Data Mining

The first task of this project is to get as many word definitions and crosswords as we can. To do that, we have to find resources like encyclopedias and newspapers and automate data collection.

Raw data will usually be in HTML/JS or XML, in various formats. Scripts are used to automatically download and convert data into a common format.

Word definitions

Wiktionary

This open encyclopedia provide its data into a downloadable format. The archive repository is here. For instance, the current version with all pages (but no history) is the enwiktionary-20150224-pages-meta-current.xml.bz2 (515.3 MB).

Crosswords being most often part of the newspaper ecosystem directed our searches towards news sites and their games sections. Other sites specializing in crosswords exist but appear to be less in number and in crosswords than news sites.

However the search for sites to data mine yielded an interesting result: most websites hosting crosswords would use an application reading from a .puz (puzzle) file written in plain text. This will help greatly with the acquisition of data for our model. The other formats that we have come across include:

.puz files
javascript applications
java applications
pure text and images
pdf
pure html

Crosswords

Here is a list of websites as well as the format with which they store their crosswords, sorted by known potential.

The Guardian
- Potential: 10000 (complete, more available but multi-line words ignored)
- New crosswords in HTML/JS (more than 10000)
- Old crosswords in image format
- Quick, cryptic, quiptic contact: crossword.editor@theguardian.com
- Everyman, speedy contact: crossword.editor@observer.co.uk
BoatLoad Puzzles
- Potential without buying : 40000
- Potential with buying (20$) : 100000
- html + javascript
- contact: support2id@boatloadpuzzles.com
Crossword Puzzle Games
- Potential : 32400 (complete)
- HTML/JS
Mirror
- Potential : 1460 (+4/day) (complete)
- HTML
Puzzle Choice
- Potential : 632
- java (for the daily one)
- image and text
- contact: puzzlemaster@puzzlechoice.com
Puzzles by Jim
- Potential : 412 (complete)
- HTML/JS
- Contact: http://www.puzzlesbyjim.com/contact-us.html
Triple Play
- Potential (PUZ): 35
- Potential (PDF): 38
Puzzles by Fred
- Potential : 52
- .puz
- Contact "Important" : tihzwa@aol.com
14 Across
- Potential : 25 (Without solution)
- HTML/JS
- contact: http://www.14across.com/contact.php
The New York Daily
- Potential without subscription : 2 (each day)
- HTML/JS or .puz file
- contact: edu@nytimes.com and NYTimes.com/edu

These website may be useful, containing more information on the .puz format or having links to numerous crossword puzzle websites

Crosswords Source (website linking to other crossword sources as well as the used format)
Crossword Links (link to numerous crossword websites)
Puz Format

This wiki
- Home
- Sitemap
- Files
- New page
- Administration
This page
- Edit
- Clean
- Delete
- History
- Print
- Comments (0)
Share

Prospective students portal

Students portal

Researchers portal

Staff portal

Business portal

Mediacorner

Teaching portal

EPFL Alumni Portal

Architecture, Civil and Environmental Engineering ENAC

Basic Sciences SB

Engineering STI

Computer and Communication Sciences IC

Life Sciences SV

Management of Technology CDM

College of Humanities CDH

EPFL

Education

Research

Innovation & Tech Transfer

EPFL Campus

Data Mining

Word definitions

Wiktionary

Crosswords