- français
- English
Data Mining
The first task of this project is to get as many word definitions and crosswords as we can. To do that, we have to find resources like encyclopedias and newspapers and automate data collection.
Raw data will usually be in HTML/JS or XML, in various formats. Scripts are used to automatically download and convert data into a common format.
Word definitions
Wiktionary
This open encyclopedia provide its data into a downloadable format. The archive repository is here. For instance, the current version with all pages (but no history) is the enwiktionary-20150224-pages-meta-current.xml.bz2 (515.3 MB).
Crosswords being most often part of the newspaper ecosystem directed our searches towards news sites and their games sections. Other sites specializing in crosswords exist but appear to be less in number and in crosswords than news sites.
However the search for sites to data mine yielded an interesting result: most websites hosting crosswords would use an application reading from a .puz (puzzle) file written in plain text. This will help greatly with the acquisition of data for our model. The other formats that we have come across include:
- .puz files
- javascript applications
- java applications
- pure text and images
- pure html
Crosswords
Here is a list of websites as well as the format with which they store their crosswords, sorted by known potential.
- The Guardian
- Potential: 10000 (complete, more available but multi-line words ignored)
- New crosswords in HTML/JS (more than 10000)
- Old crosswords in image format
- Quick, cryptic, quiptic contact: crossword.editor@theguardian.com
- Everyman, speedy contact: crossword.editor@observer.co.uk
- BoatLoad Puzzles
- Potential without buying : 40000
- Potential with buying (20$) : 100000
- html + javascript
- contact: support2id@boatloadpuzzles.com
- Crossword Puzzle Games
- Potential : 32400 (complete)
- HTML/JS
- Mirror
- Potential : 1460 (+4/day) (complete)
- HTML
- Puzzle Choice
- Potential : 632
- java (for the daily one)
- image and text
- contact: puzzlemaster@puzzlechoice.com
- Puzzles by Jim
- Potential : 412 (complete)
- HTML/JS
- Contact: http://www.puzzlesbyjim.com/contact-us.html
- Triple Play
- Potential (PUZ): 35
- Potential (PDF): 38
- Puzzles by Fred
- Potential : 52
- .puz
- Contact "Important" : tihzwa@aol.com
- 14 Across
- Potential : 25 (Without solution)
- HTML/JS
- contact: http://www.14across.com/contact.php
- The New York Daily
- Potential without subscription : 2 (each day)
- HTML/JS or .puz file
- contact: edu@nytimes.com and NYTimes.com/edu
These website may be useful, containing more information on the .puz format or having links to numerous crossword puzzle websites
- Crosswords Source (website linking to other crossword sources as well as the used format)
- Crossword Links (link to numerous crossword websites)
- Puz Format