Data Mining

The first task of this project is to get as many word definitions and crosswords as we can. To do that, we have to find resources like encyclopedias and newspapers and automate data collection.

Raw data will usually be in HTML/JS or XML, in various formats. Scripts are used to automatically download and convert data into a common format.

Word definitions

Wiktionary

This open encyclopedia provide its data into a downloadable format. The archive repository is here. For instance, the current version with all pages (but no history) is the enwiktionary-20150224-pages-meta-current.xml.bz2 (515.3 MB).

 

Crosswords being most often part of the newspaper ecosystem directed our searches towards news sites and their games sections. Other sites specializing in crosswords exist but appear to be less in number and in crosswords than news sites. 

However the search for sites to data mine yielded an interesting result: most websites hosting crosswords would use an application reading from a .puz (puzzle) file written in plain text. This will help greatly with the acquisition of data for our model. The other formats that we have come across include:

 

Crosswords

Here is a list of websites as well as the format with which they store their crosswords, sorted by known potential.

These website may be useful, containing more information on the .puz format or having links to numerous crossword puzzle websites