- français
- English
Bigdata2015-crosswords
Crosswords Data Mining
Supervisors:
Team members:
- Patrick Andrade
- Johan Berdat (team leader)
- Timothée Emery
- Matteo Filipponi
- Grégory Maître
- Vincent Mettraux
- Utku Sirin
- Laurent Valette
Source code on this GitHub repository
Abstract
A lot of websites and newspapers offer puzzles in many ways. While tools are available for games like Sudoku, crosswords have only limited help. There exist web pages providing customized search (e.g. Crossword Solver) but only to a small extent.
Our goal is to collect word definitions from various sources (dictionaries and encyclopedias like Wiktionary) and crosswords (from news paper sources like The Guardian). This database coule be analysed and hierarchically clustered to group entries by themes, definitions and synonyms. By joining multiple definitions for a single word, characteristic keywords could be identified. Moreover, some projects like WordNet could improve the results.
The goal would be to build a we application providing tools to generate and solve crosswords. In such application, we would provide word proposal through definitions and keywods, as well as word completion.
Goals and milestones
Since we want to create a complete web application, based on machine learning and language processing algorithms results, we can split the project in 3 distincts tasks. Each of them is described on their own page. They will mainly communicate using a central database. We will also try, as much as possible, to reuse code and algorithms.
- Data mining and conversion to a common format
- Processing using machine learning and language processing methods
- Web design of the final product to provide features
The project has 3 milestones:
- 16 march, framework setup: tools are chosen and locally installed on our machines. We are able to create simple webpages, which are able to access databases. We also have written some scripts to download and parse crosswords and definitions. We also have some clustering/language processing algorithms.
- 13 april, basic result: our first web app version is complete, we are able to use our algorithms to produce satisfactory results for main features, like keywords search/completion. We also have a large database of words and crosswords.
- 4 may, final result: one week before the deadline, we have implemented more features and have tested multiple algorithms. The webpage design is improved.
Tasks
Each task will have some people assigned. How they split the work between them is open. The team is split into two groups, though this is only a general idea on what each member focus:
- Web and scripting: Grégory, Matteo, Patrick, Timothée, Vincent
- Database and machine learning: Johan, Laurent, Utku
Common file format [Laurent, Timothée, Vincent]
11 march
Choose a file format (probably XML or JSON) and define how we will encode words, definitions and crosswords in a portable way. This should be scalable and robust, i.e. support unicode and be able to store any metadata. Also, prepare a list of common tag/properties. Finally, build a simple application in Scala to visualize these common files.
Database framework [Utku]
11 march
Choose which database system we will use as a common storage. Compare candidates and select the most appropriate/simple/free solution. Also, propose how we should store words, definitions and crosswords in the database, i.e. what the database schema will be. Of course, the schema will be extended once the machine learning part produces results.
Web framework [Patrick, Gregory, Utku]
11 march
Enumerate, compare and choose which web framework we will use. Prefer the ones that produce nice and simple webpages, like Polymer. Since we are going to use Scala for most of the algorithmic part, Scala Play should be the major candidate.
Application feature proposal [everyone]
During the whole project, but as soon as possible (ideally we would have some interesting discussion on 11 march)
Propose features for the final app. Any idea is welcome, even if they are weird :)
Websites for data mining [Timothée, Vincent]
During the whole project, but as soon as possible (ideally we would have some interesting discussion on 11 march)
Enumerate some websites that provide word definitions and crosswords. Synonyms, antonyms and any other data is welcome.
Algorithms [Laurent, Johan, Utku]
During the whole project, but as soon as possible (ideally we would have some interesting discussion on 11 march)
Enumerate machine learning and natural language processing techniques that we can use on words and sentences. Hierarchical clustering and graph creation methods should be interesting, in order to find which words are close to each others.
Scripts [Timothée, Vincent, Johan, Laurent, Matteo]
During the whole project, but as soon as possible (at least two scripts until 17 march)
Write scripts to automate data collection and cleaning. Downloaded HTML/JS/XML/... files need to be converted to common format automatically. While Perl is a good script language for text processing, Scala seems to be a better idea, as the whole project is in Scala.
- Collect a lot of PUZ files on the web, import them. Don't forget to add informations like the URL and the source [Timothée]
- Script for The Guardian, Puzzles by Jim [Johan]
- Script for Crossword Puzzle Games [Matteo]
- Script for Boatload of crosswords [Timothée]
- Script for Mirror [Timothée]
- Script for Wiktionary archives [Laurent, Timothée]
Some websites allow personal use of their data, which should include "personal" learning of our system. However, we must not publish them without their consent! This means that crossword won't be published on the Git repository, nor be available in the final application!
Local database and server [Utku, Gregory, Patrick]
17 march
Setup a local database and web server. Prepare instruction, so everyone can easily install them as well. We won't have any public online server until we have meaningful results.
HDFS and Spark [Laurent, Johan]
17 march
Setup local Spark distribution. May be used with a local Hadoop installation, or the one on the provided clusters.
Web viewer [Utku, Gregory, Patrick]
24 march
Create the web application project in Scala Play. Write some web pages to enumerate and view crosswords that are in the database. Also provide some "download as JSON" button. As a first try, you could only show the solution, but the idea is to have an interactive game, like The Guardian and the Mirror.
The next 6 steps are described in details under : Machine Learning
Upload data [Johan, Timothée]
13 april
- Merging many jsons into a single file
- Upload data to HDFS storing:
- Crosswords into : /projects/crosswords/crosswords
- Wiktionary into : /projects/crosswords/definitions
Bags creation [Johan, Timothée]
13 april
Providing a spark-friendly representation of the crosswords and the wikitionnary definition, following the bag-of-words model.
Word reduction [Vincent, Patrick, Grégory]
13 april
Finding a way to clean the bags to :
- Consider similar word equivalent (e.g. TAKE and TAKING)
- Discarding "common" words. (e.g. "a", "the", "my" ...)
Graph representation [Laurent, Matteo]
13 april
Use the bags to create a weighted graph representation(The weight corresponding to the similarity) allowing us to do some learning and analysis on bags previously mentioned.
In a first instance :
- Providing an adjacency list.
- Providing a way to find words which are close to some given keywords.
Clustering [Utku, Johan]
13 april
Using clustering algorithm to provide another way of doing machine learning on bags.
Mining more data [Timothée, Johan]
13 april
Gathering more data from:
- Crosswords
- Definitions
- Any text that associates words together
Improve web page [Patrick, Grégory, Vincent, Timothée]
4 may
We need to improve the web page. Here are the remaining tasks:
- Finalize the search, using two text fields (keywords and filter). The results must display the words with their score, sorted by the latter. Also add automatically a link to Wiktionary.
- As for now, the Spark/SQL data is not complete, so you should just return the query. But be sure to have done as much as possible.
- If the search is fast enough, we should use AJAX to have "real-time" search. So please check how we use it with Scala Play.
- Improve the crossword page.
- Try to generate random crosswords (offline, in Scala or Spark).
- Improve the CSS of the whole web page.
Continue data processing [Laurent, Matteo, Utku, Johan]
4 may
Laurent and Matteo achieved some interesting results in Spark. We have to continue on this path, in order to have something that works more or less as soon as possible. Once we have something, we can still improve it if we have time. So, the idea is the following:
- Laurent and Matteo, please continue and improve what you have done with these sparse vectors. Try to clean your code, to improve usability and efficiency. For instance, you could precompute pairs of word-definitions on the cluster.
- I will try to improve my markup parser, to provide more data from Wiktionary. I will also try to get data from Wikipedia.
- Utku will prepare SQL queries for the website, assuming an adjacency matrix stored in the database.
Finalize and test current solution [Everyone]
11 may
On the data processing side, we have found a way to generate a good adjacency matrix. So we need to implement and test it:
- Laurent and I, we are going to continue to code in Spark and SQL to improve speed and correctness.
- Utku and Matteo will continue to tests other metrics and machine learning techniques. Also, they will prepare some test for our adjacency matrix.
- Patrick and Vincent, you will focus on the project report.
- Grégory and Timothée will continue to improve the webpage as planned.