Bigdata2015-crosswords

Crosswords Data Mining

Supervisors:

Team members:

Source code on this GitHub repository

Abstract

A lot of websites and newspapers offer puzzles in many ways. While tools are available for games like Sudoku, crosswords have only limited help. There exist web pages providing customized search (e.g. Crossword Solver) but only to a small extent.

Our goal is to collect word definitions from various sources (dictionaries and encyclopedias like Wiktionary) and crosswords (from news paper sources like The Guardian). This database coule be analysed and hierarchically clustered to group entries by themes, definitions and synonyms. By joining multiple definitions for a single word, characteristic keywords could be identified. Moreover, some projects like WordNet could improve the results.

The goal would be to build a we application providing tools to generate and solve crosswords. In such application, we would provide word proposal through definitions and keywods, as well as word completion.

Goals and milestones

Since we want to create a complete web application, based on machine learning and language processing algorithms results, we can split the project in 3 distincts tasks. Each of them is described on their own page. They will mainly communicate using a central database. We will also try, as much as possible, to reuse code and algorithms.

The project has 3 milestones:

  1. 16 march, framework setup: tools are chosen and locally installed on our machines. We are able to create simple webpages, which are able to access databases. We also have written some scripts to download and parse crosswords and definitions. We also have some clustering/language processing algorithms.
  2. 13 april, basic result: our first web app version is complete, we are able to use our algorithms to produce satisfactory results for main features, like keywords search/completion. We also have a large database of words and crosswords.
  3. 4 may, final result: one week before the deadline, we have implemented more features and have tested multiple algorithms. The webpage design is improved.

Tasks

Each task will have some people assigned. How they split the work between them is open. The team is split into two groups, though this is only a general idea on what each member focus:

Common file format [Laurent, Timothée, Vincent]

11 march

Choose a file format (probably XML or JSON) and define how we will encode words, definitions and crosswords in a portable way. This should be scalable and robust, i.e. support unicode and be able to store any metadata. Also, prepare a list of common tag/properties. Finally, build a simple application in Scala to visualize these common files.

Database framework [Utku]

11 march

Choose which database system we will use as a common storage. Compare candidates and select the most appropriate/simple/free solution. Also, propose how we should store words, definitions and crosswords in the database, i.e. what the database schema will be. Of course, the schema will be extended once the machine learning part produces results.

Web framework [Patrick, Gregory, Utku]

11 march

Enumerate, compare and choose which web framework we will use. Prefer the ones that produce nice and simple webpages, like Polymer. Since we are going to use Scala for most of the algorithmic part, Scala Play should be the major candidate.

Application feature proposal [everyone]

During the whole project, but as soon as possible (ideally we would have some interesting discussion on 11 march)

Propose features for the final app. Any idea is welcome, even if they are weird :)

Websites for data mining [Timothée, Vincent]

During the whole project, but as soon as possible (ideally we would have some interesting discussion on 11 march)

Enumerate some websites that provide word definitions and crosswords. Synonyms, antonyms and any other data is welcome.

Algorithms [Laurent, Johan, Utku]

During the whole project, but as soon as possible (ideally we would have some interesting discussion on 11 march)

Enumerate machine learning and natural language processing techniques that we can use on words and sentences. Hierarchical clustering and graph creation methods should be interesting, in order to find which words are close to each others.

Scripts [Timothée, Vincent, Johan, Laurent, Matteo]

During the whole project, but as soon as possible (at least two scripts until 17 march)

Write scripts to automate data collection and cleaning. Downloaded HTML/JS/XML/... files need to be converted to common format automatically. While Perl is a good script language for text processing, Scala seems to be a better idea, as the whole project is in Scala.

Some websites allow personal use of their data, which should include "personal" learning of our system. However, we must not publish them without their consent! This means that crossword won't be published on the Git repository, nor be available in the final application!

Local database and server [Utku, Gregory, Patrick]

17 march

Setup a local database and web server. Prepare instruction, so everyone can easily install them as well. We won't have any public online server until we have meaningful results.

HDFS and Spark [Laurent, Johan]

17 march

Setup local Spark distribution. May be used with a local Hadoop installation, or the one on the provided clusters.

Web viewer [Utku, Gregory, Patrick]

24 march

Create the web application project in Scala Play. Write some web pages to enumerate and view crosswords that are in the database. Also provide some "download as JSON" button. As a first try, you could only show the solution, but the idea is to have an interactive game, like The Guardian and the Mirror.

 

The next 6 steps are described in details under : Machine Learning

Upload data [Johan, Timothée]

13 april

  1. Merging many jsons into a single file
  2. Upload data to HDFS storing:
    • Crosswords into : /projects/crosswords/crosswords
    • Wiktionary into : /projects/crosswords/definitions

Bags creation [Johan, Timothée]

13 april

Providing a spark-friendly representation of the crosswords and the wikitionnary definition, following the bag-of-words model.

Word reduction [Vincent, Patrick, Grégory]

13 april

Finding a way to clean the bags to :

Graph representation [Laurent, Matteo]

13 april

Use the bags to create a weighted graph representation(The weight corresponding to the similarity) allowing us to do some learning and analysis on bags previously mentioned. 

In a first instance :

Clustering [Utku, Johan]

13 april

Using clustering algorithm to provide another way of doing machine learning on bags.

Mining more data [Timothée, Johan]

13 april

Gathering more data from:

Improve web page [Patrick, Grégory, Vincent, Timothée]

4 may

We need to improve the web page. Here are the remaining tasks:

Continue data processing [Laurent, Matteo, Utku, Johan]

4 may

Laurent and Matteo achieved some interesting results in Spark. We have to continue on this path, in order to have something that works more or less as soon as possible. Once we have something, we can still improve it if we have time. So, the idea is the following:

Finalize and test current solution [Everyone]

11 may

On the data processing side, we have found a way to generate a good adjacency matrix. So we need to implement and test it: