First meeting with the proposer of the project

Here we will just summarize what was told during the meeting with Vincent Buntinx, on March 27th from 2pm to 4pm.

We will not describe the project again here, go the project's thread in the Big Data projects forum to see it.

We still have to see our Big Data T.A. with who we will discuss more about the deadlines, the goals, etc... So everything we say here can change.

During the meeting, Vincent explained us what the project was exactly. We will have access to ~200 years of articles (originating from two different newspappers) from Le Temps database. First thing to know : they will give us access to the database, but we will have some paper to sign apparently as the database is not public.

Then, he showed us his own researches in the field. The database his made of zip files apparently. Each zip contains the newspaper, images and the most important to us : a xml file containing all the words in the articles. So the OCR work as already be done for us, but according to the sample given it looks like the text has not yet been cleaned.

The first thing to do would be to parse those xml file and construct a database with a count of each word for each year. Then there is no precise method, but a thing we should do is being able to draw a graph showing the distances between years. Vincent showed us what he already did, we are free to use for example the same distance metric has him but we are also free to find new ones, to read articles to try other things. Then we are free to extand this, etc... .

Actually, for the moment, the project is relatively free, at least for Vincent. This could change when we see our Big Data T.A. as he still has to meet with Vincent.

The thing to remember is that this will be our project, with goals set in agreement by us and the Big Data T.A., so there's nothing definitive yet.

The best would be if we can all (the full team) meet at a time in the week. We will create a doodle for that.