Project plan

A study of linguistic drift on Le Temps Newspaper Corpus


BigData supervisors:

DHLab supervisors:


The team :


Project Description :

We have access to the archives of Le Temps newspaper, the archives cover approximately 200 years of newspaper (from 1816 to 1998). By using those archives, the goal of this project is to do some researches to quantify or represent in some way the linguistic drift across the years. Indeed, the language evolves and changes, some words appear while others disappear and we want to scientifically interpret this fact.


Project goals :

The first main goal of the project is to find a way to use the datas we have and to find a good distance metric which allows us to quantify and represent the drift between years and its evolution.

The second goal of this project (still in discussion) would be to apply machine learning techniques on some part of the corpus (training set) and then, given a text, find which year it belongs to approximately (with a certain precision threshold to respect of course).


Methods to achieve the goals :

For the first goal, there are two things to do in parallel first : read a bit of litterature to see what has been done and to see what distance metrics already exist; and to compute word frequency by year on the corpus. Then we will have to find a good distance metric, this part will be a bit empirical. When we have a good distance metric, we will be able to compute the distance between each year or between groups of year. Then we could analyze the results and find some interesting observations.

For the second goal, we have to find features to extract from the corpus. Once features are extracted we can apply machine learning techniques on a training set (some part of the corpus) and try to classify articles from a test set by year as precisely as possible.


Problem Statement :

Required Resources :

The data comes in the form of xml files containing the words (already OCRed) of the articles. Reading and parsing xml files is not a difficult task, we do not think the difficulty will come from the data access part of the project.


Risks to the success of the project

(To complete)



11 March

24 March

3 April

May 5

May 12 (project due)

May 19


Research topics



(To complete).