- français
- English
Project plan
A study of linguistic drift on Le Temps Newspaper Corpus
BigData supervisors:
- Christophe Koch
- Immanuel Trummer
DHLab supervisors:
- Yannick Rochat
- Vincent Buntinx
The team :
- Cynthia Oeschger (Team leader)
- Farah Bouassida
Tatiana Nikitina- Tao Lin
- Jéremy Weber
- Nicolas Bornand
- Marc Schär
- Gil Brechbühler
- Malik Bougacha
Project Description :
We have access to the archives of Le Temps newspaper, the archives cover approximately 200 years of newspaper (from 1816 to 1998). By using those archives, the goal of this project is to do some researches to quantify or represent in some way the linguistic drift across the years. Indeed, the language evolves and changes, some words appear while others disappear and we want to scientifically interpret this fact.
Project goals :
The first main goal of the project is to find a way to use the datas we have and to find a good distance metric which allows us to quantify and represent the drift between years and its evolution.
The second goal of this project (still in discussion) would be to apply machine learning techniques on some part of the corpus (training set) and then, given a text, find which year it belongs to approximately (with a certain precision threshold to respect of course).
Methods to achieve the goals :
For the first goal, there are two things to do in parallel first : read a bit of litterature to see what has been done and to see what distance metrics already exist; and to compute word frequency by year on the corpus. Then we will have to find a good distance metric, this part will be a bit empirical. When we have a good distance metric, we will be able to compute the distance between each year or between groups of year. Then we could analyze the results and find some interesting observations.
For the second goal, we have to find features to extract from the corpus. Once features are extracted we can apply machine learning techniques on a training set (some part of the corpus) and try to classify articles from a test set by year as precisely as possible.
- Does linguistic drift accelerate or decelerate ?
- Are they peaks of linguistic shift and to what they might be related?
- Is it more the words used ( 1-gram ) or the structure of the sentences that influence the most on the distance?
- Does the OCR correction improves the results ? Comparison to drop below threshold method.
- Is the n-gram (2-gram or more) more relevant than 1-gram to quantify the distance?
- Does tdf-idf improves the distance metric?
Required Resources :
- Access to Le Temps archives (already guaranteed)
- Access to the cluster to be able to treat this huge amount of data in a parallel way.
The data comes in the form of xml files containing the words (already OCRed) of the articles. Reading and parsing xml files is not a difficult task, we do not think the difficulty will come from the data access part of the project.
Risks to the success of the project
(To complete)
Milestones
11 March
- Every team member is assigned related papers to read
24 March
- Every member read and wrote a short summary of its assigned paper
- Have computed n-grams from 1 to 4-6 over the entire corpus using Hadoop and find a suitable way to store the results for efficient querying during the later phases.
- Explore alternatives to cope with OCR errors that seems widespread and would be likely to distort the rest of our work.
3 April
- Have computed basic statistics using the n-grams to get a first understanding of the data such as the number of unique words, histograms of frequencies, ...
- Define conditions that a metric must satisfy
- Have a first draft of computed distances using a simple metric
- Have chosen and split the research topics between the team members
May 5
- Every member has results for their research topics
May 12 (project due)
- Have a web application to interactively visualize some of our findings.
- Have a report.
May 19
- Presentation
- [Cynthia, Marc] Choose a few metrics to quantify linguistic drift and compare them on temporal slot of various sizes and answer the questions in problem statement for each metric.
- Explore other way to measure the drift such as :
- 1 - [Nicolas, Tao] compare the variations in the use of synonyms over time
- 2 - [Farah, Jeremy] compare the linguistics drift in articles classified in topics - clustering articles
- 3 - [Gil, Malik] number of words to attain a x % coverage of the language (rationale : the number of existing words undoubtedly increases over time, but it is mirrored in the variety actually used)
- 4 - measure style characteristics, namely the length of sentences, use of punctuation, ration of noun / verb / names, ...
- 5 - use a database of swiss family name or places and look for occurrences in the articles to quantify the openness of the country (not really related to linguistics though ; )
- 6 - look at influence of official rectifications of the french language in the course of the 19 and 20th century. How much lapse is there before they become mainstreams.
- (Try to date an article using either the aforementioned distances or pure machine leaning tools)
Workpackages
(To complete).