- français
- English
Bigdata2015-EventStreamDetection
As part of the 2015 Big Data course, we are leading a project adressing event/topic detection in news streams. We are nine people working on this project. For details about our team composition, please refer here.
For everything related to the development environment, please refer here.
Dataset
The dataset for this project is provided by the DHLab. It consists of archives from the Swiss newspaper "Le Temps". This historical database comprises 2 newspapers over 200 years :
- “Journal de Genève” (JDG) from 1826 to 1998,
- “Gazette de Lausanne” (GDL, under different names) from 1798 to 1998.
Goal
We aim at detecting articles that talk about the same topic over a set of issues contiguous in time, and across the first two newspapers. To do this, we are looking into clustering, hierarchical clustering and correlations detection techniques. One of the main challenges here is the huge amount of data : we are considering articles for almost 200 years, which is why we need the algorithms we implement to be scalable.
Project calendar and milestones
In a first phase, we are looking into various research papers that are tackling similar problems, a discussion of these papers content can be found here. At the end of this first time, the goal is to be able to have a formal definition of what a topic is and to have chosen the two or three methods/algorithms that appear to be the best suited for our project, both in terms of efficiency and scalability. The deadline for the end of this first phase is :
milestone 1 : March 18
Then we will begin to implement the algorithms. We want to first have a running implementation even if it is running sequentially on a small subset of the whole dataset. This phase is to be finished by :
milestone 2 : April 15
At milestone 2, we had several pieces of code working independently from each other. We then enter an integrating and optimzing part of our project. This part is to be done by new milestone 2b in order to be able to assess what we should focus on in the last two weeks.
milestone 2b : April 30
Last, we will work on the scalability of our implementations, run various experiments, extract results, try to interpret these results. The end of this last phase is also the end of the project :
milestone 3 : May 12
Division of labour
We are currently working on formalizing aspects of the project and finding the best suited methods to tackle event stream detection. This means that small sets of research papers (2 to 3) have been assigned to small teams of people (2 to 3 people). This is a temporary organization and we will reorganize our team regarding the implementations that will be performed. This reorganization will be done by March 18th (milestone 1).