- français
- English
TV-shows analysis & recommender system
TV-shows analysis and recommender system
Motivations
Due to growing offer in TV-Shows, it is difficult to classify those properly. The genre does not give much information and often reading a synopsis leads to spoilers. In addition, the offer is nowadays so big that it is difficult for some shows to get some exposure due to “blockbusters” or very hype programs getting most of it (long tail phenomenon). That is why some shows are not considered by many people even if they might be a good and enjoyable match for them.
Goals
The goal of the project is to analyze the content of TV-Shows according to certain themes (i.e tags, topics), via a subtitles analysis. In order to achieve that, we will acquire a large data set of subtitles of good quality and then, using some machine learning and data analysis techniques, analyze the themes present in each show.
For example, if we consider the show "Homeland", the resulting score regarding tags could be :
60% terrorism, 20% psychology, 10% espionage and 10% romance.
As a final result, we want to have two things:
- For each TV-show, we want to have a detailed information page that contains the different themes and their weight of frequency in the show.
- We want to implement a content-based recommender systems for TV-shows, where given one TV-show, the system can propose the most similar TV-show to the latter based on the similarity between themes.
Project Steps
- Crawling : Getting the data
- Preprocessing : Formatting the data
- Processing : Analyzing the data (words & topics classification)
- Recommender System
- Results presentation in a web interface [Optional]
Resources
We will not need much storage space. We will try to collect as much as TV shows as we can, but as a subtitle file is around 50kb, an upper bound of the storage space that we need is 100 Gb (10000 TV shows, each of 7 season, each season of 25 episodes).
We will certainly need Hadoop or Spark to perform parallelization of computation.
Risks & Difficulties
- In 3 : This part involves a lot of testing and hand work in order to attribute a label to a cluster, this might become quickly time consuming and a special effort will be needed in order to get relevant clusters tags.
Team composition & Task repartition
The team contains 8 members :
- Claire Musso : 1.1, 1.2
- Florian Simond : 2.2, 2.3, 2.4
- Grigory Rozhdestvenskiy : 3.1, 3.2, 3.3
- Khalil Hajji : 3.1, 3.2, 3.3
- Nassim Drissi El Kamili : 1.1, 1.2
- Nils Bouchardon : 1.1, 1.2
- Simon-Pierre Génot : 3.1, 3.2, 3.3
- Raphaël von Aarburg (leader) : 2.1, 2.2, 3.3
Timeline
- For the 24.03
- Final proposal
- From the 24.03 to the 07.04
- Dataset Acquisition : 1.1, 1.2 & 1.3
- Preprocessing done : 2.1, 2.2, 2.3, 2.4 & 2.5
- Processing started : 3.1, 3.2 & 3.3
- From the 08.04 to the 21.04
- Processing done : 3.4 & 3.5
- Recommender System started : 4.1, 4.2 & 4.3
- From the 22.04 to the 05.05
- Recommender System done : 4.4 (& 4.5 [Optional])
- From the 06.05 to the 13.05
- Result presentation
- Web Interface done [Optional]
END OF THE PROJECT
Milestones
- 08.04.2014 :
- Data stored, indexed and preprocessed
- -> Ready for the processing part
- 22.04.2014 :
- LDA algorithm implemented and tested,
- Themes/topics selection is made
- Words and the TV-shows are given a score according to each theme
- -> Ready to implement the recommender system
- 06.05.2014 :
- Recommender system algorithm implemented and tested, and
- For each TV-Show the list of similarities with the other TV-shows is obtained
- 13.05.2014 :
- Final Presentation
- Ce wiki
- Cette page