TV-shows analysis & recommender system

TV-shows analysis and recommender system

Motivations

Due to growing offer in TV-Shows, it is difficult to classify those properly. The genre does not give much information and often reading a synopsis leads to spoilers. In addition, the offer is nowadays so big that it is difficult for some shows to get some exposure due to “blockbusters” or very hype programs getting most of it (long tail phenomenon). That is why some shows are not considered by many people even if they might be a good and enjoyable match for them.

Goals

The goal of the project is to analyze the content of TV-Shows according to certain themes (i.e tags, topics), via a subtitles analysis. In order to achieve that, we will acquire a large data set of subtitles of good quality and then, using some machine learning and data analysis techniques, analyze the themes present in each show.

For example, if we consider the show "Homeland", the resulting score regarding tags could be :

60% terrorism, 20% psychology, 10% espionage and 10% romance.

As a final result, we want to have two things:

For each TV-show, we want to have a detailed information page that contains the different themes and their weight of frequency in the show.
We want to implement a content-based recommender systems for TV-shows, where given one TV-show, the system can propose the most similar TV-show to the latter based on the similarity between themes.

Project Steps

Crawling : Getting the data
Preprocessing : Formatting the data
Processing : Analyzing the data (words & topics classification)
Recommender System
Results presentation in a web interface [Optional]

Resources

We will not need much storage space. We will try to collect as much as TV shows as we can, but as a subtitle file is around 50kb, an upper bound of the storage space that we need is 100 Gb (10000 TV shows, each of 7 season, each season of 25 episodes).

We will certainly need Hadoop or Spark to perform parallelization of computation.

Risks & Difficulties

In 3 : This part involves a lot of testing and hand work in order to attribute a label to a cluster, this might become quickly time consuming and a special effort will be needed in order to get relevant clusters tags.

Team composition & Task repartition

The team contains 8 members :

Claire Musso : 1.1, 1.2
Florian Simond : 2.2, 2.3, 2.4
Grigory Rozhdestvenskiy : 3.1, 3.2, 3.3
Khalil Hajji : 3.1, 3.2, 3.3
Nassim Drissi El Kamili : 1.1, 1.2
Nils Bouchardon : 1.1, 1.2
Simon-Pierre Génot : 3.1, 3.2, 3.3
Raphaël von Aarburg (leader) : 2.1, 2.2, 3.3

Timeline

For the 24.03
Final proposal
From the 24.03 to the 07.04
Dataset Acquisition : 1.1, 1.2 & 1.3
Preprocessing done : 2.1, 2.2, 2.3, 2.4 & 2.5
Processing started : 3.1, 3.2 & 3.3
From the 08.04 to the 21.04
Processing done : 3.4 & 3.5
Recommender System started : 4.1, 4.2 & 4.3
From the 22.04 to the 05.05
Recommender System done : 4.4 (& 4.5 [Optional])
From the 06.05 to the 13.05
Result presentation
Web Interface done [Optional]

END OF THE PROJECT

Milestones

08.04.2014 :
- Data stored, indexed and preprocessed
- -> Ready for the processing part
22.04.2014 :
- LDA algorithm implemented and tested,
- Themes/topics selection is made
- Words and the TV-shows are given a score according to each theme
- -> Ready to implement the recommender system
06.05.2014 :
- Recommender system algorithm implemented and tested, and
- For each TV-Show the list of similarities with the other TV-shows is obtained
13.05.2014 :
- Final Presentation

Additional Ressources

This wiki
- Home
- Sitemap
- Files
- New page
- Administration
This page
- Edit
- Clean
- Delete
- History
- Print
- Comments (0)
Share

Prospective students portal

Students portal

Researchers portal

Staff portal

Business portal

Mediacorner

Teaching portal

EPFL Alumni Portal

Architecture, Civil and Environmental Engineering ENAC

Basic Sciences SB

Engineering STI

Computer and Communication Sciences IC

Life Sciences SV

Management of Technology CDM

College of Humanities CDH

EPFL

Education

Research

Innovation & Tech Transfer

EPFL Campus

TV-shows analysis & recommender system