TV-shows analysis & recommender system

 

TV-shows analysis and recommender system

 

 

Motivations

Due to growing offer in TV-Shows, it is difficult to classify those properly. The genre does not give much information and often reading a synopsis leads to spoilers. In addition, the offer is nowadays so big that it is difficult for some shows to get some exposure due to “blockbusters” or very hype programs getting most of it (long tail phenomenon). That is why some shows are not considered by many people even if they might be a good and enjoyable match for them.

 

Goals

The goal of the project is to analyze the content of TV-Shows according to certain themes (i.e tags, topics), via a subtitles analysis. In order to achieve that, we will acquire a large data set of subtitles of good quality and then, using some machine learning and data analysis techniques, analyze the themes present in each show. 

For example, if we consider the show "Homeland", the resulting score regarding tags could be :

60% terrorism, 20% psychology, 10% espionage and 10% romance.

As a final result, we want to have two things:

  1. For each TV-show, we want to have a detailed information page that contains the different themes and their weight of frequency in the show.
  2. We want to implement a content-based recommender systems for TV-shows, where given one TV-show, the system can propose the most similar TV-show to the latter based on the similarity between themes.

 

Project Steps

  1. Crawling : Getting the data
  2. Preprocessing : Formatting the data
  3. Processing : Analyzing the data (words & topics classification)
  4. Recommender System
  5. Results presentation in a web interface [Optional]

 

Resources

We will not need much storage space. We will try to collect as much as TV shows as we can, but as a subtitle file is around 50kb, an upper bound of the storage space that we need is 100 Gb (10000 TV shows, each of 7 season, each season of 25 episodes).

We will certainly need Hadoop or Spark to perform parallelization of computation.

 

Risks & Difficulties

 

Team composition & Task repartition

The team contains 8 members :

 

Timeline

END OF THE PROJECT

 

Milestones

 

Additional Ressources