Processing

Once our data is stored and preprocessed, our goal is now to cluster the words by themes and then be able to classify the TV-shows according to those themes.

  1. Extracting topics : description

    To extract topics, we will use a technique called Latent Dirichlet Allocation, often called LDA.

    LDA is a very interesting and powerful unsupervised machine learning technique that discovers topics inside a corpus of documents. It requires very few parameters, and no training which solves the "cold-start problem" most classification techniques face.

    The idea behind the method is quite complex : in short, we will infer the topics based on an a priori generative model on the corpus of documents.

    For more information, you will find here the original paper behind LDA by David Blei (link here) as well as a series of two video lecture he gave at Cambridge University (link here).

    It take as an input the documents that compose the corpus and the number of topics we want to extract and returns a bag-of-words representation of the topics as well as a feature vector for every document.

    In our case the documents would be entire TV shows subtitles, and if we wish to find K topics, the feature vectors would look something like : x= [topic_1_score,...,topic_K_score]

    This is perfect because this is exactly the feature vector that will be used in our recommender system.
    There are many different algorithms that we can use to infer the topic models. The one we will focus one, mostly because based on our research it is the one that adapts the best to a MapReduce implementation is Gibbs sampling (http://en.wikipedia.org/wiki/Gibbs_sampling).

     

    Indeed, there has been previous work from different teams to adapts LDA to a Hadoop MapReduce infrastructure, one of which is the work by Yahoo! research that is described here.

     

    As can be seen, this part is the hardest and most crucial part of the project. It will take us most of our work.

     

     

  2. Steps and tasks :

    1: Study and understand LDA, Gibbs Sampling, Collapsed Gibbs sampling adapted to MapReduce (Grigory Rozhdestvenskiy, Khalil Hajji, Simon-Pierre Genot)
    2: Explain the method and the algorithm to the rest of the team (Grigory Rozhdestvenskiy, Khalil Hajji, Simon-Pierre Genot)
    3: Divide the work in subtasks to distribute as much work as possible (Raphaël von Aarburg, Grigory Rozhdestvenskiy, Khalil Hajji, Simon-Pierre Genot)
    4: Code the method (As many people as possible, still undefined)
    5: Test, run and evaluate (Grigory Rozhdestvenskiy, Khalil Hajji, Simon-Pierre Genot)