Preprocessing

This "pipeline" will be applied on all the subtitles files once.

From a a subtitle files, we should get the most relevant words for the show, and the number of occurrence for these words.

 

1.  Parsing : 

Going from a *.srt file to a list of sentences. Keeping everything including punctuation in order to facilitate POS tagging. We simply get rid of subtitle synchronization informations, subtitles effects tags and special characters.

 

2. Part-of-Speech tagging & cleaning :

We will use the Stanford POS tagger in order to assign POS tags to each tokens, such as noun, verb, adjective, etc... This will allow us to remove informationless words, such as determinant, pronoun etc... It's clear that some words need to be discarded, and some kept (noun for example), but some words are not clear yet, verbs for example, and we need to apply our algorithm on some text example to make a decision on what to keep.

 

3. Stemming :

We will probably need some kind of stemming, to make the processing with LDA easier. We will use Porter's algorithm to do so.

We will aim to stem words in a way that fits as good as possible to what LDA needs.

 

4. Indexing :

Once we have our words, we should count them, and write them in a file next to their number of occurrence. This will be probably sorted in the file, but it depends a lot on what we would need later to decide on the tags.

 

5. Extra metrics :

We will gather extra metric that may be useful later while performing this task. For example the ratio of words by show (#words/lenght).