Crawling

First of all we need to get the data, the subtitles and make them workable (in a precise hierarchy). To achieve this goal we will :

 

1. Crawling (Done):

We have decided to download all subtitles from tvsubtitles.net because there is no limited number of downloading. In order to do this, we will use python script with web librairies (Beautiful Soup, Mechanize, Scrapy).

 

2. Cleaning the data :

 

3. Getting IMDB ratings (Done):

In our subtitles analysis, an interesting parameter is the IMDB score given to a TV-show along with the number of people who rated it. We achieve to obtain such ratings by using an API for IMDB called OMDB and python scripting.