- français
- English
Crawling
First of all we need to get the data, the subtitles and make them workable (in a precise hierarchy). To achieve this goal we will :
1. Crawling (Done):
We have decided to download all subtitles from tvsubtitles.net because there is no limited number of downloading. In order to do this, we will use python script with web librairies (Beautiful Soup, Mechanize, Scrapy).
2. Cleaning the data :
- The data does not follow the required hierarchy (one folder for each TV-show, containing one folder per season), we have to make a script to format it the desired way. (Done)
- Some TV-shows contains duplicate subtitles for the same episode, we need to erase them using python scripting (Done)
- Moreover we are thinking of removing the TV-shows with just very few episodes (not sure so far).
3. Getting IMDB ratings (Done):
In our subtitles analysis, an interesting parameter is the IMDB score given to a TV-show along with the number of people who rated it. We achieve to obtain such ratings by using an API for IMDB called OMDB and python scripting.
- Ce wiki
- Cette page