Crawling

First of all we need to get the data, the subtitles and make them workable (in a precise hierarchy). To achieve this goal we will :

1. Crawling (Done):

We have decided to download all subtitles from tvsubtitles.net because there is no limited number of downloading. In order to do this, we will use python script with web librairies (Beautiful Soup, Mechanize, Scrapy).

2. Cleaning the data :

The data does not follow the required hierarchy (one folder for each TV-show, containing one folder per season), we have to make a script to format it the desired way. (Done)
Some TV-shows contains duplicate subtitles for the same episode, we need to erase them using python scripting (Done)
Moreover we are thinking of removing the TV-shows with just very few episodes (not sure so far).

3. Getting IMDB ratings (Done):

In our subtitles analysis, an interesting parameter is the IMDB score given to a TV-show along with the number of people who rated it. We achieve to obtain such ratings by using an API for IMDB called OMDB and python scripting.

This wiki
- Home
- Sitemap
- Files
- New page
- Administration
This page
- Edit
- Clean
- Delete
- History
- Print
- Comments (0)
Share

Prospective students portal

Students portal

Researchers portal

Staff portal

Business portal

Mediacorner

Teaching portal

EPFL Alumni Portal

Architecture, Civil and Environmental Engineering ENAC

Basic Sciences SB

Engineering STI

Computer and Communication Sciences IC

Life Sciences SV

Management of Technology CDM

College of Humanities CDH

EPFL

Education

Research

Innovation & Tech Transfer

EPFL Campus

Crawling