- français
- English
Git description
ngrams
- Contains the java code used to extract the ngrams from the articles.
stats
- punctuation_stats
- Contains the spark code that computes the statistics on the punctuation.
- sentences_length
- Contains the Java (hadoop) code to get the average sentence length by year.
metric/src/ch/epfl/bigdata
- date_article/punct-sentences-metric : contains the spark code that computes the distance to each year for a set of articles (of one year). Uses the average sentence length and the statistics on punctuation.