HDFS Files

/projects/linguistic-shift/articles_samples

/*_articles: number of articles in the subset of articles
- /Corrected or /WithoutCorrection
  - /* : year of the articles. It contains all words of the articles with their number of occurrences. Lines have format : word \tab occurrences

/projects/linguistic-shift/distances

/chi-square: directory containing results of the chi-square distance. Lines have the format
year1:year2 \tab distance
/cosine: directory containing results of the cosine based distance. Lines have the format
year1:year2 \tab distance
/distance1: directory containing results of the simple first metric. Lines have the format
year1:year2 \tab distance
/kullback-leibler :
- /tfidf:
  - /Corrected : directory containing results of KL on the corrected corpus with TFIDF. Lines have the format year1,year2,distance
  - /WithoutCorrection: directory containing results of KL on the non-corrected corpus with TFIDF. Lines have the format year1,year2,distance
- /probability_of_a_word_per_year :
  - /Corrected : directory containing results of KL on the corrected corpus with probability of a word in a year. Lines have the format (year1,year2,distance)
  - /WithoutCorrection : directory containing results of KL on the non-corrected corpus with probability of a word in a year. Lines have the format (year1,year2,distance)
- /article: date a subset of articles (using the probability of words per year)
  - /Corrected or /WithoutCorrection :
    - /*_articles : number of articles considered.
      - /*-ngrams/year : It contains in subfolders the distance between a subset of article from a year with all the other years. Lines have format (year1,year2,distance). Year 1839 is the subset of article.
        
        /best_estimation : contains values of the best matches for the subset of articles. Same format : year1,year2,distance. Year1 is simulated by year2.
/punctuation-metric : directory containing results for the metric on punctuation statistics and sentences length. Contains the file result.csv in format : real_year_of_articles,year,distance

/projects/linguistic-shift/stats

/Corrected (statistics based on corrected n-grams from Tao):
- *-grams-TotOccurenceYear: directory containing a file with all the occurrences of all the word from the corpus. Has format : year \tab number-of-occurrences
- ProbabilityOfAWordOverAllYears/*-grams: The probability of each word over all years. Has format : word \tab probability Be careful with 2 and 3-grams, the word are also separated by commas!
- ProbabilityOfAWordPerYear/*-grams: The probability of each word in a given year. Has format : word \tab probability Be careful with 2 and 3-grams, the word are also separated by commas!
- TFIDF: The TFIDF value of each word in a given year. Has format : word \tab value
/WithoutCorrection (Statistics based on initial n-grams without OCR correction and without removed accentuation):
- Same Content as /Corrected
/WordOccurenceOverAllYears: file containing the number of time each words appears over all years. Lines have the format
word \tab number-of-occurences
/YearCardinality: file containing the number of distinct words for each year. Lines have the format
year \tab number-of-distinct-words
/topKWordsCoverage: contains folder named following the format "k-percent" where k means that we took the top k% most used words to compute the result in the folder. Those results shows how much of the full text of a year is covered by the top k% most used words
- File name in folder : result.txt
- Line format : (year,percentage)
/sentencesLength : contains the average sentence length by year.
- File name in folder : means.csv
- Line format : year,average
/punctuationStats : contains the average of comas, semicolons and colons by sentence by year
- File name in folder : stats.csv
- Line format : year,average_comas,average_semicolons,average_colons

/projects/linguistic-shift/corrected_ngrams

/n-grams: directory containing the result of ocr correction based on n-grams. ``n'' here can be 1, 2 and 3.
- Lines have the format: word, frequency.

/projects/linguistic-shift/synonyms

/1-grams: file containing all words that appear in the dictionary of synonyms.
- Lines have the format: id, year, word, frequency.

/projects/linguistic-shift/articles_separated

contains the articles separated by years. So in year x, we have all articles of year x, each article is separated by a newline ('\n'). Each article begins with <full_text> and ends with </full_text>

/projects/linguistic-shift/ngrams

Contains one subfolder for each n (1,2,3,4,5). In each subfolder there is one file by year.
- File format : word <tab> count
- Files ordered in alphabetical order.

This wiki
- Home
- Sitemap
- Files
- New page
- Administration
This page
- Edit
- Clean
- Delete
- History
- Print
- Comments (0)
Share

Prospective students portal

Students portal

Researchers portal

Staff portal

Business portal

Mediacorner

Teaching portal

EPFL Alumni Portal

Architecture, Civil and Environmental Engineering ENAC

Basic Sciences SB

Engineering STI

Computer and Communication Sciences IC

Life Sciences SV

Management of Technology CDM

College of Humanities CDH

EPFL

Education

Research

Innovation & Tech Transfer

EPFL Campus

HDFS Files