- français
- English
HDFS Files
/projects/linguistic-shift/articles_samples
- /*_articles: number of articles in the subset of articles
- /Corrected or /WithoutCorrection
- /* : year of the articles. It contains all words of the articles with their number of occurrences. Lines have format : word \tab occurrences
- /Corrected or /WithoutCorrection
/projects/linguistic-shift/distances
- /chi-square: directory containing results of the chi-square distance. Lines have the format
year1:year2 \tab distance - /cosine: directory containing results of the cosine based distance. Lines have the format
year1:year2 \tab distance - /distance1: directory containing results of the simple first metric. Lines have the format
year1:year2 \tab distance - /kullback-leibler :
- /tfidf:
- /Corrected : directory containing results of KL on the corrected corpus with TFIDF. Lines have the format year1,year2,distance
- /WithoutCorrection: directory containing results of KL on the non-corrected corpus with TFIDF. Lines have the format year1,year2,distance
- /probability_of_a_word_per_year :
- /Corrected : directory containing results of KL on the corrected corpus with probability of a word in a year. Lines have the format (year1,year2,distance)
- /WithoutCorrection : directory containing results of KL on the non-corrected corpus with probability of a word in a year. Lines have the format (year1,year2,distance)
- /article: date a subset of articles (using the probability of words per year)
- /Corrected or /WithoutCorrection :
- /*_articles : number of articles considered.
- /*-ngrams/year : It contains in subfolders the distance between a subset of article from a year with all the other years. Lines have format (year1,year2,distance). Year 1839 is the subset of article.
- /best_estimation : contains values of the best matches for the subset of articles. Same format : year1,year2,distance. Year1 is simulated by year2.
- /*-ngrams/year : It contains in subfolders the distance between a subset of article from a year with all the other years. Lines have format (year1,year2,distance). Year 1839 is the subset of article.
- /*_articles : number of articles considered.
- /Corrected or /WithoutCorrection :
- /tfidf:
- /punctuation-metric : directory containing results for the metric on punctuation statistics and sentences length. Contains the file result.csv in format : real_year_of_articles,year,distance
/projects/linguistic-shift/stats
- /Corrected (statistics based on corrected n-grams from Tao):
- *-grams-TotOccurenceYear: directory containing a file with all the occurrences of all the word from the corpus. Has format : year \tab number-of-occurrences
- ProbabilityOfAWordOverAllYears/*-grams: The probability of each word over all years. Has format : word \tab probability Be careful with 2 and 3-grams, the word are also separated by commas!
- ProbabilityOfAWordPerYear/*-grams: The probability of each word in a given year. Has format : word \tab probability Be careful with 2 and 3-grams, the word are also separated by commas!
- TFIDF: The TFIDF value of each word in a given year. Has format : word \tab value
- /WithoutCorrection (Statistics based on initial n-grams without OCR correction and without removed accentuation):
- Same Content as /Corrected
- /WordOccurenceOverAllYears: file containing the number of time each words appears over all years. Lines have the format
word \tab number-of-occurences - /YearCardinality: file containing the number of distinct words for each year. Lines have the format
year \tab number-of-distinct-words - /topKWordsCoverage: contains folder named following the format "k-percent" where k means that we took the top k% most used words to compute the result in the folder. Those results shows how much of the full text of a year is covered by the top k% most used words
- File name in folder : result.txt
- Line format : (year,percentage)
- /sentencesLength : contains the average sentence length by year.
- File name in folder : means.csv
- Line format : year,average
- /punctuationStats : contains the average of comas, semicolons and colons by sentence by year
- File name in folder : stats.csv
- Line format : year,average_comas,average_semicolons,average_colons
/projects/linguistic-shift/corrected_ngrams
- /n-grams: directory containing the result of ocr correction based on n-grams. ``n'' here can be 1, 2 and 3.
- Lines have the format: word, frequency.
/projects/linguistic-shift/synonyms
- /1-grams: file containing all words that appear in the dictionary of synonyms.
- Lines have the format: id, year, word, frequency.
/projects/linguistic-shift/articles_separated
- contains the articles separated by years. So in year x, we have all articles of year x, each article is separated by a newline ('\n'). Each article begins with <full_text> and ends with </full_text>
/projects/linguistic-shift/ngrams
- Contains one subfolder for each n (1,2,3,4,5). In each subfolder there is one file by year.
- File format : word <tab> count
- Files ordered in alphabetical order.