Dating articles with statistics on punctuation

This distance uses the statistics on punctuation and the average of sentences lengths to find the year of articles. The scala/spark code takes a sample of articles from the given year (like 15) and treat them as one big article. Here is the result for 15 articles from 1925 :

 

As we can see, this distance does not work really well. As we see in the statistics (https://wiki.epfl.ch/bigdata2015-linguistic-drift-le-temps/sentences-length and https://wiki.epfl.ch/bigdata2015-linguistic-drift-le-temps/punct-stats) there some peaks, we think that data should be clean for those years, but we did not find what were those particular cases that create those peaks.

Using a machine learning method is a good idea, sadly we did not have the time to implement it.