- français
- English
Dating Articles With KL
How : Take a subset of articles from a year, consider this subset as a year in term of file and then compute the different metrics with this added "year".
With 15 articles taken from 1840, 1880, 1920, 1960, 1995
- We took a subset of articles in each of these years and we run different metrics without removing these articles from the years
-
Kullback-Leibler Divergence
-
With OCR Correction (LEFT) and Without (RIGHT)
-
As KL is not symmetric, we only took one direction in the computation : We try to represent our subset of articles with each year. Obviously, we can observe that the best match is the corresponding year to the articles.
-
-
1840
Best Est. : 1840, 0.031660477138007924 Best Est. : 1840, 0.01932141594350566
Articles removed from the test set :
Best Est. : 1899, 0.5329287459661379 Best Est. : 1840, 0.5597024069836072
-
1880
Best Est. : 1880, 0.043447215324982254 Best Est. : 1880, 0.020543887035207067
Articles removed from the test set :
Best Est. : 1880 0.46239086016971465 Best Est. : 1880, 0.5223219897849468
-
1920
Best Est. : 1920, 0.03990581959677868 Best Est. : 1920, 0.022052117114189976
Articles removed from the test set :
Best Est. : 1920 0.45956701697224045 Best Est. : 1920, 0.5159421530851702
-
1960
Best Est. : 1960, 0.04071517376390723 Best Est. : 1960, 0.025952373391739208
Articles removed from the test set :
Best Est. : 1960, 0.17596041743309723 Best Est. : 1960, 0.41523532824370085
-
1995
Best Est. : 1995, 0.047007380332804216 Best Est. : 1995, 0.027078483784840415
Articles removed from the test set :
Best Est. : 1995, 0.1913952267788147 Best Est. : 1995, 0.24046060197740382
And here are the results with trying to represent a year from 15 articles :
-
1840
Best Est. : 1847, 0.6293087038542783 Best Est. : 1850, 0.6886133376755331
-
1880
Best Estimation : 1847, 0.6932575433754864 Best Est. : 1850, 0.7003421853715898
-
1920
Best Est. : 1850, 0.6632945152618476 Best Est. : 1850, 0.7223024234233494
-
1960
Best Est. : 1850, 0.7382428795259338 Best Est. : 1847, 0.7431473183454298
-
1995
Best Est. : 1850, 0.7958442463528926 Best Est. : 1847, 0.7685063648973741
-
We can observe that it is easier to simulate the 19th century with all subsets of articles. This is probably because in these years, there are less words. So using only a subset of articles, it is easier to be close from 1840 than 1998.
- Ce wiki
- Cette page