Metric determination

Requirements for the distance between two corpuses d(C1, C2):

 

Matrix representation:

               year1     year2     year3     ...     yearN

year1     0            d(1,2)      d(1,3)    ...     d(1,N)

year2     d(2,1)     0             d(2,3)    ...     d(2,N)

year3     d(3,1)     d(3,2)      0           ...     d(3,N)

...           ...           ...             ...                  ...

yearN     d(N,1)    d(N,2)      d(N,3)    ...    0

 

 

Criteria to juge if a metric is better than another one:

 

Choices of distances:


RESULTS (Left is with Correction of OCR, right is without it) :

1-grams

  • We can see that when we approximate the 1990's years with the 1840's, we have a big divergence, but it is not the case in the other direction.
  • For the year 1980, we don't know how to explain that it is a bit more divergent.

 

2-grams

3-grams



Chi-square 

where fi,j is the frequency of the word j in the year i.


In the experiment, the Out-of-place measure is divided to two methods (i.e., whether to use the unmatched word's frequency. Besides, since the author of this metric mentioned in his paper that this method provides high robustness in the face of OCR error, thus the effect of OCR correction is not clear in this method (It has few changes in ratio but more in count). Results are as follows:

For corrected dataset

For uncorrected dataset

For corrected dataset

For the uncorrected dataset