List of articles to read

[Gil] M. Rajman and L. Lebart. Similarités pour données textuelles. Proc. of 4th International Conference on Statistical Analysis of Textual Data (JADT'98), pp. 545-555, Nice (France), 1998. Link : http://liawww.epfl.ch/Publications/Archive/RajmanLebart98.pdf

Summary : Simple and quick state of the art about text similarity. First the representation of the documents : each document is represented by a tuple D containing frequencies of texual units in the document. Textual unit : often words, can also be 2-grams, 3-grams , ... ; could also be some morphosyntactical (is the word a verb, a noun, ...) informations. They talk about three distance, but the most interesting to us is the one that applies in Textual Data Analysis (TDA), the Chi-square distance. Usually the whole corpus is represented in one matrix where each row contains the tuple of a document in the corpus.



[Marc] R. Besançon, M. Rajman and J.-C. Chappelier. Textual Similarities based on a Distributional Approach. Proceedings of the Tenth International Workshop on Database and Expert Systems Applications (DEXA99), pp. 180-184, Firenze (Italy), 1999.

Summary : In this paper, an improved method from "Vector Space" for Information Retrieval (search documents that are close from a textual query) is presented.


[Cynthia] R. Besançon and M. Rajman. Evaluation of a Vector Space similarity measure in a multilingual framework. Proceedings of the Third International Conference on Language Resource and Evaluation (LREC'2002), May, 2002.

Summary: Analyses the proximity of vector spaces from a same text in different languages. In other words, the distance between two texts (for example in english) should remain the same if we compute it again after translating both texts. 
It is not very useful for our purpose but there is some examples of distance computations that we can consider: (sections 2.1 The standard vector space model and 2.2 The DSIR Model). Briefly, they take in account the frequencies of words in a document and the weight of the document compared to the others in the corpus.



[Tao] St-Jacques and C. Barrière. Similarity judgments: philosophical, psychological and mathematical investigations. In Proceedings of the Workshop on Linguistic Distances (LD '06). Association for Computational Linguistics, Stroudsburg, PA, USA, 8-15, 2006.

Summary: This study investigates similarity judgments from two angles:

 

[Nicolas] J.-B. Michel et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331, 17, 2011. DOI: 10.1126/science.1199644.

Summary : Aim at analyzing cultural phenomenon and linguistic changes quantitatively using a 5 million books (>500 billion words) subset of Google’s corpus, mostly in English. Use n-gram from 1 to 5 and discard components that appear less than 40 times over the dataset to cope with OCR faults. Points explored:

Lexicon size : Evolution of used words over time. Many tokens are not proper words (dates, names, ORC mistakes, …) thus sample from various periods are inspected by hand to estimate the ratio of true word. Throw away 1-gram that appear with frequency < 10^-9.

Grammar changes: irregular verbs becoming regular and vice versa

Forgetting rate of events (use of dates) and famous people (take from Wikipedia).

Censorship: rise of nazi Germany triggered a sudden change of family name frequencies


[Farah] F. Kaplan. Linguistic Capitalism and Algorithmic Mediation. In Representations, vol. 127, num. 1, p. 57-63, 2014.

Summary : This article is about google using linguistic drift metrics to update auto-completion. When you use autocompletion, google offers you related ads and they gain money when you enter one of those pages. This is why linguistic drift measurements is important to them. 

[Farah] Using Bag-of-words to Distinguish Similar Languages: How Efficient are They? Marcos Zampieri Saarland University

[Farah] V. Niculae, M. Zampieri, L. P. Dinu and A. M. Ciobanu. Temporal Text Ranking and Automatic Dating of Texts. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Gothenburg (Sweden), 2014.

 

[Marc] C Buckley, G. Salton. Term weigthing Approaches in automatic Text retrieval. Information Processing and management, 24:513-523, 1988.

Summary : See if it's interesting for us or not