- français
- English
List of articles to read
[Gil] M. Rajman and L. Lebart. Similarités pour données textuelles. Proc. of 4th International Conference on Statistical Analysis of Textual Data (JADT'98), pp. 545-555, Nice (France), 1998. Link : http://liawww.epfl.ch/Publications/Archive/RajmanLebart98.pdf
Summary : Simple and quick state of the art about text similarity. First the representation of the documents : each document is represented by a tuple D containing frequencies of texual units in the document. Textual unit : often words, can also be 2-grams, 3-grams , ... ; could also be some morphosyntactical (is the word a verb, a noun, ...) informations. They talk about three distance, but the most interesting to us is the one that applies in Textual Data Analysis (TDA), the Chi-square distance. Usually the whole corpus is represented in one matrix where each row contains the tuple of a document in the corpus.
[Marc] R. Besançon, M. Rajman and J.-C. Chappelier. Textual Similarities based on a Distributional Approach. Proceedings of the Tenth International Workshop on Database and Expert Systems Applications (DEXA99), pp. 180-184, Firenze (Italy), 1999.
Summary : In this paper, an improved method from "Vector Space" for Information Retrieval (search documents that are close from a textual query) is presented.
- Vector Space : each document dn is represented by a vector (wn1,...,wnM) call the "lexical profile", where wnk is the weight (or importance) of the term tk in the document dn, and M is the size of the indexing term set.
-
- The terms in the indexing term set are chosen to be "as discriminative as possible".
- Weight of a term : its frequency -> Improvement : take into account the importance of each term (for ex.: term that rarely occurs are more important -> term weighted by the inverted document frequency (see 2.1.1))
- Measure of similarity : cosine of the angle between two vectors (document and the query) (see 2.1.2 for the formula)
- Other measures : Chi-square, Kullback-Leibler Divergence
- Distributional Semantics
- The semantics of a word is related to the set of contexts in which that word appears -> given several sentences with an unknown word X, it's possible to guess the word X
- In other words : two words are semantically similar to the extent that their contexts are similar
- Co-occurrence frequency : for 2 words, it is the frequency of both words occurring within a given textual unit.
- Experiments :
- Distributional Semantics gives better results than Vector Space
- Existence of hybrid DS
[Cynthia] R. Besançon and M. Rajman. Evaluation of a Vector Space similarity measure in a multilingual framework. Proceedings of the Third International Conference on Language Resource and Evaluation (LREC'2002), May, 2002.
Summary: Analyses the proximity of vector spaces from a same text in different languages. In other words, the distance between two texts (for example in english) should remain the same if we compute it again after translating both texts.
It is not very useful for our purpose but there is some examples of distance computations that we can consider: (sections 2.1 The standard vector space model and 2.2 The DSIR Model). Briefly, they take in account the frequencies of words in a document and the weight of the document compared to the others in the corpus.
[Tao] St-Jacques and C. Barrière. Similarity judgments: philosophical, psychological and mathematical investigations. In Proceedings of the Workshop on Linguistic Distances (LD '06). Association for Computational Linguistics, Stroudsburg, PA, USA, 8-15, 2006.
Summary: This study investigates similarity judgments from two angles:
- Look at models suggested in the psychology and philosophy literature:
- Gain insight on the human capability in performing a similarity judgment.
- Philosophical evidence
- Humans are able to give different judgments of similarity between different senses of the word game, i.e., meaning of concepts is mind-dependent and that individuation is not intractable.
- Any common people can use their own similarity metric to disambiguate polysemous terms.
- Psychological Evidence
- Philosophical evidence shows that non-experts can perform similarity judgments, and different psychological models will give possible answers.
- There are three approaches:
- Subjective scaling: The similarity judgments are represented in an n × n matrix of objects by a multidimensional scaling (MDS) of the distance between each object.
- Objective scaling: Based on the fact that similarity measures are calculated from the ratio of objective features that describe objects under analysis. So, subjects are asked to make qualitative judgments on common or distinctive features of objects and the comparison is then made by any distance axioms.
- Semantic differential: measures the meanings that individual subjects grant to words and concepts according to a series of factor analyses.
- Philosophical evidence
- Gain insight on the human capability in performing a similarity judgment.
- Analyze the properties (shared or non-shared) of many metrics
- Divides classification models into two criteria:
- the cardinality of sets
- Classified 28 similarity measures for ordinary sets, and measures can be classified on the basis of only a few properties, i.e., reflexivity, Symmetry and transitivity.
- All 28 measures show reflexivity and symmetry but they vary on the type of transitivity they achieve.
- the proximity-based similarity measures
- Divided the model and gained three groups:
- the distance model. It overlaps in part with the subjective scaling of similarity.
- the probabilistic model. Based on the statistical analysis of objects and their attributes in a data space.
- the angular coefficients. Also a metric space model but it uses angular measures between vectors of features to determine the similarity between objects.
- Analysis of Similarity metrics:
- Using the psychological model of meaning and the typical properties of the classes to analysis similarity metrics. And using typical properties as a dividing lines between groups of metrics.
- Divided the model and gained three groups:
- the cardinality of sets
- Divides classification models into two criteria:
[Nicolas] J.-B. Michel et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331, 17, 2011. DOI: 10.1126/science.1199644.
Summary : Aim at analyzing cultural phenomenon and linguistic changes quantitatively using a 5 million books (>500 billion words) subset of Google’s corpus, mostly in English. Use n-gram from 1 to 5 and discard components that appear less than 40 times over the dataset to cope with OCR faults. Points explored:
Lexicon size : Evolution of used words over time. Many tokens are not proper words (dates, names, ORC mistakes, …) thus sample from various periods are inspected by hand to estimate the ratio of true word. Throw away 1-gram that appear with frequency < 10^-9.
Grammar changes: irregular verbs becoming regular and vice versa
Forgetting rate of events (use of dates) and famous people (take from Wikipedia).
Censorship: rise of nazi Germany triggered a sudden change of family name frequencies
[Farah] F. Kaplan. Linguistic Capitalism and Algorithmic Mediation. In Representations, vol. 127, num. 1, p. 57-63, 2014.
Summary : This article is about google using linguistic drift metrics to update auto-completion. When you use autocompletion, google offers you related ads and they gain money when you enter one of those pages. This is why linguistic drift measurements is important to them.
- The more likely to be used words are valued out.
- Auto completion denature language as it is based on algorithms set by statistical models applied to texts already altered by various algorithm mediations.
- In the web, lot of texts are corrected and formatted by algo => secondary resources (Wikipedia, translated articles etc..)
- Le Temps Corpus should be all primary resources
[Farah] Using Bag-of-words to Distinguish Similar Languages: How Efficient are They? Marcos Zampieri Saarland University
- First thing ever done, use of Zipf’s law distribution to order the frequency of short words in text and used this information for language identification. Then came n-grams.
- Use distance between frequency rankings of words. Word count by year, rank words and compute distance of the two ranking of a word between two years or (n-grams)
- They use either short words or frequent words or character n-grams.
- Blacklist list of words to distinguish between similar languages (words that never have been used before for instance).
[Farah] V. Niculae, M. Zampieri, L. P. Dinu and A. M. Ciobanu. Temporal Text Ranking and Automatic Dating of Texts. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Gothenburg (Sweden), 2014.
[Marc] C Buckley, G. Salton. Term weigthing Approaches in automatic Text retrieval. Information Processing and management, 24:513-523, 1988.
Summary : See if it's interesting for us or not
- Ce wiki
- Cette page