- français
- English
Research Papers
On this page we dicuss the research papers that tackle problems similar to ours. Out of the 8 our TAs helped us identify, we selected 4 papers which correspond to the 4 sections below (please don't mind the numerotation, it's a left-over from the time we were considering 8 papers).
2. Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining, Mei and Zhai, 2005
This paper describes a method using EM algorithms in order to extract themes from a set of documents, then derive an evolution graph and analyse the themes life cycle.
Basically, this paper emphasizes on temporal structures whose strength is evolving through time. This can be useful to explore major events or to follow the trend of a theme.
So this paper is divided in 3 parts:
- First extracting themes on a partition of time periods (for instance, they have used the example of the Tsunami during 2 months)
- Then finding which themes are related to each other given a partition and the subsequent partition of periods
- Finally, assessing the strength of the themes over the time periods which is called the theme life cycle
Overall the proposed methods are based on these technical points:
- A partition of the documents is built at the beginning, suggesting that it is perfectly designed for parallel processing
- The theme extraction is based on a mixture model with an expectation-maximization method
- The references also suggests an article about the Latent Dirichlet Allocation which seem already well implemented with Spark for instance
- The transition graph is based on a Kullback-divergence evaluation between two given themes
- The analysis of the theme life cycle is using Hidden Markov Models
Pros and Cons:
+ It is considering the temporal evolution of themes
+ Theme extraction seems to be well documented and these machine learning methods can be done in parallel
+ They have used thousands of press articles stemming from 10 different sources - so their experiments are relevant with our data constraint
- There is - as is - no hierarchical theme clustering methods described in this article
4. Incremental Hierarchical Clustering of Text Documents, Sahoo et al., 2006
Incremental hierarchical text document clustering algorithms are used to organize documents generated from streaming on-line sources, such as, Newswire and Blogs. Existing incremental hierarchical clustering algorithms are Coweb and Classit, however they are not been used with text document data. Based on these algorithms it was developed a new algorithm suitable for the text clustering which is evaluated using newswire articles.
Why is this important?
Document clustering is an effective tool to manage information overload. By clustering similar documents together we can quickly browse large document collections, easily grasp the distinct topics and subtopics (concept hierarchies) in them, and efficiently query them among many other applications.
Variation of the Coweb and Classit
This new algorithm Katz-Classit that is adapted version of Coweb and Classit algorithms that uses Katz’s distribution instead of Normal distribution because it is more appropriate for the word occurrence data. Also, there are a new way to evaluate the quality of the hierarchy generated by the hierarchical clustering algorithms, by observing how often children clusters of a cluster get children classes of the class assigned to the cluster.
To summarize there are two important improvements:
1. Katz Distribution (it is more appropriate for the word occurrence data)
2. Evaluation of the quality of the result (it is much easier to evaluate algorithms with different distribution)
However, this is a relatively unexplored area in the text document clustering literature. Also, testing showed that for different data sets results are not always the best using Katz’s Distribution, sometimes it is better to use others.
5. Mining Correlated Bursty Topic Patterns from Coordinated Text Streams, Wang et al., 2007
The proposed algorithm tries to detect major events by finding "bursty" topic patterns across different streams. Conceptually, a topic is a set of words with a high distribution probability in a given stream; a topic is "bursty" when it has an intensive coverage during a certain time interval. The technique is designed to detect bursts of topics even if the text streams are not in the same language. It does so by using a purely probabilistic approach: the algorithm basically detects patterns by calculating the correlation between several bursts of topics from different streams. Typically, after a major event, two text streams will tend to intensely use the same set of words in their respective vocabularies (languages) during a similar period of time.
Technically, the algorithm will, at a certain timestamp t, merge the words of all different streams and compute their distributions, called the coordinated mixture model. It will also estimate the individual word distribution of each stream (during its whole time span), called the stream's "background model". The background model would represent the topics covered "locally" by this stream, and the coordinated mixture model would be the correlated topics across all streams at time index t.
Basically, the author assumes that a text sample (i.e, an article from a news stream) at a certain point of time is « generated » either by using words from the background model of the stream, or from a time-dependent mixture model which would be associated with a certain pattern (i.e event). From these assumptions, the author derives a formula to compute the probability of the existence of a pattern at a precise point of time. This probability is estimated using the Expectation-Maximisation algorithm.
In order to represent the temporal dependency of a burst, the author implemented an additional constraint that takes into account the probability of a pattern in the neighboring timestamps. More formally, let z be a pattern and t be a timestamp. Without the constraint, we use P(z|t) as the pattern « strength » estimation. With the constraint, we will average (« smooth ») P(z|t -1), P(z|t) and P(z|t + 1) and take this value as the pattern strength.
Another improvement forces the algorithm to eliminate topics from a single stream that are weakly correlated with topics from all other streams. Indeed, low-correlated topics would most likely correspond to local patterns that are not associated with an event of global importance.
As far as scalability is concerned, the algorithm is theorically not at all splittable : some computations like the background model or the « noise-filter » improvement are done on the whole time span of every stream. In practice we could however define sub-intervals on the time span and apply the algorithm on each of them, but that might reduce its accuracy.
Overall, the described method is conceptually quite suited for our purpose but it has a very strong constraint that we don't need : it is meant to find patterns across streams of different languages and thus doesn't make any kind of semantic analysis. Typically, it is not able to detect hierarchies of topics (local sub-topics will be explicitely discarded by the second additional constraint). If we were to implement this algorithm, we will most certainly be able to improve it (either on scalability or accuracy aspects) by using natural language processing techniques.
7. Finding surprising patterns in textual data streams, Snowsill et al., 2010
(This is a first raw evaluation of this method, might contain inaccuracies)
This method try to estimate the probability that a given n-gram appears. And thus allows to detect a leap between the estimation and what happens. Such leaps indicates the current events.
- It treats sequences of n-grams as Markov chains.
- It uses a suffix tree as the data structure to provide memory as to estimate the the probability.
- It treats sequences of articles by batch, with a fixed timeframe.
- The given algorithm is O(N2), N being the size of the timeframe. And allegedly constant in space usage (Nevertheless, after careful consideration, seems to be O(N)).
- It seems that the authors were able to clusterize it (or some preprocessing, unclear). The article provides few details on how they achieved it. It seems that they were able to distribute directly the n-grams between clusters. In any case this method is very scalable as another way to distribute it could simply be between the different timeframes.
- It also seems very adapted to the size of our data as they tested it on a set of 1.6 x 106 articles. Which resulted in a 30GB indexing three.
There is mostly two parts in its implementation (whether they can be done in parallel is subject to caution)
- Implementing accessors to a generalized suffixed tree which allows for annotating nodes (check whether we simply use a file in the provided hdfs distributed file system or if another way could be more efficient. Our data set might be small enough so that the first option is viable)
- Implement the algorithm itself
Pro/Cons
+ It requires no further knowledge of the contained events
+ It's not very resources consuming
- The intelligence it provides might not be enough to really categorize it as a machine learning algorithm