- français
- English
Topic extraction from cable data and war diaries
For the extraction of topics, Latent Dirichlet Allocation (LDA) is used. This generative method is popular for topic discovery from a set of documents and it is implemented in many languages as libraries, such as Python (gensim) or Spark (Mllib.clustering.LDA). We use Spark's LDA library for computational reasons.
In this model, each document is represented as a set of words.
Before running an LDA model with our corpus, we first need to filter it out so the model is not spoiled by "stop words", such as "for", "the", and so it gives us more accurate topics. The filter consists of removing stop words, non-alphabetical words, common and unique words (words that respectively appear twice the corpus size and appear only once globally).
Secondly, since we want accurate topics, the number of topics to ask from the model is crucial too. In order to find the optimal value for it, we run our model with different number of topics on a sample corpus which is 10% of the whole set.
In our case, the optimal value for number of topics is 5 for Afghan war diary, 40 for Iraqi war diary and 50 for cable data.
Finally, as LDA is a generative model, the number of iterations the model computes is important for its convergence. During the execution with the sample model with different number of topics, we set the number of iterations to 100 in order to lower the execution time but that number is set to 1'000 when we run the model with the whole corpus. This number is kept low for sample because we do not need the model to converge but to give us an insight of how fast it will.
In the end of the execution, we get 2 output files per corpus. A file named "Topic Matrix" contains topics which are represented by their most relevant 10 words and the weight of each word for that topic. Each topic is then named by hand for clarity. The other file is "Topic Distributions". In this file, we find most relevant topics for each document.
Now, let's check some topics from cable data. Here are 3 hand-picked topics, each with its most important 10 terms. (The quoted topic titles were added by hand to make it clear.)
"turkish-german relations" | "visit schedules" | "middle east" | |||
term | weight | term | weight | term | weight |
bonn | 0.0321 | visit | 0.2369 | media | 0.0207 |
turkish | 0.0273 | october | 0.0358 | african | 0.0195 |
german | 0.0272 | august | 0.0323 | arab | 0.0170 |
federal | 0.0262 | november | 0.0293 | speech | 0.0169 |
berlin | 0.0241 | september | 0.0266 | israeli | 0.0159 |
ankara | 0.0214 | saudi | 0.0232 | coverage | 0.0149 |
european | 0.0211 | washington | 0.0215 | continued | 0.0148 |
reps | 0.0164 | march | 0.0192 | middle | 0.0145 |
western | 0.0161 | january | 0.0182 | east | 0.0144 |
panama | 0.0160 | june | 0.0180 | israel | 0.0141 |
All these files and the implementation of LDA can be found on GitHub :
For the Afghan wardiary -> https://github.com/fouweric/wikileaks-data-analytics/tree/master/groups/wardiary/topic-extraction/afg
For the Iraqi wardiary -> https://github.com/fouweric/wikileaks-data-analytics/tree/master/groups/wardiary/topic-extraction/irq
For the cable data -> https://github.com/fouweric/wikileaks-data-analytics/tree/master/groups/plusd/topic-extraction