Topic extraction from cable data and war diaries

For the extraction of topics, Latent Dirichlet Allocation (LDA) is used. This generative method is popular for topic discovery from a set of documents and it is implemented in many languages as libraries, such as Python (gensim) or Spark (Mllib.clustering.LDA). We use Spark's LDA library for computational reasons.

In this model, each document is represented as a set of words.

Before running an LDA model with our corpus, we first need to filter it out so the model is not spoiled by "stop words", such as "for", "the", and so it gives us more accurate topics. The filter consists of removing stop words, non-alphabetical words, common and unique words (words that respectively appear twice the corpus size and appear only once globally).

Secondly, since we want accurate topics, the number of topics to ask from the model is crucial too. In order to find the optimal value for it, we run our model with different number of topics on a sample corpus which is 10% of the whole set.

In our case, the optimal value for number of topics is 5 for Afghan war diary, 40 for Iraqi war diary and 50 for cable data.

Finally, as LDA is a generative model, the number of iterations the model computes is important for its convergence. During the execution with the sample model with different number of topics, we set the number of iterations to 100 in order to lower the execution time but that number is set to 1'000 when we run the model with the whole corpus. This number is kept low for sample because we do not need the model to converge but to give us an insight of how fast it will.  

In the end of the execution, we get 2 output files per corpus. A file named "Topic Matrix" contains topics which are represented by their most relevant 10 words and the weight of each word for that topic. Each topic is then named by hand for clarity. The other file is "Topic Distributions". In this file, we find most relevant topics for each document.

Now, let's check some topics from cable data. Here are 3 hand-picked topics, each with its most important 10 terms. (The quoted topic titles were added by hand to make it clear.)

"turkish-german relations" "visit schedules" "middle east"
term weight term weight term weight
bonn 0.0321 visit 0.2369 media 0.0207
turkish 0.0273 october 0.0358 african 0.0195
german 0.0272 august 0.0323 arab 0.0170
federal 0.0262 november 0.0293 speech 0.0169
berlin 0.0241 september 0.0266 israeli 0.0159
ankara 0.0214 saudi 0.0232 coverage 0.0149
european 0.0211 washington 0.0215 continued 0.0148
reps 0.0164 march 0.0192 middle 0.0145
western 0.0161 january 0.0182 east 0.0144
panama 0.0160 june 0.0180 israel 0.0141

 

All these files and the implementation of LDA can be found on GitHub :

For the Afghan wardiary -> https://github.com/fouweric/wikileaks-data-analytics/tree/master/groups/wardiary/topic-extraction/afg

For the Iraqi wardiary -> https://github.com/fouweric/wikileaks-data-analytics/tree/master/groups/wardiary/topic-extraction/irq

For the cable data -> https://github.com/fouweric/wikileaks-data-analytics/tree/master/groups/plusd/topic-extraction