EPFL
Wikis
Wikileaks data analytics
Topic extraction from cable data and war diaries

Topic extraction from cable data and war diaries

For the extraction of topics, Latent Dirichlet Allocation (LDA) is used. This generative method is popular for topic discovery from a set of documents and it is implemented in many languages as libraries, such as Python (gensim) or Spark (Mllib.clustering.LDA). We use Spark's LDA library for computational reasons.

In this model, each document is represented as a set of words.

Before running an LDA model with our corpus, we first need to filter it out so the model is not spoiled by "stop words", such as "for", "the", and so it gives us more accurate topics. The filter consists of removing stop words, non-alphabetical words, common and unique words (words that respectively appear twice the corpus size and appear only once globally).

Secondly, since we want accurate topics, the number of topics to ask from the model is crucial too. In order to find the optimal value for it, we run our model with different number of topics on a sample corpus which is 10% of the whole set.

In our case, the optimal value for number of topics is 5 for Afghan war diary, 40 for Iraqi war diary and 50 for cable data.

Finally, as LDA is a generative model, the number of iterations the model computes is important for its convergence. During the execution with the sample model with different number of topics, we set the number of iterations to 100 in order to lower the execution time but that number is set to 1'000 when we run the model with the whole corpus. This number is kept low for sample because we do not need the model to converge but to give us an insight of how fast it will.

In the end of the execution, we get 2 output files per corpus. A file named "Topic Matrix" contains topics which are represented by their most relevant 10 words and the weight of each word for that topic. Each topic is then named by hand for clarity. The other file is "Topic Distributions". In this file, we find most relevant topics for each document.

Now, let's check some topics from cable data. Here are 3 hand-picked topics, each with its most important 10 terms. (The quoted topic titles were added by hand to make it clear.)

"turkish-german relations"		"visit schedules"		"middle east"
term	weight	term	weight	term	weight
bonn	0.0321	visit	0.2369	media	0.0207
turkish	0.0273	october	0.0358	african	0.0195
german	0.0272	august	0.0323	arab	0.0170
federal	0.0262	november	0.0293	speech	0.0169
berlin	0.0241	september	0.0266	israeli	0.0159
ankara	0.0214	saudi	0.0232	coverage	0.0149
european	0.0211	washington	0.0215	continued	0.0148
reps	0.0164	march	0.0192	middle	0.0145
western	0.0161	january	0.0182	east	0.0144
panama	0.0160	june	0.0180	israel	0.0141

All these files and the implementation of LDA can be found on GitHub :

For the Afghan wardiary -> https://github.com/fouweric/wikileaks-data-analytics/tree/master/groups/wardiary/topic-extraction/afg

For the Iraqi wardiary -> https://github.com/fouweric/wikileaks-data-analytics/tree/master/groups/wardiary/topic-extraction/irq

For the cable data -> https://github.com/fouweric/wikileaks-data-analytics/tree/master/groups/plusd/topic-extraction

This wiki
- Home
- Sitemap
- Files
- New page
- Administration
This page
- Edit
- Clean
- Delete
- History
- Print
- Comments (0)
Share

Prospective students portal

Students portal

Researchers portal

Staff portal

Business portal

Mediacorner

Teaching portal

EPFL Alumni Portal

Architecture, Civil and Environmental Engineering ENAC

Basic Sciences SB

Engineering STI

Computer and Communication Sciences IC

Life Sciences SV

Management of Technology CDM

College of Humanities CDH

EPFL

Education

Research

Innovation & Tech Transfer

EPFL Campus

Topic extraction from cable data and war diaries