Plusd group

Public Library of US diplomacy (plusd):
 
Abstract:
Up to ~2300k US diplomatic cables were leaked so far. These cables were leaked at different times by different sources. We will try to gather all these separate datasets and analyze the relation between the various cables. From these relations the cables can be grouped into various topics where each two cables with related topic are linked together which will provide a new way to traverse the data. After this, the relations will be analyzed to visualize the changing trends in the cables over time and space.
 
Project Goals:
 
Plan:
  • Crawl the data
  • Clean the data and make use of the provided labels on each document to create a set of parameters for each document
  • Analyze the text of each document to provide topic labels
  • Finding relations between documents and group them accordingly
  • Analyze the results for an insight about the change in the cables' flow
Technique:
 
Resources/Dataset:
Cables are provided on the link:
https://archive.org/details/wikileaks-cables-csv.
 
Milestones:
31 March: Finish looking for new datasets, get familiar with the data itself. Upload the data to the cluster, import into a database with meaningful attributes.
14 April: Have a simple back-end to communicate with data - do "simple" aggregations
(Tasks up to this point will be very similar to the team members working on the War diaries)
28 April: Experiment with more advanced analysis tools/NLP
12 May: Complete visualization, report