PlusD and NLP

Cables analysis using NLP:

To analyze and extract information from the cable we used several Natural Language Processing (NLP) techniques. Here is a short description of them. 

Named Entity Recognizer (NER): 

The main function of a NER is to find so called entity (ie: Person, Location, Organization, Date, ...) and extract them from a text. 

In our case we used the NER library:

http://nlp.stanford.edu/software/CRF-NER.shtml

It's provided with several models already trained. The result were quite good even though we got a few mistake mostly due to our document that are not purely text data (Array).

The information extracted where reused in the following part:

Topic Classification with Latent Dirichlet Allocation (LDA):

LDA is a generative model that iterates over documents and as a result topics are extracted from them, enabling classification of documents according to these topics. The main advantage of LDA is the accuracy for topics and the performance over high number of documents.

Spark 1.3 offers a library (MLlib.clustering.LDA) and we use this library to implement our model. In order to get the best out of the model, we iterate over different number of topics using a sample corpus. Then, we use this optimal number of topics when running the model on the whole set of documents. 

After the extraction of topics, each topic is named by hand according to their most relevant words to make it clear.

NLP pipeline:

We also used a NLP pipeline, this time we use the SistaNLP processor library:

https://github.com/sistanlp/processors

It's mainly a Scala wrapper for the CoreNLP from the Stanford NLP group,

it's powerful and it's easy to add or remove a step in the pipeline, this was use in the analyze of the cable (tokenize, parse, NER, dependencies analysis) to extract information about money and currency.