PlusD and NLP

Cables analysis using NLP:

To analyze and extract information from the cable we used several Natural Language Processing (NLP) techniques. Here is a short description of them.

Named Entity Recognizer (NER):

The main function of a NER is to find so called entity (ie: Person, Location, Organization, Date, ...) and extract them from a text.

In our case we used the NER library:

http://nlp.stanford.edu/software/CRF-NER.shtml

It's provided with several models already trained. The result were quite good even though we got a few mistake mostly due to our document that are not purely text data (Array).

The information extracted where reused in the following part:

Money analysis
Graph building

Topic Classification with Latent Dirichlet Allocation (LDA):

LDA is a generative model that iterates over documents and as a result topics are extracted from them, enabling classification of documents according to these topics. The main advantage of LDA is the accuracy for topics and the performance over high number of documents.

Spark 1.3 offers a library (MLlib.clustering.LDA) and we use this library to implement our model. In order to get the best out of the model, we iterate over different number of topics using a sample corpus. Then, we use this optimal number of topics when running the model on the whole set of documents.

After the extraction of topics, each topic is named by hand according to their most relevant words to make it clear.

NLP pipeline:

We also used a NLP pipeline, this time we use the SistaNLP processor library:

https://github.com/sistanlp/processors

It's mainly a Scala wrapper for the CoreNLP from the Stanford NLP group,

it's powerful and it's easy to add or remove a step in the pipeline, this was use in the analyze of the cable (tokenize, parse, NER, dependencies analysis) to extract information about money and currency.

This wiki
- Home
- Sitemap
- Files
- New page
- Administration
This page
- Edit
- Clean
- Delete
- History
- Print
- Comments (0)
Share

Prospective students portal

Students portal

Researchers portal

Staff portal

Business portal

Mediacorner

Teaching portal

EPFL Alumni Portal

Architecture, Civil and Environmental Engineering ENAC

Basic Sciences SB

Engineering STI

Computer and Communication Sciences IC

Life Sciences SV

Management of Technology CDM

College of Humanities CDH

EPFL

Education

Research

Innovation & Tech Transfer

EPFL Campus

PlusD and NLP

Cables analysis using NLP: