Dating articles with statistics on punctuation

This distance uses the statistics on punctuation and the average of sentences lengths to find the year of articles. The scala/spark code takes a sample of articles from the given year (like 15) and treat them as one big article. Here is the result for 15 articles from 1925 :

As we can see, this distance does not work really well. As we see in the statistics (https://wiki.epfl.ch/bigdata2015-linguistic-drift-le-temps/sentences-length and https://wiki.epfl.ch/bigdata2015-linguistic-drift-le-temps/punct-stats) there some peaks, we think that data should be clean for those years, but we did not find what were those particular cases that create those peaks.

Using a machine learning method is a good idea, sadly we did not have the time to implement it.

This wiki
- Home
- Sitemap
- Files
- New page
- Administration
This page
- Edit
- Clean
- Delete
- History
- Print
- Comments (0)
Share

Prospective students portal

Students portal

Researchers portal

Staff portal

Business portal

Mediacorner

Teaching portal

EPFL Alumni Portal

Architecture, Civil and Environmental Engineering ENAC

Basic Sciences SB

Engineering STI

Computer and Communication Sciences IC

Life Sciences SV

Management of Technology CDM

College of Humanities CDH

EPFL

Education

Research

Innovation & Tech Transfer

EPFL Campus

Dating articles with statistics on punctuation