TF-IDF algorithm

Intro

In information retrieval, the most standards algorithms make use of term frequencies and inverted document frequencies. The goal of such an algorithm is to rank documents according to their relevance to a query. TF measures how relevant is a word for a specified document. IDF measures how relevant is word according the full corpus of documents. For example, a word that appears in most of the documents should not have a big impact on the relevance and a word that appears in very few documents make them very relevant when it appears in the query.

The algorithm

TF(document,word) = #occurences of word in document / #words in document

IDF(word) = log(#documents/ #documents containing word)

to be completed soon...

Why implementing it ?

Even if the TF-IDF algorithm doesn't answer our main objective, we find that implementing it is interesting for many reasons. First of all, the algorithms that we are going to implement next are very likely to use TF and IDF too. Furthermore, the information retrieval algorithm might be proven useful when testing and checking the results of other algorithms. It might also be an interesting plus for the DHLab.