- français
- English
Cleaning OCR errors
Cleaning OCR errors
Here are some ideas on how to correct the errors made by OCR or managing name of people, places, ...
- For name :
- They probably have a capital letter -> find all words with capital letter
- Determine if they are standard word or special name (how to do that ?)
- Replace them by a specific token (that we probably don't consider in further computations)
- Replace all capital letters by lowercase
- How to correct errors :
- find a list of current errors like there and correct them.
- Things to correct :
- remove alone letter (add a space : d e instead of de)
- remove ponctuation
- remove special letter (;,😏/,-,',<,>>,<<,>,&,etc)
- q.([a-z]) -> qu$1
- so -> se (complete word)
- pir -> par (complete word)
- pavs -> pays (complete word)
- êlre -> être
- iii -> m
- ii -> n
- bc -> be
- fiile->fille
- compute the tf-idf and remove afterwards all the words that have a score too low.
Interesting links :
- http://www.reverso.net/spell-checker/french-spelling-grammar/
- http://usesofscale.com/gritty-details/basic-ocr-correction/
- http://arxiv.org/pdf/1204.0191.pdf
- http://www.sciencedirect.com/science/article/pii/0306457383900225
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.82.369&rep=rep1&type=pdf
- http://www.infogridpacific.com/blog/igp-blog-20130317-ocr-production-nightmares.html (list of common mistake)
- Compute distance between two strings : http://www.gettingcirrius.com/2011/06/calculating-similarity-part-3-damerau.html
- Ce wiki
- Cette page