I just realised that DT3 allows us to configure custom dictionaries to help with recognising optical characters. Instead of entering every word manually, I’d love to be able import a list of words from a file, e.g. plain text with one word per line. Or is this (also) meant to group certain words for entity recognition (e.g. New York) and some NLP magic in the background?
The main use of the custom dictionary is to give the OCR help on recognising words that are not standard in a native dictionary. For example if you had a company product name like “zort”, the OCR would probably think that “sort” was more likely and use that. Adding “zort” to the custom dictionary adds a weighting to that word in the recognition stage.
That’s what I thought, thanks for clarifying. Any thoughts on the import function? (I actually compiled an extensive list for post-processing wrongly recognized words so I could process them mechanically in a past project.)
We will look to add this option, although it probably will not be until after the initial v3 release.
I would like DT had a more intelligent way to index OCR documents to avoid having databases of zillion of words. For example, use statistical analysis. For example, no one language in the world can have ten consonants (or vowels) consecutive. Try to guess in what language is the document and keep away this kind of words. Or words that are in the dictionary as searchable words and keep separated other words. Even give the ability to the user to discard words. Or use IA.
Basically, what I mean is “oirjewjkejwhfjqhfjhewuygweyguijlklhjdghgeuywuyewqu” is not a word, but it is in my indexed terms… I know it is a difficult task…
My suggestion is to store two dictionaries for each database. Those that are recognized words via dictionaries (with prefix, suffix, and so) and a second one with the remaining of the words. Then, for search purposes, use only the first table by default. If we want a completely search, then add a new search option to enable it.
This way we can reduce the gigantic nonsense word databases.