Some feedback on the OCR accuracy - for me a key concern to persuade me to make DT my ‘trusted’ source is the accuracy of being able to find what’s in the documents I hold in the database. So I tested the OCR accuracy. There’s a pretty wide divergence.
METHOD - I imported/ocrd the same pdf document in 3 programs (DTP 1.5, DTP2 Beta and Adobe Acrobat) at various different settings and compared the results. The document was in Spanish.
RESULTS - Each method produced a different word count. DTP2 produced less nonsense words but missed some significant blocks of text and numbers
Here are the word counts - Acrobat 714 Words (some nonsense), DTP1.5 727 Words (some nonsense, all numbers accurate), DTP2 626 Words (no nonsense but text, numbers and dates missing).
Examples -
Here’s a screenshot from the DTP1.5 import. Scroll all the way to the left of the image to see the Word list and Word counts. The number ‘160’ is in the document and correctly identified in the word list. There are also some obvious nonsense words in the word list -
DTPnonbeta.tiff (319 KB)
And here’s the result from the DTP2b3 import. Again scroll to the left to see the Word count info. The number ‘160’ is nowhere to be found in the word list. There are 100 fewer words, admittedly some are nonsense words but much text, dates and numbers is missing. There are no nonsense words probably because the word list is checked against a dictionary.
DTP2Beta.tiff (355 KB)
The bottom line is that Acrobat and DTP1.5 give more data, albeit requiring some further analysis, and DTP2 loses significant amounts of data.
Something needs to be done to improve the inclusiveness of DTP2 OCR or at the very least give an indication of what is missing.