DTTG v3.0 OCR questions

eleven · February 10, 2021, 11:20am

I noticed that the updated version 3.0 mentions “OCR converts scans to searchable PDFs on your device”. I can’t find out how to do it on ipad.

Does it mean that I can OCR a picture PDF into a double-layer PDF with text and picture? Can this be done on iPhone and ipad?

BLUEFROG · February 10, 2021, 11:32am

? > Help > Manage Items > Convert Items

chrillek · February 10, 2021, 12:04pm

On a side note: The german version uses “Konvertieren/nach durchsuchbares PDF/nach Text”. I find the preposition a bit awkward and would suggest “in” instead of “nach”. Which is what DT uses on the desktop.

eleven · February 10, 2021, 12:40pm

OK,Thanks.

BLUEFROG · February 10, 2021, 12:41pm

You’re welcome.

eleven · February 11, 2021, 3:46am

Hello, I used the ocr function today and found that it does not support Chinese which I really need it!

I would like to ask whether it will support ocr Chinese documents in the future version.

eboehnisch · February 11, 2021, 7:04am

Yes, it will. Some technicalities have to be solved first, though.

enGeo · February 11, 2021, 1:11pm

Can you elaborate on the ocr engine used and the implications of adjusting the ocr quality slider in the settings?

aedwards · February 11, 2021, 1:55pm

The OCR engine is Tesseract. The quality slider adjusts the level of compression applied to the image for each page when generating the final PDF file. In general around 75% gives a good compromise between file size and quality. Setting the quality value lower will make the PDF file size smaller but depending on the content it can make the text harder to read.

enGeo · February 11, 2021, 2:03pm

Thank you!
What would you say are the main differences between Abbyy engine on Mac and Tesseract on iOS? And can you recommend which one to prefer if both are available?

aedwards · February 11, 2021, 3:17pm

It does depend on the content that you are OCR’ing, however in general ABBYY OCR in DEVONthink 3 will have a higher percentage accuracy for text recognition across a range of different document types. This is not to say that Tesseract provides poor results it is just that ABBYY OCR can utilise more system resources on the Mac than would be allowed or is available on an iOS device.

jerwin · February 11, 2021, 5:48pm

A bit of a niche request, but since DTTG uses tesseract, would it be possible to add in the deu_frak/frk libraries?

I have many German documents printed in fraktur, and it would be nice to finally OCR them-- neither Abbyy Finereader Pro nor Devonthink Pro can decipher the typeface. Letting DTTG tackle the task intrigues me.

chrillek · February 11, 2021, 5:53pm

Out of curiosity: does tesseract recognize different gothic (if that’s the right term) fonts in general? I’m under the impression that Fraktur encompasses a huge gamut of character shapes. But maybe that’s not a problem or I’m mistaken?

jerwin · February 11, 2021, 6:51pm

No, generally OCR that’s not trained on fraktur does quite poorly. Take this sample text, for instance.

fraktur-sample

The first line reads

“Aktum Dienstags den 3. März 1891”

Devonthink reads it as

“Mtum Pitnfliigs öen 3. Miar? 1891.”

I expect that Abbyy Finereader Pro would come up with something similarly unhelpful, but it doesn’t work with my new mac. Tesseract with the Fraktur libraries might be more successful.

Miwagner1 · February 11, 2021, 10:24pm

I would have thought that iOS could provide just as accurate results at a faster speed if the neural hardware on iOS is used. Perhaps in the future once abbyy is optimized for the new M1 Mac hardware, the kit should run fantastic on iOS as well. Either way It’s mind blowing to see PDF OCR on a phone in your pocket.

aedwards · February 12, 2021, 9:39am

Fraktur is not included in the standard ABBYY supported languages and requires a specialist licence. I had seen that Tesseract supported Fraktur however as yet we haven’t had a chance to test it. We will be adding support for other languages in future updates and I have noted your request for Fraktur.

I assume you have a M1 Mac. In DEVONthink 3 the ABBYY OCR is working in Rosetta2.

chrillek · February 12, 2021, 10:47am

As far as I understood their GitHub pages, Tesseract should run on your desktop system just fine. So you could actually find out how good the deu-frakt rules are on your M1 mac (should be blazingly fast I’m not so sure about the quality about their rules though. They said something about deu-frakt being updated last for version 3 something, 4 is the current one.

Miwagner1 · February 12, 2021, 12:54pm

@aedwards I’m curious, why Tesseract instead of the Apple VisionKit framework?
I believe Vision is faster and more accurate than Tesseract but is iOS 13+

eboehnisch · February 12, 2021, 1:25pm

Apple Vision still supports only a very limited set of languages. But this is by far not the final version of our own OCR framework

Miwagner1 · February 12, 2021, 8:28pm

Can you make Apple OCR default when the requested language is available and go back to Tesseract when it is not?