OCR Problems after Update 3.51

MauriceK · June 12, 2020, 5:24pm

Hi,

i have problems with the OCR after updating to 3.51. The same PDF file (original 150.4 KB in size, dimensions 20.7 x 29.5 cm), has 290.1 KB after OCR under 3.50 and the dimensions, 20.8 x 29.5, under 3.51 is the size after OCR 319.7 and the dimensions are 43.2 x 61.5. In addition, the image is much more blurred after OCR in 3.51 than in 3.50.

subzero · June 13, 2020, 9:00am

Hello,

maybe the ocr settings are different now. I noticed that after the update, they are set to 150dpi by default, which of course leads to quite washy results.

Beste Grüße
Subzero

MauriceK · June 13, 2020, 9:59am

Hi Subzero,

yeah, I guess so. But even at 300 dpi there is still a difference to be seen. But what really confuses me is that the dimensions are changed after the OCR.

Kind regards
MauriceK

tjur · June 26, 2020, 5:09am

I received feedback from support that the last Abbyy update has a bug.

The error can be remedied by restoring the OCRHelper in (~ / Library / Application Support / DEVONthink 3 / Abbyy / DTOCRHelper.app) from the (TimeMachine) backup to the latest version (1.1.0). Don’t forget to disable the Abbyy extension update.

aedwards · June 26, 2020, 8:21am

The difference in page size is due to an issue with the ABBYY PDF exporter, the dimensions will usually be a multiple or close to a multiple of the original size.

MauriceK · June 26, 2020, 9:00am

Thanks for the Information. Will ABBYY fix the Problem?

aedwards · June 26, 2020, 9:36am

There is an update in the pipeline from ABBYY however we do not have any date from them for its release.

The document that you mentioned is more blurred after the OCR, is possible to share the original document so I can try and determine why.

MauriceK · June 26, 2020, 4:50pm

Thanks for the offer. But unfortunately this is not possible for reasons of data protection. I am also just starting to use DT and therefore I have not yet processed any other documents with OCR. So far this was only a first test. Actually I was looking for an OCR solution that leaves the scanned image absolutely untouched. I have about 8000 PDF files in my archive and I have adjusted the quality of the scans because the PDF’s are now my “originals”. The image quality should not change in any case.

pfarrelle · June 30, 2020, 11:34pm

Thank you @tjur for your detailed instructions. I went back to June 10 to find the 1.1.0 version from my Arq backups. I am all set to copy it into my installation but can’t find where to disable the Abbyy extension update. I can only find the pane for install add-ons that shows the extension has already been installed. Where is the control to prevent updates?

BLUEFROG · July 1, 2020, 12:03am

Quit DEVONthink and the ABBYY folder. Then do the restore.

pfarrelle · July 1, 2020, 12:30am

Thanks for the quick response Jim.

By quit the ABBYY folder do you mean delete it?

I’m assuming from what you (don’t) say that the ABBYY helper file will not update without my intervention, by updating DEVONthink for example? The comment from @tjur made me think there was a separate update path that I needed to prevent until the fix was released.

Here’s what I did: I quit DEVONthink, overwrote the DTPOHelper file, left the languages.plist intact and then restarted DEVONthink.

Scans now work as expected.

FWIW, I only updated to Catalina last week and immediately noticed this issue. I thought that I might have been asleep at the wheel, but when I went back and looked at my scans made with 3.5.1, but before upgrading to Catalina, they all seem fine in terms of pdf dimensions.

Thanks, Paul

BLUEFROG · July 1, 2020, 4:18am

Yes. That was on accidental deletion as I was typing.

tjur · July 1, 2020, 5:25am

Just uncheck the Abbyy extension in install add-ons and click install. Then updates are disabled. Before restoring the OCR helper, quit DT. After restarting DT, you are able to uncheck the update…

MauriceK · July 1, 2020, 7:28pm

Finally i made a test with a simple document and compared the ocrhelper 1.10 and 1.12. In the attached zip file there is the original scan (300 dpi) from my scansnap, one version with ocr from 1.10, one ocr from 1.12 with 150 dpi and one ocr from 1.12 with 300 dpi. I compared the results in preview and made a screenshot. The original scan an the version from 1.10 are identical, the 150 dpi version from 1.12 is totally unsharp and the 300 dpi version from 1.12 is better, but not as good as the version from 1.10 but the file is 4 times larger.

OCR Test.zip (1.7 MB)

aedwards · July 2, 2020, 8:26am

Thanks for the files, I will look into it.

aedwards · July 2, 2020, 3:10pm

In version 1.12 of the OCRHelper we had to change how we processed PDF documents due to an issue with the ABBYY OCR. Whilst we are waiting for an update from ABBYY, we now have to break down the pdf into individual images before OCR. A byproduct of this change is that there may be some minor artefacts around the characters. With the 150 dpi image it is being scaled down and that has amplified these artefacts whilst at 300 dpi they are significantly less.

The size of the final document is larger than the original as it is being re-saved by Apple’s PDFKit after the OCR to transfer the Creator property. ABBYY has significantly better compression of PDF documents than PDFKit and hence the larger size. I have added a fix so that the creator property is written to the PDF file by the ABBYY pdf creator so it should be much smaller in the next update.

MauriceK · July 2, 2020, 3:39pm

Thanks for your support. I am looking forward to the results after the next update.Then the picture could be better again, if abbyy also delivers a new update.

Ryan_N · July 2, 2020, 5:30pm

@BLUEFROG if of any use, I want to point out some triggers I’m aware of, which caused OCR to seize. None of these are probably very surprising:

clicking “X” on a file in the OCR queue at lower left of DT window (I never tried from within activity window)
renaming a file in OCR the queue
clicking “X” on currently-being-OCRed file (to stop the OCR process)
feeding a document exceeding 350 pages

In all of these scenarios, I had to delete the SDK folder, but then was still dead in the water until I did both Verify, followed by Optimize. I found that Verify alone did nothing, but doing both operations successively, worked every single time. (I was OCRing thousands of pages last week, and running amuck regularly–so had time to experiment with what was ultimately causing this to happen, and what the silver bullet fix was (or was not).

On the 350-page bit, I realized this once the hard way (and realize full well, this is not in specs nor normal to do something like this in the first place!). After ABBYY seized twice at page #349, on two separate documents, I was then curious what would happen if I converted those huge PDFs into little 1MB plain text files, then OCR’ed those files (in DT) by converting them to paginated PDF. Once again, ABBYY seized at page 349 (I only tried it once, but once was enough–I was only curious in this case.)

In all cases, deleting SDK, quitting DT, re-starting it, then re-opening my database (which I couldn’t do from “open recent” for some reason–DT somehow force-closed the database when I quit on every occasion SDK was deleted–then did a Verify, then an Optimize once the database was re-opened–was indeed a silver-bullet fix–OCR then worked as per normal.

edit: @aedwards Alan–meant to tag you also. Cheers.

BLUEFROG · July 2, 2020, 6:37pm

Very strange. Interesting but strange. Alan (and potentially @cgrunenberg would have to assess why the Optimize may have any effect in this situation.)

Ryan_N · July 2, 2020, 6:51pm

I should clarify that I wasn’t clicking Verify just on a whim, after re-opening. I did so because after reopening, I was still unable to OCR, so then I started trying stuff. Just Verify, didn’t work on two tries, but Verify followed by Optimize did work. If memory serves, I also tried just Optimize and was unsuccessful–so it’s totally possible there is absolutely no link here at all–that it’s just a coincidence. I did re-start on some of these tries, too, which makes my little test scenario not exactly lab-grade to say the very least.