OCR. Failed to process document

What does this error mean?

CleanShot 2021-04-18 at 18.09.33

I’m just going to add Index of page collection is out of bounds. as text to this thread, so it can be found in future by anybody searching the same error.

I’m sorry I can’t be of any assistance; whilst you are waiting for @bluefrog or @aedwards (I think Alan does the ABBYY bit of DT…?) to respond, perhaps you could provide a little more detail - where is the file coming from, what steps are you taking which trigger the error?

1 Like

I imported pdf from Evernote and apply a rule. About half are not processed.

Hold the Option key and choose Help > Report bug to start a support ticket. Please attach a problematic PDF. Thanks.

Can you turn on ABBYY OCR logging, to do this:

  • Quit DEVONthink 3
  • In Finder select the menu Go->Go to Folder, copy and paste the line below and press Go.
    ~/Library/Application Support/DEVONthink 3/Abbyy
  • Copy the file below to this folder.
    OCR.plist (274 Bytes)
  • If you get the same error again could you send a copy of the OCRLog.txt file which will be created in the same folder.

I’m having the same issue after installing DT 3.7.

This worked before, but maybe the file sizes were different in the past.

OCRLog.txt.zip (3.0 KB)

1 Like

Thanks for the log file, the problem is that ABBYY believes the page Image size exceeds limits. Could you send the Option key and choose Help > Report bug to start a support ticket, mark it for the attention of Alan and attach a copy of a PDF that causes this error . Thanks

Thanks Alan. Before I was helpfully pointed to this thread by @Blanc, I started another where I uploaded a file like that. Here is the file link again, in case this is enough.

I can reproduce this with files that are created with Page Screenshot (a Safari extension), as long as “[k]eep full retina resolution quality” is enabled.
With this setting disabled, the OCR error has not yet come up.

The thing is, I always had this setting enabled, but only after DT 3.7 did the errors start.

Why are you using this Safari extension then doing OCR afterwards?
What aren’t you just capturing a PDF from the start?

Because Devon’s PDF clipper doesn’t work for me like 75% of the time because of cookie notices or other popups that result in a grey/darkened page with the popup saved in the foreground. Clutter-free only works on some articles and also fails often with an error message of needing to enable JavaScript.
Printing to PDF from Safari is another option, but often this changes the layout too much as well.
Those extensions are the only ones that really capture some pages as they are.

For long non-paginated pdfs, the image when scaled for OCR will be larger than the maximum image size that ABBYY can handle. The best way would be to use DEVONthink to capture the URL as a PDF as this will be generated with a searchable text layer. If you have a problem certain web sites, try using Safari’s File->Export as PDF menu as this will also generate a PDF with a text layer.

Does the Page Screenshot extension allow you to capture the PDF as paginated rather than as one page?

1 Like

As a side note: OCR results might be good enough without rull retina resolution quality. Did you try that?

Most of the time, I get cookie notices and other pop-ups when using Devon’s own PDF clipper, that’s why I hardly use it any longer.

I often used print to PDF with bad results as far as the layout goes and thought File > Export as PDF was the same. It seems my assumption was wrong and export as PDF keeps the layout pretty nicely. I’ll have to test this more, but this might be the best solution. Thanks for making me aware of that option!

It doesn’t to preserve the layout. But the File > Export as PDF seems like the best option so far if it works similarly to a screenshot.

Yes, they are ok, just a bit more blurry after DT runs OCR.

DT will try and scale lower resolution images to improve the accuracy of the OCR, however this can lead to some noise around the characters which makes them appear slightly blurred.