Image size exceeds limits when opening a pdf (and triggering OCR rule)

willfoster11 · May 22, 2020, 3:08pm

I’m getting an error I haven’t seen before from what seems like a perfectly good pdf:

11:02:12: Inbox > Safari Screenshot 2020-05-22 11.01.09 No Text
11:02:13: ~/Library/Application Support/DEVONthink 3/Inbox.dtBase2/Files.noindex/pdf/9/Safari Screenshot 2020-05-22 11.01.09.pdf Skipped
*11:02:13: Failed to add image *
*11:02:13: OCR failed for document ‘Safari Screenshot 2020-05-22 11.01.09’. *
11:02:13: Page 0 of the image file cannot be opened due to the following error: Image size exceeds limits ().
32512 x 32512. FileName = /Users/XXXXXX/Library/Caches/Safari Screenshot 2020-05-22 11_01_09_1.pdf

I should note that the pdf is actually 20067 x 12000 and 300 dpi (i.e., less than 32512)

Any ideas? Thanks in advance!

willfoster11 · May 22, 2020, 3:41pm

To add to the strangeness:

I opened the original pdf with Preview, then exported the pdf as a pdf using “Export as pdf…”

The resulting file was smaller and easily opened by DT3…

I have a rule on to OCR (if Kind is PDF/PS and Word Count is 0, Perform the following actions: On Import OCR to searchable PDF and Move To Trash).

But when it imports the file (Kind: PDF document) and applies the rule, it creates a new document (Kind: PDF+Text) which has no selectable text layer in it (and deletes the original).

If I then OCR that file manually (ignoring “Are you sure you want to convert this searchable PDF again?”), it finally does have selectable text.

I’m attaching the documents themselves, if that helps

filename.pdf [failed import] — this is the original pdf that DT3 wouldn’t import
filename 2.pdf[successful import] — this is the pdf that I created with Preview, per above
filename 2.pdf [auto OCR] — this is the file created by the OCR rule, per above
filename 2.pdf [manual OCR] — this is the resulting file from manually OCRing the [auto OCR] version

Safari Screenshot 2020-05-22 [manual OCR] 11.01.09 2.pdf (341.6 KB) Safari Screenshot 2020-05-22 11.01.09 2 [auto OCR].pdf (213.7 KB) Safari Screenshot 2020-05-22 11.01.09 2 [successful import].pdf (839.9 KB) Safari Screenshot 2020-05-22 11.01.09 [failed import].pdf (839.8 KB)

One more thing: the dimensions I reported above are from Pixelmator Pro—when I open the original “[failed import]” file with Preview, it reports both “Media Box” and “Crop Box” as 2880 x 4816 points. Not sure why the discrepancy.

Thanks!

BLUEFROG · May 23, 2020, 4:20pm

the pdf is actually 20067 x 12000

Is this a typo??

willfoster11 · May 31, 2020, 7:05pm

No, that’s what Pixelmator Pro said, but as I subsequently reported, Preview reported it as smaller. I’ve repeated the experiment saving the file at each step of the way—and in this instance, file (1) is the one that won’t import, and I leave it to you to determine the actual dimensions (Preview is reporting it as 2880 x 17482 points on my Mac).

(1) This is the original file that DT wouldn’t open (see log).
(2) This is the file after I opened it with Preview and saved it using the “Export as pdf” command. It was successfully imported by DT.
(3) This should be the same file as (2)—it’s the one DT moved to the Trash after rule: if Kind is PDF/PS and Word Count is 0, Perform the following actions: On Import OCR to searchable PDF and Move To Trash.
(4) This is the result of the auto OCR using the above rule—note that there is no text layer, at least that I can discern, suggesting that the rule somehow isn’t working (although DT moved the original import to the Trash and reports this one as “PDF+Text”.
(5) This is the result of manually OCRing file (4) in DT, after ignoring the warning: “Are you sure you want to convert this searchable PDF again?” It does have a text layer, finally.

So it seems to me there are two issues:

A. DT won’t open the pdf—until I’ve opened it in Preview and (re)exported it as a pdf.
B. The DT rule for automatically OCRing documents isn’t resulting in selectable text—but manually doing it does.

Files are too large and are available zipped on Dropbox:

Attaching the .log file compressed as .zip here.

2020-05-31.log.zip (1.2 KB)

I really appreciate your help!

BLUEFROG · June 1, 2020, 3:24am

Where did this PDF come from?

aedwards · June 1, 2020, 7:54am

The failed image exceeds the maximum size pixel size it accepts for a page in a PDF document. How did you generate this file?

willfoster11 · June 7, 2020, 11:27pm

It’s a Safari extension called “Page Screenshot for Safari”:
https://alexdenk.eu/mywork/pagescreenshot.html

But here’s the thing—I’m having the same issue with the first OCR pass not finding text or creating a text layer with ANY image.

Attached are a screenshot I made of this page with the native MacOS screenshot shortcut, the first pass at OCR (right-click –> OCR –> to searchable text), then the second pass (of the the pdf, using the same method, overriding the “Are you sure you want to… again?” dialogue.

This is separate from the import issue—why does the first pass of OCR not work?

Also, oddly, none of this is showing up or being recorded in the log, which is empty—any idea about that?

Thanks in advance!

Image 2020-06-07|299x499 Image 2020-06-07 (first OCR).pdf (398.6 KB) Image 2020-06-07 (second OCR).pdf (966.5 KB)

Empty log photo and export:
06-07-2020-19.26.03 2020-06-07.log.zip (844 Bytes)