OCR PDF Resolution from an Intel Mac

jamesdempsey · November 22, 2024, 12:22am

I am in the process of migrating to DEVONthink from Evernote to store scanned documents.

My scan station is an older Intel Mac mini connected to a SnapScan iX500. My initial testing of the setup is going well.

What would be the recommended DPI setting for the Resolution setting in the OCR Preferences/Settings?

It is set by default to 0. Or does 0 have a specific meaning similar to ‘automatic’?

I am running on a venerable old Intel Mac mini that is running Monterey - the latest operating system it supports.

Thank you for any insights!

BLUEFROG · November 22, 2024, 1:24am

200dpi is sufficient for most purposes.
Try setting it, then quitting and relaunching DEVONthink. Recheck the Resolution setting after relaunch.

jamesdempsey · November 22, 2024, 1:32am

Thank you.

I just to clarify my understanding of this setting.

This is the DPI that the scanned document in the PDF will be saved at, regardless of the resolution of the original scanned PDF?

chrillek · November 22, 2024, 5:27am

Why would you scan a PDF again? And how? I don’t see how you feed its bytes to the scanner.

BLUEFROG · November 22, 2024, 6:10am

It is the resolution of the page image under the text layer.

jamesdempsey · November 22, 2024, 8:18am

That is what I am trying to understand, my understanding is it isn’t me scanning a PDF again, but it is DT performing some operations on the original scanned document image in the PDF.

From what I am reading in the OCR settings section of the DT manual:

OCR is performed on the original scanned image contained in the PDF generated by the scanner. The original PDF and its scanned image is not modified.
A new PDF is generated which contains both the scanned image and a text layer with the recognized text. This new PDF is what is added to the DT database.
The scanned image in the new PDF can potentially be modified from the scanned image in the original PDF. These modifications can include:
- Auto Correct: Deskewing and Page Orientation, if these options are selected
- Different DPI? On an Intel Mac, there is no “As Source” setting to retain the original resolution, so a DPI value is specified.
  
  When a DPI value is specified the image of the document in the new PDF will always be the specified DPI, even if the original document scan image was a different DPI.
The new PDF which contains the potentially modified document image and the text layer may be compressed, if that option is checked. There is no info in the doc as to what kind of compression that is.

Is this the basic gist of what is happening when using OCR on incoming PDF scans from known scanning software?

chrillek · November 22, 2024, 9:34am

I guess so.

Here’s what I think (!) is happening:

You have a PDF that results from scanning a document. That is just a pixel image, wrapped in a PDF. This image has the resolution your scanner was set to – let’s assume 600 dpi. The image can be JPG, TIFF or something else – we don’t know, and it probably depends on the scanner and its software.
Now you have DT run its OCR on this PDF. For that, it extracts the image and possibly converts it to one with another resolution – namely the one you set for OCR.
That resolution might be lower than the one set during scanning, eg 200 dpi
After OCR, DT creates a new PDF containing the pixel image and an invisible text layer matching the visible text closely (so you can select text from the PDF with the mouse).
In my opinion, it would make no sense to save an image with lower resolution than the original in the OCRd PDF. For the OCR process, a lower resolution is ok. But visually, it is probably worse than the original. But the option to preserve the original resolution exists only on Apple silicon (which makes not technical sense whatsoever, but that’s an issue Abbyy would have to fix).
Compressing the text layer makes no sense. Or rather: There is nothing to compress in a text layer, it contains just the instructions to draw the invisible text. OTOH, it might make sense to compress the pixel image. But as long as no one tells us what that means (different internal image format, like JPG vs. TIFF, JPG with a higher loss, TIFF with another compression algorithm?), I’d just not do it. And somehow compression seems to be related to the PDF metadata. The reason for that is beyond me.