OCRing PDFs

When applying OCR to a pdf, I notice it creates a new document (PDF+Text). It’s also smaller in size. Is it best to delete the non-OCR PDF?

The size will depend on the Quality and Resolution settings in DEVONthink’s Preferences > OCR.

My recommendations…

  1. Resolution should be balanced between file size and quality at 200 dpi (300 max.) Only use Same as scan when you know the resolution of the image and it doesn’t exceed the suggested limits.
  2. Quality should be balanced at 75-85%.

My recommendations, bearing in mind there are factors affecting compressibility beyond the Quality setting:

200dpi at 85% quality generally yields a bit smaller file than the original.
250dpi at 75% quality generally yields a slightly larger file.

Also, you can check the checkbox for Original Document: Move to Trash, if you want to delete the originating file.

Thank you for that clarification. Much appreciated.

I’m finding the OCR process in DT takes a lot longer than in PdfPen Pro. Also, is there a way to batch process PDF’s with OCR in DT?

Otuside of changing resolution / quality, no.
It’s not an apples-to-apples comparison between PDFPen / Acrobat / etc.

As far as batching, you can select more than one record in DEVONthink and use Data > Convert > To Searchable PDF.

Great support here. Thank you.

You’re welcome. :smiley:

I have tried OCR on a 450 KB pdf, this resulted in a ballooned 5.4 MB pdf. Settings were 250 pixels and 75% quality. If I do OCR with the scansnap home software, files are not ballooned like this… Is the settings wrong? I do not want to risk that files after OCR are lower in quality… however, if I choose, use the same resolution as in orginal PDF in the DT settings, the file is ballooned even to 12 MB file size… can you give any recommendation how to handle this better?

Compression methods in PDFs can vary greatly. Different apps use different methods. So the ending file, which still includes the originating image, will be larger than the original in most every case. You can’t change the method of compression applied in our OCR.

200 / 75% should yield a slightly smaller file.