Possible to strip OCR text from a PDF

TylerGred · November 4, 2012, 4:45pm

For a quick and simple process, I OCR everything. However, there are some documents that I just don’t need OCR’ed and I’d like to remove that in order to create a smaller file. Is there a way to do this via script or in application?

Thanks.

korm · November 4, 2012, 5:59pm

You can open an OCRd PDF in Preview, save the file as TIFF or PNG or JPG and then save that image file as PDF. The result is a non-OCRd PDF. The resolution of the image layer is poor compared to the original. The resulting PDF is about 12 times larger than the original, and all annotations are permanently affixed to the image layer.

However, if you don’t need a PDF, just save the TIFF from Preview and you’ll usually get a smaller (albeit fuzzier) file.

Bill_DeVille · November 4, 2012, 6:16pm

As korm notes, the results of removing text will be a larger and fuzzier file.

The size of the text layer in an OCRed PDF is negligible, so you won’t achieve your purpose of reducing file size by trying to remove it.

korm · November 4, 2012, 6:32pm

If you Data > Convert > to Plain Text an OCRd PDF, you’ll get the text in a new file and can roughly gauge how much of the original OCRd file space is due to the text layer.

Bill_DeVille · November 4, 2012, 8:39pm

If your OCRed PDFs are very large, experiment with the settings for resolution in Preferences > OCR.

I do most scans with a ScanSnap set for black & white scan at 300 dpi. In DEVONthink Pro Office Preferences >OCR I uncheck the option to keep the original scan resolution and have settings of 130 dpi and 50% image quality. For most documents I’m satisfied with the view/print quality, and the OCRed PDF is smaller than the original scanner output.

bcarpenter · November 24, 2012, 4:56am

It’s not the text layer that adds size to the file. It’s the conversion of bitmapped images to greyscale images that simultaneously occurs.