Is there any way to remove an OCR layer from a PDF file?

juseong · September 8, 2023, 8:21pm

I quite often encouter with PDF files that has OCR issues.
For example, when I annotated a PDF file via the DT3 PDF reader, its OCR suddenly becomes completely corrupt that the OCR layer is just full of unreadable symbols.

Is there any way to remove a OCR layer from a PDF file?
It seems very complex, and I doubt there is any way. But I’m still wondering.

stephenjw · September 8, 2023, 11:07pm

I’ve had the same issue over the years - it’s a pain.

Re-applying OCR through DT often seems to fix the issue. I’m not sure if it’s because DT creates a new file using only the image before starting OCR.

Failing that, the only way I have found is to export the PDF to an image file (eg TIFF) (I use Acrobat DC) and then re-create the pdf using Acrobat DC and apply OCR.

mbbntu · September 9, 2023, 6:08am

I don’t know what features are available in DEVONthink because I don’t use it for OCR. I use Nitro (https://www.gonitro.com) which is also available through SetApp (https://setapp.com). Nitro makes it very simple to remove an OCR layer (just a matter of selecting a menu item). When I tested various OCR options some years ago, Nitro always gave me the best results (the application was called PDFpen previously). Things may have changed, so I suppose I ought to do another round of testing, but since I’ve always had good results from Nitro I didn’t see a need to change.

cgrunenberg · September 10, 2023, 7:49am

The next release will include additional workarounds which will make this PDFkit issue much less likely (and also speed up saving of modified/annotated PDF documents a lot).