Extract text layer without OCR?

halloleo · May 1, 2023, 3:53am

I have scanned PDFs and my scanner software is already doing some OCR on it. My scanner saves this PDF to my mac and I then just import it as is into Devonthink.

Now I would like to compare this text layer my scanner software puts on the PDF with the text Devonthink’s OCR can do.

In Devonthink I tried OCR → Convert to RTF, but this seem to kick off Devonthinks OCR engine and produces an RTF from that. Is there instead a way to tell Devonthink to save to an RTF what text layer was already in the PDF?

BLUEFROG · May 1, 2023, 5:15am

How is that a legitimate test between the two? You can’t just convert the text layer from the scanner’s output and make any useful examination of the two apps’ capabilities.

Do OCR on the same document with both apps. The use Data > Convert > To Plain Text on each OCR’d document to compare.

halloleo · May 2, 2023, 1:56am

Thanks @BLUEFROG

That’s exactly what I was after! (Data > OCR > … kicks off a new DEVONthink OCR run.)

PS: DEVONthink’s OCR is quite a bit superior!

BLUEFROG · May 2, 2023, 3:50am

PS: DEVONthink’s OCR is quite a bit superior!

Is this with the Canon scanner?
Do you know what OCR engine they’re using?

jerwin · May 2, 2023, 6:35am

podofyllin is great for examining the text layers of pdfs.

Personally, I use it to see if redacted documents actually are.

halloleo · May 2, 2023, 8:35am

Yes.

No. The PDF metadata say:

PDF Producer: IJ Scan Utility Lite
Content Creator: Canon SC1011

halloleo · May 2, 2023, 8:36am

Thanks for the tip. Will check it out. Looks extremely promising!