Adobe Acrobat Typewriter Text removed during OCR

psnodgrass · July 30, 2012, 6:29pm

When I convert a PDF to PDF + Text, DevonThink Office Pro removes everything that I added to the document using Adobe Acrobat’s Typewriter. This happened to a large number of documents and the information that was removed is important. How can I avoid this and is there a way to get that information back?

BLUEFROG · July 30, 2012, 11:41pm

I am not sure why the OCR process doesn’t read the Typewriter text.

If you are importing your files then the originals outside the database are still intact.
The better workflow in this case would be to get the PDFs into DTPO, OCR them, then add Typewriter edits.

For existing Typewriter edited docs, I’m afraid I don’t have a good solution. Sorry. Are these image scans with Typewriter edits?

psnodgrass · August 13, 2012, 4:14pm

These are pdf forms (usually scanned, but sometimes downloaded pdfs) that I completed using Adobe’s Typewriter and then put into DevonThink to archive. Without the Typewriter text, they are now just a bunch of blank forms.

Your suggested workflow is helpful and I also discovered that for existing documents, Printing to PDF from Preview before converting to “PDF + Text” in DevonThink seems to solve the problem as well.

It would be better if DevonThink could handle this or at least detect that the conversion will cause data loss and warn the user.

Bill_DeVille · August 15, 2012, 8:55pm

As your Typewriter notes were not part of the normal image layer of the PDFs, they were not carried over when the images were re-rasterized after optical character recognition.

Your workflow change of printing your PDFs prior to OCR does work, because that made the Typewriter characters part of the PDF image layer.

This isn’t the kind of issue for which DEVONthink would detect an anomaly in the PDF and issue a warning to the user.