Can you retain the crisp original image when OCR'ing a PDF?

I OCR most every PDF that goes into my Devonthink Database to make everything searchable, but the huge downside to doing this is you lose the sharp, crisp image from the original PDF and are left with a searchable, yet relatively fuzzy document.

Everything is readable on the PDF+Text version, but when comparing to the original, it seems as though your eyes arent completely focusing on the document :open_mouth:

When OCR’ing a PDF, is it possible to leave the original PDF image and layer the OCR data on top of it instead of it creating an entirely new image? (Or any other tips to retain the “crisp” look of the original?)

Thanks for listening! :smiley:
-Cameron

P.S. I scan my documents with a Fujitsu 300m and normally have the Devonthink OCR settings to 150dpi at 75%. I’ve tried maxing both out just to see at 600dpi at 100% and there is still a “fuzzy” difference when compared to the original PDF.

That’s the rub, and unfortunately the answer is no.

During the OCR process the image layer is recreated.

The default dpi/image quality settings in DTPO2 Preferences > OCR represent a compromise to save the searchable PDF with reasonably readable image quality, but without a huge increase in file size.

Annard discovered that with the ABBYY OCR code he could have a very sharp PDF image and very compact file size – but at the expense of completely throwing the original image away and substituting a PDF image of the recognized text only. That would not be acceptable, as the image of the original image should be considered as a faithful representation, whereas the recognized text image could contain errors, leave out images, etc.

If I scan a contract into my database, it’s important to me that the image of the contract should be faithful to the paper copy and contain, for example, any handwritten initials and signatures. If there’s a dispute about that contract, the image layer is the ‘master’ for resolving the dispute. But the OCRd text might contain errors resulting from the proximity of handwritten initials overlaying text, or a coffee stain. The image rules.

I’ve got Preferences > OCR set to 150 dip and 75% image quality. So I can read that contract comfortably and a printout is readable. Not as sharp as the original, but I can live with it.

I do my ScanSnap scans usually set in ScanSnap Manager for black & white at 600 dpi, and in the ‘Compression’ tab of ScanSnap Manager Settings I’ve got the Compression slider all the way to the left. That increases the size of the file sent to DTPO2 for OCR (good for recognition accuracy of color content), but doesn’t pose much of a penalty in the final searchable PDF stored in my database.

The view/print image quality of ABBYY’s output is better than output of the IRIS OCR we used in DTPO 1.x and the accuracy of ABBYY OCR is much better; yet my searchable PDFs produced by ABBYY are considerably smaller than those produced by IRIS in DTPO 1.x.

One of these days in the future the technology will advance, so that the sharpness and file size of OCRd copy will be as good as I can produce by exporting a Pages document as PDF. And – something I’ve been hoping for for years – one will be able to correct errors in the text layer without modifying the image layer. :slight_smile:

Thank you for the detailed response, Bill.

I will try your recommended Scansnap Manager settings. I’ve been scanning with default compression of 3 (middle). When you say you are scanning at 600dpi, is that the equivalent of the Scansnap manager set to B&W with Image Quality set to “Best”?

Their selections in that menu are rather cryptic :wink:

Edit: Should have checked the Fujitsu’s help file first :smiley:

I leave my ScanSnap Manager setting at Best (Slow). That’s still pretty fast with a ScanSnap, and runs multiple circles around most flatbed scanners.

That equates to a 600 dpi scan for black & white, and to a 300 dpi scan if color scanning is set. If Automatic Color recognition is set, pages that are all black & white will scan at 600 dpi and pages that contain color will scan at 300 dpi.

I’ve found that I still get pretty good OCR accuracy at the fastest setting on the ScanSnap, but only for clean paper copy that doesn’t have fine print or unusual fonts.

Bill,

I wanted to let you know that I’ve seen a signifigant increase in quality by leaving my Devonthink settings at 150dpi/75% and changing the ScanSnap settings to “Slow” and Compression to “Low/1” like you said.

The scans are much better, and for more important documents I can simply increase the dpi to 200-300 if needed.

For example, I tested with a one page bank statement to see:

ScanSnap/Faster/Compression3 - DevonThink 150/75% - 250k (What I was originally using)
ScanSnap/Slow/Compression1 - DevonThink 150/75% - 270k (MUCH better sharpness and readability)
ScanSnap/Slow/Compression1 - DevonThink 200/75% - 355k (An improvement over the previous settings)
ScanSnap/Slow/Compression1 - DevonThink 300/75% - 613k (An improvement over previous settins)
ScanSnap/Slow/Compression1 - DevonThink 300/100% - 1.9MB (The best, but at a huge cost of filesize)

Your recommended settings are great and one can increase the dpi for more important files when needed. It also looks like I’d probably leave Quality precentage at 75%.