OCRing degrades PDF image quality

erico · September 16, 2009, 7:01pm

Dear all,

I am wondering if anyone else is experiencing a degradation of image quality within Devonthink once a PDF has been ocred. On my machine, there is a clear degradation, as the two screen shots demonstrate below. What is strange is that if I view the same ocred image in Skim or preview, it looks exactly as it always did. I don’t know if this is a bug limited to my system or something reporoducible by others, but if there’s any way to document its occurrence and thereby fix it, I would be very interested.

-erico

before:

after:

Bill_DeVille · September 17, 2009, 12:50am

erico, what are your DTPO2 Preferences > OCR settings for dpi and image quality?

The default settings are 150 dpi and 50% image quality. That’s a compromise between file size and view/print quality.

I use a DTPO2 Preferences setting of 150 dpi and 75% image quality, which I find quite acceptable. Once in a while I’ll kick the setting up to 200 dpi for a document with small print. (However, for the scanner settings the resolution should be 300 dpi or better for good OCR accuracy.)

Today two users sent examples like yours, and one of them sent a PDF that had been OCRd with DTPO2 Preferences > OCR settings of 600 dpi and 100% image quality. The file size of a 3.5-page paper was 8.2 MB. I imported it into a database. It was very difficult to read, and I wasn’t impressed by legibility in Preview, although that was better.

For grins, I printed that PDF from Preview and ran the printout through my ScanSnap and OCR to a DTPO2 database. The result, with my 150 dpi/75% setting was a very readable and searchable PDF in my database. The file size had dropped to 1.2 MB, only 15% of the size of the original PDF.

Moral: Properties don’t necessarily scale up in a linear fashion as numbers get bigger.

Granted, however, legibility was better in Preview than in the database for the original PDF. I could detect no difference in legibility in Preview or in DTPO2 for my rescanned PDF.

Aside: The first sample I saw was from a paper was by a political scientist, about quantitative analysis of behavior. Many years ago I was a maverick physical scientist in a political science department, about the time quantitative methods were becoming popular. I teased some of my new friends about being impressed by Tom Swift and his Electric Factor Analysis Machine.

erico · September 17, 2009, 4:00am

Bill,

Good question. I of course should have posted my numbers. I’m using 300 dpi, 100% quality. And as you can see, it looks like 10% quality in the database. Weird. I’ll try lowering the quality and test again, but I have to admit I would like to keep all the dots there, for the sake of printing largely, but also because some number of the documents I have are legal documents, and I think it is good not to “alter” the image, if that is not necessary. So eventually I’d like the display to look better at a 100%. This problem has been going on for a while, maybe since the switch to abbyy.

-eric o

annard · September 24, 2009, 3:38pm

As usual (not sure if you did it already) send a message to support@devon-technologies.com with the original PDF. Then if necessary we can contact Abbyy.

jean_alexis · October 18, 2009, 6:20pm

I second that. This looks like a bug in the preview of PDF files (my workflow is scanning documents as PDF at 300 dpi BW-8bit with Epson scan, then using import with OCR in Devon Think, settings 300 dpi, 46% quality).
The resulting PDF have good quality, as shows when opening document in preview or other PDF aware applications.
However when displayed in DevonThink they look really bad and hard to read unless you zoom, it looks like dithering is disabled.
It would be really nice if fixed

annard · October 18, 2009, 10:07pm

It has been fixed for the next build, it’s just a display issue on Snow Leopard.

brentnanis · November 13, 2009, 8:09pm

Also having trouble with OCR.

Scanned a document at 600 dpi and 100%. Then sent it via fax using the ‘send to fax’ option in the print dialog box.

Recipients complaining the fax is too grainy to read.

I also note that the OCR process seriously degrades the pdf when viewed in Devon or an external pdf viewer

I am using the latest version of DTPro. 2.0pb7

Is this to be fixed in a version not yet released?

Is the layer of OCR added to a document expected to cause problems for faxing when using ‘send to fax’ versus a traditional hardcopy paper-feed fax?

Bill_DeVille · November 13, 2009, 8:26pm

The display fix mentioned by Annard will be in the next release, public beta 8.

I do a lot of scanning and OCR to searchable PDFs. I’ve got DTPO2 Preferences > OCR set to 150 dpi (sometimes to 200 dpi) and 75% image quality. The resulting PDFs display well in DTPO2 and Preview, and faxes sent from them are legible.

Ironically, I can replicate the problem you described if I set DTPO2 Preferences > OCR to 600 dpi and 100% image quality (under Snow Leopard). So I don’t do that.

brentnanis · November 14, 2009, 12:24am

Thanks Bill.
Did some testing. Tried:

150 dpi and 75% - marginal difference but still fuzzy
100 dpi and 75% - not as good
100 dpi and 50% - worse, of course

Looking forward to the fix…

We are talking about import images with OCR, right?? There isn’t another command somewhere else that I’m missing??

Is there any way to OCR images/PDFs that are already in DTPro?

mk.lv · November 28, 2009, 3:11pm

I have to agree, that the JPEG-only compression for images is not the best choice in all cases.
The best solution would be kind of automatic mode if the PDF engine could detect the image type used (at least choose between lossy JPG and lossless PNG), its colour bit depth (for PNG) in the original document and respect the original settings. The same with DPI and compression.
Let’s say I have a lot of scans in 1bit B&W at a high resolution (PNG compressed images into PDFs), taking some 200 KB per page. Converting them with the OCR’s currently used JPEG compression is an overkill - it would convert it to 24bit JPEG, bring in artifacts and take up more space than the original…
If there were at least an option to choose between the current JPEG and 4 PNG modes (24bit, 8bit grayscale, 8bit optimized colour palette, 1bit B&W), I would be already happy!

joshgibson · February 22, 2020, 7:14pm

Is there a way to customize compression levels after OCR in DEVONthink 3? I’ve looked at my preferences and deselected “Compress PDF” but after I run DT3’s OCR on a 6+MB PDF, the file size shrinks to 2MB or so, every time.

I’ve tried quitting and reopening the database, but to no avail. Any other ideas?

It’s a subtle difference, but I’m scanning and preserving genealogical records, and I’d like to have more fine-tuned controls over this kind of thing:

BLUEFROG · February 23, 2020, 3:37pm

We are working on some things in here that may reduce some artifacting in OCR. Thanks for your patience and understanding.

joshgibson · February 23, 2020, 6:14pm

No problem, thank you! Do you recommend any workaround apps? I prefer OCRing in DT3 because it seems to get more accurate results than any other app I’ve tried OCRing with, but I’m guessing that if I OCR before importing into DT3, there won’t be any artifacts?

BLUEFROG · February 23, 2020, 6:25pm

I’m guessing that if I OCR before importing into DT3, there won’t be any artifacts?

That’s impossible to tell without testing. There is no singular OCR engine, so the results can vary app-to-app.

joshgibson · February 23, 2020, 8:08pm

Makes sense, thank you!