Files Sizes of PDFs are HUGE after scanning/OCR in DTPO 2.0

I haven’t changed my scan settings since upgrading my databases to DTPO 2.0b1. One thing I’ve noticed is that PDFs after OCR are HUGE now (one 265 manual I scanned, 90% text, weighed in at 500MB! The same file was reduced to 42MB after optimizing in Acrobat Pro.

My OCR settings are set to 300dpi, 100%, automatic, incoming scans converted to searchable PDF. Why the incredibly large file sizes? Has something changed in the final release version of DTPO 2.0?

Thanks!

Bryan

The default settings in DTPO Preferences > OCR are for 150 dpi and 50% image quality, as a compromise between file size and image view/print quality. I do most scanning on a ScanSnap in black & white. My experience is that I’m getting smaller files with better view quality in the final release than during the public beta period.

The image layer of OCRd PDFs is created using Apple’s PDFKit code, which isn’t all that efficient for minimizing file size. See what happens, for example, if you remove a page from a two-page PDF and save the resulting one-page PDF — the file size may be larger instead of smaller.

There are third-party utilities such as PDF Shrink that allow one to customize a ‘shrink set’ for resolution and image quality. I sometimes use PDF Shrink to make significantly smaller PDFs without appreciable loss of view quality.

First of all, thank you, Bill, for your diligent support of all things DevonThink. I’ve implemented many of your suggestions (silently) over the past year reading through these forums.

I’m going to change the settings as per your suggestion. I suppose if I change the settings in DTPO prefs and then Convert to Searchable PDF, it will re-OCR the document with the new settings reflected.

Thanks again!

I purchased PDFShrink - wonderful product.

Is there any way to incorporate PDFShrink into a workflow to shrink documents already imported into DTPO2.0?

Thank you very much for the help!

Bryan

I just tried PDFShrink and all the files either doubled or tripled in size. A folder that started out as 481.1MB is now 1.01GB.

Can anyone recommend something else?

I don’t use PDF Shrink’s default shrink sets, as one of them shrinks PDFs too much, while others may not decrease PDF size.

Try the settings for a custom set while following the effects of each of the settings in the user manual. I created two new sets that I like. The individual settings used may vary with the quality of the scan image (some scanners are better than others, just as some cameras are better than others), whether it’s black & white, grayscale or color, the importance of images, etc. It takes a bit of experimentation (work with copies, not your originals) to produce a custom PDF Shrink set that produces satisfactory results, both in size reduction and in view/print quality.

But remember: if you start with a PDF that was saved into DTPO at 150 dpi and 50% image quality, you won’t get a very much smaller PDF unless you sacrifice view/print quality. For a black & white scan already saved to DTPO with default settings in Preferences > OCR I can reduce the file size to about half while legibility remains good. For such a scan saved in DTPO with the option to retain scanner settings, I can get a six- to eight-fold size reduction with good view/print legibility.

Acrobat has a mode to reduce PDF size, and I’ve used that at times. My experience is that I can get smaller files with better legibility with an appropriate PDF Shrink set. Acrobat’s settings can be tweaked, but that’s not for the faint of heart.

There are applications that produce very highly compressed PDFs. The only problem is that some size reduction schemes produce PDFs that cannot be viewed in Preview or in DTPO (and in some cases, text cannot be indexed by DTPO or Spotlight) — which defeats the purpose.

It was the custom settings that made them so huge. I just tried the “print” default setting and it reduced everything by about 75% and I don’t see any loss of quality. Unfortunately, it puts everything back in the same folder and adds “-print” to the file name, so now I have to go through umpteen million files deleting duplicates and renaming.

There’s a setting to overwrite the existing PDF, so that it is replaced by a smaller PDF with the same filename.

Thanks for your last comment, Bill. I’ll look for the setting. Have you done any scripting of a PDFShrink within DTPro solution?

Just discovered that when I was pressing “Save”, the forum app was saving a draft of my post and not submitting it. I was wondering where my posts were going…

Thanks again for the great support you lend to the DevonThink community.

Bryan

Thank you, I had to go through the user manual again to figure out how to edit an existing setting, it is not apparent that it is even possible so I didn’t think I could until you said it.

I think it must have been the choice of viewers that was making it so huge, since that was the only difference between my custom settings and the preset. So I am going to try it again now.

Thanks for all the help.

Just for the benefit of anyone who wants to incorporate PDFShrink into their document import workflow:

I’ve set up a new PDFShrink configuration that is set for minimum file size and which overwrites the existing file. I saved this configuration as a droplet to my Desktop. When import into DTPro2 is completed, I just drag the file from the DTPro GUI onto the Desktop Droplet. Voià - it’s shrunk after the process is finished. You can even select multiple docs and PDFShrink queues 'em all up.

I’ve seen resulting file sizes that are 10% - 20% the size of the original document, with OCR intact.

Bryan

I was planning to scan and OCR a ton (3 file drawers) of paper into DevonThink 2 Pro.

Is it not going to work reasonably using the standard automated DT workflow (including reasonable file sizes)? Am I pretty much required to use a 3rd party app like PDFShrink?

Thanks!

I’m quite satisfied with most of my scans OCRed into DT Pro Office, with Preferences set to 150 dpi/50% quality. Most of my scans are black & white; OCR of such scans will usually not result in significant increase in the size of the final PDFs.

Example: an IRS Form 1099-R form with accompanying documentation is 4 pages. With my ScanSnap set for black & white, and with DTPO Preferences OCR set to 200 dpi/50% image quality, the original PDF output is 2 MB and the searchable PDF stored in the database is 1.8 MB — a slight reduction in file size.

I’ve scanned a few very large documents, some of which I share with others. I’ve used PDF Shrink on those with success. Example: Someone sent me a PDF with 421 pages that has a file size of 419 MB. I used PDF Shrink to reduce the file size to 159.2 MB but with text still clear and easy to read.

With today’s huge and cheap hard drives, file size isn’t all that important. When I scan articles, reports, letters and so on I want the scans to be clear enough to be very comfortable to read, so I usually won’t use PDF Shrink on them.

But I frequently scan receipts and invoices, e.g., for tax records. Most are 1 or 2 pages. As handwritten entries in forms are images, a 2-page document of this type might have a size of 1.2 MB. PDF Shrink can reduce it to 73.1 KB using the “smallest possible” shrink set, a very large size reduction. Although a bit fuzzy, such records meet IRS requirements should they be needed, e.g., for a tax audit. When I’ve finished a tax filing, e.g., for 2009, I’ll archive a database containing those records, perhaps processing many of them with PDF Shrink to save file size in the archive.

Yes, but when sending folders & folders of scanned PDFs to your accountant every quarter, file size is important. Especially in countries where the internet connection isn’t quite as reliable, fast or cheap as North America, Europe or Japan.

I’ve set up different profiles in my ScanSnap software. If I’m scanning a black and white document, I choose the profile that results in smaller file sizes (you can fiddle around with file quality and compression sizes to do so). I’ve found that when OCR’ing multi=paged color documents, however, the file size increases (I scan 30-60 page all-color documents quite a bit). Great thing about PDFShrink is that, using the procedure above (after configuring PDFShrink settings as per Bill’s suggestion above), I can select multiple files in DTPO and drag them to the Shrink droplet. The shrinking is done in-place with no loss of quality (as far as I’ve been able to see).

That’s what I’ve found.

I recently bought a ScanSnap 1300 which has OCR bundled and I’ve been finding that the resultant PDF is quite a bit smaller and seems slightly clearer. I’m tempted to do the scanning outside but the workflow isn’t as nice if I either use “scan to folder” or if I let the ScanSnap do the OCR and tell DT not to since I lose the title step and it doesn’t delete the original. It would be nice to have the same integrated workflow if I let the ScanSnap do the OCR for me.

Thanks,
Lee

It appears that when DTPO OCRs documents, the new file it creates has the image data saved with a lossy grayscale algorithm; if you zoom in close to a DTPO OCRed PDF, you can see what appear to be JPEG artifacts, which do not exist in bitonal PDFs.

This is an inappropriate compression method for high-resolution bitonal images, which is what most of us are OCRing, for the following reasons:

  1. Much larger file sizes, on the order of 5-10x.
  2. Loss of image quality, which is particularly important if the file needs to be re-OCRed in the future.

Lossy compression algorithms are appropriate for photographs, not text.

Please forward this to the dev team; this needs to be fixed. No serious document management software should recompress bitonal images in this manner.

As confirmation of my last post:

This is a old ScanSnap 300 dpi black-and-white PDF that I ran through DTPO 2.0.2’s OCR.

This inspector is from Acrobat Pro 8, Advanced > Print Production > Preflight, run any preflight on it, then in that window, Options > Browse Internal PDF Structure… and click the “Browse internal structure by page” button.

It’s actually being compressed as a 300 dpi color JPEG – DCTDecode is the JPEG codec. facepalm

See: wikipedia/Portable_Document_Format#Raster_images

Alexander: This has been documented before in the forums. DevonThink isn’t interested. The OCR is outsourced to another program. Since you have Acrobat 8 Professional, you can use the same workflow as I do. It creates files which retain image quality & file size, but at a slight loss of OCR quality since Acrobat’s OCR isn’t as accurate as ABBYYYYYY’s? See here: viewtopic.php?f=20&t=10045&p=46927#p46927.

Thanks for the heads-up on that, B. For kicks, I downloaded ABBYY’s own FineReader, and it does the same ridiculous thing.

I like this quote from Adobe’s forum:

"Accuracy/completeness of OCR output, from any application, is significantly impacted by the effective resolution of the image. Resolutions of 200 to 250 ppi resulted in a noticable degradation.
Below 200 ppi degredation ramps up significantly. Below 100 ppi results in OCR output being effectively worthless.
Any downsample results in destructive removal of pixels with the consequent degradation of image fidelity.
Image output of downsampling rarely provides much for an OCR process to “grab”.

Adobe documentation identifies 300 ppi as an optimal resolution for a balance of file size and OCR accuracy.
Records agencies, such as NARA, recommend 600 ppi." -http://forums.adobe.com/message/2650632#2650632

So for everyone who is scanning/importing documents with DTPO, saving them as “150 dpi 50% quality” PDFs, and throwing out the original copies, I hope you’re happy with today’s OCR quality, because you’ll never be able to accurately re-OCR them again, ever.

Meanwhile, I’ve got my original quality 300+ dpi scans, in smaller files.

See why this is a problem?