OCR to PDF+text

I am confused.

Getting started on the scan > OCR > DEVONthink process. (And anxiously waiting with credit card in hand for Devon to come out with an integrated solution!)

I have scanned a 4 page document using my wonderful new Fujitsu ScanSnap. It scans directly into DEVONthink, and the information panel shows the PDF occupying 319 KB. Okay, that sounds reasonable.

Then I move the PDF to Adobe Acrobat 7.0 and perform OCR on it. The resulting PDF+Text document, back in DEVONthink, is 2939 KB!

I copied the text out and pasted it into a blank BBEdit file. The text is 7425 bytes, rounded to 8K.

(I tried doing a Save As in Acrobat, then tried Reduce File Size, but that only got the size down to 1.9 megs… still too much.)

So we have a 319K image plus 8K of text adding up to a 2.9 megabyte PDF file. What’s going on here?

(I realize the issue is probably with Acrobat, not DEVONthink, but I like the people on this forum better. :slight_smile:

Stephen, I feel the same way about my ScanSnap; it’s amazing.

OCR processes rasterize each page again, and that’s where the PDF file size can bloat.

I’ve used a little app for years, PDFShrink, to reduce PDF file size. It provides options for end use. For example, if smallest possible size for screen view only is needed, that’s an option. If reasonably good print quality is needed, that can be accomplished.

High resolution scans prior to OCR are good. After OCR, the image quality of the PDF+text fie can be reduced.

Really? That seems… well, “unwise” is probably the nicest word I can think of. At minimum, you think it’d be a prefs option in Acrobat/IRIS/whatever to “leave my bitmap alone”!

I’ll try PDFshrink, but I hate adding one more step to a process that already has too many. Thanks!

Stephen, there’s another option: get a bigger hard drive. :slight_smile: Then you won’t worry so much about file size. I’ve got 1.5 TB online on my PowerMac G5.

It’s not necessary to “shrink” the individual PDF files, one by one. If you start to get low on space or whatever, PDFShrink can batch shrink all the PDF files in your database Files folder. That requires a bit of familiarty, starting with how to reveal the contents of your package’s Files folder, and configuring PDFShrink so that it can batch process all the PDFs. You would probably set up PDFShrink to save the output to another folder, saving each PDF with an unaltered file name, then replace the existing PDFs in the Files folder with the “shrunken” PDFs of the same name. Note: Before doing that, experiment with the various compression options in PDFShrink to choose the one you prefer. You probably wouldn’t want the smallest possible file size if you want easy screen reading with reasonable print quality.

To save space (which may or may not be the ultimate goal here), try using Omni’s DiskSweeper. I found some HUGE audio files that I had forgotten about from when I’d digitized some old cassettes.

(I personally also have the bad habit of not clearing out my downloads folders very often. When I do, I recover lots of space I’d ‘lost’.)

(Hope this isn’t a forum no-no.)

I sure don’t see how anything you’ve posted here could possibly be in the no-no category.

Hopefully my suggesting JDiskReport a freeware alternative to OmniDiskSweeper is okay, too. :slight_smile:

grin Well, it’s not the disk space as much as the backup strategy for all that space. I burn my Documents folder, minus mail messages, to DVD regularly, and I can see that this scanning process will quickly overwhelm that strategy.

Hmm. I understand the steps you outlined, but is it safe to monkey around inside the .dtBase package like that? Makes me nervous…

The universe isn’t safe. But one can take precautions :slight_smile:

Reducing the file size of PDFs can be done without incident. But as usual it’s a good idea to make an external backup of the database first.