De-OCR?

Hi,

This may be a dumb question. But… My workflow involves using DTO’s OCR capability to turn book-length PDFs into searchable text. The resulting PDF files are, of course, several times larger than the original image-only PDFs.

What I’m wondering is this: Let’s say I’ve finished working with the PDF file. I’d like to retain the PDF, but I don’t need it to be searchable any longer. Any way to convert back to a non-searchable, image-only PDF and so shrink down the file and save some precious hard drive space?

You did do your work on a copy of the original PDF file, right? If so, then you have the original smaller PDF file somewhere, which is what you’re looking for, I assume. Just delete the OCR’d file (PDF + text) and then add your original to your database.

I’m wondering the same thing. My workflow is to scan documents directly into DTPO & OCR them. When I finally get to sorting out the Inbox, I realise that not all documents, or all pages need to be OCR’s. How can I un-OCR specific pages? (And I will open up another thread with a slightly different question: How can I OCR only page 1?)

Thanks, Twicks… But no–my current workflow involves retaining ONLY the DTO (searchable PDF) copies. In the future I can retain the image-only versions… But I’ve got quite a few searchable PDFs I’d like to “de-OCR” and reclaim some disk space.

Anyone–is it possible?

And–BCarpenter… You could always open your PDF, print specific page(s) to a new PDF, and then use DTO to OCR the new PDF. (If you want everything together in a single file, you can use Preview to insert the new OCR’d pages back into the original file, and delete the corresponding un-OCR’d pges.)

Acrobat enables you to view and delete objects in the top-level dictionary of a PDF. This includes text, etc. Not sure what other applications have this feature.

C

Thanks–that would be one possibility!

Removing the text layer would make little difference in file size. The increase in file size after OCR is primarily due to the re-rasterization of the image layer. Acrobat has a procedure to reduce file size of PDFs, but this may or may not produce significant reduction.

I just tried removing the text layer using Acrobat. As Bill has pointed out, very little file size reduction was achieved, like 2-3% perhaps. Bill, what do you mean by re-rasterisation? I thought the settings in DTPO to decrease the dpi & decrease the JPEG quality would have reduced the file size, not increased it.

The OCR process itself recreates the image layer, saving each page of the PDF as a temporary file at 300 dpi and maximum image quality, then assembling them into a single PDF when all have been converted. This file would probably be very much larger than the original image-only scanned PDF.

The DTPO2 Preferences > OCR settings for dpi and image quality (suggested, 150 dpi, 50% image quality) are a compromise to reduce the final size of the PDF as stored in the database. Even so, the saved PDF will likely be larger than the original image-only PDF as produced by the scanner.

If it’s any comfort, ABBYY OCR tends to produce significantly smaller searchable PDFs than did the IRIS OCR previously used in DTPO 1.x. (ReadIRIS Pro 12 has an option to produce compressed PDFs, which are considerably smaller; the downside is that such compressed PDFs are unreadable in DEVONthink or Preview.)

The good news is that storage media are getting larger and cheaper at the same time. :slight_smile:

Bill,

Thanks–this is all helpful info. I did suspect that simply stripping out the text would only reduce file size minimally… After all, it can’t be the text that takes up so much space, right?

So thanks for helping us look under the hood.

And yes–at least hard drive real estate is a buyer’s market. :slight_smile:

Alan

Thanks for the detailed reply Bill. Out of interest, I tried taking a 4-page monochrome PDF scanned at 150dpi & converting it to OCR’d PDF in DTPO using the settings of 150dpi/50%. The file size went from about 80kb to 750kb. I then put the same file through OCR using Acrobat Professional 8 & the file size went from 80kb to 92kb.

So, Acrobat is much more efficient at OCR than DTPO. Now to figure out if the extra step in the workflow is worth the effort…

I don’t know that’s the correct conclusion. Acrobat is likely encoding the image data in CCITT4 compressed format, and AABBYY is using the Apple PDFKit (or whatever) which is 32-bits uncompressed. DTPO is just using the Apple-native format which gives an optimal viewing experience at the expense of larger file size.

Try working with a large (~50 page, say) PDF saved as CCITT4 in Apple’s Preview before you get too excited about Acrobat, or knock DEVONtechnologies.

(BTW, There’s lots on this forum about the above. Search for it.)

Best, C

Yes, I spoke too soon. Looks like I picked the one example where there is such a glaringly obvious difference. I’ve tried several other sample workflows now using colour scans at various DPI’s, OCR’s in DTPO & Acrobat, different ways of reducing image quality, etc. There doesn’t seem to be too much difference between Acrobat & DTPO (Acrobat better by about 10%), except in the case that I mentioned above. Perhaps it is something to do with converting the monochrome image to a colour/greyscale image when the re-rasterisation is done?

As for previous posts on this, I couldn’t find them. However, I’ve just worked out how to do an advanced search & some posts from early in the year have shown up. Having gone through them, there is no solution to this problem. Yes, the monochrome images are converted to JPEGs (increasing file size, decreasing contrast), but no explanation of why this is necessary has been given.

From my experiments, the best workflow to maintain readability & keep file sizes to a minimum is:

  1. Scan to folder.
  2. Open in Acrobat. Recognise text using OCR with settings of “Searchable image (exact)”. Then PDF Optimiser to downsample everything to, say, 150dpi. Save file.
  3. Import this into DTPO.

Try this experiment: forget about DTPO for a second and get one of those “slim” monochrome PDFs you mentioned. Open it in Preview and then “Save As…” You should have a much larger file, although it should scroll much more smoothly. Pick a big one, and you’ll really see the difference the image format makes to OSX. Why Apple doesn’t have a decent (in hardware?) CCITT4 decompression algorithm in Quartz is beyond me.

FWIW, I’ve re-compressed DTPO/AABBYY OCR’ed PDFs in Acrobat, and have had great results. So you can always revisit the issue of the bitmaps’ data format at your leisure.

Charles

I’ve just given this a go. I took a monochrome image at 150dpi & Saved As using Preview. It made no difference to the document size. I then tried adding a note annotation to it & it added a miniscule amount to the file size.

Thanks for this tip. How do you do this? I’ve tried using the PDF Optimiser section of Acrobat 8 Professional & it made not much difference to the file size (or even got bigger in some instances). It remains greyscale with JPEG compression. I can’t seem to get a greyscale PDF resaved as a monochrome image.

It works for me. Perhaps the source file wasn’t a bitmap format?

I spoke too soon, or imprecisely, about the recompression. I’ve been able to reduce the DPI of a PDF OCR’ed in DTPO, but haven’t changed the format from greyscale to bitmap.

This could be done by exporting the PDF as image files, but then you’d have to re-OCR…

C

Oh, okay. The DPI isn’t really the issue because DTPO can downsize the DPI at the time of OCR. As far as I can see, the main reason the file size increases is that the document is changed from monochrome to greyscale/colour. It is this process that I can’t work out how to reverse.

Right, but this gets us back to (sort of) my original point: DTPO/AABBYY converts the monochrome PDFs to greyscale because it yields superior performance on OSX. Or, DTPO/AABBYY converts the monochrome PDFs to greyscale because they use the underlying OS support, which actually makes the conversion.

The point is, because it yields superior performance on OSX, why would you want to convert it back?

C

By “superior performance”, do you mean faster decompression & therefore faster display, therefore less jerking during scrolling? If so, I understand what you’re getting at. There is a small but noticeable difference on my system.

However, my primary concern is not the smoothness of scrolling, but rather minimising file size while preserving image quality. The image has already been converted from monochrome to greyscale by DTPO, leading to a drop in quality & increase in file size. Do I want to convert it back? Not really, as it will lose quality further again. Rather, I’d prefer that the conversion didn’t happen in the first place & the original image was preserved. The ABBYY engine doesn’t require a greyscale image for OCR as the standalone ABBYY program happily runs on monochrome images without conversion.

Yes, although I think there are also rasterization issues as well.

Not necessarily. If you don’t change the DPI, you’re simply exchanging a 1-bit representation of B&W for a 32-bit representation. No loss of “quality” there.

So just OCR your documents outside of DTPO. Just because the OCR is there, doesn’t mean you have to use it there. :slight_smile: