DTPO OCR: downsample to 72dpi after recognition, before save

Robert_Black · December 4, 2006, 7:13pm

In order for the OCR in DTPO to work, you need to scan with a resolution of at least 300dpi, which is fine. However, the OCR process then embeds the image at 300dpi as a layer in the resulting PDF, which results in a bloated PDF.

I note that the OCR process in Acrobat, flawed as it is in many ways, does have one distinct advantage - it offers a setting to downsample an embedded image to 72dpi after OCR has been performed, and before the image layer is added to the PDF. Best of both worlds! Please consider that a feature request for DTPO.

Robert_Black · December 6, 2006, 3:18pm

In another post (http://www.devon-technologies.com/phpBB2/viewtopic.php?p=18316#18316)

Maybe you don’t need to rely on IRIS, or engineer your own solution from scratch? OS X has the tools to post-process a PDF after it’s been saved to a file. See the Save As… dialogue in Preview. If you select PDF as the file format with one of IRIS’s PDFs, and then choose the Quartz Filter ‘Reduce File Size’, the resulting PDF has its image and text layers intact, but the image layer is hugely reduced (too much in fact - the quality setting is too low by default)

A Quartz Filter seems perfect to study or harness as a way to downsample the image layer to 72dpi without discarding the text layer in IRIS’s PDFs.

See the ‘Filters’ tab in the ‘ColorSync Utility.app’ to create a custom compression filter. I created one that downsamples to 72dpi and uses medium JPG compression, and tied it on a PDF from IRIS - file size dropped from 1249 KB to 164 KB, while remaining very readable on-screen.

So maybe you don’t need to rely on IRIS to change their engine to offer an option to downsample scanned and OCR’d PDFs to 72pdi, just like Acrobat

Anyway, I’m sure I have a very unrealistic idea of how hard that would be to integrate into DTPO (issues like supporting earlier OS’s and their frameworks i.e. Panther). I’m just trying to help out with info. I do appreciate that the IRIS OCR engine gives more accurate results - it’s just the size of the image layer I take minor issue with.

Robert_Black · December 6, 2006, 4:41pm

I’ve figured out a workaround I’m happy with.

I made my own custom 2-step Quartz filter (first downsample to 72dpi with high quality, and then compress using JPG with a 50% quality setting)
I created an Automator workflow with three steps: Filter Finder Items (File Type is PDF File), Apply Quartz Filter to PDF Documents (My own 72dpi Quartz Filter), and Label Finder Items (Green). This I’ve saved as an application so that PDFs can be dragged onto it as the input.
I’ve put my Automator workflow applet inside the /DEVONthink.dtBase/Files/ folder, and I’ll periodically drag PDFs that aren’t labelled Green onto it.

Robert_Black · December 6, 2006, 5:03pm

And here I thought I’d be smart and use the ~/Library/Application Support/DEVONthink Pro/ folder to save my workflow (saved as a workflow this time, not an applet), so I could use it directly on my OCR’d PDF from with DTPO’s Scripts drop-down menu in the tool bar (it’s a toolbar custom option), but no luck - the magic doesn’t work. The workflow runs, but not on the PDF I have selected in DTPO. Oh well. It would have been nice. Any advice?

annard · December 7, 2006, 9:07am

Yes, I have thought about using the Quartz filters but if I use a workflow solution it doesn’t work in Panther and in any case it will add another (lengthy) delay to the process. So, in the future this will be tackled in an elegant way when we have the possibility to do so.

In the meantime, on Tiger, running a workflow is a good idea. You can send me the workflow and I’ll take a look at it. I know that with 10.4.8 there is an issue with the Get Data from Record action and subsequent PDF actions. So maybe you have the same issue here? Just send it to support@devon-technologies.com. It may even end up in the distribution if you don’t mind.

Thanks!

Robert_Black · December 10, 2006, 7:27pm

Done
Robert

StephenFleming · January 5, 2007, 4:09am

Any hints as to when a built-in solution for downsampling the image layer might be available?

We’re moving, and I have several boxes of documents that I’d rather not move. And since I paid for DTPO, I want to scan and OCR them.

But I don’t want to scan them if they’re going to wind up with the godawful-huge PDFs we have now.

So… on the edge of my seat here. A hint? Thanks!

annard · January 8, 2007, 9:42am

The next release will downsample images to 150dpi, resulting in both great space savings and still a nice printout.