Just a quick follow-up:
While I have complained here and elsewhere about the resulting size of PDFs OCRd in DTPO, I have to admit the accuracy of the OCRing is hard to beat (I think Adobe ClearScan is really the cats pyjamas in terms of accuracy and file size… but adobe
).
I use large government documents for one aspect of my research. These documents tend to be (for some reason I can’t explain) scans of paper producing images of text. These scans are reasonably good in terms of DPI but they can be a bit noisy and generally poor. These can range from 150 pages to 2000 pages. Recently on a smaller document (175 pages) I thought I’d test DTPO versus PDFpen Pro to see:
- Who was faster,
- Who produced the smaller file
- Who produced a better looking file
PDFpen produced a smaller, better looking file at a smaller file size a fair bit faster than DTPO. The settings I have for quality in DTPO did not bring the size down to to comparable level and produced an uglier PDF.
Since the PDFpen file was done first I began searching through it and trying to select text here and there and so on. While in general it did okay, it was clear that the less-than-ideal quality of the scans were a bit of a challenge. Lots of poorly aligned elements of the OCR layer, many inaccurate or incomplete words, and so on. Not a big deal, but I generally need to search these documents for specific words so accuracy is somewhat critical in this case. I don’t want to track down a keyword in a 500 page document manually because the OCR failed to detect it.
Once the DTPO version was done, I tried some of the same text-based things, searching, selecting words, using the “lookup” feature in Mac OS, and found that in many places where PDFpen Pro struggled to OCR, DTPO succeeded like a champ.
So, the moral of the story is, OCR is about tradeoffs. In this specific use-case I am burdened with a massive PDF no matter what. Whether the file is 60mb or 90mb isn’t terribly important (though of course I would prefer the smaller size) because it doesn’t interfere with my ability to use them. Here, accuracy is more important than file size. Since it isn’t a lengthy monograph, compression artefacts are also not the end of the world (though I certainly don’t like them). In this case DTPO is the clear choice.
In other instances where scan quality is better, PDFpen may do a better job with a smaller file size and marginally better image quality. This would be important where file-size is matters more, such as for things I might store or need to access on a mobile device.
So, I think the gist is, the suitability of DTPO as an OCR tool depends a great deal on your needs. It doesn’t always work for my needs, but its high level of accuracy really does win out in this specific case for me.