For Searchable Text in PDFs--DTPO or Adobe Acrobat?

rmansfield · July 6, 2009, 12:55pm

I have a number of documents saved as PDF files created originally from simple straight scans. I notice that DTPO can convert these to searchable text PDFs. I also have a full version of Adobe Acrobat (8) which can do the same thing.

Adobe Acrobat is faster for creating searchable text than DTPO, but I wonder if anyone knows or has an opinion as to which one is better. Is DTPO slower because it is more accurate in its OCR capabilities?

rmansfield · July 29, 2009, 5:09pm

I’m going to comment on this again to bring it back as a current topic in hopes that someone can answer my question. As mentioned below, I know Acrobat is faster than DTPO to convert PDFs to searchable text, but I really want to use whichever program is more accurate. In the big picture, speed is not an issue.

And obviously, I’m going to keep the files in DTPO, but am just curious as to whether it or Adobe Acrobat has the more accurate OCR. Thanks.

Bill_DeVille · July 29, 2009, 5:47pm

I haven’t done a comparison of OCR accuracy for Acrobat and DTPO 2 for the same document. But comparing the text of similar documents, I would give the point to DEVONthink Pro Office 2.

There is a reason for the speed difference yon noted. OCR of a large file can be very RAM-intensive, as many page images are kept in memory. This can result in choking the computer.

To mitigate that problem, Annard set up the procedure to store temporary pages on disk.

Thus, although OCR will take longer, other work can be done on the computer at the same time. A large queue of images for OCR can chug along to completion.

rmansfield · July 29, 2009, 11:09pm

That’s what I kind of needed to hear.

If that weren’t enough, though, I found another way in which DTPO’s scanner is better than Adobe Acrobat…

Today, I scanned a full-color newsletter. Some of the text was white on black. I wondered how each program would handle this. I ran it through Acrobat first and Acrobat would recognize yellow text on a black background, but not the white text on a black background.

So I tried the same thing in DTPO. And guess what–it recognized ALL the text.

So, now I know!

ptram · March 25, 2010, 10:06pm

I could do a test between DTPO 2 and Acrobat 8 (so, not the latest version). The attached document shows the results on rather problematic documents.

Devon is definetely more accurate, while Acrobat sometimes even refuses to compete.

NOTE: I’m not able to attach the test file. Neither rtf or pdf files are allowed by the board. As a temporary solution, I’m posting the link to my dropbox:
dl.dropbox.com/u/982284/OCR_Acrobat_Devon.pdf

Paolo

darwin · March 26, 2010, 8:23am

Discussions here in the forum indicate that Acrobat produces much smaller pdfs, it seems to me.

ptram · March 26, 2010, 11:03am

Darwin, I’m not sure of the relation between recognition accuracy and generated PDF size. May you elaborate on that?

Paolo

darwin · March 26, 2010, 11:16am

No, I’m sorry. Perhaps the discussions about OCRd pdfs and file size give some answers.

ptram · March 26, 2010, 3:55pm

Darwin, thank you for pointing me there. The few scans I’ve done doens’t exhibit this increased size problem, but they are rather small PDF files from start. What I care more, however, is that OCR is well done, so that searching a document would be effective. It seems to me that DT has much better OCR than Acrobat (8).

Paolo

darwin · March 26, 2010, 9:36pm

That’ s interesting, because most complaints here in the forum are about file size, I think. Would be interesting how big your files are?

ptram · March 27, 2010, 2:42am

I’ve tried only two files: one is a 700 KB, four page PDF, that remained the same size after OCR. The other is a 684 KB JPG file, representing a brochure, that after OCR was reduced to 215 KB in Devon.

Settings are 150 dpi, 75%, Automatic quality. I’ll do other tries before the demo expires.

Paolo

bcarpenter · March 28, 2010, 11:13pm

Some of your problem here may be whether Acrobat has been set up to recognise non-English words or not.

Apart from that, my testing over the last 6 months has found that DTPO is certainly more accurate that Acrobat 8 in OCR, but does generate larger file sizes at lower quality. There are other threads on this, but the basic problem is that DTPO converts bitmap monochrome files to greyscale, increasing the file sizes significantly. The file size is determined by the graphics/image component of the file, not the hidden OCR’d text layer.

ptram · March 29, 2010, 5:34pm

Hi,

No problem, so, since Acrobat’s OCR was set to the Italian language.
Thank you for your hints, I’ll do some other tests with larger files.

Paolo