DTPO and OCR capabilities: to upgrade or not upgrade

valente · December 5, 2006, 12:09pm

I’m on the last stage of trying DTPO and a decision must be made: to buy or not to buy, that’s the question.

To be true, the only feature that really appeals to me is the OCR capability. I don’t use the email importing and the server feature may be handy in the future, but not right now.

So, going back to OCR… I already have an application that does this: Adobe Acrobat. However, Acrobat is an external app to DTP and I keep all my pdf files in DTP. Should this fact alone (all-in-one) be enough to make me upgrade from DTPro to DTPO?

To make my decision I decided to run some tests. I don’t scan many papers, but I do download a lot of articles in pdf that are not recognized as pdf+text (only recognized as pdf, i.e. like a big image). So my test was along with that need of OCR processing a pre-existing pdf file with no text recognized.

Here are my results for one file with 948KB (4 pages), that has both text (some smaller than 8pt) and images, after processing it with DTPO and Acrobat 8 Pro:

Processing time:
DTPO = 3 minutes
Acrobat = 1 minute

Results:
DTPO = a file with 2.3MB. Some problems regarding the smaller fonts. No problem with images.
Acrobat = a file with ~300KB. A lot of problems with text, some of it missing (the smaller one). Images only partially recognized (part of them are taken as text).

Looking at the results I have no doubts that even taking more time, DTPO does a much better work. A bigger file is just a consequence of a better imported (PDF to PDF+Text) file.

However, there are a few things that need to be reviewed in future updates (some of them already mentioned in the forum):

there’s a problem with the date in the dialogue box that appears after the OCR recognition process is made. (It never sets the date of today; you have to do it yourself.)
the OCR process should be possible in files already inside the dtBase. (If you want to OCR recognize a file that is already in the dtBase, you must drag it out [to the desktop, for instance] and then import it again using the OCR dialogue.)

Still the end result is that I’ll most probably upgrade to DTPO.

– MJ

annard · December 5, 2006, 12:36pm

Regarding 1: the date of the end result is set to the date of the original file. Thus for archival purposes the OCR-ed document is identical to the original one. This saves a lot of trouble for people who care about these things and was requested numerous times during our initial beta test phase.

Is definitely in the works and we hope to include it in the final release. And if not then it will come soon thereafter since it is of course obvious that this is a useful and necessary feature.

Regarding file size, Acrobat is definitely superior there. I hope that IRIS will improve this in future releases because we’re depending on them to achieve this.

Bill_DeVille · December 5, 2006, 7:33pm

Comment: ScanSnap provided a ‘free’ copy of Acrobat 7 Standard when I got my scanner a few months ago.

I’ve compared OCR accuracy using Acrobat and DTPO. The IRIS OCR engine in DTPO is considerably more accurate when comparing OCRs I did on a set of image-only PDFs. Acrobat had more trouble with small print and with PDF images of less-than-perfect photocopies. Remember, though, that no OCR engine can produce error-free text from low-resolution images or from scans of blemished or marked-up paper originals.

Robert_Black · December 6, 2006, 3:21pm

Please see this post in another thread regarding PDF file sizes:
http://www.devon-technologies.com/phpBB2/viewtopic.php?p=18376#18376