Problem with text recognition on converted jpgs

michiganuser67 · July 24, 2010, 7:36pm

I am a brand-new customer, now using DEVONthink Pro Office in trial mode to see whether it fits my needs. I seem to have a problem with the text-recognition (OCR) on imported pdfs. I have many photocopies of foreign-language documents (mostly in Russian). These are currently jpgs. Quality varies (some are fuzzy, dark, misaligned, etc.), but some are quite easily legible. There are two issues:

I understand DEVONthink Pro Office needs these files to be pdfs (i.e. it can’t scan a jpg for text), and it cannot convert jpg to pdf on its own - correct? I realize I can convert jpg to pdf using Preview or Acrobat, but it would be easier if DTPO could do the conversion. I’m also looking for an add-on or third-party solution to batch-convert jpgs to pdfs (I have about 3500 images), but so far without success.
Based on sample scans/imports today (I’m on the free trial, so the program limits my OCR runs each day), ABBYY’s OCR cannot identify text on ANY of these documents. This is odd, because I did a test with Adobe Acrobat’s OCR to compare, and it has no problem with these same copies. Since all the reviews I’ve seen for ABBYY say that it is equal to (usually better than) Acrobat, I wonder whether I’m doing something wrong. Is there any trick to setting up the OCR? (I’ve imported ABBYY, and I’ve been through the Preferences, and I set up Russian as the primary language.) I COULD use Acrobat for all my OCR scans, and then import these scanned versions to DTPO, but would rather use ABBYY if it is better.

Thank you.

Bill_DeVille · July 24, 2010, 9:13pm

First, DTPO’s OCR mode can recognize and convert JPEG images to text. If you tried that and it didn’t work, your JPEGs may be a different “flavor” – for example, there’s a “flavor” of JPEG called lossless, and I doubt that DTPO OCR would recognize that, although I haven’t tried it.

Example: I made a screenshot of your post as a 72 dpi JPEG, then used DTPO’s File > Import > Images (with OCR) to capture it to the database, then selected the resulting searchable PDF and invoked Data > Convert > to plain text. There were conversion errors, but the result was not bad, considering the low resolution of the image. Here it is:

Problem with text t i ^ B W Ä W '^üöTTI recognition on converted jpgs Dby michiganuser67 on 24 Jul 2010
I am a brand-new customer, now using DEVONthink Pro Office in trial mode to see whether it fits my needs. I seem to have a problem with the text-recognition (OCR) on imported pdfs. I have many photocopies of foreign-language documents (mostly in Russian). These are currently jpgs. Quality varies (some are fuzzy, dark, misaligned, etc.], but some are quite easily legible.

I understand DEVONthink Pro Office needs these files to be pdfs (i.e. it can’t scan ajpg for text), and it cannot convert jpg to pdf on its own - correct? I realize I can convert jpg to pdf using Preview or Acrobat, but it would be easier if DTPO could do the conversion. I’m also looking for an add-on or third-party solution to batch-convert jpgs to pdfs (I have about 3500 images), but so far without success.
Based on sample scans/imports today (I’m on the free trial, so the program limits my OCR runs each day], ABBYY’s OCR cannot identify text on ANY of these documents. This is odd, because I did a test with Adobe Acrobat’s OCR to compare, and it has no problem with these same copies. Since all the reviews I’ve seen for ABBYY say that it is equal to (usually better than) Acrobat, I wonder whether I’m doing something wrong. Is there any trick to setting up the OCR? (I’ve imported ABBYY, and I’ve been through the Preferences, and I set up Russian as the primary language.) I 'COULD* use Acrobat for all my OCR scans, and then import these scanned versions to DTPO, but would rather use ABBYY if it is better.
„M,^nuser67
Joined: 2- QTM a
Thank you.

You indicated that some of your images are dark or fuzzy. OCR tries to recognize text characters and words in an image, and writes out a new text layer based on recognition of pictures of characters and words in a searchable PDF.

Obviously, if the image lacks contrast (which can be the case in a dark or fuzzy image), OCR accuracy can become poor. Often, image processing can improve the contrast and sharpness of the picture of characters and words, and that can improve OCR conversion accuracy. But image processing results saved as a JPEG will probably result in some degradation of the image, since normal JPEGs are not lossless. I recommend that you save the results of image processing as TIFF for that reason, especially as you can then do any further touchup of a TIFF image with minimal degradation of the saved edit. You may encounter very large file sizes, so make sure you’ve got plenty of free hard drive space.

With the large number of images you have, you will want to use image processing software that can easily be set up to do batch conversions of images from one filetype, e.g., JPEG, to another, e.g., TIFF. You may also need to post-process some of your TIFF images, e.g., to lighten the image, and improve contrast and sharpness of the pictures of text characters for better OCR recognition. The OCR can probably handle many cases of misalignment of the image of text, but post-processing to "straighten’ the image may be required in some cases to improve character recognition.

For batch processing and relatively simple image processing I use and recommend GraphicConverter, which is built for such tasks and is reasonably priced. There are those who say it’s an “ugly” application. This is one of those cases where I say its utility as a tool is more important than it’s appearance.

The “big boys” who scan thousands of books to produce digital versions of them don’t use flatbed scanners. Instead, they use special equipment with digital cameras. Their equipment costs thousands of dollars, but automatically controls lighting evenness, minimizes curvature of pages, flips pages automatically and can digitize a whole book in minutes. Of course, they’ve also automated the procedure for turning the digital camera images (often taken with two cameras) into the final PDF or ebook.

michiganuser67 · July 24, 2010, 10:55pm

Thanks very much for that quick and helpful reply!

I don’t know whether my jpgs are lossless or not. They are standard photos, taken with a Panasonic camera in the 1.2 MB range. I just tried to use the import-jpg process you mentioned with 6 more images, and it didn’t work, so I suspect that’s not an option. (First tried it as “File-Import-Image (with OCR)” – this did nothing, DTPO didn’t even import the jpgs. Then I dragged and dropped the files into my DTPO global inbox – DTPO did then import the jpgs, but all showed up as jpgs, none as having any text that I can tell.)

It’s possible that this experiment with these 6 images failed because I’ve hit my 20-per-day cap on the trial license, but the fact remains that NONE of the images I tried today were OCR’d by DTPO (this includes files I converted first to pdf format, and which Acrobat – by comparison – had little turning into text-readable form). Since I know Acrobat isn’t THAT much better than ABBYY (if at all), I’m really wondering whether I’m doing something wrong in a very basic way, like failing to have turned “on” the ABBYY engine or something similarly fundamental. The log did show ABBYY as having been downloaded and installed, but I wonder if there is some similarly basic issue that my trial version is not doing. I can’t believe that NONE of these files are readable, in other words, especially since Acrobat is doing fine with them. So on that score they don’t seem to require TIFF reconversion and reprocessing (although that’s good advice).

On a semi-related note, would DTPO OCR (ABBYY) automatically rotate, realign, etc. images? Acrobat is doing that in some cases as part of its OCR, but the ABBYY images coming into DTPO are not changing in these visible ways.

I did find GraphicConverter, thanks very much for that tip, and it does look great for a giant batch task. THe only problem is the files are being (roughly) quintupled in size from their original (e.g., a jpg of 1 MB is turning into a PDF of 5 MB). That’s not likely workable, with so many images to process. I can’t see an option in GraphicsConverter to adjust size, so I may be forced to use Acrobat for this reason as well. (The Acrobat conversion adds some to the file size, but only about 10%, not 500%.)

elwood151 · July 25, 2010, 12:31am

Hi,

for importing multiple images at once with DevonThink:
I’ve done that many thousand times during the last weeks and it works quite well.

When the jpegs sit in your inbox as jpegs:
did you try the context menu “convert - searchable pdf” (also accessible via the Data menu)?

Kind regards

Martin

michiganuser67 · July 25, 2010, 1:11am

Thanks very much – that was the trick! This time it worked, I’m very glad to report. It was a bit slow, and not all of the text on those particular documents turned out to be searchable (no surprise there). But the key is that some of it WAS readable; and that’s enough to confirm that the DTPO OCR is working. And that in turn lets me start with the real work. Thank you!

elwood151 · July 25, 2010, 2:41am

I’m glad I could help you!

The trick with the conversion via the data menu is extremely helpful in any case when the context menu is not available
(e. g. if you have performed a search and select several documents in the search results. Then, the ocr conversion context menu is not there, but you can convert all selected documents via the data menu.)

Bill_DeVille · July 25, 2010, 2:54am

As to the issue of increased file size in converting an original JPEG to a different image type, e.g., TIFF, prior to OCR – that’s really not much related to the size of the final PDF after OCR. In fact, during the OCR procedure individual images of each page of the PDF are being created as temporary files, which can exceed 40 MB or more.

See DTPO Preferences > OCR re options related to image resolution and image quality. The default settings of 150 dpi and 50% image quality are a compromise between file size and view/print quality. I sometimes tweak this to 200 dpi and 75% image quality, but I don’t check the option to retain the original scan resolution, which can result in enormous file sizes for some PDFs. For most paper copy scans on my ScanSnap I use Black & White and the Better setting, and the resulting searchable PDFs are generally about the same size, or even smaller, than the original PDF produced by the scanner.

michiganuser67 · July 25, 2010, 3:37am

Thanks very much. In that case, would Martin’s technique for conversion work with a TIFF file? I don’t think I will do this routinely – I have too many files, so mostly I’ll try the existing jpgs first – but for those that seem tweakable into better quality, this could be handy. Thanks again, to both of you!