First, DTPO’s OCR mode can recognize and convert JPEG images to text. If you tried that and it didn’t work, your JPEGs may be a different “flavor” – for example, there’s a “flavor” of JPEG called lossless, and I doubt that DTPO OCR would recognize that, although I haven’t tried it.
Example: I made a screenshot of your post as a 72 dpi JPEG, then used DTPO’s File > Import > Images (with OCR) to capture it to the database, then selected the resulting searchable PDF and invoked Data > Convert > to plain text. There were conversion errors, but the result was not bad, considering the low resolution of the image. Here it is:
Problem with text t i ^ B W Ä W '^üöTTI recognition on converted jpgs Dby michiganuser67 on 24 Jul 2010
I am a brand-new customer, now using DEVONthink Pro Office in trial mode to see whether it fits my needs. I seem to have a problem with the text-recognition (OCR) on imported pdfs. I have many photocopies of foreign-language documents (mostly in Russian). These are currently jpgs. Quality varies (some are fuzzy, dark, misaligned, etc.], but some are quite easily legible.
- I understand DEVONthink Pro Office needs these files to be pdfs (i.e. it can’t scan ajpg for text), and it cannot convert jpg to pdf on its own - correct? I realize I can convert jpg to pdf using Preview or Acrobat, but it would be easier if DTPO could do the conversion. I’m also looking for an add-on or third-party solution to batch-convert jpgs to pdfs (I have about 3500 images), but so far without success.
- Based on sample scans/imports today (I’m on the free trial, so the program limits my OCR runs each day], ABBYY’s OCR cannot identify text on ANY of these documents. This is odd, because I did a test with Adobe Acrobat’s OCR to compare, and it has no problem with these same copies. Since all the reviews I’ve seen for ABBYY say that it is equal to (usually better than) Acrobat, I wonder whether I’m doing something wrong. Is there any trick to setting up the OCR? (I’ve imported ABBYY, and I’ve been through the Preferences, and I set up Russian as the primary language.) I 'COULD* use Acrobat for all my OCR scans, and then import these scanned versions to DTPO, but would rather use ABBYY if it is better.
Joined: 2- QTM a
You indicated that some of your images are dark or fuzzy. OCR tries to recognize text characters and words in an image, and writes out a new text layer based on recognition of pictures of characters and words in a searchable PDF.
Obviously, if the image lacks contrast (which can be the case in a dark or fuzzy image), OCR accuracy can become poor. Often, image processing can improve the contrast and sharpness of the picture of characters and words, and that can improve OCR conversion accuracy. But image processing results saved as a JPEG will probably result in some degradation of the image, since normal JPEGs are not lossless. I recommend that you save the results of image processing as TIFF for that reason, especially as you can then do any further touchup of a TIFF image with minimal degradation of the saved edit. You may encounter very large file sizes, so make sure you’ve got plenty of free hard drive space.
With the large number of images you have, you will want to use image processing software that can easily be set up to do batch conversions of images from one filetype, e.g., JPEG, to another, e.g., TIFF. You may also need to post-process some of your TIFF images, e.g., to lighten the image, and improve contrast and sharpness of the pictures of text characters for better OCR recognition. The OCR can probably handle many cases of misalignment of the image of text, but post-processing to "straighten’ the image may be required in some cases to improve character recognition.
For batch processing and relatively simple image processing I use and recommend GraphicConverter, which is built for such tasks and is reasonably priced. There are those who say it’s an “ugly” application. This is one of those cases where I say its utility as a tool is more important than it’s appearance.
The “big boys” who scan thousands of books to produce digital versions of them don’t use flatbed scanners. Instead, they use special equipment with digital cameras. Their equipment costs thousands of dollars, but automatically controls lighting evenness, minimizes curvature of pages, flips pages automatically and can digitize a whole book in minutes. Of course, they’ve also automated the procedure for turning the digital camera images (often taken with two cameras) into the final PDF or ebook.