DTPO quits the entire OCR job when there is one OCR error

Ryuji · April 16, 2007, 9:14pm

I have been using Devonthink Pro Office with Fujitsu ScanSnap for a few days. While it is very nice to be able to keep scanning while OCR is working in the background, there are annoying problems regarding the DTPO handling of OCR errors.

Problem 1.

When multiple PDF files are lined up for OCR, and one file caused OCR problem of some sort. I can select to give up OCR on that file and continue. However, in reality, this quits the rest of the OCR job. Files that were waiting to be OCRed in DTPO is will no longer be processed, and they are not imported as-is either. So far, I haven’t found any workaround other than manually digging the PDF files saved by the scanner driver, rename them, drag them into DTPO and convert to searchable PDF. This is very annoying.

Problem 2.

There were several cases where DTPO rejected the PDF as it caused OCR error. However, when I attempt to OCR the same file later, it sometimes went through. I don’t understand why. Most of the time, when DTPO can’t handle it twice, I could OCR them in Acrobat 7.0.9. I don’t understand why either, as I find the OCR in DTPO is generally superior to that in Acrobat 7. (Haven’t used Acrobat 8 tho.)

Problem 3.

DTPO cannot OCR documents larger than 50 pages. HOWEVER, this error will come up only after spending time to OCR the document. This results in a kind of error and Problem 1 above applies, besides wating time and not getting a single searchable page. Can this be detected before wasting time? Even better, can DTPO detect scanned PDF longer than 50 pages and invoke Acrobat to do OCR? (I have been using Acrobat for documents longer than 50 pages.)

Problem 4 (feature request)

Say I have 1000 documents in the OCR queue. A popup window will come up and ask for the file name and keywords, every time a single file OCR task is completed but before saving it.

a. While the popup dialog is active, it doesn’t seems that the next document in the OCR queue is processed. That is, if I don’t realize that the popup dialog is there, I’m wasting time. Can this be fixed?

b. Can the OCRed results be saved in a temporary folder, wherein I can rename the files and add keywords altogether at a later time? This is so that I can scan a bunch of documents and go away for a cup of coffee or even go home.

c. the filename dialogue above currently has a postage-stamp size preview of the first page. Can this be made bigger, and also pageable? This is because I scan a lot of academic journal papers, and the necessary info to make up a filename is often printed in 8 or 10 point size and there is no way to see them before they are saved with a filename. This is very annoying. I usually have to keep a stack of paper in the scanned order so that I can look at them to make up a filename. This is very annoying. Otherwise, I had to save the document with a random filename foo1, foo2, whatever (oops, these are deterministic and highly predictable, and not random and rename later, until I reazed the problem below.

d. Is there a way to rename the file after it is once saved in the dialog? Changing the document title in DTPO does not seem to change the filename. Also, how can I edit the keywords and subject metadata in the PDF file once the file is saved?

Thanks!

Ryuji

Bill_DeVille · April 17, 2007, 1:21am

There are two different types of error reports I’ve gotten when using a ScanSnap.

[1] An error in page feed. ScanSnap Manager provides an option to reload the sheet feeder and continue scanning.

[2] If I insert more than 25 pages of double-sided material into the feeder or OCR an image-only PDF that’s longer than 50 pages, the OCR engine will report a page limit error and stop. That limitation is imposed by IRIS. At this time there’s no workaround except to avoid exceeding the 50-page limit.

I haven’t run across inconsistent behavior like that.

I would avoid using Acrobat 8 to OCR PDFs, as Adobe has changed the PDF output in a way that’s not compatible with Tiger’s PDFKit code (I’m guessing that Apple will update PDFKit in the future, but perhaps not until Leopard). OCR using Acrobat 7 or earlier is OK, although I find the output a bit less accurate.

That 50-page limit imposed by our license from IRIS is irritating. But IRIS charges hundreds of dollars more for a version that doesn’t have the limitation.

I inspect PDFs that I plan to run OCR on. If they are already in my database I can quickly check the number of pages before invoking OCR. If they are longer than 50 pages I’ll either split them using another application (and merge the segments afterwards), or OCR with Acrobat. I usually do the split/merge option in order to get better accuracy.

No, that dialog is modal and I doubt that can be fixed.

My workaround is to turn off the Preferences > OCR option to enter document attributes. That lets the queue proceed without stopping and removes the need for me to sit over the computer entering document attributes one by one. Later, I’ll rename the documents inside my database and add any necessary keywords to the Comment field. Saves time and aggravation.

c. the filename dialogue above currently has a postage-stamp size preview of the first page. Can this be made bigger, and also pageable? This is because I scan a lot of academic journal papers, and the necessary info to make up a filename is often printed in 8 or 10 point size and there is no way to see them before they are saved with a filename. This is very annoying. I usually have to keep a stack of paper in the scanned order so that I can look at them to make up a filename. This is very annoying. Otherwise, I had to save the document with a random filename foo1, foo2, whatever (oops, these are deterministic and highly predictable, and not random and rename later, until I reazed the problem below.

d. Is there a way to rename the file after it is once saved in the dialog? Changing the document title in DTPO does not seem to change the filename. Also, how can I edit the keywords and subject metadata in the PDF file once the file is saved?

Even if it were relatively easy to let you scan a preview of the OCR’d PDF in the process of naming the file (which it would certainly not be), you would waste time.

That’s why I turn off the document attributes option and change document names in the database. Most PDFs have a title that I can simply select and Control-click on. Then I select the contextual menu option to Set Title As.

True, the file name of the PDF will still be the date-time string assigned during OCR. But if I select such a PDF and export it to the Finder using File > Export > Files & Folders the file name will be changed to the document title you assigned in DTPO. I may want to do that if I send the PDF to a colleague. Otherwise, I don’t care about the file name.

Ryuji · April 17, 2007, 6:39pm

Thanks for your response. It contained a few helpful tips and it was very useful to uncheck “enter document attribute” although this doesn’t do exactly what I hoped. Some of my problems weren’t covered. Plus, I have a few more things to ask.

There is a third kind of error.

[3] If the OCR module can’t process the scanned image, particularly when the page contains a lot of figures, handwritten characters, or many rules (like ledgers or many underlined texts).

[4] When there is an error in handling files from the ScanSnap driver to Devonthink. Sometimes I get a popup saying “Can’t import file […path…] at location: Top or current group. You can try to inspect the document in the finder and try to import it from there.”

I didn’t know to write down the error message for [3].

The files I see in the finder are largely broken or 0 byte empty files. Maybe there is a problem in the way ScanSnap driver hand over the PDF file to Devonthink? (I downloaded the latest driver for Mac OS X from their website… this thing says V2.0L11.)

There are some cases where I made say 10 separate scans of different documents, and 2 scans (2 documents) were saved in a concatenated way. I am not sure why this happens, as the PDF files saved in ScanSnap directory are perfectly ok.

Another problem I have is that the image resolution and color values are changed after using OCR. Is there a way not to degrade resolution or color when I need them? Some papers have photographs, fine diagrams, etc. Also, I am a typeface fetish and when I find interesting typeface sample I scan them but I like to preserve fine details in the serifs, etc. in these cases.

Another request is that, if the document is longer than 50 pages, is it possible to make a PDF+text file with the text applied to the first 50 pages?

I understand that you don’t need to care about the internal file names at this point as long as you use export function, rather than dragging the files out, but I do the latter a lot, because I have a few different areas of interest and my database files split off, or many files have to move around. Currently I see no way to move files or groups across databases other than saving files and importing them to another database. Also, I’m waiting for the day when Devonthink database contents become spotlightable. When this happens (well, some use their own hacks to do this already) it would be much nicer to have real filename than the date strings.

Ryuji · April 18, 2007, 6:21am

When I start up DTPO, it tells me that the default output of Scansnap driver is not Devonthink, even though it is set so. Then if I tell go ahead change it to Devonthink, it actually changes the handling application to Preview…

annard · April 18, 2007, 8:19am

No, it should continue. It will actually try a file twice, just to be sure.

Neither do I, but I have seen it happen. This is an IRIS problem.

No, it doesn’t OCR it at all. It only analyzes the document (this takes up a great deal of time). This and the rest of your comment is not something we can change with the current technology we have from IRIS.

As to your problem with the wrong detection of the ScanSnap Manager defaults, this can happen if its preferences get messed up for some unknown reason (I bet you have two entries for DTPO in there). If you remove or reset them to the default value, that should restore the situation to its proper state.

Ryuji · September 8, 2007, 7:28pm

I have also observed that the chance of OCR error is much higher when there is a diagram, figure, chart, etc. other than text on the same page. Is this something being recognized and worked on?

annard · September 8, 2007, 7:57pm

We’ve licenced the OCR from IRIS, so I don’t know really what they’re working on.