Better OCR page limit feedback/process

milhouse · February 2, 2007, 7:18am

Hi,

I imported ~ 1600 pdfs from my research library. ~ 40-60 pdfs have no text so I attempted to batch convert to readable pdfs.

The conversion seems to stall when there are over 50 pages. Canceling the job crashes something called RDE (i submitted a crash report to feedback@…).

Is it possible to provide some sort of pre-OCR notification to avoid this?

It will be more time consuming to sort through each of these manually to see how many are less than 50 pages and to crack open the full version of IRIS or Adobe to do the rest.

The 50 pg limit is a bit of a turn-off for research use. Perhaps an auto-split function when there are over 50 pgs?

Also, when it does work, some OCRd docs are placed in the same folder as the original and some at the root level. There doesn’t seem to be a consistency, at least that I can see (I’m sure there is so someone can correct me).

kludges, comments and solutions are welcome

cheers

Bill_DeVille · February 2, 2007, 8:41am

The Devontechnologies license from IRIS for the OCR engine in DTPO limits the maximum size of a document for OCR to 50 pages. (IRIS charges big bucks for OCR without a page limit.)

We did internally talk about the possibility of an automatic routine that would split and recombine long documents. Unfortunately, IRIS would not take kindly to that.

So the only kludge I can think of for image-only PDFs is to try to avoid including large PDFs in a batch being OCR’d. PDFs running over 50 pages in length will require segmenting before OCR, then reassembly afterwards. There’s an Automator action, ‘Combine PDF Records.workflow’ in the Extras folder on the download disk image.

I suspect that we won’t provide a script or workflow to split PDFs. Hmm. Users could do that, though.

annard · February 2, 2007, 10:34am

Like Bill said, you can build something yourself using AppleScript. There is a specific error code that is returned when the maximum page limit is exceeded and thus you could decide to split it up.

Our code can deal with RDE crashes, so that is not the root of your problems. Also when you run the Automator workflow, you’re not held up by any user interaction that is required when you use any of the Convert menu items.

By default, when the conversion takes place and was successful, the result will be placed in the same group as the original. If this isn’t the case, I would like you to run the conversion on one of the problem cases and send the group location and name of the original file and any messages in the Log tool or the Console to support@devon-technologies.com.

milhouse · February 2, 2007, 2:20pm

Thanks for the replies. I understand the licensing issue. I do own a full IRIS Pro license as well as Adobe Full.

I initially imported all files to the root level and used auto group on several dozen files.

I wonder, could the seeming random placement of OCR’d files result from using that function? Does auto group function create replicants?

If I OCR’d a replicant (is that possible?) would it place the file in the same location as the original or the replicant?

Also, is there a way to automatically tag or label imported PDFs without text to make them more easily identifiable? (instead of having to read through the log file and sort them individually)

Thanks again
cheers

annard · February 2, 2007, 6:07pm

Yes, that is the cause of the confusion. The next release will improve this when you use the interactive way of converting these records.

Not at this point in time. You could probably hack something together with AppleScript but it will not be very fast since you would have to check every “image” record to see if it has text associated with it. Or you could use the convert workflow example and stick some more AppleScript in it to do this detection, and add a recursive Get Records action in the beginning and start with the top group. That should go through your whole database.

Bill_DeVille · February 3, 2007, 5:09am

Hey, milhouse, I’ve got exactly 16 image-only PDFs in my main database of about 22,000 documents.

How do I know that? Use the History tool. If there’s not already a column that displays Kind, it can be added using View > Columns. Then sort by Kind. Scroll (which can be tricky in a large database) to find any plain PDF files.

Most of those 16 PDFs wouldn’t benefit from OCR (images of handwritten notes, for example). But I see a couple that could usefully be converted. Unfortunately, that Control-click CM option to Convert > to Searchable PDF isn’t available in the History window.

But I can press Command-R to Reveal the document, sort by Creation Date, and run OCR. A second version of that document (same Creation Date, different Modification Date) is now visible. I can delete the original, and now have a searchable PDF.

For example, I had downloaded a 14-page letter by a scholarly journal editor on the subject of open access issues for journals. I had kept it, even though it was image-only. Now I’ve got the full, searchable text and it is certainly more useful in my database.

How’s that? One more kludge for the books.

milhouse · February 3, 2007, 5:31am

Sweet, actually!

Thanks