Automatic Indexing and OCR

Sampsa · December 27, 2007, 3:09pm

I use Bibdesk for handling all my papers and it automatically files the associated PDFs, if any, in a specified folder. That folder now has an action to index the files automatically in DT to take advantage of the features in DT.

However, some of the PDFs don’t have a text layer and I would like to automate OCR on them, but not at the time of import, as it takes a while. The OCR could run at night.

So I’ve thought of the following flow:

When a PDF file is put into the folder, it is :
- indexed by DT if it has a text layer
- moved to a subfolder “No Text” if it doesn’t have text layer
At night all files in the “No Text” are OCRed and moved into the main folder, and thus automatically indexed by DT.

A couple of questions:
a) How do I sort files based on whether they have a text layer? Is there a way in AppleScript to detect that?
b) The OCR Automator action seems to import the document automatically after OCR. Is there a way to leave it where it is or even move it to the main folder?

cgrunenberg · January 7, 2008, 2:27pm

You might check if the word count property of a record is zero or not.

Sampsa · January 7, 2008, 10:09pm

Good idea. Thank you!

m021478 · January 14, 2008, 11:11am

I was hoping you’d be willing to explain the details of how you would go about accomplishing the workflow you’ve mentioned here? I would love to pull off something like this, but would have no idea where to begin…

Any suggestions would be greatly appreciated… Thanks!

Bill_DeVille · January 14, 2008, 7:26pm

Don’t forget that the History tool gives you a flat file view of your entire database and you can do batch selection and processing there. Display the History window (Tools > History).

If it’s not already there, choose View > Columns to add the Kind column in the History view, and click on the Kind header to sort by Kind.

Scroll until you have identified image-only PDFs (Kind = PDF). Select all the image-only PDFs that you wish to convert to searchable PDF and choose Data > Convert > to Searchable PDF (Kind = PDF+Text). DT Pro Office will now run the selected PDFs through OCR.

Note: It’s a good idea to check the log (Tools > Log) to identify any PDFs that could not be converted, e.g. because of too low resolution.

Now you can again sort History by Kind, select the PDF documents (except for those that could not be converted) and delete them.

If you have already filed image-only PDFs into, e.g. project groups, the converted PDFs will be in the proper group locations. The converted PDFs will have the original Creation Date but a new Modification Date.

blatch · January 29, 2008, 5:00pm

I just tried your advice and it worked wonderfully, until my hard drive filled up with temporary files causing the OCR to stop.

Bill_DeVille · January 29, 2008, 7:09pm

Never let your hard drive run out of free space! When that happens, the operating system might start overwriting data files, which is a Very Bad Thing.

Apple engineers recommend keeping at least 10% to 15% of the drive’s capacity free, so that there’s room for VM swap files and other temporary files used by the operating system and your applications.

blatch · February 4, 2008, 9:46pm

I had 27GB free before I started the OCR process. One of the items was a book. The temporary files left over from OCRing the book are what filled my hard drive.

blatch · May 9, 2008, 4:05pm

Since you didn’t grok my last post, let me illustrate the problem:

Do you see now? This is what happens if I have books in my DT database and OCR them. That is a big problem, don’t you think? And what do I get? A canned response about free space.

Bill_DeVille · May 9, 2008, 5:56pm

Sometimes canned responses are appropriate.

You may need to OCR book-length PDFs one at a time. Those temporary files created by an application should be temporary and cleared when a procedure is finished. But if the procedure doesn’t finish (runs out of disk space, for example), it may be necessary to intervene. That’s the way OS X works.

Often a restart will clear left-over temporary files. If not, a maintenance utility such as Coc*tail or OnyX can be used to clear out caches (which is a good thing to do periodically, anyway).

blatch · May 12, 2008, 9:05pm

Okeydokey. Perhaps I’ll symlink DT’s cache to another machine with more scratch storage