Automatic Indexing and OCR

I use Bibdesk for handling all my papers and it automatically files the associated PDFs, if any, in a specified folder. That folder now has an action to index the files automatically in DT to take advantage of the features in DT.

However, some of the PDFs don’t have a text layer and I would like to automate OCR on them, but not at the time of import, as it takes a while. The OCR could run at night.

So I’ve thought of the following flow:

  1. When a PDF file is put into the folder, it is :
    • indexed by DT if it has a text layer
    • moved to a subfolder “No Text” if it doesn’t have text layer
  2. At night all files in the “No Text” are OCRed and moved into the main folder, and thus automatically indexed by DT.

A couple of questions:
a) How do I sort files based on whether they have a text layer? Is there a way in AppleScript to detect that?
b) The OCR Automator action seems to import the document automatically after OCR. Is there a way to leave it where it is or even move it to the main folder?

You might check if the word count property of a record is zero or not.

Good idea. Thank you!

I was hoping you’d be willing to explain the details of how you would go about accomplishing the workflow you’ve mentioned here? I would love to pull off something like this, but would have no idea where to begin…

Any suggestions would be greatly appreciated… Thanks!

Don’t forget that the History tool gives you a flat file view of your entire database and you can do batch selection and processing there. Display the History window (Tools > History).

If it’s not already there, choose View > Columns to add the Kind column in the History view, and click on the Kind header to sort by Kind.

Scroll until you have identified image-only PDFs (Kind = PDF). Select all the image-only PDFs that you wish to convert to searchable PDF and choose Data > Convert > to Searchable PDF (Kind = PDF+Text). DT Pro Office will now run the selected PDFs through OCR.

Note: It’s a good idea to check the log (Tools > Log) to identify any PDFs that could not be converted, e.g. because of too low resolution.

Now you can again sort History by Kind, select the PDF documents (except for those that could not be converted) and delete them.

If you have already filed image-only PDFs into, e.g. project groups, the converted PDFs will be in the proper group locations. The converted PDFs will have the original Creation Date but a new Modification Date.

I just tried your advice and it worked wonderfully, until my hard drive filled up with temporary files causing the OCR to stop. :frowning:

Never let your hard drive run out of free space! When that happens, the operating system might start overwriting data files, which is a Very Bad Thing.

Apple engineers recommend keeping at least 10% to 15% of the drive’s capacity free, so that there’s room for VM swap files and other temporary files used by the operating system and your applications.

I had 27GB free before I started the OCR process. One of the items was a book. The temporary files left over from OCRing the book are what filled my hard drive.

Since you didn’t grok my last post, let me illustrate the problem:

Do you see now? This is what happens if I have books in my DT database and OCR them. That is a big problem, don’t you think? And what do I get? A canned response about free space.

Sometimes canned responses are appropriate. :slight_smile:

You may need to OCR book-length PDFs one at a time. Those temporary files created by an application should be temporary and cleared when a procedure is finished. But if the procedure doesn’t finish (runs out of disk space, for example), it may be necessary to intervene. That’s the way OS X works.

Often a restart will clear left-over temporary files. If not, a maintenance utility such as Coc*tail or OnyX can be used to clear out caches (which is a good thing to do periodically, anyway).

Okeydokey. Perhaps I’ll symlink DT’s cache to another machine with more scratch storage :slight_smile: