Script for removing PDFs not deleted after OCR

historicist · June 13, 2012, 6:21pm

Hello:

I have an enormous number of PDFs in my database that have been OCR’d (using the Convert>to Searchable PDF command). For a long time I (and my research assistants) forgot to check the box for “Move to Trash” in Preferences>OCR>Original Document. So we have many, many duplicates we now want to clean out.

Is there a script to take each PDF, compare its filename to others in the database, compare its Kind (PDF vs. PDF+Text), and if there are both, to delete the PDF version?

I’m an Applescript novice, but I think I remember how to install new scripts in the script menu.

Many thanks,
Michael U

eboehnisch · June 14, 2012, 2:59pm

You could create a smart group that looks for PDFs with empty textual content (word count equals 0). These should be the PDFs before OCR (after OCR they have a higher word count, of course).

historicist · June 14, 2012, 3:37pm

I could, yes – and that would isolate all of the un-OCR’d PDFs.

But a number of our PDFs are deliberately not OCR’d (because they have handwriting, or very difficult text in 16th-century books).

So what I need instead is a way to compare each of the PDFs with word-count 0 with those (of the same name) with word-count more than 0.

I guess I could create a Smart Group of all PDFs in the database, add a column with word counts, sort them by name, and then delete every obvious duplication. It’s just a long process I’d hoped to automate.

yours
Michael