Hello:
I have an enormous number of PDFs in my database that have been OCR’d (using the Convert>to Searchable PDF command). For a long time I (and my research assistants) forgot to check the box for “Move to Trash” in Preferences>OCR>Original Document. So we have many, many duplicates we now want to clean out.
Is there a script to take each PDF, compare its filename to others in the database, compare its Kind (PDF vs. PDF+Text), and if there are both, to delete the PDF version?
I’m an Applescript novice, but I think I remember how to install new scripts in the script menu.
Many thanks,
Michael U
You could create a smart group that looks for PDFs with empty textual content (word count equals 0). These should be the PDFs before OCR (after OCR they have a higher word count, of course).
I could, yes – and that would isolate all of the un-OCR’d PDFs.
But a number of our PDFs are deliberately not OCR’d (because they have handwriting, or very difficult text in 16th-century books).
So what I need instead is a way to compare each of the PDFs with word-count 0 with those (of the same name) with word-count more than 0.
I guess I could create a Smart Group of all PDFs in the database, add a column with word counts, sort them by name, and then delete every obvious duplication. It’s just a long process I’d hoped to automate.
yours
Michael