convert to searchable PDF script?

Is there a script that could be made that would allow us to essentially use DT’s OCR outside of the database? Often times I have a PDF that I need to OCR and then attach to my bibliographic database - only after this does the PDF go into DT (my biblio. manager’s attachment folder is indexed by DT).

It would be great if I could skip the process of moving the PDF into DT, converting, and then taking it out again.

Thanks, Danny

The “Import Droplet.app” and the “OCR Commands Suite” are probably a good starting point for such a script.

Don’t forget Automator!

Hi Danny!

Have you been successful in writing such a workflow/script? I am trying to set up something similar so I would be very interested to hear about your advances.

Nils

Hi Nils,

I’m not savvy with applescript, nor really automator, so I’ve been lazily sitting back hoping someone else will figure it out and share it!

Sorry :frowning:

I have made some alterations to a script written by Eric Boehnisch:
It is supposed to do the following with my database (using DT Office Pro) which contains only indexed pdfs staying on a Network volume:

  • find those pdfs which have not yet been ocr’d
  • convert those pdfs to searchable pdfs (this stores them in my database)
  • delete the original file in the database AND on the Network volume
  • export the searchable pdf to the Network volume
  • delete the searchable pdf in the Database
  • delete the automatically produced file “DEVONTECH_storage”

There are still some problems with my script:

  1. I cannot find a way to tell Devonthink to delete the item in the database AND the original file, though this is possible manually (DT asks if you want to delete only the link or also the original file).
  2. Because of that problem, I think, the export command does not work (it does not replace the original file).
  3. the Finder does not delete the file “DEVONTECH_storage”. This might be related to a wrong path name, but everything I have tried did not work.

Any help is much appreciated.

Nils

using terms from application “DEVONthink Pro”
tell application “DEVONthink Pro”
activate
set theDatabase to current database
set theContents to contents of theDatabase
repeat with this_record in theContents
set this_text to plain text of this_record
if this_text is “” and type of this_record is not group then
try
set converted_record to convert image record this_record
delete record this_record
export record converted_record to “/Volumes/test/”
delete record converted_record
end try
end if
end repeat
end tell
end using terms from
tell application “Finder”
delete file “DEVONtech_storage” of folder “/Volumes/test”
end tell

I have now succeeded in writing a script that works; for AppleScript professionals it surely looks a little bit clunky, so any help is appreciated. However, beware: it is supposed to delete files, so take care. The only problem I have is that AppleScript sometimes stops with the message that the file supposed to be deleted is already in use. Any suggestions for that?

Nils

using terms from application “DEVONthink Pro”
tell application “DEVONthink Pro”
activate
set theDatabase to current database
set theContents to contents of theDatabase
repeat with this_record in theContents
set countWords to word count of this_record
if countWords is 0 and type of this_record is not group then
set name_of_this_record to (the name of this_record)
set neuer_Pfad to “Testvolume:Testfolder:” & (name_of_this_record)
try
with timeout of 7200 seconds
set converted_record to convert image record this_record
end timeout
delete record this_record
tell application “Finder”
delete neuer_Pfad
end tell
export record converted_record to “Volumes/Testvolume/Testfolder/”
delete record converted_record
on error
delete record converted_record
end try
end if
end repeat
end tell
end using terms from
tell application “Finder”
try
delete file “Test:Testfolder:DEVONtech_storage”
end try
end tell

Cool. Now pretend that you are talking to a complete idiot when it comes to applescript (cause U R) and tell me exactly what I’m supposed to do with this.

You could tie it to a folder, as folder action. When you put a image-only-pdf inside the folder, the script should do the following:

  • invoke DT
  • DT would then ocr it and create a new readable pdf in its current database
  • then, DT would put the ocr’d-pdf back into the original folder, delete the image-only-pdf, the ocr’d-pdf in the database and the annoying “DEVONTHINK_storage”-file

You just have to change the names of the folders and the folder location, according to your needs.

After the script is run, the image-only-pdf has been changed to a textlayer-pdf and is still outside DT (that was what you want, I think). I have extended the script so that it indexes the newly created pdf. In this way, I can ocr pdfs without having to import them into my database.
Unfortunately, I frequently get an unspecific AppleScript error (“unable to read file”), which stops the script altogether. I had hoped to circumvent the error by introducing the term “try” but this did not work. Could someone with a little bit more experience please have a look at the script?

Nils

Hm, I think I found the problem: the pdfs invoking the AppleScript errors were longer than 50 pages, the limit in DT Office Pro. I would like to exclude those pdfs, but I could not find a reference to the length of documents in the script library of DT. Did I miss something?

Nils

Nils, that 50-page limit per PDF on OCR was formerly imposed by our license with IRIS, but has been removed. So make sure you are using DT Pro Office 1.5.2.1.

You might inspect the PDFs that generate the error message. A PDF that requires a password to open would be a problem. There are some “flavors” of PDF files that may not be compatible with Apple’s PDFKit. In some cases, if the file will open under Preview, using Save As in Preview may allow OCR conversion by DT Pro Office. If you have Acrobat 8, re-saving a PDF as an earlier version, perhaps 1.4, may make the PDF readable.

I should also mention that some hacks to OS X can cause problems. Unsanity’s ShapeShifter, for example, severely affects creation and saving of PDFs (as well as numerous other problems).

I didn’t know that! Sweet

Bill, thank you for the information!

Nils