Many academic PDFs downloaded from sites like JSTOR have defective text layers and need to be re-OCRed before they can be searched or annotated. I want to automate this process, so that anything I tag with ‘OCRthis’ gets re-OCRed in the background.
This is easily set up as a smart rule pointing to a script; however, I’m having difficulty figuring out the DEVONthink AppleScript ocr command. I think what I’m missing is that the direct parameter of the command is to a reference not to a record. I’m not clear on what that means.
So, this won’t work:
set thePath to path of eachRecord
set theResult to ocr eachRecord file thePath type PDF document without waiting for reply
As far as the script is concerned: in the second line you have an extraneous eachRecord, so the line should read set theResult to ocr file thePath type PDF document without waiting for reply. Note, however, that the resulting - new - file will be created in the global inbox. I’m not sure that is what you want? Incorporating the OCR command directly into the smart rule would leave the item otherwise untouched, that is with the original created date, tags and so on.
(P.S. I want to do this via AppleScript, rather than a plain smart rule, as there are several other steps that I want to add that won’t work without scripting. Also, no problem re the global inbox. I know how to send it to another destination group.)
Just be aware that it is a new document with a new ID. If that were a problem, you could OCR Apply as the first step in the rule and perform script actions as a second or further step.
records.forEach (record => {
let ocrRecord = record;
if (record.wordCount() === 0) {
/* No text layer, perform OCR */
ocrRecord = app.ocr(record.path(), { file: record.path(), waitingForReply: true});
app.delete({record: record, in: record. parents[0]});
}
/* further processing */
It loops over every record (forEach). In the loop, it sets ocrRecord to the current record, then checks if the current record already lacks a text layer (wordCount === 0). If so, it runs OCR on the original record and replaces ocrRecord with the result. It finally deletes the original record. That takes care of @Blanc’s remark.
Thanks both. Here’s the solution I arrived at, using Shane Stanley’s FileManagerLib. (N.B. The compressPDF handler compresses the file with PDF Squeezer, as mentioned here.)
use AppleScript version "2.8"
use scripting additions
use myLib : script "myLib"
use script "FileManagerLib" version "2.3.5"
on performSmartRule(theRecords)
tell application id "DNtp"
repeat with eachRecord in theRecords
try
set oldPath to path of eachRecord
set newRecord to ocr file oldPath type PDF document to incoming group with waiting for reply
set newPath to path of newRecord
---
set oldPathElements to parse object oldPath
set newPathElements to parse object newPath
if (full_name of newPathElements) is not (full_name of oldPathElements) then
rename object newPath to name (full_name of oldPathElements)
end if
move object newPath to folder (parent_folder_path of oldPathElements) with replacing
delete record newRecord
---
set theResult to myLib's compressPDF(oldPath, "l")
if theResult is "" then add custom meta data 1 for "mdfilecompressed" to eachRecord
on error theError
display alert theError
return
end try
end repeat
end tell
end performSmartRule