Background re-OCRing of defective PDFs

P12 · July 15, 2022, 6:08pm

Many academic PDFs downloaded from sites like JSTOR have defective text layers and need to be re-OCRed before they can be searched or annotated. I want to automate this process, so that anything I tag with ‘OCRthis’ gets re-OCRed in the background.

This is easily set up as a smart rule pointing to a script; however, I’m having difficulty figuring out the DEVONthink AppleScript ocr command. I think what I’m missing is that the direct parameter of the command is to a reference not to a record. I’m not clear on what that means.

So, this won’t work:

	set thePath to path of eachRecord
	set theResult to ocr eachRecord file thePath type PDF document without waiting for reply

Many thanks for any advice given.

Blanc · July 15, 2022, 7:56pm

You could incorporate the actions directly in the smart rule:

As far as the script is concerned: in the second line you have an extraneous eachRecord, so the line should read set theResult to ocr file thePath type PDF document without waiting for reply. Note, however, that the resulting - new - file will be created in the global inbox. I’m not sure that is what you want? Incorporating the OCR command directly into the smart rule would leave the item otherwise untouched, that is with the original created date, tags and so on.

P12 · July 16, 2022, 10:30am

Ah ha! Thanks. That solves it.

The dictionary is a bit confusing as it seems to require a reference as the direct parameter (as something distinct from the file).

(P.S. I want to do this via AppleScript, rather than a plain smart rule, as there are several other steps that I want to add that won’t work without scripting. Also, no problem re the global inbox. I know how to send it to another destination group.)

Thanks again.

Blanc · July 16, 2022, 10:38am

Just be aware that it is a new document with a new ID. If that were a problem, you could OCR Apply as the first step in the rule and perform script actions as a second or further step.

chrillek · July 16, 2022, 12:30pm

Here’s what I do in a (longish) JavaScript:

 records.forEach (record => {
    let ocrRecord = record;
    if (record.wordCount() === 0) {
      /* No text layer, perform OCR */
      ocrRecord = app.ocr(record.path(), { file: record.path(), waitingForReply: true});
      app.delete({record: record, in: record. parents[0]});
    }
   /* further processing */

It loops over every record (forEach). In the loop, it sets ocrRecord to the current record, then checks if the current record already lacks a text layer (wordCount === 0). If so, it runs OCR on the original record and replaces ocrRecord with the result. It finally deletes the original record. That takes care of @Blanc’s remark.

The AppleScript would look similar.

P12 · July 16, 2022, 6:47pm

Thanks both. Here’s the solution I arrived at, using Shane Stanley’s FileManagerLib. (N.B. The compressPDF handler compresses the file with PDF Squeezer, as mentioned here.)

use AppleScript version "2.8"
use scripting additions
use myLib : script "myLib"
use script "FileManagerLib" version "2.3.5"

on performSmartRule(theRecords)
	tell application id "DNtp"
		repeat with eachRecord in theRecords
			try
				set oldPath to path of eachRecord
				set newRecord to ocr file oldPath type PDF document to incoming group with waiting for reply
				set newPath to path of newRecord
				---
				set oldPathElements to parse object oldPath
				set newPathElements to parse object newPath
				if (full_name of newPathElements) is not (full_name of oldPathElements) then
					rename object newPath to name (full_name of oldPathElements)
				end if
				move object newPath to folder (parent_folder_path of oldPathElements) with replacing
				delete record newRecord
				---
				set theResult to myLib's compressPDF(oldPath, "l")
				if theResult is "" then add custom meta data 1 for "mdfilecompressed" to eachRecord
			on error theError
				display alert theError
				return
			end try
		end repeat
	end tell
end performSmartRule