Smart Rule AppleScript to OCR a PDF using Nitro PDF Pro (lossless)

tkrunning · September 8, 2023, 3:39pm

When I started using DevonThink Pro my main frustration with it was the annoying OCR engine. Not that it’s not decent at creating the OCR layer in PDFs, but rather that it (drastically) reduces the quality of the file contents itself (or drastically increases the file size).

Luckily, Nitro PDF Pro (included with Setapp) does not have this issue, so I set up a Hazel automation on a dedicated folder in iCloud (called “DevonThink OCR inbox”) where I save PDFs and images that needs to be OCRed. The PDF automation in Hazel opens the file in Nitro and runs the OCR, then saves the file and moves it to the DT inbox. I don’t recall where exactly I found the AppleScript that are being used in the Hazel automation, however it was probably on the MPU forum.

The embedded script is the following:

tell application "Nitro PDF Pro"
	open theFile as alias
	-- does the document need to be OCR'd?
	get the needs ocr of document 1
	if result is true then
		tell document 1
			ocr
			repeat while performing ocr
				delay 1
			end repeat
			delay 1
			close with saving
		end tell
		--In PDFpen, when no documents are open, window 1 is "Preferences"
		--If other documents are open, do not close the App.
		if name of window 1 is "Preferences" then
			tell application "Nitro PDF Pro"
				quit
			end tell
		end if
	else
		-- Scan Doc was previously OCR'd or is already a text type PDF.
		tell document 1
			close without saving
		end tell
		--In PDFpen, when no documents are open, window 1 is "Preferences"
		--If other documents are open, do not close the App.
		if name of window 1 is "Preferences" then
			tell application "Nitro PDF Pro"
				quit
			end tell
		end if
	end if
end tell

This workflow works great whether I scan documents with my phone or download PDFs needing OCR on my Mac. However, I already have loads of PDFs inside DT that still needs OCR.

So I spent some time (despite it just being some minor tweaks that were needed) getting it to work in DT as a Smart Rule:

The updated AppleScript used for the Smart Rule is this:

on performSmartRule(theRecords)
	log "Starting script"
	tell application id "DNtp"
		repeat with theRecord in theRecords
			set thePath to path of theRecord
			tell application "Nitro PDF Pro"
				open (POSIX file thePath) as alias
				-- does the document need to be OCR'd?
				get the needs ocr of document 1
				if result is true then
					tell document 1
						ocr
						repeat while performing ocr
							delay 1
						end repeat
						delay 1
						close with saving
					end tell
					--In PDFpen, when no documents are open, window 1 is "Preferences"
					--If other documents are open, do not close the App.
					if name of window 1 is "Preferences" then
						tell application "Nitro PDF Pro"
							quit
						end tell
					end if
				else
					-- Scan Doc was previously OCR'd or is already a text type PDF.
					tell document 1
						close without saving
					end tell
					--In PDFpen, when no documents are open, window 1 is "Preferences"
					--If other documents are open, do not close the App.
					if name of window 1 is "Preferences" then
						tell application "Nitro PDF Pro"
							quit
						end tell
					end if
				end if
			end tell
			
			
			
		end repeat
	end tell
end performSmartRule

Figured I’d share it here in case others find it helpful.

Note: I’m an AppleScript novice, so if there’s anything that could be improved upon, please let me know!

BLUEFROG · September 8, 2023, 6:35pm

What are the references to PDFpen in your script?
If your smart rule is filtering out PDF documents that don’t need OCR, telling Nitro PDF to determine if OCR needs to be done is superfluous. It’s not wrong and won’t break anything; just noting it.
It’s better practice to not nest tell blocks for different applications when it can be avoided.

—
You can use Word Count is 0 as well in the smart rule criteria.

PS: NitroPDF isn’t using the same OCR engine so it’s not a 1:1 comparison. Just something to consider.

tkrunning · September 9, 2023, 8:54am

Thank you for your valuable input, Jim!

As mentioned this script is something that was posted on the MPU forum—years ago—before PDFpen was renamed Nitro. I just didn’t think to change the comments when I changed the app name in the script.

Great feedback!

I’ve updated the script used for the Smart Rule (with a bit of help from ChatGPT):

on performSmartRule(theRecords)
    log "Starting script"
    tell application id "DNtp"
        repeat with theRecord in theRecords
            set thePath to path of theRecord
            my processFileWithOCR(thePath)
        end repeat
    end tell
end performSmartRule

on processFileWithOCR(thePath)
    tell application "Nitro PDF Pro"
        -- Open the file
        open (POSIX file thePath) as alias
        
        -- Perform OCR
        tell document 1
            ocr
            repeat while performing ocr
                delay 1
            end repeat
            delay 1
            close with saving
        end tell
        
        -- Quit the application if no other documents are open
        if name of window 1 is "Preferences" then
            quit
        end if
    end tell
end processFileWithOCR

Yeah for sure. I know the limitations in DevonThink is caused by the OCR engine (which has been discussed elsewhere on the forum), and beyond changing the OCR engine there’s nothing the DT developers can do to fix the issue. For my purposes it’s important that the OCR process is lossless/does not resample the actual images in the PDF, hence why I am relying on Nitro instead.

Is there any benefit to doing it this way instead of using character count? Any edge cases I’m not considering?

BLUEFROG · September 9, 2023, 2:05pm

You’re welcome

Is there any benefit to doing it this way instead of using character count? Any edge cases I’m not considering?

Nope. Six of one; half a dozen of the other

Not having the Nitro app, I can’t test this but have you verified the documents are indeed found in a DEVONthink search?

PS: I had forgotten Nitro bought PDFPen from Smile in the recent past.

chrillek · September 9, 2023, 2:32pm

Can you say anything about the languages supported by this OCR engine? Their website keeps so mum about that, it makes me suspicious…

mbbntu · September 9, 2023, 3:45pm

From Nitro PDF Pro Preferences:

Catalan
Danish
Dutch
English
Finnish
French
German
Italian
Korean
Norwegian
Polish
Portuguese
Russian
Simplified Chinese
Spanish
Swedish
Traditional Chines
Welsh

tkrunning · September 9, 2023, 5:17pm

Yes, works fine!

What @mbbntu said