Script to OCR PDFs with the latest FineReader

Silverstone · October 16, 2019, 2:11pm

DENONthink 3 revision

What’s new in this version

Made specifically to use in Smart Rules (more on how to do it later)
Added new Document Properties to clone. You have to know that OCRing means importing new, recognized document, and as such, it has new UUID, and all the properties of the «old» document must be cloned to the new one. You can choose in script what exactly you want to clone (more on this later).
Simplified Progress Bar

Why this script may be a better option

Built-in ABBYY FineReader engine is version 11 (which is much better than the previous built-in), I tested the script with version 12.1.13. The sample is fairly modest, so, some conclusions may seem arguable, more evidence needed

Better picture quality along with the smaller resulting file size (this is still true in most cases, thanks to MRC)
Recognition goes faster and quality of recognition is overall better (still confirm it)
Automatic outline in resulting PDF (very useful feature). May be turned off in script, if needed.
Correctly OCRs vertical text. Built-in engine cannot do it. This is especially important if you have both text orientations on one page (like pivot tables).
Splits two-sided scans (e.g. book scans). Built-in engine doesn’t have such option. May be turned off.
OCRed document does not loose Annotation or Reminder. If you OCR document with built-in OCR command after making Annotation or setting Reminder, you will loose them in new OCRed document. With this script you choose to keep it or not.

Configuring a script

Making a Smart Rule

Copy this script to: «/Users/YOU/Library/Application Scripts/com.devon-technologies.think3/Smart Rules»
Create Smart Rule. Select files to search for. E.g. «Extention» is «PDF Document» AND «Word Count» is 0. Add other conditions you want.
In action section choose «Execute Script», choose «External» and select from the drop-down menu the name of this script.

You are all set. Now you can select any files you want «Option-click» choose «Apply Rules» and «Name of the Rule you created» (or menu «Tools» - «Apply Rules»). It’ll start the process.

If you choose «Perform Rules» instead of «Apply Rules» - DT will perform the script on the files currently in your rule group (instead of selected files), so be careful.

Progress Bar is simplified, this means you will not see there different stages of recognizing, you’ll see a text like: «(1 of 3): Name of currently recognized file». If you want more details on what’s going on - just switch to FineReader.

Options you can set in FineReader

Open FineReader Preferences and choose:

Enhance images (yes/no)
Split the «book-scan» (yes/no)
Detect page orientation (yes/no)

Options you can set in Script

Open this script in any script editor and find there one of the sections (they are highlighted with comments)

FineReader Recognizing Preferences

There are description text in comments to any parameter, right in the code

«LangList»: Set recognizing languages (take it here).
«PdfLayout»: Set PDF Layout. One from: {page image; text and pictures; text over image; text under image}
«saveType»: Type of saving documents. One from: {empty pages split files, same files as source, separate file for each page, single file}
«CreateOutlineboolean»: Set whether to generate automatically Table of Contents (yes/no)
«UseMRCboolean»: Use MRC compression or not (yes/no). Helps to keep file sizes very small and clear
«KeepPageNumberHeadersAndFootersBoolean»:Yes or No to keep Page numbers, Headers and Footers
«EnablePDFTaggingboolean»: Yes or No to save PDF tags
«KeepTextandBackgroundColorsboolean» Yes or No to keep the background and text colours
«EmbedFontsboolean»: Whether to embed fonts or not
«KeepPicturesboolean»: Yes or No to keep pictures in the OCRed document
«ImageQuality to high quality»: Page image quality. One from: {balanced quality; high quality; low quality}
«PageSize»: Define the page size or leave «Automatic». pick sizes from here.

Set up a temporary folder to use

This path will be used by FineReader to save an output file. This file will be removed after import to DEVONthink.

Setting up Clone Options

There are description text in comments to any parameter. If you want to turn it off - just use comment sign in every line before.

«addition date»: same “Added” date
«aliases»: Same “Aliases”
«altitude» Same “Alitude”
«attached script»: Reattach the same script
«comment»: Clone “Finder Comments”
«creation date»: Same “Created” date
«exclude from classification»: same boolean
«exclude from search»: same boolean
«exclude from see also»: same boolean
«exclude from tagging»: same boolean
«label»: Same “Label”
«latitude»: Same “Latitude”
«locking»: Same “Locked/Unlocked” state
«longitude»: Same “Longtitude”
«meta data»: Same PDF meta data
«rating»: Same "Rating»
«state»: Same “State/Flag”
«tags»: Same “Tags”
«URL»: Same "URL"field
«custom meta data»: Clone custom meta data if it is not empty

You can also choose to clone properties, which are not cloned using built-in OCR:

«annotation»: Reattach the same annotation to the OCRed document
«reminder»: Set the same reminder to the OCRed document

Script

Copy and save it in any script editor.

on performSmartRule(theRecords)
	tell application id "DNtp"
		if (count of theRecords) > 0 then
			show progress indicator "Recognizing…" steps (count of theRecords) with cancel button
			
			-- *********************************************
			-- Set Up Your FineReader Recognizing Preferences Here (the rest of the preferences like: "Enhance picture quality"; "Divide two-paged images"; "Recognize page orientation", you may setup from the FineReader app prefs)
			
			using terms from application "FineReader"
				set langList to {Russian, English} -- set recognizing languages (take it here: https://abbyy.technology/en:products:fre:win:v11:languages)
				set PdfLayout to text under image -- one from: {page image; text and pictures; text over image; text under image}
				set saveType to same files as source
				set CreateOutlineboolean to yes -- Wheter to generate automatically Table of Contents
				set UseMRCboolean to yes -- Helps to keep file sizes very small and clear.
				set KeepPageNumberHeadersAndFootersBoolean to yes -- Yes or No to keep Page numbers, Headers and Footers
				set EnablePDFTaggingboolean to yes --Yes or No to save PDF tags
				set KeepTextandBackgroundColorsboolean to yes -- Yes or No to keep the background and text colours
				set EmbedFontsboolean to yes -- Whether to embed fonts or not
				set KeepPicturesboolean to yes -- Yes or No to keep pictures in the OCRed document
				set ImageQuality to high quality -- Page image quality. One from: {balanced quality; high quality; low quality}
				set PageSize to automatic -- pick one from here: https://gist.github.com/dmgig/5e30cedc17e4458ef2dd52ffb6c552c7
			end using terms from
			
			-- *********************************************
			
			try
				set theNumber to 0
				repeat with theRecord in theRecords
					step progress indicator "(" & (theNumber + 1) & " of " & (count of theRecords) & "): " & ((name of theRecord) as string)
					
					set theName to (filename of theRecord) as string
					if cancelled progress then exit repeat
					set theType to type of theRecord
					if theType is PDF document then
						
						set oldName to theName & "_old"
						set name of theRecord to oldName
						set inPath to path of theRecord
						
						-- *********************************************
						-- Set Up Your Temporary Folder Here ("outPath" - the folder where FineReader will create a recognized file, it will be deleted after import):
						
						set outPath to "/Users/ilya/Documents/00_Temp/" & theName
						
						-- *********************************************
						
						set theNumber to theNumber + 1
						tell application "FineReader"
							
							repeat until is finereader controller active
								delay 1
							end repeat
							
							export to pdf outPath from file inPath ¬
								ocr languages enum langList ¬
								export mode PdfLayout ¬
								saving type saveType ¬
								create outline CreateOutlineboolean ¬
								use mrc UseMRCboolean ¬
								keep page numbers headers and footers KeepPageNumberHeadersAndFootersBoolean ¬
								enable pdf tagging EnablePDFTaggingboolean ¬
								keep text and background colors KeepTextandBackgroundColorsboolean ¬
								embed fonts EmbedFontsboolean ¬
								keep pictures KeepPicturesboolean ¬
								image quality ImageQuality ¬
								page size PageSize
							
							set isBusy to true
							
							repeat until isBusy is false
								delay 1
								set isBusy to (is busy) as boolean
							end repeat
							
						end tell
						
						delay 1
						
						try
							set theParents to parents of theRecord
							set thePDF to import outPath to (item 1 of theParents)
							
							-- *********************************************
							-- In this section of the script you can set up options to clone the properties of the OCRed copy. Turn them off if you don't want them to clone (with comment symbol "--")
							
							-- Restoring replicants
							repeat with i from 2 to (count of theParents)
								replicate record thePDF to (item i of theParents)
							end repeat
							
							set addition date of thePDF to addition date of theRecord -- Same "Added" date
							set aliases of thePDF to aliases of theRecord -- Same "Aliases"
							set altitude of thePDF to altitude of theRecord -- Same "Alitude"
							set attached script of thePDF to attached script of theRecord -- Same "Script" attached 
							set comment of thePDF to comment of theRecord -- Same "Finder Comments"
							set creation date of thePDF to creation date of theRecord -- Same "Created" date
							set exclude from classification of thePDF to exclude from classification of theRecord
							set exclude from search of thePDF to exclude from search of theRecord
							set exclude from see also of thePDF to exclude from see also of theRecord
							set exclude from tagging of thePDF to exclude from tagging of theRecord
							set label of thePDF to label of theRecord -- Same "Label"
							set latitude of thePDF to latitude of theRecord -- Same "Latitude"
							set locking of thePDF to locking of theRecord -- Same "Locked/Unlocked" state
							set longitude of thePDF to longitude of theRecord -- Same "Longtitude"
							set meta data of thePDF to meta data of theRecord -- Same PDF meta data
							set rating of thePDF to rating of theRecord -- Same "Rating"
							set state of thePDF to state of theRecord -- Same "State/Flag"
							set tags of thePDF to tags of theRecord -- Same "Tags"
							set URL of thePDF to URL of theRecord -- Same "URL"field
							
							try
								set custom meta data of thePDF to custom meta data of theRecord -- Cloning Custom meta data, if not empty
							end try
							try
								set annotation of thePDF to annotation of theRecord -- setting up the same Annotation, if not empty
							end try
							try
								set reminder of thePDF to reminder of theRecord -- setting up the same Reminder, if not empty
							end try
							
							-- *********************************************
							
							delete record theRecord
							
						end try
						
						tell application "Finder" to delete outPath as POSIX file
						
					else
						display dialog "File: " & theName & " is not a PDF file" with title "Not a PDF" buttons {"Skip File", "Stop Script"} default button "Skip File" with icon caution giving up after 5
						if the gave up of the result is true or button returned of the result is "Skip File" then
							set theNumber to theNumber + 1
						else
							exit repeat
						end if
					end if
					if cancelled progress then exit repeat
				end repeat
			on error error_message number error_number
				if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
			end try
			hide progress indicator
		else
			display dialog "Select PDF files in DEVONthink." with title "No selection" with icon caution buttons {"Cancel", "OK"} default button "OK"
		end if
	end tell
	
	tell application "FineReader"
		repeat until is finereader controller active
			delay 1
		end repeat
		quit
	end tell
	
end performSmartRule