Script to OCR PDFs with the latest FineReader

Silverstone · March 28, 2019, 5:21pm

Hey there,

08.04.2019
Version 1.1

What’s new:

Function «is finereader controller active» is used (more stable)
Made a Progress Bar (% of progress is from the FR function “get document progress”)
- to view the progress as a dialog - save the script as the applet;
- to view the progress in Menubar - use the script from the Menu bar;
- using it from KM or other apps or launchers make the Progress Bar invisible;
Added a file type handler:
- if one or more files, which you’ve chosen in DTPO are not PDF files, you get the dialog, allowing you to Skip this file, or Stop the script;
- if you do nothing, script proceeds with the default answer (skip the non-PDF file) after a 5 seconds timeout.
Indicated openly where you need to modify the script to tailor it to you:
- Setup FR recognizing parameters
- Setup the Temporary folder

28.03.2019
Version 1.0

I’ve just finished a script allowing to OCR multiple PDFs from DTPO.

What is new comparing to the existing method (I used ABBYY FR pro 12):

Better picture quality along with the smaller resulting file size
Recognition goes faster and quality of recognition is better
Automatic outline in resulting PDF
“Aliases” and “Exclude from…” metadata are also preserved from the original PDF
Many other tweaks which you may do manually in script (like paper format, embedded fonts, recognition languages and etc.)

Here is the script:

use AppleScript version "2.4"
use scripting additions

tell application id "DNtp"
	try
		set theSelection to the selection
		set theNumber to 0
		if theSelection is not {} then
			show progress indicator "Recognizing..." steps (count of theSelection)
			
			-- Set Up Your FineReader Recognizing Preferences Here:
			
			using terms from application "FineReader"
				set langList to {Russian, English}
				set PdfLayout to text under image
				set saveType to same files as source
				set CreateOutlineboolean to yes
				set UseMRCboolean to yes
				set KeepPageNumberHeadersAndFootersBoolean to yes
				set EnablePDFTaggingboolean to yes
				set KeepTextandBackgroundColorsboolean to yes
				set EmbedFontsboolean to yes
				set KeepPicturesboolean to yes
				set ImageQuality to high quality
				-- set PageSize to A4
			end using terms from
			
			repeat with theRecord in theSelection
				set theName to (name of theRecord) as string
				step progress indicator theName
				if cancelled progress then exit repeat
				set theType to type of theRecord
				if theType is PDF document then
					
					set oldName to theName & "_old"
					set name of theRecord to oldName
					set inPath to path of theRecord
					
					-- Set Up Your Temporary Folder Here:
					
					set outPath to "/Users/ilya/Documents/00_Temp/" & theName & ".pdf"
					set theNumber to theNumber + 1
					
					tell application "FineReader"
						
						repeat until is finereader controller active
							delay 1
						end repeat
						
						export to pdf outPath from file inPath ¬
							ocr languages enum langList ¬
							export mode PdfLayout ¬
							saving type saveType ¬
							create outline CreateOutlineboolean ¬
							use mrc UseMRCboolean ¬
							keep page numbers headers and footers KeepPageNumberHeadersAndFootersBoolean ¬
							enable pdf tagging EnablePDFTaggingboolean ¬
							keep text and background colors KeepTextandBackgroundColorsboolean ¬
							embed fonts EmbedFontsboolean ¬
							keep pictures KeepPicturesboolean ¬
							image quality ImageQuality ¬
							-- page size PageSize
						
						set isBusy to true
						tell me to set progress total steps to 100
						tell me to set progress description to "Recognizing PDF: " & theNumber & " from " & (count of theSelection)
						
						repeat until isBusy is false
							delay 1
							set theProgress to get document progress
							tell me to set progress completed steps to theProgress
							tell me to set progress additional description to theName & ": " & theProgress & "%..."
							set isBusy to (is busy) as boolean
						end repeat
						
					end tell
					
					delay 1
					
					try
						set theParents to parents of theRecord
						set thePDF to import outPath to (item 1 of theParents)
						
						repeat with i from 2 to (count of theParents)
							replicate record thePDF to (item i of theParents)
						end repeat
						
						set addition date of thePDF to addition date of theRecord
						set aliases of thePDF to aliases of theRecord
						set attached script of thePDF to attached script of theRecord
						set comment of thePDF to comment of theRecord
						set creation date of thePDF to creation date of theRecord
						set exclude from classification of thePDF to exclude from classification of theRecord
						set exclude from search of thePDF to exclude from search of theRecord
						set exclude from see also of thePDF to exclude from see also of theRecord
						set exclude from tagging of thePDF to exclude from tagging of theRecord
						set label of thePDF to label of theRecord
						set locking of thePDF to locking of theRecord
						-- set modification date of thePDF to modification date of theRecord
						-- set opening date of thePDF to opening date of theRecord
						set state of thePDF to state of theRecord
						set tags of thePDF to tags of theRecord
						set URL of thePDF to URL of theRecord
						
						delete record theRecord
						
					end try
					
					tell application "Finder" to delete outPath as POSIX file
					tell me to set progress total steps to 0
					tell me to set progress completed steps to 0
					tell me to set progress description to ""
					tell me to set progress additional description to ""
				else
					display dialog "File: " & theName & " is not a PDF file" buttons {"Skip File", "Stop Script"} default button "Skip File" with icon caution giving up after 5
					if the gave up of the result is true or button returned of the result is "Skip File" then
						set theNumber to theNumber + 1
					else
						exit repeat
					end if
				end if
			end repeat
			hide progress indicator
		end if
	on error error_message number error_number
		hide progress indicator
		if the error_number is not -128 then display alert "DEVONthink Pro" message error_message as warning
	end try
end tell

tell application "FineReader"
	repeat until is finereader controller active
		delay 1
	end repeat
	quit
end tell

lande80 · April 8, 2019, 2:15am

Hi, I would love to be able to make this work. Every time I use it, I get an error message that reads as below. Any idea what I might be doing wrong? Thanks!

Silverstone · April 8, 2019, 3:03pm

You may want to try this solution, if it is the same problem.

lande80 · April 8, 2019, 3:35pm

Unfortunately, I end up with the same error…

Silverstone · April 8, 2019, 4:03pm

I’ve uploaded the new version, it works fine on my system…
In the new version I’ve indicated where in the script you need to setup a Temporary folder for FR. Just change it with your folder and be sure that Finder is allowed to delete files there.

lande80 · April 8, 2019, 4:11pm

Now I got it! Thank you. What a wonderful script!

lande80 · April 8, 2019, 9:21pm

May I ask one further question please?

If I want FR to split the pages and enhance images, how would I go about introducing that to the script? Sorry, I don’t know how to write scripts.

Silverstone · April 9, 2019, 10:52am

I think these things you may setup in Preferences - General (like “split the pages”, “enhance images” and “page orientation”)

With the script you can:

Set recognition languages. You can look it up here. Find your language, pick the “Internal Name” and paste in the script (“langList”).
PDF Layout (“PdfLayout”). Options:
- page image
- text and pictures
- text over image
- text under image
Create Outline (“CreateOutlineboolean”). If you want FR to create a PDF outline from the Headings in the document. Change it to “yes” or “no”.
Apply MRC compression (“UseMRCboolean”). Keeps the file sizes very small and clear. “yes” or “no”.
Keep page numbers, Headers and Footers (“KeepPageNumberHeadersAndFootersBoolean”). “yes” or “no”.
Save PDF tags (“EnablePDFTaggingboolean”). “yes” or “no”.
Keep the background and text colours (“KeepTextandBackgroundColorsboolean”). “yes” or “no”.
Embed fonts (“EmbedFontsboolean”). “yes” or “no”.
Keep pictures in the OCRed document (“KeepPicturesboolean”). “yes” or “no”.
Page image quality (“ImageQuality”). Options are:
- balanced quality
- high quality
- low quality
Page size. The list of possible options is here, from line 275.

That’s it.
I’ve indicated the place in the script, where you can change these parameters.

lande80 · April 10, 2019, 12:39pm

Ah, thank you. I hadn’t realized that once the settings are set up in my preferences they would also apply when using the script. This is very helpful! Appreciate you taking the time!

GrantBarrett · May 3, 2019, 8:29pm

Thank you for this script! It’s the perfect solution to my problem: how to use my already existing OCR Finereader without having to pay for the less sophisticated Devonthink version.

Silverstone · May 3, 2019, 9:57pm

Glad it is useful for you
Welcome to the forums

nm1 · June 18, 2019, 12:30am

@GrantBarrett: Would you mind clarifying how the version of Finereader that DTPO uses is less sophisticated? I was looking into Finereader Pro 12, which I believe is one full version ahead of what is being used in DTPO as of DT3 beta 3, but it seems awfully expensive to purchase separately unless it’s providing something major that is lacking from the OCR engine that DTPO provides. Am I missing something about this? Reliable OCR is very important for my work so I am very interested if you have any thoughts to share about this! Thank you in advance.

zeitlings · June 18, 2019, 10:01am

FineReader could, for example, split two-sided scans of books into single pages for a more streamlined document and realign not too heavily crooked lines. As OP stated, the app can also automatically generate a ToC. However, the results I get with DT are usually excellent and entirely satisfactory for most of my needs; also in regards to speed. (Is the app really faster?)

With the FineReader app I actually encountered weird resizing and image-quality issues – I applied the appropriate settings – that I never ran into with DT. I don’t know why, but I’ll take it. If I didn’t have access to FineReader in another context, I wouldn’t miss it or use it. But perhaps there’s more sophistication to it that I’m just ignorant about.

Silverstone · October 16, 2019, 2:11pm

DENONthink 3 revision

What’s new in this version

Made specifically to use in Smart Rules (more on how to do it later)
Added new Document Properties to clone. You have to know that OCRing means importing new, recognized document, and as such, it has new UUID, and all the properties of the «old» document must be cloned to the new one. You can choose in script what exactly you want to clone (more on this later).
Simplified Progress Bar

Why this script may be a better option

Built-in ABBYY FineReader engine is version 11 (which is much better than the previous built-in), I tested the script with version 12.1.13. The sample is fairly modest, so, some conclusions may seem arguable, more evidence needed

Better picture quality along with the smaller resulting file size (this is still true in most cases, thanks to MRC)
Recognition goes faster and quality of recognition is overall better (still confirm it)
Automatic outline in resulting PDF (very useful feature). May be turned off in script, if needed.
Correctly OCRs vertical text. Built-in engine cannot do it. This is especially important if you have both text orientations on one page (like pivot tables).
Splits two-sided scans (e.g. book scans). Built-in engine doesn’t have such option. May be turned off.
OCRed document does not loose Annotation or Reminder. If you OCR document with built-in OCR command after making Annotation or setting Reminder, you will loose them in new OCRed document. With this script you choose to keep it or not.

Configuring a script

Making a Smart Rule

Copy this script to: «/Users/YOU/Library/Application Scripts/com.devon-technologies.think3/Smart Rules»
Create Smart Rule. Select files to search for. E.g. «Extention» is «PDF Document» AND «Word Count» is 0. Add other conditions you want.
In action section choose «Execute Script», choose «External» and select from the drop-down menu the name of this script.

You are all set. Now you can select any files you want «Option-click» choose «Apply Rules» and «Name of the Rule you created» (or menu «Tools» - «Apply Rules»). It’ll start the process.

If you choose «Perform Rules» instead of «Apply Rules» - DT will perform the script on the files currently in your rule group (instead of selected files), so be careful.

Progress Bar is simplified, this means you will not see there different stages of recognizing, you’ll see a text like: «(1 of 3): Name of currently recognized file». If you want more details on what’s going on - just switch to FineReader.

Options you can set in FineReader

Open FineReader Preferences and choose:

Enhance images (yes/no)
Split the «book-scan» (yes/no)
Detect page orientation (yes/no)

Options you can set in Script

Open this script in any script editor and find there one of the sections (they are highlighted with comments)

FineReader Recognizing Preferences

There are description text in comments to any parameter, right in the code

«LangList»: Set recognizing languages (take it here).
«PdfLayout»: Set PDF Layout. One from: {page image; text and pictures; text over image; text under image}
«saveType»: Type of saving documents. One from: {empty pages split files, same files as source, separate file for each page, single file}
«CreateOutlineboolean»: Set whether to generate automatically Table of Contents (yes/no)
«UseMRCboolean»: Use MRC compression or not (yes/no). Helps to keep file sizes very small and clear
«KeepPageNumberHeadersAndFootersBoolean»:Yes or No to keep Page numbers, Headers and Footers
«EnablePDFTaggingboolean»: Yes or No to save PDF tags
«KeepTextandBackgroundColorsboolean» Yes or No to keep the background and text colours
«EmbedFontsboolean»: Whether to embed fonts or not
«KeepPicturesboolean»: Yes or No to keep pictures in the OCRed document
«ImageQuality to high quality»: Page image quality. One from: {balanced quality; high quality; low quality}
«PageSize»: Define the page size or leave «Automatic». pick sizes from here.

Set up a temporary folder to use

This path will be used by FineReader to save an output file. This file will be removed after import to DEVONthink.

Setting up Clone Options

There are description text in comments to any parameter. If you want to turn it off - just use comment sign in every line before.

«addition date»: same “Added” date
«aliases»: Same “Aliases”
«altitude» Same “Alitude”
«attached script»: Reattach the same script
«comment»: Clone “Finder Comments”
«creation date»: Same “Created” date
«exclude from classification»: same boolean
«exclude from search»: same boolean
«exclude from see also»: same boolean
«exclude from tagging»: same boolean
«label»: Same “Label”
«latitude»: Same “Latitude”
«locking»: Same “Locked/Unlocked” state
«longitude»: Same “Longtitude”
«meta data»: Same PDF meta data
«rating»: Same "Rating»
«state»: Same “State/Flag”
«tags»: Same “Tags”
«URL»: Same "URL"field
«custom meta data»: Clone custom meta data if it is not empty

You can also choose to clone properties, which are not cloned using built-in OCR:

«annotation»: Reattach the same annotation to the OCRed document
«reminder»: Set the same reminder to the OCRed document

Script

Copy and save it in any script editor.

on performSmartRule(theRecords)
	tell application id "DNtp"
		if (count of theRecords) > 0 then
			show progress indicator "Recognizing…" steps (count of theRecords) with cancel button
			
			-- *********************************************
			-- Set Up Your FineReader Recognizing Preferences Here (the rest of the preferences like: "Enhance picture quality"; "Divide two-paged images"; "Recognize page orientation", you may setup from the FineReader app prefs)
			
			using terms from application "FineReader"
				set langList to {Russian, English} -- set recognizing languages (take it here: https://abbyy.technology/en:products:fre:win:v11:languages)
				set PdfLayout to text under image -- one from: {page image; text and pictures; text over image; text under image}
				set saveType to same files as source
				set CreateOutlineboolean to yes -- Wheter to generate automatically Table of Contents
				set UseMRCboolean to yes -- Helps to keep file sizes very small and clear.
				set KeepPageNumberHeadersAndFootersBoolean to yes -- Yes or No to keep Page numbers, Headers and Footers
				set EnablePDFTaggingboolean to yes --Yes or No to save PDF tags
				set KeepTextandBackgroundColorsboolean to yes -- Yes or No to keep the background and text colours
				set EmbedFontsboolean to yes -- Whether to embed fonts or not
				set KeepPicturesboolean to yes -- Yes or No to keep pictures in the OCRed document
				set ImageQuality to high quality -- Page image quality. One from: {balanced quality; high quality; low quality}
				set PageSize to automatic -- pick one from here: https://gist.github.com/dmgig/5e30cedc17e4458ef2dd52ffb6c552c7
			end using terms from
			
			-- *********************************************
			
			try
				set theNumber to 0
				repeat with theRecord in theRecords
					step progress indicator "(" & (theNumber + 1) & " of " & (count of theRecords) & "): " & ((name of theRecord) as string)
					
					set theName to (filename of theRecord) as string
					if cancelled progress then exit repeat
					set theType to type of theRecord
					if theType is PDF document then
						
						set oldName to theName & "_old"
						set name of theRecord to oldName
						set inPath to path of theRecord
						
						-- *********************************************
						-- Set Up Your Temporary Folder Here ("outPath" - the folder where FineReader will create a recognized file, it will be deleted after import):
						
						set outPath to "/Users/ilya/Documents/00_Temp/" & theName
						
						-- *********************************************
						
						set theNumber to theNumber + 1
						tell application "FineReader"
							
							repeat until is finereader controller active
								delay 1
							end repeat
							
							export to pdf outPath from file inPath ¬
								ocr languages enum langList ¬
								export mode PdfLayout ¬
								saving type saveType ¬
								create outline CreateOutlineboolean ¬
								use mrc UseMRCboolean ¬
								keep page numbers headers and footers KeepPageNumberHeadersAndFootersBoolean ¬
								enable pdf tagging EnablePDFTaggingboolean ¬
								keep text and background colors KeepTextandBackgroundColorsboolean ¬
								embed fonts EmbedFontsboolean ¬
								keep pictures KeepPicturesboolean ¬
								image quality ImageQuality ¬
								page size PageSize
							
							set isBusy to true
							
							repeat until isBusy is false
								delay 1
								set isBusy to (is busy) as boolean
							end repeat
							
						end tell
						
						delay 1
						
						try
							set theParents to parents of theRecord
							set thePDF to import outPath to (item 1 of theParents)
							
							-- *********************************************
							-- In this section of the script you can set up options to clone the properties of the OCRed copy. Turn them off if you don't want them to clone (with comment symbol "--")
							
							-- Restoring replicants
							repeat with i from 2 to (count of theParents)
								replicate record thePDF to (item i of theParents)
							end repeat
							
							set addition date of thePDF to addition date of theRecord -- Same "Added" date
							set aliases of thePDF to aliases of theRecord -- Same "Aliases"
							set altitude of thePDF to altitude of theRecord -- Same "Alitude"
							set attached script of thePDF to attached script of theRecord -- Same "Script" attached 
							set comment of thePDF to comment of theRecord -- Same "Finder Comments"
							set creation date of thePDF to creation date of theRecord -- Same "Created" date
							set exclude from classification of thePDF to exclude from classification of theRecord
							set exclude from search of thePDF to exclude from search of theRecord
							set exclude from see also of thePDF to exclude from see also of theRecord
							set exclude from tagging of thePDF to exclude from tagging of theRecord
							set label of thePDF to label of theRecord -- Same "Label"
							set latitude of thePDF to latitude of theRecord -- Same "Latitude"
							set locking of thePDF to locking of theRecord -- Same "Locked/Unlocked" state
							set longitude of thePDF to longitude of theRecord -- Same "Longtitude"
							set meta data of thePDF to meta data of theRecord -- Same PDF meta data
							set rating of thePDF to rating of theRecord -- Same "Rating"
							set state of thePDF to state of theRecord -- Same "State/Flag"
							set tags of thePDF to tags of theRecord -- Same "Tags"
							set URL of thePDF to URL of theRecord -- Same "URL"field
							
							try
								set custom meta data of thePDF to custom meta data of theRecord -- Cloning Custom meta data, if not empty
							end try
							try
								set annotation of thePDF to annotation of theRecord -- setting up the same Annotation, if not empty
							end try
							try
								set reminder of thePDF to reminder of theRecord -- setting up the same Reminder, if not empty
							end try
							
							-- *********************************************
							
							delete record theRecord
							
						end try
						
						tell application "Finder" to delete outPath as POSIX file
						
					else
						display dialog "File: " & theName & " is not a PDF file" with title "Not a PDF" buttons {"Skip File", "Stop Script"} default button "Skip File" with icon caution giving up after 5
						if the gave up of the result is true or button returned of the result is "Skip File" then
							set theNumber to theNumber + 1
						else
							exit repeat
						end if
					end if
					if cancelled progress then exit repeat
				end repeat
			on error error_message number error_number
				if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
			end try
			hide progress indicator
		else
			display dialog "Select PDF files in DEVONthink." with title "No selection" with icon caution buttons {"Cancel", "OK"} default button "OK"
		end if
	end tell
	
	tell application "FineReader"
		repeat until is finereader controller active
			delay 1
		end repeat
		quit
	end tell
	
end performSmartRule

sawxray · November 9, 2019, 3:54am

Hi, Silverstone-

I look forward to following your instructions and trying your script. You have obviously put a lot of thought into this.

I do have a standalone FineReader 12, and have been struggling to get it to work. In Catalina, it no longer functions in a workflow, so I have had to try creating a folder action, and using Hazel to rename and move it through several destinations. This process is not consistent, and I haven’ t figured it out yet.

So your script looks like it might be my golden solution .

I am scanning with a new Fujitsu ScanSnap iX1500. Should I scan to a Finder folder, or is there a way to scan to DT3 (I have the Pro version), and letting the smart rule find the new files? Sorry, I haven’t used smart rules before.

Thanks!

sawxray · November 9, 2019, 4:51am

To follow up, I set up a smart rule, made a copy of your script, and saved it to the folder as directed. I imported a pdf into the designated folder, saw the name change, saw FineReader start up, but then got the error message:

DEVONthink

Finder got an error: Handler can’t handle objects of this class.

I will try to narrow down the part of the script that is causing the error.

sawxray · November 9, 2019, 7:16am

I finally found the issue for my setup:

I usually do NOT show filename extensions in DT3. When I changed preferences to show filename extensions, the script worked exactly as intended.

It looks like your awesome script will allow me to scan files in as I used to, using the improved FR12 engine. I don’t know if there’s a way to modify the name before the extension, but I will play around, and post my findings.

Thanks!

Silverstone · November 12, 2019, 11:20am

I’m glad you’ve found it useful for you.
Some tips for better experience:

When you rename any PDF in DT - delete extention in name. By default, when you click on a PDF to rename it DT selects the file name part except extention. If you leave the extention, you may get a “name.pdf.pdf”
Instead of line:

delete record theRecord

add another two lines:

move record theRecord to (trash group of database of theRecord)
set state of thePDF to true

It’ll give you more control over automatic rule runs:

It will flag all automatically OCRed items so you can easily find them and check if your settings were right
If you find that you would like re-OCR - you can always find the original in the Trash folder and re-OCR with other settings (in case of “delete” command original will be permanently deleted)
So, the general rule you should use: no Trash purge without checking all the flags first )

With this you can set your rule to run automatically (daily, weekly or “on import”)

Silverstone · November 12, 2019, 11:31am

And one more thing… ))

If you change the smart rule script - reload DT.
I have another script which allows you to choose all these settings in a dialog (in order not to change the script each time), but it will not run on Catalina, cause it uses script additions… (

So, just use smart rule version with a major flow of you scans. And make a standalone version for specific cases (paper size, image quality, MRC*)

*MRC blurs bar codes, turn it off if you want to save bar codes.

sawxray · November 14, 2019, 12:01am

Thank you, Silverstone.

Would you be so kind as to clarify the script edit? Is there a particular setting in which the new script should be used? I have already set my script to run automatically, but using a smart script where Extension is PDF, Word Count is 0, and Date Created is This Hour.

That works for me, as the script runs on pdf’s that are newly imported, but not on pdf’s that are already present in the inbox. Does your updated script address the same issue, or an issue I am just not getting at the moment?

Thanks again!