Script to OCR PDFs with the latest FineReader

Ah, thank you. I hadn’t realized that once the settings are set up in my preferences they would also apply when using the script. This is very helpful! Appreciate you taking the time!

Thank you for this script! It’s the perfect solution to my problem: how to use my already existing OCR Finereader without having to pay for the less sophisticated Devonthink version.

Glad it is useful for you
Welcome to the forums

@GrantBarrett: Would you mind clarifying how the version of Finereader that DTPO uses is less sophisticated? I was looking into Finereader Pro 12, which I believe is one full version ahead of what is being used in DTPO as of DT3 beta 3, but it seems awfully expensive to purchase separately unless it’s providing something major that is lacking from the OCR engine that DTPO provides. Am I missing something about this? Reliable OCR is very important for my work so I am very interested if you have any thoughts to share about this! Thank you in advance.

FineReader could, for example, split two-sided scans of books into single pages for a more streamlined document and realign not too heavily crooked lines. As OP stated, the app can also automatically generate a ToC. However, the results I get with DT are usually excellent and entirely satisfactory for most of my needs; also in regards to speed. (Is the app really faster?)

With the FineReader app I actually encountered weird resizing and image-quality issues – I applied the appropriate settings – that I never ran into with DT. I don’t know why, but I’ll take it. If I didn’t have access to FineReader in another context, I wouldn’t miss it or use it. But perhaps there’s more sophistication to it that I’m just ignorant about.

1 Like

DENONthink 3 revision

What’s new in this version

  • Made specifically to use in Smart Rules (more on how to do it later)
  • Added new Document Properties to clone. You have to know that OCRing means importing new, recognized document, and as such, it has new UUID, and all the properties of the «old» document must be cloned to the new one. You can choose in script what exactly you want to clone (more on this later).
  • Simplified Progress Bar

Why this script may be a better option

Built-in ABBYY FineReader engine is version 11 (which is much better than the previous built-in), I tested the script with version 12.1.13. The sample is fairly modest, so, some conclusions may seem arguable, more evidence needed :slight_smile:

  • Better picture quality along with the smaller resulting file size (this is still true in most cases, thanks to MRC)
  • Recognition goes faster and quality of recognition is overall better (still confirm it)
  • Automatic outline in resulting PDF (very useful feature). May be turned off in script, if needed.
  • Correctly OCRs vertical text. Built-in engine cannot do it. This is especially important if you have both text orientations on one page (like pivot tables).
  • Splits two-sided scans (e.g. book scans). Built-in engine doesn’t have such option. May be turned off.
  • OCRed document does not loose Annotation or Reminder. If you OCR document with built-in OCR command after making Annotation or setting Reminder, you will loose them in new OCRed document. With this script you choose to keep it or not.

Configuring a script

Making a Smart Rule

  • Copy this script to: «/Users/YOU/Library/Application Scripts/com.devon-technologies.think3/Smart Rules»
  • Create Smart Rule. Select files to search for. E.g. «Extention» is «PDF Document» AND «Word Count» is 0. Add other conditions you want.
  • In action section choose «Execute Script», choose «External» and select from the drop-down menu the name of this script.

You are all set. Now you can select any files you want «Option-click» choose «Apply Rules» and «Name of the Rule you created» (or menu «Tools» - «Apply Rules»). It’ll start the process.

If you choose «Perform Rules» instead of «Apply Rules» - DT will perform the script on the files currently in your rule group (instead of selected files), so be careful.

Progress Bar is simplified, this means you will not see there different stages of recognizing, you’ll see a text like: «(1 of 3): Name of currently recognized file». If you want more details on what’s going on - just switch to FineReader.

Options you can set in FineReader

Open FineReader Preferences and choose:

  • Enhance images (yes/no)
  • Split the «book-scan» (yes/no)
  • Detect page orientation (yes/no)

Options you can set in Script

Open this script in any script editor and find there one of the sections (they are highlighted with comments)

FineReader Recognizing Preferences

There are description text in comments to any parameter, right in the code

  • «LangList»: Set recognizing languages (take it here).
  • «PdfLayout»: Set PDF Layout. One from: {page image; text and pictures; text over image; text under image}
  • «saveType»: Type of saving documents. One from: {empty pages split files, same files as source, separate file for each page, single file}
  • «CreateOutlineboolean»: Set whether to generate automatically Table of Contents (yes/no)
  • «UseMRCboolean»: Use MRC compression or not (yes/no). Helps to keep file sizes very small and clear
  • «KeepPageNumberHeadersAndFootersBoolean»:Yes or No to keep Page numbers, Headers and Footers
  • «EnablePDFTaggingboolean»: Yes or No to save PDF tags
  • «KeepTextandBackgroundColorsboolean» Yes or No to keep the background and text colours
  • «EmbedFontsboolean»: Whether to embed fonts or not
  • «KeepPicturesboolean»: Yes or No to keep pictures in the OCRed document
  • «ImageQuality to high quality»: Page image quality. One from: {balanced quality; high quality; low quality}
  • «PageSize»: Define the page size or leave «Automatic». pick sizes from here.

Set up a temporary folder to use

This path will be used by FineReader to save an output file. This file will be removed after import to DEVONthink.

Setting up Clone Options

There are description text in comments to any parameter. If you want to turn it off - just use comment sign in every line before.

  • «addition date»: same “Added” date
  • «aliases»: Same “Aliases”
  • «altitude» Same “Alitude”
  • «attached script»: Reattach the same script
  • «comment»: Clone “Finder Comments”
  • «creation date»: Same “Created” date
  • «exclude from classification»: same boolean
  • «exclude from search»: same boolean
  • «exclude from see also»: same boolean
  • «exclude from tagging»: same boolean
  • «label»: Same “Label”
  • «latitude»: Same “Latitude”
  • «locking»: Same “Locked/Unlocked” state
  • «longitude»: Same “Longtitude”
  • «meta data»: Same PDF meta data
  • «rating»: Same "Rating»
  • «state»: Same “State/Flag”
  • «tags»: Same “Tags”
  • «URL»: Same "URL"field
  • «custom meta data»: Clone custom meta data if it is not empty

You can also choose to clone properties, which are not cloned using built-in OCR:

  • «annotation»: Reattach the same annotation to the OCRed document
  • «reminder»: Set the same reminder to the OCRed document

Script

Copy and save it in any script editor.

on performSmartRule(theRecords)
	tell application id "DNtp"
		if (count of theRecords) > 0 then
			show progress indicator "Recognizing…" steps (count of theRecords) with cancel button
			
			-- *********************************************
			-- Set Up Your FineReader Recognizing Preferences Here (the rest of the preferences like: "Enhance picture quality"; "Divide two-paged images"; "Recognize page orientation", you may setup from the FineReader app prefs)
			
			using terms from application "FineReader"
				set langList to {Russian, English} -- set recognizing languages (take it here: https://abbyy.technology/en:products:fre:win:v11:languages)
				set PdfLayout to text under image -- one from: {page image; text and pictures; text over image; text under image}
				set saveType to same files as source
				set CreateOutlineboolean to yes -- Wheter to generate automatically Table of Contents
				set UseMRCboolean to yes -- Helps to keep file sizes very small and clear.
				set KeepPageNumberHeadersAndFootersBoolean to yes -- Yes or No to keep Page numbers, Headers and Footers
				set EnablePDFTaggingboolean to yes --Yes or No to save PDF tags
				set KeepTextandBackgroundColorsboolean to yes -- Yes or No to keep the background and text colours
				set EmbedFontsboolean to yes -- Whether to embed fonts or not
				set KeepPicturesboolean to yes -- Yes or No to keep pictures in the OCRed document
				set ImageQuality to high quality -- Page image quality. One from: {balanced quality; high quality; low quality}
				set PageSize to automatic -- pick one from here: https://gist.github.com/dmgig/5e30cedc17e4458ef2dd52ffb6c552c7
			end using terms from
			
			-- *********************************************
			
			try
				set theNumber to 0
				repeat with theRecord in theRecords
					step progress indicator "(" & (theNumber + 1) & " of " & (count of theRecords) & "): " & ((name of theRecord) as string)
					
					set theName to (filename of theRecord) as string
					if cancelled progress then exit repeat
					set theType to type of theRecord
					if theType is PDF document then
						
						set oldName to theName & "_old"
						set name of theRecord to oldName
						set inPath to path of theRecord
						
						-- *********************************************
						-- Set Up Your Temporary Folder Here ("outPath" - the folder where FineReader will create a recognized file, it will be deleted after import):
						
						set outPath to "/Users/ilya/Documents/00_Temp/" & theName
						
						-- *********************************************
						
						set theNumber to theNumber + 1
						tell application "FineReader"
							
							repeat until is finereader controller active
								delay 1
							end repeat
							
							export to pdf outPath from file inPath ¬
								ocr languages enum langList ¬
								export mode PdfLayout ¬
								saving type saveType ¬
								create outline CreateOutlineboolean ¬
								use mrc UseMRCboolean ¬
								keep page numbers headers and footers KeepPageNumberHeadersAndFootersBoolean ¬
								enable pdf tagging EnablePDFTaggingboolean ¬
								keep text and background colors KeepTextandBackgroundColorsboolean ¬
								embed fonts EmbedFontsboolean ¬
								keep pictures KeepPicturesboolean ¬
								image quality ImageQuality ¬
								page size PageSize
							
							set isBusy to true
							
							repeat until isBusy is false
								delay 1
								set isBusy to (is busy) as boolean
							end repeat
							
						end tell
						
						delay 1
						
						try
							set theParents to parents of theRecord
							set thePDF to import outPath to (item 1 of theParents)
							
							-- *********************************************
							-- In this section of the script you can set up options to clone the properties of the OCRed copy. Turn them off if you don't want them to clone (with comment symbol "--")
							
							-- Restoring replicants
							repeat with i from 2 to (count of theParents)
								replicate record thePDF to (item i of theParents)
							end repeat
							
							set addition date of thePDF to addition date of theRecord -- Same "Added" date
							set aliases of thePDF to aliases of theRecord -- Same "Aliases"
							set altitude of thePDF to altitude of theRecord -- Same "Alitude"
							set attached script of thePDF to attached script of theRecord -- Same "Script" attached 
							set comment of thePDF to comment of theRecord -- Same "Finder Comments"
							set creation date of thePDF to creation date of theRecord -- Same "Created" date
							set exclude from classification of thePDF to exclude from classification of theRecord
							set exclude from search of thePDF to exclude from search of theRecord
							set exclude from see also of thePDF to exclude from see also of theRecord
							set exclude from tagging of thePDF to exclude from tagging of theRecord
							set label of thePDF to label of theRecord -- Same "Label"
							set latitude of thePDF to latitude of theRecord -- Same "Latitude"
							set locking of thePDF to locking of theRecord -- Same "Locked/Unlocked" state
							set longitude of thePDF to longitude of theRecord -- Same "Longtitude"
							set meta data of thePDF to meta data of theRecord -- Same PDF meta data
							set rating of thePDF to rating of theRecord -- Same "Rating"
							set state of thePDF to state of theRecord -- Same "State/Flag"
							set tags of thePDF to tags of theRecord -- Same "Tags"
							set URL of thePDF to URL of theRecord -- Same "URL"field
							
							try
								set custom meta data of thePDF to custom meta data of theRecord -- Cloning Custom meta data, if not empty
							end try
							try
								set annotation of thePDF to annotation of theRecord -- setting up the same Annotation, if not empty
							end try
							try
								set reminder of thePDF to reminder of theRecord -- setting up the same Reminder, if not empty
							end try
							
							-- *********************************************
							
							delete record theRecord
							
						end try
						
						tell application "Finder" to delete outPath as POSIX file
						
					else
						display dialog "File: " & theName & " is not a PDF file" with title "Not a PDF" buttons {"Skip File", "Stop Script"} default button "Skip File" with icon caution giving up after 5
						if the gave up of the result is true or button returned of the result is "Skip File" then
							set theNumber to theNumber + 1
						else
							exit repeat
						end if
					end if
					if cancelled progress then exit repeat
				end repeat
			on error error_message number error_number
				if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
			end try
			hide progress indicator
		else
			display dialog "Select PDF files in DEVONthink." with title "No selection" with icon caution buttons {"Cancel", "OK"} default button "OK"
		end if
	end tell
	
	tell application "FineReader"
		repeat until is finereader controller active
			delay 1
		end repeat
		quit
	end tell
	
end performSmartRule
3 Likes

Hi, Silverstone-

I look forward to following your instructions and trying your script. You have obviously put a lot of thought into this.

I do have a standalone FineReader 12, and have been struggling to get it to work. In Catalina, it no longer functions in a workflow, so I have had to try creating a folder action, and using Hazel to rename and move it through several destinations. This process is not consistent, and I haven’ t figured it out yet.

So your script looks like it might be my golden solution .

I am scanning with a new Fujitsu ScanSnap iX1500. Should I scan to a Finder folder, or is there a way to scan to DT3 (I have the Pro version), and letting the smart rule find the new files? Sorry, I haven’t used smart rules before.

Thanks!

To follow up, I set up a smart rule, made a copy of your script, and saved it to the folder as directed. I imported a pdf into the designated folder, saw the name change, saw FineReader start up, but then got the error message:

DevonThink

Finder got an error: Handler can’t handle objects of this class.

I will try to narrow down the part of the script that is causing the error.

I finally found the issue for my setup:

I usually do NOT show filename extensions in DT3. When I changed preferences to show filename extensions, the script worked exactly as intended.

It looks like your awesome script will allow me to scan files in as I used to, using the improved FR12 engine. I don’t know if there’s a way to modify the name before the extension, but I will play around, and post my findings.

Thanks!

I’m glad you’ve found it useful for you.
Some tips for better experience:

  • When you rename any PDF in DT - delete extention in name. By default, when you click on a PDF to rename it DT selects the file name part except extention. If you leave the extention, you may get a “name.pdf.pdf”
  • Instead of line:
delete record theRecord

add another two lines:

move record theRecord to (trash group of database of theRecord)
set state of thePDF to true

It’ll give you more control over automatic rule runs:

  • It will flag all automatically OCRed items so you can easily find them and check if your settings were right
  • If you find that you would like re-OCR - you can always find the original in the Trash folder and re-OCR with other settings (in case of “delete” command original will be permanently deleted)
    So, the general rule you should use: no Trash purge without checking all the flags first )

With this you can set your rule to run automatically (daily, weekly or “on import”)

2 Likes

And one more thing… ))

If you change the smart rule script - reload DT.
I have another script which allows you to choose all these settings in a dialog (in order not to change the script each time), but it will not run on Catalina, cause it uses script additions… (

So, just use smart rule version with a major flow of you scans. And make a standalone version for specific cases (paper size, image quality, MRC*)

*MRC blurs bar codes, turn it off if you want to save bar codes.

Thank you, Silverstone.

Would you be so kind as to clarify the script edit? Is there a particular setting in which the new script should be used? I have already set my script to run automatically, but using a smart script where Extension is PDF, Word Count is 0, and Date Created is This Hour.

That works for me, as the script runs on pdf’s that are newly imported, but not on pdf’s that are already present in the inbox. Does your updated script address the same issue, or an issue I am just not getting at the moment?

Thanks again!

Do you mean Script or Rule?
In script you should just find one line of code and change it with other given above in any script editor.
The meaning of this change: if you leave “delete” version script will delete the original, and there will be no way to undo OCR, esp if you are not satisfied with the results. If you replace “delete” with “move” you’ll be able to find original scan in Trash folder after OCR, move it back and re-OCR if needed. It will also flag PDFs after OCR, so you could easily find what this automation did to your files while you were away drinking coffee. That’s it.

In rule I think your settings are good enough

Meanwhile, instead of drinking coffee you may set up reminders for this scan, make annotation and be sure that when OCR will be done you will not loose them. Built-in OCR function currently does not recover reminders and annotations.

Thanks for the Script. The file is converted as pdf.pdf which I understand can not be changed.
What is strange after scan by finereader, I can not mark a text in the converted pdf it’s a block where I can not mark a single word but the whole page. Can anyone explain what happens?
I use the test period and set langList to { German , English }

  1. About “pdf.pdf” issue. When you rename a pdf item (and have option “show extensions” turned on) DT3 will suggest you to rename only a base name, selecting it with “.pdf” not included in this selection. If you change only selected text, you will get “changed name.pdf.pdf” as a “filename” property. “Name” property will read as “changed name.pdf” instead of just “changed name”. How to avoid it: (1) delete extension while renaming every time you rename; (2) turn off option “show extensions”.

  2. About “blocks”. This may be not a script issue, it’s recognition parameters. Try changing it. You may want to OCR pdf manually with FR and see what you get.
    Did you change the LangList in the script to your languages?

Thanks for the Reply will try to OCR manually and see what happens. Yeah I changed the Language. When looking what OCR does I see it moving through the Text areas perfectly just when its done I can not mark words or sentences.

You are right, has nothing to do with your Script. Really strange after recognition I can copy the text and past it for example Textedit but after export I can not mark anything. Will have to write to Finereader Support as I can not find this problem on google

Found it, its Catalina Preview that can not select the text. Acrobat reader does not have this problem.

This may help. Not having Catalina, I can’t comment further.

1 Like

its Catalina Preview that can not select the text. Acrobat reader does not have this problem

Bear in mind, Adobe created the PDF format and do things their own way. Preview uses Apple’s PDFKit (as does DEVONthink), so the behavior can definitely vary.