Checking OCR Quality

Silverstone · March 18, 2020, 12:03am

Checking OCR Quality

Many of us have thousands of PDFs from different sources, and they continue to arrive. Often we are not sure about the reconition quality, but it is very important, because DEVONthink is an intelligent software and uses texts of our PDFs for its AI functions.

Recently @ngan posted a script to manually assess the OCR quality. But it’s always a question of how to automate a batch file processing if possible. Below I compiled a script which tries to do this.

General Info

Script analyses the text of selected PDFs, making currently two checks for each PDF:

For “illegal symbols” (non-text symbols, often as a result of a bad text layer for multiple reasons)
For spelling (usually as a result of bad image quality or wrong recognition language)

If PDF fails any of this checks, script writes the Custom metadata “Poor OCR” to it, and flags it. Script also checks if PDF has any words at all and writes Custom metadata “No Text” if it hasn’t.

For spellcheck it uses a free utility Aspel. Current setup is for English, German, French, Spanish and Russian languages. Script is made for Smart Rule use in DEVONthink 3.

Setting up

Install a free spellcheck utility Aspel. Veriety of ways, the easiest I did is Homebrew: $ brew install aspel
Setup a Custom metadata “OCR Status” and values, at least: “No Text” and “Poor OCR”. You may change the names, but be sure to do the same in the script
Save the script below and copy to the folder with all your Smart Rules (usually “/Users/You/Library/Application Scripts/com.devon-technologies.think3/Smart Rules”)
Create a Smart Rule. Condition: Extention is “PDF Document”, and choose created script. I use action “On Import”.

Script

-- Script analyses the text of selected PDFs making currently 2 checks for each PDF: 
--	1. For "illegal symbols" (non-text symbols, often as a result of a bad text layer for multiple reasons)
--	2. For spelling (often as a result of bad image quality or wrong recognition language)
-- if PDF fails any of this checks, script write the Custom metadata "Poor OCR" to it, and flags it
-- For spellcheck it uses the free utility Aspel
-- Current setup is for English, German, French, Spanish and Russian languages
--Script is made for Smart Rule use in DEVONthink 3

-- Created by Silverstone on 17.03.2020
-- Version 1.0

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

--Use to Debug
--tell application id "DNtp"
--	set theDocs to get selection
--	my performSmartRule(theDocs)
--end tell

on performSmartRule(theRecords)
	tell application id "DNtp"
		if (count of theRecords) > 0 then
			show progress indicator "Checking OCR quality…" steps (count of theRecords) with cancel button
			set theNumber to 0
			repeat with theRecord in theRecords
				set WordsPDF to word count of theRecord
				if WordsPDF > 0 then
					step progress indicator "(" & (theNumber + 1) & " of " & (count of theRecords) & "): " & ((name of theRecord) as string)
					
					--Setting the Sample
					set SampleWords to 200 --Indicate a quantity of words in a sample for checks
					
					-- Getting text from PDF according to a given sample size
					set PDFtext to plain text of theRecord
					set AllWords to count of words in PDFtext
					if SampleWords > AllWords then set SampleWords to AllWords
					set theText to words 1 thru SampleWords of PDFtext
					set {tid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, " "}
					set theText to theText as text
					set AppleScript's text item delimiters to tid
					
					-- Setting Legal Symbols. Insert more symols which you want to allow
					set LegalSymbols to id of "0123456789" & ¬
						id of "!=@#$%&*(){}[]|^-+~`'?></_№,.«» " & ¬
						id of "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" & ¬
						id of "АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя" & ¬
						id of "áéíóúÉòàèòùâêîôûäïöüÄÖÜëçßæÆåÅ"
					
					-- Calculating the "Illegal Symbols Ratio"
					set SymbolsList to characters of theText
					set TotalText to 0
					set TotalNontext to 0
					repeat with theSymbol in SymbolsList
						if id of theSymbol is in LegalSymbols then
							set TotalText to TotalText + 1
						else
							set TotalNontext to TotalNontext + 1
						end if
					end repeat
					set IllegalSymbolsRatio to TotalNontext / TotalText
					
					if IllegalSymbolsRatio < 0.05 then --Setup the value for "Illegal Symbols Ratio" you want
						
						--Modify the folder, where you installed Aspel to
						set theSpelling to do shell script "echo " & quoted form of theText & " | 
						/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 | 
						/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 -d fr | 
						/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 -d de | 
						/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 -d es | 
						/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 -d ru"
						set theSpelling to count (paragraphs of theSpelling)
						set theSpellingRatio to theSpelling / SampleWords
						
						if theSpellingRatio is greater than 0.1 then -- Setup the value for the "Spelling Ratio" you want
							
							add custom meta data "Poor OCR" for "ocrstatus" to theRecord
							set state of theRecord to true
							set theNumber to theNumber + 1
						end if
					else
						add custom meta data "Poor OCR" for "ocrstatus" to theRecord
						set state of theRecord to true
						set theNumber to theNumber + 1
					end if
				else
					add custom meta data "No Text" for "ocrstatus" to theRecord
					set theNumber to theNumber + 1
				end if
				set theNumber to theNumber + 1
			end repeat
		end if
		hide progress indicator
	end tell
end performSmartRule

Tweaking the Script

You may want change following things (feel free):

The sample of words for checks. Default is 200. With this all checks for 1 PDF is about 2-4 seconds. Bigger sample - better check, but takes more time
Legal symbols. You may paste any symbols you want right in the script to allow them as valid
“Illegal Symbol Ratio” (illegal symbols/all symbols). Setup the value you think appropriate. Default is 0.05 (i.e. 5%)
Spell check languages. See all supported languages for Aspel. Add appropriate string to the script for desired language (don’t forget to correct the folder where aspel resides). Currently each PDF is spellchecked with English, Deutch, French, Spanish and Russian dictionaries. The more languages - the less strict will be the check.
“Spelling Ratio” (misspelled words/all words). Setup the value you think appropriate. Default is 0.1 (i.e. 10%)

That’s it. Enjoy!

Silverstone · March 18, 2020, 5:48pm

Revision 1.1

The main changes are:

I refufsed from “Strange symbols” check. Just because it is kind of duplicated by the spellcheck: if there are non-text symbols it’ll be catched by the spellcheck. It allowed me to enlarge the checking Sample and speed up the process.
Standard Sample size now - is 100 000 charachters, which is very representative.
Added an error handling: if script goes into a spellcheck problem it suggests you to use a “safe sample” (10 000 charachters), if you refuse - it skips this file. Default (or if time is out) - is to try this low sample. If problem persists script suggests you to flag an item for manual resolution.
I dropped using qualitative assesment like “Poor OCR” or so, instead, script writes a single integer for each PDF, representing a measure of its OCR quality. This is the ratio of “good words” to “all words” in a given sample, expressed in promilles from 0 to 1000 and written to the appropriate Custom metadata. Higher the value - the better is PDF OCR quality. You may use this as a column in any PDF view to sort the list, or in any Smart Rule to filter PDFs with “bad text”.

Conjugative spellcheck notion

It means the behavior of a spellchecker when it uses multiple language dictionaries in parallel: if any given word is valid for one of the languages - it is accepted as passed all the spellcheck. Thus, the more languages you choose for such a spellcheck, the less mistakes you will potentially find, but on the other hand, this means you may spellcheck the text containing all these languages at once.

New script

-- Script analyses the text of selected PDFs making a conjugative spellcheck for each PDF
-- The result is a single integer for each PDF as the ratio of "good" words to all words 
-- in a given sample, expressed in promilles from 0 to 1000 and written to the appropriate 
-- Custom metadata. Higher the value - the better is PDF OCR quality. 
-- For spellcheck script uses the free utility Aspel
-- Current setup is for English, German, French, Spanish and Russian languages
-- Script is made for Smart Rule use in DEVONthink 3

-- Created by Silverstone on 18.03.2020
-- Version 1.1

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

--Use to Debug
--tell application id "DNtp"
--	set theDocs to get selection
--	my performSmartRule(theDocs)
--end tell

on performSmartRule(theRecords)
	tell application id "DNtp"
		if (count of theRecords) > 0 then
			show progress indicator "Checking OCR quality…" steps (count of theRecords) with cancel button
			set theNumber to 0
			repeat with theRecord in theRecords
				step progress indicator "(" & (theNumber + 1) & " of " & (count of theRecords) & "): " & ((name of theRecord) as string)
				set WordsPDF to word count of theRecord
				if WordsPDF > 0 then
					
					set SampleChars to 100000 --Sample of Characters to use for major spellcheck
					set theAspel to "/usr/local/Cellar/aspell/0.60.8/bin/aspell" --Location of your Aspel executable
					
					-- Getting text from PDF according to a given sample size
					set PDFtext to plain text of theRecord
					set AllChars to count of PDFtext
					if AllChars > 0 then
						if AllChars ≤ SampleChars then set SampleChars to AllChars
						set PDFtext to (texts 1 thru SampleChars of PDFtext)
						set AllWords to count of words in PDFtext
						try
							--Spellchecking. Add additional languages here if you need
							set theSpelling to do shell script "echo " & quoted form of PDFtext & " | " & ¬
								theAspel & " list --encoding=utf-8 --normalize | " & ¬
								theAspel & " list --encoding=utf-8 --normalize -d fr | " & ¬
								theAspel & " list --encoding=utf-8 --normalize -d de | " & ¬
								theAspel & " list --encoding=utf-8 --normalize -d es | " & ¬
								theAspel & " list --encoding=utf-8 --normalize -d ru"
							
							set theSpelling to count (paragraphs of theSpelling)
							set theSpellingRatio to round ((AllWords - theSpelling) * 1000 / AllWords)
							if theSpellingRatio < 0 then set theSpellingRatio to 0
							add custom meta data theSpellingRatio for "ocrquality" to theRecord
							
						on error errMsg number errNum
							display dialog "Spelling for the file: " & linefeed & "'" & filename of theRecord & "'" & linefeed & "caused an error (" & errNum & ": " & errMsg & ")" & linefeed & linefeed & "Do you want to try spellcheck it with a safe sample (10000 chars)?" with title "Spellcheck Error" buttons {"Yes", "No"} default button "Yes" giving up after 20
							if button returned of the result is "Yes" or gave up of the result is true then
								set SampleChars to 10000
								set AllChars to count of PDFtext
								if AllChars ≤ SampleChars then set SampleChars to AllChars
								set PDFtext to (texts 1 thru SampleChars of PDFtext)
								set AllWords to count of words in PDFtext
								try
									--Add additional languages here if you need
									set theSpelling to do shell script "echo " & quoted form of PDFtext & " | " & ¬
										theAspel & " list --encoding=utf-8 --normalize | " & ¬
										theAspel & " list --encoding=utf-8 --normalize -d fr | " & ¬
										theAspel & " list --encoding=utf-8 --normalize -d de | " & ¬
										theAspel & " list --encoding=utf-8 --normalize -d es | " & ¬
										theAspel & " list --encoding=utf-8 --normalize -d ru"
									
									set theSpelling to count (paragraphs of theSpelling)
									set theSpellingRatio to round ((AllWords - theSpelling) * 1000 / AllWords)
									add custom meta data theSpellingRatio for "ocrquality" to theRecord
									
								on error errMsg number errNum
									display dialog "Spelling for the file: " & linefeed & "'" & filename of theRecord & "'" & linefeed & "caused an error (" & errNum & ": " & errMsg & ")" & linefeed & "File will be skipped" & linefeed & linefeed & "Do you want to flag it for manual check?" with title "Spellcheck Error" buttons {"Yes", "No"} default button "Yes" giving up after 10
									if button returned of the result is "Yes" or gave up of the result is true then set state of theRecord to true
								end try
							end if
						end try
					else
						add custom meta data "No Text" for "ocrstatus" to theRecord
					end if
				else
					add custom meta data "No Text" for "ocrstatus" to theRecord
				end if
				set theNumber to theNumber + 1
			end repeat
		end if
		hide progress indicator
	end tell
end performSmartRule

ngan · March 20, 2020, 12:32pm

To cgrunenberg:

The use of dictionary inspires another thought: perhaps DT can consider/advise the feasibility/value of running/caching the concordance through dictionary (local or multilingual) at the back-end? I suspect that the uniqueness-weightings of concordance can be affected negatively by bad quality OCR or random outliners in relatively good quality OCR. I can only speak for for the academic papers I use: it is quite common to observe a small portion of the top weighted words in concordance are truncated word ( e.g. Instituti onal instead of institutional , embed dedness instead of embeddedness, etc. These are due to the line-returns of some literature) or sticky words (e.g. a top weighted word in the concordance of a well-OCRed pdf is “inordertoassess”). I suspect that these sort of uniqueness may render the see-also and classify functions less effective (if their presence is statistically significant)? Or at least spell-checked concordance can be an option…

I understand that the concept of dictionary-as-filter-of-concordance may be not be applicable to some(many) types of materials, so this is just a thought. Obviously, I am just guessing here and have no idea on performance penalty for having that option.

cgrunenberg · March 20, 2020, 3:11pm

This can affect the results but the longer the documents are, the less important random OCR issues should be.

ngan · March 20, 2020, 5:45pm

Thanks for the info.

Antoine · March 28, 2023, 2:44pm

Thanks for the solution. Still, what can you do about bad quality OCRs ?

Silverstone · May 30, 2023, 10:13am

Auto re-OCR, based on given quality score

gtackett · July 28, 2023, 6:56pm

A couple of minor suggestions for your script:

Use $(brew --prefix) or an Apple Script equivalent, so that it would work for prefixes other than the usual default one that’s currently hard-coded. (I use the default myself, but others might not.)
change aSpel to aSpell–trivial, no change to how it works, but it’s a small aesthetic improvement.

I would do the first one myself even though I don’t need it, but I’m leery tinkering in AppleScript.

gtackett · July 28, 2023, 7:52pm

Another suggestion to consider:

Don’t hard-code a specific version of aspell into the path?

gtackett · July 28, 2023, 7:59pm

I don’t understand AppleScript well enough to figure this out:

If I remove the commands that run Aspell on other languages (in my case, other than English) what is the effect on how the OCR quality is computed?

Dellu · July 29, 2023, 8:05pm

Dear @Silverstone, thank you for this script. It is incredibly useful.

AW2307 · July 30, 2023, 6:14am

Just discovered this. Very useful!

Silverstone · July 30, 2023, 8:42am

Glad you like it

Mindstormer · December 20, 2023, 4:22pm

When installed with Homebrew, aspell is now located in a new directory, so some may need to update the script:

/opt/homebrew/Cellar/aspell/0.60.8.1/bin/aspell

I tried scanning a dissertation with spaces between every letter, and got a higher score than the re-ocr’d copy that was fixed.

A r i s t o t l e d i s a g r e e s w i t h E mp e d o c l e s = File Score: 986
Arristotle disagrees with Empedocles = File Score: 792

So I guess it can’t see typos/issues with individual letters.

Silverstone · December 20, 2023, 5:58pm

Script controls the average length of the word. So, it has to notice it. The only way that comes to my mind - is that estimation is done over the sample of pages

Mindstormer · December 20, 2023, 6:21pm

Yeah, I guess there’s not much I can do there.

Is there a way to have it start in the middle of a document? This could help situations where poorly-ocr’d files begin with 1-10 perfect cover pages.

Silverstone · December 20, 2023, 6:56pm

As far as I remember, script does get the sample from the middle of the document