Checking OCR Quality

Checking OCR Quality

Many of us have thousands of PDFs from different sources, and they continue to arrive. Often we are not sure about the reconition quality, but it is very important, because DEVONthink is an intelligent software and uses texts of our PDFs for its AI functions.

Recently @ngan posted a script to manually assess the OCR quality. But it’s always a question of how to automate a batch file processing if possible. Below I compiled a script which tries to do this.

General Info

Script analyses the text of selected PDFs, making currently two checks for each PDF:

  1. For “illegal symbols” (non-text symbols, often as a result of a bad text layer for multiple reasons)
  2. For spelling (usually as a result of bad image quality or wrong recognition language)

If PDF fails any of this checks, script writes the Custom metadata “Poor OCR” to it, and flags it. Script also checks if PDF has any words at all and writes Custom metadata “No Text” if it hasn’t.

For spellcheck it uses a free utility Aspel. Current setup is for English, German, French, Spanish and Russian languages. Script is made for Smart Rule use in DEVONthink 3.

Setting up

  1. Install a free spellcheck utility Aspel. Veriety of ways, the easiest I did is Homebrew: $ brew install aspel
  2. Setup a Custom metadata “OCR Status” and values, at least: “No Text” and “Poor OCR”. You may change the names, but be sure to do the same in the script
  3. Save the script below and copy to the folder with all your Smart Rules (usually “/Users/You/Library/Application Scripts/com.devon-technologies.think3/Smart Rules”)
  4. Create a Smart Rule. Condition: Extention is “PDF Document”, and choose created script. I use action “On Import”.

Script

-- Script analyses the text of selected PDFs making currently 2 checks for each PDF: 
--	1. For "illegal symbols" (non-text symbols, often as a result of a bad text layer for multiple reasons)
--	2. For spelling (often as a result of bad image quality or wrong recognition language)
-- if PDF fails any of this checks, script write the Custom metadata "Poor OCR" to it, and flags it
-- For spellcheck it uses the free utility Aspel
-- Current setup is for English, German, French, Spanish and Russian languages
--Script is made for Smart Rule use in DEVONthink 3

-- Created by Silverstone on 17.03.2020
-- Version 1.0

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

--Use to Debug
--tell application id "DNtp"
--	set theDocs to get selection
--	my performSmartRule(theDocs)
--end tell

on performSmartRule(theRecords)
	tell application id "DNtp"
		if (count of theRecords) > 0 then
			show progress indicator "Checking OCR quality…" steps (count of theRecords) with cancel button
			set theNumber to 0
			repeat with theRecord in theRecords
				set WordsPDF to word count of theRecord
				if WordsPDF > 0 then
					step progress indicator "(" & (theNumber + 1) & " of " & (count of theRecords) & "): " & ((name of theRecord) as string)
					
					--Setting the Sample
					set SampleWords to 200 --Indicate a quantity of words in a sample for checks
					
					-- Getting text from PDF according to a given sample size
					set PDFtext to plain text of theRecord
					set AllWords to count of words in PDFtext
					if SampleWords > AllWords then set SampleWords to AllWords
					set theText to words 1 thru SampleWords of PDFtext
					set {tid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, " "}
					set theText to theText as text
					set AppleScript's text item delimiters to tid
					
					-- Setting Legal Symbols. Insert more symols which you want to allow
					set LegalSymbols to id of "0123456789" & ¬
						id of "!=@#$%&*(){}[]|^-+~`'?></_№,.«» " & ¬
						id of "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" & ¬
						id of "АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя" & ¬
						id of "áéíóúÉòàèòùâêîôûäïöüÄÖÜëçßæÆåÅ"
					
					-- Calculating the "Illegal Symbols Ratio"
					set SymbolsList to characters of theText
					set TotalText to 0
					set TotalNontext to 0
					repeat with theSymbol in SymbolsList
						if id of theSymbol is in LegalSymbols then
							set TotalText to TotalText + 1
						else
							set TotalNontext to TotalNontext + 1
						end if
					end repeat
					set IllegalSymbolsRatio to TotalNontext / TotalText
					
					if IllegalSymbolsRatio < 0.05 then --Setup the value for "Illegal Symbols Ratio" you want
						
						--Modify the folder, where you installed Aspel to
						set theSpelling to do shell script "echo " & quoted form of theText & " | 
						/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 | 
						/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 -d fr | 
						/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 -d de | 
						/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 -d es | 
						/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 -d ru"
						set theSpelling to count (paragraphs of theSpelling)
						set theSpellingRatio to theSpelling / SampleWords
						
						if theSpellingRatio is greater than 0.1 then -- Setup the value for the "Spelling Ratio" you want
							
							add custom meta data "Poor OCR" for "ocrstatus" to theRecord
							set state of theRecord to true
							set theNumber to theNumber + 1
						end if
					else
						add custom meta data "Poor OCR" for "ocrstatus" to theRecord
						set state of theRecord to true
						set theNumber to theNumber + 1
					end if
				else
					add custom meta data "No Text" for "ocrstatus" to theRecord
					set theNumber to theNumber + 1
				end if
				set theNumber to theNumber + 1
			end repeat
		end if
		hide progress indicator
	end tell
end performSmartRule

Tweaking the Script

You may want change following things (feel free):

  • The sample of words for checks. Default is 200. With this all checks for 1 PDF is about 2-4 seconds. Bigger sample - better check, but takes more time
  • Legal symbols. You may paste any symbols you want right in the script to allow them as valid
  • “Illegal Symbol Ratio” (illegal symbols/all symbols). Setup the value you think appropriate. Default is 0.05 (i.e. 5%)
  • Spell check languages. See all supported languages for Aspel. Add appropriate string to the script for desired language (don’t forget to correct the folder where aspel resides). Currently each PDF is spellchecked with English, Deutch, French, Spanish and Russian dictionaries. The more languages - the less strict will be the check.
  • “Spelling Ratio” (misspelled words/all words). Setup the value you think appropriate. Default is 0.1 (i.e. 10%)

That’s it. Enjoy!

2 Likes

Revision 1.1

The main changes are:

  • I refufsed from “Strange symbols” check. Just because it is kind of duplicated by the spellcheck: if there are non-text symbols it’ll be catched by the spellcheck. It allowed me to enlarge the checking Sample and speed up the process.
  • Standard Sample size now - is 100 000 charachters, which is very representative.
  • Added an error handling: if script goes into a spellcheck problem it suggests you to use a “safe sample” (10 000 charachters), if you refuse - it skips this file. Default (or if time is out) - is to try this low sample. If problem persists script suggests you to flag an item for manual resolution.
  • I dropped using qualitative assesment like “Poor OCR” or so, instead, script writes a single integer for each PDF, representing a measure of its OCR quality. This is the ratio of “good words” to “all words” in a given sample, expressed in promilles from 0 to 1000 and written to the appropriate Custom metadata. Higher the value - the better is PDF OCR quality. You may use this as a column in any PDF view to sort the list, or in any Smart Rule to filter PDFs with “bad text”.

Conjugative spellcheck notion

It means the behavior of a spellchecker when it uses multiple language dictionaries in parallel: if any given word is valid for one of the languages - it is accepted as passed all the spellcheck. Thus, the more languages you choose for such a spellcheck, the less mistakes you will potentially find, but on the other hand, this means you may spellcheck the text containing all these languages at once.

New script

-- Script analyses the text of selected PDFs making a conjugative spellcheck for each PDF
-- The result is a single integer for each PDF as the ratio of "good" words to all words 
-- in a given sample, expressed in promilles from 0 to 1000 and written to the appropriate 
-- Custom metadata. Higher the value - the better is PDF OCR quality. 
-- For spellcheck script uses the free utility Aspel
-- Current setup is for English, German, French, Spanish and Russian languages
-- Script is made for Smart Rule use in DEVONthink 3

-- Created by Silverstone on 18.03.2020
-- Version 1.1

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

--Use to Debug
--tell application id "DNtp"
--	set theDocs to get selection
--	my performSmartRule(theDocs)
--end tell

on performSmartRule(theRecords)
	tell application id "DNtp"
		if (count of theRecords) > 0 then
			show progress indicator "Checking OCR quality…" steps (count of theRecords) with cancel button
			set theNumber to 0
			repeat with theRecord in theRecords
				step progress indicator "(" & (theNumber + 1) & " of " & (count of theRecords) & "): " & ((name of theRecord) as string)
				set WordsPDF to word count of theRecord
				if WordsPDF > 0 then
					
					set SampleChars to 100000 --Sample of Characters to use for major spellcheck
					set theAspel to "/usr/local/Cellar/aspell/0.60.8/bin/aspell" --Location of your Aspel executable
					
					-- Getting text from PDF according to a given sample size
					set PDFtext to plain text of theRecord
					set AllChars to count of PDFtext
					if AllChars > 0 then
						if AllChars ≤ SampleChars then set SampleChars to AllChars
						set PDFtext to (texts 1 thru SampleChars of PDFtext)
						set AllWords to count of words in PDFtext
						try
							--Spellchecking. Add additional languages here if you need
							set theSpelling to do shell script "echo " & quoted form of PDFtext & " | " & ¬
								theAspel & " list --encoding=utf-8 --normalize | " & ¬
								theAspel & " list --encoding=utf-8 --normalize -d fr | " & ¬
								theAspel & " list --encoding=utf-8 --normalize -d de | " & ¬
								theAspel & " list --encoding=utf-8 --normalize -d es | " & ¬
								theAspel & " list --encoding=utf-8 --normalize -d ru"
							
							set theSpelling to count (paragraphs of theSpelling)
							set theSpellingRatio to round ((AllWords - theSpelling) * 1000 / AllWords)
							if theSpellingRatio < 0 then set theSpellingRatio to 0
							add custom meta data theSpellingRatio for "ocrquality" to theRecord
							
						on error errMsg number errNum
							display dialog "Spelling for the file: " & linefeed & "'" & filename of theRecord & "'" & linefeed & "caused an error (" & errNum & ": " & errMsg & ")" & linefeed & linefeed & "Do you want to try spellcheck it with a safe sample (10000 chars)?" with title "Spellcheck Error" buttons {"Yes", "No"} default button "Yes" giving up after 20
							if button returned of the result is "Yes" or gave up of the result is true then
								set SampleChars to 10000
								set AllChars to count of PDFtext
								if AllChars ≤ SampleChars then set SampleChars to AllChars
								set PDFtext to (texts 1 thru SampleChars of PDFtext)
								set AllWords to count of words in PDFtext
								try
									--Add additional languages here if you need
									set theSpelling to do shell script "echo " & quoted form of PDFtext & " | " & ¬
										theAspel & " list --encoding=utf-8 --normalize | " & ¬
										theAspel & " list --encoding=utf-8 --normalize -d fr | " & ¬
										theAspel & " list --encoding=utf-8 --normalize -d de | " & ¬
										theAspel & " list --encoding=utf-8 --normalize -d es | " & ¬
										theAspel & " list --encoding=utf-8 --normalize -d ru"
									
									set theSpelling to count (paragraphs of theSpelling)
									set theSpellingRatio to round ((AllWords - theSpelling) * 1000 / AllWords)
									add custom meta data theSpellingRatio for "ocrquality" to theRecord
									
								on error errMsg number errNum
									display dialog "Spelling for the file: " & linefeed & "'" & filename of theRecord & "'" & linefeed & "caused an error (" & errNum & ": " & errMsg & ")" & linefeed & "File will be skipped" & linefeed & linefeed & "Do you want to flag it for manual check?" with title "Spellcheck Error" buttons {"Yes", "No"} default button "Yes" giving up after 10
									if button returned of the result is "Yes" or gave up of the result is true then set state of theRecord to true
								end try
							end if
						end try
					else
						add custom meta data "No Text" for "ocrstatus" to theRecord
					end if
				else
					add custom meta data "No Text" for "ocrstatus" to theRecord
				end if
				set theNumber to theNumber + 1
			end repeat
		end if
		hide progress indicator
	end tell
end performSmartRule
1 Like

To cgrunenberg:

The use of dictionary inspires another thought: perhaps DT can consider/advise the feasibility/value of running/caching the concordance through dictionary (local or multilingual) at the back-end? I suspect that the uniqueness-weightings of concordance can be affected negatively by bad quality OCR or random outliners in relatively good quality OCR. I can only speak for for the academic papers I use: it is quite common to observe a small portion of the top weighted words in concordance are truncated word ( e.g. Instituti onal instead of institutional , embed dedness instead of embeddedness, etc. These are due to the line-returns of some literature) or sticky words (e.g. a top weighted word in the concordance of a well-OCRed pdf is “inordertoassess”). I suspect that these sort of uniqueness may render the see-also and classify functions less effective (if their presence is statistically significant)? Or at least spell-checked concordance can be an option…

I understand that the concept of dictionary-as-filter-of-concordance may be not be applicable to some(many) types of materials, so this is just a thought. Obviously, I am just guessing here and have no idea on performance penalty for having that option.

This can affect the results but the longer the documents are, the less important random OCR issues should be.

Thanks for the info.