Many of us have thousands of PDFs from different sources, and they continue to arrive. Often we are not sure about the reconition quality, but it is very important, because DEVONthink is an intelligent software and uses texts of our PDFs for its AI functions.
Recently @ngan posted a script to manually assess the OCR quality. But it’s always a question of how to automate a batch file processing if possible. Below I compiled a script which tries to do this.
General Info
Script analyses the text of selected PDFs, making currently two checks for each PDF:
For “illegal symbols” (non-text symbols, often as a result of a bad text layer for multiple reasons)
For spelling (usually as a result of bad image quality or wrong recognition language)
If PDF fails any of this checks, script writes the Custom metadata “Poor OCR” to it, and flags it. Script also checks if PDF has any words at all and writes Custom metadata “No Text” if it hasn’t.
For spellcheck it uses a free utility Aspel. Current setup is for English, German, French, Spanish and Russian languages. Script is made for Smart Rule use in DEVONthink 3.
Setting up
Install a free spellcheck utility Aspel. Veriety of ways, the easiest I did is Homebrew: $ brew install aspel
Setup a Custom metadata “OCR Status” and values, at least: “No Text” and “Poor OCR”. You may change the names, but be sure to do the same in the script
Save the script below and copy to the folder with all your Smart Rules (usually “/Users/You/Library/Application Scripts/com.devon-technologies.think3/Smart Rules”)
Create a Smart Rule. Condition: Extention is “PDF Document”, and choose created script. I use action “On Import”.
Script
-- Script analyses the text of selected PDFs making currently 2 checks for each PDF:
-- 1. For "illegal symbols" (non-text symbols, often as a result of a bad text layer for multiple reasons)
-- 2. For spelling (often as a result of bad image quality or wrong recognition language)
-- if PDF fails any of this checks, script write the Custom metadata "Poor OCR" to it, and flags it
-- For spellcheck it uses the free utility Aspel
-- Current setup is for English, German, French, Spanish and Russian languages
--Script is made for Smart Rule use in DEVONthink 3
-- Created by Silverstone on 17.03.2020
-- Version 1.0
use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
--Use to Debug
--tell application id "DNtp"
-- set theDocs to get selection
-- my performSmartRule(theDocs)
--end tell
on performSmartRule(theRecords)
tell application id "DNtp"
if (count of theRecords) > 0 then
show progress indicator "Checking OCR quality…" steps (count of theRecords) with cancel button
set theNumber to 0
repeat with theRecord in theRecords
set WordsPDF to word count of theRecord
if WordsPDF > 0 then
step progress indicator "(" & (theNumber + 1) & " of " & (count of theRecords) & "): " & ((name of theRecord) as string)
--Setting the Sample
set SampleWords to 200 --Indicate a quantity of words in a sample for checks
-- Getting text from PDF according to a given sample size
set PDFtext to plain text of theRecord
set AllWords to count of words in PDFtext
if SampleWords > AllWords then set SampleWords to AllWords
set theText to words 1 thru SampleWords of PDFtext
set {tid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, " "}
set theText to theText as text
set AppleScript's text item delimiters to tid
-- Setting Legal Symbols. Insert more symols which you want to allow
set LegalSymbols to id of "0123456789" & ¬
id of "!=@#$%&*(){}[]|^-+~`'?></_№,.«» " & ¬
id of "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" & ¬
id of "АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя" & ¬
id of "áéíóúÉòàèòùâêîôûäïöüÄÖÜëçßæÆåÅ"
-- Calculating the "Illegal Symbols Ratio"
set SymbolsList to characters of theText
set TotalText to 0
set TotalNontext to 0
repeat with theSymbol in SymbolsList
if id of theSymbol is in LegalSymbols then
set TotalText to TotalText + 1
else
set TotalNontext to TotalNontext + 1
end if
end repeat
set IllegalSymbolsRatio to TotalNontext / TotalText
if IllegalSymbolsRatio < 0.05 then --Setup the value for "Illegal Symbols Ratio" you want
--Modify the folder, where you installed Aspel to
set theSpelling to do shell script "echo " & quoted form of theText & " |
/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 |
/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 -d fr |
/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 -d de |
/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 -d es |
/usr/local/Cellar/aspell/0.60.8/bin/aspell list --encoding=utf-8 -d ru"
set theSpelling to count (paragraphs of theSpelling)
set theSpellingRatio to theSpelling / SampleWords
if theSpellingRatio is greater than 0.1 then -- Setup the value for the "Spelling Ratio" you want
add custom meta data "Poor OCR" for "ocrstatus" to theRecord
set state of theRecord to true
set theNumber to theNumber + 1
end if
else
add custom meta data "Poor OCR" for "ocrstatus" to theRecord
set state of theRecord to true
set theNumber to theNumber + 1
end if
else
add custom meta data "No Text" for "ocrstatus" to theRecord
set theNumber to theNumber + 1
end if
set theNumber to theNumber + 1
end repeat
end if
hide progress indicator
end tell
end performSmartRule
Tweaking the Script
You may want change following things (feel free):
The sample of words for checks. Default is 200. With this all checks for 1 PDF is about 2-4 seconds. Bigger sample - better check, but takes more time
Legal symbols. You may paste any symbols you want right in the script to allow them as valid
“Illegal Symbol Ratio” (illegal symbols/all symbols). Setup the value you think appropriate. Default is 0.05 (i.e. 5%)
Spell check languages. See all supported languages for Aspel. Add appropriate string to the script for desired language (don’t forget to correct the folder where aspel resides). Currently each PDF is spellchecked with English, Deutch, French, Spanish and Russian dictionaries. The more languages - the less strict will be the check.
“Spelling Ratio” (misspelled words/all words). Setup the value you think appropriate. Default is 0.1 (i.e. 10%)
I refufsed from “Strange symbols” check. Just because it is kind of duplicated by the spellcheck: if there are non-text symbols it’ll be catched by the spellcheck. It allowed me to enlarge the checking Sample and speed up the process.
Standard Sample size now - is 100 000 charachters, which is very representative.
Added an error handling: if script goes into a spellcheck problem it suggests you to use a “safe sample” (10 000 charachters), if you refuse - it skips this file. Default (or if time is out) - is to try this low sample. If problem persists script suggests you to flag an item for manual resolution.
I dropped using qualitative assesment like “Poor OCR” or so, instead, script writes a single integer for each PDF, representing a measure of its OCR quality. This is the ratio of “good words” to “all words” in a given sample, expressed in promilles from 0 to 1000 and written to the appropriate Custom metadata. Higher the value - the better is PDF OCR quality. You may use this as a column in any PDF view to sort the list, or in any Smart Rule to filter PDFs with “bad text”.
Conjugative spellcheck notion
It means the behavior of a spellchecker when it uses multiple language dictionaries in parallel: if any given word is valid for one of the languages - it is accepted as passed all the spellcheck. Thus, the more languages you choose for such a spellcheck, the less mistakes you will potentially find, but on the other hand, this means you may spellcheck the text containing all these languages at once.
New script
-- Script analyses the text of selected PDFs making a conjugative spellcheck for each PDF
-- The result is a single integer for each PDF as the ratio of "good" words to all words
-- in a given sample, expressed in promilles from 0 to 1000 and written to the appropriate
-- Custom metadata. Higher the value - the better is PDF OCR quality.
-- For spellcheck script uses the free utility Aspel
-- Current setup is for English, German, French, Spanish and Russian languages
-- Script is made for Smart Rule use in DEVONthink 3
-- Created by Silverstone on 18.03.2020
-- Version 1.1
use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
--Use to Debug
--tell application id "DNtp"
-- set theDocs to get selection
-- my performSmartRule(theDocs)
--end tell
on performSmartRule(theRecords)
tell application id "DNtp"
if (count of theRecords) > 0 then
show progress indicator "Checking OCR quality…" steps (count of theRecords) with cancel button
set theNumber to 0
repeat with theRecord in theRecords
step progress indicator "(" & (theNumber + 1) & " of " & (count of theRecords) & "): " & ((name of theRecord) as string)
set WordsPDF to word count of theRecord
if WordsPDF > 0 then
set SampleChars to 100000 --Sample of Characters to use for major spellcheck
set theAspel to "/usr/local/Cellar/aspell/0.60.8/bin/aspell" --Location of your Aspel executable
-- Getting text from PDF according to a given sample size
set PDFtext to plain text of theRecord
set AllChars to count of PDFtext
if AllChars > 0 then
if AllChars ≤ SampleChars then set SampleChars to AllChars
set PDFtext to (texts 1 thru SampleChars of PDFtext)
set AllWords to count of words in PDFtext
try
--Spellchecking. Add additional languages here if you need
set theSpelling to do shell script "echo " & quoted form of PDFtext & " | " & ¬
theAspel & " list --encoding=utf-8 --normalize | " & ¬
theAspel & " list --encoding=utf-8 --normalize -d fr | " & ¬
theAspel & " list --encoding=utf-8 --normalize -d de | " & ¬
theAspel & " list --encoding=utf-8 --normalize -d es | " & ¬
theAspel & " list --encoding=utf-8 --normalize -d ru"
set theSpelling to count (paragraphs of theSpelling)
set theSpellingRatio to round ((AllWords - theSpelling) * 1000 / AllWords)
if theSpellingRatio < 0 then set theSpellingRatio to 0
add custom meta data theSpellingRatio for "ocrquality" to theRecord
on error errMsg number errNum
display dialog "Spelling for the file: " & linefeed & "'" & filename of theRecord & "'" & linefeed & "caused an error (" & errNum & ": " & errMsg & ")" & linefeed & linefeed & "Do you want to try spellcheck it with a safe sample (10000 chars)?" with title "Spellcheck Error" buttons {"Yes", "No"} default button "Yes" giving up after 20
if button returned of the result is "Yes" or gave up of the result is true then
set SampleChars to 10000
set AllChars to count of PDFtext
if AllChars ≤ SampleChars then set SampleChars to AllChars
set PDFtext to (texts 1 thru SampleChars of PDFtext)
set AllWords to count of words in PDFtext
try
--Add additional languages here if you need
set theSpelling to do shell script "echo " & quoted form of PDFtext & " | " & ¬
theAspel & " list --encoding=utf-8 --normalize | " & ¬
theAspel & " list --encoding=utf-8 --normalize -d fr | " & ¬
theAspel & " list --encoding=utf-8 --normalize -d de | " & ¬
theAspel & " list --encoding=utf-8 --normalize -d es | " & ¬
theAspel & " list --encoding=utf-8 --normalize -d ru"
set theSpelling to count (paragraphs of theSpelling)
set theSpellingRatio to round ((AllWords - theSpelling) * 1000 / AllWords)
add custom meta data theSpellingRatio for "ocrquality" to theRecord
on error errMsg number errNum
display dialog "Spelling for the file: " & linefeed & "'" & filename of theRecord & "'" & linefeed & "caused an error (" & errNum & ": " & errMsg & ")" & linefeed & "File will be skipped" & linefeed & linefeed & "Do you want to flag it for manual check?" with title "Spellcheck Error" buttons {"Yes", "No"} default button "Yes" giving up after 10
if button returned of the result is "Yes" or gave up of the result is true then set state of theRecord to true
end try
end if
end try
else
add custom meta data "No Text" for "ocrstatus" to theRecord
end if
else
add custom meta data "No Text" for "ocrstatus" to theRecord
end if
set theNumber to theNumber + 1
end repeat
end if
hide progress indicator
end tell
end performSmartRule
The use of dictionary inspires another thought: perhaps DT can consider/advise the feasibility/value of running/caching the concordance through dictionary (local or multilingual) at the back-end? I suspect that the uniqueness-weightings of concordance can be affected negatively by bad quality OCR or random outliners in relatively good quality OCR. I can only speak for for the academic papers I use: it is quite common to observe a small portion of the top weighted words in concordance are truncated word ( e.g. Instituti onal instead of institutional , embed dedness instead of embeddedness, etc. These are due to the line-returns of some literature) or sticky words (e.g. a top weighted word in the concordance of a well-OCRed pdf is “inordertoassess”). I suspect that these sort of uniqueness may render the see-also and classify functions less effective (if their presence is statistically significant)? Or at least spell-checked concordance can be an option…
I understand that the concept of dictionary-as-filter-of-concordance may be not be applicable to some(many) types of materials, so this is just a thought. Obviously, I am just guessing here and have no idea on performance penalty for having that option.
Use $(brew --prefix) or an Apple Script equivalent, so that it would work for prefixes other than the usual default one that’s currently hard-coded. (I use the default myself, but others might not.)
change aSpel to aSpell–trivial, no change to how it works, but it’s a small aesthetic improvement.
I would do the first one myself even though I don’t need it, but I’m leery tinkering in AppleScript.
Script controls the average length of the word. So, it has to notice it. The only way that comes to my mind - is that estimation is done over the sample of pages