Script (v1b2): Quickly check the quality of OCR of pdf file by 1st N words or by the selected text

ngan · March 15, 2020, 8:55pm

Why the script:
I have recently working on the text content of some older pdf+text files that were imported from other sources. I found that even if the “word count” column is showing a certain number of words, it doesn’t mean the quality of the OCR is good. This script extracts the first N words from the plain-text content of the pdf file and will reveal the document in the viewer window for further action (such as re-OCR) if the “Reveal” button is clicked.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

--ngan 2020.03.15

tell application id “DNtp”
	set {theDoc} to item 1 of {selection}
	
	if kind of theDoc is "PDF+Text" then
		set theChoice to button returned of (display dialog (name of theDoc as string) & return & return & my getFirstNWords(300, get plain text of theDoc) buttons {"Cancel", "Reveal"} default button "Cancel")
		
		if theChoice is "Reveal" then
			set the root of viewer window 1 to (parent 1 of theDoc)
			set selection of viewer window 1 to {theDoc}
			set the index of viewer window 1 to 1
		else
			return
		end if
	end if
end tell

on getFirstNWords(n, theText)
	set wdsInText to my min(n, count words of theText)
	set theText to words 1 thru wdsInText of theText
	set {tid, text item delimiters} to {text item delimiters, " "}
	set theText to theText as text
	set text item delimiters to tid
	return theText
end getFirstNWords

on min(x, y)
	if x ≤ y then
		return x
	else
		return y
	end if
end min

cgrunenberg · March 16, 2020, 9:55am

Thanks for the script! Just had an idea how the quality could be theoretically calculated via the concordance. Not sure how and where this could be useful or whether it would just confuse people.

ngan · March 16, 2020, 10:22am

That sounds very interesting. It may help to alleviate the frustration from not knowing why search can’t find what users “think” they should find (me as one of them in some cases).

I think it will be instrumental for users to identify the more inferior OCRed documents if they can use such attribute to create a smart group like finding duplicated items, too. IMHO, I guess that “rank of OCR quality” can be an option of columns’ choice.

Silverstone · March 17, 2020, 7:02am

Known problem. I usually select some potentially “problem text” in such a document and press a hot key which copies it to the Sorter, then compare it to this PDF.

It would be great if there were an automatic method, trying to guess PDFs with a bad recognition quality, presenting us with a “probability bar” like in “See also” )

ngan · March 17, 2020, 7:29am

Just about to post the updated script when @Silverstone mentioned this. But this is a DT only script.

(1) If no text is selected, or when a pdf is selected in the viewer window, the first N words of plain text content are shown.
(2) If text is selected, the plain text of the selected text is shown.
(3) When the active window is a viewer window, and no item in the viewer windows is selected, the script will enter into config mode and the user can input the number of words in plain text to show - I’m not sure what’s the max words that a dialogue box can show (probably about 1000+ words?). And, the new word limit will remain valid until DT is closed. When DT restarts, the word limit will be reset to default (not by design but related to how DT loads a script into cache/memory).

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

--ngan 2020.03.15
--v1b2
property numWords : 500
tell application id "DNtp"
	
	
	if (count of selection) is 0 then
		set numWords to text returned of (display dialog "Enter number of words to see " default answer numWords) as integer
		
	else
		set {theDoc} to item 1 of {selection}
		set theCitedText to selected text of think window 1 as string
		
		if kind of theDoc is "PDF+Text" then
			
			if theCitedText is "" then
				set theText to my getFirstNWords(numWords, get plain text of theDoc)
			else
				set theText to theCitedText
			end if
			
			set theChoice to button returned of (display dialog (name of theDoc as string) & return & return & theText buttons {"Cancel", "Reveal"} default button "Cancel")
			
			if theChoice is "Reveal" then
				set the root of viewer window 1 to (parent 1 of theDoc)
				set selection of viewer window 1 to {theDoc}
				set the index of viewer window 1 to 1
			else
				return
			end if
		end if
		return
	end if
	
	
	
end tell

on getFirstNWords(n, theText)
	set wdsInText to my min(n, count words of theText)
	set theText to words 1 thru wdsInText of theText
	set {tid, text item delimiters} to {text item delimiters, " "}
	set theText to theText as text
	set text item delimiters to tid
	return theText
end getFirstNWords

on min(x, y)
	if x ≤ y then
		return x
	else
		return y
	end if
end min

Silverstone · March 17, 2020, 8:30am

If we are to automate this process, we need some verification rules for script to automatically assess the OCR quality. I think this is the main difficulty here.

My hypothesis is that poorly OCRed PDF contains more “strange symbols” in a given sample. So, we must have a checklist of such “strange symbols”, like:

symbols outside “normal” UTF-8 areas
rare letter combinations, or combinations impossible for normal language (like those, used in automatic transliteration apps)
proportion of letter symbols (text symbols) over all other
…?

The script could check all this and write a Custom metadata value, like “Poor recognition” and flag such PDFs to manually verify if it is true.

ngan · March 17, 2020, 8:37am

I’ll leave it to the expert…

However, given my limited experience in dealing with perhaps 30-40 poorly OCRed pdf, a quick visual on the first few hundred words give me 100% hits. This is not scientific evidence at all. But generally, if OCR is bad quality, it occurs everywhere in the document - if the case is about the general OCR quality and not relating to specific issue (such as equation, tables, etc). I am more focused on overall text quality only.

Something like this is what I want to check before I work on cite and take note.

Silverstone · March 18, 2020, 12:08am

Here it is, the smart rule script for automated batch processing.
Feel free to modify as you wish.

ngan · March 18, 2020, 4:46am

Just when I was expecting the expert (I was referring to DT…) to add OCR check in the future update … Thank you for sharing the script!