Finding PDFs that need to be OCR'ed

cturner · March 4, 2009, 5:36pm

Hi all-

I get a lot of PDFs from online databases: JSTOR, SAGE, EBSCO, that in theory are OCR’ed scans of the original print journals. I’ve found that this is not always the case, and that fact is disguised on import to DTPO because the PDF’s citation page is text. So these are rightfully typed as “PDF+Text.”

I wrote a small script that goes through the current database looking for content of type “PDF+Text,” and based on a number you input, will create a numerically sorted list in the Inbox of files LESS THAN that number of words. A sort of roadmap for where to start looking for PDFs that need to be OCR’ed.

The script might not be perfectly suitable for anyone else’s needs, but it does have some examples of shelling out to make a temp file, and using UNIX sort within an Applescript.

Enjoy! Charles


set the result to display dialog "How Many Words Less Than?" default answer "1000"
set wordCount to text returned of result as integer

set theFilename to do shell script "mktemp"
set fileRef to open for access POSIX file theFilename with write permission

set theLog to ""

tell application "DEVONthink Pro"
	set contentList to every content of current database
	repeat with i in contentList
		if (word count of i is less than wordCount) and (kind of i is equal to "PDF+Text") then
			set theLog to theLog & word count of i & " words: " & name of i & linefeed
		end if
	end repeat
end tell

write theLog to fileRef
set theLog to do shell script "sort -n " & theFilename
close access fileRef

tell application "DEVONthink Pro"
	create record with {name:"Low Word Count.log", type:txt, plain text:theLog}
end tell

cturner · March 4, 2009, 5:40pm

Hahaha!

I see that I could have done this with a Smart Group. Anyway, some fun scripting…

Blush<

Charles