Hi all-
I get a lot of PDFs from online databases: JSTOR, SAGE, EBSCO, that in theory are OCR’ed scans of the original print journals. I’ve found that this is not always the case, and that fact is disguised on import to DTPO because the PDF’s citation page is text. So these are rightfully typed as “PDF+Text.”
I wrote a small script that goes through the current database looking for content of type “PDF+Text,” and based on a number you input, will create a numerically sorted list in the Inbox of files LESS THAN that number of words. A sort of roadmap for where to start looking for PDFs that need to be OCR’ed.
The script might not be perfectly suitable for anyone else’s needs, but it does have some examples of shelling out to make a temp file, and using UNIX sort within an Applescript.
Enjoy! Charles
set the result to display dialog "How Many Words Less Than?" default answer "1000"
set wordCount to text returned of result as integer
set theFilename to do shell script "mktemp"
set fileRef to open for access POSIX file theFilename with write permission
set theLog to ""
tell application "DEVONthink Pro"
set contentList to every content of current database
repeat with i in contentList
if (word count of i is less than wordCount) and (kind of i is equal to "PDF+Text") then
set theLog to theLog & word count of i & " words: " & name of i & linefeed
end if
end repeat
end tell
write theLog to fileRef
set theLog to do shell script "sort -n " & theFilename
close access fileRef
tell application "DEVONthink Pro"
create record with {name:"Low Word Count.log", type:txt, plain text:theLog}
end tell