I have a huge amount of PDFs (5000+) that originate from OCR’d newspaper clippings. The first line of text in each PDF contains the title of the article.
To rename these files into a more human-readable scheme, I would like to copy the first line of text in each PDF file (alternatively the first X characters and append this string to the existing filename.
This seems to work for me. The if is there to bypass blank lines I’ve found in some of my documents. Ideally, I’d do a regexp or something, but I don’t know how to do that in AppleScript.
tell application "DEVONthink Pro"
set selectionList to selection
repeat with i in selectionList
repeat with aParagraph in (paragraphs of (rich text of i))
if ((count of characters of aParagraph) > 2) then
set name of i to aParagraph as text
exit repeat
end if
end repeat
end repeat
end tell
And FWIW, if you wanted to screen out a set of predictably uninteresting first lines, you could list regexes describing them at the start of the script, and use something broadly along the lines of :
property plstJunkLines : {"^Sign in$", "^Register$", "^larger$", "^smaller$", ¬
"^(Dear|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)", ¬
"^Thank you", "REILLY"}
property pMax : 4 * (10 ^ 6) -- Max byte size - some PDFs are just a bit too big and slow to process automatically this way
set strSkip to ""
repeat with oJunk in plstJunkLines
set strSkip to strSkip & "|" & oJunk
end repeat
tell application id "DNtp"
set {dlm, my text item delimiters} to {my text item delimiters, linefeed}
repeat with oDoc in selection as list
tell oDoc
if type is PDF document then
if size < pMax then
set strLines to (paragraphs of ((its plain text) as string)) as text -- prepare line delimiters for shell
try
set strLine to (do shell script "echo " & ¬
quoted form of (strLines) & ¬
" | perl -ne 'if (!(m/^.{0,3}$" & strSkip & "/)) {print \"$_\"; exit}' ")
if strLine ≠ {} then set its name to strLine & ".pdf"
end try
end if
end if
end tell
end repeat
set my text item delimiters to dlm
end tell