Copy first line of PDF and append to filename

maars · July 9, 2011, 5:10am

Hi,

I am puzzling over a problem:

I have a huge amount of PDFs (5000+) that originate from OCR’d newspaper clippings. The first line of text in each PDF contains the title of the article.
To rename these files into a more human-readable scheme, I would like to copy the first line of text in each PDF file (alternatively the first X characters and append this string to the existing filename.

Is there any way of doing this in Applescript?

Thanks a lot,
Marcel

alanshutko · July 9, 2011, 6:39pm

This seems to work for me. The if is there to bypass blank lines I’ve found in some of my documents. Ideally, I’d do a regexp or something, but I don’t know how to do that in AppleScript.


tell application "DEVONthink Pro"
	set selectionList to selection
	repeat with i in selectionList
		repeat with aParagraph in (paragraphs of (rich text of i))
			if ((count of characters of aParagraph) > 2) then
				set name of i to aParagraph as text
				exit repeat
			end if
		end repeat
	end repeat
end tell

maars · July 11, 2011, 3:29am

Works great, thanks! Didn’t realize it was that easy.
Marcel

houthakker · July 11, 2011, 7:52am

And FWIW, if you wanted to screen out a set of predictably uninteresting first lines, you could list regexes describing them at the start of the script, and use something broadly along the lines of :

property plstJunkLines : {"^Sign in$", "^Register$", "^larger$", "^smaller$", ¬
	"^(Dear|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)", ¬
	"^Thank you", "REILLY"}

property pMax : 4 * (10 ^ 6) -- Max byte size - some PDFs are just a bit too big and slow to process automatically this way

set strSkip to ""
repeat with oJunk in plstJunkLines
	set strSkip to strSkip & "|" & oJunk
end repeat

tell application id "DNtp"
	set {dlm, my text item delimiters} to {my text item delimiters, linefeed}
	repeat with oDoc in selection as list
		tell oDoc
			if type is PDF document then
				if size < pMax then
					set strLines to (paragraphs of ((its plain text) as string)) as text -- prepare line delimiters for shell
					try
						set strLine to (do shell script "echo " & ¬
							quoted form of (strLines) & ¬
							" | perl -ne 'if (!(m/^.{0,3}$" & strSkip & "/)) {print \"$_\"; exit}' ")
						if strLine ≠ {} then set its name to strLine & ".pdf"
					end try
				end if
			end if
		end tell
	end repeat
	set my text item delimiters to dlm
end tell