Smart rules - Use PDF content for custom metadata

Connor · January 17, 2020, 8:45am

Dear,

I am currently experimenting with smart rules. This is working really good for static content (like if filename is … or if file contains …).
What I would like to achieve is to search inside the document for a certain term (like invoice number), pick the information behind it and place this information inside custom metadata.

Is this possible either by scripting or by the simple smart rule?

Thank you for your feedback,

Connor

cgrunenberg · January 17, 2020, 9:07am

An AppleScript should definitely be able to handle this. How does such a term/number look like?

Connor · January 17, 2020, 9:44am

Sounds great to me. A classical case would look like

IF there is a term like “Belegnummer” or “Rechnungsnummer” THEN take the following next 10 characters and write this term into the field “Belegnummer” (custom metadata).

BLUEFROG · January 17, 2020, 4:45pm

The text in a PDF is not guaranteed to be in any logical order. The word Belegnummer may visually appear before the invoice number in the PDF, but that is only an appearance. It could easily be in a different place in the actual text.

Converting such an invoice to plain text and checking this would be worthwhile, though it could vary from company to company (and possibly document to document).

Connor · January 17, 2020, 7:39pm

Even though it makes my task harder, I need to state that you are right.
After checking the plain text of several documents these seem to be identical when checking the same kind of document, but are not ordered at all compared the PDF.
So taking this into consideration - which approach would you recommend?

BLUEFROG · January 17, 2020, 7:58pm

What’s an example invoice value, e.g., is it 10 digits or a combination of digits and characters (including dashes, etc.) ?

Connor · January 17, 2020, 8:07pm

The corresponding line is

7000607467 0151XXXXXX 12.06.2019 3213321751 10.01.2020

The first sequence is fixed and then it is the fourth coloumn of digits (3213321751). So I am not sure on whether it would be better to take the “fourth column” or the “first number after the third separator”.

Which scripting functions would be a good start?

BLUEFROG · January 17, 2020, 8:41pm

You need to identify that line in the whole body of the text first. You have to remember computers are very dumb and very literal. What seems obvious to you as a person is not the way a computer looks at it.

This can be handled in different ways and would still require testing with your actual data.

This is a simple teaching edition example based on your particular parameters…

tell application id "DNtp"
	repeat with thisRecord in (selection as list)
		
		set recordText to (plain text of thisRecord)
		-- Get the plain text of the file
		
		repeat with currentLine in (paragraphs of recordText)
			-- Loop through the paragraphs in the text
			
			set lineWords to (words of currentLine)
			-- Get the words in the current line
			-- This returns a list of words
			
			if (count characters of (item 1 of lineWords as string)) = 10 then
				-- Check if the first item in the list has 10 characters in it (per your example)
				
				set possibleMatch to (item 1 of lineWords)
				-- If it has 0 characters, set it as a possible match for further validation
				
				try -- This is used to do a process without an error necessarily stopping the script
					(possibleMatch as real)
					-- Validating if the first word is a number or not. If it's a word, it obviously can't be coerced into a numerical value.
					-- i.e., you can't say "What is apple minus three?"
					
					display alert "" & (item 4 of lineWords)
					-- If it is a number, we are displaying the fourth word (again, per your example).
					
					-- If the number is matched, here is where you'd do stuff with the "pssibleMatch" variable.
					
					exit repeat -- Stop looping since there's no reason to continue after the match has been made.
				end try -- Otherwise, end trying when it errors and continue on looping to the next line…
			end if
		end repeat
	end repeat
end tell