I am currently experimenting with smart rules. This is working really good for static content (like if filename is … or if file contains …).
What I would like to achieve is to search inside the document for a certain term (like invoice number), pick the information behind it and place this information inside custom metadata.
Is this possible either by scripting or by the simple smart rule?
Sounds great to me. A classical case would look like
IF there is a term like “Belegnummer” or “Rechnungsnummer” THEN take the following next 10 characters and write this term into the field “Belegnummer” (custom metadata).
The text in a PDF is not guaranteed to be in any logical order. The word Belegnummer may visually appear before the invoice number in the PDF, but that is only an appearance. It could easily be in a different place in the actual text.
Converting such an invoice to plain text and checking this would be worthwhile, though it could vary from company to company (and possibly document to document).
Even though it makes my task harder, I need to state that you are right.
After checking the plain text of several documents these seem to be identical when checking the same kind of document, but are not ordered at all compared the PDF.
So taking this into consideration - which approach would you recommend?
The first sequence is fixed and then it is the fourth coloumn of digits (3213321751). So I am not sure on whether it would be better to take the “fourth column” or the “first number after the third separator”.
You need to identify that line in the whole body of the text first. You have to remember computers are very dumb and very literal. What seems obvious to you as a person is not the way a computer looks at it.
This can be handled in different ways and would still require testing with your actual data.
This is a simple teaching edition example based on your particular parameters…
tell application id "DNtp"
repeat with thisRecord in (selection as list)
set recordText to (plain text of thisRecord)
-- Get the plain text of the file
repeat with currentLine in (paragraphs of recordText)
-- Loop through the paragraphs in the text
set lineWords to (words of currentLine)
-- Get the words in the current line
-- This returns a list of words
if (count characters of (item 1 of lineWords as string)) = 10 then
-- Check if the first item in the list has 10 characters in it (per your example)
set possibleMatch to (item 1 of lineWords)
-- If it has 0 characters, set it as a possible match for further validation
try -- This is used to do a process without an error necessarily stopping the script
(possibleMatch as real)
-- Validating if the first word is a number or not. If it's a word, it obviously can't be coerced into a numerical value.
-- i.e., you can't say "What is apple minus three?"
display alert "" & (item 4 of lineWords)
-- If it is a number, we are displaying the fourth word (again, per your example).
-- If the number is matched, here is where you'd do stuff with the "pssibleMatch" variable.
exit repeat -- Stop looping since there's no reason to continue after the match has been made.
end try -- Otherwise, end trying when it errors and continue on looping to the next line…
end if
end repeat
end repeat
end tell