Hi Everyone,
I have finally understood how to make DTPO work for me in many ways, but one thing that was driving me nuts was that after scanning years worth of paperwork, including old receipts, I had to hunt and copy/paste relevant information to the record name.
I entered this script into the contest as well, so I hope my posting it here is consistent with the spirit of that event.
The problem with many documents, even after PDF+Text OCR conversion is that key information is located in many places around the document, which means hunting, zooming, clicking the mouse, copying, pasting, etc.
For receipts, the key information is the date of sale, the total amount, and where I spent the money.
So I set out to automate that process with a script. The one below. It’s far from comprehensive, but it you have a date in the (M)M-(D)D-(YY)YYY format on the OCR text, this thing is pretty durn accurate.
It can be extended easily to look for more date patterns, and I think over time, and from many of you scripting wizards in here, it could even be improve to find the total more often, as well as getting the vendor name better.
You can select a pile of scanned receipts, and it’ll rename all of them one by one, pretty quickly.
the receipt name takes the form:
RCPT 2009-09-30 for $103.89 at APPLE Store Summit
of the OCR is perfect and if the OCR didn’t get junk on the first line, and it didn’t break your columns into separate text blocks. otherwise, you’ll get something like this:
RCPT 2010-01-24 for $TAX at sdf-we= ~@!!@ ~~!
if the date is in an unrecognized pattern, like '10Jan1, the receipt will take the form
RCPT ODDBALL DATE FORMAT for $5.10 at LUNCH DELI 1023
but I’ve found that for more than 95% of what I’ve scanned, it gets the date automatically, at that’s a big win because I can smart group them then…
Please add on to it, but also, I hope this script inspires more people to apply this approach to more types of documents. I can see bill statements being a logical next step.
-- OCR Receipt Renamer v 1.0 JBM 2010-02-23 (M)M-(D)D-(YY)YY Total found on word lists, with nominal parsing by word lists. Store name merely first three words of OCR text.
-- outputs consistent record name format RCPT YYYY-MM-DD to more esily facilitate sorting and scripted batching. Please add more and more logic to this little pattern finder, but share it back to Devontech so we can all use it.
-- this script looks for dates and the total payment on OCR'd receipts and takes the first three words of the receipt as a way to try and present the store name and constructs a niftier object name in DEVONthink for the OCRd receipt.
-- WARNING! this script is built in with a Year 2100 bug. If you are still using this script then, well... You've probably gotten really good at Applescript by now, so I'll just let you figure that out.
-- this section sets up the variables the first one is the date pattern used for sh shell grep to do its little magic and print the date. This pattern looks only for M-D0(D)-YY(YY) or MM-D(D)-YY(YY) (sorry anyone else in the world because I know your patterns are different and I had a heck of a time with regex working just to get this going for me). I think it'd be easy to set up a pattern for other formats, BUT you'll have to rearrange things below. I intend to add some more variants, as I see them. For instance, a favorite lunch stop for me, has a very strange Mon'00'YY format, but it sometimes prints out as Y, Mon'D, looking like this 09'Feb14 WE have the technology to find those dates in grep, but so far that looks to be about only 1 in 1000 receipt formats?
set mypattern to "[0-9]\\{1,2\\}[,/-][0-9]\\{1,2\\}[,-/][0-9][0-9]"
-- now we set us up the bomb
-- this should make it multi-selection friendly -- easier to do this than to trap multiples, at least for me, plus, if it works for one record, why not do it over and over
-- first try grabs selection loop. second try tries to get the text layer of the OCR to parse. later tries are contingencies based on the date formats.
tell application id "com.devon-technologies.thinkpro2"
try
set theselection to the selection
if theselection is {} then error "Please select some records."
-- repeat with thisitemhere from 1 to count of items of input
repeat with theRecord in theselection
set rcdtext to "oops, this ain't right if you get this message."
try
set rcdtext to the plain text of theRecord
on error
display alert "You done did it Now. There ain't no text for me to be looking at, so this ain't gonna be right."
end try
set mycat to rcdtext as text
set newname to ""
set myprice to ""
set myplace to ""
-- these two lists attempt to deal with the variety of ways people state the "total" on any receipt. I adjusted them as I scanned some of my own receipts to most situations that my receipts encounter. This was also an attempt to make this script much easier to localize for others.
-- logic at work here. Most receipts in america put the amount due and total cash paid as "total" But not always, sometimes they write "total due" or "payment" or "total purchase". This script scans the OCR text word by word looking for a match to word one. Then, it merely grabs the next word (figuring that 90% of the time that's the cash outlay. HOWEVER, in order to increase the hit ratio, there's a second match list. That was once Applescript grabs the second word, it is matched against the duelist. If there is a hit, then applescript grabs the third word, which should be the price paid now in 98% of clean OCR. There is one last check it does, which is see if applescript only grabbed a dollar sign, if so, it gets the fourth word.
-- To localize, replace these first word and second word lists with your countries equivalents, or add them and you can have a pretty fancy international script for the world traveller who buys things in many many places.
set testtotals to {"TOTAL", "Total", "total", "PAYMENT"}
set duelists to {"Due", "due", "DUE", "Purchase", "PURCHASE"}
try
-- first attempt to find date. the pattern here looks for (M)M-DD-YYYY
do shell script "echo " & quoted form of mycat & " | grep -o " & quoted form of mypattern & "[0-9][0-9]"
set myDate to first paragraph of result
--fix that pesky M format to MM
if the first word of myDate is in {"1", "2", "3", "4", "5", "6", "7", "8", "9"} then
set myDate to "0" & myDate
end if
--fix that even peskier D format to DD
if the second word of myDate is in {"1", "2", "3", "4", "5", "6", "7", "8", "9"} then
set myMonth to the first word of myDate
set myday to "0" & the second word of myDate
set myyear to the third word of myDate
set myDate to myMonth & "-" & myday & "-" & myyear
end if
-- write the preferred date string
set myDate to the third word of myDate & "-" & the first word of myDate & "-" & the second word of myDate
on error
-- if you are here in the code, then there was not a four YYYY code and grep spits out some nasty error, which is good because now we can try the 90% pattern for (m)m-dd-yy
try
do shell script "echo " & quoted form of mycat & " | grep -o " & quoted form of mypattern
set myDate to first paragraph of result
-- fix that pesky M format to MM
if the first word of myDate is in {"1", "2", "3", "4", "5", "6", "7", "8", "9"} then
set myDate to "0" & myDate
end if
--fix that even peskier D format to DD
if the second word of myDate is in {"1", "2", "3", "4", "5", "6", "7", "8", "9"} then
set myMonth to the first word of myDate
set myday to "0" & the second word of myDate
set myyear to the third word of myDate
set myDate to myMonth & "-" & myday & "-" & myyear
end if
--note to self. there was something special about this line, but by the time you are looking for it, we'll have both forgotten.
--write the preferred date string for record names
set myDate to "20" & the third word of myDate & "-" & the first word of myDate & "-" & the second word of myDate
on error
--at this point, there are no numeric designations of month in a delimited list on your receipt. The very, very intrepid could now begin to search for the more obscure formats by embedding deeper try blocks, for instance, this one right here could look for 'YY MMM dd forms, but well, I get tired of branch logic after a while and figure that we've explained about 98% of U.S. receipt dates to our little expert system with the first two. If you like getting things more precise, add more date pattern tries here.
-- This is to make the failure to find a date stand out.
set myDate to "ODDBALL DATE FORMAT."
end try
end try
-- this next section is where you can go fishing for data in your receipt. For me, it's really just looking for the total. The block structure can really be built up to find Debit card charges, Sales Tax rate, etc, and you could save those findings into a paired list in the comments or something. Somebody who uses that info would find that helpful. I like the idea, but only about two-thirds of my receipts get OCRd in a way where my search method here will find a reliable pair. There's just so many options on receipt structure and it gets complicated because the OCR sometimes will parse your receipt into two columns of text which totally ruins the reliability here. If you have a way to force OCR to recognize full lines, instead of columns, this section will work much better.
set theprices to (mycat as text)
repeat with thisitem from 1 to (count of words in theprices)
if word thisitem of theprices is in testtotals then
if myprice = "" then
set i to thisitem + 1
if word i of theprices is in duelists then
set ii to i + 1
if word ii of theprices is "$" then
set ii to i + 2
end if
set myprice to word ii of theprices
else
if word i of theprices is "$" then
set i to i + 1
end if
set myprice to word i of theprices
end if
end if
end if
end repeat
set myplace to word 1 of theprices & " " & word 2 of theprices & " " & word 3 of theprices
set the name of theRecord to "RCPT " & myDate & " for $" & myprice & " at " & myplace
end repeat
on error errText
display dialog errText
end try
end tell