Automatic Renaming of Receipts to RCPT YYYY-MM-DD for $XX.XX

jbmanos · February 24, 2010, 12:03am

Hi Everyone,

I have finally understood how to make DTPO work for me in many ways, but one thing that was driving me nuts was that after scanning years worth of paperwork, including old receipts, I had to hunt and copy/paste relevant information to the record name.

I entered this script into the contest as well, so I hope my posting it here is consistent with the spirit of that event.

The problem with many documents, even after PDF+Text OCR conversion is that key information is located in many places around the document, which means hunting, zooming, clicking the mouse, copying, pasting, etc.

For receipts, the key information is the date of sale, the total amount, and where I spent the money.

So I set out to automate that process with a script. The one below. It’s far from comprehensive, but it you have a date in the (M)M-(D)D-(YY)YYY format on the OCR text, this thing is pretty durn accurate.

It can be extended easily to look for more date patterns, and I think over time, and from many of you scripting wizards in here, it could even be improve to find the total more often, as well as getting the vendor name better.

You can select a pile of scanned receipts, and it’ll rename all of them one by one, pretty quickly.

the receipt name takes the form:

RCPT 2009-09-30 for $103.89 at APPLE Store Summit

of the OCR is perfect and if the OCR didn’t get junk on the first line, and it didn’t break your columns into separate text blocks. otherwise, you’ll get something like this:

RCPT 2010-01-24 for $TAX at sdf-we= ~@!!@ ~~!

if the date is in an unrecognized pattern, like '10Jan1, the receipt will take the form

RCPT ODDBALL DATE FORMAT for $5.10 at LUNCH DELI 1023

but I’ve found that for more than 95% of what I’ve scanned, it gets the date automatically, at that’s a big win because I can smart group them then…

Please add on to it, but also, I hope this script inspires more people to apply this approach to more types of documents. I can see bill statements being a logical next step.

-- OCR Receipt Renamer v 1.0 JBM 2010-02-23  (M)M-(D)D-(YY)YY Total found on word lists, with nominal parsing by word lists.  Store name merely first three words of OCR text.

-- outputs consistent record name format RCPT YYYY-MM-DD to more esily facilitate sorting and scripted batching.  Please add more and more logic to this little pattern finder, but share it back to Devontech so we can all use it.  

-- this script looks for dates and the total payment on OCR'd receipts and takes the first three words of the receipt as a way to try and present the store name and constructs a niftier object name in DEVONthink for the OCRd receipt.

-- WARNING!  this script is built in with a Year 2100 bug.  If you are still using this script then, well...  You've probably gotten really good at Applescript by now, so I'll just let you figure that out.

-- this section sets up the variables the first one is the date pattern used for sh shell grep to do its little magic and print the date.  This pattern looks only for M-D0(D)-YY(YY) or MM-D(D)-YY(YY)  (sorry anyone else in the world because I know your patterns are different and I had a heck of a time with regex working just to get this going for me).  I think it'd be easy to set up a pattern for other formats, BUT you'll have to rearrange things below.  I intend to add some more variants, as I see them.  For instance, a favorite lunch stop for me, has a very strange Mon'00'YY format, but it sometimes prints out as Y, Mon'D, looking like this 09'Feb14   WE have the technology to find those dates in grep, but so far that looks to be about only 1 in 1000 receipt formats?


set mypattern to "[0-9]\\{1,2\\}[,/-][0-9]\\{1,2\\}[,-/][0-9][0-9]"

--  now we set us up the bomb
-- this should make it multi-selection friendly -- easier to do this than to trap multiples, at least for me, plus, if it works for one record, why not do it over and over

-- first try grabs selection loop.  second try tries to get the text layer of the OCR to parse.  later tries are contingencies based on the date formats.

tell application id "com.devon-technologies.thinkpro2"
	
	try
		
		set theselection to the selection
		
		if theselection is {} then error "Please select some records."
		-- repeat with thisitemhere from 1 to count of items of input
		
		repeat with theRecord in theselection
			
			set rcdtext to "oops, this ain't right if you get this message."
			try
				
				set rcdtext to the plain text of theRecord
			on error
				display alert "You done did it Now.  There ain't no text for me to be looking at, so this ain't gonna be right."
			end try
			
			
			set mycat to rcdtext as text
			set newname to ""
			set myprice to ""
			set myplace to ""
			
			-- these two lists attempt to deal with the variety of ways people state the "total" on any receipt.  I adjusted them as I scanned some of my own receipts to most situations that my receipts encounter.  This was also an attempt to make this script much easier to localize for others.
			
			-- logic at work here.  Most receipts in america put the amount due and total cash paid as "total"  But not always, sometimes they write "total due" or "payment" or "total purchase".  This script scans the OCR text word by word looking for a match to word one.  Then, it merely grabs the next word (figuring that 90% of the time that's the cash outlay.  HOWEVER, in order to increase the hit ratio, there's a second match list.  That was once Applescript grabs the second word, it is matched against the duelist.  If there is a hit, then applescript grabs the third word, which should be the price paid now in 98% of clean OCR.  There is one last check it does, which is see if applescript only grabbed a dollar sign, if so, it gets the fourth word.
			
			-- To localize, replace these first word and second word lists with your countries equivalents, or add them and you can have a pretty fancy international script for the world traveller who buys things in many many places.
			
			set testtotals to {"TOTAL", "Total", "total", "PAYMENT"}
			set duelists to {"Due", "due", "DUE", "Purchase", "PURCHASE"}
			
			try
				
				--  first attempt to find date.  the pattern here looks for (M)M-DD-YYYY
				
				do shell script "echo " & quoted form of mycat & " | grep -o " & quoted form of mypattern & "[0-9][0-9]"
				set myDate to first paragraph of result
				
				--fix that pesky M format to MM
				
				if the first word of myDate is in {"1", "2", "3", "4", "5", "6", "7", "8", "9"} then
					set myDate to "0" & myDate
				end if
				
				--fix that even peskier D format to DD
				if the second word of myDate is in {"1", "2", "3", "4", "5", "6", "7", "8", "9"} then
					set myMonth to the first word of myDate
					set myday to "0" & the second word of myDate
					set myyear to the third word of myDate
					set myDate to myMonth & "-" & myday & "-" & myyear
				end if
				
				-- write the preferred date string
				
				set myDate to the third word of myDate & "-" & the first word of myDate & "-" & the second word of myDate
			on error
				
				-- if you are here in the code, then there was not a four YYYY code and grep spits out some nasty error, which is good because now we can try the 90% pattern for (m)m-dd-yy
				
				try
					do shell script "echo " & quoted form of mycat & " | grep -o " & quoted form of mypattern
					set myDate to first paragraph of result
					
					-- fix that pesky M format to MM
					if the first word of myDate is in {"1", "2", "3", "4", "5", "6", "7", "8", "9"} then
						set myDate to "0" & myDate
					end if
					--fix that even peskier D format to DD
					if the second word of myDate is in {"1", "2", "3", "4", "5", "6", "7", "8", "9"} then
						set myMonth to the first word of myDate
						set myday to "0" & the second word of myDate
						set myyear to the third word of myDate
						set myDate to myMonth & "-" & myday & "-" & myyear
					end if
					--note to self.  there was something special about this line, but by the time you are looking for it, we'll have both forgotten.
					
					--write the preferred date string for record names
					
					set myDate to "20" & the third word of myDate & "-" & the first word of myDate & "-" & the second word of myDate
				on error
					
					--at this point, there are no numeric designations of month in a delimited list on your receipt.  The very, very intrepid could now begin to search for the more obscure formats by embedding deeper try blocks, for instance, this one right here could look for 'YY MMM dd forms, but well, I get tired of branch logic after a while and figure that we've explained about 98% of U.S. receipt dates to our little expert system with the first two.  If you like getting things more precise, add more date pattern tries here.
					
					-- This is to make the failure to find a date stand out.
					
					set myDate to "ODDBALL DATE FORMAT."
				end try
				
				
			end try
			
			-- this next section is where you can go fishing for data in your receipt.  For me, it's really just looking for the total.  The block structure can really be built up to find Debit card charges, Sales Tax rate, etc, and you could save those findings into a paired list in the comments or something.  Somebody who uses that info would find that helpful.  I like the idea, but only about two-thirds of my receipts get OCRd in a way where my search method here will find a reliable pair.  There's just so many options on receipt structure and it gets complicated because the OCR sometimes will parse your receipt into two columns of text which totally ruins the reliability here.  If you have a way to force OCR to recognize full lines, instead of columns, this section will work much better.
			
			set theprices to (mycat as text)
			repeat with thisitem from 1 to (count of words in theprices)
				if word thisitem of theprices is in testtotals then
					if myprice = "" then
						set i to thisitem + 1
						if word i of theprices is in duelists then
							set ii to i + 1
							if word ii of theprices is "$" then
								set ii to i + 2
							end if
							set myprice to word ii of theprices
							
						else
							if word i of theprices is "$" then
								set i to i + 1
							end if
							set myprice to word i of theprices
						end if
					end if
				end if
				
			end repeat
			
			set myplace to word 1 of theprices & " " & word 2 of theprices & " " & word 3 of theprices
			
			set the name of theRecord to "RCPT " & myDate & " for $" & myprice & " at " & myplace
		end repeat
		
	on error errText
		display dialog errText
	end try
	
	
	
end tell

jbmanos · February 24, 2010, 12:12am

one more thing!

I think a grep pattern that looked for $#(,)###.## type patterns could be used to find all dollar amounts in the text. I need to tweak that pattern to make the comma and the dollar sign optional. as well as make it hit on #.##, ##.##, ###.## etc.

From there, applescript can compare the items in the grep result list to find the highest amount, and then do some logic to see if that amount was the cash tendered or the total paid. After figuring that out, it would select the highest or second highest amount.

That would probably increase the hit ratio on finding the receipt total exponentially. So if someone doesn’t beat me to it, that’s the next planned change to the script.

Knight_of_Nee · February 28, 2010, 3:13am

Mind += Blown;

That is some nice work I especially like the shell script you slipped in there. I never considered that. In fact that opens up a whole crazy world to me like calling python from within my AS.

Hey Eric, Bill and Christian, WRITE A DTP SCRIPTING BOOK. And give jbmanos some kickback.

I’d gladly pay another $15-$20 for a pro scripting manual to go along with DTPO2. Publish it through Take Control (http://www.takecontrolbooks.com/)

I’m serious.

korm · February 28, 2010, 12:21pm

I like what you did JBM.

Regarding getting date/time info. I use a personalized version of the annotation smart template that I’ve modified in a number of ways. Setting the name for the annotation file uses this code to get the system date/time and use it in the file name – similar to what you do:

set base_name to do shell script "date -n +%Y%m%d\\ %H.%M.%S"
set short_base_name to do shell script "date -n +%Y%m%d"

and with that result I do


set the name of theRecord to (base_name & " NOTE: " & (name of theFrontmostDocument) as string)

```which takes the date/time string and prepends it to a indicate of the type of annotation and the name of the annotated document.  So I get:

[b]20100226 06.23.40 NOTE: Journal 9-23.pdf[/b]

I've refactored this approach in other workflow scripts.

Nickedy · May 10, 2014, 9:43pm

I am new to DT and interested whether this has been perfected !?
“Automatic Renaming of Receipts to RCPT YYYY-MM-DD for $XX.XX”
thx for any information on that … I would also need help to get a script set up so I can use it in DT !?
highly Appreciate any comment and/or help

project_guru · September 25, 2019, 8:08pm

I know it’s years later but I think this solution is still one of the best, even in 2019!

I am also wondering if anyone has built upon this solution.

Kindly share.

Thanks!

cgrunenberg · September 26, 2019, 6:59am

In version 3 it’s also possible to use smart rules and the amount and document date placeholders instead.

jbmanos · October 3, 2019, 12:12pm

That makes me glad that this was helpful for you! Thanks for saying so, too!

I never did write the rest of this script. I wanted to add more tests and branches so it’d find names better and get more hits for stranger receipts but never did that.

Scansnap receipts got good at doing this so I hadn’t needed to improve it for my scans.

I do wonder if this wouldn’t be a good case for Machine Learning and it has me wondering now if Apple’s ML can be accessed from AppleScript

hawkboy · October 8, 2020, 11:43pm

Hello… I am brand new to DEVONthink 3 and am hoping you can help me to figure out how to implement a smart rule for scanning receipts with all information intact as you alluded to in your response on the community board Sept 2019? I cannot figure it out! TIA

BLUEFROG · October 9, 2020, 12:47am

NOTE: There is no way to guarantee all the information will come through as expected, as there may be multiple potential matches or the OCR layer of a scanned image may not contain the expected text.

That being said, here’s an example smart rule using a regular expression to look for walmart in the receipt, then using the document date not only for renaming but also for filing the document afterwards.

It detects PDFs with a text layer in the Inbox of this particular database, noting it’s always wise to be more specific than general when setting criteria.
It parses the text for walmart, ignoring case.
It renames the file with matched text \1 and uses placeholders for the detected document amount and date. Control-click in this edit field and choose from the Insert Placeholder contextual menu.
If files the document in a Purchases group, segregating by the year and month of purchase.

The original

original in Inbox

The output

renamed and filed

Here’s the receipt showing what DEVONthink detected in the text layer…

hawkboy · October 9, 2020, 2:01am

Thanks so much…still can’t get it to work for me but I will keep trying. To be clear I’m trying to figure out if DT3 can be manipulated to work like the old Neat software did. I’m using Paperless now but I don’t love it for a lot of reasons. However, it does a much better job scanning receipts and so I think I’m going to have to accept it.

BLUEFROG · October 9, 2020, 2:20am

Hold the Option key and choose Help > Report bug to start a support ticket.
Include a file you’re trying to process and what your expected output filename is.

jbmanos · February 10, 2021, 2:37pm

I’m so rusty with my applescript! I have some new challenges as I recently built a database of 16 years of files and now want to automate some data gathering.

I’m totally shocked everytime I see replies here but I’m glad I could do something that sparks more usability!