Find out the Date in an OCR Scanned PDF and Rename to Date

mannigfaltig · June 25, 2012, 11:46am

Hello forum

I would like to know, if it’s possible to create an AppleScript, witch can find the date in an OCR Scanned PDF, for example the issue date of a receipt, and then rename that PDF to the receipts date.

This would save me a lot of time!

Or is there any software on the marked witch can do this allready?

Looking forward to your answers

G.

Hugh · June 25, 2012, 2:28pm

The challenge for software is, I suspect, to identify within the text of the PDF the particular text string which is the date. It needs to be unique and stand out. After that, renaming the document should be fairly simple.

For example, if you physically highlight the date with a marker, the OCR software supplied with the Fujitsu Scansnap 1500M is designed to be capable of turning the resultant text string into a PDF keyword. Hazel (noodlesoft.com/hazel.php) can then be set up to rename the document with the keyword (among many other clever things). However, I’ve found the reliability of the highlight-finding to be iffy; bright green marker is supposed to be best, and I suspect the legibility of the original document matters a lot. Especially with receipts, legibility can be an issue.

I believe some other applications may be able to try to parse the entire text and put the date into a database, if they can find it and if it is uniquely “datey”.

mannigfaltig · June 25, 2012, 7:37pm

Brilliant, thanks you Hugh for the tip, as I work with the brilliant ScanSnap scanner.

But perfect would be to have this as an AppleScript (one app does it all(thanks to apple I don’t have to think of another term then app, and even it’s the short form for application, what makes it even more handy)).

mannigfaltig · September 8, 2012, 1:11am

Found this post:

[url]Automatic Renaming of Receipts to RCPT YYYY-MM-DD for $XX.XX]

Mezzanine · September 10, 2012, 10:41am

I’ve been working on a solution for this over the past week or so.

At first, I developed a perl script outside of DTP that would search freshly scanned and OCR’d documents for date text - there are a few pre-built, intelligent PDF text extraction and date parser modules for perl on CPAN. After more trial-and-error than I had hoped, this script set the file’s timestamp (modification date) to the first date found within the document (if any).

After importing the document into DTP, I would then use one of the bundled scripts “rename to creation date” (?) which changed the document name to reflect the “creation date” of the object in DTP, which itself originated from the file’s timestamp. With me so far?. This finally produced record names within DTP like:

2010-06-20
2012-08-01

There were a few issues with the built-in perl PDF text extraction modules which meant that some downloaded (rather than scan-generated) PDF docs couldn’t be parsed so I changed to use the “mdimport -d2” system command which was messier but more robust.

After all of that, I then started work on managing the workflow via a DTP script. I wanted something that was repeatable/re-runnable so that improvements in the date searching algorithm could be applied to documents already imported into DTP. This was not possible with my first solution as changing a file’s timestamp within the DTP DB structure doesn’t automatically get reflected in the object’s metadata - although there is another bundled script for that!

Anyway, I now have a DTP applescript that:

takes the list of currently selected docs within DTP and for each doc;
passes the PDF text and location (path) to my new perl script
the perl script;
transforms and searches the text for appropriate date strings
sets the file’s timestamp to the date (if found)
returns the date “name” back to the applescript
the applescript reads the file’s timestamp and sets the internal DTP record “creation date”
the document name is changed to the “name” returned by the perl script. This is “YYYY-MM-DD” or “no date” if no date was found.

This is all a work-in-progress and I don’t have any of the scripts available right now but am happy to post them when I’m back home if there’s any interest…

jonaswouters · September 20, 2012, 5:37pm

I would like those scripts!

Mezzanine · September 20, 2012, 11:17pm

Here’s the applescript part…

-- Set object name and creation date from document text
-- Created by Graeme Wilford 8/9/2012
-- Copyright (c) 2012. All rights reserved.

-- the perl script (takes document text on STDIN and source file path as 1st argument)
set PDFdate to "/Users/gwilford/bin/PDF-date4-stdin"
set MaxTextSize to 1000

tell application id "com.devon-technologies.thinkpro2"
	try
		set this_selection to the selection
		if this_selection is {} then error "Please select some contents."
		
		set number_of_steps to count of this_selection
		show progress indicator "Parsing documents…" steps number_of_steps
		--show progress indicator "Parsing records…"
		
		repeat with this_record in this_selection
			-- fake a 'continue' with an exit repeat of a 1-pass loop
			repeat 1 times
				set this_name to name of this_record as string
				step progress indicator this_name
				
				-- get text contents of PDF and pass up to MaxTextSize chars to parser
				set theText to plain text of this_record
				set theTextSize to length of theText
				-- skip to the next record if we have no text to parse 
				if theTextSize = 0 then exit repeat
				
				-- only pass MaxTextSize chars through "do shell script"
				if theTextSize > MaxTextSize then set theTextSize to MaxTextSize
				copy characters 1 through theTextSize of theText as string to theTrimmedText
				
				set thePath to path of this_record
				
				-- do the heavy lifting
				-- NB. shell can only handle ~260k command line	
				set theName to do shell script "echo " & quoted form of theTrimmedText & " | " & PDFdate & space & quoted form of thePath
				
				-- get the file timestamp
				tell application "System Events"
					set theDate to modification date of file thePath
				end tell
				
				-- set DB object name
				-- do this *after* checking file mod date
				-- as this command actually moves the file!
				set the name of this_record to theName
				
				-- set DB object creation date
				set the creation date of this_record to theDate
			end repeat
		end repeat
		hide progress indicator
		
	on error error_message number error_number
		hide progress indicator
		if the error_number is not -128 then display alert "DEVONthink Pro" message error_message as warning
	end try
end tell

and here’s the perl part:

#!/usr/bin/perl -w

# Created by Graeme Wilford 8/9/2012
# Copyright (c) 2012. All rights reserved.

# Take text on STDIN and a source filename as 1st argument
# - search for appropriate date string in text
# - set the source file modification time to date (if found)
# - print out date string for use as name of file/record

use Date::Extract;

my $parser = Date::Extract->new();
my $now = time();
my $today = DateTime->now();
my $year = 52*7*24*3600;

# Range of dates that are deemed allowable ($past > date > $future)
my $future = $year;
my $past = 30*$year;

# initial part of document to check first
my $initial_size = 500;

my $filename = $ARGV[0] || die "no filename supplied";

# Process the text
print process() . "\n";

# re-form parsed dates in Date::Extract friendly format including y2k fix
sub dform {
	my ($d, $m, $y) = @_;
	#print "$y/$m/$d: ";

	# ensure 4-digit year
	if ($y < 20) {
		$y += 2000;
	} elsif ($y < 1900) {
		$y += 1900;
	}

	# for Date::Extract 
	my $s = sprintf("%s %d, %4d", $m, $d, $y);
	#print STDERR %s . "\n";
	return ($s);
}

	
sub process {
	my $text = <STDIN> || return "no text";
	my ($dt, $textlen);

	# Pre-process various Date::Extract unfriendly date formats
	#
	# separators can be '-' or whitespace 
	# day can be:
	# - single digit with or without leading zero
	# - double digit
	# - qualified (eg 1st, 15th, 22nd)
	# month can be full or short name
	# year can be 2 or 4 digits
	$text =~ s/(\d{1,2})(st|nd|rd|th)?(-|\s{1-3})(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w*(-|\s{1-3})(\d{4}|\d{2})/dform($1, $4, $6)/eig;

	# lose now/today/tomorrow
	$text =~ s/now|today|tomorrow//ig;

	# lose days of week
	$text =~ s/(mon|tues?|wed(nes)?|thu(rs)?|fri|sat(ur)?|sun)(day)?//ig;

	# look for first "date*" string
	if (substr($text, 0, $initial_size) =~ /date:?\s+(.*)/is) {
		$dt = $parser->extract(substr($1, 0, 50));
	}

	# check initial part of document text
	if (!$dt) {
		$dt = $parser->extract(substr($text, 0, $initial_size));
	}

	# check remaining parts of document
	if (!$dt && length($text) > $initial_size) {
		$dt = $parser->extract(substr($text, 470));	
	}
	return "no date" unless $dt;

	# sanity-check the date found
	if ($dt->epoch > $now + $future ||
	    $dt->epoch < $now - $past) {
		return "no date";
	# don't allow 'today'
	} elsif ($today->ymd eq $dt->ymd) {
		return "no date";
	}
		
	# date is good!
	utime($dt->epoch, $dt->epoch, $filename);
	return $dt->ymd;
}

To make this work, you’ll need to:

save the perl script somewhere
run “cpan install Date::Extract”
load the applescript into the applescript editor
edit the location of the perl script near the top
compile and save the applescript to the usual scripts folder

Still very much a work-in-progress but getting fairly good results now. Please make sure you backup often in case there’s any (more) bugs

mannigfaltig · November 11, 2012, 4:33pm

wow, a huge thank you!

marshalleq · March 15, 2016, 5:15am

Hi there, I have been searching for this feature for quite some time. Really I thought it should be possible to use a rule such as ‘choose first date’ and apply across a whole bunch of documents. That would be something DevonThink could do a solid job of to make the product much more useful out of the box. Thankfully you’ve had this idea too!

Since this is quite old now, I just wanted to check if you had any further updated versions of your scripts as you mentioned you may be still tweaking them.

Fantastic job for an ESSENTIAL feature!

Thanks!

Mezzanine · March 15, 2016, 7:39pm

Thanks. I did a bit more development and the latest version is on GitHub: https://github.com/gwilford/paperless-scripts

It’s not been updated for 3yrs but the GitHub version still works fine for me.

Regards,
Mezzanine

marshalleq · March 25, 2016, 9:01pm

Thanks I have tried it and it doesn’t yet work. But it’s my first foray into DevonThink scripting and the last time I did scripting, it was batch scripting in DOS, so this is rather above my current skillset!

So what am I missing here?

Mac-mini:Downloads username$ osascript Set\ Creation\ Date\ from\ Contents.scpt
Set Creation Date from Contents.scpt:4:5: script error: Expected end of line, etc. but found “<”. (-2741)

I took out the top blanks spaces with Vi:
Mac-mini:Downloads username$ vi Set\ Creation\ Date\ from\ Contents.scpt
Mac-mini:Downloads username$ osascript Set\ Creation\ Date\ from\ Contents.scpt
Set Creation Date from Contents.scpt:0:1: script error: A “<” can’t go here. (-2740)

chrillek · November 6, 2016, 12:14pm

I implemented something like that in JavaScript. It does not change the filename, only the creation date of the document. Changing the filename as well should be simple.

The code tries to take into account German and English versions of month names (if it recognizes at least the first three characters of it), but it will not work (yet) with English day variants like “31st” etc. If somebody needs that, they have to tweak the variable dayString.

I suggest this version because it is self sufficient, doing all the regular expression stuff itself instead passing it on to a Perl script. That should be faster than the variant that runs an external script to fetch the date from the OCR record.

The text in the dialog is in German, but that can be easily changed.



var months = ['Jan', 'Feb', 'M[äa]r', 'Apr', 'Ma[iy]', 'Jun', 'Jul', 'Aug', 'Sep', 'O[ck]t', 'Nov', 'De[cz]'];

var monthsRE = months.map(function (x) { 
   return new RegExp(x); });
   

var monthString = "(?:(0?[[1-9]|1[012])[-./ ]+|(" 
 + months.join('|')  // All month names as alternatives
 + ")[a-z]*\\s+)";    // followed by possibly more characters (long month name) and at least one space

var dayString = "(0?[1-9]|[12]\\d|3[01])[-./ ]+";

var yearString = "((?:[12]\\d)?(?:\\d{2}))";

var REString = dayString + monthString + yearString;
var dateRE = new RegExp(REString);



var Devon  = Application("com.devon-technologies.thinkpro2");

Devon.includeStandardAdditions = true
 
var pr = Devon.properties();
var selection = pr['selection'];

if (!selection || selection.length === 0) {
  Devon.displayAlert("Erst Datensätze auswählen");
} else {
  for (var i = 0; i < selection.length; i++) {
    var record = selection[i];
	if (record.type() === 'PDF document') {
	    var t = record.plainText();
    	var found = t.match(dateRE);
		if (found) {
  		  console.log(found);
		  Devon.displayAlert(found[0]);
  		  var tag = +found[1];
		  var monat = found[2];
		  if (+monat === 0) { // month as string
		    monat = found[3];
		    monthsRE.every(function (m, i) {
			  if (m.match(monat)) {
			    monat = i + 1;
				return false;
			  } else {
			    return true;
			  }	
			 });
		  }
		  var j = +found[4];
		  var jahr = j < 100 ? +j + 2000 : j;
		  

		  var result = Devon.displayDialog(record.name() + 
		    '\nDatum ändern zu:' ,{
			    defaultAnswer: tag + '/' + monat + '/' + jahr,
		        buttons: ["Abbrechen", "Ändern"],
                          defaultButton: "Ändern"
                          });
	      if (result.buttonReturned === 'Ändern') {
		    record.date = new Date(jahr, monat-1, tag);
		  }
		} else {
		  Devon.displayAlert('Kein Datum gefunden');
		}
	  }
}
}

marshalleq · November 7, 2016, 6:08am

Thanks, I will give it a try. I think it’s going to be nearly impossible for me though, unless I learn some scripting language. That will take the kind of time that I don’t have and really I think this should be built into Devonthink via a simple drag drop logic interface. Hopefully they read this and add it into their roadmap somewhere. That would be killer.

BLUEFROG · November 7, 2016, 2:01pm

A stitch in time, saves nine.

Learning a computer language is worth the time you can end up saving.

Frederiko · November 7, 2016, 4:11pm

True, although this problem is a particularly hard one that would really benefit from some api support so that some of the incredibly complexity this problem has, is abstracted out.

It would be nice for example to be able to search by date, such as October 2016 and have the search return all results where the documents contain a reference to October 2016 in anyone of the possible ways it might be expressed.

So searching for 2016/10/10 would return ‘10 Oct 2016’, ‘10 October 2016’, ‘10th Oct 2016’ … ‘Oct 2016’ in a descending order of probability of matching, something like what the AI does at the moment matching similar documents.

Add this to the DT’s devs todo list

Frederiko

BLUEFROG · November 7, 2016, 5:35pm

Not necessarily a simple task from an API standpoint either. In fact, it would be easier (though uglier) to hardcode permutations.

”Hey, Criss…” (ducks)

pete31 · March 1, 2017, 9:01am

chrillek,
it works like a charm for dates like 26.09.2016. If it’s like 16. August 2016 the date will be recognized correctly (and displayed in the first dialog), but the month is not passed to the second dialog, which displays 16/undefined/2016.
Unfortunately I have no idea of JavaScript. Is it broken or just a little flaw?