Last September, I wrote a fairly awkward parser to search the “plain text” content of a record for date strings, replacing the name with a consistent YYYY-MM-DD date, changed the record “creation date” to match and also updated the source file’s modification time.
[url]Find out the Date in an OCR Scanned PDF and Rename to Date]
The version I posted kinda worked but anyone installing it would have also had to download a Perl CPAN module which would have needed the Xcode package with optional command line tools in place to be able to run ‘make’.
I’ve spent the last week dusting off the code and have…
loaded it in GitHub
completely re-written the date parsing logic
removed any requirement for CPAN modules
changed to name as YYYY-MM if no day in date (otherwise YYYY-MM-DD)
added logging and testing options
added in a test-suite
tested and tuned it against 100s of OCR’d documents
It’s under active development so a few changes are still likely.
The README has some basic instructions. All you need to do is copy the compiled applescript (.scpt file) to the usual DTP Scripts folder and the perl code should be copied into the folder above (~/Library/Application Support/DEVONthink Pro 2/) or anywhere else if you want to edit the location within the applescript.
Graeme, this is clever and useful. Thank you for the major effort you invested in your creation – I am sure many will find it a very useful tools for document management.
If I’m parsing the code correctly, the routine appears to stop after the first date string match in a file. That’s a logical design decision, and perhaps the documentation should be explicit on that. Because some files might have multiple date string candidates of which the first one is not necessarily the one the user wants.
search for first occurrence of ‘date’, ie. /date:?\s+(.*)/is
search only the first 500 chars
search the rest of the document from 470 chars onwards
The first element was useful but led to fairly horrible inconsistencies like pulling out my DoB on a range of docs instead of the document authoring date. I also considered returning the chronologically earliest date found but again this proved too inconsistent.
The current (No_CPAN_Modules) version simply returns the first well-formed date found within the document. It doesn’t compensate for bad OCR or appreciate the text column order in which the OCR or document authoring software stores text. DTP is responsible for passing the “document text” to the parser and it arrives in whatever order DTP chooses (actually I think this is determined by Spotlight).
I should also say that the code ought to work with any DTP document with some text component, not just PDFs. I haven’t tried with others but as long as a record holds some plain text, it should work. Any docs without plain text are simply skipped over.