Solved: Renaming by finding the correct date aka past, least distant to present date or present date itself

enGeo · April 12, 2020, 7:33pm

I am trying to extract the correct date from a PDF file which contains several dates.

The files are often read by OCR and the built in options “Document Date”, “Oldest Document Date”, and “Newest Document Date” often cannot find the date I am looking for to rename the file.

Is there a way via smart rule, hazel, keyboard Maestro etc. or AppleScript etc. to realise s.th. like finding the

past, least distant to present date or present date itself

Thanks for your help!

Edit: Hazel can do this.

BLUEFROG · April 13, 2020, 3:13am

Welcome @enGeo
If Hazel can do it, please post screen captures of how it’s accomplished.

enGeo · April 13, 2020, 1:17pm

This aims at just received documents e.g. scanned mail, recent invoices etc.
It does not help a lot for importing and renaming old documents.

BLUEFROG · April 13, 2020, 1:59pm

So you’re trying to find dates with the same format with a positional parameter?

Example, a document could have contents like so…

Born: 12.05.1901
Married:  07.13.1925
Died: 12.15.1966

And you’d look for a first, second, or third number?

enGeo · April 13, 2020, 2:20pm

For example I receive mail from my health insurance company with the following data (this is also the order of the dates in which hazel finds them in the document due to the OCR process):

born: 12.05.1980
member since : 07.13.1999
date of the mail: 13.04.2020 (date I am looking for)
birthday of child: 12.15.2011

rule d1 finds 12.05.1980 which is not in the recent 7 days and therefore not matched
rule d2 finds 07.13.1999 date which is not in the recent 7 days and therefore not matched
rule d3 finds 13.04.2020l which is today and therefore is matched
rule d4 and following find nothing because the file is already moved due to rule d3

→ file is moved to target folder and renamed as 2020-04-13

The number of d-rules depends on how many dates there are in the document/ you want to find. If it is a multi page document usually the searched date is on the first page and therefore for invoices or the like there might be no more than six dates (-> six d-rules) on page one, right?

→ You need more d-rules than there are dates on the first page of the document.

It is an approach that came to my mind and works for me.
On twitter @devontech considered this “trying to get dates in an unusual way here”
Is this approach “unprofessional” i.e. do you have another idea to approach this?

BLUEFROG · April 13, 2020, 3:07pm

I wouldn’t characterize it as professional or unprofessional but it’s definitely the first request I’ve seen.

From your description above, the Newest Document Date would match. Here’s an AppleScript result for a text document with the dates you mentioned above…

enGeo · April 13, 2020, 4:02pm

How does this script perform regarding due dates in invoices? The result would not be the date the document was “written” but the due date and therefore not the one I am looking for, right?

BLUEFROG · April 13, 2020, 4:43pm

Getting document dates is entirely dependent on content of the specific document. If an invoice has several dates, the desired date may or may not be detected.

What do you mean by due date ? Are you processing accounts payable that still need to be paid or invoices that have been paid? (And again, that underscores what I said in the first paragraph.)

enGeo · April 13, 2020, 4:58pm

There often are future dates in the document.
That’s why the newest date doesn’t work for me.

BLUEFROG · April 13, 2020, 5:57pm

You seem to be discussing a wide variety of documents with differing data, i.e., not very uniform. This makes the matter very difficult to automate at face value.

enGeo · April 13, 2020, 7:32pm

indeed!

I am on the way to go paperless e.g. scanning any paper based document, run OCR on it, rename it, archive it, put a reminder on it, tag it etc. Except scanning and OCR the same is applied on digital documents.
The hub for this is DEVONthink accompanied by Dttg and several other apps on Mac, iPad and iPhone.

This is the private side to it. Additionally, I also organize the professional aspects with DEVONthink.

Thanks to your replies and questions I understand so much better now how important knowledge of setting and requirements is to be able to understand and subsequently help. Thank You.

BLUEFROG · April 13, 2020, 7:42pm

You’re welcome

We are working on some things in here that may be useful in the future, so stay tuned for future updates.

enGeo · April 13, 2020, 7:45pm

I am very much looking forward to any update of your astonishing piece of software.

BLUEFROG · April 14, 2020, 3:40am

Thanks. While we have other projects needing time and resources too, more good stuff is coming.

enGeo · April 14, 2020, 7:55pm

I think I would need a Script like this:

extract dates from documents as date01, date02, …
if
date01 is smaller than current date subtract date01 from current date
alse
subtract current date from date01
set result as result01
repeat with all dates
compare results
smallest number is the date I am looking for

Unfortunately I am not familiar with AppleScript.

BLUEFROG · April 14, 2020, 9:04pm

Date extraction isn’t a foolproof thing as the format can vary. Also the system Language & Region settings affect date coercions, so if I run the script using US English I will get a different result than you running Deutsch in Germany.

Also, if you have future dates for accounts payable, there’s a good chance they won’t be closer than today’s date.

enGeo · April 14, 2020, 9:20pm

By default, Hazel will try and detect the date format for you. This is indicated by the “Automatically detect date format” checkbox. When checked, Hazel will try to determine dates in various formats as defined by your system.

I have the feeling that Hazel is pretty good at this.

Developer of Hazel in May 2017:

Hazel is actually using Apple’s data detectors for this.

Do you think it’s possible to use the identified dates? As a file name it can be used. Can they also be used in a script?

To write scripts that work with ADD, you must have the Apple Data Detectors Scripting scripting addition, which supplies some required new terminology. To write custom detectors (which are simple text files), download the Apple Data Detectors SDK, which includes the Detector Editor and a manual.
source

BLUEFROG · April 14, 2020, 9:46pm

That scripting addition is very old.

Development would have to assess using data detectors in this way (or have an alternate idea). But it still doesn’t address what I said previously.

enGeo · April 14, 2020, 9:50pm

Seems for the moment I can be happy with what I set up in Hazel.

Is this s.th. different: https://nshipster.com/nsdatadetector/

BLUEFROG · April 14, 2020, 9:55pm

Here is an example return from a test document, again bearing in mind the results will vary be language/region. See how the use of MM/DD/YYYY or DD/MM/YYYY or YYYY/MM/DD makes a difference in the coercion…

English (US)

	(**)
	(*12.31.1999 = Friday, December 31, 1999 at 12:00:00 AM*)
	(*1980.12.05 = Tuesday, December 12, 12169 at 12:00:00 AM*)
	(*13.07.1999 = Friday, January 7, 2000 at 12:00:00 AM*)
	(*04.13.2020 = Monday, April 13, 2020 at 12:00:00 AM*)
	(*12.15.2011 = Thursday, December 15, 2011 at 12:00:00 AM*)

Deutsch - German

	(**)
	(*12.31.1999 = Donnerstag, 12. Juli 2001 um 12:00:00 AM*)
	(*1980.12.05 = Dienstag, 3. Mai 2011 um 12:00:00 AM*)
	(*13.07.1999 = Dienstag, 13. Juli 1999 um 12:00:00 AM*)
	(*04.13.2020 = Montag, 4. Januar 2021 um 12:00:00 AM*)
	(*12.15.2011 = Montag, 12. März 2012 um 12:00:00 AM*)

Also notice the top ** in earch set of results is a matched serial number of ten numbers that was an invalid value to coerce to a date.

And I’m not saying it’s impossible to do in AppleScript. I’m saying it’s not a trivial matter and the lack of uniformity of incoming data, system settings, etc. all have an effect. It’s easy to write something that parses one document from one source. Different documents from different sources add complexity quickly.