Regex to extract dates from pdfs

Wolkenhauer · June 9, 2022, 10:07am

One application of smart rules, that has been discussed here in several places, is the extraction of dates from pdf files. In this context, I compiled a cheat sheet from several places and hope others may find it useful.

Using Regular Expressions.pdf (3.5 MB)

EDIT: I have compiled everything we learned in this thread in the pdf above.

Wolkenhauer · June 9, 2022, 10:14am

In my own example, I am working on, I want to file receipts for tax purposes. I scanned a large number of them with Readdles Scanner Pro on the iPhone. This app even allows for a workflow to rename a file to something like “receipt-YYYY-MM-DD” and upload it to a Dropbox folder, where I collect them (a folder indexed by DT). Unless I scan the receipts immediately, the date of scanning will not the correct one, hence the desire for a smart rule in DT to help.

In my collection for this year, I only found four variations, which may be an advantage of receipts that there will most likely be only one date in the file. The variations I am trying to cover are:

Variation 1: DD.MM.202Y (date and month with or wihout leading ‘0’)
- Variation 1.1: D.MM.202Y
- Variation 1.2: D.M.202Y
- Variation 1.3: D.M.2Y
Variation 2: DD.M.202Y (without leading 0 for date and month)
- Variation 2.1: DD.M.2Y
Variation 3: 202Y-MM-DD (date and month with or without leading ‘0’)
Variation 4: DD/MM/202Y (date and month with or wihout leading ‘0’)
- Variation 4.1: D.M.202Y

The sub-cases did not actually occur but I can imagine it may happen.

I assume that restricting the year to 202Y should also make the search quite specific. (I should not be later with my tax than two years back

Wish me luck and in case someone has an idea for my case, let me know!

Olaf

chrillek · June 9, 2022, 10:59am

Although I do not quite understand what a “correct date” means, here’s what I suggest as a regex:
\d+[-.]\d+[-.]2\d+ for the first variation and its subclasses.
Oh, well, that catches the second variation, too. And the fourth one.
˚202\d-\d+-\d+` for the third one.

However, I’d advise against restricting the years at all. This kind of RE will stop working in eight years time, and then you might be wondering why. If I understand it correctly, you are dealing with two basic date formats: German (day separator month separator year) and ISO (year separator month separator year). And fortunately no named months.
The first one might be captured like this:
(\d\d?)[-.\/](\d\d?)[-.\/](\d{2,4})
Dissected:

\d\d? is a single digit with an optionally following one. Captures everything from 0 to 99, including leading zeros
[-.\/] captures one of the characters ‘-’, ‘.’ or ‘/’. Note that the latter one might have to be escaped since it is used to limit REs. Not doing that might result in problems, escaping does no harm. Also note that the - has to come first in a character class! Otherwise, it is indicating a range (as in [1-9] in the example below).
\d{2,4} captures between two and four digits, i.e. the year in the complete and abbreviated form.

This works for all years, but it has (of course) some shortcomings, since it would also capture something like “45-07-1923” which might be a telefon number or a code used for archiving or what not. In order to prevent that, you could refine the expression, and that’s where the fun starts:
\b(0?[1-9]|[12]\d|3[01])[-.\/](0?[1-9]|1[012])[-.\/](\d{2,4})

\b means a word boundary, i.e. we do want a stand-alone date string. More on that below
0?[1-9] catches all single-digit days with an optional leading zero
[12]\d catches all days between the 10th and the 29th
3[01] catches the last one or two days of the month for all months but February
| is the alternation operator so that the three preceding sub-expressions are alternatives and the first matching one is selected
Something similar is happening for the months, so no need to dissect that.

Now, this expression will not match “45-07-1923”, but only because of the leading \b. Otherwise it would find a match starting at the “5”, and we do not want that. So the leading \bensures that, and we’d even be more secure with appending the same to the end of the RE (thus preventing partial matches for something like “12-34-5678AB”.

Also, this refined RE would happily match “30-02-2022”. One can prevent that, too, with some more work.

In any case, the day, month and year captured in groups () and accessible later als $1, $2, and $3.

I recommend fiddling around with REs in https://regex101.com. It’s very easy to test REs against real-life strings there: As you can see below, matched and unmatched strings are clearly marked.

Wolkenhauer · June 9, 2022, 11:49am

Thank you @chrillek !! This is much appreciated. I keep trying with CSS and automation, but I am very, very slow … Olaf

BLUEFROG · June 9, 2022, 1:25pm

I am sure there are things you know well that we would fumble with

Wolkenhauer · June 10, 2022, 7:13pm

Dear @chrillek , would you be so kind to provide advice on your RE from above?

I have created a Smart Rule with it and it works really well, on most receipts. Here is the smart rule, in case someone else is interested:

An example where it does not work, is the following receipt:

Using the regex101 page, one can check that the problem arises from the space in front of the 1. First without a space, working fine:

With the space:

Is there a quick fix?

Thank you

Olaf

Wolkenhauer · June 10, 2022, 10:05pm

Ok, one can add another alternative to the single digit months element: ?[1-9] (there is a space before the ?)

If a leading 0 is missing, or a space is there, this will also be passed on to the filename. So, some file may be named Receipt-2022-4-05.pdf and another Receipt-2022-04-05.pdf

chrillek · June 11, 2022, 8:32am

If spaces can be present in the dates, they can be matched by \s* (one or more white space characters) at the appropriate place. Which would be after the separator’s character class and before the next capturing group.

( ?, i.e. a space followed by a question mark inside a capturing group is wrong for several reasons:

It will match a maximum of one space (the question mark), while there could be more
It will match only a space but not a tab character or a non-breaking space.
It will capture the space in the group. That leads, as you’ve noticed, to it being added to the filename in the replacement expression (obviously, because it’s captured in the month part of this RE, so it can’t disappear from this part by magic). Just as the separators, the space(s) has/have to be matched outside of a capturing group.

The space will only be passed on if you’re using a capturing group, as I explained.

For the “missing” zeroes: You’re using regular expressions (as mentioned in the title of the thread). Not a full-fledged “rename my file according to certain rules” program. A regular expression (at least in this context) can’t by itself add zeroes when they’re “missing” (rather: the user wants them to appear out of thin air). What you want could be achieved in JavaScript’s replace method with a replacerFunction (see here). But that’s outside of a normal RE replace as DT uses it. Of course, a smart rule could simply execute a script that parses the record, transforms the date as needed and then sets the filename. I’ve posted something like that before.

I know that you probably will not like the following, but still:
Everything I said is already explained all over the net. It’s actually worth the effort to read about REs if one is going to need them.

Wolkenhauer · June 11, 2022, 10:12am

Thank you @chrillek (!) but why ending on that note? You provide great help, only to tell me that I could/should just look it up

I am going to compile this thread later and share this here. For the ordinary user, the steps and alternatives discussed by you, are helpful.

chrillek · June 11, 2022, 10:53am

It’s a bit like the old saying “Give a man a fish and he is not hungry for a day. Teach them how to fish, and they’ll never be hungry again.” Shelling out recipes to solve one particular problem does not really help. Learning about the technique does (which does require reading and experimenting, yes).
That’s also the reason why I explained the REs in question in detail.

Wolkenhauer · June 11, 2022, 3:40pm

I have compiled the thread and more into a pdf file which I uploaded at the beginning of this thread. If you are interested in extracting dates from texts and using regex in smart rules, have a look at the pdf file. It contains all of @chrillek 's tips as well and a step by step description of what I did.

Lynx · June 14, 2022, 2:51pm

Thank you for your contribution.