Regex to extract dates from pdfs

If spaces can be present in the dates, they can be matched by \s* (one or more white space characters) at the appropriate place. Which would be after the separator’s character class and before the next capturing group.

( ?, i.e. a space followed by a question mark inside a capturing group is wrong for several reasons:

  • It will match a maximum of one space (the question mark), while there could be more
  • It will match only a space but not a tab character or a non-breaking space.
  • It will capture the space in the group. That leads, as you’ve noticed, to it being added to the filename in the replacement expression (obviously, because it’s captured in the month part of this RE, so it can’t disappear from this part by magic). Just as the separators, the space(s) has/have to be matched outside of a capturing group.

The space will only be passed on if you’re using a capturing group, as I explained.

For the “missing” zeroes: You’re using regular expressions (as mentioned in the title of the thread). Not a full-fledged “rename my file according to certain rules” program. A regular expression (at least in this context) can’t by itself add zeroes when they’re “missing” (rather: the user wants them to appear out of thin air). What you want could be achieved in JavaScript’s replace method with a replacerFunction (see here). But that’s outside of a normal RE replace as DT uses it. Of course, a smart rule could simply execute a script that parses the record, transforms the date as needed and then sets the filename. I’ve posted something like that before.

I know that you probably will not like the following, but still:
Everything I said is already explained all over the net. It’s actually worth the effort to read about REs if one is going to need them.

2 Likes