Syntax for using regular expression named capture group substitution

gtackett · July 2, 2023, 10:57pm

The ICU Regular Expressions docs say that ${name} will be replaced with the text matched by a named capture group.

But when I scan text with this regular expression: .+(?\d++).* and do a “Display Alert” action with the text \1 ${number}, what I see in the alert is the matching digit string followed by, literally, ${number}.

What is the right way to do this?

BLUEFROG · July 3, 2023, 12:17am

What are you trying to match?

chrillek · July 3, 2023, 6:23am

Are you sure that your syntax is correct? Iirc, (?...) defines a non-capturing group. And what is \d++ supposed to match?

gtackett · July 4, 2023, 2:03am

Let me amend the regular expression to be consistent with a change I made. What I 'm trying now is:
^(.+?)(?<number>\d+?).*
This is stripped-down version of a regex that wasn’t working as I expected.

The plain text file I’m applying the rule to contains just this:

abc123

The whole rule looks like this:

When I run it, the result is this:

gtackett · July 4, 2023, 2:11am

A non-capturing group is denoted by (?:…).
(?<name>…) is a named capture group.
++ matches the preceding expression 1 or more times, but is called a possessive match. Unlike simple + it will match as many times as possible, and once it’s matched, it prevents backtracking.

chrillek · July 4, 2023, 6:16am

Thanks for pointing that out. I never use named capturing groups, so I’m not familiar with the syntax.
The name groups might be accessible by number, too (\2, in this case). Did you try that?

Edit To answer my own question: Using \1 \2 in this context works. This seems to indicate that DT’s regex handling is not yet up to dealing with named capturing groups (they were not available initially in Apple’s RE implementation, it seems). Maybe this could be fixed, @cgrunenberg?

cgrunenberg · July 4, 2023, 8:28am

This isn’t supported currently, a future release might improve this.

BLUEFROG · July 4, 2023, 2:42pm

But why are you trying to use a named capture group? Is it really necessary?

Given this text, what would you expect your RegEx to return?

abc123
123
s456
This is something with a number in it: 5.87
123 precedes words.

chrillek · July 4, 2023, 5:08pm

The regular expression
^(.+?)(?<number>\d+?)
should match
abc123 such that $1 is “abc” and ${number} is “123”
123 => $1 is “1” and ${number} is “23” (the 1 matches the non-greedy .+?)
s456 => $1 is “s” and ${number} is “456”
This is something with a number in it: 5.87=> $1 is “This is something with a number in it: 5.” and ${number} is “87”
123 precedes words => $1 is “1” and ${number} is “23” (because the second capturing group is not anchored)

gtackett · July 4, 2023, 5:09pm

This is a simplified example to show the problem, rather than a real use case.

In the more complex real case, named groups would make understanding the regex and the substitution string much easier.

BLUEFROG · July 4, 2023, 6:41pm

Without examples more closely matching real world, it’s difficult to assess, including if named groups are necessary.

gtackett · July 7, 2023, 1:48am

Whether or not named groups are necessary in my tiny example should have no bearing on whether they work.

I’ve always believed that when trying to pinpoint why something doesn’t work, a minimal reproducer was a good idea. After all, troubleshooting something small and simple is generally easier that something larger and/or more complex. That’s why I trimmed my real use case down to this tiny example: My real use case wasn’t working, so I went after a tiny example that also showed the same problem: named capture groups not working.