Regex search and replace: No captured group?

bws950 · March 20, 2021, 11:52am

Forgive me if I’m using the wrong terminology here, but is there a way to use more than 9 replacements? When I use \10, I get [Replacement1]0 instead of [Replacement10]. And I don’t seem to be able to used named groups to solve the problem.

(For context, I’m trying to use Regex to pull information from the name of PDFs I’ve saved to DT – which include case names and citations for legal cases – and then save that information in custom metadata fields using a smart rule.)

chrillek · March 20, 2021, 12:16pm

Apparently not. The syntax is described here: ICU User Guide | ICU Documentation
You could try to use a named capture group, though. Like

(?<name>...)

where the angle brackets are part of the expression (!). So

/(?<case>CA-\d\d-\d{4})/California Case ${case}/

If DT relies on NSRegularExpression, this should work. If (and that’s a big if!) Apple has implemented the ICU set as it promised.

BLUEFROG · March 20, 2021, 3:22pm

It does.

bws950 · March 21, 2021, 11:39am

Should this work in a Smart Rule that’s set up to alter custom metadata? It’s not working for me.

To be more detailed, here’s a set of (made-up) sample data that I’ve designed my regex to account for. As you can see, there are differences in the party names (the parts before and after the “v.”), and differences in the precise format of the citation (because they came come in different volumes of the case reporter, which is the first number; can come in different case reporters, which is the “U.S.” or “F.3d”; can start on different pages of a given reporter volume, which is the final number known as the “pincite”; can include information about the court - D.C. Cir., for example – or not; and are decided in different years):

United States v. Harris, 515 U.S. 12 (2004)
Thomas v. United States, 32 U.S. 535 (1999)
Jones v. Smith, 92 F.3d 1112 (D.C. Cir. 1991)

Here’s the regex I’ve written:

(?<Party1>((?=[A-Za-z])(.*?)(?=\sv.\s)))\sv.\s(?<Party2>((?=[A-Za-z])(.*?)(?=,\s[1-9]*[0-9]))),\s(?<Volume>([1-9]*[0-9]))\s(?<Reporter>([A-Z].[A-Z]*[A-Za-z0-9][.d]))\s(?<Pincite>([0-9]*[0-9]))\s((?<Court>(?<=()(.*?(?=[1-2][0-9][0-9][0-9])))(?<Year>([1-2][0-9][0-9][0-9]))

That seems to correctly capture all three of my examples above. Using the program Patterns to test these, here’s what I get:

But when I try to use a Smart Rule in DT to set custom metadata for those fields, I’m unable to get it to work. Here’s the Smart Rule I’ve set up:

The case name that produces for the Jones v. Smith example, for example, is just: ${Party1} v. ${Party2}

By contrast, if I set the Change Case Name field in the smart rule to \1 v. \4, it produces: Jones v. Smith

I.e., it labels it correctly. (I assume that \4 is returning Jones because there are so many subgroups within the regex. If I set the case name field to \1 v. \2, I get: Jones v. Jones )

For the case name, this isn’t a problem, because I can just use \1 v. \4 to set metadata. But \9 returns the “Reporter” field - i.e., the “F.3d” in this example - so I’m unable to get anything after that - the pincite, court, and year - into the metadata.

Apologies for hijacking an old thread with such a specific question, but I’d be thrilled to be pointed in the right direction on how to fix this. Thank you!

chrillek · March 21, 2021, 12:08pm

As I said before: Apple says that they implement the ICU version of regexes. Those have named capture groups. And DT confirmed that they rely on Apple’s NS-Regex routines.

Since it is not working as advertised, either Apple did not really implement the full ICU definition (probable) or DT did somehow deviate from The Right Way (not so probable, because putting more burden on them).
If you really wanted to figure out who’s the culprit here, you could write an ObjC program testing it: if the named captures work there, DT did something not quite correct. Otherwise, all hope is lost.

BTW: in your examples, you have a maxium of seven capture groups (as I see it, at least – US court documents vary wildly, probably):

Jones    Smith         92    F3d  1112   D.C.Cir   1991
party1   party2        vol    Rep.     p     Court   year

So maybe you could get away with 9 numeric capturing groups after all?

Regardless of that: Why are you using lookahead REs all the time? What is wrong with a simple

([A-Za-z].*)\s+v\.\s+([A-Za-z].*),\s+

to (for example) get at Party1 and Party2? Letter, followed by anything up to a at least one space followed by “v.” followed by at least one space, followed by anything up to but excluding a comma?
If you really, really want to use groups even if they’re not necessary, use non-capturing ones.
For example, this ((?=[A-Za-z])(.*?) seems to be utter overkill to capture Party1: You want the first letter and the rest up to, but excluding the “v.”, so use one capturing group like so: `([A-Za-z].*?). You do not want the “v.” part to be part of the first Party, so why include it in this named group?

chrillek · March 21, 2021, 3:55pm

What I’d suggest

([A-Za-z].*)\s+v\.\s+([A-Za-z].*),\s+(\d+)\s+([^\s]+)\s+(\d+)\s+\(([^\d]*)?(\d{4})\)

Seven capturing groups, where the sixth one can be empty (in the first two examples). Works with your examples on regex101.com:

This RE works with 4 digit years in the parenthesis. Must be modified if 2 digit years are also possible.

BLUEFROG · March 21, 2021, 5:27pm

https://regexr.com/ is fun too

BLUEFROG · March 21, 2021, 5:29pm

So these are the names of individual PDFs?
What is the full name of a PDF (which I’m assuming there is more to the name if you’re trying to scrape them)?

bws950 · March 22, 2021, 12:47pm

Thanks very much, chrillek - with a slight tweak to account for the fact that court names sometimes have numbers in them, this (far superior!) RE worked for my purposes, and I was able to use it to get the smart rule working with \1, \2 etc.

And to BlueFrog’s question - the example case names and citations I gave are full PDF names. My goal was to avoid having to enter the information twice, once in the Title (which I often generate by copying and pasting from another source) and then separately by re-typing that information in metadata fields.