Recommendations for searching content of record with regex

KevinCoates · July 17, 2023, 6:50am

Hi

I’ve searched the forum, and I’m not sure what I want to do is possible. It may be possible with a smart rule but it’s not obvious (to me!). Any recommendations of an approach are welcome.

I’m looking to scan the content of a record for a regex pattern:
[C|T][-| ][0-9]{1-4}[/||-| ][0-9]{1-4}

Then search the DTP database for record names matching that pattern, and put the results into a markdown file.

In Applescript, I can’t see how to use the DTP wildcards to deal with the problem that the two numbers in the regex above may be between 1 and 4 digits.

In smartrules, I can use scan text to match the regex, but I don’t see how to then generate a list of all hits in the document.

Any pointers would be much appreciated.

Kevin

chrillek · July 17, 2023, 7:07am

The only way to do that (as far as I know) is to run a script that checks the RE against the content of all relevant records.

That’s not too complex, but AppleScript is not the ideal choice for that due to its lack for RE support.

cgrunenberg · July 17, 2023, 7:22am

Does a toolbar search for name:"[CT] [0-9]* [0-9]*" return many undesired results or not the desired results at all?

KevinCoates · July 17, 2023, 7:52am

Yup. Unfortunately that’s all I know at the moment

KevinCoates · July 17, 2023, 7:53am

Lots of false positives I’m afraid. (And if I remove the spaces, then no results at all.)

EDIT: Actually, no desired results at all.
False positives: any number
False negative: with C-388/15 it matches only the individual numbers not the pattern.

chrillek · July 17, 2023, 8:21am

Your “regular expression” looks a bit weird to me. For example, [C|T] would match an uppercase T, an uppercase C or a pipe symbol (|). I suppose that the latter is intended as an alternation here, which makes utterly no sense in a bracket expression. The same goes for [/||-| ] – Why do you have three pipe symbols in a bracket expression here? The brackets define a set of characters to match, and each element of this set counts only once, so that repeating a pipe doesn’t make sense.

Below, you’ll see my take on the task. Please check and modify the regular expression the test call to whatever you want to match, taking care that the RE is valid (and sensible). regex101.com helps with figuring that out.

The code is JavaScript and intended to be used in a smart rule’s “execute script” part. It doesn’t do any error checking, so it’s advisable that your smart rule selects only records with a plainText property (e.g. Markdown, OCR’d PDF).

function performsmartrule(records) {
  const foundNames = [];
  records.filter(r => /[CT]\s+\d{1,4}\s+\d{1,4}/.test(r.plainText())).forEach(r => foundNames.push(r.name()));
  if (foundNames.length) {
   const newRecord = Application("DEVONthink3").createRecordWith({type: "markdown", name: "found-files", plainText: foundNames.join('\n\n')});
  }
}

records are the records selected by the smart rules condition. filterreduces them to those whoseplainTextproperty matches the regular expression specified for thetestcall. forEachthan pushes the _name_ of the matching records into the arrayfoundNames. If all records are processed _and_ this array contains anything at all (lengthis not zero), a new markdown record is created and itsplainTextproperty set to the names infoundNames`, separated by two newlines.

KevinCoates · July 17, 2023, 9:43am

That’s very kind, many thanks.

And you’re right my regex was wrong. I was using regex101 and it got the results I wanted, but only because there were no pipes in the source text. I didn’t look closely enough at “how” it was getting the results. It should, I think, have been:
[CT][- ][0-9]{1,4}[-\/][0-9]{1,4}
But I can see your version is more elegant.

In parallel, I kept looking and this also seems to work:

set theResult to do shell script "echo " & quoted form of theContent & " | egrep -o '[CI][- /][0-9]{1,4}[-/\\\\ ][0-9]{1,4}'"

On a quick check, the output looks right.

Thanks all.

chrillek · July 17, 2023, 12:15pm

Sure. This brute force way to circumvent AppleScript’s shortcomings always works egrep, however, is also limited and limiting in its support of regular expressions. Not to mention that apparently innocent strings like this "O'Reilly's books are everybody\'s darling" seem to throw quoted form of off the rails.
I’m not saying that it doesn’t work in your case. Or in many cases. But it is neither an elegant nor a robust solution, since it needs quoting/escaping stuff for the shell.

You’re using a different regular expression (matching [CI], not [CT]) than before, having a bunch of backslashes in the second separator ([-/\\\\ ]). What do you intend to match there?

KevinCoates · July 17, 2023, 1:41pm

Thanks for taking the time. I appreciate it.

Elegance doesn’t worry me in this context but robustness does

On the other hand, the shell solution is something I understand and can expand on myself relatively easily. What I’ve asked is a first step, not the destination.

Your solution - which again I do appreciate - given my current knowledge, I can’t expand on.

The CI in my first example should have been CT. Not sure what happened.

Longer version.

(Note to someone more expert in Eur-lex in case they see this thread: I know that in theory I can extract this data from the eur-lex database by querying the api. Again far beyond my current skill set. But the more I type the more I think I should bite the bullet.)

I have a set of PDFs that are EU court cases.

Each case cites other cases. EU court cases are cited in the text in the form:
C-1234/56
Or
T-1234/56

Where
C or T is the court
1234 is the case number (from 1 with no leading zeroes)
56 is the year (currently in two digit form but not sure how long that will last)

I’ve in the past seen typos where

is rendered as a space;
/ is rendered as a \ (I added a space as an alternative typo out of caution)

(These issues may have been corrected over time but I’d rather check for them just in case. The earliest PDFs are scans of 1957 court reports.)

I’m not expecting other text in the PDFs which would mess up this approach.

If I can correctly identify the pattern in the PDFs, I can construct records that say “this case cites these cases” and “this case is cited by these cases”. Back links and forward links.

There are other possible uses including many I haven’t thought of, hence my concern about “next steps”.

cgrunenberg · July 17, 2023, 1:48pm

A toolbar search for "[CT] [0-9][0-9][0-9][0-9] [0-9][0-9]" should actually match cases having 4-digit case numbers. A prefix like name: or text: could be used to search only in the name or the text.

NOTE: When searching for phrases (meaning words enclosed by quotation marks), any white space or separator between the words of the phrase is accepted. E.g. "e mail" accepts e-mail, e+mail or e@mail etc.

KevinCoates · July 17, 2023, 2:26pm

Hi Christian,

Thanks. That matches:
C-1234/56
but not
C-1/56

The target text does not have leading zeros.

(Though your suggestion does highlight a false positive I hadn’t spotted: I need to exclude colons. There is some text in the form C:1234:56 which should not be matched.)

I’m thinking that I either need to accept the weaknesses of the shell script approach, or learn the eur-lex API.

I appreciate the suggestions.

Kevin

chrillek · July 17, 2023, 2:39pm

There’s this

http://api.epdb.eu/

Which seems to be way cooler than the official API. The latter dates from 2014 and works only with XML, which is not fun to produce nor consume.

cgrunenberg · July 17, 2023, 3:03pm

See…

KevinCoates · July 17, 2023, 3:11pm

Thanks. I hadn’t seen that.

KevinCoates · July 17, 2023, 3:17pm

We may be talking at cross-purposes, and I certainly didn’t mean to be rude if that’s the way it came across.

Yes, that matches 4-digit case numbers, but that doesn’t help for my purposes. I’d seen that option before I posted, but as the case-number can be anything from 1 to xxxx it doesn’t produce a useful result (a partial set of results doesn’t work I’m afraid.)

cgrunenberg · July 17, 2023, 3:30pm

No need to worry, this was just one simple example for 4-digit cases to demonstrate that the search might be sufficient.

E.g."[CT] [0-9] [0-9][0-9]" OR "[CT] [0-9][0-9] [0-9][0-9] OR "[CT] [0-9][0-9][0-9] [0-9][0-9] OR "[CT] [0-9][0-9][0-9][0-9] [0-9][0-9]" should find all desired cases.

BLUEFROG · July 17, 2023, 3:49pm

This still matches colon-delimited strings, e.g.,…

cgrunenberg · July 18, 2023, 6:12am

I didn’t claim that it wouldn’t But the search results could be used in a script to find the documents which might match the regex and therefore the script doesn’t have to apply the regex to all documents which could be easily very slow.