Regex (global mode): How to look for (and pass on) multiple occurrences of a string in a document?

Tossn · July 10, 2022, 4:39pm

Hello,

is it possible using a regular expression in an smart rule to find multiple occurrences of a string in a document and then use this result to change the document’s name (by adding the multiple occurrences)?

For example, the regular expression
/hello/g
looks for multiple occurrences of the string “hello” in a document (I basically use Regex101.com to check). However, it seems as if Devonthink would not allow me to prefix a
/
and suffix
/g
in a regular expression being placed in a smart rule. Is this the case?

This being said in general, I am looking for a way to solve this particular scenario:

I want to read in receipts from DHL the tracking numbers they contain. For example, in a document there are the tracking numbers “RM661929878DE” and “LP042481351DE”. The numbers are basically readable (cross-checked with a plain text conversion) and Regex101.com says that the regular expression
/((RM|LP|RC)[0-9]{9}DE)/g
retrieves both numbers.

When I put this checked expression in a smart rule (as action under “scan text” and “regular expression”), the “display alert” function I also set up doesn’t even appear (and doesn’t return anything).

On the other hand, the expression
((RM|LP|RC)[0-9]{9}DE)
– so without global mode
/…/g
– returns only the first tracking number (“RM…”), but not the second one (“LP…”).

What else could I try?

chrillek · July 10, 2022, 5:24pm

“Retrieves” is a misnomer here. A better word would be “matches”. Because that is what happens (in this context!). You provide a regular expression, and you tell regex101.com that you want to see all matches in the sample text. Easy.

Now, we’re not talking about regex101.com, but about DT. And DT simply can’t do what you want it to (at least not in this situation) because it can’t do anything useful with “all matches”. Let me put it differently: What exactly would you write in the replacement part of your naming rule? You might think to use a reference to the capturing group, like $1. But that can’t work, since there’s only one capturing group for every match – and you want all matches.

However, all is not lost: You can write a script (I’d suggest JavaScript, because of its built-in support for REs, but it is feasible with AppleScript, too) that does what you want.
Something like this

(() => {
  const myRE = /(RM|LP|RC)\d{9}DE/g;
  const app = Application("DEVONthink 3");
  const records = app.selectedRecords();
  records.forEach(r => {
     const txt = r.plainText();
     const matches = [...txt.matchAll(myRE)];
     const name = r.name() + matches.map(m => m[0]).join(" "));
     r.name = name;
  })
})()

This little gem

gets the currently selected records;
loops over them with forEach and retrieves the text of each record;
gets all the strings that match the expression in myRE in an array
appends these matches to the name of the record, separated by a space

I took the liberty to simplify your RE (no need for a capturing group, really). And I didn’t test the code at all, but it should give you an idea how you could achieve what you want.

But: What exactly do you want? Appending all tracking numbers to the name of the record does not seem to make much sense to me. It would, if these numbers were easily scanned by the human eyes (which they are not, at least not with my eyes). And depending on the number of tracking numbers in the file, the name would become longer and longer until DT couldn’t display it completely anymore. If you want to search for them, why even put them in the name at all, since DT will find these codes in the records itself if you search for them.

BTW: I’m fairly certain that DT does not support any modifier in REs. So, not only does /g not work as you hoped it would, but also /i, /m, /s etc. will not do anything (and probably silently break the RE stuff in a smart rule).

Tossn · July 12, 2022, 4:55pm

@chrillek Thanks for your Java Script approach. Looks very and reasonable. Since I feel more connected with regular expressions, I kept looking for a regex solution in combination with smart rules.

I finally found a workaround that brings the desired result I described using search operators and regular expressions.

Thus, this thread brings users who have a similar problem the added value of one Java Script approach and one regex approach.

The regex way leads to the goal by using several intelligent regex rules one after the other. So not in a single action, as I tried at first. Of course, the successive smart rules must represent all possible structures of the wanted text to be matched.

In my case there are the following structures (DHL tracking numbers on the receipts):

RM000000000DE
or
RM000000000DE RC000000000DE
or
RM000000000DE
RC000000000DE
or
RM 000 000 000 DE

(The nine zeros stand for variable numbers. The letters RM and RC can sometimes be LP or CY in the text contents.)

Here is the sequence of the smart rules:
(translations: “eine”=“one” / “Inhalt”="(text)content" / “Name ändern”=“change name” / “verschieben”=“move” / “sortierbares Dokumentendatum”=“sortable document date”)

Here is the result in the file names: