What you always wanted to know about Regular Expressions

chrillek · July 15, 2024, 12:13pm

If you have used the search function of word processors or other programs, you might have stumbled on “wild-card” characters like * and ?, indicating any character and exactly one character. So, you might have used ???-??? ???? to search for US phone numbers. Which kind of works: This pattern will match a phone number like 555-123 4567. But it will also match entirely different strings, like ABC-xyz 1234.

REs come to the rescue, since they allow a very detailed specification of the pattern you are searching.

Staying with US phone numbers, you’d use a RE like \d{3}[ -]\d{3} \d{4}. That means:

three digits, followed by
a space or a dash, followed by
three digits, followed by
a space, followed by
four digits.

This expression matches 555-123 4567 and 555 123 4567. But it does not match ABC-xyz 1234.

So, a RE is a string of characters specifying a pattern to search for. And you might want to know more about them because they are useful if you are looking for more than just fixed words.

DEVONthink allows you to use regular expressions in smart rules with its scan name and scan text actions. Together with other actions, these two allow you to change the name of a record or set tags depending on its content. Or you can modify the name of a record by rearranging its components.

Simple RE patterns

The simplest RE pattern is the dot ., which stands for a single character. For example, .he matches strings like The, the, she, She and he (that’s a space followed by “he”). To match a literal dot, for example in a number like 1.234, you must escape it with a backslash: 1\\.234.

You already saw the pattern for ASCII digits, \d. There is no simple corresponding pattern for letters, though. [a-zA-Z] matches all upper and lower ASCII characters. To match any letter in any script, you can use \p{L} with many current RE engines. Here, the L stands for “letter”. There are other \p{} sequences matching emojis, non-ASCII digits and other Unicode classes.

Looking at the patterns matching digits and letters, you might be wondering why they look so different. Some background information should help to understand that.

Digits and letters are special cases of character classes in RE. These character classes are in enclosed in square brackets []. For example, [ab] matches the lower-case letters a and b, and [- ] matches a dash or a space. To match a range of characters, you write its first element, a dash, and its last element: [a-z] matches all lower-case ASCII letters. Similarly, [0-9] matches all ASCII digits, and \d is only an abbreviation for that – two characters instead of five. Note that if you want to use a dash in a character class, it must be the first character: [- ] matches a dash or a space, [ -] is a syntax error.

Another often used shortcut is \s for “ASCII white-space”, which you’d have to write as [ \t\v\n\r] otherwise.

There are also inverted character classes: [^ab] matches any character but a and b, [^0-9] matches anything but digits. The shorthand versions use upper-case letters, so that \D matches anything but a digit, and \S matches anything but a space.

How much do you want to match?

REs provide several notations to simplify repeating patterns. They are called quantifiers, must immediately follow the pattern and determine how often it must repeat to match:

* matches any number of occurrences, including none;
+ matches one or more occurrences;
? matches zero or one occurrences;
{n} where n is a number, matches exactly n occurrences;
{n,} where n is a number, matches at least n occurrences;
{n,m} where n and m are numbers matches at least n and at most m occurrences.

You should be using the asterisk wisely. For example, a* will match the string b, which might surprise you. But the * matches the preceding pattern zero or more times, and b is exactly that: zero occurrences of a followed by b.

So, don’t use * indiscriminately – in addition to matching nothing, it also can gobble up more of the string than you intend. Let’s say you want to find HTML start tags with a RE and try this: <.*>. That will match

an opening < followed by
an optional slash followed by
any number of any characters, followed by
a closing >.

Applying that to the string <a href="#target">Target</a>Some text will not only match <a>, but the complete string because the * is greedy: together with the preceding dot, it matches all characters to the end of the string.

The * and + quantifiers can be made non-greedy (lazy) by appending a ?. Thus, <.+?> matches <a...> and  in the string used above. Alternatively, using a negated character class obtains the same result: <[^>]+> matches a > followed by at least one character that is not >, followed by a >.

Escaping special characters

Since many characters have a special meaning in REs, you must escape them using a backspace \ if you need a literal match. These characters are
., +, ?, *, [, ], {, }, (, and ). However, inside a character class, you only need to escape ]. Thus, \+ matches a single “+” sign, as does [+].

If you want to match a literal backslash, you have to escape it everywhere, also in a character class.

Re-using parts of a match

If you wanted to not only match start tags but a complete HTML element, you could be tempted to write an RE like this <a.*?>.*?</a> and repeat that for all HTML elements. While that might be possible, it would be a pain.

HTML elements consist of an opening tag with the element’s name, followed possibly by text or other elements and a closing tag, again with the element’s name. If we capture the name of the element in the opening tag, we can re-use it in the closing tag:

<([a-zA-Z]+).*?>(.*</\1>)?

If you use this RE with the text <a href="#target">Target</a>Some text, it will find two matches:

<a href="#target">Target</a>
Some text

But don’t get carried away by trying to write to parse arbitrary HTML with REs – that is not possible. If, for example, the p element in the preceding example contains an img, our RE will not see that. It will simply return the whole p, including the img.

Parenthesis in REs define capturing groups (in general). That means that you can refer to the content of a matched pattern by using \n later in the RE, where n is the number of the capturing group, starting with 1 for the first group.

In addition, capturing groups are useful if you want to search with a RE and replace the matches with something else. You should use capturing groups only if you need their contents. If you simply intend to group several patterns like with .*</\1> above, you should always use a non-capturing group like so:
(?:.*</\1>)?

Another example of reusing patterns is the search for word repetitions:
(w+)\W+\1
finds any number of “word” characters (\w) and saves them in a capturing group, it then looks for at least one non-word character (W+), followed by the content of the capturing group. It will, for example, match the the.

Modifying strings with REs

In a replacement operation, REs are even more powerful than with simple searches. However, the exact syntax for using the content of capturing groups in replacements depends on the environment. JavaScript’s replace and replaceAll methods refer to capturing groups as $1, $2 and so on. DEVONthink’s smart rules and the UNIX command line tool sed use \1, \2 and so on.

Here, I’ll use \n to refer to the nth matched capturing group in a regular expression replacement.

Example: Change the date from German to ISO format

Suppose your text contains dates in the German format, i.e. DD.MM.YYYY, and you want to convert them to ISO dates (which permits easier sorting by date). Your regular expression would be
(\d\d)\.(\d\d)\.(\d{4})
so that the month, day and year would be captured in the first, second and third capturing group. Then your replacement string would be
\3-\2-\1.

You could make this expression more flexible by using [-/.] instead of \. in the search expression. Then you’d be able to match dates like 2/7/2014 as well as 4-8-2013 and 1.1.1990.

Note As it stands, the search expression will match invalid dates like 32.13.0001. A more robust variant would be (0?[1-9]|[12]\d|3[01])[-/.](0?\d|1[0-2])[-/.]\d{4}
It uses the alternation symbol | (“or”) and matches only valid days and months, with or without leading zeroes. The first capturing group means match

a single digit between 1 and 9, with or without a leading zero, or
any number between 10 and 29, or
30 or 31.

However, it would still permit a February with 30 days or a 31st of September and require four-digit years before 1000. Excluding this kind of malformed date strings is possible with a more complicated RE.

Swap two columns in a CSV

Suppose you have tabular data where the columns are separated by a comma (and the columns themselves do not contain commas). You want to swap the second and third entry. The search expression would be
/(.*?,)(.+?),(.+?)(,.*)/g
and you’d swap column 2 and 3 with
\1\3,\2\4 (this one is a bit tricky because of the way how the capturing groups include commas). It also works with an empty first column. If you don’t want that, replace the * by a + in the first capturing group.

RE resources on the Web and elsewhere

My preferred test bed for REs is regex101.com. There, you can enter a RE and some text you want to test it against, and the page will tell you if and where your RE matches (or doesn’t). If you test an RE, make sure to run it against text where it should not match!

To learn more about REs, https://www.regular-expressions.info is an excellent point of reference. A bit short on examples, but very thorough in its explanations and in detailing the differences between RE implementations.

The best and definitive source on REs (in my opinion) is Jeffrey Friedls book “Mastering regular expressions”. Unfortunately, it’s out of print, but you can buy an e-book version.

saltlane · July 15, 2024, 12:44pm

A superb easy guide REGEX. Thanks

BLUEFROG · July 15, 2024, 1:22pm

Another good site is gskinner’s…

And for the very curious, DEVONthink uses ICU Regular Expressions as used by Apple.

PS: Topic is closed to preserve the integrity and educational value of the post. For regex related questions, please start a new thread and reference this one, if needed. Thanks!