If you have used the search function of word processors or other programs, you might have stumbled on “wild-card” characters like *
and ?
, indicating any character and exactly one character. So, you might have used ???-??? ????
to search for US phone numbers. Which kind of works: This pattern will match a phone number like 555-123 4567
. But it will also match entirely different strings, like ABC-xyz 1234
.
REs come to the rescue, since they allow a very detailed specification of the pattern you are searching.
Staying with US phone numbers, you’d use a RE like \d{3}[ -]\d{3} \d{4}
. That means:
- three digits, followed by
- a space or a dash, followed by
- three digits, followed by
- a space, followed by
- four digits.
This expression matches 555-123 4567
and 555 123 4567
. But it does not match ABC-xyz 1234
.
So, a RE is a string of characters specifying a pattern to search for. And you might want to know more about them because they are useful if you are looking for more than just fixed words.
DEVONthink allows you to use regular expressions in smart rules with its scan name
and scan text
actions. Together with other actions, these two allow you to change the name of a record or set tags depending on its content. Or you can modify the name of a record by rearranging its components.
Simple RE patterns
The simplest RE pattern is the dot .
, which stands for a single character. For example, .he
matches strings like The
, the
, she
, She
and he
(that’s a space followed by “he”). To match a literal dot, for example in a number like 1.234
, you must escape it with a backslash: 1\\.234
.
You already saw the pattern for ASCII digits, \d
. There is no simple corresponding pattern for letters, though. [a-zA-Z]
matches all upper and lower ASCII characters. To match any letter in any script, you can use \p{L}
with many current RE engines. Here, the L
stands for “letter”. There are other \p{}
sequences matching emojis, non-ASCII digits and other Unicode classes.
Looking at the patterns matching digits and letters, you might be wondering why they look so different. Some background information should help to understand that.
Digits and letters are special cases of character classes in RE. These character classes are in enclosed in square brackets []
. For example, [ab]
matches the lower-case letters a
and b
, and [- ]
matches a dash or a space. To match a range of characters, you write its first element, a dash, and its last element: [a-z]
matches all lower-case ASCII letters. Similarly, [0-9]
matches all ASCII digits, and \d
is only an abbreviation for that – two characters instead of five. Note that if you want to use a dash in a character class, it must be the first character: [- ]
matches a dash or a space, [ -]
is a syntax error.
Another often used shortcut is \s
for “ASCII white-space”, which you’d have to write as [ \t\v\n\r]
otherwise.
There are also inverted character classes: [^ab]
matches any character but a
and b
, [^0-9]
matches anything but digits. The shorthand versions use upper-case letters, so that \D
matches anything but a digit, and \S
matches anything but a space.
How much do you want to match?
REs provide several notations to simplify repeating patterns. They are called quantifiers, must immediately follow the pattern and determine how often it must repeat to match:
*
matches any number of occurrences, including none;+
matches one or more occurrences;?
matches zero or one occurrences;{n}
wheren
is a number, matches exactlyn
occurrences;{n,}
wheren
is a number, matches at leastn
occurrences;{n,m}
wheren
andm
are numbers matches at leastn
and at mostm
occurrences.
You should be using the asterisk wisely. For example, a*
will match the string b
, which might surprise you. But the *
matches the preceding pattern zero or more times, and b
is exactly that: zero occurrences of a
followed by b
.
So, don’t use *
indiscriminately – in addition to matching nothing, it also can gobble up more of the string than you intend. Let’s say you want to find HTML start tags with a RE and try this: <.*>
. That will match
- an opening
<
followed by - an optional slash followed by
- any number of any characters, followed by
- a closing
>
.
Applying that to the string <a href="#target">Target</a><p>Some text</p>
will not only match <a>
, but the complete string because the *
is greedy: together with the preceding dot, it matches all characters to the end of the string.
The *
and +
quantifiers can be made non-greedy (lazy) by appending a ?
. Thus, <.+?>
matches <a...>
and <p>
in the string used above. Alternatively, using a negated character class obtains the same result: <[^>]+>
matches a >
followed by at least one character that is not >
, followed by a >
.
Escaping special characters
Since many characters have a special meaning in REs, you must escape them using a backspace \
if you need a literal match. These characters are
., +, ?, *, [, ], {, }, (
, and )
. However, inside a character class, you only need to escape ]
. Thus, \+
matches a single “+” sign, as does [+]
.
If you want to match a literal backslash, you have to escape it everywhere, also in a character class.
Re-using parts of a match
If you wanted to not only match start tags but a complete HTML element, you could be tempted to write an RE like this <a.*?>.*?</a>
and repeat that for all HTML elements. While that might be possible, it would be a pain.
HTML elements consist of an opening tag with the element’s name, followed possibly by text or other elements and a closing tag, again with the element’s name. If we capture the name of the element in the opening tag, we can re-use it in the closing tag:
<([a-zA-Z]+).*?>(.*</\1>)?
If you use this RE with the text <a href="#target">Target</a><p>Some text</p>
, it will find two matches:
<a href="#target">Target</a>
<p>Some text</p>
But don’t get carried away by trying to write to parse arbitrary HTML with REs – that is not possible. If, for example, the p
element in the preceding example contains an img
, our RE will not see that. It will simply return the whole p
, including the img
.
Parenthesis in REs define capturing groups (in general). That means that you can refer to the content of a matched pattern by using \n
later in the RE, where n
is the number of the capturing group, starting with 1 for the first group.
In addition, capturing groups are useful if you want to search with a RE and replace the matches with something else. You should use capturing groups only if you need their contents. If you simply intend to group several patterns like with .*</\1>
above, you should always use a non-capturing group like so:
(?:.*</\1>)?
Another example of reusing patterns is the search for word repetitions:
(w+)\W+\1
finds any number of “word” characters (\w
) and saves them in a capturing group, it then looks for at least one non-word character (W+
), followed by the content of the capturing group. It will, for example, match the the
.
Modifying strings with REs
In a replacement operation, REs are even more powerful than with simple searches. However, the exact syntax for using the content of capturing groups in replacements depends on the environment. JavaScript’s replace
and replaceAll
methods refer to capturing groups as $1
, $2
and so on. DEVONthink’s smart rules and the UNIX command line tool sed
use \1
, \2
and so on.
Here, I’ll use \n
to refer to the nth
matched capturing group in a regular expression replacement.
Example: Change the date from German to ISO format
Suppose your text contains dates in the German format, i.e. DD.MM.YYYY
, and you want to convert them to ISO dates (which permits easier sorting by date). Your regular expression would be
(\d\d)\.(\d\d)\.(\d{4})
so that the month, day and year would be captured in the first, second and third capturing group. Then your replacement string would be
\3-\2-\1
.
You could make this expression more flexible by using [-/.]
instead of \.
in the search expression. Then you’d be able to match dates like 2/7/2014
as well as 4-8-2013
and 1.1.1990
.
Note As it stands, the search expression will match invalid dates like 32.13.0001
. A more robust variant would be (0?[1-9]|[12]\d|3[01])[-/.](0?\d|1[0-2])[-/.]\d{4}
It uses the alternation symbol |
(“or”) and matches only valid days and months, with or without leading zeroes. The first capturing group means match
- a single digit between 1 and 9, with or without a leading zero, or
- any number between 10 and 29, or
- 30 or 31.
However, it would still permit a February with 30 days or a 31st of September and require four-digit years before 1000. Excluding this kind of malformed date strings is possible with a more complicated RE.
Swap two columns in a CSV
Suppose you have tabular data where the columns are separated by a comma (and the columns themselves do not contain commas). You want to swap the second and third entry. The search expression would be
/(.*?,)(.+?),(.+?)(,.*)/g
and you’d swap column 2 and 3 with
\1\3,\2\4
(this one is a bit tricky because of the way how the capturing groups include commas). It also works with an empty first column. If you don’t want that, replace the *
by a +
in the first capturing group.
RE resources on the Web and elsewhere
My preferred test bed for REs is regex101.com. There, you can enter a RE and some text you want to test it against, and the page will tell you if and where your RE matches (or doesn’t). If you test an RE, make sure to run it against text where it should not match!
To learn more about REs, https://www.regular-expressions.info is an excellent point of reference. A bit short on examples, but very thorough in its explanations and in detailing the differences between RE implementations.
The best and definitive source on REs (in my opinion) is Jeffrey Friedls book “Mastering regular expressions”. Unfortunately, it’s out of print, but you can buy an e-book version.