Seeking advice on how to use regex for capturing part of a markdown link

AW2307 · November 2, 2024, 10:29pm

Hi,

I’m trying to use Regex via the “Scan Text” smart rule action to add part of a markdown link to a custom data field.

However, it seems that the url itself is inaccessible via Regex.

For example, take the following text in a markdown document:
"This is some text. And this is the [item link](brain://abc123) followed by more text."

The following RegEx (for debugging purposes) should in theory capture everything: ([\S\s]*)

However, it only captures the following:
"This is some text. And this is the [item link] followed by more text."

Therefore, it seems I will not be able to achieve the goal of isolating and capturing “abc123”, since it seems to be ignored completely.

Is this a known limitation, or is there a way around it? Any advice would be highly appreciated.

korm · November 2, 2024, 11:12pm

If you are looking to extract just the URI from the text, and the requirement is not just for brain, then try

(?<=\]\()\b[a-zA-Z][a-zA-Z0-9+.-]*://[^\)]+

AW2307 · November 2, 2024, 11:21pm

Thanks for the input. Correct, I am only trying to extract the URI.

Tried the Regex, but for some reason I just get \1 in the custom field. It does not seem to find a match…

mbbntu · November 2, 2024, 11:49pm

I just tried an experiment using https://regex101.com, and I found that this worked:

(?<=\]\()\b[a-zA-Z][a-zA-Z0-9+.-]*:\/\/[^\)]+

BLUEFROG · November 3, 2024, 5:42am

Have you enabled IndexRawMarkdownSource in DEVONthink’s hidden preferences?

See…

AW2307 · November 3, 2024, 5:46am

Hi @mbbntu

Yes, it does.

But for whatever reason it does not seem to work if I use a smart rule “scan text” and a subsequent one “change custom field”.

Maybe it’s related to the text containing a link in markdown format, see unexpected behavior described in initial post.

AW2307 · November 3, 2024, 5:47am

Hi @bluefrog

Yes, this setting is enabled. Does it impact Regex behavior?

meowky · November 3, 2024, 5:56am

That’s because you did not specify a capture group.

Try the following regex instead:

]\([^:)]+:\/\/(.+?)[ |)]

chrillek · November 3, 2024, 6:53am

The [ and ] are special characters in RegExes. You must escape them to match literally.
Try your RE in regex101.com. That tells you what matches where (or doesn’t).

meowky · November 3, 2024, 7:12am

FYI only the left square bracket [ is a special character. The right one ] is not required to be escaped in most scenarios.

AW2307 · November 3, 2024, 8:45am

Thanks again for your input. However, it’s still not working i.e. there appears to be no match.

I will set the IndexRawMarkdownSource hidden preference to off, rebuild the database since otherwise it doesn’t take effect if I remember correctly, and then try again.

AW2307 · November 3, 2024, 9:49am

After setting IndexRawMarkdownSource to off and rebuilding the database, these are the findings:

The URI is still unexpectedly not captured via ([\S\s]*) even though it is in Regex101
While the Regex variations kindly provided by different users in this thread work in Regex101, they do not seem to work via the smart rule “Scan Text” action

This is how I am setting up the smart rule:

chrillek · November 3, 2024, 9:58am

But you do want DT to index it, so why set it to off?

Your RE says “get me between zero and any number of space and non-space characters”. That’s equivalent to (.*). Something you should always be wary of.

Yes. And no: the RE matches everything. Your whole text. If you use
This is some text. And this is the [item link](brain://abc123) followed by more text. as Test string in regexp101., you’ll see that it matches the URI. And everything else.

This
](.*?:\/\/.*?)\)
captures your sample URI in the first capturing group. It should work in DT (and is obviously less complicated than the one you posted in your screenshot. \1 should give you access to the captured URI. Try showing it with alert or notification in your smart group. If that works, the issue is not with the RE, but with the custom meta data field (we don’t know how that’s defined, for example). DT sometimes does not behave as one would expect in that context.

AW2307 · November 3, 2024, 11:16am

I tried using ([\S\s]*) to generate an alert. It again excludes the URI from the captured text.

If ([\S\s]*) doesn’t capture the URI, it does seem that something unrelated to Regex is getting in the way. Potentially a bug?

Defining the correct Regex is definitely not the issue. There are now several variations that should work according to Regex101.

chrillek · November 3, 2024, 4:08pm

Turn on raw markdown indexing. After that, create a new MD document and test your rule with it.

BLUEFROG · November 3, 2024, 4:27pm

If you just enabled it, you’d need to rebuild the database, reimport the file, or create a new one.

troejgaard · November 3, 2024, 4:29pm

I have not tried to use the “Scan Text” smart rule action with RegEx on markdown before. I just tried enabling IndexRawMarkdownSource and creating a new document with links after that. It also doesn’t work for me. It does work with things that are not links, like HTML <tags> — actually, I don’t seem to need the Hidden Preference for that. But for [links](https://) it only shows [links]. I didn’t rebuild the database though, because I don’t want this currently, so I don’t now if that is the reason.

What I can quickly find in the forum seems conflicting.

Here Bluefrog says this (march 2023):

But september 2022 he says this, specifically about the “Scan Text” smart rule action:

Does that still stand? Is the “Scan Text” a case where IndexRawMarkdownSource has no effect?

BLUEFROG · November 3, 2024, 4:31pm

My question is why are you approaching this with RegEx? And are you expecting only one URL per document since it’s not going to match and return more than one?

BLUEFROG · November 3, 2024, 4:34pm

Likely, as I don’t recall a change.

Here is a simple smart rule (and a bit more verbose) example…

on performSmartRule(theRecords)
	tell application id "DNtp"
		repeat with theRecord in theRecords
			set src to source of theRecord
			set documentLinks to (get links of src)
			if documentLinks is not {} then add custom meta data (item 1 of documentLinks) for "External Link" to theRecord
		end repeat
	end tell
end performSmartRule

But again, I don’t know your expectation in terms of how many links are in the document, etc.

troejgaard · November 3, 2024, 4:37pm

Then that explains the “problem” here.

I was thinking the same. @AW2307 you don’t give many details. How do the documents in question look in real life? What is the bigger goal?

Maybe it is not possible to achieve the bigger goal in this specific way, but there might be other ways to go about it. DEVONthink can do many things. Maybe a script, or something else.