Regex changing of URL seems to add a character to found text

jrickmd · February 6, 2022, 7:06pm

I have a rule that looks for a URL in imported records:

The text from a sample record is:

**Tweet by Thomas Dayspring**

With respect to lipid and lipoprotein population percentile cut points - Here is the data from the more ethnically/racially diverse MESA population [pic.twitter.com/xpKupZI6Nl](https://t.co/xpKupZI6Nl)

Thomas Dayspring (@Drlipid) [February 6, 2022](https://twitter.com/Drlipid/status/1490361799065149448)

The smart rule finds the URL, but appears to add a close bracket (ie, ]) to the end of the URL for some reason. What am I doing wrong?

2022-02-06-13-05-50_screenshot

Rick

chrillek · February 6, 2022, 7:20pm

I suppose that you want to match pic.twitter.com/xpKupZI6Nl and none of the other URIs. In that case, I’d not use a greedy RE, but one that is as specific as possible. For example
pic.twitter.com/([0-9a-zA-Z]*)
As it stands, the RE should gobble up everything from pic.twitter… to xpKupZI6Nl), because only then it will find the newline.

jrickmd · February 6, 2022, 7:42pm

I appreciate the difference that you describe (and realize I will never become an expert in RegEx)…

And it worked… wondering how the ‘]’ character got in there though… it IS found in the markdown text of the document, but is not in the visible shown text. I understood that DT was looking only at the displayed text and not the markdown.

? bug or my misunderstanding?

Here is a screenshot of the displayed and MD text of the file…

chrillek · February 6, 2022, 8:03pm

The RE is of course looking in the MD text. The „other“ version is HTML, and you would not want to scan that

jrickmd · February 6, 2022, 8:10pm

I see you commenting on other topics in regards to “non-rendered text” in markdown. (DT3 searching markdown footnotes fails - #21 by cgrunenberg)
I thought the outcome of that discussion was that DT was only searching “rendered text” which in my mind would be the text inside of the brackets, and not the actual URL inside the parentheses…

does DT in fact search the entire markdown source?

If it does, why did only the closing bracket end up in my RE result and not the rest of the text or the rest of the line?
If it doesn’t, why did the closing bracket show up in my RE result?

Rick

chrillek · February 7, 2022, 8:47am

First, DT never searches in the “rendered” text (aka the HTML). Simply because that does not really exist (in the sense of a document), it is merely an ephemeral something presented to you on the fly. Also, if you convert an MD document to HTML (i.e. a new document), DT will most certainly not search in the raw test of the HTML (which if full of thinks like span, header, div and so on) but in the text part of it.

From what I understood from @cgrunenberg’s and other’s explanations, DT builds an index of a document into which it enters certain words. In the process, it removes stuff like punctuation, parenthesis, brackets etc. Which results in the index containing only a part of all the glyphs in the text. That’s what the thread you mentioned was dealing with.

Now, there are two ways to search in DT. One is with the scan methods in smart rules, which involves regular expressions and works on the raw text of the document. Which might still exclude the HTML elements in the case of HTML documents, but I don’t know about that. This method (i.e. scan something with a regular expression) will find punctuation, brackets etc.

The other method has no special name, but it looks only at the index. I.e. punctuation etc is excluded. And this is the method that you use in all the other smart rule matches (those without the scan part) and when pressing Cmd-F or in the search bar.

So the answer to

is: that depends on how you search.

Good question, to which I have no immediate answer. You might want to head over to regexp101.com and copy/paste your regular expression as well as the original text you’re searching in. That should tell you more.

BLUEFROG · February 7, 2022, 1:10pm

does DT in fact search the entire markdown source?

Not by default, no.

There is a hidden preference available - IndexRawMarkdownSource - in the Help > Documentation > Appendix > Hidden Preferences that can be toggled to affect this. However, after enabling it, a database rebuild would be required to index the source.