Smart rule wildcard behaviour

I just created my first smart rule. But I cannot understand the Wildcard behaviour.

I want to match all documents with a url like
http*://epaper.heise.de/download/archiv/*/ct.??.??.???-???.pdf

like for example https://epaper.heise.de/download/archiv/594cfcd18563/ct.20.04.054-058.pdf

But I get no matches. When I started to examine the problem by trial I found out that this rule matches the document:

https://epaper.heise.de/download/archiv/5*/ct.20.04.054-058.pdf

While this one doesn’t:
https://epaper.heise.de/download/archiv/*/ct.20.04.054-058.pdf

Which leaves me in total confusion (as someone working with perl regexps for 25 years, this means a lot :wink: ).

Can someone point me to my mistake, please?

Only the “matches” conditions support operators & wildcards. But both operators & wildcards are applied to the word-based search index, not to e.g. complete & raw URLs. The easiest setup might be to use these conditions instead:

URL contains "//epaper.heise.de/download/archiv/"
URL ends with ".pdf"

Or would this return too many documents?

1 Like

Thanks for your reply.
Does that mean the URL is split on the slashes into words? That comes as a surprise. :slight_smile:

Though helping me to understand, that would still return too much documents, as there are 3 different types of magazines from that server. They share the beginning //epaper.heise.de/download/archiv/, but differ in the first 2 characters of the filename:

ct.20.04.016-017.pdf
ix.18.09.100-104.pdf
mi.20.01.078-085.pdf

Indexing splits text always into words and this information is used by the “matches” condition supporting boolean operators & wildcards. Only the is/begins/ends/contains (not) conditions use the raw text.

You could add another condition like…

URL contains “/ct.”

Alternatively the following condition should be able to replace all these conditions:

URL matches “epaper.heise.de” download archiv ct pdf

1 Like

Yay, that works (and is specific enough). Thanks for the explanation!

Are all metadata fields split into tokens for the word index like that?

Yes, everything is indexed the same way to make the behaviour of the operators/wildcards consistent.