Smart rule wildcard behaviour

hoenix · March 29, 2020, 9:37am

I just created my first smart rule. But I cannot understand the Wildcard behaviour.

I want to match all documents with a url like
http*://epaper.heise.de/download/archiv/*/ct.??.??.???-???.pdf

like for example https://epaper.heise.de/download/archiv/594cfcd18563/ct.20.04.054-058.pdf

But I get no matches. When I started to examine the problem by trial I found out that this rule matches the document:

https://epaper.heise.de/download/archiv/5*/ct.20.04.054-058.pdf

While this one doesn’t:
https://epaper.heise.de/download/archiv/*/ct.20.04.054-058.pdf

Which leaves me in total confusion (as someone working with perl regexps for 25 years, this means a lot ).

Can someone point me to my mistake, please?

cgrunenberg · March 30, 2020, 9:43am

Only the “matches” conditions support operators & wildcards. But both operators & wildcards are applied to the word-based search index, not to e.g. complete & raw URLs. The easiest setup might be to use these conditions instead:

URL contains "//epaper.heise.de/download/archiv/"
URL ends with ".pdf"

Or would this return too many documents?

hoenix · March 31, 2020, 6:52am

Thanks for your reply.
Does that mean the URL is split on the slashes into words? That comes as a surprise.

Though helping me to understand, that would still return too much documents, as there are 3 different types of magazines from that server. They share the beginning //epaper.heise.de/download/archiv/, but differ in the first 2 characters of the filename:

ct.20.04.016-017.pdf
ix.18.09.100-104.pdf
mi.20.01.078-085.pdf

cgrunenberg · March 31, 2020, 7:54am

Indexing splits text always into words and this information is used by the “matches” condition supporting boolean operators & wildcards. Only the is/begins/ends/contains (not) conditions use the raw text.

You could add another condition like…

URL contains “/ct.”

Alternatively the following condition should be able to replace all these conditions:

URL matches “epaper.heise.de” download archiv ct pdf

hoenix · March 31, 2020, 10:54am

Yay, that works (and is specific enough). Thanks for the explanation!

Are all metadata fields split into tokens for the word index like that?

cgrunenberg · March 31, 2020, 11:57am

Yes, everything is indexed the same way to make the behaviour of the operators/wildcards consistent.