Benchmarking searches with and without wildcards

chrillek · December 17, 2023, 6:00pm

Since it was mentioned in another thread, I decided to run a very simplistic benchmark on DT’s search with and without wildcards.
Here’s the JavaScript code

(() => {
  const app = Application("DEVONthink 3");
  const searchTerms = ['text:auszug datum', 'text:auszug* datum*', 'text:*auszug* *datum*'];
  const runs = 100;
  searchTerms.forEach(term => {
    const start = Date.now();
	var result;
    for ( i = 0; i < runs; i++) {
      result = app.search(term);
    }
    const end = Date.now();
    console.log(`"${term}" search ${(end - start)/runs} ms per run ${result.length} matches`);
  })
})()

And here are the results for a set of search terms. The “key learning” is (not surprisingly) that more wildcards slow down the search. Especially so if you search for more than one word (see the last three lines of the results):
With no wildcards, the search is very fast. Prepending wildcards to every word causes a 12-fold slowdown (while the number of matches increases only by 3.3). Appending wildcards to every word increases search time tenfold. And fencing every word with wildcards nearly doubles the search time again.

"text:auszug datum"     search  24.02 ms per run 143 matches
"text:*auszug *datum"   search 293.65 ms per run 484 matches 
"text:auszug* datum*"   search 245.8 ms  per run 189 matches
"text:*auszug* *datum*" search 401.7 ms  per run 509 matches

All this for 9 databases with a total of 4.6GB and less than 20000 files. As one can see, search time for two wildcard-fenced words is nearly half a second. For three words, the differences get bigger:

"text:auszug datum herrn"       search  23.41 ms per run  62 matches
"text:*auszug *datum *herrn"    search 384.79 ms per run 269 matches
"text:auszug* datum* herrn*"    search 220.25 ms per run 103 matches
"text:*auszug* *datum* *herrn*" search 675.54 ms per run 282 matches

Now, search time for wildcard-fenced words is nearly 30 times longer than for words without wildcards.
All that seems to indicate that it’s not a very good idea performance-wise to systematically fence all words with wildcards.

FrankT · December 17, 2023, 7:02pm

Thank you, @chrillek Interesting experiment.

The analysis is valuable. The conclusions are arbitrary.

“Very fast”, so, there is almost no waiting time. Even 100 times longer than almost nothing, is still not much more than almost nothing, so still fast.

This result depends on the texts that are searched. Apart from that: If what was searched for is found, that is a win.

Not at all. It indicates that the user should decide for himself whether he wants imprecise speed or whether he would rather wait a little longer to find what he was looking for. The important thing is that you can choose. The option to search for word components does not take anything away from anyone. It gives everyone a new opportunity.

chrillek · December 17, 2023, 7:30pm

I very clearly said “performance-wise” and “systematically”:

If you need it, do it. And wait a bit.

Defaulting search to word parts would slow down search for everyone. And I doubt that word part search is very useful in, eg, English, French or Spanish.

FrankT · December 17, 2023, 7:38pm

I understood that. But that can also be discussed. What is the goal? To find what you are looking for. Is it better to search three times, with very short waiting times, or once with a slightly longer waiting time? What shows more “performance” in the end?

What about German? But again, an option means that you can use this feature or not.

The search with * is already implemented in DT. Is it so difficult to make it selectable as a feature?