Why the file with fewer occruances of search term ranks higher

alexzh · February 5, 2025, 10:45am

Hi community!
I am wondering how DT rank the relevance of a file based on a search term: based on occurrence of this search term in the file or the percentage of this search term in the whole text?

BLUEFROG · February 5, 2025, 2:15pm

You say “a search term”. Are you talking about a single word?

alexzh · February 5, 2025, 2:34pm

yes.

BLUEFROG · February 5, 2025, 3:44pm

Then a document with more occurrences of a single word would rank higher than a document with fewer.

The blue whale is the largest animal on Earth.

… or …

This sentence has the word blue in it.
No fruit is blue.
Being sad is often called "being blue".
Blue is a primary color on traditional color wheels.

Which do you think would rank higher?

NickLowe · February 5, 2025, 7:48pm

It’s percentage, and @Bluefrog’s second example would rank higher not because it has more occurrences of the search term but because the search term occurs once per seven words of the text rather than once per nine words. A short article will generally be ranked higher in a search than a book-length document with a lower focus on the term as a proportion of wordcount. (This is normally what you want.)

alexzh · February 12, 2025, 10:53am

I see. Thanks a lot!