False positives with boolean searches?

Mindstormer · August 27, 2023, 11:54pm

When doing various kinds of boolean searches, I’ve started to notice that I will often get a long list of file results, but a good percentage of the results appear to be false positives. When I click on them, I find that the in-file search inspector often returns nothing. This happens most often when a phrase in quotations is combined with a boolean search term. It seems as though both individual phrases are within the file, but Devonthink is not excluding results as it should when the boolean requirement is not met.

What’s going wrong? Is some kind of setting enabled that includes partial matches in the results? Am I using incorrect search syntax? This is highly problematic for academic research and I’m hoping it’s something simple.

FrankT · August 28, 2023, 9:07am

False positives are less bad than real results that are not found.

What I am about to tell you will not make you happy. Search with three different apps in identical databases and you will get different results.

Probably this should not be in theory, but in real life it is.

I’ve imported old data (RTF and PDF) that I don’t change anymore but occasionally need into three apps (DT, Scrivener, Macjournal). If I want to be as sure as possible to find everything I’m looking for, I search in all apps.

The results often differ by two or three documents. What DT doesn’t find, Scrivener or Macjournal will, or vice versa.

Mindstormer · August 28, 2023, 10:43am

Yeah, that certainly is problematic for two reasons:

For research, a search can yield thousands of results with a database of 100,000+ documents. Having to parse through hundreds or thousands of false positives manually adds a tremendous amount of time. If, as you’re suggesting, false negatives are a common reality to be concerned about, then there’s an even bigger problem as one can no longer claim a particular dataset does or does not contain certain content after analyzing all results for a paper or dissertation.
One of my objectives has been to develop improved academic research workflows. I have been demonstrating DT Pro as a proof of concept for improved historical research workflows in returning digital primary sources at my university. If it is incapable of objectively accurate searches (assuming search parameters are within correct syntax requirements), then I must find another solution that is reliable. Is there a logical reason why the results should include inconsistencies between the source data and search results?

I understand why generative AI LLMs hallucinate, but I’m not clear on why this should unless it is also based on the same underlying neural net/transformer structure. My understanding is that searching an index is not at all fundamentally the same, and objectivity should be possible. If Devonthink is fundamentally flawed and buggy, that is disappointing, yet good to know, and I’ll have to keep looking for something that isn’t. If all indexes are fundamentally flawed in how they work, then that’s another story.

I did a lot of preliminary testing between two databases of over 1000 identical files and initially saw no inconsistencies at all between the results of simple searches in Devonthink and another web solution. These false positives have only started showing up with more advanced search syntax, suggesting either an underlying issue with Devonthink itself, or improper syntax.

FrankT · August 28, 2023, 12:05pm

I am a historian. So I am used to being able to prove what I claim. Here I have to be careful. I could prove some things with screenshots and the exact indication of what I have searched for and how. These errors are reproducible. But of course I have not collected everything systematically in all these years. That would be pointless, because I can’t do anything about it.

What I can say, there are false negative and false positive results. False positives are, as you say, annoying but easy to spot, if you invest time. False negatives are much harder to find. That’s why I use multiple apps. Of course, that only works for home users and still doesn’t guarantee finding everything. But it increases the chances.

When I realized this years ago, I was as disappointed as you probably are now. But as a historian, I see it pragmatically. We work with what we have. When I look through sources on paper in an archive, I’m bound to miss something. It’s the same with digital searches.

It’s clear why this can happen to a human being. Unfortunately, I don’t know why it happens to a software program.

BLUEFROG · August 28, 2023, 12:58pm

Outside there being an actual issue with the index of your database, there is no great mystery here and I would not call these false positives. While that’s possible, that’s likely not the case here.

You have given the search engine no context.

If I tell you I want something blue, is that very helpful?

Do I mean something with blue in the name, like The Blue Angels?
Or something in the content, like Roses are red; violets are blue?
Or something with an attribute of blue, like the sky or the ocean?
Or some combination of those things?
And what about related terms to help narrow the results, e.g., including water would eliminate sky?

So as you can see, my query is incomplete and you would also give me many results. The fact they’re not what I was thinking of does not make the false positives. In fact, they are true positives as they match exactly what I asked for.

You should be using search prefixes to provide context to the search engine, e.g., using text: when you’re looking for content, name: when searching by name, etc.

PS: This is a topic we’ve covered in the documentation, on these forums, and in the Help > Tutorials.

PS: And yes, it is still possible (but much less likely) that there could be a bug in the interface

Mindstormer · August 28, 2023, 1:51pm

Thank you for the clarification. I do not think this resolves the issue, however. I will attempt to demonstrate.

Suppose we take the Devonthink Manual as our example.

If I search for:

text:“occurs n” NEAR/3 “less”

The Devonthink manual pops up as a result, with 4 in-file results, as shown on page 229 (see screenshot 1).

Suppose I modify my query to a boolean search I know should fail, as in screenshot 2.

text:“occurs n” NEAR/2 “less”

No result should be displayed because “occurs n” and the word “less” is at least 3 words apart in this document. And yet, the file is returned as if it contains my query when, in fact, it does not. This is demonstrated after the in-file search runs.

What am I doing wrong? I had hoped that the “text:” prefix would resolve this issue, but either my syntax is still incorrect, or this is a false positive.

In other words, what search syntax would I need to use at the top so that no file is returned on the left in this second case example since no text contents of any file technically match my second query?

BLUEFROG · August 28, 2023, 2:28pm

It looks like there may be an issue with using a distance to the NEAR operator. Remove it and just use NEAR. Do you see expected results?

@cgrunenberg would have to comment on this specifically.

chrillek · August 28, 2023, 2:37pm

I’d like to concur that something is working not as expected (from me). If I search for Agentur NEAR/4 verpflichtet, DT turns up a number of documents. Many of them are correct hits. Some, however, are not.

According to the documentation, **term1 NEAR/n term2**: term1 occurs n or less words before or after term2, I’d expect to always see Agentur before verpflichtet. However, that’s not always the case. And this is not due to a bogus OCR result. I converted the PDF to pure text and got this:

Der Beitrag wird zusammen mit der Leistung an Sie ausgezahlt. 
Bitte leiten Sie diesen an Ihr Versicherungsunternehmen YYY 
(Aktenzeichen: xxxxx) weiter, dem Sie zur Zahlung verpflichtet sind.
Postanschrift
Agentur für Arbeit Berlin-Mitte

Clearly, verpflichtet comes before Agentur in the pure text.

Is that just a glitch in the documentation (in that the order in which the words appear doesn’t matter, only their distance), or is it something in DT that is not handling the NEAR correctly?

BLUEFROG · August 28, 2023, 2:45pm

(in that the order in which the words appear doesn’t matter, only their distance)

NEAR is exactly as stated; order doesn’t matter, only distance…

If the order matters, you’d use BEFORE or AFTER

Note: nice in the second line is matched as it’s still the proper distance away. Criss would have to comment on if that’s intended.

chrillek · August 28, 2023, 3:06pm

After having read the doc three times, I noticed that it say “before or after” (emphasis mine).

I’d suggest saying something like Both terms are n or less words apart or Both terms have a distance of n words or less. That way, neither term is named first, and the sentence becomes shorter, thus easier/faster to comprehend.

Mindstormer · August 28, 2023, 3:13pm

With the Devonthink manual case example, I do get the results when removing the distance operator. But there are many other searches where NEAR returns false positives even without a distance operator.

This took some tinkering to demonstrate once again with the Devonthink manual, but I think I’ve isolated the problem. Here’s another example of the problem with only “NEAR”.

It’s a little difficult to explain, so take a look at the three screenshots before reading on.

Basically, the false positive occurs when a legitimate result falls ONE word outside of the boolean search NEAR parameter. Two or more words outside of the parameter and no result is returned (expected correct behavior). This bug extends to the BEFORE operator as well, with a false positive at 11 words apart. The AFTER operator actually returns a proper result up to 11 words apart. At 12, the file is omitted from the results. So probably, the code for each boolean operator needs to be checked and fixed, especially in association with distance modifiers. I did not check every boolean operator.

I can’t think of a practical use case where a content search returns a file to the left, but then nothing in the search inspector on the right. When this occurs, I would typically assume something is wrong. If it intends to return a fuzzy partial match that falls one word/step outside of the search parameters, it should at least jump to the partial match if it is going to do anything useful at all. If and when this happens, it should be something one can toggle on and off.

BLUEFROG · August 28, 2023, 3:17pm

As I mentioned, Criss is going to have to weigh in on this.

Mindstormer · August 28, 2023, 3:18pm

Excellent. Hopefully, the additional examples will help him weigh in on what is happening and where the problem is. In the meantime, let me know if I need to clarify anything that was not expressed with sufficient clarity.

Trying the following three searches on the Devonthink manual also demonstrates the problem well for this sentence on p.229:

“For convenience, some of these operators can also be abbreviated using commonly used symbols”

SEARCH 1:

text:“convenience, some” BEFORE/4 “operators”

File returned, result returned. Works perfectly.

SEARCH 2:

text:“convenience, some” BEFORE/3 “operators”

The file is listed, but no in-file result is listed. A single result does exist that falls 1 word outside of the 3-word proximity parameter… but this close match is neither jumped to nor highlighted, rendering this result useless since it is effectively a needle in a PDF haystack. This could be a useful type of fuzzy search if there was an easy toggle for this and if it actually listed and jumped to such results in a visually differentiated way from normal results that fall within the search parameters.

SEARCH 3:

text:“convenience, some” BEFORE/2 “operators”

Correct and expected outcome: No file returned.

Edit: Upon further testing, it appears that all numerical proximity parameters also replicate this issue, including AFTER/n and even, occasionally, AFTER.

Mindstormer · September 1, 2023, 1:12pm

Does Chris have an email I could reach out to directly?

cgrunenberg · September 1, 2023, 1:21pm

Sure. But actually it’s a known short coming of the database search, BEFORE, AFTER & NEAR operators are currently recommended only for words, not for phrases. In case of phrases too many results might be returned. The Search inspector doesn’t have this limitation.

Mindstormer · September 1, 2023, 2:46pm

Thanks kindly for the feedback.

Is this a shortcoming that can/will be resolved? Surely a file should not be returned if no matches are present? Is this not a bug?

The Devonthink manual seems to suggest that the use of phrases is acceptable/intended (p.229); as does p.242 of Joe Kissell’s Take Control of Devonthink book.

If the problem is arising from counting how many words apart two phrases or word combinations are, it seems logical to start counting from the last word of the phrase preceding the boolean operator, and to the first word in the phrase following the boolean NEAR, BEFORE and /n operators. This would obviously have to be inverted for “AFTER”.

On my end, I have been working to submit proposals to get the necessary permissions for the establishment of locally hosted databases on workstations at over 20 initial research/study centers, with the ultimate goal of expanding to around 100 sister universities. I was hoping that Devonthink could be an integral component of this, but this type of bug is liable to be a non-starter for many. Even in my own Ph.D. research, this issue means I will have to add hours to my research workflows.

cgrunenberg · September 1, 2023, 2:58pm

We’re aware of it and it’s on our to-do list. It’s actually an issue of the optimized index-based search, the text-based search used by the inspector doesn’t have this limitation but is way too slow for the database search.

Mindstormer · September 1, 2023, 5:01pm

Ok, thank you very much!

Orion1844 · September 13, 2023, 2:00am

I am also having this issue I was expecting the boolean proximity operators to work. However, the false positives reduce the value of these operators dramatically and the functionality of my workflows is directly impacted. I was trying to sell another colleague on DevonThink and the first search he wanted to try was a NEAR search and it returned false positives and failed to show accurate results. If this bug is not resolved its value for academic database research will be severely limited.

BLUEFROG · September 13, 2023, 1:43pm

Welcome @Orion1844

Have you read through this thread, e.g.,…