🟥 Problems with Exact Phrase Search in DEVONthink

ClaudeCloud · August 4, 2024, 11:48pm

I have a problem with the search function in DEVONthink. When I search for an exact phrase by putting it in quotation marks, DEVONthink does not find the contiguous words.

Exact Phrase Search in DEVONthink

DEVONthink offers the ability to search for exact phrases by putting them in quotation marks. For example, a search for "It is important for future development to realise that policies" should only return documents that contain this exact phrase.

Problem Description

Despite entering the search phrase correctly and updating the indexes, the problem persists. The following points have already been checked and implemented:

Correct Search Syntax: The search phrase was correctly placed in quotation marks.
Updated Indexes: The database indexes have been updated.
Reindexing: The database has been reindexed.
Search Scope: The search was set to “All Databases”.
Fuzzy Search: The search was conducted with both enabled and disabled fuzzy search.
File Properties: It was verified that the file is in the correct folder and database.
No Filters: No unnecessary filters are activated that could restrict the search results.

Specific Issue

When I enter the exact phrase "It is important for future development to realise that policies" in the global search bar, nothing is found, even though the phrase exists in the document exactly as written. Even when reducing the search phrase to 3-4 words, the document is not found. Interestingly, the phrase is found when searching directly within the document with the same settings.

Expectation

I expect the global search in DEVONthink to find documents that contain the exact phrase without using Boolean operators or additional search parameters.

FrankT · August 5, 2024, 7:06am

Unfortunately, this won’t help you, but I copied your sentence into a document and was able to find it without any problems. This exact phrases and only this.

cgrunenberg · August 5, 2024, 7:09am

What kind of document? Is the document or one of its enclosing groups excluded from searching?

rfog · August 5, 2024, 8:56am

Could there be a PDF with errors in OCR? A word document with extra control stuff in the middle? My two cents. (Normally, when searching with quotes, DT cannot find the string because one of those reasons, sometimes even soft/hard hyphens in middle of the phrase).

cgrunenberg · August 5, 2024, 9:02am

Or the PDF doesn’t have a text layer at all and is therefore not indexed/searchable. In this case searching just in the document would still work on modern macOS versions due to live text.

ClaudeCloud · August 5, 2024, 3:00pm

It is a scientific paper (PDF) from ResearchGate. So, it’s absolutely standard, nothing special. It is quite unsettling when something like this happens.

The error can be reproduced at will. However, I have now downloaded the paper again. With the newly downloaded version, it works. I did not manipulate the previously downloaded version.

I don’t understand. Does this happen often, and do I now have to check every paper to ensure it is searchable, even though the technical requirements should be met if it comes from such a portal?

ClaudeCloud · August 5, 2024, 3:11pm

That sounds interesting. I compress my PDFs with DEVONthink, which had no impact on the document in question. The sentence was correctly rendered as “text only”. I also conducted a test where I copied the sentence directly from the document and pasted it into the global search. The global search still couldn’t find the sentence.

It happens quite often that I can’t find sentences, even though they appear to be original.

And what is this error with Word controls? What is meant by “two cents”?

Of course, there’s the additional problem that the compressed texts are searched with ChatGPT 4, and besides hallucinations, there are also minor improvements that correct scanning errors. I am aware of all this and have to deal with it. It’s annoying, but what can you do? This has nothing to do with the previous problem, which was tested manually without AI. I’m just mentioning it because it seems to be the direction DEVONthink is heading, and surely, it’s not just me using it this way.

BLUEFROG · August 5, 2024, 3:13pm

What “technical requirements have been met” ?

ClaudeCloud · August 5, 2024, 3:17pm

I believe that when a PDF is downloaded as a scientific paper from ResearchGate, it should be technically flawless. It should not contain any errors and should be a PDF containing text, not just images. Well, this experiment now suggests that this might not always be the case.

chrillek · August 5, 2024, 3:23pm

A PDF can be technically flawless and still have no text layer. The two aspects are not related. So, it might be worthwhile to check if the PDFs that you can’t search in fact do contain a text layer.

BLUEFROG · August 5, 2024, 3:29pm

That would be an incorrect assumption. A site like that is an aggregator / clearing house for the documents. They aren’t verifying the quality of "160+ million publication pages" (their marketing hype), just as they’re not peer-reviewing or vouching for any of it. (And yes, there are certainly documents with incorrect data out there.) Sites like this just provide a way for people with information they want to share to make it more widely accessible to potentially interested parties. This goes for PubMed, Library of Congress, Westlaw, etc.

On a related note: You shouldn’t make assumptions about metadata either. For example, a DOI is a useful thing in professional documents. But just because you found a DOI in several documents you opened, you should not assume there is one in all your documents, even if pulled from ResearchGate, etc. There are not rigorous standards in publication creation, and things like DOI haven’t been around since the beginning of publication. (DOI was standardized in 2012 (ISO 26324), withdrawn and updated in 2022 and expected to be replaced by an open draft, so only around for 12 years so far and obviously in flux. And just because it’s a standard doesn’t mean people have used it.)

Also, we discussed PDF searchability here…

Tangentially, the whole of the Internet is the same thing: You search for things online but don’t believe “Well, it’s online so it must be true.”… right?

rkaplan · August 5, 2024, 3:43pm

Lord no

They don’t even enforce copyright -no less technical standards

ClaudeCloud · August 5, 2024, 3:45pm

Understood. As I previously mentioned, the phrase was successfully found by the global search after a second download of the paper. This suggests that the paper should have been technically error-free on the first attempt.

The question remains as to where the error lay during the first attempt, when the identical document was downloaded but not found in the global search with the quoted phrase, even though the copied phrase was found when searching directly within the PDF.

BLUEFROG · August 5, 2024, 3:47pm

And how did you get the PDF into your database?

ClaudeCloud · August 5, 2024, 3:52pm

Cheers, I’ve been messing about with the searchability of PDF documents for yonks. Naturally, you get to know the pitfalls and try to set things up so they’re technically readable. In a pinch, I’ll even scan an entire book myself. Still, I can’t always figure everything out, so I’m chuffed to bits when I can get to the bottom of why something isn’t being found.

ClaudeCloud · August 5, 2024, 3:54pm

It was downloaded using Safari on OSX Sonoma 14.6 and bunged into the DEVONthink inbox via Finder.

chrillek · August 5, 2024, 4:02pm

It could also mean that they provided another version in the meantime.

Well, it has been mentioned several times now that a PDF must have a text layer in order to be searchable by DT. If the document type is PDF+Text, that condition is met. Otherwise, not.

BLUEFROG · August 5, 2024, 5:08pm

If you find this consistently reproducible with certain documents, let us know. For example, if you download a PDF, drop it into the Global Inbox’s alias, and it’s a PDF Document not PDF+Text, then you drop the same PDF into the Inbox again and it’s now a PDF+Text. Isolate that file in the Finder and see if you have other documents behaving similarly.

Also, if a PDF is large, it may be a PDF Document initially, then change to PDF+Text after DEVONthink has indexed its content. So don’t jump to it being an error immediately. Give it a minute or three, at least.

MsLogica · August 5, 2024, 8:20pm

I will second what several others have said here, which is that you need to check that the original faulty PDF actually has a text layer, because it might not.

But even if it does, you should then check what that text layer actually says, because it’s not always useable. For example, you could get a text layer that just loads as

􏿾 􏿾 􏿾 􏿾 􏿾 􏿾 􏿾

Always annoying when that happens.

Or even worse, you get a text layer which loads with some code instead of characters, so for example instead of “apples & oranges” you now get

apples & oranges

This doesn’t search correctly and is quite irritating!

On the plus side, if you use DT Pro you have in-built OCR and can re-do the text layer yourself whilst muttering about PDFs being annoying.

Antoine · August 10, 2024, 8:18am

Maybe you could trace if the problem comes from the pdf; open it in Preview and search for your sentence.

I just tried this with a pdf and Preview does not find a sentence if there is a line break in it. (I don’t know how to get around that though). What happens when you select the sentence in Preview and copy/paste it as text?