Confused by phrase search

Ok, after finding that Index’d files cannot be phrase searched, I re-imported my library of books which I often search for using particular phrases (I usually know what I’m looking for, I just can’t remember which book it’s in).

However, now I’m finding that several books found which do not contain the phrase I’m looking for appears in the list. They do contain the component words, just not the phrase. For example, I choose “search in all”, phrase, exact match, and look for the words, “with exceeding gladness”. I think I know this phrase is in one particular book, and I’m hoping DEVONthink will show me that book within the first three hits.

Instead, I discover that I haven’t accurately remembered the phrase. Yet the book I’m looking for appears near the top anyway! I click on it, only to be greeted with the first page of text, and an error beep when I use Cmd-F to try and find the “found” phrase within the book.

Any idea what’s going on? It’s useless to me if books appear in the list that don’t actually match the phrase. What ends up happening is that I have to click on each book to see if it shows me a page with words in blue; otherwise, it’s an invalid match and I must move on to the next book. This is much less effective than a plain grep, which tells me right away if I’ve misremembered the phrase.

Any suggestions? Have I imported things incorrectly?

John

John:

I would suggest that you use Ignore case instead of Exact, although that’s slightly slower.

If you used Phrase, all the listed results do contain that phrase. Really. :slight_smile:

Try again, making certain that you really are using the Phrase operator instead of All Words.

Suggestion: I use Tools > Search all the time. Reasons: I can click on the Options button to examine and modify the search operators, and there are more search operators available than in the toolbar Search field. Importantly, I find the Context button on the Tools Search window extremely useful.

Actually, it did not. I confirmed that all the search settings were correct, I even exported the text file to an external file, opened it in Emacs and did the search there. Sure enough, no phrase.

If this happens again I will try to reproduce it by creating a new database with just that file. If I can get it to reproduce, I will send you the bits.

John

I found a very simple test case that demonstrates the incorrect phrase search problem.

I created a new database and imported the following HTML file into it (which I saved to disk as a file (not a webarchive), and then imported use File | Import | Files and Folders…). That file is here:

johnwiegley.com/msg01528.html

Once you have the file in your new database, set your search preferences to All/Phrase/Exact/In Selection, and click on the Home icon. Now search for “joy and happiness”. On my machine, it will show the file in the search results, even though neither “joy and happiness” nor “joy happiness” exist as phrases within the file.

John

John:

Sorry, but the phrase “joy and happiness” IS IN THAT DOCUMENT. The Phrase search that showed it as a result is CORRECT. :slight_smile:

Go to the paragraph in that page’s text that begins with “insofar”. The next to last sentence in that paragraph contains – guess what – the phrase, “joy and happiness”. So DEVONthink Pro’s Phrase search is correct.

True, there is something strange about your example. A Command-F search doesn’t find “joy”, even in a TextEdit copy. I switched the format of the TextEdit copy between plain text and rich text a couple of times, and finally got Command-F to find “joy”.

Hi John,

They do exist in a proper context: namely there is a return character of some kind between the words “and” and “happiness”. Despite this it should still count as a phrase.

The fact that an editor doesn’t find it is because of the return character since it doesn’t look at the data as a sentence.

Bill,

I notice that the html page has “joy and happiness” broken by a hard return, between “and” and “happiness” --so that not even a browser Find can come up with the phrase. The same break appears in the source code, though it’s not an html break or return code.

Will

Will:

Thanks. I noticed some other glitches in that page, also.

Even so, score 1 for DT Pro’s Search, 0 for text glitches. :slight_smile:

I’m very happy to be proven wrong where DT bugs go! :slight_smile: So how about a request: When I select a page that is found by a phrase search – which the phrase is broken up by returns or not – still page forward to the hit and highlight the result in blue… Since I’m not able to find the phrase that was matched (and searching for single words from the phrase might take way too long), it does reduce the usability of the search results. I take it from your note that the page has anamolies that this is something which can be corrected?

John

John,

I would suggest that you massage the text a bit before you import it into DTP.

If you copy and paste it into TextEdit, and then from the Services menu run Format: Reformat, all of those end-line hard returns will vanish. (You need to have the free bundle of Service apps, from the DevonTech site, installed in your user Library.)

Then copy and paste or drag the contents of the TextEdit window into DTP, where it will become an RTF file.

Although that’s an extra step, it should make the DTP searches far more consistent.

Will

John:

I don’t think so. DT Pro is using standard OS X tools to mark the matching strings.

The author of the Web page has either used sloppy text formatting or, more likely, HTML creation software that produces sloppy results. If you encounter this often with files from a particular author or organization, you might alert them to the problem in hopes they will correct it.

Actually, if you want to go to the trouble for an important document, you could use a good HTML editor to clean up/reformat the text.

Now I understand what you mean with “not finding it”: you can’t get to the search result on the page that DT Pro’s Search window claims is there. I agree that this should be addressed. Can you please send an email to support with the link to this thread (and please don’t remove the HTML link on your site). I will inform Christian as well.

Thanks for the tips, I have sent an e-mail to support. I must add that “massaging” the files before I import them is not really an option: my database is at almost 100,000 entries now. And I experience the same difficult with plain text, not only these HTML files which have been generated by mhonarc (and I have 16,000 of these in the database alone!).

Thanks for your ever-rapid response,
John