DT and phrases within searches

ChemBob · February 21, 2005, 3:41pm

DEVONagent allows the use of phrases within a group of search words by surrounding the “phrase portion” of the search in quotes. For example -wind erosion of “coal piles” in Michigan and “lake contamination”- would return only those websites having the actual terms “coal piles” and “lake contamination” in the result. This is great for narrowing results and making them more relevant. I’ve tried this in DEVONthinkPE but it doesn’t seem to work. It will return results on anything that has wind, erosion, coal, piles, etc. in it as long as it contains all the words. Lots of environmental and regulatory texts have hundreds of pages and these words will almost always show up individually somewhere in the document, requiring the perusal of the entire document to see if the actual information that is needed is contained therein. Searches on the entire phrase will, of course, find nothing. Does DEVONthink Pro add the ability to include phrases within a set of search terms and, if not, couldn’t this be readily incorporated into DEVONthink since it is already available in DA?

Thanks,
ChemBob

Peter_Gallagher · February 22, 2005, 4:05am

Bob,

In DA the use of quotes to group a phrase is unnecessary since there is a ‘phrase’ search option (drop-down in the toolbar search window). If you use quotes in such a search they are treated as literals – that is, DA returns only phrases that surround the phrase with quotes.

If you use quotes in an ‘all words’ search, they are ignored (which seems inconsistent with the action in the phrase search where they are literals) and DA returns a list of any document containing all the words with no proximity taken into account.

I’m puzzled by the apparent weakness of DA search routines that ignore proximity and even (it seems) categories and other readily available ‘hints’ about relevance. I’ve raised the issue in this unanswered post.

I hope that DevonTech plan to improve this aspect of the product.

ChemBob · February 22, 2005, 5:40am

Peter Gallagher:

Bob,

In DA the use of quotes to group a phrase is unnecessary since there is a ‘phrase’ search option (drop-down in the toolbar search window). If you use quotes in such a search they are treated as literals – that is, DA returns only phrases that surround the phrase with quotes.

If you use quotes in an ‘all words’ search, they are ignored (which seems inconsistent with the action in the phrase search where they are literals) and DA returns a list of any document containing all the words with no proximity taken into account.

I’m puzzled by the apparent weakness of DA search routines that ignore proximity and even (it seems) categories and other readily available ‘hints’ about relevance. I’ve raised the issue in this unanswered post.

I hope that DevonTech plan to improve this aspect of the product.

I think (let me know if I’m misunderstanding you) that you mean DT rather than DA in your post. I referred to searches in DA as a means of comparison to DT. Yes, I’m aware of the drop-down phrase search option in DT but I’m pretty sure it takes the entire entry into the search field as the phrase. I am interested in having searches in my DT database that include one or two imbedded phrases in the search field along with other words in the search that aren’t included in the specific phrases.

I just reread the post at the link you provided. My experience is the same as yours. DT almost NEVER finds the documents that are truly the most relevant ones. It is also extremely bad at guessing where a document should be classified. This is even though I have a fairly large database of over 11,000,000 words. I have gone to great lengths to organize all my DT entries into groups with descriptive names and have titled the entries (in most cases) to reflect the most relevant aspects of their content. My impression is that DT is ignoring all of this and, as you observed, seems to be preferentially finding and ranking the documents that contain the largest numbers of the search words (even when they are highly dispersed throughout the document and used in a manner that is inconsistent with the query) well above the most relevant documents, often not ranking the most relevant folders at all in the “classification” or the most relevant documents in the “see also.”

When I do DT searches I’m finding myself saying “no, that’s not it” and having to do additional work to find the relevant document or folder. What this means is that I can’t just import the information and organize it, I have to at least vaguely remember it myself to locate it later. This seems to obviate the task for which I most need DT.

I too would like to ask the developers: Are we doing something wrong or is this a known limitation of DT? What can be done to improve these searches by you (coding) or us (organizing and wording)? Is there some reason why the DA web search capabilities aren’t included in DT? Any help would be appreciated.

Thanks,
ChemBob

Peter_Gallagher · February 22, 2005, 6:49am

Bob,

You’re correct. I was referring to DT, not DA. Sorry for the confusion.

Thanks for your support on the search capability.

Relevance is tricky, and may be harder to code than it looks. But there’s little point in using a specialized database if it does no better than the indexed search built-in to the filing system in returning results. As far as I can see at the moment, DT is faster (because it’s not maintaining an index?) but no smarter.

I agree, too, with your observation on auto-classification. DT seems to be pretty poor at that despite descriptive classifications. I’ve given up on it for the moment.

I’d be interested to hear from anyone who has a positive experience with auto-classification. Maybe there are circumstances in which it works?

Peter

Bill_DeVille · February 22, 2005, 8:09am

Peter:

[1] Relevance weighting in searches is, as you say, a tricky business. Especially if the search returns many hits, I do find that I need to scan through search hits to pick out the items that I’m most likely to find useful. At the first level, because I try to give items descriptive names, I select (Command-click) each item that might be useful (even if the relative rating isn’t high).

Then I create a new group and replicate the selected search hits into it. At this point, I can do other searches limited to that group to try to further filter the results. In list view I can sort by date, an so on. I usually throw away the new group(s) after I’m finished with them, as they only contain replicants of existing items.

I do make a lot of use of “See Also” when I’m looking at an interesting item. Often, that works better than a search.

Christian is promising the DA search operators in DT version 2. That should greatly help filtering items that are most likely to be what I need.

[2] I’m not using Auto-Classify. But one of these days pretty soon I plan to make a copy of my database, turn that feature on, and dump in a set of files to see what happens.

Behavior of the Classify button has greatly improved in recent versions of DT. I spent a half hour this evening using that to classify some ungrouped items. Although I can often nit-pick the Classify decisions, they were reasonably logical and consistent. The most important improvement is that Classify always makes a decision, choosing one or more groups. In the past, it often couldn’t decide, so made no decision at all.

I suspect that Auto-Classify has been given the same tweaking. If so, I may find it useful. Even in cases where I might not have made the same decision as DT, it’s possible that DT’s decisions will become more and more consistent and logical as the database grows.

I’ll let you know what happens when I try it out. For a starter, I’ll dump about a hundred files into the copy database and check the results. I’ve got another 2,500 or so files to dump in if Auto-Classify works reasonably well.

Peter_Gallagher · February 22, 2005, 9:10am

Hi Bill,

You’re saying that you do repeated scanning on a temporary group to narrow the search results? But I can do this with the Finder search results, too. The point of relevance-weighting is precisely to avoid this time-consuming mechanical iteration.

As for using DA search operators in a future DT: that would be good. ‘Near’, ‘not’ etc are in my view pretty basic requirements for intelligent search. If you were ever a Nexis user, for example, you’ll be aware that these modifiers have been around in text databases since the 1970s.

But making the user implement them is not, perhaps, the most innovative program design. The problem is that this approach “delegates-up” (to the user). It requires the user to supply structures for the search that a good analyst/designer would already have anticipated.

I honestly expected DT would take account of at least the relevancy ‘hints’ that are readily available to it in category names and document headings; or perhaps to use some of the built-in document summary techniques for deciding what’s important and what’s not. Although, to be honest, I find OS X summaries are pretty ordinary, they’re way ahead of what DT seems to be doing (do you agree?).

I think the innovations in relevance weighting that DA most relies on are mostly innovations in the search engines themselves. They don’t require the user to manipulate a command line of search qualifiers to manage relevancy: they just do it and do it very well, too given the size of the databases they are searching.

By the way, when I said ‘auto-classify’, I was referring to the ‘classify’ button. I meant that this was ‘automatic’ in the sense that using the button is the alternative to manually classification.

I find that DT is classifying material more often than not, but that the discriminations it makes are unpredictable. I guess that they’re based on the same numerical weighting that seems to drive search results and ‘keyword’ results. Both of these seem to be linked to concordance-weights (I’m guessing here) and not to the more useful relevance hints that Bob and I have referred to.

I’d be interested to know how your experiment with auto-classify turns out. In particular, of course, how successful DT is in making the right choice between material that is close in word content but semantically far apart.

Peter