Documentation Request: Search, See also, Classify

louise · April 9, 2006, 8:22pm

I work with DEVONthink Pro every day and i really love it - except for the missing documentation of the “A.I.”-Features. There were threads about this subject before, but as i found a similar request in Ted Goransons excellent About This Particular Outliner Column i thought it might be time to ask again for a better documentation of these features.
Here a Quote of Ted Goransons Review of DEVONthink Pro in About This Particular Outliner August 2005:

And this is where I think DEVONthink has a long way to go. It won’t tell me what its search algorithms are or how they are weighted. I’m someone—admittedly an atypical user—who could understand that and wish to tinker with the settings. In fact, I believe that the settings could be exposed in such a way that many users would understand the controls.

Since I can’t understand nor control how it associates, I will always mistrust it. And I’ll want to build or purchase my own modules. I’m pretty sure DEVONthink is set up do this; DEVON has a new investor that does intelligence work who will likely be adding its own modules, almost certainly based on n-grams.

(About n-grams: Since you’re not actually indexing the meaning of words, why bother to index the whole word? After all, there are vastly fewer combinations of two or three letters than there are whole words in all languages. As it turns out, you can get just as good patterns with these as with whole words, but with vastly fewer resources, and allowing incredibly more efficient pattern matching techniques. The spy guys like this because they could be dealing with tens of millions of unindexed items at a whack.)

louise

Bill_DeVille · April 9, 2006, 10:02pm

Hi, Louise:

DEVONtechnologies has no plans at this time to disclose the proprietary techniques used in the AI features, nor do we agree with Goranson either that this is any ground to ‘mistrust’ the results because the methods are not disclosed, or that most users would benefit with tinkering with them. (We do agree, however, that Goranson’s columns are outstanding.)

AI features such as See Also are unique to DEVONthink/DEVONthink Pro and we believe they can be of considerable use to many users.

Whenever I have discussed the AI features, including See Also, I have always noted that it remains the user’s responsibility to decide whether or not the suggested related documents or concepts are indeed useful. In this sense, I always approach DT Pro’s suggested relationships with a certain degree of skepticism or ‘mistrust’, for reasons that could not be mitigated by access to the tinkering methods Goranson proposes.

Example: I’m looking at a document concerning the biological processes in a wetland environment, with a focus on how those processes affect the mobility and toxicity of arsenic. In my database, when I press See Also, DT Pro promptly suggests a number of other documents that may be related. Some of those suggestions are about the metabolic processes of bacteria that result in conversion of arsenic from one chemical state to another. Good! Some of them are about relative rates of uptake of the different chemical states of arsenic in the food chain by various organisms. Good – that could lead me to still other sources of interesting information if I follow that trail. One of them is about problems in sampling and analyzing contaminants in soils and sediments. Hmm – I may not follow that trail today, but might come back to it later. Another suggested paper is mostly about phosphate levels in wetlands. Could that be related? Maybe, it might even turn out to be important. But I’ll skip it today. And some other suggestions I reject as not very relevant or interesting at the moment. Of 86 suggestions for the first document, i’ve picked 8 as worth looking at closely, and I’ll repeat the process of See Also on three or four of those to see where they lead me.

Conclusion about the utility of AI: DT Pro was really useful in helping me peruse the contents of my large database to find some information that I’m investigating right now. But I’m responsible for understanding the literature. I’m a chemist, i’ve done research in biochemistry and microbiology and i’ve done investigations of pollution problems in wetlands and the associated food chains. DT Pro doesn’t “know” anything at all about those disciplines. DT Pro is, however, “looking” at the contextual relationships of words in my database, and it does that far faster and more comprehensively that I could. So we have an interactive relationship, DT Pro and I. DT Pro makes suggestions based merely on its analysis of word associations, and I evaluate those suggestions to see if they are useful to me. Some are, some are not. But it’s my responsibility to understand the material – DT Pro doesn’t “understand” anything.

If I were doing something like analysis of word frequency usage in the works of William Shakespear perhaps it might seem useful to ‘tweak’ the Ai methodologies in DT Pro. (Remember, though, that the Concordance automatically does that word frequency listing.)

But the built-in AI for See Also works quite well for generalized use in large reference databases, and I doubt that knowledge of how it works, or the ability to tweak it, would really improve its utility (except for essentially trivial matters). DT Pro will sometimes make brilliant suggestions, and often make dumb suggestions. It’s the user’s responsibility to distinguish brilliant from dumb. In short, user knowledge about the topic under investigation is what’s really important.

Bill_DeVille · April 9, 2006, 10:05pm

Louise:

BTW, technically, DT Pro’s database is unindexed.

marcellus · April 10, 2006, 7:27pm

I’d like to connect up this discussion to another topic that has appeared fairly frequently on the forum, requests for features such as annotating, more flexible metadata, the ability to retrieve highlighted text, etc. Like Bill I use DTP to manage a large number of articles. I use the AI features to find related and relevant material to what I’m looking for, and I’m very happy with that aspect of the program. Yet, I’m continually frustrated in doing something I suspect many users want to do. I’ve made the point before on the forum that often you want to work at a level between the word and the document, in other words with passages of text within documents. Why?
Users such as journalists and many kinds of researchers in the social sciences often work with material such as interviews. The problem with data of this kind is that in natural speech people often use the same words to refer to different topics and different words to refer to the same topic. In these situations what you want to be able to do is to look at a passage of text and mark it in some way using a tag, a label, a code or whatever (there is no one preferred term) so that you can retrieve passages marked in a similar way into a new document and compare them. In a similar way, if you are a researcher in the humanities you might work with published documents rather than recorded speech but find that you want to mark up, for example, passages in different documents that you feel show the influence of a prior writer. The judgement you make there is your’s; it might not appear in the words themselves.
Tools for doing this kind of thing are not well developed on the Mac platform. (For an example of the kinds of things that might be possible, it is worth looking at the German WinTel program, Atlas/ti; atlasti.de) Adding such a capability would make DTP an extraordinarily powerful program, and would I believe open up a substantial additional market for it. I would really urge the developers to investigate the possibilities.

Bill_DeVille · April 10, 2006, 8:36pm

Marcellus:

Thanks. Interesting comments. The developers are considering additional metadata features and appreciate suggestions for such specific potential features. Other users have made a variety of similar requests.

BTW, although this is limited to relationships that are already “in the words” DT Pro 1.1 will let you select a sentence, paragraph or longer section of a document, Control-click and use the contextual menu option See Selected Text. DT Pro will then suggest a list of similar documents.

See Selected Text can let the user approximate the “Johnson approach” of breaking longer documents into segments, without having to physically split the documents.

Example: I’ve got a number of long PDFs that contain information about chemical analytical methods. Some of them run over 500 pages, and cover literally hundreds of procedures for testing environmental contaminants. Suppose that I’m interested at the moment in looking at procedures for arsines. I select a section dealing with those compounds, Control-click and do a See Selected Text search. Now DT Pro is likely to suggest highly relevant material in my database.

Of course, I could have selected “arsine” (a single word only) and pressed the Option key. DT Pro would then display a list of all documents that contain the selected word. That list is in a slide-out drawer, making it convenient to quickly explore items in the list, then return to the document you were reading. (Remember that you can always insert notes into the Comment field of a document, and such notes are searchable.)

And now one of my workarounds, or kludges:

I use DEVONnote as an omnipresent notetaker, because it provides a floating window that can be minimized to the Dock and then become available for use in any application, including DT Pro. When I’m rummaging about in my database I use DN to make notes, collect excerpts from various DT Pro documents (with the related document title included), and so on. That lets me quickly create information that is “my judgement” and not in the words themselves. When I copy that DN document into DT Pro and turn on Wiki links, I’ve automatically got hyperlinks to all the documents I excerpted or referenced (simply by copying the name of each referenced DT Pro document into my notes). That simple approach might be useful in your example of comparing excerpts from transcripts of interviews.

Christian has noted that such a floating window may be incorporated into DT Pro in a future release.

marcellus · April 11, 2006, 12:51pm

Thanks, Bill. Inventive as ever!! In fact, I use a similar kludge. Hadn’t thought to use the Wiki links, though.

louise · April 17, 2006, 5:03pm

Bill, even if the answer is negative, thank you anyway for your explanations. I won’t ask again before next year

They are useful for sure - but well documented they could be even more useful

One simple question: See Also/Classify does only consider the document-text itself - it doesnt take into consideration its name, comment, links (from and to the document), its layout or the groups it is stored in?

Bill_DeVille · April 17, 2006, 6:01pm

Both See Also and Classify consider the organizational structure of the database. Obviously, that’s very important to Classify.