The basis for scoring results in DT

I’m new to DT but I’d like to know how this works because I’m finding the ‘relevance’ displays confusing.

I have a small database of about 2-300 documents dealing with issues in the same domain. They range in size from 200 word press releases to long monographs.

Starting with the toolbar search field (say, searching ALL for an Exact phrase of one or more words): how are the documents returned ranked? By simple frequency of occurrence?

This seems to be the case since I regularly see the LONGEST document in a subject domain returned at the top of the list, even when it is NOT the most relevant by any semantic measure. For example brief documents with the phrase in the title of the document and whose content includes one or two headings using the phrase (RIGHT ON THE NOSE for a relevancy score) are found at the bottom of the list.

I scroll down through the returned documents to the bottom of the list and choose a document I believe is more relevant than DT believes. Although it is listed near the bottom of the list, it is located in two DT folders (as replicants). The title of the document contains the words I searched for and one of the folders in which it is located (as a replicant) has a name that contains the words in the phrase for which I searched (in other words, I have classified this document to a folder named for its relevant content). But DT has returned – at the bottom of the relevancy list – the replicant in an earlier folder in the hierarchy whose name bears no relation to the content I searched for.

I click on that document in the list of documents returned by the search and the drawer to the right of the window fills with a list of documents. But none of these is ANY of the documents (all replicants) in the folder with a name that contains the words I searched for. Why not?

The same list seems to appear in the Drawer when the ‘see also’ button is pressed.

It seems to me that DT is not taking notice of relevancy signals other than frequency of words in the documents themselves. Is that correct?

It seems to be ignoring my manual classification into folders when searching for documents. Is that correct?

When I search for a phrase, does the score reflect simply the frequency of the individual words? Or is there some attention paid to proximity in the frequency calculation? Could there be?

Is there (will there be) attention paid to e.g. titles or headings in a document. I could understand that weighting headings inside a document might be difficult for text, but what about RTF (where headings are more readily distinguished)?

Is there (will there be) some way of giving DT ‘hints’ about relevancy? For example by identifying ‘key’ phrases manually in a document? Shouldn’t classification have this effect?

I apologise for the questionnaire. I tried to find the answers to these questions (that seem sort of fundamental to me) in the documentation or here in the forums (using terms such as ‘score’, ‘relevan*’, ‘title’) without finding answers.

Thank you.

I was also wondering about this: Folders the name of which coincides with the sort string that are are ranked very low.

Would greatly appreciate if someone from Devon could answer the questions of the original poster.

Thanks,
Ben

Peter and Ben:

It seems to me that searches in DT are likely to be weighted by the frequency of occurence of the search string in documents.

But as your database grows larger, you will probably find that isn’t the case for “See Also.” This is a feature I use a great deal, and it does often display striking intelligence.

“See Also” relavancy ratings show the document you are viewing as most similar in context to itself – which is right on. My main database contains a good many thousand items, and over 14,300,000 words. Even large documents that contain some of the terms in the base document may rank low – again, the relevancy rankings really do reflect the fact that DEVONthink is looking at the contents of the database to find semantic relationships. Clearly, DEVONthink is looking beyond the hierarchical organization of items, which I find very useful indeed. It can find similar items regardless of how they are grouped. What I really love is that DEVONthink can often suggest similarities or relationships that I hadn’t realized. From that comes inspiration.

I haven’t really been able to keep my database content well organized in groups. At the moment, for example, I’ve got over 3,000 items in my “Edit This” imports group. That doesn’t hinder DEVONthink from acting as a very useful research assistant.

Intelligent features such as See Also mean that DEVONthink will thrive in Tiger. Spotlight simply can’t do that sort of ‘thinking.’

There are some things that I organize carefully into groups, so that I – a mere human – can find them myself: contacts, drafts, and that sort of thing.

Peter, you wondered whether giving special weight to titles of documents, or providing other metadata hints, would result in better performance by DEVONthink. Although we probably will see additional metadata possibilities for ‘marking’ items, See Also already displays an uncanny ability to sense contextual relationships (semantics? meaning?). Keep adding to your database, and you will find DEVONthink becoming almost magical.

Hi Bill,

I’d hoped that DevonTech would have given you definite information on its technologies when they appointed you as an ‘evangelist’. I guess from your use of conditional sentences ('likely to … ’ , ‘… probably’) that they haven’t. I’m sorry if that’s the case, because you are making a valiant effort.

As far as I could tell, after two months of using DT (I’ve more or less given up, a month ago) it never improved its ‘intelligence’. In fact, I’d rate it as a singularly ‘unintelligent’ and procedural database. It seems to use only concordance weights (including for ‘see also’). That is: raw frequency. Nothing smarter than that. Purely statistical weighting (and pretty ordinary statistics). This works for, say, eliminating SPAM, but it does little or nothing for intelligent search.

DT has NO CLUE AT ALL about semantics. Sorry, but I believe you’re wrong there. It seems to ignore semantic clues completely. That’s why it seems to return such dumb results. An ‘intelligent’ database that doesn’t even weight the name of the folders containing the data, as lief_ben points out, is no so ‘intelligent’ after all.

Your advice to continue to use DT until it performs better amounts, if I’m right, to saying that with a VERY large sample space, the impression a user will have from simple probabalistic results will be better. That’s plausible. But not very much better results. And not better to the same extent as a small use of intelligent weighting would have made it (in my estimation).

I’m afraid that I didn’t have that patience to wait for months for small incremental results.

I’ve gone back to relying on the file system as a storage mechanism. I’m hoping SPOTLIGHT will be no worse than DT and a whole lot more convenient.

Best wishes,

Peter

It’s somewhat hard for me to assess the issue. For me, the search has worked quite o.k. as of yet, but not too the extent that I would call it impressive. Generally speaking, DT has worked quite ok for me, in fact I wouldn’t wnat to miss it.

Peter, there are reports from people like Steven Johnson which support the idea that DT has something “smart” about it. And on a more pragmatic note, statistics can deliver amazingly good results … But I sort of agree with your criticism anyway: DT tries to build its reputation on the “AI features”, and I would expect these features to be employed for ranking search results in such a problems.

I would still really appreciate some more conrete information on the actual technology used for ranking the entries matching queries – developers, this is your call! :slight_smile:

Ben

Peter:

Wish I could sit you down with my computer and demonstrate my database. It covers a wide spectrum of disciplines related to my science and environmental policy interests.

I’ve had career training and experience in a number of disciplines: chemistry, physiology, molecular biology, biochemistry, public administration (theory and practice), science and environmental policy (teaching and making), environmental regulatory administration, risk assessment, quality assurance and even high finance (administering a quarter of a billion dollars in water quality improvement loans).

DEVONthink doesn’t “know” about any of those things. When I’m looking at a document about a water quality issue such as mercury intake in fish, I can press the See Also button and DEVONthink will suggest other documents that may be similar. In doing that, DEVONthink is looking at the terms used in the document I’m viewing (including patterns or associations), going to its glossary, and looking at links to other documents.

The list of documents that are “similar” to the one I’m viewing are sometimes obvious, sometimes dumb, and sometimes very interesting. In this example, I’ll see literature on occurences of methymercury in fish, health effects, regulatory limits (U.S. and in other countries), policy discussions, risk assessment procedures, mercury sources and sinks in the environment, analytical and statistical evaluation procedures, impacts on fishery economics and so on.

It’s often useful to open one of the “similar” documents and do See Also on it. Useful threads of information and thought can result. I can quickly get an overview, for example, about the range of expert opinions on health risks and possible trends in regulation of mercury sources, fisheries and dietary recommendations.

That’s much more useful than Search. When I do a search, I’m simply looking for those documents that contain my search terms. I do that a lot. Then I follow the See Also threads in documents that may give me an interesting starting point.

This is an interactive process. I’m responsible for “understanding” what I’m reading. DEVONthink acts as my research assistant by serving up suggestions of other documents that it “thinks” are contextually related in some way to the document I’m looking at.

I’m reminded of one of Goethe’s sayings: There is nothing more frightening than ignorance in action. DEVONthink is ignorant about fish, mercury, humans and health risks. It merely parses through documents looking for terms and contextual relationships. The human part of the team is responsible for not being ignorant, while mining knowledge and understanding from the data. If I don’t truly understand what I’m reading, whatever I write is likely to be bad. (God knows, there’s already too much bad literature in the fields of science and evironmental policy!)

Bottom line: DEVONlthinks relevance ratings for searches are primarily statistical, with weighting for number of occurrences of terms. DEVONthink’s relevancy rankings for See Also lists are more tied to a “similarity” construct based on contextual relationships, which I find more useful. But I’m responsible for deciding what’s interesting and useful for my purpose at hand, so DEVONthink’s rankings are only a start. For topics that are well represented in my database, I’m certain to find a good starting point among DEVONthink’s suggestions.

Bill:

Thanks for the further explanation. You’re a very good evangelist for DT.

I guess I would have liked to see DT take ‘contextual’ clues – including the name of the folders/categories that I use for documents – in a more deliberate way.

Best

Peter

I’ve loved the concept – and the execution – of DT since I first heard of it. I was convinced to buy it by those two magical functions: “classify” and “see also.” Of the two, I originally valued “classify” the most. At the time, I was aFiler, not a Piler of documents.

I’ve now come around 180 degrees. With the power of DT’s search function – and the power of Spotlight – I no longer want to waste the time it takes to create and maintain a complex hierarchy of nested files and folders. I’d much rather throw everything into one or two giant folders, and use the power of the search tools available. That means I now value “see also” even more.

HOWEVER, I’ve always assumed that DT’s “see also” function drew its accuracy from that folder hierarchy. If its true that DT doesn’t use the context provided by the enclosing folders when it calculates the “see also” function… why, that’s very liberating. :slight_smile:

Peter:

Thanks for the compliment (I think). My views about DEVONthink didn’t change when I started evangelizing for DEVONtechnologies – and I continue to spend more time as a user than as an evangelist. :smiley:

DEVONthink does pay attention to the database’s organizational structure, but – fortunately for me – not in too rigid a way. It does networking, too.

My example of mercury intake in fish required DT to look across disciplinary and organizational lines. I’ve got groups for Chemistry/Analytical Procedures, Water Quality/Heavy Metals/Fate & Transport, Agriculture & Food/Fisheries, Risk Assessment/Standards & Limits, Toxicology/Heavy Metals/Health Effects, Air Quality/Pollutants/Sources & Source Controls, and so on. All of those groups (and still others) contain information that I may find useful when I’m thinking about human health effects related to intake of mercury from eating fish. By the way, fish contained mercury from natural (does that mean “organic” :slight_smile: ) sources, long before humans appeared. Natural loads of mercury in fish vary significantly by locality and fish species. Some fish have always had high mercury levels, essentially independently of anthropogenic pollutants. On the other hand, some fish in many localities have mercury levels that are clearly related to human activities. I need to know that sort of thing, too. It suggests what can, and what cannot be done through environmental policies and regulations to reduce mercury pollution. Zero mercury in fish, for example, would be an unattainable goal; Mother Nature wouldn’t stand for it.

See Also threads can walk me through that relatively complex organization of my references, and across disciplinary lines to find pertinent information. Often, it helps me fit together bits of information in ways that I had never thought of before. I’m still fascinated by that.

Based on my experience with DT, search results are ranked by number of occurrences of the search term in the documents contained in the database.

The functioning of See Also is trickier to understand, but DT appears to list the found documents according to the frequency of shared words.

DT does not consistently suggest to classify a document in the group that contains the top document in the list of See Also. However, I don’t know much about this feature and I’ve barely used it at all, because I prefer to organize the database on my own.

Thus if I’m correct, “context” in DT simply means the ranking of a word among other words, and ranking depends on frequency instead of semantics.

DT should at least consider the actual weight of words in a document. If a 10,000 word document contains the word “water” 20 times, it should be ranked below a 500 word document that contains it 10 times.

At present, instead, the “weight” column in the Words drawer assigns a high ranking to uncommon words, misspelled words, foreign words, and so forth. [More on this in the PS below.]

The keyword list (shown by clicking on the >> icon) usually contains the same words listed on top of the “weight” columns, and is practically useless.

If weight was calculated simply by considering the number of occurrences of each word in relation to the total number of words in a document, then DT would provide the user with a more reliable list of keywords (in the example above, “water” would likely figure among the keywords in the 500 word document, but not in the 10,000 word document).

Thus a user could do a See Also, obtain a more reliable list of related documents, quickly check the keywords in each of them, and decide whether a document is really relevant.

Since I’m completely ignorant of these issues at a theoretical level I wonder if what I said make sense, and whether it would be difficult to implement it in DT. Like many of us, I am grateful to Bill for his support, clarifications and tips. I agree, though, that these issues concern the very “soul” of DT and would deserve a reply by the developers.

PS. Here is an example of how weight and keywords work at present in DT. I just pasted the present message in DT, opened it in its own window and clicked on Words. The top words in the “weight” column were “trickier”, “misspelled”, “developers”, “quicky” (a typo for “quickly”), and “semantics”. When I corrected “quicky” to “quickly” and updated the list of Words, “quickly” disappeared from the top of the “weight” column. And the top keyword in this message is… “misspelled”!

Agreed… Clarification on the importance of folder structure in these functions would also be nice, as Fred pointed out. A response from the developers on “how” these important functions work would help us all more effectively organize and use our DT databases. As a now “somewhat experienced” new user I am still figuring out how best to go about using DT and these are important issues, as “Classify” and “See Also” is what sets DT apart…

Eric and Christian are burning the midnight oil readying new releases. Perhaps they will pick up on this thread after that.

I’ve played with See Also, trying to get a feeling for some of the logical tricks behind this feature. I’ve pulled up sets of similar documents and examined the Words and Keywords listings for each. If “context” were as simple as your analysis, I would have expected to see very different similarity listings than I get in practice.

My guess is that DT’s similarity rankings are based not merely on word frequencies, but on associations of terms, most of which are not listed as Keywords in the sets I’ve looked at closely. If I’m right, association tends toward context, which tends toward semantics.

Example: I pulled up a document about tidal mills (generation of electricity from tides). My database holds other documents that include both of the terms ‘tidal’ and ‘mills’ in other contexts. Some of those other documents contain several other terms that were present in my “source” document. None of them showed up except those dealing with the topic of my source document: generation of power from tides or currents. But See Also also pulled up documents dealing with production of power via the temperature differences in ocean depths, and windmill power stations. Interestingly, the latter two topic articles contained neither ‘tidal’ nor ‘mills’ but did contain the terms ‘power’ or ‘generation’ and/or ‘electricity’. What really fascinated me about this example is that I’ve got hundreds of articles on energy production, including power plants using coal, natural gas, biofuels and, of course, nuclear power plants. None of those was suggested by See Also from my source article.

The source article also contained the term ‘alternative’ in conjunction with power production. Some, but not all, the items in the See Also list also contained that term. On the other hand, I’ve got lots of articles about another alternative energy source, solar power – as well as alternative sources such as biofuels, or household fuel cells. None of those showed up in the See Also list, although many of them contained the terms alternative, power, generation, electricity, and so on.

The See Also list didn’t contain any items about hydroelectric power generation from dams.

The See Also list pulled suggestions from different groups, including my group of not yet classified items.

In other words, DEVONthink produced a list of suggested articles that is eerily like that I would have produced myself. I would have been looking for ‘new’ stuff, based on tides or currents (including wind currents). I might or might not have thought to include the articles suggested by DT on electric production based on water temperature differences. I would have ignored ‘conventional’ power plants, but might have looked at solar energy topics.

Frequency alone doesn’t cut it. See Also is doing some sort of pattern recognition that filtered out hundred of other potential suggestions that were not quite similar enough to the source article.

See Also is really doing some interesting things. Christian has talked about what he would like to do when CPU power goes up another step or two. I’m looking forward to that. :slight_smile:

Bill,

Thanks for your reply; it’s a pleasure to be in touch with you.

What made me curious to understand how DT works is this. Among my documents, I have a large (330,000 words) dictionary of Buddhist terms, translated into English with short explanations. Whenever a do a See Also on any document in my database, that dictionary is always on top of the list. My database contains many short documents on topics loosely related to Buddhism, but very few of them are specifically on Buddhism, and I think I have never used any of those documents as a starting point. So why is the Buddhist dictionary related to almost everything else? I think the reason is that, being a dictionary and being large, it contains repeated occurrences of virtually any word found in the shorter documents. To me, this suggests that word frequency, instead of keywords or context or semantics, is the main criterion used by DT in looking for related documents.

I think we should continue to talk about this topic in order to understand a little better how DT works, until Christian et al. have time to reply. So here are a few tentative comments on your post:

This is exactly what I wanted to say in my previous post. “Tidal” and “mills” must occur frequently in your source document, since it’s about tidal mills. DT retrieves other documents in which those words have a high frequency. In addition, it also retrieves documents in which other words that occur frequently in your source document have a high frequency.

The question here is: Don’t “power” and “generation” and/or “electricity” occur frequently in your source document? I would guess so, since that document is about generation of electricity from tides. Therefore DT duly retrieves other documents in which these words occur frequently.

Could this be because the words that occur most frequently in them do not occur frequently in your source document? Or – to ask the same question in a different way – do they contain many words specifically related to tidal mills?

To find out, please open the source document and the “unrelated” documents in their own windows, and open the Words drawer in each window. Arrange the lists of words according to frequency, with the most frequent words on top. Do the unrelated documents contain many words in common with the source document among, say, the first 10 or so words (excluding “a”, “the”, “who”, “that”, “is”, “are” and similar words, which I suppose DT omits in its calculations)? My guess – please correct me if I’m wrong – is that they do not.

Again, the point is not whether a particular word occurs or does not occur in the related or unrelated documents. The point is how many times the most frequently used words in the source document occur in other documents. Say that I have a document in which the four most frequent words are “lunch”, “dinner”, “eat” and “drink”. When I do a See Also, the first documents retrieved by DT would be those in which all of those words have a high frequency, and not only “lunch”. A document concerned only with “lunch” would be ranked lower in the list, or perhaps would appear only if I click on More to see more results.

I don’t think that DT sees those four words and says, “mmmh, this document must be about eating; let me retrieve a few documents about that topic”. That would be (if I understand the Big Word correctly) “semantics”.

I hope I’m wrong :open_mouth: . Thanks a lot again for your reply.

You really aren’t wrong, at least not entirely wrong. DEVONlthink is looking at word frequencies and keywords in comparing the textual content of documents.

And you are almost correct when you note that DEVONthink doesn’t make the leap to understanding that a document is about eating. I agree with you that DT doesn’t understand the meaning of eating, much less the meaning of lunch. What DT is doing is working with contextual relationships of groups of terms, not just single terms, across the database. That’s actually pretty rich stuff!

Years ago, preparing for a presentation, I printed out histogram charts on the frequency of selected tumors by location, race, and sex. The data came from the U.S. SEER tumor registry database, covering a time period from the 1950s into the 1970s. I printed out three sets corresponding to lung cancer, bladder cancer and another that I can’t recall now.

Each tumor type produced a distinctive histogram. That is, I could easily distinguish between tumor types by the pattern of the histograms. Lung cancer histograms had a distinctive ‘fingerprint’ as did the others. There were easily identifiable patterns for each tumor type that corresponded to sex and race. To a lesser degree, there were identifiable patterns by location. A striking location pattern was that for San Francisco, where lung cancer in women was considerably elevated in comparison to the other locations. An interesting correlation to that was that I had data on cigarette smoking behavior that indicated that female residents in San Francisco smoked more heavily than in other regions of the U.S. during that time period.

Frequencies of the selected cancers in New Orleans were at or below the national average (for race and sex) except in one instance. White males in New Orleans had elevated frequencies of bladder cancer. Other available data showed that this increased frequency corresponded with socioeconomic status, with the increased frequencies of bladder cancer concentrated in ‘upper class’ white males. I wasn’t able to identify the environmental factors here, but suggested only half in jest that it could be expensive bourbon whisky and brandy, perhaps with expensive cigars.

Those relatively simple histogram charts were a good communication device for my audience, as they provided visual representations of some data correlations in a big data set. They got some points across, and I had fun with the presentation.

What See Also is doing is conceptually similar, but more involved, with more variables. It ‘fingerprints’ the source document, then looks through my database (over 14,300,000 words, with thousands of documents) for those with similar ‘fingerprints’. All this in seconds. And it doesn’t need visual aids to help it make correlations!

As I’ve noted before, it’s up to me, the human part of the team, to decide whether there is meaning (semantics) that can be associated with the list of similar documents. I’m free to toss out the suggestions that I don’t think are ‘meaningful’.

To refer back to your example, it’s true that DEVONthink doesn’t understand the ‘meaning’ of ‘eating’. But I’m delighted by how well See Also performs in suggesting related material that’s in my database. If I start with a document about the health benefits of fish oils, for example, I can quickly survey other literature on that and related topics. If I start with a document about toxic pollutants in food, I quickly get another set of suggestions covering that topic.

I’m looking forward to Spotlight, but Spotlight can’t do for me what DEVONlthink already does. See Also has become indispensable to me for literature research.

Me, too. Amusing reference.

Exactly.

Saving/organizing certain data into hierarchical nests became a BWOT for me a few years ago. Nowadays I’d rather the computer handled the tedious location-centric, filesystem(-like) organizational storage details so I have more time to be pedantic about other things. :wink:

More generalized fast searching (including metadata), “smart” items/groups, and interface enhancements are making it easier to correlate a broader diversity of information across hierarchical and other boundaries. Fun stuff.

In followup to my last post, I can’t resist adding this comment from Apple’s Brian Croll in Giles Turnbull’s “Finder’s end?” weblog entry:

“The first thing you start doing is cease using hierarchy as a means of storing stuff. I know this, because I’ve been using this for a few months and I don’t bother to file things anymore. I don’t give a rat’s ass where my files are, because Spotlight finds everything for me.”

A prototype Spotlight-extremist view? :slight_smile:

I’d like to go back to the original question of Peter:

What exactly does the AI of DEVONthink? How exactly do the search/classify/see-also functions work?

I bought DEVONthink a few days after i started testing it and now work every day with it and i’m really satisfied with it - but i think the above questions should be answered in the manual, because it would help me working with the results of these functions when i would know and understand how they work.

Thanks,
Lousie 8)

yes, that would definitely be helpful. there is something in the air in this thread in the spirit of “there really is no AI in DT, it’s just a marketing strategy”, and I would really love have this statement shown to be unsubstantiated by the makers of a program I love…

I too would suggest something at least along the line of a logic diagram or a conceptual model of the “reasoning” that is occurring within the artificial intelligence of DT. I think this would help a great many of us find even more efficient uses of DT in our workflow. It could be done in a way that doesn’t reveal any proprietary code.

ChemBob

I’d be sort of surprised if Dev_Tech were prepared to tell us how they are managing the relevancy assessments that rank search results, and apparently lie behind ‘classify’ and ‘see also’.

From my own experience, I share xuanyingzi’s assessment of what is likely going on behind the scenes; mostly a weighted frequency assessment based on a concordance.

What I’d appreciate from Dev_Tech even more than the details of existing technologies is an assurance that DT will in future take semantic relevance clues (manual classification to folders, headings in text, titles) into account when weighting results. That would improve results enormously, I believe.

Also, since even Spotlight can do boolean ‘not’ searches, it would be good to have this basic search functionality in DT.

A step up from there would be some sort of ‘bayesian’ approach to weighting: for example, providing check boxes against a list of documents returned from a search so I can indicate ‘good result’ or ‘not what I wanted at all’. This is how my spam-filter (Spamsieve) works… and it now works extremely well.

Now for something OFF TOPIC (sorry):

The new PDFkit in Tiger allows PDF annotation. A particularly neat application of that facility is demonstrated in the most recent iterations of TexShop (uoregon.edu/~koch/texshop/texshop.html). Tex is plain text but the output is frequently PDF. TexShop allows users to navigate from a word in the PDF to the text version and back again by clicking on a word (either in the text or PDF versions). VERY neat: imagine how this could help when using a text/rtf conversion of a PDF file in DT.

Best wishes,

Peter