Reprise: Comments/Expectations about See Also

Recent comments about the usefulness of the See Also feature in DT Pro (and about DT Pro’s strengths and weaknesses as a research tool) are found in this thread: < devon-technologies.com/phpBB … php?t=3211>.

Recommendations for modification of See Also

Peter Gallagher notes that I’m enthusiastic about See Also as a research tool (which I am, with my main database), but would like the the ability to do user tweaking of the algorithms so as to make it more focused. For example, he suggests that additional weight be given to the terms that are included in the title of a document, and to the terms used in bolded headings and subheadings, as opposed to the body of the text in the document. Other such tweaks previously discussed might include weighting of the organizational structure of the database (which in fact does get consideration in the algorithms) and/or user-assigned keywords. (In some past threads, some users have commented that they couldn’t “trust” the AI algorithms unless they were disclosed. The algorithms are proprietary and are not likely to be disclosed.)

In a similar vein, talazem reminds us “I think everyone here with a high school education (everyone, I’d assume) remembers when a history or english lit teacher tuned you in to the fact that if you SCAN the table of contents and index of a book, plus any abstract that you might find, you’ll have gotten the whole of the gist of the book; the rest is evidence and details.“ So he sides with Peter and wishes that the AI routine could be tweaked to act as recommended by his high school teacher.

My own use and expectations of See Also

First, I neither expect nor want an exhaustive “catalog” of all the books, articles and notes about a topic to be presented in the See Also slide-out panel. I try (with more or less efficiency) to do that with my organizational structure of the database. If I wish to produce such a catalog (usually to improve my organizational structure), I’ll use searches and/or smart groups. As Christian has noted, the search features in version 2.0 will become much more powerful for such purposes.

Second, even if I could tweak the algorithms to look for bolded text or tables of contents, the variety of layouts and formatting in my collection of documents would probably make that of limited utility, as well as constituting a complicated development set of problems. Perhaps keywords might be considered more easily. But personally I try to avoid keywords or tagging, as (a) I have never come up with a consistent scheme that would cover all of my possible uses of reference materials and (b) as I sometimes add hundreds or thousands of references to a database, I don’t have the time or inclination to tag them.

Third, I’ve gone through the process of producing large bibliographies (more than 3,000 references in one) using talazem’s teacher’s trick of producing index card notes taken from “scanning” tables of contents and introductory chapters (and as much more as feasible). But when I’m doing research I’m not interested in the similarity of documents in that sense. I’m looking for the “evidence and details” instead. Similar TOCs doesn’t necessarily imply much about similarity of the contents. :slight_smile:

I hope, when I press the See Also button, to get a list of suggestions that will help me explore a topic and perhaps lead me to new (hence unexpected) insights about relationships to other topics or ideas. The results, when I press that button, can be highly variable, depending on the text contents of the document I’m viewing and the other content of the database.

I monitor the memory usage on my Mac using a preference pane named MenuMeters. If I’ve just launched DT Pro, select a document and then press the See Also button, I notice a very large drop in free memory. That’s because DT Pro is comparing the word patterns in the document being viewed to the word patterns in the other documents in the database, a very big task indeed. It may take a few seconds to produce the first set of suggestions. Subsequent uses of See Also on other documents produce virtually instantaneous results on my computers. That’s because DT Pro retains the “setup” for See Also until the application is quit.

I’ll often follow a trail of suggestions, choosing a suggested document and running the See Also routine on it to see where it may lead me.

My main database of more than 20,000 documents deals with my interests in environmental science and technology, and associated policy and regulatory issues. It covers a broad range of scientific and engineering disciplines, from chemistry to conservation ecology, case histories of pollution problems, as well as law, economics and many other topics. Many of these disciplines have a highly structured “language” of technical terminology, others do not. Sometimes I’ll Option-click on a term to see other documents that use that term.

Perhaps I’ll examine a draft environmental regulation that sets limits on a pollutant discharged into the environment. Is it enforceable, i.e., are there available analytical techniques to measure the contaminant? Does the normal background already exceed the discharge limit? Does the toxicological information support the proposed standard? How does one balance risk assessments with cost-benefits of the proposed standard? (Those questions are often raised, and I’ve seen proposals fail because they had not been asked in advance.)

DT Pro provides me a very useful set of tools to ask and answer questions like that. I may identify areas for which I’ve got insufficient information, leading me to go looking for additional references. In such cases, the See Also suggestions may be “dumb”.

The utility of See Also suggestions varies, of course, on the content of the document being viewed and on the other documents in the database. If the text of the viewed document isn’t contextually similar to anything else in the database, there will be few or no suggestions, and they may not be of any use. But in most cases, in my database, there will be potentially useful suggestions. There are occasional glitches. I’ve moved the user manual for my Infiniti G35x car into my main database because I look at it frequently. Although it’s by no means the largest PDF file there, for some reason it turns up very frequently in See Also suggestions. I simply ignore it. One of these days I’ll remove this “magnet” file.

I try always to remember that DT Pro doesn’t “know” anything about chemistry, or toxicology, or the law. It’s up to me to understand and interpret the material I’m looking at. I’m interacting with the information in my database, and I’m responsible for decisions.

So I use See Also to explore connections between the documents in my database. The connections that often prove most valuable are those that may seem surprising, those I wouldn’t have thought of. These are not random. There are logical connections between the documents (throwing out outliers such as the user manual for my car). It’s those “surprising” ones that can lead to a new understanding, or even a new idea. That, in a nutshell, is why I would be disappointed if See Also simply regurgitated a catalog of all the other documents that say the same thing as as one I’m looking at. Example: If I’m looking at a page about dogs and press See Also, I’ll be pleased if the suggestions include documents about canines, carnivores, or perhaps pets.

The “Tower of Babel” problem of multiple languages and linguistic analysis

Maria, who deals with multiple languages in her database, wants the AI function to “see” correspondences between documents regardless of the language. Timotheus and talazem echo the desire for better multilingual capabilities.

No question about it. The Classify and See Also features would be much stronger if they could handle multiple languages. Perhaps some day that may happen.

Maria has suggested that the developers of DT Pro should construct interpretative tables between languages so that terms used among various languages could be correlated in searches and AI functions.

But which languages? There are a great many languages used in DT databases.

And which words? I don’t see any way to decide except all words, including all of the idiosyncratic variations of use and context.

For years very large organizations with very large funding have been working on similar problems. I’m not aware of any claim that a universal solution to language translation has yet been developed. Phone companies, for example, have developed correspondences between limited sets of words or phrases for many languages, but even those limited sets are very large.

For searches, though, the enhanced search capabilities of DT Pro version 2.0 will provide some assistance, as it should be possible to improve searching for terms in multiple languages.

So it should be possible to enter a query such as:
((“ExactEnglish” OR “ExactLatvian” OR “ExactJapanesevariant) BUT NOT (“ExactRussian” OR “ExactPolish”)) BEFORE “ExactGerman”

That’s a silly example. But that kind of query will allow much more useful searches of multi-language databases.

Bill,

as always, I have been reading your post with care and pleasure. It is an important matter, and I only add some comment on the part I have quoted.

Well, I did not suggest that the developers construct tables, but that the users can construct tables with equivalents. Like you, I would not like to tag all text with labels and keywords, I would just set up a table with words in fields that I work on intensively and ask DT to consider these words as (1) identical and (2) important.

Erich answered in that other thread that DT wants to make anything as automatic as possible and asked for open source dictionaries. I answered that I am against it, my field of study would be overlooked and misinterpreted again. Things would be easier with a DT option to create simple tables, mark them as “synonym tables” and have DT do the rest.

That is my dream…

Until then, I use DT as data storage and hope for integration with the Finder and the possibility to open all databases at one ASAP.

Best,
Maria

Hi Bill,
Thanks for this longer post on ‘See Also’. My wish was to see not just ‘See Also’ but—more important to me—the standard database query that is presently based on DT’s internal concordance use ‘significance clues’ as well as frequency clues. I made some obvious suggestions about what those significance clues might be (words in the title; words in the title that repeat words in the name of the folder, if any; words that might be in headings or emphasized in the document).

It’s an empirical question how well any of these–or other clues–would work; that’s why it might be an idea to make them optional (unelss the results are outstanding). But my guess is that they’d offer somewhat more ‘intelligence’ than a search based on word frequency. I may be wrong. I’m not a statistician. But I am a user of libraries from back in the days when catalogs came on index cards. I remember very well what algorithms to apply to find the item you want from a stack of candidates (hint: it wasn’t repetition of a keyword that signalled a ‘hit’).

I agree with you about the fallibility of ‘tagging’ (a la Yojimbo) that relies on the user to add ‘hints’. It might work, but it also distorts.

I also recognize the value of ‘serendipity’ that you point to in ‘See Also’. But for me, that’s a potential—and not much used—bonus. I’m more likely to be trying to find the use of a particular piece of data in a particular context that’s hidden somewhere in a pile of frustratingly similar data. “Sibling” branches of the same idea are a distraction from this sort of searching. Above all, I don’t want DT to offer me 20 results of a query where the last—#20, that falls somewhere out on the ‘tail’ of the frequency distribution—has is the only signifiant use, in context.

You could say (I think you may have said) ‘so what… 20 results is not so much. You can eliminate 10 at a glance through the titles’. True. But these are often LONG documents. The frequency orientation means that DT displays the first ‘hit’ that it finds and I have to open each of the candidates and search them all again (CMD+F) to find a hit that’s really in context.

It would be good if DT had had the smarts to eliminate the 10 I eliminated from glancing at the titles by (maybe) using the titles in its own search algorithms. Then I might have had to look through only 5 using CMD+F. Or maybe only 1.

Am I asking too much? Why not? When I’m writing a paper I might load 20 documents into DT and perform 50 searches. Each one of those can take more than 10 minutes with the ‘query-triage-CMD+F’ routine. It’s a hell of a lot better than the index-card-and-microfiche-days (I’m quite old). But I’m greedy. I’d like it to be a lot better still.

Best,

P

Would it be possible for the “see also” function to have, say, two presets? And either add another button next to “see also” and “classify” or allow the user to switch between the two?

I like the way “see also” functions just fine, but I would often like a lot more preference given to directory structure. Ideally, some day, I think it would be wonderful if a document could be auto-grouped based on the structures of similar documents, but in their own group tree – for instance:

Literary Sources > Anne of Green Gables > Plot Summary
(plus about a hundred others)

would cause my “Anna Karenina – Plot Summary” to be grouped in
Literary Sources > Anna Karenina > Plot Summary

It’s a pipe dream, but I think it’s a good thing to work toward. :slight_smile:

As I said in the other thread, I like your idea very much. I have no idea how easy it would be to implement… I suddenly had visions of having to rebuild the entire database or something like that… but I definitely like the idea.