DN 2b7 - Database Properties (incorrect statistics?)

Total: 221, 18.3MB
Words: 116,664 unique, 350,773 total

Number of unique words seems to be excessive for a small database.
How this statistics is calculated?

Take a look at the database’s Concordance (Tools > Concordance).

Where is “Tools > Concordance” in DEVONnote? I was unable to find…

BTW. In my database there is a file with first 100000 prime numbers --> 11115 paragraphs, 100032 words. It appears that in “107 109 113 127 131 …” each number is counted as a separate word.

Is word statistics used in “Classify” feature?

Tools > Concordance is available in the DEVONthink applications, but not in DEVONnote.

Yes, a ‘word’ is a string of three or more alphanumeric characters, by default.

‘Classify’ does consider word frequencies.

OK, is there way to change this “default” definition?

From my past experience junk/noise words are show-stoppers for automatic document clustering / text mining. Some form of words filtering may be quite useful. E.g. user defined stop list or document/library word importance weighting.

Better yet, there’s a user-accessible way to filter out such a document as your list of primary numbers. In that document’s Info panel you may exclude the document from Classification, See Also, Search and/or Tagging. When you’ve been using Classify or See Also in a large database it’s pretty easy to note a small handful of documents (among tens of thousands) that can be filtered out to improve the focus of Classify and See Also. For example, in a large database I found that See Also was likely to pop up two documents in many of its suggestions, regardless of the topic of the document being viewed at the time. One was my owner’s manual for an Infiniti G35x, the other was a Surgeon General’s report on second hand smoke. I’ve never been able to figure out why See Also was so entranced by those two documents. Eliminating them from inclusion in See Also suggestions solved the problem.

Although one can in fact emphasize the importance of a particular term in search queries (there’s an operator to do that), I think the value of See Also would actually be lessened by that approach. I value See Also most when it suggests an interesting relationship that I wouldn’t have thought of, rather than regurgitating relationships that I already know.

Hi Bill,

First of all, thank you a lot for the explanation. Ability to exclude “difficult” documents from Classify or See Also is really great.

I would be grateful if you can answer some more questions.

  1. Is there chance you introduce some form of user-visible words stop-list in DEVON products? Can help to eliminate common words (it, else, then etc).

  2. Weighting I mentioned in the previous post is related to choosing which words from the document to use for indexing. E.g. word ‘algebra’ may be important for one document (used as a keyword) but not for the other (not used as a keyword). Query itself does not include weighting. Do you think such scheme would affect See Also feature?

  1. No. If you are using DT Pro or DT Pro Office, take a look at the Concordance (Tools > Concordance).

When a new document is added to the database its text is analyzed and incorporated into the Concordance. What is a word? The settings for my database include alphanumeric strings that range from 3 to 50 characters. Strings such as ‘a’, ‘it’, ‘is’, and so on are not included in the Concordance. The Concordance for my main database includes 466,665 unique ‘words’ – strings of alphanumeric characters of from 3 to 50 characters.

As most of my documents are in English, there are a great many English words in my Concordance. There are also many foreign words in languages from Greek, to Russian, to French, German, etc. There are numeric strings. There are variants of words, including typos. There are ‘strange’ words, such as strings picked up from URLs.

The most frequent word in my database is ‘The’. It occurs 2,117,287 times. It can be found in a total of 3,480 groups. It has a length of 3 characters. It has a weight of 0.

Algebra occurs 106 times, appears in 49 groups and has a weight of 4. Algebraic occurs 19 times, appears in 11 groups and has a weight of 11. Algebraically occurs 2 times, appears in 6 groups and has a weight of 27. Algebras occurs 8 times, appears in 5 groups and has a weight of 15.

Extremely common words such as ‘the’ and ‘and’ are assigned a weight of zero. Knowing the kind of documents in my database, I would agree on the relative weights assigned, for example, to ‘algebra’ and ‘algebraically’. Those weights are assigned by proprietary algorithms – don’t ask about them.

So your question as to whether there could be user-access to enter stop words that are common is answered by the approach already taken in DEVONthink. No, there would really be no obvious advantage gained, and lots of ways effectiveness might be reduced.

  1. The Concordance and AI routines such as Classify and See Also are deal only with words in the content of documents. Other words that may appear in document names, comments and other metadata are not included or considered.

The word usage and patterns in a document “is what it is” and is objectively compared to other documents by the algorithms. I think that’s highly appropriate.

There have been comments from time to time that users would like to tilt the See Also routine in such a way as to force the outcome.

I think that would be inappropriate. I value See Also for its usefulness in guiding me to look at some ideas or facts that might not at first glance be obviously related to the document I’m reading.

If I’m reading an article about Aldous Huxley, I don’t want See Also merely to list a catalog of books and articles about Aldous Huxley. The appropriate way to do that is by tagging. Instead, I would hope that See Also would provide me interesting trails to articles about dystopian views of the future, foibles of the British upper class of his time, or perhaps an article on junk science in the field of vision correction. Indeed, See Also has done that for me.

Many years ago I spent several days with Huxley. He had always fascinated me, especially because of his gloomy view of science and technology as expressed in Brave New World. In preparation for his visit I tried to read Eyeless in Gaza, but found it impossibly boring. Nevertheless, he was a fascinating person. And eccentric. He gave a lecture on techniques for correcting poor eyesight so that optical correction would not be required. He had to hold his notes within a few inches of his thick eyeglasses to read them – which is to say that his theories of the physiology of vision were junk science. Meeting him was a memorable experience. He was very intelligent and had a sharp and questioning mind. He was intrigued by research in biochemical individuality by Roger Williams at the University of Texas and came there to learn about it. I had the good fortune to be assigned to discuss the research with him.