Related words: same file, different database, quite different response

dspady · June 29, 2023, 11:43pm

PROBLEM: I have an issue with related words. In one database, for one file, when I do a concordance and then click on a specific word (in the attached photo [screenshot 2023-06-29 at 5.23.25PM.png] (, the word is “wellbeing”) I get ONE related word. If I click on other files, and look for the same related word, I get the same response (ONE related word) In another database, which is a small subset of the first, created for troubleshooting purposes, when I click on the SAME file, and the same word “wellbeing”, I get a bunch of related words. (shown in the other attached file screenshot 2023-06-29 at 5.23.51PM.png). When I click on another file name and look for the related word, I get a similar pattern of related words.

QUESTION: Why is one database only giving me ONE related word whereas the other gives me quite a few (sensible) words. In both databases, if I look for relations for some other words, the system seems to respond appropriately.

OTHER INFO: The big database has been rebuilt, reindexed, no change. The little database came from a sample of files that made up the big database. Both databases use indexed files, i.e. they are not incorporated into a DT database per se.
Thanks
Don Spady

cgrunenberg · June 30, 2023, 7:22am

The result does not only depend on the document but also highly on the contents of the database and its group structure too, therefore it’s actually working as expected.

dspady · June 30, 2023, 2:31pm

I find the answer hard to accept. If you say this is an appropriate response, I just don’t understand the relevance of this metric, nor how it is determined. It makes no sense to me when a smaller database, containing ONLY files from the larger database can have such a radically different set of related terms to the same word. How is this metric determined.

cgrunenberg · June 30, 2023, 2:44pm

Neither predefined data like dictionaries nor pretrained data like machine learning are used. But the concordance matters, e.g. the number of groups, frequency & weight of words depends on the database.

dspady · June 30, 2023, 3:05pm

Well, that just tells me what is NOT done to determine related words. Not what IS done. It seems to me that related words would suggest other words, and obviously links to other files (because the file names seem to change if one clicks on the other ‘related’ words. In this case, where there is only 1 word that is related, there is not much to click on, or to derive any insight from. So, what is the utility of the related word metric.
It seems to me that the ‘see also’ is far more useful.

cgrunenberg · July 1, 2023, 7:59am

Basically both the concordance (including frequencies & weights of words) and the context of the word in other documents in the same database are analyzed to figure out words which are similar/related in this database.

E.g. Hubble might be related to telescope, to Edwin, to constant or to something completely different depending on the documents & database.

The found words can be used e.g. for searching (by double-clicking) to find related documents or tagging (see contextual menu).

dspady · July 1, 2023, 3:30pm

Thanks for the explanation. However, I still do not understand why when using a MUCH larger database I get 1 related word link but when using a quite small database, using files from the larger database, I get 9 potential related word links. If anything I would have expected the opposite. It just makes no sense, but if this rather bizarre result is correct, then so be it.
Thanks for your help.
Don