"Keywords" functionality... Rebuild?

Hello all,

The “keyword” tool almost never works this side.
I just about never use it - although, I probably would - if it provided me with something a bit more constructive.
It gets forgotten - and when I check it again, I get more of the below.

As shown, are screengrabs of what comes up.
This happens 90% of the time.

I initially assumed it was a problem with the OCR of the PDF – but as mentioned, it happens almost always, despite none of the other aspects relying on the OCR layer being effected (i.e. searches for words/phrases identifies and finds them inside the open pdf; the “text-select” script into an Annotation RTF works etc.)…

Any ideas about what could cause this?
Anyone else suffer from a similar fate?
Can something be ‘rebuilt’/‘reset’ to have better results generated?

Hoping someone has some suggestions…

DTPO Keyword.png

DTPO Keyword2_thumb.jpg

Hmm… Maybe I’m missing something about what “Keywords” is intended for?

See the three below.

Keyword “Carby” selected - this meant nothing to me (initially).
7/8/9 seemingly random files suggested.
Hit the 2nd one, for no good reason – and “Carby” thrown up as the selection… :open_mouth:

I. Am. Confused.

[Fortunately, I’m used to the feeling.] :laughing:

DPTO KW Carby.png
DPTO KW Carby_Popup.png

The “Keywords” button seems to be performing properly in your examples. This procedure examines the text content of the selected document and lists possibly interesting words, excluding common words such as “and”, “the”, etc. This has nothing to do with keywords that you may have assigned in Document Properties or in the Info panel.

The “Similar Words” button in the full Search window looks at the text content of all the results and lists possibly interesting terms (leaving out, of course common terms such as “and”, “the”, etc.). It can also help identify alternative spellings of a term, even typos in documents.

I sometimes find such a list useful to help me identify alternative terms for searching my document collections. Once in a while it gets me onto a track I hadn’t thought of and that’s useful.

For example, when viewing an article about Alistair Cooke, I pressed the Keywords button and came across “Agrarian” in the list. I had been thinking about stages of our economy in the past and environmental issues at various times. Clicking on “Agrarian” got me started on some references related to that project.

Those are elements of what I call DEVONthink’s “rich environment” to help me make use of information in my document collections.

With respect … “interesting” to who? Or, to “what”? I’m with @Cassady. I rarely use the Keywords feature because it suggests nothing of interest, and most frequently nothing sensical. An example similar to @Cassady’s above … 3 of the top 9 keywords are article reference IDs from a PNAS selection (PDF). They occur only once in the article (the bibliography), and not at all in any other articles in the database. None of that is interesting to this human, though the machine obviously loves this stuff :confused:

BTW, all 9 of the top keywords in this example occur in the first 200 words (the corrections section) of an 8000 word article. The really interesting stuff in the article is in the other 7800 words.

Hello Bill!

Thanks for the reply. You’ve confirmed my understanding of the “keywords” vs “similar words” functionality, but I’m still not ‘seeing’ it. :neutral_face:

In the screen-grabs I popped up, simply retyping the words listed - gives me the following:

Uthne; Eulaw; Frsa; Flume; Niiic; Mays; Laff; Tnhe; Borto; Narne.

Harld; Aeven; Athhe; Nexpl; Necv; Vtion; Orthy; Ohome; Uyld; Banon; Epcas; Ifilr; Cmost; Ocf; Dhoss; Alsai; Fayct; Escap; Gare; etc etc…

Hmp; Kcl; Carby now that this was in reference to a Mr. Carby-Hall - but no way I would have known that before hitting the selection option]; Cro; Rife; Annul; Poa; Tucc; Tum; Issn [presumably a reference to the book id?].

With the exception of “Issn”, not a single one of the words listed above (or those remaining that I never re-typed) make any sense. Not in the least. As [b]korm[/b] mentioned, ‘nothing sensical’.
So it’s pretty pointless to me as a tool - since I end up randomly clicking on the suggestions, not having the faintest clue what any of them mean, or are possibly referring to.

If it’s supposed to work like that - well, then that’s one thing.

But if it’s supposed to throw out words that actually mean something - as in ‘grammatically’ (in other words, grab words from within the text?), then surely that’s something completely different, since it does not appear to be working particularly well…

The alternative scenario(??), is if the “keywords” function is better served in a specific type of document rather - i.e. RTF’s below/above a certain length/number of words?

If the latter- guidance would be appreciated.
But if the former - is this something that can be/needs to be addressed?

That being said - I’ve tried using it in a fairly straightforward RTF, containing both my comments and using ‘text’ extracted from the PDF, with the DTPO script. The paragraph at the bottom ends three lines below where the screen is cut-off, the rest is all in.

I at least start seeing more “understandable” words, but the majority (in PINK) are still being pulled from very obscure elements, like the journal description in the heading (Iss-Issue; Rel-Relations; Ind-Industrial), or the standard template descriptors in the annotation sub-heading (Click; Apr; Annotator name).

Now I realise that there is only so much that can be done by an algorithm, and I’m not expecting/hoping for miracles, but I get frustrated at not being able to use a potentially powerful tool.

If we could get a bit more clarification from the developer, or other users that use it effectively(??), as to what the “ideal environment” would be to run the tool on - that would no doubt be of great use!

[Edit: cleaned up Screengrab]

“Interesting to whom” and “the ideal environment” of the Keyword button vary with the content of the selected article. I’ll give two examples of the results for documents in my databases.

  1. For a document named "ECOLOGY A Pardon for the Dingo” (Science, 10 January 2014: Vol. 343 no. 6167 pp. 142-143).

I found that list of words quite useful. I wouldn’t have had to type “Canis” to search for the “family” to which dingos belong, or the first term in a search for “Marsupial Australia"” for the “family” of critters threatened by the appearance of dingos in their ecology. I could try a search for "Tasmanian Carnivores”. Or “Marsupial Predation”.

  1. For a document named Assessing Contractor Use in Superfund, January, 1989, Office of Technology Assessment, Congress of the United States, I was offended by the list of words proffered by Keywords. :imp:

That document resulted from a Google Scholar search. The report itself isn’t included, just bibliographic information, table of contents and a list of the panel members. The document contains 183 unique words, one of which is my last name, as I was a panel member. But Keywords didn’t list my name, or that of the panel chairman, David Marks. But colleagues such as Bill Librizzi and Bob Pojasek were listed.

Bah, humbug—sometimes Keyword lists are not at all useful! :frowning:

But in this case the Similar Words button for the entire list of search results was useful. One of my notes about this report had a typo. I had spelled Superfund as “Supcrfund” and was able to catch that spelling error.

@korm: Yes, the Keywords procedure won’t scan all the content of a book-length document. And only small number of “keywords” will be listed. Steven Johnson’s approach, having paid research assistants create notes that small chunks of text not exceeding about 500 words would handle Keywords better. So does my approach of using Annotation notes about documents. Don’t forget that the Words button will list all of the unique words in a selected document. sorted alphabetically or by frequency, length or weight.

Thanks for the comprehensive reply, Bill.

I had a closer look at things my side. It appears that the gibberish I was seeing, is more-often-than-not either incomplete sections of legitimate words [i.e. ‘Cro’, instead of “Crotum”], or otherwise is just a really obscure word that is to be found, usually along the borders of my pdf’s… :wink:

But what does happen often - enough for it to be noticeable, is that the word(s) that appear in the drop-down list, when keywords’ option/icon is selected - often has little to no bearing on the terms identified in the various pdf’s provided in the “see-also” drawer…

For instance “Tum” yields “turn”.
“Thze” yields “the”…

If it is an OCR issue - can that explain why the ‘source’-term says one thing, but the “see-also” terms are all similar, and something different?

Lastly, just to clarify – in the Concordance drawer (since it was being spoken off earlier), what is to be understood by the term “Weight”?

Is that the ‘weight/significance’ attached to the word by the algorithm?

Concordance Window.png