Keyword gibberish, am I holding it wrong?

scottlougheed · April 16, 2015, 1:36pm

Hi everyone,
I have noticed that the keyword dropdown consistently offers some mostly nonsensical gibberish, even when viewing well OCR’d PDFs or modern academic journal articles (which are not OCRd because they are inherently text, not images).
Here are two examples, one from an unencrypted modern ebook, and another from a well OCRd scan of text:
eBook:

OCR:

Modern journal article:

In many cases the keyword is a bunch of words jammed together. In other cases the keyword is not actually anything that can be found in the document. Most of all, for the majority of the keywords, I question the extent to which they are truly “key” even when the word is a real word that actually exists in the text. Is “backdraft” a key word? are random author names pulled from the bibliography key?

Are others experiencing this behaviour with Keywords? How did you overcome this issue (if at all)? How do you use the keywords feature in your workflow?

Quite possibly the keyword feature is simply not intended to be used with journal articles and scholarly books, since there often is a lot of strange stuff like unusual author names or repeated headers that might give an AI the impression of significance when there is actually none.

Cheers,
S

cgrunenberg · April 16, 2015, 2:17pm

Seems that some words were concatenated while converting PDF documents to text for indexing. You might try to rebuild the database, maybe this will fix the issue if the PDF documents are fully compatible to the latest releases of Mac OS X.

In case of eBooks a third-party Spotlight mdimporter might have returned garbage, could you please send the eBook to cgrunenberg - at - devon-technologies.com? And which Spotlight mdimporter for eBooks do you use? Thanks!

scottlougheed · April 16, 2015, 2:36pm

Hello Christian,
For the benefit of other forum users who may have this problem I am posting (most of) the text from the email I sent you, attached to which were several sample documents.
…
Thanks for following up with me about my keyword issues. I have sent you one of the offending eBooks and a journal article from 2015 that exhibited the same issue. I rebuilt the database and the keyword list looks identical to prior to rebuilding.

I am indexing these rather than importing them to DTPO, and as far as an mdimporter goes, it is either the implimentation of mdimporter used by DTPO, or it is spotlight’s own implementation of mdimporter, I have done nothing funky with Spotlight or anything like that (this is also outside of my knowledge slightly so I could just be ignorant about this!).

Cheers,
Scott

cgrunenberg · April 16, 2015, 2:57pm

The documents are PDF documents and it’s an issue of Mac OS X’s PDFkit framework as the conversion to plain text creates concatenated words in one case, this can be also reproduced using e.g. Preview.app.

Bill_DeVille · April 16, 2015, 3:09pm

In the case of OCRed scans, the algorithm must work with the text as it is, errors included. No OCR software is perfect – the problems of text recognition from images are tricky – but generally the better the paper copy (no blemishes, no weird/tiny fonts) and the optics and lighting of the scanner, the better the results.

One of those run-together word combinations seems almost a stroke of genius. Especially in the context of ever-tighter budgets, university administrators find issues of departmental autonomy difficult to manage. Combining that into a single term, departmentalautonomy, might be a useful (if somewhat Teutonic) addition to the language.

In the case of journal articles captured from the Web as PDF (from most sources) or rich text (of HTML pages – my own preferred capture mode), the text of the captured document should be that of the original source – no run-togethers or other glitches. The algorithm to generate “keywords” deals with the text of a document. It isn’t trained in the discipline or its jargon. It simply evaluates the uniqueness and frequency of the terms used. Some terms listed as keywords may be useful to the user, others not. Nevertheless, I often do find such lists useful.

scottlougheed · April 16, 2015, 3:18pm

Thanks Christian, for the help here and via email.
Bill, thanks for your additional insight and your departmentalautonomy observation!

Cassady · April 16, 2015, 6:11pm

Sadly I too suffer from this affliction, in many cases.

Not DTPO’s fault - gremlins picked up in the done-long-ago OCR process (when they still used elves and unicorn tears) - and even older journal articles with fonts predating the printing press.

One day, when I have minions working for me, I shall seek to ferret them out and try and replace with improved versions (the articles, not the minions).

FROBGOBLIN · April 16, 2015, 10:54pm

I think the OCR stuff is generally “good enough” and I haven’t seen any improvements in the software I use (Adobe Pro) for many years now. It’s a lot better than it used to be ten or fifteen years ago, but nowhere near perfect. If only we had unlimited budgets to hire unlimited numbers of people to fix it all! Amazon’s Mechanical Turk?

By the way, the software Google is using for Google Books seems better than what I have. I don’t know for sure about that, but there might be some proprietary stuff we don’t have access to yet that already solved this problem.