I am using the concordance tool on web archives. The web archives contain user names, jargon, etc that are not real words. Is there a way to omit these words from concordance? Like checking them against a dictionary, and removing them from the concordance word list.
This is not yet possible but we’re considering the possibility to exclode words for future releases.
Has there been any progress on this? In text mining, it is standard to eliminate a variety of so called “stop” words, which are words so common that they are useless. It looks like DTP uses the “weight” of a word - I presume for the same purpose.
So is there a way to screen out words below a certain weight? Analogous to how we screen words of a certain length?
There’s no such possibility right now.
I just wanted to add my interest in seeing a feature like this.
I am indexing PDF files of technical journals, recent ones published directly to PDF, old ones scanned and converted to text.
Noise words are too dominant compared to the technical information words, and scanned documents can unfortunately generate gibberish words as well.
I wondered if in addition to maintaining noise words as a separate resource, there could be a spell check to eliminate the more obvious rubbish caused by OCR working with difficult scans?