Word Stemming in DT3?

kappabear · March 18, 2021, 6:48pm

I know it’s been asked before, and the answer has always been “no”, but does DT3x support “word stemming”? For instance, searching for the word “swim~” and having DT3 search for “swim, swimming, swam, swum”? I know that I can use a wildcard for “swim*”, and get “swim, swimming, swimsuit, swimmer, etc”, but not the other various forms of the word swim.

I’ve searched the manual for the word “stemming” and not found what I was looking for, and hope that it’s there, but referred to as something different.

chrillek · March 18, 2021, 7:23pm

The stem being “sw”? Sorry, bad joke. Frankly, I don’t see that coming. It’s different for all languages, and not even possible for all languages (think Asian languages). I’m fairly certain that it’s not there (certainly not for my native language). And I also think it is a bit out of the range of DT.
Just think “swamp” and “swan”, both dangerously close to “swam” – you’d need a whole english/american dictionary, and this is only 1,5 languages. Than there are one or two others … One of them very versatile with pre- and suffixes: we have arbeiten, bearbeiten, verarbeiten, umarbeiten, durcharbeiten… should they be found if someone is looking for “arbeit~” or for “~arbeit”?

brookter · March 18, 2021, 7:23pm

For your particular example, then wouldn’t sw?m* work?

Is that any help?

chrillek · March 18, 2021, 7:27pm

Apart from the fact that swimming in a swamp is pushing it a bit … I doubt that this is what the OP is looking for. I suppose they also want “go~” to find “goes”, “going”, “went” and “gone” or “sit~” to match “sitting”, “sits”, “sat” (if the latter even exists, my irregular verbs are a bit rusty). In these cases, wildcards don’t get you very far.
It might even be challenging to get that working at all for english words: “book” → “(she) books” vs “the books”, “voice” → “she voices” vs. “the voices”. There’s probably a reason for linguistics being a domain of AI.

brookter · March 18, 2021, 7:40pm

Well, obviously… if the request is if for lexical lookup then clearly the answer is no.

However the OP says that they can’t use wildcards to find ‘swim’ words where the third letter changes. All I’ve done is point out that you can do that, and shown how. I could have added that you can limit the third letter search to ‘a’, ‘i’ and ‘u’. sw[aiu]m*.

kappabear · March 18, 2021, 7:47pm

My background is in Digital forensics and eDiscovery, and I used stemming all of the time in those searches, and would love to see the ability in DT3, if it doesn’t already exist.

Using a wildcard wouldn’t work for irregular verbs like, “to be”, where I’d want “be, been, was, were”, or “lie” and wanting “laid”, “lay”, and “lain”

brookter · March 18, 2021, 7:56pm

Fair enough!

cgrunenberg · March 19, 2021, 7:45am

In this example the query sw[iau]m* should deliver good results.

In this case the OR operator would be necessary, e.g. lai[dn] | lay

chrillek · March 19, 2021, 7:54am

I think that the OP does not really want to figure out the appropriate wild card combinations for every case. They’re probably more after a generic solution: throw the indicative of a verb at the search, get back all documents containing the different forms this verb can appear in.
NB finding an expression for “be” is probably futile, anyway.

cgrunenberg · March 19, 2021, 7:57am

Maybe. But my examples should just show how to work around it as good as possible. Makes we wonder - is there any software or tool (CLI) on the Mc that can stem text? Then maybe a script could be used to stem the text of items and add the stemmed text as a comment or custom metadata.

chrillek · March 19, 2021, 8:10am

Maybe in the context of spelling tools (aspell). However, this is more a domain of NLP (natural language processing) tools, I think. Python-istas might know more about that.

BLUEFROG · March 19, 2021, 5:27pm

And do you have software that accurately processes the strings, including irregular verbs?

kappabear · March 19, 2021, 6:21pm

Hey Jim!

No, I don’t personally own any software that does stemming as I don’t often have the need. However, I’ve used TONS at work over several decades that definitely stems anything that you throw at it, including irregular verbs. But, these were very expensive eDiscovery & digital forensics apps that run on severs that are purpose built for quickly find, culling, filtering words in terabytes of data. Though there are some smaller desktop apps like DTSearch that also do stemming, but probably won’t handle irregular verbs.

Cheers my friend!

BLUEFROG · March 19, 2021, 6:28pm

Thanks for the clarification. There is some stemming and lemmatization code i found for python but I can’t speak to its usefulness. And almost surely they’re not going to pick up on irregular verbs.

Interesting suggestion though.

Cheers to you as well!

kappabear · March 19, 2021, 6:37pm

I would think that it wouldn’t be hard to create lookup tables for the various irregular verbs, and stem using those. And yes, lemmatization is even better than stemming.

andrzejm007 · March 25, 2021, 7:40pm

Don’t know of it’s applicability to DEVONthink but Elastic Search is a technology that does magic in scope of searching, including stemming: Stemming | Elasticsearch Reference [7.12] | Elastic

papy · July 28, 2021, 11:34am

Sorry for reviving this thread, but Apple provides a “Natural Language” framework which may be interesting for future Devon search features.
I believe that is the framework the PDF Search app uses, and it works well for this “stemming” issue (at least in English and French) which is very useful.