Handling of non-indexed characters and word connectors

Hi, I’m working on a script that uses a toolbar query to do a regex search inside the results and have some questions. I know that DEVONthink only indexes alphanumeric characters (and some more) but am not sure:

  • Are $€£¥%§ the only additional characters that are indexed?

  • Are non-indexed characters handled as either a space or a ? wildcard?

  • Are characters that are used in DEVONthink’s search syntax -:!?."()[]*/&^+<=>|~ also handled this way if they are used as part of a search term?

  • Are double quotes the only type of quotes that can be used as part of the search syntax?

  • Which characters are regarded as word connectors? I know of - _ . / but are there more? If so are they following any standard, e.g. a NSCharacterSet?

Yes.

They’re not indexed at all, not sure what should be actually handled.

These characters aren’t indexed either, they’re just part of the search syntax.

Yes, see documentation.

The index doesn’t use such things.

If I search e.g. ABC@2020-12-11 then DEVONthink uses some kind of replacement for characters that are not indexed, in this case @ and - . How is @ replaced? Does it match any character, including a space?

What characters does DEVONthink use to recognize words as connected? Is the list - _ . / complete? I can’t test for all possibilities, I think.

Edit: I think I’ve mixed up regex and DEVONthink searches. DEVONthink does not care about “word connectors” at all but ignores them, right?

(Background is that I don’t want a regex search for the query as entered in the search field but a regex search that yields the exact same results as a DEVONthink search would, so I have to replace e.g. @ with the same logic as DEVONthink does.)

It’s like searching for ABC 2020 12 11 or ABC/2020:12;11. All not indexed characters are handled the same way, no matter whether white space, separators etc.

I’m still not sure what this should mean, e.g. the index doesn’t connect words.

Got it. I completely mixed up where the query could be sanitized. The whole confusion came from the fact that DEVONthink handles

Words linked by non-white separators (e.g., page-index or page_id) are treated like phrases put into “quotes”.

and me trying to rebuild this behaviour. Instead of trying to sanitize the whole query but leaving out “word connectors” I’ll have do sanitize each word/connected words/quoted words after they are already recognized as such in the script. That’s obvious now but caused some serious confusion.

Yes, I mixed up this too. DEVONthink doesn’t recognize connected words, but I need to recognize them in the script to treat them as if they were quoted. For the now obviously wrong approach I thought I needed a way to exclude “word connectors” when sanitizing the query.

Thanks a lot and sorry for the confusion!