DT4 - inner workings of indexing / see also / similar documents - thought about using vectorization?

axelbuehler · April 3, 2025, 8:39pm

Hi,
I’ve been wondering how you work out document similarity - since for a German language Text I find mostly related German language documents (and English/English), this looks more like a classical text indexing than using vectorized similarities (which might work across languages). Have you thought about adding the option to index content into a vector store (of choice?). Based on what I understand having a smart combination of text based and vector based search might deliver richer and more relevant results.
Axel

BLUEFROG · April 3, 2025, 8:45pm

You need to clarify what you’re referring to.

this looks more like a classical text indexing than using vectorized similarities (which might work across languages).

What looks like it?

Have you thought about adding the option to index content into a vector store (of choice?)

We have thought about a great many things but that doesn’t mean all are feasible at this time, for whatever reasons there may be.

I would also recommend you spend time in the new Help, especially the Getting Started > AI Explained section. DEVONthink is not suddenly “an AI application”. Access to external AI engines has been added as a technology complementary to our own internal (and still improving) AI.

axelbuehler · April 3, 2025, 8:59pm

Hi,

don’t get me wrong - cool release. Kicking the tires immediately.

I refer to “see also” or the “documents” list or the graph in the AI section - how do you determine the similarity and ranking there? As I am not a computer linguist I might be off, but based on my latest readings in terms of RAG/improving search vectorizing text seemed to be a very powerful addition to search which allows to some extent to identify related concepts without necessarily matching character patterns.

BLUEFROG · April 3, 2025, 9:03pm

See Also and Classify are driven by DEVONthink’s internal AI. There is no RAG or embedding going on in it.
Another thing mentioned in the Help…

cgrunenberg · April 4, 2025, 7:36am

We’ve considered this and it’s still an option for future releases. But in the end this has also disadvantages, e.g. increased disk space usage and slower indexing (a lot slower).

rkaplan · April 4, 2025, 11:46am

I have compared DT4 AI performance with numerous “Chat with your PDF” apps which use a vector database and embeddings. I understand why they are needed to search databases of petabyte scale and beyond.

But for querying a specific PDF or set of PDFs as is likely to occur in DT4, I believe the DT4 approach without a vector database is considerably more accurate.

For tasks too large to run an LLM in DT4 such as finding a particular document in a Gigabyte size database, the existing DT3/DT4 boolean toolbar search still works lightning fast; I suspect for most searches the benefits of a semantic search through one’s entire DT4 database would not be that large. Yes it is true that an LLM semantic search could find synonyms in such a situation, but as you say at a big tradeoff of performance.

axelbuehler · April 8, 2025, 6:20pm

Thanks to both of you for sharing your insights.