DT Files exporting for Vector DB & AI

vbiberger · February 22, 2025, 6:05pm

At the moment I am experimenting with automation and AI. To let the AI access my DT Files (PDFs mostly) I am thinking of the following scenario:

Whenever a File is added to DT a Script (via Smartrule?) is triggered that exports a text version of the file to a specified folder.
The filename should be the “Verweis” (sorry using DT in german) so the file can later be referenced with x-DEVONthink-item://VERWEIS

Does anyone see the change to achieve this? Integrated or local AIs are all well and good, but I want to automate certain things with make.com and so need to build a vector DB in the cloud.

Any help or Idea would be greatly appreciated. Has anyone used DT with make.com or zapier already?

PocketComputer · February 22, 2025, 9:25pm

Depending on the size of the files, you may need to consider ways to break up the text into sections before storing and vectorising them.

The size of the chunks, the amount of overlap between chunks and strategies for splitting text across boundaries is also something to look at.

The advice I’ve seen is to go for 10% overlap between chunks of text, and to use semantic chunking to decide where a chunk ends and the next begins.

Chunk sizes of around 2K seem to work OK, not too big, not too small.

In the case of Azure, you can use the REST API to send documents directly (in PDF format too) into blob storage and Azure AI Search takes care of the rest, other platforms may be more manual.

vbiberger · February 22, 2025, 9:56pm

Thank you for your answer. Unfortunately I can not send the files directly to my VectorStorage. I could send them to make.com and have them OCRed there, but that seams kind of unnecessary. And than there is the idea of using the internal DT-Id as identifier… That would be smooth to integrate the backlinks.

I might have to look into python coding to generate functions I need. I also might differentiate between “first time building the index” = a lot of pdf embedding vs. daily work - a few PDFs per Day, that come through a webdav cloud anyways.

chrillek · February 22, 2025, 10:23pm

Are you saying that these PDFs don’t contain a text layer? Wouldn’t it then be easiest to OCR them in DT?

vbiberger · February 22, 2025, 10:56pm

They do have a pdf layer, but I fail to extract it on the make.com platform.

chrillek · February 23, 2025, 12:17am

You mean a text layer, I guess?

In your OP you asked for a way to export the text to a folder so that your AI software can work with it.
A script in a smart rule can do that by writing the plainText property to a file. If the PDF has a text layer, that is.
Is the AI able to work with these files?

cgrunenberg · February 23, 2025, 7:45am

What exactly do you want to automate? Maybe it’s possible without using this cloud service.

vbiberger · February 23, 2025, 8:19am

Thank you for all your answers.

@cgrunenberg: I built myself a AI Assistant. This assistent connects phonecalls, sends SMS and Faxes, gives my client information. I access it over a Mattermost chat server from “on the go” utilizing mistral.ai. So I need the information in the cloud.

I tried the “smart rules” approach and got pretty much what I wanted (picture attached, in German but you’ll figure it out). I clean up the filename (getting rid of the DEVONthink:// part) with hazel and copy the file to my webdav connection. Using a Tag I can in a flexible way reindex files. Neat.

DT is very thought thru regarding the smart rules. Thank you for that.

“Edit comment” is not needed at the moment. I dream of using the real filename and utilize the link in the comment another time. For now you can leave that part alone.

Oh, and /VectorDB is a real folder I indexed with DT, so it can be processed by hazel.

cgrunenberg · February 23, 2025, 8:32am

Thank you for the nice words, good to know that smart rules are sufficient to export your data in the desired way.