Multithread indexing of files

walking · October 13, 2023, 9:00am

I need to index quite a few thousand PDFs. I’ve chosen to Index Files and Folders… due to storage constraints and the fact that more PDFs will be added to an existing file structure.

It’s very painful watching the ‘indexing’ and being pretty much unable to use DT during the indexing. I.e. there’s this constant 6-10 second period when I can’t do anything, then a 2 second break when you can click something, then another 6-10 second hang.

I’m running DT on a Mac Studio with 64GB RAM. Memory utilisation is 34% and cores are only 18%.

To be clear, I’m not using OCR, as the PDFs have already been scanned in correctly.

I would like to know if there’s any way for DT to more efficiently use the resources of the machine when working on a large library of files and whether the indexing couldn’t be speeded up by utilising the other cores and memory available. Thanks

cgrunenberg · October 13, 2023, 9:14am

Which version do you use? DEVONthink is actually indexing PDF documents in the background, therefore it’s unclear what’s causing the delay. A screenshot of File > Database Properties would be useful.

saltlane · October 13, 2023, 11:34am

Another thing to check (as I experienced it) is that you don’t have files being removed locally (by optimise storage) and put in iCloud as these seem to be re-downloaded every time Devonthink accesses the indexed folder. You need the indexed folder contents to be available locally or there is a time lag as they are pulled back down from iCloud.

walking · October 13, 2023, 12:39pm

Thank you for your reply.

I admit - having set up a new database with just 1,000 PDFs, it indexed these in the background.

But it doesn’t appear to do this on a larger (ahem) dataset and causes a constant ‘hog’ of resources rendering DT unusable whilst the indexing takes place. In the background the rest of the Mac runs perfectly fine.

In terms of version I’m on DT Server and we bought a Studio with 64GB Ram to dedicate to it. Before I give you the database properties - yes, we’ve split into a couple of libraries to try to limit the size, but the more libraries, the more times searches need to be run. It is no problem loading the largest library and using DT is v fast. It’s just the indexing of a large import of files that renders it unusable.

In terms of DB properties… I know we’re pushing the limits… It’s 1.6M items and over 500GB (a lot of legal files that have been scanned). It’s not the searching of the library that is a problem, but just the indexing that seems to cause a problem and massively under utilise the system resources.

cgrunenberg · October 13, 2023, 12:44pm

Actually more likely is that the database size which exceeds the recommendations causes the stalls. Or does the new database cause the same issue while (!) all other databases are closed?

The number of words would be interesting too as that’s also important and likewise the location of the indexed files in the filesystem.

BLUEFROG · October 13, 2023, 3:37pm

If you’ve attempted to add all these at once, smaller batches are recommended.
And yes, you’ve exceeded the comfortable limits of a database.

walking · October 13, 2023, 5:31pm

Only 216m words, 13m unique

For some positive feedback: When it’s not indexing files, it holds its own…

Appreciate that we’re way beyond recommended limits - I’ll consider a reorganisation of the library data broken down by 3 month periods… It’s just it feels like there is much more power in the system it doesn’t utilise.

BLUEFROG · October 13, 2023, 6:03pm

That would indicate to me that many documents lack a searchable text layer.

216,000,000 words / 1,600,000 documents = 135 words per document … unless I’m misunderstanding something.

Also, you didn’t state where these indexed files are located.

cgrunenberg · October 14, 2023, 6:45am

The next time you’re going to index data, please launch Apple’s Activity Monitor application (see Applications > Utilities), select DEVONthink 3 in the list of processes, choose the menu item View > Sample Process and send the result (or even better multiple samples) to cgrunenberg - at - devon-technologies.com! Thanks in advance!

walking · October 20, 2023, 9:26pm

Thanks both for your responses.

I am not sure what happened, but I set up new libraries, re-indexed the files, and it’s now working beautifully with indexing taking place in the background even with 25-75k files importing.

So I’ve decided to reindex all the historic data overnight in smaller chunks. It makes such a difference to be able to use DT whilst the indexing is going on.

We’ll continue to push the limits but for now… all good.

cgrunenberg · October 21, 2023, 7:45am

Glad to hear it’s working as expected now. If you should ever experience the slow performance again, then a sample would be great plus a description what exactly you did. Thanks in advance!