Performance expectations with large indexed library

P12 · November 30, 2021, 2:22pm

I recently indexed my Bookends PDF library in DEVONthink and I’m having some performance issues (beachballing). I’d like to know whether this is just to be expected or whether I can improve the situation in some way.

It’s a large PDF library. Nearly 6,000 items and 18GB. I initially had it in the same database as all my other research materials but now moved it to its own database (didn’t seem to make much difference).

Here’s some examples of where I’m experiencing problems:

Manually move one PDF from the indexed attachments folder in Finder to a subfolder named ‘Reading list.’ DNtp beachballs for about 15 seconds while it registers the change.
Right click on one of my PDF records in DNtp itself. DNtp beachballs for about 15 seconds before it shows the menu. (If I try this twice in succession, the second time is quicker. Presumably a matter of loading into memory.)
Change metadata of any of the PDF records in DNtp via AppleScript—again, beachballs, meaning I have to wait to continue with my workflow. Not the end of the world but breaks the flow.

I’m running a 2015 MacBook Pro (2.9 GHz i5). So, a slow machine compared to the current Apple Silicon generation but not ancient.

My question, then, is whether this is an expected level of performance, under the circumstances.

I’ve considered moving the whole library fully into DNtp (rather than indexing), but I’m not sure what difference that would make re performance. Also, it would be inconvenient as it’d take a whole lot of scripting to allow Bookends to interact with its own library.

I plan to invest in a new machine next year but I want to make my scripts and workflows performant with what I have first.

cgrunenberg · November 30, 2021, 2:29pm

How large are these PDF documents (number of bytes, pages & words)? How much time does e.g. Preview.app require to open one?

BLUEFROG · November 30, 2021, 3:36pm

What is reported in File > Database Properties for the database?

P12 · November 30, 2021, 8:33pm

The files themselves vary in size. Some 20MB+ but the vast majority much smaller than that. Here’s a reasonably random sample:

As I said, a large library. If this is just the reality of having 1/3 of a billion words (!) in a database then that’s just how it is. But I want to optimise things as much as I can before I throw an M1 chip at the problem.

Thanks!

cgrunenberg · December 1, 2021, 6:58am

How many pages do the PDF documents that cause noticeable delays have? And which version of macOS do you use and how much RAM does your computer have? Maybe it’s also an issue of virtual memory.

P12 · December 1, 2021, 6:20pm

Okay, here’s a memory test:

I quit DEVONthink and reopened. This is the system memory usage when it had just reopened (with all libraries now open but no documents focused or being previewed):

That’s from 16GB total.

I then navigated to the group containing my indexed PDF attachments and browsed through some PDFs in the preview window (some small files, some 20MB+). No lag, no problem, no change in RAM usage.

I then right click to open the contextual menu on a document (any document). Beachball for 10-15 seconds. Here’s memory usage a few seconds after that:

As you can see, accessing the context menu causes a large chunk of information to be loaded into memory that was not there when merely browsing. That would explain the beachball. Once that large chunk has been loaded in, the context menu shows without any delay on any file I select (at least immediately; if I come back later it will lag again).

I should add that accessing the context menu is not the only thing that causes a lag, as described above, but it’s the most easily reproducible method. There’s also a lag that occurs when running a script querying the attachments library that still happens even when the RAM usage is up to around 4GB (and that’s the more serious problem). But I would guess that these issues are connected.

My question I suppose is therefore: is this expected behaviour? There’s clearly headroom in the RAM, the issue is more how DEVONthink loads it in and out. I’d be happy for it to use more RAM in order to get better performance!

cgrunenberg · December 1, 2021, 7:25pm

It’s most likely data loaded while calculating the suggested destinations for Data > Classify and the same contextual menu command so that the name of the preferred destination can be immediately shown in the menu. Afterwards other documents shouldn’t cause such a huge delay anymore but this might vary depending on the size of the document.

cgrunenberg · December 2, 2021, 8:24am

How many words (see e.g. info in navigation bar) does this document contain?

P12 · December 6, 2021, 8:27pm

Okay, this is interesting. I started by performing the test, exactly as described above, on two files:

5,000 words, 300KB: beachballs for 10 seconds; DEVONthink RAM load goes from 1GB to 4.4GB.
200,000 words, 19MB: menu opens immediately; no RAM increase.

So, it’s the smaller documents that are the problem. In fact, after some more testing, the problem seems to arise only on PDFs with fewer than 10,000 words. I’ve tried this several times and it works the same every time.

Below 10k words, beachball, RAM usage increases 4.4x; above 10k, no beachball, no extra RAM.

Clearly there’s something going on there!

BLUEFROG · December 6, 2021, 8:59pm

Please compress the two files, upload them to a cloud account of yours, open a support ticket, and include a link to the files. Thanks!

cgrunenberg · December 7, 2021, 8:57am

In case of smaller documents (indeed less than 10.000 words) Data > Classify (and the same contextual menu command) actually calculate the destinations and, if there’s a valid suggestion, show the destination name in the menus.

P12 · December 8, 2021, 12:49am

Well, that explains it!

So, I guess this is basically just how DEVONthink works. Every time something changes in a database (a Finder tag is modified; the menus are refreshed for a file with < 10k words), DEVONthink recalculates the internal relationships of that database. One sign of this is loading a large amount of information into RAM. On a slower machine, with a large database, it can take a significant amount of time on each occasion.

Is it worth raising a ticket for this or do I just need to reconcile myself to needing a new computer?

cgrunenberg · December 8, 2021, 8:36am

As it’s working as intended it’s IMHO not necessary.

Splitting the database into multiple smaller ones might also be an option (at least if not all of them will be necessary & opened all the time).