indexing v. Add to database question

fantabulosa · August 27, 2012, 1:55pm

Hi,
Is it common for Devonthink to hang for around 30 sec. and is it possible to do something about it? (I have 8GB of ram on Macbook pro). For instance will I get a better performance if I will store by PDF’s on Dropbox or Sugarsync and will index them to Devonthink without physically adding the files?
Related question is to do with accessing the database from another computer: My Devonthink pro database is about 7GB and growing. Generally I just keep adding PDF’s to it, often whole scanned books and articles. Recently I started to wonder if I will be more efficient to store the files on a server as this might allow me to access them from another computer with Devonthink installed.

Thank you,
Daniel

BLUEFROG · August 27, 2012, 4:00pm

The size of your database is at issue. Remember, the larger the database the more RAM, etc. it requires. I would consider splitting your database into smaller ones.

Also, indexing off a server would yield slower performance due to the speed of a network versus the speed of a local hard disk. This is not to say you can’t use a server and index the files, but that the quality and stability of your network will affect the performance (as it does with any network related process).

Again, you can successfully use a server, especially in light of wanting to share a database (though our in beta Sync options are addressing some of this) but just remember some of the performance implications.

fantabulosa · August 27, 2012, 4:10pm

Thank you for a very helpful response. Just to clarify: If I store all my pdf’s on a local hard disk but not inside the database and index it through DT will there be an improvement in speed? I believe Sugarsync will allow me to have a local folder mirrored on the server so this might be a workaround the network speed issues.

Thank you,
Daniel

korm · August 27, 2012, 4:20pm

Not noticeably. As Jim pointed out, it is the size of the database that matters. DEVONthink is keeping meta-information in memory regardless of whether the documents are inside the database package or elsewhere on disk. The server issue merely adds latency to data retrieval times, but not to the amount of memory that the database requires.

Greg_Jones · August 27, 2012, 6:11pm

I’m going to go against the flow here and say that perhaps it is not the size of your database. I usually have 5 databases open, the largest of which is 16GB and I never see a slowdown like what you are experiencing. I also use a 15GB database on an old PowerBook with 1GB of RAM and again, no slowdown. Christian has also maintained that it is not the size of the database or the number of documents that matters as much as does the total number of words in the database.

Having said that, I don’t know what to tell you to try, but do explore some additional options before turning your existing workflow upside down.

BLUEFROG · August 27, 2012, 7:58pm

This would lead me even more to the idea of splitting the database… IMO.

Greg_Jones · August 27, 2012, 8:47pm

Well, I’d be interested to hear how many words @fantabulosa has in his database. I just checked and I am approaching 18 million total words in my 5 daily use databases, 10 million alone in the database that is 16GB in size. From what I have seen reported from the users here, my ‘big’ database is mid-sized compared to some. I never see anything approaching a 30 second hang with DEVONthink. The other thing that makes me believe it is not necessarily a RAM issue is his configuration now compared to ~ a month ago. We just did get a 64-bit version of DEVONthink that will take advantage of that 8GB of RAM. If it is slow now, what was it like with the 32-bit version of DEVONthink?

My configuration: 2011 MacBook Pro, 8GB RAM, 2.2GHz iCore 7, 7,200 RPM HDD

BLUEFROG · August 28, 2012, 12:55am

Agreed, Greg. I would be interested as well. Nice machine BTW.

fantabulosa · August 28, 2012, 2:16pm

Hi
Just checked the database statistics:
Words: 731,783 unique, 61,079,904 total

It is quite a lot of words… I have been collecting academic literature in this DB for a few years now.

Daniel

Greg_Jones · August 28, 2012, 2:26pm

It does appear that splitting the database may be in order. Yours is a great example of how the size of the database, in GB, is pretty much meaningless with respect to database performance.

fantabulosa · August 28, 2012, 2:32pm

Will it then mean that I will have to search each DB individually or will the search function span several databases?
Thanks!

Greg_Jones · August 28, 2012, 2:39pm

The global search will search across all open databases, however, and I could be mistaken, but…

I expect that splitting your master database into x number of smaller databases and keeping them all open at once will give you roughly the same performance as what your master database is doing now.

BLUEFROG · August 28, 2012, 4:40pm

@Greg: Neither am I sure about the performance of splitting the database and leaving them open. I’d hazard a guess it would be a net-zero gain.

@fantabulosa: Part of splitting the databases shouldn’t just be a blind split but an informed segregation of the data into logical sub-databases. This requires a philosophical shift where you don’t leave the databases open all the time but open as you need them. Some people may argue it infringes on their “workflow” (and I use that term very loosely) but there’s less logic in leaving unused resources open if it adversely affects your performance.

I think you’d find that smaller, more focused databases along with Tagging, Groups, Smart Groups, etc. would yield faster and more accurate results (less false positives).

PS: There’s no shame in having a lot of little databases instead of one gigantic one.

korm · August 28, 2012, 4:53pm

Before jumping into splitting, I’d suggest reviewing Christian’s advice (he’s the one who should know):

The words reported by @fantabulosa are within the range. The number of items … I don’t think that’s been posted yet.

I mention this, because there might be something else going on with the content (e.g., a poorly built or poorly OCRd PDF), and going through the effort to split the database might not bring @fantabulosa the desired relief. For example: a 30-second hang was reported. Hang while doing what? Hang on every document? What happens when the hung document is removed temporarily from the database? Has Verify & Repair and/or Rebuild been run? Disk permissions check out ok?

organizer · August 30, 2012, 12:54pm

Hello,
I have a sizable DB (7 GB dtDbase2, 1,2 Million Unique words, 97 million words in 10000 docs) I use the ‘index’ method, I index from a nested Finder folder, that folder is 7,4 Gb in 12000 ‘items’.
The disc is my standard IMAC (3,06 core I3, 4GB) 7200 RPM disc. I use DTP 2.4.1 on OSX 10.6.8.
When I do a ‘content’ search for a simple word (Chia), that gives 457 results in 3,3 sec. I am quite happy with that. Once a day (?) when I wake DTP, it is working for some time (30 secs, minutes ?, always too long!) I can see quite a lot of disc use going on ‘activity monitor’ DTP is then indicated as 'not responding". That is I assume some daily husbandry, maybe the DTP daily ‘backup’.
It is probably too simple to assume that a ‘hourly’ backup setting in DTP is the culprit.

korm · August 30, 2012, 1:05pm

The above from DEVONthink > Preferences > Backup. Hourly backups are probably excessive. If you’re running Time Machine, backups might not be necessary. I use the default Weekly backup and in thousands of hours of use I’ve never needed to restore a backup – but I have needed that Time Machine backup occasionally.

I’d suggest turning off backups, or lowering the threshold, for a while. But that probably won’t fix the start-up issue. You said “when I wake DTP” – do you mean when you use it the first time in the morning after it has been sitting idle over night – or do you shut it down at night and start up in the morning? If the latter - then backup might be implicated, and it might be just the overhead of loading metadata for your database.

It’s hard to tell, actually. I use the same databases on two machines: an old, crusty, crufty MacPro, and a MacBook Pro. Start up of those databases (about 6 GB total) on the MacPro is slow - up to 60 seconds, I’d guess. On the MacBook, the startup for the same databases is about 20 seconds.

The other thing I recommend is Verify & Repair and then Backup & Optimize. Then reboot the machine. Then see if anything changed.

Bottom line, though, is that it seems the start-up experience isn’t wonderful (that can be ameliorated by a morning cup of coffee ) but other than that, your experience isn’t so bad. Am I reading too much into that?

Bill_DeVille · September 3, 2012, 3:22pm

Until very recently, DEVONthink was running as a 32-bit application. That means that the total addressable memory space was limited to 4 GB (if a database required more than 4 GB addressable space, memory errors would occur). And most Macs were limited in the amount of installed RAM to 4 GB or less (especially laptops).

When a DEVONthink database is loaded, the entire text index plus other metadata must be loaded into memory. For years, I’ve used topically split databases with a rule of thumb upper limit of 40 million total words as a means to allow databases to fit into my available RAM (typically 4 GB RAM installed on my laptops), so that free physical RAM memory wasn’t all used up. As long as a database fits within available free RAM (and didn’t exceed the memory space limitations when running as 32-bit), database operations can proceed at the full speed of which the Mac is capable. But the operating system requires memory space, as do other applications that are running in addition to the DEVONthink application.

Apple supplements the physical RAM installed in a computer with Virtual Memory. When free physical RAM becomes exhausted, processes can continue by swapping data back and forth between RAM and Virtual Memory swap files on disk. This allows processes to continue to run to completion, but at the expense of slowdowns on the computer, because read/write speeds on disk are much slower than read/write speeds in RAM. I hate slowdowns, and my practice of creating topical databases has been the principal workaround to limit their sizes, hence to keep them running quickly in free RAM.

For users whose computers meet Apple’s requirements for running Mountain Lion, and that have lots of RAM installed, a new day has dawned. The current version of DEVONthink runs in 64-bit, and so raises the bar on database size.

My new MacBook Pro has 16 GB RAM, which can allow much larger DEVONthink databases, or an increased number of simultaneously open databases, without slowdown or memory errors. I will continue to use topically designed databases, because there are other advantages to that approach than merely controlling database word count. But I no longer have to spend time juggling the total word count of open databases or freeing up RAM, in order to avoid slowdowns. Yes, even with 16 GB of RAM there are limits to database size, but as a practical matter I haven’t approached those limits. Wonderful!