Very large databases, search and memory help

TheDarkTrumpet · June 22, 2017, 3:55pm

Hello,

I decided to import all my email into a devonthink database. There were a few reasons why this was required. The database size is roughly 17.5Gb or so.

I know I can probably “split” the database into a smaller, most likely to be referenced, version, but then searching against all may be more difficult. If there’s no way to address this problem, I may do just this…but I’d like to avoid it if possible.

Closing and opening the database takes a very long time. We’re talking about 10m or so. Searches within the database also take a very long time (100% cpu for 5-10 minutes or so).

Ideally, my goal here isn’t necessarily to get perfect classification, but more a way to search for a string in the database. So, to that end, I’d like to try and make this as fast as possible with the least amount of machine learning that DevonThink is great at. Essentially, kinda like if I was to have this all in flat files, doing a grep through all the files…that’s pretty much all I need for this database.

The EML files appear to be HTML. Converting to Text may be helpful as a batch operation.

Is there any way to approach this besides just going into the eml folder and doing a grep manually?

Thanks

BLUEFROG · June 22, 2017, 4:24pm

Here’s my treatise on database size…
Size in gigabytes isn’t the critical number. If you check out File > Database Properties > … for a given database, the number of words / unique words are more critical. On a modern machine with 8GB RAM, a comfortable limit is 40,000,000 words and 4,000,000 unique words in a database. (Note: This does not scale in a linear way, so a machine with 16GB wouldn’t necessarily have a comfortable limit of 80,000,000 words / 8,000,888 unique words.) So text content in a database is far more important.
If you have a database of images, it will have very few words but be large in gigabytes.
If you have a database of emails, it will have many words, but may be smaller in gigabytes.
The second one may perform more poorly as the number of words increases beyond the comfortable limit.

Smaller, more focused databases will generally perform better, Sync faster, and be more data-safe in the event of a catastrophe (avoiding the “all your eggs in one basket” problem). They also give you the opportunity to close unused databases when you’re not using them. This frees up resources, not only for DEVONthink, but the rest of the system. There is no benefit to having a bunch of unused databases open all the time.

Concerning conversion, it is possible to script converting emails to plain text (though I certainly would NOT try converting 1000’s at a time). Note that a basic script would convert just the text content and basic header info. For example, here is a Dropbox email converted to plain text via script…

TheDarkTrumpet · June 22, 2017, 5:13pm

Hey,

Thank you for the reply.

Looking at the database properties I’m sitting at 41,631,810 unique, and 131,435,700 total. I have ‘create spotlight index’ set, I should probably remove that. Also the total in that says 28.334, 1.8Gb. Not sure if that means 1.8Gb of content, or 28.3Gb

Also the machine is an early 2015 macbook pro, 13 inch (i7, 16gb of ram). Seems like it meets that comfortable limit you were referencing, but not by much.

BLUEFROG · June 22, 2017, 8:29pm

The total lists the number of records, then the size of the database.

Again, the resources do not scale in a linear fashion.

TheDarkTrumpet · June 22, 2017, 8:54pm

Yeah, well Im off by a factor of 10 on how I originally read it, so even if my machine was 10x more powerful, it likely wouldn’t work.

On a side topic. Email boxes as large as mine are not uncommon. Having the ability to support more without having the classification added would be nice. For now, I think Ill run it to convert it to txt, then parse the headers and create my own file structure that I can grep through.