I hope the brilliant folks at DT will get more aggressive about expanding. I would love to hear that DT has purchased a company like nVivo, or decided to make a Windows version of the product.
It’s selfish to keep something so wonderful to yourself. I work in a consulting capacity and can’t tell you how many times I’ve utterly amazed clients, customers and peers with my ability to “ingest” insane amounts of information into DT with my Mac and my ScanSnap.
Someone says something…I take a note. Someone gives me a document, I scan/OCR it. Presentations? All sucked into my bottomless hard disk.
But I want MORE POWER. When will DT take it to the next level?
The next level looks like DT not being confined to Mac users, and becoming the way that people everywhere manage information.I want to be able to use it at work, with a few thousand of my peers, for my personal life and my professional life.
Oh…I have some prices for my new MacBook somewhere…open DT…there it is…Need to find last month’s phone bill…DT on my phone…bingo…notes from the last all hands meeting…on my iPad, cross-referenced…got it…even though I’m stuck with Windows at work.
One thing I’d like to see is for DT to be able to index more. Right now, I suppose 200 GB or so is the workable limit. I don’t know. The app freezes up a lot, so it is difficult to tell exactly where it is running into trouble. Spotlight, on the other hand, can index several terabytes without a hiccup. If DT could do whatever Spotlight is doing, that’d be nice!
A DEVONthink database that holds Imported or Indexed content requires more memory than does Spotlight, because the DEVONthink database contains more information about the groups, documents and their metadata than does Spotlight. But that additional information is what makes the DEVONthink environment so much richer than the Finder environment. Ergo, I don’t want DEVONthink to do what Spotlight is doing.
Artificial intelligence algorithms are at the kernel level of a DEVONthink database.
The most important measure of database size is not the database file storage size. The most important measure is the total number of words contained in open databases. The next most important measure is the number of items contained in open databases. For example, a database with a file size of 1 GB that holds plain text documents will likely be larger in total word count than a database with a file size of 200 GB that holds PDFs with a high percentage of images in their content.
As DEVONthink is a 64-bit application, it can have a large memory space. Problems begin to emerge when there’s not enough free RAM left, so that the computer moves into Virtual Memory to allow procedures to continue. In the Virtual Memory mode, data is moved back and forth between RAM and Virtual Memory swap files. This can result in slowdowns resulting from differences in read/write speeds in RAM and disk, especially as those speeds are orders of magnitude slower on a conventional hard drive. In the worst case, Virtual Memory swap files grow large and, if free disk space is used up, can pose the potential of overwriting existing data on the drive.
Typically, a computer will have the largest amount of free RAM right after a restart. Apple’s memory management is good, but not perfect. “Inactive RAM” is data that is temporarily retained in RAM because it has been frequently called or is necessary for a given application, and inactive RAM is subject to being replaced when free RAM is needed by an application. However, over time “crud” inactive RAM accumulates and sticks, tending to reduce available free RAM.
After a restart, my MacBook Pro with 16 GB RAM has about 10 GB free RAM (exclusive of inactive RAM) with my current suite of 7 open DEVONthink Pro Office databases, DEVONagent Pro, Mail, ScanSnap Manager, Messages and several other apps open. At this moment, there’s a bit more than 5 GB free RAM available, so all is well. The computer is operating at full speed, with no pageouts. Day by day, as I continue work, the amount of free RAM will continue to diminish and if I were to take no action would reduce to the point that pageouts occur and Virtual Memory swap files start growing—the spinning ball slowdown indicator would appear.
I monitor free RAM. When it drops to about 1 GB I’ll take action. I can recover most of the free RAM by quitting and relaunching apps, or by a restart. As my MacBook Pro has a 500 GB SSD, restarts take less than a minute, so are no longer to be feared. Remember, too, that ay errors that have accumulated in the computer’s memory will be cleared by a restart.
RAM is good. More (free) RAM is better!
I really don’t need a Mac Pro with the maximum currently possible 128 GB RAM! But I sometimes wonder just how large a DEVONthink database set could be run on such a beast.
Good post, Bill. Database size is one of the least understood aspects of DEVONthink. You post shouldn’t be burried in a long thread – it should be a stand-alone locked posting in the Tips & Tricks forum.
Thanks for the wonderful clarification on the database size. Indeed, I was vague in my post, and should have said that RAM was probably the issue.
What happens with DEVONthink (from what I can tell) is that it works to index my content and at some point the available RAM is used up or something, but it just freezes, as I remember. I haven’t tried to index my big files in a while.
As you pointed out, it isn’t file size, but words. In my case, they are one and the same, though. I have many thousands of PDFs, many of them scanned and OCRd by me, and this seems to trip DEVONthink up. When my core files (a few hundred gigabytes) are compressed down into just text files they amount to only 2 or 3 gigabytes. Of course, this is 2 or 3 gigabytes of densely packed text, so that is a lot.
With only 8GB of RAM on my 2014 MBPr, I’m probably at the low end of the scale. Wish I had at least 16GB…
Anyhow, I don’t know how to make it happen, of course, but as a user I’d like to see DEVONthink perform as smoothly as Spotlight – if it does different things, all the better. That’s why I purchased it, after all
As for the artificial intelligence stuff, which I am very interested in, I haven’t benefited as much from it. My files are in multiple languages, and I don’t seem to have as much luck with the AI this way. Of course, I know that separating my files into different databases is one solution to this problem, but I do my work in at least two languages (English and Japanese), oftentimes more (Classical Japanese, Classical Chinese, Portuguese, and Spanish), so it’s all hopelessly jumbled, even within individual files. Certainly, my writing and research notes, which are some of my most important files, use several languages, so I lose out there.
Could DEVONthink improve the AI for multiple languages in single databases? Maybe. That would be nice, but I imagine it is no minor feat! There is surely a lot for me to improve in my use of DT. I wouldn’t say I’m a novice (using it for several years now, though only recently bought my own copy). However, I get the sense that I am only barely scratching the surface right now, especially with all the use I get out of DEVONagent. These two apps together are amazing.
I believe I am understanding what you are describing, but for others that may be reading this thread now or in the future, the term indexing has a very different meaning in DEVONthink. In DEVONthink parlance, indexing is the process of adding documents to a database without moving them from their original location in the Finder. What you are describing are the various DEVONtechnology functions of classification, AI, etc. The more words (unique and total) in a database, the more RAM needed for DEVONthink to classify the documents.
Actually, I meant indexing in the DEVONthink sense. I don’t know what DEVONthink is doing in the background, because I am afraid I don’t know enough about the process, but whatever that is (dealing with unique words?), it is freezing the app, and I can’t index many of my folders or files. The reason I compared it to Spotlight earlier is that Spotlight indexes everything without a hiccup. I wish that DEVONthink could accomplish the same think.
FROBGLOBIN, it sounds as though you are trying to capture a very large collection of documents into a single DEVONthink database.
I wouldn’t recommend doing that, even if I had unlimited RAM. Nor, for that matter, do I capture all the files on my computer into a DEVONthink database.
Two of my large databases hold content relevant to my professional interests in environmental issues. The one in which I spend most time doing research and writing holds more than 30,000 documents and has a total word count of more than 40 million words (about the same as the Encyclopedia Britannica). That database contains scientific and engineering documents from a number of disciplines, case histories of environmental problems, environmental policy issues and items dealing with environmental laws and regulations (primarily U.S. and EU). A related database also deals with environmental topics, but covers methodologies including environmental sampling, analytical methods, statistical methodologies for evaluating environmental data, risk assessment and cost/benefit analysis.
Why did I separate those kinds of content into two databases? When I’m interested in, for example, human health effects of mercury contamination in edible portions of fish, I’m interested in toxicological information, in case histories, and in regulatory limits that exist or have been proposed to reduce health effects. When performing searches or invoking See Also I wouldn’t want to be bothered by potentially hundreds of references about how samples are collected, complications in choices of analytical procedures posed by other materials present in samples, and so on. Those two databases complement each other, but my work in each of them is more efficient because they are separated.
I have a number of other databases that meet a need or interest. Each is more efficient for my purposes because its contents relate to a specific purpose for using its information.
I normally have a set of 5 open databases. I can treat my databases like information Lego blocks, opening and closing them as needed. As a practical matter, with the 16 GB RAM in my MacBook Pro I wouldn’t try to open all of my databases at once; the computer would slow to a crawl.
As I’ve checked the option in Database Properties for each database to provide indexing information to Spotlight, I can use Spotlight to search across all my DEVONthink databases, whether they are open or closed. When I do that, I click on the option, “Show all in Finder” and can then identify items that are in a database by the blue shell logo. If I select a result and press Space I can see a Quick Look display of the content. If I double-click on a result it will open in DEVONthink, opening a previously closed database if necessary.
Thanks for the detailed advice, Bill. I’ve benefited from your posts on the topic in other past threads as well.
Up until now, I’ve been thinking of DEVONthink as something that could index and organize everything, but with the amount of data I have, this might not be feasible after all. Certainly, with only 8 GB of RAM, it will be unwieldy on my machine.
I think what I will do is follow your suggestion to make several databases. I’ll consider Spotlight as a way to catch any data anywhere that might possibly be related to my current project, and I’ll use DEVONthink to index relevant files into curated databases.
In my experience, the ability to index in DEVONthink is one of the features that makes it stand apart from other seemingly similar applications (like Evernote). It makes something like creating a curated database a little less stressful, because I don’t actually have to move anything around and “commit” to anything. I hope DT will keep this feature around and/or improve on it.
One possible factor with this scenario: erroneous OCR. I know this is an issue with another program that “indexes” by building a concordance - a list of all possible words.
When I OCR a poor quality original, the resulting text has a lot of errors in it. This creates unique “words” such as quo1ty or 0CR, and many others that you would never have imagined. Each such word requires space in the concordance. Even if OCR is 99% accurate, which is darned good with imperfect originals, the 1% of erroneous words are all different, so they are probably “blowing up” the total index size, taking time to process going in, and slowing down searches.
My solution is that when I have a long, poor quality, document I choose one of several options:
Only OCR the first few pages
Don’t OCR it at all. Sometimes I copy and paste the abstract into the document’s meta-data.
Manually clean up the document.
Finally, and I have zero idea to what extent this applied to DT, I rebuild the database after I have cleaned up multiple thousands of pages of material. (Such as shifting to not indexing them.) This compresses the DB size and “cleans it up.”
Again, I don’t know enough about DT to know how much of this will apply. Certainly breaking the database into pieces is a more powerful and fundamental step, but sometimes less convenient.
To be honest, I am generally quite pleased with the current system, and that is why I did not offer any concrete suggestions for improvement beyond wanting it to be able to index more stuff, which may not even be desirable (as Bill pointed out), but I am sure there are lots of things that would be even more enjoyable to use if they were updated – I look forward to being pleasantly surprised by the improvements
Ideas for exported summary reports in tabular format keep flowing. Take for example Frederiko’s latest:
Maybe some future DEVONthink could have a graphical interface to define report formats and save the defined format as a script (something more robust than AppleScript – maybe Swift?) for tweaking and customization. I’m thinking of this as a FileMaker-like (and definitely FileMaker-lite ) interface where we can drag database properties (name, dates, comments, tags) onto a grid and define an output report. So, in the case of Frederiko’s script, instead of relying on Numbers to do the formatting, the grid could have been defined in the DEVONreport definition pane.