Large unindexed archives

Graham_Nelson · June 26, 2004, 4:02pm

Firstly, let me say that DEVONthink is an interesting application with great potential for use on writing projects: I won’t be using it until multiple databases are allowed, because the things I would use it for are too disparate, but I gather this is on the way. So I don’t want this posting to read as criticism.

The main thing I would like to use DEVONthink for is as a file organiser, a sort of iTunes-for-documents. This is more or less the "hoarder" scenario in the manual: I have a great many text files, PDFs and so forth, and what I want is not to search them simultaneously but to efficiently store, group, manage and read them. DEVONthink nearly does this perfectly, but only for relatively small archives, and I would like to suggest two features which would open the way to much larger ones.

(i) I’d like all files, not just images and PDFs, to be copied into the library area in an open way (so that the Finder can get at them if anything should go wrong). Ideally, they would be managed the way iTunes organises the contents of its own music folder, or indeed the way DEVONthink writes the files out when Export… is used, but it wouldn’t really matter how they were organised. Right now, if the database should be corrupted, it’s not obvious how to retrieve the data (though I must say DEVONthink does recover from errors very well).

(ii) I’d like an option for a DEVONthink database (when we have multiple databases) to be unindexed! This would be inappropriate for many projects and many users, so it should be off by default, but when it’s on, all files would be treated the way DEVONthink currently treats images: as binary files which can be displayed and whose titles can be searched, but which have no indexed interior.

Because you’re experts on indexing, and because DEVONthink does indexing very well, it may not occur to you that anyone would want this, but here’s why I suggest it. Inevitably, DEVONthink’s performance is bound to reduce as the archive grows larger - and we are all gathering steadily larger archives: if you’re a scientist, you pile up preprints at a frightening rate, and then there are all the computer manuals, and… On my iMac G4 (1 GHz, 512MB of RAM) a collection of about 6000 text and PDF files running to 500 MB pushed DEVONthink over some kind of performance limit, probably because physical RAM became too small - the program became very sluggish, and then took over half an hour to quit, so that in the end I had to force quit. Browsing the data was fine, but adding new files now took impossibly long: maybe ten minutes to add another 1MB of text.

I’m not complaining here - indexing and cross-referencing 500 MB of text is not a small task. A faster computer might have made it an easier task. But my point is that it’s a task I didn’t need. I wanted the organisation and presentation side of DEVONthink, the elegant browser and the hierarchical groups, but I didn’t need the ability to search within all files simultaneously. What I’m asking for is the chance to sacrifice this in return for very much better performance on large archives.

With this option, a text/PDF/whatever archive could be gigabytes in size without any significant performance hit, just as iTunes will happily manage 250GB of music because it is "only" 25,000 file references.

I’m guessing this would be an easy feature to add? As I say, it wouldn’t be for every project or every user, but as an option it would considerably add to the scope of what the program could be used for.

cgrunenberg · June 26, 2004, 4:29pm

Graham,

actually the current builds should already meat your expectations:

Use the “Link” command (or press command-option modifiers while dropping files) and the files are not indexed/imported but it’s still possible to view them afterwards (if it’s a supported file format)
DT 1.9 will introduce an option to copy linked files to the database folder/packages too.

BTW:
Indexing (see File menu) instead of importing should both increase the performance and reduce the memory usage but Wildcards/Phrase searches are not available for such contents.

Bill_DeVille · June 26, 2004, 7:46pm

Graham:

Like you, I’ve got a large collection of scientific and technical references. My current database runs to almost 7 thousand items, many of them large PDF files, some of which run over 500 pages. My database is over 500 MB after optimizing.

I get great value from the “indexing” features of DT. If, for example, I’m looking for analytical techniques for a particular substance in water or biological samples, DT is much more powerful if it has analyzed the words and contextual relationships in text. Searches are much more powerful, and the “See Also” button or “Word” button often provides very useful literature analysis and review assistance.

While I can’t be certain of the reasons you are seeing slowdowns in DT operations, I can offer suggestions from my own experience.

I’m running DT on a TiBook 500 MHz with 1 GB RAM. DT searches are extremely fast, and windows pop open quickly on my machine, until the number of Virtual Memory files hits 5. At that point, things begin slowing down, because the operating system has to go to disk frequently. What causes Virtual Memory to grow? For one thing, the Optimize and Backup operation, which I use at the end of a DT session. The Classify operation is also CPU and memory intensive. I tend to add new items to my “Edit This” group and then use the Classify button to suggest group(s) for permanent location of each item. I monitor free memory and the number of Virtual Memory files. When free memory gets low and the number of VM files gets to 5, I stop, run the Verify & Repair tool, the Optimize and Backup tool, and quit DT. A restart is in order (time for a five minute break, anyway, after working on about a hundred files). After the reboot, DT is blazing fast again.

I find that launching Mail eats free memory and grows VM files rapidly during synchronization of my mailboxes and my iDisk. I avoid launching Mail before an intensive DT session.

Using these practices, DT performance stays very good, and I don’t experience the severe slowdowns you’ve seen. Of course, one should also do routine disk housekeeping, such as running cron scripts, cleaning cashes and repairing permissions as needed. Every week or so, I run Disk Utility and then Disk Warrior to keep my HD up to snuff.

I rather like DT’s database structure, as opposed to storing files individually. But I don’t copy my PDF files into the database. Instead, I prefer importing text of the PDF or, for large PDF files, doing Index import. I’ve got a 60 GB drive in my TiBook and so far don’t have file space problems.

My TiBook is almost three and a half years old. I continue to be amazed at how fast DT runs on it. One of these days, of course, I hope to have a G5 PowerBook with 4 GB or more of RAM – and I’ll be pushing Christian to add even more AI features to DT.

Bill_DeVille · June 26, 2004, 7:56pm

Graham:

One more thing.

I’ve been working for some time using alpha versions of DT Pro. Although I’ve experimented with multiple databases, I’m still using and enlarging my original database, which does contain multiple topics under lots of group and subgoup headings.

One of these days I’ll get around to breaking up the ‘mixed’ database into individual topical databases. But there’s also value in mixing up things. DT sometimes finds relationships between items that I hadn’t thought of – can provide useful new insights sometimes.

cgrunenberg · June 27, 2004, 12:08am

Can’t wait til almost all Macs will be G5s or G6s - 64-bit and fast memory access are ideal for the technology Lots of ideas in the pipeline…

Graham_Nelson · June 27, 2004, 3:06pm

Thanks for these replies. I see now that Link to… does indeed do what I’m looking for, and DEVONthink PE now has no trouble in managing some 1.5 GB of files in just the iTunes-like way I was asking for. I realise this is using only about 1% of the cleverness of the thing, but it’s still very helpful to me. I was going to wait for the Pro version to come out, but it’s all too good to miss out on in the mean time, so I’ve just ordered the license code. Thanks once again.

(One thing I do notice is that Link to… sometimes behaves oddly if there’s a DEVONthink storage file in the same folder, as there is if it’s linking stuff which has previously been exported: it makes a link with the right name, but which is empty when opened. Besides that, it would sometimes be nice to export the files for other applications to use, where all that metadata is unwanted. Maybe Export… could have an option to do so without writing these files?)