DT Pro Office Hangs after large import

danfinkelstein · December 12, 2006, 8:26pm

I’ve recently upgraded from DT Pro to DT Pro Office (and put in a new registration key, to boot). I’m trying to import a directory structure with about 70000 documents, which DT Pro Office dutifully churns through… until the end.

It pops up a Log window showing the 20 or 30 files that it couldn’t import (file format problems), and hangs. I can move both the Log window and the main interface window around on the screen, but I can’t save the Log window’s contents nor clear it via the buttons on that window, nor can I close the Log window, and the application becomes unresponsive. It’s consuming as much CPU as it can, apparently, as well.

If I force-quit the application, the import fails.

I’m running Mac OS X 10.4.8, 1.67 GHz PowerPC G4, 1 GB RAM.

Bill_DeVille · December 12, 2006, 10:00pm

Hi, steuben. When one asks an application to undertake a very large job, sometimes patience is recommended.

DT Pro isn’t just a Finder replacement, which is to say that it doesn’t just have to remember all of the files and their locations in the database. It’s also looking at every text string in every document, noting the occurrences of every string in every document in the database and compiling a concordance of the whole. When you toss a few new documents into a database that takes place seamlessly, i.e. you won’t notice a significant lag time. But tossing seventy thousand documents at a database will result in a significant lag time. There’s a lot of processing going on while the database is digesting seventy thousand documents.

If you hadn’t terminated the operation, perhaps the database would have finished ‘digesting’ the content, resulting in a successful database build. So patience is sometimes a virtue. On the other hand, something might have gone wrong and the build might not have gone to completion because of memory corruption from import of a corrupted file, or even insufficient free drive space to hold all of the swap files and cache files required to get to build completion. We don’t know.

I’m not certain that your computer has enough horsepower to manage your database in such a way that it will perform quickly. When you open a database the text contents and metadata will be loaded into memory. A very large database will quickly consume all of the computer’s physical RAM and then start using Apple’s Virtual Memory, which requires creation of VM swap files on the hard drive – and a resulting swapping of data back and forth between RAM and the disk during database operations.

Apple’s Virtual Memory is very efficient, in the sense that it will allow large amounts of data to be called into memory and processed, even when there’s insufficient physical RAM to hold all of the data. The process will go to completion; but if VM is heavily used, the process will become much slower than if all of the data to be processed could fit into RAM. Disk access is incredibly slow, compared to RAM access.

It’s great to have fast CPUs, but the real bottleneck in a large document database will be memory. RAM is good, the more the better.

I’m spoiled. I spend most of my working time inside my databases, especially my main one. I expect many of my searches to produce results in a few milliseconds; I expect searches to be orders of magnitude faster than Spotlight searchers, with the results presented in a much richer environment than Spotlight searches. When I start a series of AI operations such as ‘See Also’ or ‘Classify’ I’ll put up with a few seconds before the first results are produced, but I expect each subsequent set of results to be presented with no discernible pause.

I’m managing well over 100,000 documents in several databases. I work with collections of databases that reflect my particular interests. So I’ve broken my collections into ‘topical’ databases so that I can work on a database on my MacBook Pro 2.0 GHz dual core with 2 GB RAM, or on my Power Mac G5 2.3 GHz dual core with 5 GB RAM.

My main database reflects my professional interests in environmental science and technology and related policy and legal and regulatory matters. In that sense it’s topical, although in fact it covers a broad spectrum of disciplines. It has about 21,000 documents and runs a bit over 24 million words in content.

It runs briskly on my MacBook Pro, although after I’ve been working it hard for a few hours it will have accumulating pageouts and increasing use of Virtual Memory. If it slows down, I’ll quit and relaunch DTPO to get back to full speed. On my Power Mac G5 with 5 GB RAM I never see pageouts or slowdowns.

Another of my databases is quite large and covers financial and tax records. When I need to delve into those matters, I switch to it. But if I incorporated that with my main database, there would be no practical advantage, and I would expect slower operation, which I would find frustrating.

Another of my databases is devoted to a very large collection of items dealing with the Apple Newton.

Still another large and rapidly growing database covers detailed methodologies for environmental data sampling, analytical methodologies and statistical evaluation procedures. Those matters are easily separable from my main database. By separating the materials I not only gain in operational speed, but in increased focus of the artificial intelligence operations.

I know nothing about the nature and content of your collection of seventy thousand documents. Could they in fact be separated into topical or interest databases of a size that would be speedy on your PowerBook?

Of course, there are Macs that could easily fly though a collection of your size.

Future versions of DT Pro and DTPO will reduce the memory ‘footprint’ of databases and add speed advantages. Even so, there may remain practical considerations that may point to advantages in separating collections into topical or interest databases.

And of course if one uses the Web server feature in DTPO, there are other reasons why one might not wish to make some documents searchable and viewable by others.