Speed problems when accessing search results

jwiegley · December 28, 2005, 5:48pm

The following problem happens to me every time I perform the described search:

My database has several tens of thousands of entries of all types in it. Also, I did a Backup & Optimize right before doing this test. Anyway, I open up a search window, constrain the search down to a particular section – which has about 1500 text and HTML files in it – and search for All words, Fuzzy match: “society affected which live”. It finds 189 items in 11.75 seconds. So far this is great.

I then click on the top entry in the results list (note: the file i’m looking for is entry #2, so the search is working well). At this point I get a wait cursor.

It is 18 minutes later now, and still I have a wait cursor. This happens every time, even after restarting. So far I have not been patient enough to find out when DT will come back to me. Activity Monitor says it is sometimes “not responding” and sometimes pegging CPU. The file in question is a 1.8 Mb plain text file, as are many of the returned search items. It is not pratical for me at this point to break the files into “pages”.

Is there a preference setting I can turn off? Is DT doing something that can be pushed to a background thread, so I can see the contents of these search hits right away? I’m really hoping I can use DT as my all-around information library, but so far I’ve been forced to export all my files to disk and do these searches using “grep” again.

Let me know if there’s anything else I can do. I’ve gathered profiling results here:

johnwiegley.com/sample.txt

Thanks, John

Bill_DeVille · December 28, 2005, 8:40pm

jwiegley:

The following problem happens to me every time I perform the described search:

My database has several tens of thousands of entries of all types in it. Also, I did a Backup & Optimize right before doing this test. Anyway, I open up a search window, constrain the search down to a particular section – which has about 1500 text and HTML files in it – and search for All words, Fuzzy match: “society affected which live”. It finds 189 items in 11.75 seconds. So far this is great.

I then click on the top entry in the results list (note: the file i’m looking for is entry #2, so the search is working well). At this point I get a wait cursor.

It is 18 minutes later now, and still I have a wait cursor. This happens every time, even after restarting. So far I have not been patient enough to find out when DT will come back to me. Activity Monitor says it is sometimes “not responding” and sometimes pegging CPU. The file in question is a 1.8 Mb plain text file, as are many of the returned search items. It is not pratical for me at this point to break the files into “pages”.

Is there a preference setting I can turn off? Is DT doing something that can be pushed to a background thread, so I can see the contents of these search hits right away? I’m really hoping I can use DT as my all-around information library, but so far I’ve been forced to export all my files to disk and do these searches using “grep” again.

Let me know if there’s anything else I can do. I’ve gathered profiling results here:

johnwiegley.com/sample.txt

John, you didn’t mention your CPU type and speed, RAM and amount of free disk space.

My guess is that you are getting bogged down in Virtual Memory, so that instead of operations happening in free RAM, they are having to access the VM swap files and other data on the drive continually. Of course, disk access is REALLY slower than RAM access. It gets even worse if there’s not much free hard drive space. I suspect that a Fuzzy search works the CPU harder than a “ignore case” search, but that’s not the main problem leading to slowdown.

Opening a 1.8 MB text file takes some memory, of course. Then there’s the problem that while opening the file, DT Pro is looking through the file to highlight search terms and scroll down to the first occurrence, while probably having zero free RAM by that time. Smaller files would be less “stressful” but the real solution is more RAM, if possible.

I run big DT Pro databases on two machines: A TiBook G4, 500 MHz, 1 GB RAM and a 60 GB HD (which has less free space than I would like). A PowerMac G5, 2.3 GHz dual core CPU, 5 GB RAM and a 500 GB HD.

I can do just about everything on the TiBook that I can do on the PowerMac, but sometimes with difficulty. On the TiBook I have to monitor the accumulation of Virtual Memory swap files as an indicator of impending slowdown. There are times when, in the midst of what I’m trying to do, I have to stop and reboot the TiBook to get some free RAM back so that I can finish my task. Sometimes, quitting DT Pro and then relaunching it will speed things up for a while. Note that this is with BIG databases. I can run relatively small databases on the TiBook very satisfactorily. As my CPU and RAM are maxed out, the only thing I can do to improve performance is to free up some more HD space, which I plan to do.

But the PowerMac flies through everything. Monitoring VM use, I see that at the moment I’ve got 4,178 MB free RAM and one 64 MB swap file in use (which rarely if ever grows). Since my last restart I’ve had nearly a million page-ins and zero page-outs. Sure, I’ve got a lot more CPU speed and power in the Power Mac than in the TiBook, but the amount of physical RAM probably makes the most significant difference in avoiding slowdowns.

Obviously, when I’m working a big DT Pro database, I choose the Power Mac – especially if I’m on a tight schedule.

DT Pro 2.0 will introduce a new database structure. The “body” of the database will be smaller, especially in your case where the database consists mostly of text and HTML files. You will see speed improvements for many operations, especially on computers with limited physical RAM.

jwiegley · December 28, 2005, 9:32pm

Sorry I forget to mention my specs: This is one of the newest 15" PowerBooks, with 2 Gb of RAM, a 1.67 GHz CPU, and the 7200 RPM hard drive. I made sure to kill all processes except for a little clock-timer before doing this test.

Bill_DeVille · December 28, 2005, 11:43pm

John, that’s a nice PowerBook, and blows away my old TiBook.

I’ve also got a Rev. B iMac, 2 GHz G5, 2 GB RAM and 250 GB HD. I found that Virtual Memory swap files grew on it also, although it goes a lot further before slowdown (with a really big database) than the TiBook. That’s why I went for 5 GB RAM on the Power Mac, which seems to be enough to eliminate growth of VM swap files entirely. I’m emphasizing this is on a really big database like yours. It has no problems with ‘reasonable’ sized databases.

Here’s a comparison. Using your search string, “society affected which live” with a Fuzzy search, my Power Mac found 157 items in 1.815 seconds. That’s pretty quick.

But I was bragging too much about my Power Mac. The first search result was a Project Gutenberg text file, Darwin - The Descent of Man, size 1871 KB. I clicked on it to open it from the Search window. Got the spinning ball, with one CPU running about 95% and the other about 20%, and forced quit after 23 minutes.

Repeated the search, but this time used “Ignore case” instead of a Fuzzy search. Got 105 results in 1.944 seconds. This time, the Project Gutenberg file opened in about 4 seconds with all occurrences of the search terms highlighted, and scrolled down to the first term occurrence. As a second test, I opened a 12 MB document. DT Pro took 20 seconds to open it, with all the search terms highlighted.

Conclusion: When a Fuzzy search with multiple terms is run, and a result is selected, DT Pro is looking throughout the text at all possible spelling variants of each term. There’s the bottleneck. The CPU has a heck of a lot to do under this circumstance. By comparison, the Ignore case search requires DT Pro only to find and highlight the actual search terms before opening the document. The larger the document and the greater the number of terms in the Fuzzy search query, the greater the demand on processing power.

So don’t feel bad about your PowerBook. I’ll ask Christian if there’s any way to optimize the routine for opening a document from the Search results list.

cgrunenberg · January 3, 2006, 2:50pm

What’s the nature of the top entry? E.g. a PDF/PS document, a HTML page or rich text document? And how large is it?

Bill_DeVille · January 3, 2006, 6:42pm

Christian:

I’ve been experimenting with a number of Fuzzy searches today.

And I’m not having a problem opening documents in the results list from the “society affected which live” query string except for the text document that appeared at the top of the list. It’s a text file downloaded from Project Gutenberg, Darwin, The Descent of Man. Size is 1,821 KB. That’s the ONLY file that results in an endless spinning beach ball. I’m able to open other, much larger text, HTML and PDF files without a problem.

Supposition: There’s something unique about that Gutenberg text file that causes a problem in the Fuzzy search results list when it’s selected for display. If I do an Ignore case for the single search term “Darwin”, that same file opens quickly.

John, I’m back to using Fuzzy searches again, as I’ve hit only one file so far that ties up DT Pro on my machine when selected. I’ll send a copy of that file to Christian. You may have run into a similar situation in your database. Try opening some of the other files in your Fuzzy search results list, starting with small ones, and see what happens.

cgrunenberg · January 4, 2006, 1:28pm

John & Bill,

I was able to speed up highlighting of occurrences in plain/rich text documents after a fuzzy search around 100 times. But it might still take a few seconds if a file is that huge and contains almost 2 million characters.