text or pdf?

robertom · January 22, 2006, 5:24pm

is it better to convert PDFs to text? sometimes DT takes time do classify, do you think it can classify itens quicker if the database is composed of text files instead of html and pdfs?

thanks

cgrunenberg · January 22, 2006, 8:09pm

No. The most important factor is how many words (both total and unique) are in the database but the kind of your documents doesn’t matter.

Bill_DeVille · January 22, 2006, 9:36pm

The file type really isn’t the significant issue.

When you are viewing a document and – for the first time in a session – press the “Classify” button, there may be a pause while code and data is brought into working memory.

If you have plenty of free RAM, that code and data is brought into your physical RAM and processing proceeds quickly.

But if your computer needs to use Virtual Memory, it will be swapping code and data back and forth between physical RAM and files on the hard drive, including VM swap files. Since data access from the hard drive is many, many times slower than data access from physical RAM, this can slow down an operation. Of course, the worst case will be when there’s so little free physical RAM available that your computer has to access the hard drive for every step of the processing.

Other than the obvious fact of slow operations, there are three indicators of this situation: little available free RAM; a buildup in Virtual Memory swap files; and (most telling) a large number of Virtual Memory page-outs.

Another DT Pro feature that uses artificial intelligence and lots of processing is “See Also”. Running out of free physical RAM will also slow down this procedure.

The most critical phase of operations is when one switches from one DT Pro set of processes, such as searches, to “Classify” or “See Also”. For the first instance of the new operation, code will be switched into memory along with data from the database. Even on a powerful computer with lots of free RAM, there might be a pause of a second or two to shift to the new set of processes and complete the first operation, especially if the document being viewed at the time is large.

Subsequent instances of the “Classify” or “See Also” operation will be faster than the first (assuming comparable document sizes). That’s because the code to implement the procedure is already available and doesn’t have to be fetched again.

Bottom line: Slowdowns in DT Pro operations almost always result from insufficient free physical RAM and the consequent need to resort to repeated access to the hard drive to pull in code and data. Although it’s nice to have a fast CPU, available free physical RAM is the most critical speed factor.

Comparison of my three Macs on the same database: My main DT Pro database contains about 18 million total words and tens of thousands of reference documents. I can run it satisfactorily on each of my three computers as described below. But the best results are on the computer with the largest physical RAM.

- TiBook 500 MHz G4, 1 GB RAM: While on travel, I can accomplish useful work on this machine. But I often notice significant slowdowns after an hour or so of working it hard, including extensive use of the AI features. I monitor the buildup of VM swap files and free RAM, and it’s evident that when the VM swap files equal or exceed the amount of physical RAM I need to close and relaunch DT Pro to get some speed back (or reboot to clear the swap files).

- iMac (Rev. B) 20", 2 GHz G5, 2 GB RAM: This is, of course, much speedier than the TiBook. I can work my DT Pro database for hours with good performance. By “work” I mean adding new content, writing and editing and heavy use of the searching and AI features. But even on this computer, I can see buildup of VM swap files over time. If I work DT Pro heavily for a day or two, I’ll accumulate 5 VM memory swap files and see the page-out numbers start to increase. At that point, it’s time to quit and relaunch DT Pro or see processes slowing down.

- PowerMac dual core 2.3 GHz G5, 5 GB RAM: Again, a big step-up in speed. But with 5 GB RAM, I can run DT Pro heavily for several days without seeing a single page-out, indicating that there’s sufficient free physical RAM to avoid swapping data back and forth between VM swap files and free RAM. I almost never see more than the standard one 64 MB VM swap file, and have never exceeded 2 VM swap files so far. When I switch between procedures in DT Pro, e.g. from searching and editing to classifying, I may see a pause of a second or two while DT Pro sets up for the new procedure. But after than, when classifying a series of documents, there’s no perceptible slowdown in moving on the the next document – it pops up immediately. It’s really fast, and stays fast. Most single-term searches take 50 milliseconds or less using Ignore case. An Exact search for a single term often takes 5 milliseconds or less. That’s awesome.

Is 5 GB RAM overkill? Probably. The reason I went for that much in the new PowerMac was because I was running projects that stressed the iMac. Example: I did a volunteer project to pull down all available Web data on the effects of hurricanes Katrina and Rita on Louisiana’s health care infrastructure. I used DEVONagent to pull down more than 10,000 HTML pages and dumped that material into a new DT Pro database. I then heavily used searching and AI features to filter down and organize about 5,000 documents that helped analyze the effects – which have been extremely severe. Making results available as soon as possible becomes a productivity issue, rather like a professional photographer’s use of PhotoShop. Saving a few hours can sometimes make all the difference! In this case, the iMac did the job, but I’ve got some more projects that are really benefiting from the “more horsepower” capabilities of the PowerMac (and making life easier for me at the same time).

That term “Professional” in the name of DEVONthink Professional isn’t hype. DT Pro enables one to do some amazing things. I can do those things on my old, slow TiBook. But when time is of the essence, it’s a pleasure to do them on the PowerMac.

robertom · January 23, 2006, 1:15pm

thanks.

I have a PB with 1 G Ram and thought it was more than enough…

Bill_DeVille · January 23, 2006, 8:59pm

It’s a lot faster than my old TiBook.

I used the TiBook to display that hurricane effects database. It did a good job. When I talked about stressing the 2 GB RAM iMac, that was happening because I kept standing the database on its head to quickly filter it down in size but with increased content relevance. I was using a lot of tricks. Example: I had over ten thousand individual HTML documents with no organizational structure. I didn’t even have a total picture of the kinds of information in the database. So I started by doing some searches to look for things of interest. Then I could create a new group and move the most relevant search results into that group. That was the start of an organizational structure. I made the organizational structure more fine-grained by picking a document, running See Also and either moving or replicating the most relevant suggested documents into another new group or subgroup. In a couple of hours, with comments from health care professionals, I had fleshed out an organizational structure for the documents. Then I ran Auto Classify on the ungrouped items to help fill out the organization. There were still thousands of unclassified documents. I ran Auto Group on that residue, which resulted in some new groups that made sense (and others that didn’t), and the useful ones were renamed and fitted into the structure. That was iterative work for two or three hours.

I ended up with about 5,000 still unclassified documents. Much of that material was redundant or of little immediate relevance. We were interested in the big picture, and the classified material gave a good coverage of topics and data for that purpose – closed hospitals and clinics, areas without emergency room services, etc… So I made a copy of the database and deleted the unclassified items. They are still available in the original if there’s a need to do further work, but that probably won’t have to be done.

What went into the database via DEVONagent searches? Sources included the professional health services Web sites in Louisiana, state, local and federal government sites, and a variety of Internet sources including newspaper articles.

I made heavy use of the AI features in DT Pro to select and organize the material. From starting the DEVONagent searches to finished database took less than 3 days, representing perhaps 12 hours of my time (I was doing other things at the same time).

Without the assistance of DEVONagent and DEVONthink Pro, such a project would take weeks of intensive effort. But the whole purpose of this exercise was to pull together information quickly, because decisions had to be made quickly. Instead, the iMac took the work stress (better it than me, is my philosophy). The CPU was pegged out a good deal and there were a number of times when I wished for more than 2 GB RAM.