Import, Index, File Size, and Speed

I’m trying out DEVONthink Pro right now and see how it could be very useful. The main features I would like to use are search, classify, and see also.

My initial document import (containing text, pdf, html, doc, images, etc) would be about 800 MB. I’ve seen others mention they have larger dbs. Would indexing rather than importing the information make for a smaller db and/or faster searching? Would indexing rather than importing improve performance overall?

Also, I noticed a couple of people mentioned that when their db became large, phrase searching, classify, and see also became painfully slow to the point they just stopped using it. These would be some of the main reasons I would use DT, so if that really is the case, it would be a deal breaker.

I’d appreciate any insights anyone has.

Thanks

If my understanding of the discussions on that point is correct, it was not the speed but two other problems:

  • Typing into the search field in the toolbar could slow down significantly because DT started searching with the first letter you typed. This is different in the separate search window, which works great with every kind of search in larger databases as well – phrase searches in my 4GB database present results just instantly.

  • Some people – me being one of them – have stopped using “see also” and “classify” because we are working in several languages, and DT cannot find similarities between a Japanese and a German text about the same topic. There have been discussions about workarounds – with index files etc., but for the time being, I can live with the great search tool.

Don’t worry about speed!

Maria

Thanks for the comments. That definitely helps alleviate my worrys about search speed, see also, and classify. Does anyone have any thoughts on whether it is better to simply index contents rather than import contents. I can see how each method could be superior. If you simly index contents, then the db stays small and potentially lighter and faster. However, it could be that by importing contents the contents are “optimized” so that even though the database is larger it functions optimally.

Any thoughts?

Thanks again

It’s true that Index-imported files, e.g., PDF documents, require less ‘space’ in the DT/DT Pro database. And because the database ‘knows’ all the text strings that are contained in such documents, simple searches involving AND or OR multiple strings work very well.

But if searches of database documents are made that depend on the relative positions of text strings within the body of the text, there are some limitations of Index-imported documents. Currently, this limitation affects primarily Phrase searches, which can be multi-string searches in a specified order.

As I understand the result of an Index import, DT/DT Pro ‘knows’ all of the individual words contained within a document, but doesn’t ‘know’ about occurrences of multiple word strings, without going back to the original text to take a look.

Let’s look at a hypothetical example of two documents; one contains the string “North Pole” and the other doesn’t contain that string but does contain “telephone pole” and “North America”. Let’s look at two cases: Case 1 - the database contains the full text of the documents; Case 2 - both documents have been Index imported. Note that I may get different results depending on whether i select “Exact” or “ignore case”.

Case 1a: The results of an AND (ignore case) search for “North” and “Pole” will find both documents.

Case 1b: The results of a Phrase (ignore case) search for “North Pole” will find only the document that contains the string “North Pole”.

Case 2a: The results of an AND (ignore case) search for “North” and “Pole” will find both documents.

Case 2b: The results of a Phrase (ignore case) search for “North Pole” will not find either document.

That’s a problem for me. Sometimes a Phrase search is just what I need. Over the evolution of DT Pro (a total of more than 50 alphas and betas prior to the public betas) there have been variants of the Phrase search operation. A successful Phrase search of Index-imported content can be done if the operation forces a reexamination of the original documents’ text content; but that really slows things down, of course. Currently, that isn’t done (at least with my database).

So I’ve quit Index-importing of PDF documents, so that I can perform more comprehensive Phrase searches. One of these days, because DT Pro is so scriptable, I may explore the possibility of automatically identifying PDFs that have previously been Index-imported, with the hope of automatically reimporting them using PDFKit. (I’d appreciate it if some kind soul would develop such a procedure.)

On my computer, with my large database, Phrase searches are extremely fast (and most content is not Index-imported). Some users report a different experience, with slow Phrase searches. I’m not certain why there’s such a variance.

At some point in the evolutionary development of DT Pro we will see much more powerful Boolean search operators such as NEAR(x), BEFORE(x) and so on. I would expect that Index-imported documents won’t work with such operators without a forced reexamination of the content of the original documents.

So I’ve begun to avoid Index-imports except for special purposes. I’ve got enough free drive space, CPU speed and RAM that I shouldn’t need to worry about the demands on them posed by other import methods. If I copy imported PDFs to my database Files folder, they don’t add to the size of the DT Pro database file (although, of course, they add to the size of the database package file).

Bottom line: Yes, there are pros and cons for choosing types of import.

Thanks for the thorough response. Does importing to the database folder render the same improvements in phrase searching as importing into the database?

Thanks

Yes. The critical point for PDFs is to set Preferences > PDF & PS / Index & Convert to check ONLY Use PDFKit (Tiger). (I assume you are using Tiger.) Note on that same preference panel that you retain the checkable options to link externally to a PDF file, to copy it into the body of the database, or to copy it into the database’s Files folder.

Then you will have an attractive DT/DT Pro display of PDF documents and Phrase searches will work.

There’s no difference - phrase searching will work in both cases.

However, indexing and phrase searching are currently exclusive (either indexing or phrase searching) but this will change in v2.0. In addition, the performance of both importing and indexing will be identical in v2.0 (at the moment indexing is a lot faster, creates smaller databases and therefore the databases need less memory).

Thanks, Christian. That clarifies my concerns about any future limitations of indexed files when version 2.0 comes along. So I won’t have to worry about reimporting some of my older files. :slight_smile:

But for a current project, I’ll stick to importing PDFs via File > Import > Files & Folders instead of Index, because I do need Phrase searches for the files in that project.