DTTG Strategy for Indexing Large Documents?

Hi-

I wonder if a DEVONthinker could give (or point me to) a description of the indexing strategy that DTTG uses? It’s currently limited to <20k, so the question becomes what happens when DTTG encounters a document that will create an index larger than this. I have a PDF that’s 632Kb or 3,159 unique words, and I find it hard to understand what search will find (or not) within this document.

On the first page, DTTG will find “Israel,” but not “socially.” DTTG will find “consumption” which has a high word frequency of 193 (or the fourth most-freqent word), but not “Routledge” which has a frequency of 17 and 113th most common. DTTG will return “2009” even though “1997” is more frequent.

So a little understanding of what’s being left out would help in determining how to utilize your iPad app. Also, the statement of “index size” is a bit opaque to a user, who has no idea how to relate that to the “real world,” or even information that DTPO provides. Does a word take a fixed number of bytes in the index? Anyway to convert 20k to something meaningful like words or characters?

Also, if the limit is going to remain for a while, it would be great to be able to view the document index, not necessarily on DTTG, but on DTPO, so a user could get a finer grained idea of what was going on.

Thanks!

Charles

It simply takes the first 20 KB of the file for the index and so you can only find words within the first 20 KB. We might drop this limit in a future release, it was introduced to increase search performance on the memory-wise limited devices.

At least in the German App Store full-text search is promised. I think you should make users aware of this limitation.

Birgitt

Ahh, its not a bug it’s a feature!

That makes my entry in the bug tracker system obsolete,
should have read this post in the first place. :wink:

Joking aside, a very common use case is the use of DTTG as a repository for PDF files up to 1.5 MB in size in meetings.
Especially in the UK Public Sector you have a lot of governmental papers, policies etc. which it is useful to be able to refer to ‘on the spot’ w/o schlepping around loads of paper.
Quite frankly, only indexing a limited subset renders the use of DTTG in this scenario as obsolete, because as a user I simply can not know if a search term will be found or not, due to the previously stated limitations.

Besides the point that there is a technical limit based on memory, I would strongly suggest to make it clear to the user when a document is selected in DT Pro/Office that a warning will be shown that this document will be synced, but will not be completely searchable.

Thanks for this, but my first example seems not to bear out your assertion:

The original PDF document is 16,271 (non-unique) words or 107,662 bytes. I also created a subset PDF of the first two pages, 321 (non-unique) words or 2,495 bytes. So the subset is well within the <20Kb limit, which I have set.

So if I search for “Israel,” it is returned in both documents, but if I search for “situated” only the smaller subset PDF returns a hit. The distance between these words is only 181 words or 1,187 bytes.

My word/byte measurements are done by taking the “plain text” element of the DTPO database and getting the statistics via TextMate. The actual files count ~651Kb and ~410Kb respectively, but I presume you aren’t indexing any of the PDF overhead.

I’ll send my PDF files to support, and perhaps they can figure out what I might be doing wrong here.

Best, Charles

[size=85]Edit: suggestion removed since it merely inflamed those already irritated. BTW, I don’t work for DTech. :frowning: [/size]

Not to be a pill, Korn, but the hope is to search across documents, which is the strength of DTPO on the Mac. Searching within PDFs on the iPad is pretty much conquered territory.

But for me, this is even worse. Here’s a 66Kbyte TEXT file. The word “boil” occurs 3,992 bytes, 709 words in. So it should be within the first 20K of the file as per Eric’s post above. I’m not able to get DTTG to find the word “boil” in it. Can anyone else? (With apologies to Gilles Deleuze.)

http://vze26m98.net/devon/lotsa_text.zip

I’ll report this to Support, but I haven’t got a ticket issued yet for my previous post.

Best wishes, Charles

I really think that is not okay to refer users to other Apps to do what DTTG promises to do, i.e. full-text search. In this respect, the company is misleading the customer about what to expect.

Birgitt

Okay. I took apart a DTTG iTunes backup, and the document data is stored in a Core Data SQLite backing store. I looked at the ZFULLTEXT field for the “lotsatext.txt” file above, and all that’s there is the first 1,025 bytes, or 179 words. So I can find “Kantianism” but not “elaboration” in the first paragraph.

Going further, I dumped my entire DTTG database, and did an analysis of what percentage of a file’s full-text is indexed versus what is indexed in DTPO. You can find a spreadsheet of the results here:

http://vze26m98.net/devon/dttg-ft-analysis.zip

It’s sorted by the ascending percentage of text that is indexed by DTTG. If you compare the percentage indexed by the size of the plain text of the document, it appears that above 20Kbytes (which is where I have my indexing set) only the file’s first 1024 bytes are retained.

So, to conclude, at least for my installation, DTTG craps out when trying to index a file whose plain text is greater than 20Kbytes, ie a bug on my machine.

HTH, Charles

:slight_smile: Seems like we’d have interesting things to talk about. Deleuze was interpreting Zarathustra’s boiling? Definitely haven’t read Also Sprach very carefully but have been looking at other texts… sorry for the offtopic, seen Nietzsche et la philosophie (untranslated) in an ebook?

I haven’t done much testing… does dttg reindex docs of the index size setting is changed? probably unrelated to whats you’re going thru. 11am here and didn’t sleep last night so trying to go into tech mode and read what you said is a bit much for me now. :slight_smile: but searching I’ll try to take a look at as well.

Hi gyuen-

That was a fragment of Deleuze’s lectures on Kant from his class in the 1970s:

http://www.webdeleuze.com/php/sommaire.html

As far as my issue with DTTG goes, only 1Kbytes of plain text is stored for a file over the 20Kbyte index limit. DTTG indexes this just fine, so it’s hard to tell what’s going on. Effectively, it’s an import problem rather than an index problem.

Can’t tell you about how DTTG indexes the document. Didn’t really pore over the backup, and didn’t look at the SQLite indexes, or look for a custom index.

Not sure what would happen if I returned to the default 10Kbyte maximum. Perhaps everything would be fine, but then I’d have to decide whether to go with a lower limit, or live with the hard clip at 20K… And to do this, I’d have to re-sync my Inbox with DTTG, because, as per the above, it’s an import problem, re-indexing will just index the same 1K all over again.

EDIT:
Tried a re-sync with the indexing limit set back to the default 10Kbytes and the trouble remains for me. Just to repeat, having not heard from anyone else who’s experiencing this so far indicates I’m the only one having the problem! :wink:

Best wishes, Charles

Charles,

I looked searched a few of his works and found talk of boiling and bubbling. As far as boiling, the best explanation seems to be from Nietzsche et la philosophie, and seems integral to his later ideas, many I’d say are new names/concepts developed from Nietzsche. :slight_smile: It’d be fun to chat more sometime.

I played with search a bit more. Trying it only for a few minutes or so, it was a bit tedious to try to narrow down exact behavior.

Some things I noticed; I was searching for ‘Ister’ and partial matches like 'sister show up, without any ordered preference to exact matches. I think the DT toolbar search box behavior and DDTG should eventually be the same. For iOS, I’d probably also prefer the buttons ‘Name’ and ‘Name & Contents’ to be icon no text buttons on the same line as the search box, with a OS X Mail like toolbar (From, To, Subjet, …) right below for search options like exact, partial, … Maybe that won’t be the UI exactly but more like how Mail displays search options seems like the best idea. And for desktop DT the same, with more a advanced search option UI like the Finder Smart Group creation.

And a bit off topic, both DTTG and DT need better ways of working with multiple documents. I suspect OS X will eventually adapt tabs like Safari in all apps, that’s the best thing I can think of at the moment, while DTTG, to conserve space, could use the current document name as a dropdown menu to navigate between or open other docs. Such things just haven’t happened since the iOS UI hasn’t evolved much yet.

I think your mention is enough of a call to the guys and they’ll deal with it eventually. Sad to say, it seems like a bug fix release may take some time and delay new features. I think DT has only one dedicated iOS developer, and only since August, so things could take a while. And the beta testers are possibly people who volunteered, and so the group may not have yet been reshuffled to keep only those most useful to QA. Seems like you and I would be good additions. :slight_smile:

Gary

We are addressing a number of issues and enhancement requests related to search. Your points are well taken, but we have certain limitations that are inherent in iPads (and worse on iPhones): there is only 256M of memory, we can only obtain a small portion of it, and there is no VM. Disk space is also limited, compared to desktop machines, and we are also dealing with moving all data over a WiFi network. We could build a full index for each document on the desktop and move it over to the device, but given that users so far are moving 1,000’s of documents with sizes of several Megs over to DTTG, this quickly becomes a real issue. Therefore, we have imposed limitations on what is possible in this first release, and we are grappling with determining features we must provide in new releases, but in the context of the limitations described above. This is not to say that there are no solutions to this (and other) issues, but the question is “can they be implemented on the device and if so, what is the priority in relation to other features we must have?”

We appreciate the passion shown by DT users, and particularly those who have purchased our new DTTG product. We are listening, and will let you know our thoughts on these matters going forward. All (constructive) suggestions are welcome, and we know that our users will provide ideas to help solve some difficult problems in improving this product over time.

Hi Mike-

Thanks for your response. I imagine that Sync issues are consuming most of your attention right now, so it’s understandable that you might not have caught the details of my posts and support tickets.

I think I’ve verified that DTTG consistently can’t index and search the document database within its self-advertised limits of 10K or 20Kbytes. This may be an issue with my machine only, but it’s an issue that needs to be fixed for me.

Providing full-text search in DTTG for any configuration of document size and quantity is a completely different topic. I understand the memory constraints within the iOS architecture.

I would say that partial indexing of documents is going to be a hard sell for a company whose brand is built on indexing. I don’t see how you can avoid the requirement of fully indexing a document resident in DTTG. In fact, my issue with DTTG surfaced because I often search bibliographies at the end of academic papers. So, a frequent use of DTTG (and DTPO) is to search the last few kilobytes of the PDFs I’ve stored on these apps.

In the middle term, I’m OK with the 10/20K limit, insofar as it actually works, which for me it doesn’t right now. My solution is simply not to put documents larger than your indexing limit on DTTG.

Eventually, I’m sure you’ll find the right mix of functionality for a iPad/iPhone-based product. Hopefully the name “DEVONthink to Go” isn’t a misnomer that oversells what can be brought to the platform.

Best wishes,

Charles

As far as full text search, iAnnotate seems to handle it fairly well and it is quite quick. :slight_smile: So at least one dev seems to have figured it out. Syncing the index from the desktop seems like a good idea.

Whatever your reasons are for implementing your 20 KB full text search limit, I still think it is only fair to tell your customers about it and change the text in the App Store accordingly. I think you are really misleading your customers.

Birgitt

As was pointed out earlier, iAnnotate indexes the PDFs either on the host before syncing them to the iPad or on the iPad if needed (eg, when you do an Open In from another app). It seems to be something DT should be able to do as well!

Yes, but the developers of iAnnotate probably didn’t contemplate the possibility that people would send thousands of documents over to their app. I’m pretty sure that iAnnotate would immediately choke in that situation. The current Apple mobile devices are much more limited in RAM than any current Mac.

As he noted, Mike is thinking about the possibilities, such as sending over an index file for the synced database content. If you are currently activating Spotlight indexing for your databases, those index files themselves can grow quite large. I don’t activate Spotlight indexing on all the DTPO databases on my laptop, but even so, the index files themselves total about 580 MB.

What’s the current status on this topic?
I am looking at buying DTPro (DTTG already bought in order to test) to have a possibility to search through a large variety of files (mainly PDFs) via a full text index, available on both, iMac & iPad (latter replacing more and more the former).

If you use the latest version, 2.0.6, of DEVONthink, all new PDFs sent to DEVONthink To Go will be fully indexed. If you want to apply this to all your files, sync, then reset DEVONthink To Go using its Settings, then sync them again to the device.

DEVONthink 2.0.5 and earlier clipped the index text sent to the device due to a typo in the code. Sorry for the inconveniences!

Eric.