Search Inconsistencies?

FROBGOBLIN · January 26, 2016, 12:05am

Hi. I’m trying to get a handle on searching in DEVONthink, because it isn’t working for me the way that I would like. Any help would be greatly appreciated. Specifically, I want to know what inconsistencies or blind spots you might have run into – hopefully your experiences will shed light on my problem, or at least help us improve the product.

My problem is that things I easily find in Spotlight (HoudahSpot) are not turning up in DEVONthink. This is sometimes a minor annoyance, but at other times, especially with Smart Groups, it can be a much more serious issue. One Smart Group ought to have a few dozen files and it has two, for example. I assume the classification AI suffers from the same blind spots. The inconsistencies currently seems to be limited to CJK (Chinese, Japanese, and Korean) so far, but it may be more widespread (I have some other oddities, but haven’t been able to pin them down yet).

I’d like to give you the files, but I cannot share any of them. Sorry. I’m trying to pin down files / terms that are in files I can share, but I haven’t had the time to sift through things and figure out what is missing from the search results.

What I can say is that it seems to be terribly inconsistent. If I make a sentence using the term I am searching for (I usually have to use asterisks around CJK terms, because the languages don’t put spaces in-between words), it works just fine. And, if I copy / paste a page of text from a scanned document (OCR by Adobe), it works fine. But, that same PDF file where I got the text from cannot be found. This doesn’t make sense to me. I hope, and assume, that I am missing something. Perhaps your experiences will give me some hints.

Thanks!

korm · January 26, 2016, 12:31am

Could you clarify? Are you using Spotlight only, or HoudahSpot only, or both? (Results should be the same, but …). I assume your database(s) have activated “Create Spotlight Index” and that’s why you expect Spotlight to be returning the same results as DEVONthink and vice-versa. And that all your PDFs are OCRd.

BLUEFROG · January 26, 2016, 12:42am

Asian language support is limited due to the lack of word boundaries, so no - you cannot expect the same results. Using asterisks or a tilde prefix will aid in the search.

FROBGOBLIN · January 26, 2016, 8:09am

@ korm
Thanks! I’m using HoudahSpot (which essentially is Spotlight, but just in case there was something funky going on in HS, I wanted to mention it). Actually, everything is showing up in HS, so the Create Spotlight Index is on, but I also have them indexed (I’ve tried both ways – sticking stuff into DT and indexing – and neither works as expected in DT). All PDFs are OCRd. When I copy the text out of a PDF into a Word file, it works, but the original PDF doesn’t.

@Bluefrog
That’s very unfortunate

Tildes and asterisks have been used, but to no avail. Essentially, the content is there, but it isn’t being recognized, which eviscerates the core feature of DT; namely, it’s AI. If it’s only partially effective, it cannot be relied upon to produce the expected results, and everything becomes suspect. Spotlight can manage it, so the problem has a solution, and I would urge DT to consider addressing the issue somehow.

In DT’s defense, as Bluefrog says, unclear word boundaries make the search thing tough. And, as a result (perhaps), there is really very little out there that does perform such searches well. It is kind of shocking, considering that 1 to 2 billion folks (a third or so of the planet) rely on CJK in their daily life. But, my feeling is that things are “good enough” for most tasks, and people give it a pass, because they wouldn’t be willing to shell out money for something more powerful anyhow. What about Google Drive? Well, besides having to give up your privacy, it only indexes the first 100 pages of a PDF (a hidden, awful limitation). What about Evernote? Well, besides having to give up your privacy (local notebooks are a kludgy semi-solution), it only indexes files below a certain size as well (note sizes are limited).

It could simply be my incompetence. There might well be amazing search apps out there. I’m asking an IT researcher to look into the situation for Windows to see what is out there (I’ve had no success on that platform). At the moment, the Mac and Spotlight are simply the best commercial options available, as far as I can tell, for getting searches done. DT is certainly great, and I appreciate a lot of what it does, but if the search doesn’t produce better results than what I am seeing, it is a major drawback (again, it looks to be primarily a CJK issue – most folks on this forum are probably immune from the issue).

Specifically, my smart groups are empty, so I have to make smart folders in Finder, and these aren’t indexed by DT. That means I am in and out of the app just to do basic stuff, like look up guidelines for a project, or something like that. It ends up being faster to just work outside of DT, which is the opposite of how it ought to be, in my opinion.

“Make it so,” as Picard would say

FROBGOBLIN · January 26, 2016, 8:32am

By the way, I’ve been aware for a while that there were some inconsistencies between Spotlight and DT’s search results, but I didn’t begin to realize the extent of the problem (for my use case, at least) until recently. I’ve needed increasingly precise results the past few months, and I have kept coming across cases where files seem to be missing, even though I know they are somewhere in the database, and it turns out that only Spotlight can find them.

Yesterday, though, I was surprised to see a smart group with only a few files when it should have had a bunch of them. Then, looking into things, I realized that this was occurring all over in my database, and I was just unaware of it – missing out on the data I had collected. Ouch.

Again, DT isn’t “bad,” especially for non-CJK stuff, but I would like to see it get a lot better with CJK.

BLUEFROG · January 26, 2016, 1:43pm

Also, consider that Spotlight and DEVONthink are not using the same underlying technology (meaning, DEVONthink is not querying the Spotlight index for its information when searching).

korm · January 26, 2016, 2:33pm

For completeness, I assume this means “DEVONthink is not … when searching databases internally in the DEVONthink client”.

So … wouldn’t it be a nice feature if DEVONthink optionally included Spotlight indexed data in Search results. The best of all worlds?

BLUEFROG · January 26, 2016, 3:02pm

Considering I’m a huge Spotlight fan (from my days at Ironic Software), my personal opinion is: yes.

My technical opinion is: not so much. Spotlight has their index. We have ours. They have their data structures. We have ours. Meshing the results - and having to weed out duplicates, etc. - would either be technologically difficult or return unexpected results.

Don’t get me wrong, if anyone could do it, Criss could do it. (He’s some kind of magician, I think. ) But the question of, “Should we do it?” would really be up to him.

FROBGOBLIN · January 26, 2016, 9:15pm

well, at the moment, the dt structure is not returning the expected results. the content is there, but dt cannot find it. from my perspective, this is a major issue that needs to be addressed.

it doesn’t need to use the spotlight data, or act like spotlight, as long as it finds my stuff.my point about spotlight was that i am not asking for dt to invent a technology that does not exist. it is obviously possible to do a better job indexing my stuff, because spotlight is already doing it. the question then becomes: will dt settle for mediocrity, or ascend to a higher level, all shiny and chrome (to borrow the words of immortan joe in mad max)?
m.youtube.com/watch?v=BX-FMvt83fA

for my use, frankly speaking, the current search results are simply insufficient, because the data i have so carefully collected is being ignored. of course, the ability to make use of the spotlight index would be pretty cool whatever happens, though, the status quo is pretty disappointing.

BLUEFROG · January 26, 2016, 9:35pm

Actually, not considering our User base is not 1 to 2 billion people and that English is still the primary method of communication in the world. Our User base an presence is overwhelmingly English speaking. (Stats on the most commonly spoken language is a simple statistic. Look at what is taught as the secondary language in these countries: English.)

So, while you may feel “disappointed”, I don’t think you should be “surprised”.

I don’t know why this would be the question, as if this one issue defines the overall quality of the product.

This does not mean we are sitting still, but we also have many things to consider and address.

FROBGOBLIN · January 27, 2016, 6:43am

BLUEFROG:

It is kind of shocking, considering that 1 to 2 billion folks (a third or so of the planet) rely on CJK in their daily life

Actually, not considering our User base is not 1 to 2 billion people and that English is still the primary method of communication in the world. Our User base an presence is overwhelmingly English speaking. (Stats on the most commonly spoken language is a simple statistic. Look at what is taught as the secondary language in these countries: English.)

So, while you may feel “disappointed”, I don’t think you should be “surprised”. 8)

the question then becomes: will dt settle for mediocrity, or ascend to a higher level,

I don’t know why this would be the question, as if this one issue defines the overall quality of the product. :shock:

This does not mean we are sitting still, but we also have many things to consider and address.

hi. my comment about the large cjk user base that does not seem to enjoy the same level of search reliability was not addressed specifically to dt, but directed at the entire search indexing community (however one defines such a thing). this includes the big players such as google and microsoft. in general, there is surprisingly little out there for cjk on windows, for example – at least, nothing that measures up to spotlight, in my experience. again, i could be wrong, but that seems to be the case.

as for english as a second language, there is a considerable gap between the acquisition of a language (however that is measured by school classes taken or tests passed) and actual usage of a language. i think your point is certainly valid when it comes to localization of software or manuals – perhaps it isn’t worth the effort. all of my students speak english as a second language (i work in a japanese university), and i am certain that they could figure it out somehow. more importantly, though, nearly all of my work (and probably nearly all of anyone’s work in east asia) is conducted in cjk, not english, so the content of the index is quite unlikely to contain much english.

personally, i consider the search index to be the heart of the product, because it informs the ai (presumably, dt’s defining feature), and shortcomings in the index have a huge impact on several other features. in my use case, searching is critical. unreliable or poorly optimized searches are not especially useful, because they fail to make full use of the data i have painstakingly collected over the years.

i get it, though. if it’s not a priority, it isn’t a priority, and the folks at dt have to make the best decision for dt. dt is quite amazing with western languages, and i understand that my issue is a relatively inconsequential one within the existing user base, which is largely composed of western language users.

thanks for the prompt answer to my query (i have to admit that i didn’t realize the extent of the issue with my data, at least), ,and i appreciate your considering the improvements – it’s nice to have a forum where we can have input into the product.

BLUEFROG · January 27, 2016, 6:54am

I do hear you. I am off and on teaching myself Korean, Mandarin Chinese, and Japanese (slowly, given I am self-teaching in spare moments, and the only impetus is love of language itself) so I have plenty of documents with non-Roman data in them. However, I currently don’t have the daily necessity of searching that data, more than I do the collecting of it.

You are definitely heard here, and know we have at least one person to whom such functionality is critical (ie. we know who we could ask to test / validate something we may be working on).

FROBGOBLIN · January 27, 2016, 7:39am

Good luck with the languages! Great fun, as I am sure you are discovering I am still teaching myself Chinese and Korean – slowly as well.

Yep. I am here anytime to help