PDF search speed

Altostratus · May 26, 2019, 7:36pm

Hello,

In DT3b2, searching for a string inside a large PDF file is a gazillion times slower (scientifically measured) than opening the file with Cmd+Shift+O and just doing the search in Preview instead, which is near-instantaneous. Why is that?

cgrunenberg · May 27, 2019, 12:22pm

I wasn’t aware of the scientific number “gazillion” Which search string did you use? And did you enter it in the toolbar search field or in the Search inspector? Finally, the performance is basically limited by macOS’ PDFkit framework but DEVONthink 3 performs some additional actions. A sample might be useful.

Please launch Apple’s Activity Monitor application (see Applications > Utilities), choose DEVONthink while it’s still searching in the list of processes, select the menu item View > Sample Process and send the result to cgrunenberg - at - devon-technologies.com - Thanks in advance!

Altostratus · May 27, 2019, 12:42pm

I don’t remember the precise string but it was a nondescript, medium-sized word (such as “nondescript”).

It wasn’t entered in the toolbar (the one for cross-database search which is usually very fast). Rather, it was entered in the panel you get when doing a Command+F when viewing a PDF file, in order to search just inside that file.

Process sample sent by mail. Thanks.

When you have the time : Gazillion

Altostratus · June 9, 2019, 7:24pm

Here is the type of searchable PDF file with which this happens consistently. I almost always end up giving up on DT3b2 and just opening it in Preview to find the text I’m looking for, which is really supposed to be DT’s strength.

Of course you can try this with any old PDF file at archive.org.

BLUEFROG · June 9, 2019, 9:28pm

Interesting. This seems like a pretty atypical PDF with almost 600,000 words in it. If we can fine-tune for that, that would be great but it definitely seems unusual.

Altostratus · June 9, 2019, 10:00pm

It certainly isn’t your run-of-the-mill 100-page double-spaced contemporary camera manual or product overview, but it’s still far from unusual. At least to me, I use tons of similar material in my research.

There’s one thing I don’t get (and maybe it’s too technical to explain briefly): if I do a search for a string across several open databases, DT will come up with one of those heavy-duty PDFs instantly (if the string is there). There is absolutely no waiting involved, and this has always been part of the magic. The engine will throw PDFs at you that you’ve completely forgotten about, and make connections you didn’t realize were there. So it knows the text is there, because it can pick the file itself out of a thousand others in a millisecond. So how come that when you actually look at the file, searching for text becomes so slow?

BLUEFROG · June 9, 2019, 10:11pm

In-document searching is searching the PDF contents.

This is not the same mechanism as our search engine, which searches the databases’ indices at lightning-speed.

cgrunenberg · June 10, 2019, 11:26am

I just checked this and the delay is caused by macOS’ PDFkit framework which is used for searching, not by DEVONthink 3’s post-processing (e.g. scanning of annotations or ensuring that not all substrings are highlighted/found if only complete words should be found).

Altostratus · June 10, 2019, 12:26pm

@cgrunenberg: So that means Preview is faster because it’s not going through the PDFkit framework?

@BLUEFROG: I know it’s not the same mechanism, that’s what I was trying to say. It might seems obvious to you, but I don’t get it: if DT has all the the words of all the PDFs in a database stored very efficiently in some data structure, how come it can’t access that same efficient information when searching inside one of them?

Thank you.

cgrunenberg · June 10, 2019, 12:44pm

Hard to tell what it does exactly but Apple’s apps are of course not limited to official APIs.

ngan · June 10, 2019, 12:50pm

Just my naive thought/common sense of software/kit/api: knowing that the words is/are in the file is not the same as knowing where the words are and highlight/pinpoint the hits. So the underlying mechanism of search and search+locate are very different in terms of the amount of work/ways of indexing…

BLUEFROG · June 10, 2019, 12:53pm

if DT has all the the words of all the PDFs in a database stored very efficiently in some data structure, how come it can’t access that same efficient information when searching inside one of them?

You are dealing with two different data sources. If you do a Google search, it will find a large number of web pages. If you do a Command-F while reading a web page, that search is not conducted via Google, but by another mechanism. DEVONthink is no different in this regard.

Whether it could be changed would be up to Criss to determine, but that’s the simple logic behind it.

ngan · June 10, 2019, 1:06pm

Another IMHO: DT can handle most of what difference users need, but some users are quite used to integrate different apps to work with DT to customise the optimal solution for a specific purpose. In your case, if this is just a matter of optimisation, I suspect that the developer would have responded already. So, if you find that it is more efficient to open a long document in preview for search and locate keywords - why not? I am staring to opening my some of my rtfd notes in external editor for cleaning up format and re-sizing images and all I need is to click the open-externally button and that’s perfectly easy.

Altostratus · June 10, 2019, 6:27pm

I was asking this because when searching globally DT will bring up a list of files and, when each of them is clicked, will instantly highlight the search term in yellow (in several places if needed). So, from my perspective, DT “knows” where the words are and doesn’t need to “find” them in the PDF. I was just curious as to why this same mechanism couldn’t be used for searching just one file.

Like I wrote above, I didn’t say it was difficult, I was curious about the reason for the huge difference in search speed (inasmuch as it can be explained without going into a one-hour tech talk about APIs and all that).

Actually, this is not a great analogy. Try entering for example “bismarck prinz eugen rheinubung” in a Google Search. You will most like hit this Google Books page, with all your search terms highlighted. This is exactly what DT is doing for me on a global database level. Now, if now type e.g. “Raeder” in the search box which is on the Google Books page, you’ll get all the pages of that book with the new search term highlighted, at essentially the same speed. You would find it weird if that search took a lot longer than searching through an index of the entire WWW. Please understand I’m not asking how come Google is faster than DT. Rather, I’m pointing out that the example you gave is not the best one IMHO

ngan · June 10, 2019, 6:42pm

I see… But it is true that, at least for me, it takes a while for inspector bar to pick up and highlight the list of hits even for 20 pages pdf, and longer for books. So there are different mechanisms at work and some are uncontrollable by DT (now I understand more after the explanation by DT colleagues).

BTW, an very interesting file. I’ve never read any printed newspaper (or magazine?) from the year 1931!

BLUEFROG · June 10, 2019, 6:57pm

Actually, your analogy makes less technological sense

That’s Google searching Google. It’s a GET request to books.google.com.

That is essentially the same technology as was used for the initial search.
DEVONthink uses its internal indices for database searches and PDFKit for searching PDFs.

Altostratus · June 10, 2019, 7:23pm

I guess we agree, then! What puzzles me is why DT doesn’t use essentially the same technology as is used for the initial search, considering that it’s a gazillion times faster (as was mentioned before). Why bother with PDFKit if your internal indices are so fast?

BLUEFROG · June 10, 2019, 7:30pm

That would be for Criss to respond to.

cgrunenberg · June 11, 2019, 4:14am

The search index is only for searching, it doesn’t know anything about the layout of documents or how to highlight occurrences in them.

Altostratus · June 11, 2019, 9:01am

This issue is definitely not scaling well. I am now working with several PDF records of the US Congress, each 600 mb and 1500 pages long, and the problem is getting worse.

In order to give an idea of what a gazillion means, please have a look at this screen capture: https://streamable.com/drvti First I’m searching for a string inside DT, then trying it again in Preview. Also, please note that while DT is populating the search results, I am clicking at different results in the hope of displaying the page. DT is very unresponsive at this stage. Compare this to the way Preview reacts to clicks. I have no idea what Apple has done here in terms of optimization but it’s like night and day. I’m not sure how much this is related to DT3, I remember DT2 was faster with this sort of thing but I may be wrong.