PDF search speed

if DT has all the the words of all the PDFs in a database stored very efficiently in some data structure, how come it can’t access that same efficient information when searching inside one of them?

You are dealing with two different data sources. If you do a Google search, it will find a large number of web pages. If you do a Command-F while reading a web page, that search is not conducted via Google, but by another mechanism. DEVONthink is no different in this regard.

Whether it could be changed would be up to Criss to determine, but that’s the simple logic behind it.

Another IMHO: DT can handle most of what difference users need, but some users are quite used to integrate different apps to work with DT to customise the optimal solution for a specific purpose. In your case, if this is just a matter of optimisation, I suspect that the developer would have responded already. So, if you find that it is more efficient to open a long document in preview for search and locate keywords - why not? I am staring to opening my some of my rtfd notes in external editor for cleaning up format and re-sizing images and all I need is to click the open-externally button and that’s perfectly easy.

I was asking this because when searching globally DT will bring up a list of files and, when each of them is clicked, will instantly highlight the search term in yellow (in several places if needed). So, from my perspective, DT “knows” where the words are and doesn’t need to “find” them in the PDF. I was just curious as to why this same mechanism couldn’t be used for searching just one file.

Like I wrote above, I didn’t say it was difficult, I was curious about the reason for the huge difference in search speed (inasmuch as it can be explained without going into a one-hour tech talk about APIs and all that).

Actually, this is not a great analogy. Try entering for example “bismarck prinz eugen rheinubung” in a Google Search. You will most like hit this Google Books page, with all your search terms highlighted. This is exactly what DT is doing for me on a global database level. Now, if now type e.g. “Raeder” in the search box which is on the Google Books page, you’ll get all the pages of that book with the new search term highlighted, at essentially the same speed. You would find it weird if that search took a lot longer than searching through an index of the entire WWW. Please understand I’m not asking how come Google is faster than DT. Rather, I’m pointing out that the example you gave is not the best one IMHO :slight_smile:

I see… But it is true that, at least for me, it takes a while for inspector bar to pick up and highlight the list of hits even for 20 pages pdf, and longer for books. So there are different mechanisms at work and some are uncontrollable by DT (now I understand more after the explanation by DT colleagues).

BTW, an very interesting file. I’ve never read any printed newspaper (or magazine?) from the year 1931!

Actually, your analogy makes less technological sense :stuck_out_tongue:

That’s Google searching Google. It’s a GET request to books.google.com.

That is essentially the same technology as was used for the initial search.
DEVONthink uses its internal indices for database searches and PDFKit for searching PDFs.

I guess we agree, then! What puzzles me is why DT doesn’t use essentially the same technology as is used for the initial search, considering that it’s a gazillion times faster (as was mentioned before). Why bother with PDFKit if your internal indices are so fast?

That would be for Criss to respond to.

The search index is only for searching, it doesn’t know anything about the layout of documents or how to highlight occurrences in them.

This issue is definitely not scaling well. I am now working with several PDF records of the US Congress, each 600 mb and 1500 pages long, and the problem is getting worse.

In order to give an idea of what a gazillion means, please have a look at this screen capture: https://streamable.com/drvti First I’m searching for a string inside DT, then trying it again in Preview. Also, please note that while DT is populating the search results, I am clicking at different results in the hope of displaying the page. DT is very unresponsive at this stage. Compare this to the way Preview reacts to clicks. I have no idea what Apple has done here in terms of optimization but it’s like night and day. I’m not sure how much this is related to DT3, I remember DT2 was faster with this sort of thing but I may be wrong.

Version 3 performs few additional tasks (e.g. scanning PDF annotations and ensuring that substrings aren’t matched) but the performance should be basically identical to the one of version 2.

Compare this to the way Preview reacts to clicks.

Looks very similar to DT’s behavior IMHO

It isn’t. Not remotely. It’s too bad the screen capture doesn’t record mouse clicks as visual feedbacks. DT is very sluggish in this case (the one I recorded) compared to Preview.

Could you please launch Apple’s Activity Monitor application (see Applications > Utilities), choose DEVONthink while it’s sluggish in the list of processes, select the menu item View > Sample Process and send the result to cgrunenberg - at - devon-technologies.com? Thanks in advance!

Done. I launched a search through a large PDF, immediately started the sampling and while it was running I tried clicking on the slowly populating search results, with very sluggish response from DT (I’m on beta 3 now).

I also import substantial pdfs. To check, I have imported the variety104-1931-11.pdf into DT3.3 and search/find text “Features in production”. Instant result. Instant too for ‘PLAYING CAPITOL THEATRE’. Deeper in find “manager of Hippo-” also instant.

Doesn’t appear to be an issue, at least not for me? Am I missing something about the op’s concern?

I don’t think you are. But it’s not instant at all on my end, and it’s a pretty fast Mac. I don’t know what is causing that.

34

I have similar age/spec iMac except it’s 32GB; also Mojave version is 10.14.5. Might be worth updating the OS?

I don’t know if this could be your culprit but two weeks ago after i updated from Sierra to Mojave, a few days later I encountered a massive problem with Spotlight not indexing properly, Finder tags not showing, unable to find Mail messages. I tried all the common problem solutions but to no avail. Finally found an uncommon problem solution which included deleting an obsolete TagsMail.mdimporter.

Suggest also search Google - mojave slow pdf search. You might find something amongst the comments.

Another possibility i find is to run the OCR on a pdf again. I use Acrobat Pro X1 to ocr but I assume DT’s Abbey Fine Reader similar. Have you tried Data>OCR>to searchable pdf? (Not that that explains how I was able to find the text without redoing, but maybe OS 10.14.5 is the answer?)

but maybe OS 10.14.5 is the answer?

While I can’t say this is the solution, we do generally recommend keeping up to date on operating system point releases.

After updating to 10.14.5 I can confirm searches seem to go faster but still slower than Preview. Thanks for the suggestion.

but still slower than Preview.

As @cgrunenberg mentioned, “Hard to tell what it does exactly but Apple’s apps are of course not limited to official APIs.”

There are plenty of private APIs and things Apple may do but not make available to developers, so it’s entriely possible the team working on Preview has done things we aren’t aware of or don’t have access to.