PDF search speed

BLUEFROG · June 10, 2019, 12:53pm

if DT has all the the words of all the PDFs in a database stored very efficiently in some data structure, how come it can’t access that same efficient information when searching inside one of them?

You are dealing with two different data sources. If you do a Google search, it will find a large number of web pages. If you do a Command-F while reading a web page, that search is not conducted via Google, but by another mechanism. DEVONthink is no different in this regard.

Whether it could be changed would be up to Criss to determine, but that’s the simple logic behind it.

ngan · June 10, 2019, 1:06pm

Another IMHO: DT can handle most of what difference users need, but some users are quite used to integrate different apps to work with DT to customise the optimal solution for a specific purpose. In your case, if this is just a matter of optimisation, I suspect that the developer would have responded already. So, if you find that it is more efficient to open a long document in preview for search and locate keywords - why not? I am staring to opening my some of my rtfd notes in external editor for cleaning up format and re-sizing images and all I need is to click the open-externally button and that’s perfectly easy.

Altostratus · June 10, 2019, 6:27pm

I was asking this because when searching globally DT will bring up a list of files and, when each of them is clicked, will instantly highlight the search term in yellow (in several places if needed). So, from my perspective, DT “knows” where the words are and doesn’t need to “find” them in the PDF. I was just curious as to why this same mechanism couldn’t be used for searching just one file.

Like I wrote above, I didn’t say it was difficult, I was curious about the reason for the huge difference in search speed (inasmuch as it can be explained without going into a one-hour tech talk about APIs and all that).

Actually, this is not a great analogy. Try entering for example “bismarck prinz eugen rheinubung” in a Google Search. You will most like hit this Google Books page, with all your search terms highlighted. This is exactly what DT is doing for me on a global database level. Now, if now type e.g. “Raeder” in the search box which is on the Google Books page, you’ll get all the pages of that book with the new search term highlighted, at essentially the same speed. You would find it weird if that search took a lot longer than searching through an index of the entire WWW. Please understand I’m not asking how come Google is faster than DT. Rather, I’m pointing out that the example you gave is not the best one IMHO

ngan · June 10, 2019, 6:42pm

I see… But it is true that, at least for me, it takes a while for inspector bar to pick up and highlight the list of hits even for 20 pages pdf, and longer for books. So there are different mechanisms at work and some are uncontrollable by DT (now I understand more after the explanation by DT colleagues).

BTW, an very interesting file. I’ve never read any printed newspaper (or magazine?) from the year 1931!

BLUEFROG · June 10, 2019, 6:57pm

Actually, your analogy makes less technological sense

That’s Google searching Google. It’s a GET request to books.google.com.

That is essentially the same technology as was used for the initial search.
DEVONthink uses its internal indices for database searches and PDFKit for searching PDFs.

Altostratus · June 10, 2019, 7:23pm

I guess we agree, then! What puzzles me is why DT doesn’t use essentially the same technology as is used for the initial search, considering that it’s a gazillion times faster (as was mentioned before). Why bother with PDFKit if your internal indices are so fast?

BLUEFROG · June 10, 2019, 7:30pm

That would be for Criss to respond to.

cgrunenberg · June 11, 2019, 4:14am

The search index is only for searching, it doesn’t know anything about the layout of documents or how to highlight occurrences in them.

Altostratus · June 11, 2019, 9:01am

This issue is definitely not scaling well. I am now working with several PDF records of the US Congress, each 600 mb and 1500 pages long, and the problem is getting worse.

In order to give an idea of what a gazillion means, please have a look at this screen capture: https://streamable.com/drvti First I’m searching for a string inside DT, then trying it again in Preview. Also, please note that while DT is populating the search results, I am clicking at different results in the hope of displaying the page. DT is very unresponsive at this stage. Compare this to the way Preview reacts to clicks. I have no idea what Apple has done here in terms of optimization but it’s like night and day. I’m not sure how much this is related to DT3, I remember DT2 was faster with this sort of thing but I may be wrong.

cgrunenberg · June 11, 2019, 9:03am

Version 3 performs few additional tasks (e.g. scanning PDF annotations and ensuring that substrings aren’t matched) but the performance should be basically identical to the one of version 2.

BLUEFROG · June 11, 2019, 2:39pm

Compare this to the way Preview reacts to clicks.

Looks very similar to DT’s behavior IMHO

Altostratus · June 11, 2019, 4:27pm

It isn’t. Not remotely. It’s too bad the screen capture doesn’t record mouse clicks as visual feedbacks. DT is very sluggish in this case (the one I recorded) compared to Preview.

cgrunenberg · June 12, 2019, 6:37am

Could you please launch Apple’s Activity Monitor application (see Applications > Utilities), choose DEVONthink while it’s sluggish in the list of processes, select the menu item View > Sample Process and send the result to cgrunenberg - at - devon-technologies.com? Thanks in advance!

Altostratus · June 12, 2019, 8:45am

Done. I launched a search through a large PDF, immediately started the sampling and while it was running I tried clicking on the slowly populating search results, with very sluggish response from DT (I’m on beta 3 now).

mog · June 12, 2019, 12:16pm

I also import substantial pdfs. To check, I have imported the variety104-1931-11.pdf into DT3.3 and search/find text “Features in production”. Instant result. Instant too for ‘PLAYING CAPITOL THEATRE’. Deeper in find “manager of Hippo-” also instant.

Doesn’t appear to be an issue, at least not for me? Am I missing something about the op’s concern?

Altostratus · June 12, 2019, 1:23pm

I don’t think you are. But it’s not instant at all on my end, and it’s a pretty fast Mac. I don’t know what is causing that.

mog · June 12, 2019, 1:52pm

I have similar age/spec iMac except it’s 32GB; also Mojave version is 10.14.5. Might be worth updating the OS?

I don’t know if this could be your culprit but two weeks ago after i updated from Sierra to Mojave, a few days later I encountered a massive problem with Spotlight not indexing properly, Finder tags not showing, unable to find Mail messages. I tried all the common problem solutions but to no avail. Finally found an uncommon problem solution which included deleting an obsolete TagsMail.mdimporter.

Suggest also search Google - mojave slow pdf search. You might find something amongst the comments.

Another possibility i find is to run the OCR on a pdf again. I use Acrobat Pro X1 to ocr but I assume DT’s Abbey Fine Reader similar. Have you tried Data>OCR>to searchable pdf? (Not that that explains how I was able to find the text without redoing, but maybe OS 10.14.5 is the answer?)

BLUEFROG · June 12, 2019, 1:54pm

but maybe OS 10.14.5 is the answer?

While I can’t say this is the solution, we do generally recommend keeping up to date on operating system point releases.

Altostratus · June 12, 2019, 2:50pm

After updating to 10.14.5 I can confirm searches seem to go faster but still slower than Preview. Thanks for the suggestion.

BLUEFROG · June 12, 2019, 3:39pm

but still slower than Preview.

As @cgrunenberg mentioned, “Hard to tell what it does exactly but Apple’s apps are of course not limited to official APIs.”

There are plenty of private APIs and things Apple may do but not make available to developers, so it’s entriely possible the team working on Preview has done things we aren’t aware of or don’t have access to.