Instant PDF Search

atlas · May 30, 2020, 4:35am

I recently came across an app called pdfsearchapp and I was absolutely blown away with the speed that it’s able to find text within indexed pdfs. Is there anyway that this kind of rapid indexed searching can be incorporated into DEVONthink? Searches of large 1000 page pdfs take minutes instead of under a second.

rfog · May 30, 2020, 5:50pm

Just installed from my Setapp stuff, and for now, it is taking more or less same time to index my PDF as DT takes. Will check later for search, but I think won’t be faster.

rkaplan · May 30, 2020, 6:07pm

TIme to index is about the same - but the interface is extremely nice to review hits for your search.

atlas · May 30, 2020, 6:10pm

sorry i should have clarified. the index time is about the same, maybe slower but once it’s been indexed, the search time is magnitudes faster.

BLUEFROG · May 30, 2020, 6:17pm

It’s likely building and searching a per-document index. DEVONthink has a master index for the database that allows for greater functionality and feeds the AI.

Development would have to assess this but if implemented it would add to the size of the database package.

atlas · May 30, 2020, 6:21pm

cheers, thanks for the insight. would be interesting if it could be an option that could be enabled by the user. i obviously have no idea about the complexity of it’s implementation but just a thought.

BLUEFROG · May 30, 2020, 6:24pm

You’re welcome.
It could have a negative impact due to the added size.

Also, there could be an expectation for DEVONthink To Go to provide a similar immediate function. A cross-platform implementation could prove more difficult, especially with smaller capacity devices like iPhones.

rkaplan · May 30, 2020, 6:28pm

I think this “PDF Search” app would be helpful as an alternate means of search in very large documents in DT3 using the “Open With” feature.

Presently it is possible to “Open WIth” PDF Search and view a file, but in order to search it you then have to save it in a designated directory, which is less than convenient. However, the developer responds quickly to suggestions and says he would consider a change in their app to be able to index/search “on the fly” with individual documents opened via “Open With.” If they add that, then that would seem to be a straightforward and very helpful integration with DT3 to give that search option for specific files.

BLUEFROG · May 30, 2020, 6:29pm

in order to search it you then have to save it in a designated directory, which is less than convenient.

? Can you clarify this?

rkaplan · May 30, 2020, 6:33pm

The “PDF Search App” requires that you designate specific folders where files to be indexed are located. If you use “Open With” it will display the file but it will not find anything in a search if the file is not located in a designated search folder.

I understand the idea of search folders if you want to index a bunch of files in advance. But it seems to me it would be helpful if it could also index any arbitrary file being viewed in its app; their developer is considering that idea after I explained the use case for that feature.

BLUEFROG · May 30, 2020, 6:38pm

I think this is similar to how DEVONsphere Express operates, though @cgrunenberg would have to give an authoritative answer on that.

rkaplan · May 30, 2020, 9:30pm

Response time seems similar to Devonsphere

The difference is that it is possible here to search one particular PDF file or a set of a few specific PDF files. I do not think that is possible in Devonsphere, or am missing something there?

Also the presentation of multiple hits and interface to browse through multiple hits is more helpful for some situations than that in DT3 or Devonsphere.

(I am not questioning that DT3 and Devonsphere are both immensely useful -but there seems to be some function in this app which is more capable in the situation of searching within a specific document.)

BLUEFROG · May 31, 2020, 1:17am

I do not think that is possible in Devonsphere, or am missing something there?

No. I didn’t mean to imply the functionality is the same, just that the underlying mechanism might be.

rfog · May 31, 2020, 6:50am

Not in Devonsphere (that I have but not use to use it), but in DT itself, you can search by groups of pdfs grouped by folders/databases very easily. I put search terms in search box, then go clicking over each database/folder and the search range changes immediately (and when I say immediately it is immediately on my 14 databases of about 20.000 documents). BTW, this is a very interesting thing that DT do and I think this tool is not able to do easily.

¿Why I know I have about 20K docs? Because I’m indexing all my PDF in this PDF Search tool and it says I have 15.953 PDFs, adding other about 5K other documents like ePubs and so, makes the total of about 20K.

In relation to PDF Search, it is very slow in indexing, as it has only indexed about 1500 PDF in all this night when DT should had indexed all items. Other pending test stuff is how much time takes to do a search once all files are indexed… and of course, DT do an infinite more things.

rkaplan · May 31, 2020, 8:28am

Can you clarify how you do that?

If I click on other folders, for me the search box becomes inactive

rfog · May 31, 2020, 8:41am

I go to search box and type what I want to search (and press enter, as I have disable searching while typing):

Then I click in one database, “Investigación y ciencia” in this case:

After that I click in Jules Verne db:

And, as I go across folders, search stuff changes dynamically:

Of course, this can be done from search field (see how it changes), but is waaaaay easier to go around databases/folders. And once first time search is done, going around folders/databases is almost immediate.

rkaplan · May 31, 2020, 8:47am

Oh I see now… I usually begin my searches by selecting a frequently used Group from my Favorites. If you do it that way, then switching to another group in the sidebar ends the search.

But if I instead choose the group from further down in the sidebar (not from Favorites) then I can switch the search scope as you suggest.

rfog · May 31, 2020, 9:03am

You are right, from Favorites it does not follow the search terms. Could be a bug, @cgrunenberg?

cgrunenberg · June 1, 2020, 7:38am

The next release will fix this.

rfog · June 2, 2020, 6:43pm

Ok, after 4 days indexing -and not yet finished (still about 3K documents of 16K), PDF Search is more or less usable…

62 GB in a sqlite database. Only PDF, no other kind of documents. 44GB for DT with more indexed documents -about 1300 ePub files plus 3 databases without indexed files but inside the database.

Here wins DT by far, but then making some search, I tried “eureka verne” (without quotes), PDF Search takes at least twice to find the results, and finds eureka OR Verne. However, DT is more “intelligent” and searches documents that have both worlds in the document, and orders by “nearest” for both words.

Changing the syntax to “+eureka +verne” (without quotes), that means, based in online manual “each term must be in same document” in PDF Search, it finds 5 documents against DT, that finds 197. If I estimate not indexed, PDF search is not finding all terms.

To be sure of that, I restrict search to one database (that is fully indexed in PDF Search), and DT finds 7 documents and PDF search, only one:

https://dl.dropboxusercontent.com/s/xgaq3ck56wg0fbp/Captura%20de%20pantalla%202020-06-02%2020.40.00.png?dl=0(image larger than 4096KB)

Said that, and taking in consideration PDF Search uses a very big sqlite database, I’m still with DT for indexing PDF, as it is faster, finds better and very important, seems more reliable than PDF Search due database stuff. Said that, I don’t think PDF Search is a bad product but DT is better (and of course, has a zillion more features not related to PDF Searching.