Exclude term/phrase when searching pdf

dspady · January 14, 2023, 7:21pm

When I search a pdf for a term that is also included in the title of the pdf and the title is on every page, is there a way to exclude the ‘title term’. If, for example “climate change” is the search term, and the title includes “climate change in the 20th century”, is there a search expression that will exclude the title.
e.g. “climate change” NOT “climate change in the 20th century”

Don

BLUEFROG · January 14, 2023, 7:55pm

No this is not possible. There is no concept of a page title in a PDF. That’s a human perception.

dspady · January 14, 2023, 8:55pm

The issue is less the title than the presence of a recurring phrase that hides the desired result. It just happens that the phrase of interest is usually part of all of the title, but conceivably it could be somewhere else. I just want to sift out the relevant specific phrase from a bunch of unnecessary noise.

DrJJWMac · January 14, 2023, 9:26pm

Perhaps this could be accomplished by searching for PDFs that have more than one occurrence of the term or for PDFs where the search term is not on the first page. Maybe AppleScript is required to handle such a distinction.

In any case, if the title is by example “climate change in the 20th century”, would you not expect the sub-phrase “climate change” is going to be in the body of the document in any case? Can give a counter example where the search sub-phrase is in the title but should never appear in the body just for clarification? Your query seems hypothetical otherwise.

–
JJW

dspady · January 14, 2023, 11:27pm

I would not be interested in a situation where the sub-phrase is in the title but not elsewhere. I have attached a screen capture where the search term was “hotter” and the title was “4 degrees hotter” but the problem of multiple unwanted finds is apparent.

I don’t quite know how to use applescript for this situation, or if it is even possible.
Don

BLUEFROG · January 15, 2023, 5:57pm

What you’re trying to do isn’t feasible. This is the closest I can get and it’s still inaccurate as you’ll see I selected an instance of climate change that wasn’t matched…

PS: You’re in the difficult situation of trying to make the computer think. And not only think, but think like you.

dspady · January 15, 2023, 6:13pm

I guess we can only dream. And while in some ways it might be nice, having a computer think like me (or any human) I think in the long run is not such a good idea. The possibility of unintended consequences seems too great.
Thanks for all your help.
Don

BLUEFROG · January 15, 2023, 6:16pm

You’re welcome. You could of course try the BEFORE search operator as my example was contrived. But you should be aware of the potential for some misses.

dspady · January 15, 2023, 6:48pm

Actually, you hit the problem dead on. IT works. I tried the following
“soil biodiversity” !BEFORE “to ecosystem functions and services”
and it worked. Then I added an OR (|) condition as below:
| !AFTER “Global diversity and distribution of” and then made it more complex by adding
| !AFTER (“Global diversity and distribution if” | "State of knowledge of ")
so a ‘complete search might be’
“soil biodiversity” !BEFORE “to ecosystem functions and services” | !AFTER( “Global diversity and distribution of”| "State of knowledge of ")
That becomes a good filter, and is pretty much what I want.
So, I learned something useful today.
I need to learn more about these search terms like BEFORE and AFTER and ! etc.
Many thanks to you.
Don

BLUEFROG · January 15, 2023, 7:09pm

You’re welcome and these things are covered in the Appendix > Search Operators section of the built-in Help and manual.

dspady · January 15, 2023, 8:50pm

I guess that is where I should have started. Apologies.

Don

DrJJWMac · January 15, 2023, 9:38pm

I have to ask this. Are you trying to search PDF journal articles? If so, you may be better to off-load your files to a true bibliography database manager. The typical ones auto-populate fields named title, author, journal, and so on, allowing you to search each field separately.

This approach seems to me to suggest that you already know the full extent of the title that you need or you do multi-step searching anyway. For example, you pick a search term, find all the occurrences, notice the title phrase that you want (or want to exclude), and then narrow your search after the fact to get that one article that covers your scope.

It seems the more generic approach is to obtain the number of returns for the search phrase from each PDF and then select out only those PDFs that have one return (one hit) for the search term. While that one hit may not be only in the title, at least you will have fewer hits (PDF files) to review.

But then, I believe the above generic multi-step search filter might only be possible to construct with AppleScript.

–
JJW

BLUEFROG · January 15, 2023, 10:13pm

No worries at all. Just know the resources are there when you need them

dspady · January 16, 2023, 5:11am

Thanks for your thoughts. Yes. I am searching pdf journal articles plus various reports, books, etc that are in PDF form. I use Bookends as my biblio manager and DT, Scrivener, and Nisus or Word as well. I have over 25000 journal articles covering a wide range of medical and environmental topics and use DT as a way of finding relevant information and potentially associated documents. Many of these docs, esp reports and books have running heads that contain words I might search for. It is to avoid getting multiple ‘hits’ from those running heads that I made my inquiries to DT. Using Boolean terms in the search helps a great deal.
Don