Boolean NOT (NEAR ...)?

Mindstormer · August 28, 2023, 6:27pm

Is there a way to use the NOT operator without having it exclude entire documents? For example, suppose I have the file with the following two lines of content text:

The quick brown fox jumps over the lazy dog that is sleeping on the job.
The fox is sleeping.

The closest I can come is by using the following syntax:

text:(fox NEAR/9 sleeping) NOT (fox NEAR/3 sleeping).

Ideally, if a file contained both lines 1 and 2, this type of search would return a result containing the first line in the document while ignoring the second line without excluding the entire document from the pool of results.

In reality, this search excludes all files containing a match with the second condition while returning any files containing only the first line/match. As a result, the example file above never shows up.

This is important because I’m working with nearly 100,000 periodicals in my database, and this would be tremendously useful in situations where I am looking for two terms whenever they occur in proximity, but also want to exclude a third term that is muddying and adding hundreds of useless results to my search… If I want to do a comprehensive study of a topic in newspapers from 1880–1890, for example, I either have to weed through thousands of muddied file content results or else exclude a lot of files in favor of a more focused, exclusionary search. The ability to combine these and dynamically return file contents that match the entire search string could cut down on 90% of the work.

Hope this all makes sense. Happy to clarify if needed.

BLUEFROG · August 29, 2023, 2:01am

This is the logical result of the search. I don’t understand why it would ignore the parameters you gave it.

(fox NEAR/9 sleeping) NOT (fox NEAR/3 sleeping).

Can you provide an actual example? I’m guessing you’re not actually searching for foxes, sleeping or otherwise.

Mindstormer · August 29, 2023, 11:36am

Yeah, I understand this is the logical result. This is why I prefaced the post with: “Is there a way to use the NOT operator without having it exclude entire documents?”

As explained, I’m working with periodicals. Periodicals and newspapers contain many articles on a variety of topics. Searching for historical theological articles on the Old and New Testament provides lots of results but is often also muddied with results tied to the Old and New Covenant. Using “NOT” to exclude the term “covenant” proves problematic because, in many cases, periodicals that address the discussion of the Old and New Testaments also use the term covenant elsewhere (often in a separate article within the same periodical). So the result is that rather than excluding only articles that address discussion of old/new covenants, I end up excluding entire periodical issues because some other article elsewhere in the file listed the term “covenant.” The problem is that this wipes out a lot of good results that I would want to keep. This is why I’m looking for some kind of way to exclude a term when it is in close proximity to others rather than universally.

For academic and archival research, this is a surprisingly common problem when trying to work with tens of thousands of multi-topic periodicals and journals. Having some kind of way to delimit the results with this kind of proximity-based specificity would enable better research and lookups for all sorts of things.

I.e. A NOT/15 boolean operator would be amazing to exclude any results where a term occurs within 15 words. In that regard, this would be a feature suggestion if there is no other way to achieve this currently.

Mindstormer · September 1, 2023, 1:13pm

I’ll take it that this is not possible. Do I need to make a feature request somewhere for consideration of NOT/n? I know a lot of apps have vote-based feature request pages.

chrillek · September 1, 2023, 1:34pm

I’m wondering if
old AND (NOT(Covenant near/15 Testament))
might be doing what you want. That should find all documents containing “Old” where “Covenant” does not occur within 15 words of “Testament”.

It’s a bit difficult for me to gather from your prose what you’re after - perhaps a short example of what you do and don’t want to match helps.

Mindstormer · September 1, 2023, 2:07pm

Thanks for the idea chrillek. Unfortunately, when I try this same syntax with test files to ascertain if it will work, it still excludes entire file results if they contain a proximity match between covenant and testament. This is not helpful, because one article in a periodical pdf might contain the result I want to exclude, while another article later in the same file contains the result I want to include. In other words, I want the boolean “NOT” to exclude in-file results from showing when they match the “NOT,” without removing entire files that may have other in-file matches of what I am wanting to find.

In my case example, I’m wanting articles discussing the old/new testament without reference to old/new covenant because the latter are usually articles outside of the intended scope of research. If a periodical contains one of each kind of article, I don’t want to lose the good one just because the undesired one was present elsewhere in the file.

Hope that makes sense.

chrillek · September 1, 2023, 2:26pm

But then
((old OR new) AND testament) AND (NOT covenant)
should do the trick. Before, you were talking about “distances” between words.

Speaking in terms of boolean algebra: You intend to find those PDFs that contain term A, but do not contain term B. There’s nothing about “distance” in this requirement, it’s all about existence.

But maybe I’m still not getting it. I’m educated in logic and might fail translating from your description.

And perhaps it would be reasonable to split the periodicals into articles, to make things easier.

Mindstormer · September 1, 2023, 2:53pm

Yeah, “NOT” is all about existence in the entirety of the document, unfortunately. So the moment (NOT covenant) is added, all newspapers/periodicals where the term is mentioned even once anywhere will exclude the entire file result entirely.

I’m working with a database of around 100,000 multi-column historical periodicals and newspapers. It is virtually impossible to split them into articles because they don’t cleanly start and end on mutually separate pages (just like modern newspapers where articles can start mid-page, after another article). The amount of work necessary, even if they could be separated, would be astronomical.

BLUEFROG · September 1, 2023, 3:08pm

So the moment (NOT covenant) is added, all newspapers/periodicals where the term is mentioned even once anywhere will exclude the entire file result entirely.

And this is the logical result of your search. You’re telling the search engine you don’t want something to be matched but show it to me anyways. That’s contradictory and would indicate a broken search engine IMHO.

((old OR new) AND testament) AND (NOT covenant)

This would match old testament or new testament and logically exclude documents with covenant in it.

Boolean operators are powerful things but they also don’t lead to a syntax that is intuitive for everyone. And for the kind of search you are referring to, they are required.

Does this look correct, excluding old covenant…?

Mindstormer · September 1, 2023, 4:06pm

Yes, your screenshot represents the type of dynamic search I’d like to be able to accomplish, were it possible.

BLUEFROG · September 1, 2023, 4:35pm

Well, that’s the result of a search I constructed, so yes, it’s possible. However, I am testing some things related to it.

chrillek · September 1, 2023, 4:42pm

I think @bluefrog and I just demonstrated that it is possible. Boolean algebra is the key to it. But frankly, I’m not sure that this is really what you’re after – from your previous posts, I’d gathered that you want to include and exclude the same document in the search results simultaneously.

That’s only possible if you put it in the same box as Schrödinger’s cat. But even then – lifting the lid will destroy the illusion.

It is about negating the following predicate. So,
NOT (a NEAR/4 b) will hopefully match every document where a and b are not four words or less apart. Including those, I’d suppose, where neither or only one term appears, and of course those where the distance is more than four words.
NOT (a AND b) will match all documents not containing a and b at the same time: either a, or b, or none of them.
NOT (a OR b) will match all documents containing neither a nor b.
And so on.

Mindstormer · September 1, 2023, 4:56pm

Hmm, what am I doing wrong that is preventing the result from showing? See attached the created file and then the result of the search.

BLUEFROG · September 1, 2023, 5:03pm

As I mentioned, this is not going to work in the manner you’re attempting.

Does old and new actually matter in this case?

And the moment you add AND (NOT …) you’re excluding documents. That is the ONLY logical response to your query.

Mindstormer · September 1, 2023, 5:17pm

All that I was really querying was the possibility of a proximity-based exclusion. The old and new terms merely narrow out the results from including stuff like people’s last will and testament. In a sense, this begins to accomplish the same thing in a positive way. That alone wasn’t quite enough, which is why I was seeking a way to exclude additional terms in proximity to these without excluding entire files.

If it’s not possible currently, that’s fine. That’s all I was trying to determine in case there was another method that could work to narrow things down. The bigger the database, the more beneficial advanced filter techniques become.

BLUEFROG · September 1, 2023, 5:22pm

You’re fighting against how searches technically work versus your conception.

In your example, AND (NOT…) is going to disservice your results as it will exclude documents you hope to (not expect to) match.

If you use OR you will get more matching documents, positively matched ones, even if they also include covenant. IMHO this is better than missing documents.

Here’s a question, re-asked if need be:

You want to match old testament or new testament.
You don’t want to match documents with only covenant in them but only all the terms?

Given these documents, which one(s) should be matched?

As far as I understand, you only want to match document 1, correct?

Mindstormer · September 1, 2023, 5:39pm

Yeah, this is what I have been doing. The problem is that when I get, say, 900–5000 results… a lot of them could have been excluded if there was a way to exclude specific terms in proximity to my desired results.

You are correct that in your example, I only want to match document 1 in that scenario.

It gets more complicated, however, because I have additional periodicals in my database of two kinds:

Some have articles on the old or new testament that reference the Abrahamic covenant, or discuss the old and new covenants, which I don’t want to muddy my results. These entire documents would be fine to exclude and I’d be happy with using NOT “covenant” to exclude these.

But then others can have an article on page 1 which could be about new marriage covenants, or some other use of the term covenant. Then a completely separate article on page 5 is all about the old and new testament. If I excluded the term covenant as in the prior example, the entire document would be excluded and I would miss out on seeing the article on page 5 just because a separate article on page 1 referenced the term covenant.

So, in the end, I’m forced to stick with AND and OR boolean logic, which is fine, but produces lots of additional hits for me to wade through.

BLUEFROG · September 1, 2023, 5:57pm

Use this: ((new OR old) NEXT testament) OPT covenant
This matches Document 1 not Document 2. covenant will not be highlighted in the search hits. While that lack of highlighting gives you what you say you’re needed, @cgrunenberg would have to weigh in on whether that’s intended or not.

Do you understand why the NOT operator will not work here?
Do you see how the NEXT operator gives specificity more easily than BEFORE or AFTER in this case (and similar ones) ?

So, in the end, I’m forced to stick with AND and OR boolean logic, which is fine, but produces lots of additional hits for me to wade through.

Bear in mind, there is a limit to how much filtering you can do through searching.

chrillek · September 1, 2023, 6:16pm

I’m not convinced that the search criteria is at all clear. Basically, they want to find old/new testament and covenant, provided that it’s not old/new covenant and that the terms appear in the same article.

Each document can, they said, contain many articles. And there is no clear limit to an article, it seems. But the area of interest is the article. How would anyone write a search expression for that? I can understand that it’s desirable, but feasible? I don’t think so. Even with regular expressions (and I’m a big fan of those) – I don’t see a way.

Whatever search mechanism one uses, they operate on certain regions: a line, a paragraph, a page, a document. Arbitrary, not clearly defined parts of a document are not part of this.

Mindstormer · September 1, 2023, 6:26pm

I do understand why the not operator would not work there… because it would exclude the entire document when part of it may be desired (as with the attachment below). This is why I was querying the possibility of a feature request for an operator that, instead of searching for terms in proximity, excludes results with terms in proximity to others (as in article 1 below). It would essentially be the inverse of regular proximity operators.