"Search Confusion" NEAR (Boolean OR Secondary) = HELP!

Trebor_D · March 20, 2009, 5:16pm

I’ve been using DA a lot over the past couple of weeks to run scheduled searches on (mostly) News plugins - I thought I understood the principles behind Boolean operators and the Default vs. Secondary search fields…and was using the ‘unlimited’ nature of the Secondary to attempt very specific results. But the search results kept including pages that shouldn’t have been there…and not including pages that I thought should have been. So I did some very simple comparative searches today and the results really have me scratching my head. Can anyone (is Mr. DeVille in the house?) help explain these results?

[All searches were made using the Google News plugin @ 100 Results per, with a few domains previously placed in the Preferences=>Excluded list.]

Search 1: Why are these results not more or less the same?

Search 1a
Default Query: Kenya NEAR crisis (NOT food) (NOT economy)
Secondary: [empty]
Results: 23

Search 1b
Default Query: Kenya NEAR crisis NOT (food OR economy)
ditto: (Kenya NEAR crisis) NOT (food OR economy)
Secondary: [empty]
Results: 0

Search 1c
Default Query: Kenya NEAR crisis
Secondary: NOT (food OR economy)
Results: 69

Search 1d
Default Query: Kenya NEAR crisis
Secondary: Kenya NOT (food OR economy)
Results: 144 [inc. lots with food &/or economy!]

Search 2: Why do “NEAR (x OR y)” and “NEAR/1 (x OR y)” deliver identical results…And (per #1), why do similar queries in Default or Secondary fields deliver different results?

Search 2a
Default: Kenya NEAR (economy OR crisis)
Secondary: [empty]

Results: 70

Search 2b
Default: Kenya NEAR/1 (economy OR crisis)
Secondary [empty]

Results: 71

Search 2c
Default: (Kenya NEAR economy) OR (Kenya NEAR crisis)
Secondary: [empty]
Results: 73

Search 2d
Default: (Kenya NEAR/1 economy) OR (Kenya NEAR/1 crisis)
Secondary: [empty]
Results: 7

Search 2e
Default: Kenya
Secondary: Kenya NEAR (economy OR crisis)

Results: 35

Search 2f
Default: Kenya
Secondary: Kenya NEAR/1 (economy OR crisis)
Results: 36

Search 2g
Default: Kenya
Secondary: (Kenya NEAR economy) OR (Kenya NEAR crisis)
Results: 37

Search 2h
Default: Kenya
Secondary: (Kenya NEAR/1 economy) OR (Kenya NEAR/1 crisis)
Results: 3

Obviously, I’m dealing with at least 2 separate issues here:

Boolean: Why does x NOT (y OR z) ≠ x (NOT y) (NOT z)? (Similar for NEAR and NEAR/n).

Default vs Secondary: The Help literature states, “Secondary Query: When you enter something here, the primary term (the one that is entered or the default query) is only used for querying the search engines, but not for accepting or rejecting pages. Without a secondary query, DEVONagent uses the primary query for both querying search engines and post-filtering the results.” With this logic, I would think that # of Results for [Default=x and Secondary=x NEAR y] should be ≥ [Default=x NEAR y and Secondary=empty]…but it comes out the other way around.

I’ve emphasized the # of results as the main factor in these test searches; obviously, quality is the real issue - but unless the result counts make sense, I can’t trust the quality.

Can anyone help clarify this for me? I’ve never thought of myself as any more dense that the next fella, but now I’m having my doubts…

Thanks in advance!

cgrunenberg · March 23, 2009, 9:14am

The primary term is used to initialize a search, therefore the results are NOT comparable as long as this term is identical.

Maybe because of the contents of the pages, maybe because there’s a bug in the range handling of proximity operators in the current release.

The second term uses implicit AND, therefore the terms are NOT identical.

Trebor_D · March 23, 2009, 5:33pm

Thanks for your help…please bear with me!

But why would a primary of “Kenya” and a secondary of “Kenya NEAR economy” yield significantly fewer results than a primary of “Kenya NEAR economy” with an empty secondary? (Given that with an empty secondary, the primary terms are used for both the initial search and the post-filtering.) Surely it should be the other way around, since “Kenya” as a search initializer is so much less limiting than “Kenya NEAR economy”?

[And, please understand, I’m just using this Kenya / economy as an example…results are duplicated with other search terms. Interestingly, if NOT is substituted, a similar pattern results, but to a much lesser degree - a difference of up to a third, rather than 3/4.]

This is important to me because I’ve been running searches with well over 10 terms in the secondary search field…and now I’m afraid that putting them in the secondary rather than the primary (as I’ve had to do given the limit of 10 in the primary) is really constricting the results…not sure how to get the best combination of primary & secondary.

I’ve had similar results regardless of the search conducted…
It’ll be a drag to have to write out (NEAR/2 a) OR (NEAR/2 b) OR (NEAR/2 c), etc, etc…which I guess I’d better do.

That makes no sense to me. In both examples, I’m asking for all pages that include x but neither y nor z, am I not?

But let’s talk it out: if I am trying to find, for example, pages that include the word “Kenya” falling near the word “economy”, but in which neither the words “food” nor “crisis” appear, what’s the proper way to shape the query? And, if the answer matches either [(x NEAR n) NOT (y OR z)] or [(x NEAR n) (NOT y) (NOT z)], what result would the other form provide?

Again, thanks for taking me through this…but I do have to agree that there may be a glitch with the NEAR (x OR y) vs NEAR/1 (x OR y) - that can definitely be duplicated.

Bill_DeVille · March 23, 2009, 6:49pm

No, the two expressions are not the same.

x AND (implicit) NOT (y OR z) will list documents that contain ‘x’ and EITHER NOT ‘y’ OR NOT ‘z’.
x AND (implicit) (NOT y) AND (implicit) (NOT z) will list documents that contain ‘x’ but do not contain BOTH ‘y’ and ‘z’.

So the first expression may well yield more hits than the second. A document will become a ‘hit’ for the first query if it doesn’t contain either y or z. But it will become a ‘hit’ for the second query only if it fails to contain both y and z. (In either query, of course, a document ‘hit’ must contain x.)

Trebor_D · March 23, 2009, 8:54pm

Ay, madre, my head goes 'round…thought I had it there for a minute, but then I got thrown off when you said #1 might have more results than #2. A little more hand-holding, if you’ve got a moment…

Here are 4 simple text samples, to plug in your explanation:
A) This year has seen an improvement in the economy of Kenya.
B) This year has seen an improvement in the economy of Kenya, due to the abundance of food.
C) This year has seen an improvement in the economy of Kenya, due to increased flower exports.
D) This year has seen an improvement in the economy of Kenya, due to the abundance of food and increased flower exports.

With x=Kenya, y=food, and z=flower,
#1) “x NOT (y OR z)” would return YES on A, and NO on B, C, D, would it not?
#2) “x (NOT y) (NOT z)” would come back YES on A, B, and C, but NO on D?

And if you substitute NEAR for NOT, then #1 comes back YES on B, C, and D, and NO on A…while #2 comes back YES on D, NO on A, B, and C.

Do I get it? (I swear, they sure don’t teach Boolean in school any more!)

cgrunenberg · March 24, 2009, 6:23am

Search engines return all kinds of pages for the primary term “Kenya” but obviously few containing “economy” and therefore there’s not much left to match for the secondary term. Contrary to the primary term “Kenya NEAR economy” of course. Lots of pages containing both words and therefore more pages containing both of them within a certain range.

Trebor_D · March 24, 2009, 5:57pm

Still seems counter-intuitive to me. # of results for (Kenya) > (Kenya AND economy) for an initial search (with AND substituting for NEAR since the search engines can’t handle NEAR)…so with the same secondary applied to each, I’d expect (Kenya)+(Kenya NEAR economy) ≥ (Kenya AND economy)+(Kenya NEAR economy).

But the proof’s in the pudding…guess I may be a lost cause. Thanks for trying!