Boolean NOT (NEAR ...)?

Mindstormer · March 26, 2024, 7:31pm

You are certainly correct, but at this point, separating out articles from so many historical archives would require a budget of at LEAST $100k. Plus, a feasible workflow would need to be available as they don’t neatly start and end on separate pages, and scan quality varies dramatically.

I am super thankful for what Devonthink brings to the table, however, over what is available anywhere else, as it does make the process of research far easier despite the hindrances of multi-topic files.

Incidentally, this syntax does work successfully, but only for PDFs. I’m told the next maintenance update will fix it across rtf/txt as well.

Working simple example:
If you want to search for second, but not second coming or second advent:
second (NOT(second NEXT/2 (coming) OR (advent)))

Working complex example:
If you want to search for the term late, but NOT “late 18th century” or “late 1880s”, etc.:
late (NOT(late NEXT/2 (centur*) OR [12]???s))

Working phrase Example:
If you want to search for “Devonthink is” but not “Devonthink is a” in the manual:
“devonthink is” NOT (devonthink NEAR/3 a)
“devonthink is” (NOT(Devonthink NEAR/3 a))

What will NOT work / unsupported syntax:
“devonthink is” (NOT “devonthink is a”)
“devonthink is” NOT “devonthink is a”

Now, does the syntax for that last non-working example work sometimes? Yes.
Is this inconsistent? absolutely.
So who knows how many false negatives are occurring with the other working examples!
This probably suggests another bug somewhere, but I’m glad for what works some of the time.

BLUEFROG · March 26, 2024, 7:58pm

I would not classify this as a simple search.

pete31 · March 26, 2024, 7:58pm

Can you post an example pdf?

rkaplan · March 26, 2024, 8:13pm

If you have them in PDF form there are any number of automated or manual ways to mark the end of each article which would cost drastically less than that.

And you must do that. . Anything else will lead to a mediocre project at best.

Mindstormer · March 26, 2024, 8:19pm

Valid point, haha! Relatively simple in academics, perhaps.

Example

Tell me more! I’m hoping AI advances will enable more layout aware OCR that can eventually assist in this process. Currently, the best I’ve seen still has difficulty separating out columns consistently.

rkaplan · March 26, 2024, 9:18pm

As @pete31 notes - it would be very helpful for you to post an example PDF.

But putting aside the potential for AI to detect white space or the potential of AI to detect a large font suggesting a new article, what if you simply had an easy to use low-tech app in which a human marked the transition between two articles?

Using a back of napkin calculation if a staffer can review one periodical per minute with such software, that means 500 periodicals per day or 200 days for the whole task. Allowing for error in the calculations, that’s clearly under 1 year for a full-time clerical person. Surely that does not cost $100,000.

But assuming this is some sort of project of academic or historic merit, you could probably do it even more efficiently by inviting a bunch of college students to help support the endeavor in return for pizza while they pull a few all-nighters as a group to get it done.

One way or another if these periodicals are worthwhile to organize, surely you can get help both to devise an app to split the articles and for the human labor to divide them.

Alternatively - hire a computer science guru student to utilize Amazon Mechanical Turk or some open-source equivalent to outsource the splitting of articles to the web at large. It’s a perfect task for such a project and there are a surprising number of people on the web who either for free or for a nominal amount of money would do such a task. 10 cents per periodical would probably be a generous rate to pay and would get the job done for $10K plus the cost of a techical person to get the process rolling.

Mindstormer · March 26, 2024, 11:38pm

I did post an example in the post you replied to. One of the research centers I got a portion of the database from has been paying staff and students for years to scan these. Just getting the text to be recognized in columns is a laborious task as Abbyy is by no means all that efficient for manually configuring layouts. For 2 million pages, if a student could process 100 pages an hour, it’s still 20,000 hours of work. At $10/hr, that’s $200k.

I’m a Ph.D. student trying to develop the most comprehensive historical database for my field to maximize research efficiency. So far, I’ve spent 400–500 hours collecting and tagging this ever-growing database. In the grand scheme of things, though, it will save far more time than that. In my own case, it is halving the time necessary to find sources, which is awesome. It is also revolutionizing the workflow of the 20–30 academics I’ve shared it with.

If/when the time comes that I have a budget or grant to develop a better OCR layout and/or article separation solution, I’d love to do so, but app development is something I’ve paid for in the past when I worked for another organization, and it rarely comes cheap. For a project like this, I’d expect it would cost an additional $30–100k to develop a good solution to automate the necessary processing.

rkaplan · March 27, 2024, 12:11am

Do you need to do all 100K items in the first round?

Or can you demonstrate the value of your idea by starting with the newest 1,000 or the oldest 1,000 or the top 100 classic articles or the 500 most cited articles in the field etc etc?

Way better in my mind to do a great job archiving the first 1,000 than a mediocre job archiving them all.

Mindstormer · March 27, 2024, 12:40am

It’s history, so it’s hard to prioritize too much as anything can become important depending what one is researching, but yeah, if perfect layout recognition existed, along with the fictional piece of software that could mark transitions to split up the OCR’d text, that would be great. Even then, however, there are plenty of ways one could still benefit from using boolean NOT+Near modifiers just to narrow down types of results within single-topic articles. To that end, the solutions I’ve posted above are quite helpful.

rkaplan · March 27, 2024, 1:06am

There is no bibliographic database anywhere that does not very clearly use individual articles as the core element in the database. All searches flow from there.

No matter what else you do with the data - step 1 is an unequivocally accurate catalog of each article.

cgrunenberg · March 27, 2024, 6:13am

Actually this query is not the one desired by you. It’s basically this:

second AND NOT ((second BEFORE/2 coming) OR advent)

Operators are evaluated from left to right, proximity operators have the highest priority. Therefore your query matches documents containing second but only if they do neither contain advent nor second coming.

This should work but unfortunately doesn’t due to a known issue in case of complex proximity queries using parenthesis:

second NOT BEFORE/2 (coming OR advent)

cgrunenberg · March 27, 2024, 2:48pm

…but will work in the next release.

BLUEFROG · March 27, 2024, 2:59pm

Who is doing the scanning and who is doing the OCR?
And what app is being used for OCR?

Mindstormer · March 27, 2024, 5:22pm

Libraries often save their periodicals in the same format as the example I linked, unfortunately. They don’t typically take the time to tag them by article, based on the various archives I’ve navigated. It’s one of those things that seems ridiculous in this day and age, yet persists nonetheless.

Superb, thanks! And thanks for the syntax feedback.

There’s no simple answer to this, because my archive is a composite of libraries and archives throughout the US and some other countries. The research center I live closest to has told me they are using a variety of tools, including Abbyy for layouts as and when they have time, along with a tesseract OCR solution running on amazon aws. In general, however, it seems that libraries tend to prioritize digitization since they’re often more concerned about preservation before old documents fall apart, recognizing that they can hopefully pay more attention to OCR later.

rkaplan · March 27, 2024, 5:45pm

Agreed. but your project is to be a searchable database in your field, not a library - correct?

To do that Step 1 is to convert the digital periodicals to digital articles. Otherwise there will not be much utility to your project.

pete31 · March 27, 2024, 7:09pm

Thanks. There might be a way to split this PDF. Before I try further I’d like to check whether other scans look the same (as there’s no point in finding a solution for a single scan). Can you link to 2 or 3 more scans?

Mindstormer · March 27, 2024, 8:25pm

There’s a ton of utility already. NOT based searches are a significant potential time saver to eliminate undesired results, but even without them, being able to search for specific phrases in 100k periodicals within 1-2 seconds is light years ahead of using online archives where they rarely support boolean searches and return files in such a way that you have to download them one at a time and hope the result was even an accurate hit of your terms. In the end, I’ve already reduced what would typically take hours or days to a matter of minutes. For me and my colleagues, this is revolutionary.

I’m dealing with over a 100 different periodicals, so they do vary quite significantly. Even with the title I shared previously, they changed things up every other decade. Example 1 Example 2 Example 3 | Different Example | Another | Another 2

rkaplan · March 27, 2024, 8:44pm

@Mindstormer

OK I see now what you are working on - that’s similar to initial attempts to archive newspapers. Agreed that the formatting is a challenge.

Perhaps it would be simplest in that case to arrange your database by the name of the periodical and simply do full text search within each periodical. You could subdivide by date or simply assign arbitrary page numbers within the full set of each periodical.

The “NOT” logic you are seeking could be done by the searcher in choosing which periodical(s) are relevant to the desired search.

Mindstormer · March 27, 2024, 10:23pm

Yes, this is precisely what I have done. All periodicals are uniformly named and tagged by title abbreviation, year, decade, etc. and sorted into an efficient directory for my master archive. Hazel for MacOS has been pretty useful to that end when it comes to naming/tagging automatically.

Mindstormer · August 1, 2024, 2:44pm

It works wonderfully! Thanks so much for today’s update!