Auto-search incoming doc and generate alarm if text string found

VincentA · October 20, 2023, 2:38pm

Hi - I’ve been using DT happily since 2010 - initially for my MSc, then as a file repository for a legal practice.
I would like to do the following:
Send file to the global inbox (manually), then:

DT automatically searches doc for given text strings, then
DT puts an alarm on my desktop.

I knew a teeny bit of AppleScript years ago, and have the Sal Seghoian bible, but haven’t got a clue where to start…

BLUEFROG · October 20, 2023, 2:45pm

What does this mean? How do you have an alarm on your desktop ?

And what ”text strings” ?

Automation requires clear directions.

VincentA · October 20, 2023, 3:06pm

Thanks for the very speedy reply. Forgive me if I’m being sloppy in my choice of terms for the question:
Alarm - Should have said Notification - banner that pops out at the right hand side of my screen. Devonthink 3 sent me one a couple of days ago about how to generate a new database template.
Text string - I receive Court Case Listings as email attachments several times daily. I want to automate the process of searching them for cases in which we are instructed using either the client’s surname or the alphanumeric case reference. Obviously the list of case names / references will need updating semi-daily.

chrillek · October 20, 2023, 3:19pm

A smart rule could do that, I think. But how many search strings are we talking about?

VincentA · October 20, 2023, 3:26pm

Depends how I do it - if I update weekly for what “SHOULD” be called into court the next week, then 10-20.
If I’m checking for nasty surprises on any of the cases running at any given time, then more like 100.
The UK criminal court system is on its knees, and all sorts of unexpected things can and do happen.

chrillek · October 20, 2023, 3:32pm

A smart rule for 10 to 20 strings should be ok. More than that, and it’ll become tedious.

One possibility might be to put the strings (aka case number, client names) into a separate file. A script (possibly run by the smart rule) can read this file and check the incoming record against the strings it finds there.

VincentA · October 20, 2023, 3:40pm

Sounds good - I can generate case name / reference lists really easily

chrillek · October 20, 2023, 3:53pm

That’s conveniently vague From your first post, you’re sending a “file” to the global inbox. What’s the format of this file? If it’s a PDF, does it contain a text layer or do you have to do OCR on it to make it searchable?

VincentA · October 20, 2023, 3:55pm

Nothing that fiddly - the attachments are htm files. I’m trying to keep the thread out of the weeds and just making it tricker for you to help! Argh.
Not relevant to this thread, but running OCR on unsearchable pdfs and then pulling textual rabbits out of hats was one of my favourite party tricks for years. Most pdfs I deal with now are pdf&text…

VincentA · October 20, 2023, 4:01pm

One advantage of the field in question being .htm is that I can include (I know this isn’t code) “IF file format = .htm, then…” into the smart rule.

chrillek · October 20, 2023, 4:24pm

You can do that with any format, also more useful ones like PDF.

VincentA · October 20, 2023, 10:39pm

More lazy drafting - they’re the only files of that type I’d ever be importing to DT

chrillek · October 21, 2023, 10:35am

Here’s a JavaScript script that can be used in a smart rule or stand alone on selected HTML files:

function performsmartrule(records) {
  const app = Application("DEVONthink 3");
  const curApp = Application.currentApplication();
  curApp.includeStandardAdditions = true;
  
  /* UUID of the text record containing names and case nos, one on each line */
  const caseFileUUID = 'E69CC3A3-48F6-4575-B95D-327B1C8BBB2F';
  /* Build an array of names and numbers from this record */
  const caseNameList = app.getRecordWithUuid(caseFileUUID).plainText().split('\n');
  
  // Regular expression to match the case number and name entries in the raw HTML data */
  const RE = /<td valign="top" width="10%">(.*?)<\/td><td valign="top" width="50%">(.*?)<\/td>/g;
  
  /* Use `Set`s for the found case nos and names because they automatically store each value only once */
  const caseNos = new Set();
  const names = new Set();
  
  /* Loop over the records matched by the smart rule, weeding out all that are not HTML */
  records.filter(r => r.type() === "html").forEach(r => {
    
    /* Get the raw HTML from the current record */
    const txt = r.source();
    
    /* Find all matches in the HTML and add them to `caseNos` and `names` respectively */
    const matches = txt.matchAll(RE);
    [...matches].forEach(m => {
      caseNos.add(m[1]);
      names.add(m[2].replace(',','')); /* remove comma from name */
    })
    /* caseNos and names now contain all case numbers and names only once, bar casing.
    put all names and numbers matching one entry in the case name list into `foundList`
    */
    
    const foundList =  caseNameList.filter(code => caseNos.has(code) || names.has(code));
    if (foundList.length > 0) {
      curApp.displayNotification(`${foundList.join("\n")}`, {withTitle: `Matches found in ${r.name()}`});
    }
  })
}

(() => {
  const app = Application.currentApplication();
  if (app.name() !== "DEVONthink 3") {
    performsmartrule(Application("DEVONthink 3").selectedRecords());
  }
})()

You have to adjust the UUID at the top of the script to refer to a simple text file containing one name/case number per line.

Shortcomings

There seems to be some variability as to the casing of names: Sometimes last names are all uppercase, sometimes only the first letter is uppercase. That might cause problems if the casing in your name list is different from the court ones.
The notification only shows the first four matches. You must click on the little triangle on the upper right of the popup to expand the list. If that’s a problem, the matches could be send by e-mail or something …

VincentA · October 21, 2023, 1:15pm

Thank you very much indeed! You’re a star. JS and regexes are way beyond me (I speak 4 human languages, but programming is a different matter)
What you’ve come up with is workable for sure - the hardest bit will be adjusting the UUID each time I update the list of case names/numbers. And that’s not hard

A couple of questions:

why are the names case sensitive?
if the script is triggered by a smart rule which is looking for .htm files, what is line 18 for? " /* Loop over the records matched by the smart rule, weeding out all that are not HTML */"

I can easily extract the quotation marks, but this is what my (minimal effort) case list looks like with one reference per line:
“SMITH John” 01VV9876543
“BAR Foo” 01AA1234567
etc

Is that going to work? I’m not concerned if I get double hits - i.e. if both the above examples are in a given attachment I get 4 hits not two.

chrillek · October 21, 2023, 1:42pm

Because that’s how they arrive. It’s the court that writes “Foo Bar” once and “FOO Bar” the next time. Someone has to either lower or upper case everything somewhere if the names should be case-insensitive.

const caseNameList = ... plainText().toLower().split('\n');
...
names.add(m[2].toLower().replace(',', ''));

might help, but then the names will be all lower case in the notification.

As I said: The script can run stand-alone, run for example from Script Editor or the command line with osascript and working on the currently selected records. The filter call is just a safe guard for this case.

No (I sent you a sample by PM, by the way)

SMITH John 
01VV9876543
BAR Foo 
01AA1234567

If the name-code pairs are fixed (i.e. if you’re not interested in cases with “Bar Foo” other than 01AA1234567), then the code could be simplified.
You will not get double hits. If “Foo” is contained twice in the HTML, it will be only once in the names set used in the script (that’s why I used a set and not an array).

VincentA · October 21, 2023, 2:05pm

Thanks again - got the PM. Time to have a play.

VincentA · October 23, 2023, 3:40pm

Huge Thanks to chrillek . Now got a workable system which will reduce work stress levels measurably