I’m considering my (home) scanned documents to DT and while I’ve imported them easily enough, I’m attempting to see if there’s a reasonable way of scripting/automating a series of rules to tag the docs, along the lines of:
for each document:
if “phrase A” found in document.text, add tag “A”
if “phrase B” found in document.text, add tag “B”
etc…
I can do this phrase by phrase using a smart rule, but I either need a separate rule for each phrase or to keep editing the rule before applying it, which seems sub-optimal! I did a bit of a search, but didn’t find anything obviously matching, so thought I’d ask for pointers…
Similarly, Batch/Classify via Chat didn’t really help, as the tags it found were primarily applicable to only a single document, although I’ve not tested to see if the prompt can be modified as that seems the wrong direction for my pretty simple requirements…
Seemed fairly simple to me? There was no concern about exclusions, simply things such as if the word Santander” was in the text, tag it as “Santander”, if the phrase “Premium Bonds” then tag as “Premium Bonds”, etc.
The was no issue in terms of even wanting to parameterise the tagging with the target string, as the test and actions could be completely static with multiple tests, it was just the framework of being able to iterate over the files, as the batch process appears to do, then process a script that I was expecting but haven’t been able to find as yet.
It can easily be scripted by using a map (record in AppleScript parlance) associating each phrase with the corresponding tag.
You can even use regular expressions instead of phrases. Straightforward in JavaScript, more coding required with AppleScript.
In JavaScript, one could do something like this (untested):
/* Map strings and regular expressions to tags.
If a string maps to a tag containing only the same string,
leave the mapping empty, fixTagMap() fills in the value */
const tagMap = {
"Santander" : "", // will be set to "Santander" by fixTagMap
"Premium Bonds": "", // will be set to "Premium Bonds" by fixTagMap
"Telco": "TelcoTag"
};
function performsmartrule(records) {
const app = Application("DEVONthink")
fixTagMap();
/* Build a regular expression from the keys in tagMap
Will look like /(Santander|Premium Bonds|Telco)/
*/
const phraseMatcher = new RegExp(`(${Object.keys(tagMap).join('|')})`,'g');
/* Loop over all records */
records.forEach(r => {
/* if it exists, match the record's plainText attribute
against the regular expression phraseMatcher */
const match = r.plainText() && r.plainText().matchAll(phraseMatcher);
/* match is an Array of Arrays, with every 2nd element containing the matched phrase */
if (match) {
/* Build an Array of tag names from the matches */
const matchedTags = Array.from(match).map(m => tagMap[m[1]]);
/* Set the record's tags */
r.tags = r.tags().concat(matchedTags);
}
})
}
/* Add a tag name identical to the phrase if the tag name is empty */
function fixTagMap() {
Object.keys(tagMap).forEach(key => {
if (!tagMap[key]) {
tagMap[key] = key;
}
})
}
That’s a script to be executed by a smart rule action.
Just out of curiosity, I don’t know much about tags, but why would you tag documents with words that are already in the text?
I would tag “Santander” (I assume the bank) with “bank,” as well as JPMorgan, HSBC, UBS, etc., in order to find all banks, even if the word “bank” does not appear in the text, no?
One reason might be because the tag will ultimately include related materials that don’t contain the word.
Depending on the materials, searching for tags can ultimately be faster than searching in the body text. That’s especially relevant if further searches within the specific tag are contemplated.
If I were researching a project about several different banks, the “bank” tag would be almost completely useless. YMMV.
I don’t understand what you mean. If I have 50 documents with different bank names in the database, I can find them very easily with the tag “bank”. Why would that be completely useless? I’m probably misunderstanding you.
Or: Imagine “Santander” was the name of a bank and also the name of an exotic bird
Then I would search for “Santander” and the tag “bank” if I only wanted documents relating to the bank … and not to the bird.
I’m afraid I don’t understand that either. Whether I search for “Santander” as a word or a tag, why should one be faster than the other?
There are probably just different ways to use tags, and everyone does it the way that makes sense to them
It depends on the organisation of the text index and if you search for a word in the text. Searching for something in the tags should be a simple index operation, ie O(1) – the time to access an element does not depend on the size of the array.
Since a word index is probably not organized as a list, we can expect something like O(log n) (in the case of a binary tree). Accessing an element increases with the logarithm of the number of elements. Under these assumptions, tag access is faster. If the word index is organized as a hash table, best-case complexity is O(1) and worst case O(n). Which would be worse than with a binary tree.
As to the general question as to Why, it’s just that for a number of the more common selections that I do periodically, it’s easier for me to click on the tag to select the subset than to type it out and verify I‘ve not misspelt something, and, as was mentioned earlier, a tag becomes even more useful if I end up with false-positives, i.e. documents that match the test string but I’ve removed the tag from as they’re not what I’m looking for with that particular tag.
Before you tag along, you might perhaps familiarize yourself with all the metadata and search facilities of DT. For example, I do use tags rarely for anything but sorting a list of records. For me, simply typing a search term is usually faster than trying to remember a particular tag.
Indeed, this is true for many but DEVONthink has tag-specific features for those tag users. For example, the Tags filter pane (via Tools > Filter) and the Tags inspector (via Tools > Inspectors) both let you filter, apply, remove, and see tag connections/suggestions.
If it were my database, I’d already have a group or even an entire database focused on banks, making the tag redundant.
Depending on how the index is organized, a tag search only needs to look at tags, while the full text search needs to look at everything. That can be several orders of magnitude difference in the number of things to be searched.
The context is that of importing a set of previously scanned docs into DT, so there’s no prior organisation beyond the file names, which primarily give the date of the scan.