Performance tips

MrSkooby · July 11, 2024, 2:21pm

Do people have any performance tips ?

For example, if I have a smart rule running and the results of that smart rule open, execution of the smart rule is painfully slow.

However if I switch the view to the InBox the smart rule executes significantly faster I’d estimate x3 or x4 faster. I assume this is because DTP doesn’t have to keep refreshing the view as the smart rule executes.

Have people identified similar performance issues/work arounds ? I feel I’m starting to push the limits of what one DTP database can cope with and performance is suffering as a result.

Thanks in advance.

SM

cgrunenberg · July 11, 2024, 2:24pm

A screenshot of the rule and the number of items in your database(s) would be useful.

MrSkooby · July 11, 2024, 3:24pm

These are the stats from the database in question…

The smart rule is an AppleScript a few thousand lines long so not really worth sharing but it does update a lot of custom data fields. I regularly see Devonthink running flat-out at 99% busy. As well as smart rules taking longer to run with if the smart rule is open,

Maybe when I can switch to “matches” rather than “is” as per this thread maybe performance will improve overall as even the smart rule view can take 15-30 seconds to populate even before the smart rule runs.

cgrunenberg · July 11, 2024, 3:31pm

That’s why a screenshot of the smart rule’s conditions would be useful. E.g. lots of conditions or an enabled Fuzzy option reduce the performance. In addition, how much RAM does your Mac have? Maybe it’s also related to virtual memory. But it might also be an issue of the long script which is executed in the background.

MrSkooby · July 11, 2024, 9:03pm

So this smart rule took 15.288 and 15.326 seconds to populate on two consecutive executions. This is before the on-demand Apple script is executed.

My Mac has 36GB.

I have to admit, this smart rule is running against a group with 40,997 records in it.

I suspect each time my script updates a record DevonThink spends 15+ seconds refreshing the view before (or possibly while) the next record is processed by the script.

rkaplan · July 12, 2024, 2:03am

Do you truly use 19,707 tags? That can slow things down considerably.

BLUEFROG · July 12, 2024, 2:18am

Well spotted and @MrSkooby that indeed is not a good idea. In fact, you should be receiving warnings in Window > Log about this.

cgrunenberg · July 12, 2024, 6:06am

Does the rule search in all opened databases? Are there more databases in addition to the one shown above?

MrSkooby · July 12, 2024, 7:11am

This searches one group in the database screenshotted above.

MrSkooby · July 12, 2024, 7:30am

In short yes I’m using 19,707 tags and the number is increasing daily. I find them incredibly useful. All my PDFs have a “Keywords” section.

e.g. Keywords : ERR-01234

My script

tags the PDF with tag ERR-01234
Converts “ERR-01234” in the PDF to a link that takes me directly to the ERR-01234 tag. So when I click on the link I immediately see all the other PDFs that have issues related to ERR-01234.

If this the root of my performance issues I can start to look at a different structure but at the moment other than the performance of searches its functionally an excellent configuration.

MrSkooby · July 12, 2024, 7:30am

I’m not seeing any warning/errors in the log. Is there a threshold ?

chrillek · July 12, 2024, 7:36am

I’d expect keywords in PDFs to be searchable, using the docKeywords prefix. Such a search should immediately turn up all the PDFs with this keyword. Why is it necessary to have these keywords replicated to tags for searching?

A link where? In the PDF, so that each PDF with the keyword “ERR-01234” also has a link to the DT item representing the tag “ERR-01234”?

MrSkooby · July 12, 2024, 7:47am

Each document has sections like this …

My script has tagged this record with 14 tags. 13 keyword tags and 1 error tag.

If I click on the “AUDIT DATA” in the PDF keywords this is a link “x-devonthink-item://02493B39-E4D2-446B-980C-A098DD90537D” which takes me directly to the AUDIT DATA tag. Where in this instance I’m immediately presented with 7 documents that are specifically tagged with “AUDIT DATA”.

If I just search the DB for “AUDIT DATA” I get 220 results many of which are not all that relevant.

MrSkooby · July 12, 2024, 8:10am

At the moment I’m working around this issue by using an external sqlite database as an index. Which takes a 5-6 second search down to a few ms and works really well.
If the number of tags is the cause of my performance issues then I might be able to do something similar.

chrillek · July 12, 2024, 8:18am

What are these “sections”? You were speaking of “keywords” before, and keywords are predefined metadata in PDFs. Now it appears that you’re using your own “sections” (whatever that is) in a PDF that you introduce with the heading “Keywords”. In addition, you seem to use another section titled “Errors” to summarize errors. Before, you called these things “keywords”, indiscriminately. That insinuated that you were talking about a kind of metadata, namely the PDF keywords.

No. These are not “PDF keywords”. These are just links in a PDF below the heading “Keywords”. Of course, this is not searchable with docKeywords: AUDIT DATA as a real PDF keyword would be.

Using diverging terminology and not clarifying necessary details makes it difficult to help.

I’d suggest storing metadata as your “Keywords” and “Errors” as such, not as tags. Metadata is searchable and (in the case of PDF) part of the document itself. Tags are also searchable but separate from the document.

Currently, you’re duplicating metadata from your document. Which is (in my opinion) never a good idea: If you change the document’s metadata, you’ll have to change the replicated data, too. An error-prone process that introduces unnecessary overhead.

MrSkooby · July 12, 2024, 8:43am

Apologies if my terminology wasn’t quite correct. I wasn’t aware of the keywords metadata field available within a PDF.

You are correct, my “text keywords” section is just part of each document. However these “text keywords” and “text errors” are just links to tags. The tags just work as a one to many relationship that I can follow with just one click I don’t see it as replicating data, more joining documents together by common tags.

My script just goes one step further in that it automates identifying “text keywords/errors”, the creation of the required tags, the updating of the PDF text to become links to the relevant tags. It also removes orphaned tags. If a PDF changes the script will update the tags/links as required. So I personally don’t see much of an overhead.

If I place the “text keywords/errors” in the PDF’s keyword metadata, yes I believe I can search it, and I like that it discrete from DTP so if the PDF is used by me or somebody else outside of DTP the metadata is still there. But whilst it’s in DTP if I wanted to follow up on “AUDIT DATA” I would then have to manually query the DTP “docKeywords: AUDIT DATA”. Compared to my current arrangement where I click a link and get taken to the tag.

However, this thread was about performance. If the tags are the cause of my performance issues then I will look for an alternative configuration. But if the tags aren’t the cause of my performance issue then I need to look elsewhere.

chrillek · July 12, 2024, 9:05am

If you change, for example, AR-2010 to AR-2011, you have to change all occurrences of “AR-2010” in all PDFs to “AR-2011” as well as changing the links. Depending on the number of PDFs, that might take some time.

You seem to employ DT as a relational database. Which it is not. I read in another of your posts that you’re currently circumventing some of the perceived performance issues by using SQLite – that seems to be a better approach than forcing DT into doing something it wasn’t conceived to do.

What about using an URL command instead of your DT item url?
x-devonthink://search?query=docKeywords:AUDIT%20DATA
should (I hope) do what you’re currently doing with the x-devonthink-item link taking you to the tags.

Edit: Well, that’s not such a brilliant idea as I initially thought. Real PDF keywords are not clickable. So, you’d still have to replicate them into your PDF text. An alternative might be to use pseudo-keywords that don’t appear in the text itself, like AUDIT_DATA and _AUDIT. I think that DT indexes some non-word characters, but couldn’t find anything authoritative on that with a quick glance at the search chapter in documentation.

MrSkooby · July 12, 2024, 9:46am

Thankfully this event doesn’t happen. If in the unlikely event one document was updated to AR-2011 then it would correctly become disconnected from the existing AR-2010 records. If one of the other documents was also updated to AR-2011 then it to would be disconnected from the remaining AR-2010 records and would be re-connected to the other AR-2011 document via an AR-2011 tag.

In my opinion DT is great at mining unstructured data. And that is great for this requirement as well. However my data also contains a lot of structure already and the structure as is has a lot of value to me so yes it may be inappropriate but yes I am trying to retain as much of that structure in DT and hopefully have the best of both worlds.

This looks interesting. I will experiment with this. But looking at the AppleScript dictionary “meta data” of a Record is “get” only. No set option it seems, even though the UI appears to let you add Keywords to a PDF’s metadata.

rkaplan · July 12, 2024, 10:38am

I would need to really give this some thought and it may be hard to truly understand the mission without understanding the full document which of course you cannot share.

That said - is it possible that rather than a bunch of tags, what you really want is an Annotations document which has 14 sections - then in each section it contains Page Links to each occurrence of what you are now calling a tag?

MrSkooby · July 12, 2024, 11:40am

I’m not sure if I understand you correctly, but I think that is what I already have.

The text keywords in the bottom of my PDF such as “AUDIT DATA” start off as text.
My script identifies these text keywords and runs some code “FindOrCreateTag(theKeyword)” which returns the DT referenceURL for an existing tag or creates a new tag if one doesn’t exist but again returns the referenceURL for it. In short, pass it a tag name and it returns a reference URL.

The script then annotates the “AUDIT DATA” text by attaching the reference URL to it. Now when I click on “AUDIT DATA” in my PDF it takes me directly to the “AUDIT DATA” tag in DT where I can see all of the documents I’ve written that contain the “AUDIT DATA” tag.

So one “text keyword” equates to one DT tag. As you can see in this instance this document has 14 “text keywords” so it has 14 links to 14 DT tags.

With 48,000+ documents each with a number of “text keywords” I’m actually surprised I only have around 19,000 unique tags.

I actually like this approach more…

If I can push those “text keywords” like “AUDIT DATA” into the PDF metadata keywords field I can then update the annotations links to be in this format “x-devonthink://search?query=docKeywords:AUDIT%20DATA” rather than linking directly to the “AUDIT DATA” tag. This circumvents the need for tags which you have flagged as a possible performance issue. It also means the keywords are embedded in the PDF making it more portable/searchable by other products.

Some manual tests suggest this should work. As above the current challenge is pushing the keywords into the PDF metadata field, ideally via the DT AppleScript interface. Failing that by accessing the PDF directly via a POSIX path.