Performance of searching custom meta data

Hi,
I have some automation that needs to check for the presence of a record with a specific custom meta data value. e.g. search “mddocumentID==32.410.922”. If the record doesn’t exist it creates a new one.

The slowest part of the entire process is the “search”. This is taking around 4s per lookup and seems to be increasing as the number of records increases. This is on an M3 Pro with 36GB RAM and no other apps to speak of running.

Are there any recommendations on how I could improve performance on these queries.
Would using a non custom meta data field such as the Aliases field perform better?

An alternative is collect all of the documentIDs into a list a the start of each run and then search the list rather than the database. But to do this I still have to do a search of all records where mddocumentID!=“” and then visit each record to collect the documentID. This is also a huge overhead. Is it possible to get the documentID directly without visiting each record.

Something equivalent to …
set documentIdList to (select documentId from theLocation where documentId is not null)

Thanks

MS

What kind of automation, a script?
Have you tried using a smart rule?

Sorry, yes. An AppleScript. I’ve not tried an a smart rule. Can you search within a Smart Rule from AppleScript. Will this help ?

Post your script, please.

The main part of script is 2000+ lines long so I’ve just included the relevant parts as the rest of it will just cloud the issue.

to DNtpSearchByQuery(msgPrefix, documentLocation, searchQuery)
	#Generic function for running a query in DNtp, returning the record count and record List.
	#{documentCount, documentList} DNtpSearchByQuery(msgPrefix, documentLocation, searchQuery)
	tell application id "DNtp"
		set msgPrefix to msgPrefix & "DNtpSearchByQuery|"
		set theRecords to search searchQuery in documentLocation
		return {count of theRecords, theRecords}
	end tell
end DNtpSearchByQuery

to getDocumentByDocumentID(msgPrefix, documentId)
	# Search the database for documentID. Return true it exists and the record, or false if 
	# it doesn't exist
	set {successful, documentLocation} to (my getDocumentLocationByDocumentID(msgPrefix, documentId))
	if successful then
		set {recordCount, recordList} to DNtpSearchByQuery(msgPrefix, documentLocation, "mddocumentid==" & documentId)
		if recordCount is 1 then
			set theRecord to item 1 of recordList
			set theResult to {true, theRecord}
		else
			set theResult to {false, ""}
		end if
	else
		set theResult to {false, ""}
	end if
	return theResult
end getDocumentByDocumentID

to createDocument(msgPrefix, documentId)
	# Create a new empty document with basic meta data. The record created form a "to-do" item. This
	# record needs to be populated from the production Oracle database. But also allows a URL to be
	# generated that can be used to reference this documentId going forward
	tell application id "DNtp"
		set msgPrefix to msgPrefix & "createDocument|"
		set {successful, documentLocation} to (my getDocumentLocationByDocumentID(msgPrefix, documentId))
		if successful then
			set theRecord to create record with {type:"PDF document"} in documentLocation
			
			set custom meta data of theRecord to {mddocumentid:documentId}
			my setDocumentDetail(msgPrefix, theRecord, "documentName", "Placeholder for |" & documentId & "|")
			my setDocumentDetail(msgPrefix, theRecord, "documentID", documentId)
			my setDocumentDetail(msgPrefix, theRecord, "documentRefreshStatus", "New")
			return {true, theRecord}
		else
			return {false, ""}
		end if
	end tell
end createDocument

to findOrCreateDocument(msgPrefix, documentId)
	# Seach DNtp for the documentId. If its found return success and the record, if its
	# not found create a new record and return it.
	tell application id "DNtp"
		set msgPrefix to msgPrefix & "findOrCreateDocument|"
		set {successful, theRecord} to my getDocumentByDocumentID(msgPrefix, documentId)
		if not successful then
			set {successful, theRecord} to my createDocument(msgPrefix, documentId)
		end if
		return {successful, theRecord}
	end tell
end findOrCreateDocument

to createNotesFromPDF(msgPrefix, theRecord)
	# Using regex the PDF associated with theRecord looking for matching documet IDs.
	# For each document ID discovered search DNtp for that ID in the custom meta data 
	# field "mdDocumentId". If its found, great, if not create it. 
	tell application id "DNtp"
		set msgPrefix to msgPrefix & "createNotesFromPDF|"
		set theFilePath to path of theRecord
		my logmsg("d", msgPrefix & "File|" & theFilePath & "|")
		set documentIdList to my grepPDF(theFilePath)
		set documentIdList to my cleanTheList(documentIdList)
		repeat with documentId in documentIdList
			my logmsg("d", msgPrefix & "Found Note ID|" & documentId & "|")
			set {successful, theRecord} to my findOrCreateDocument(msgPrefix, documentId)
		end repeat
		
	end tell
end createNotesFromPDF

my createNotesFromPDF("createNotesFromPDF | ", theRecord)

But Script Debugger times all of the calls, and by far this the longest call, which takes 3.5 - 4 seconds.

search "mddocumentid==136.98-75.1" in parent id 1549626 of database id 2
{content id 1573272 of database id 2}

I do have an undesirable feature in my code at the moment where by it does the search twice. I’m looking to resolve that, but ultimately it’s this 3.5-4 second search time that is killing performance. Everything else is completing in 0.00x seconds.

And how long does an explicit toolbar search in DEVONthink take for mddocumentid==136.98-75.1 scope:selection ?

“1 item found in 3.489” seconds.

I would provide a screenshot but I can’t work out how to post images.

You can drag and drop a screen capture from your desktop into your reply.

“1 item found in 3.489” seconds.

So, this appears to be a legitimate amount of time for the search in AS.

  • How many databases are you searching? Just the one?
  • How many documents are in the database?

Only searching 1 database, with approx 13,000 documents in the database.

Not sure if it should matter in this case but…

  • Are the documents imported or indexed?
    • If indexed, where are they located - the file path in the Finder?

These are all imported pdf’s.

Try this search: mddocumentid==136.98 75.1. Any faster?

What’s the type of the custom meta data? In addition, how many items are in your databases(s) and how many of them have custom meta data? Not every type is indexed and not every search uses the index (e.g. string conditions like is/begins/contains).

Unfortunately this doesn’t help. Even documents with an ID like “2113768.1” are taking just as long.

The one we’re using as a primary index as such, documentID is “Single-line Text”.

In an attempt to rule out other issues, all other databases are closed. Only this one database is open.

I would expect nearly all of the PDFs, so nearly 14,000, have custom meta data.

I’m specifically using the “is/==” condition as we want an exact match. So if this is indexed using a hash/bitmap/b-tree index I’d hope access would be very quick.

If there are fields or data types that are specifically indexed, I could look to switch to an indexed data type.

The normalized full text index doesn’t include separators & white spaces etc., therefore all exact string comparisons (is, contains, begins, ends) are currently relatively slow. Try mddocumentid:"136.98 75.1" instead. is this faster/successful?

1 Like

That’s done it !

mddocumentid==1670198.1
“1 item found in 3.796 seconds”

mddocumentid:“1670198.1”
“1 item found in 0.005 seconds”

Thank you !

Maybe I spoke too soon. Not enough testing.

Whilst this is slower it returns the correct results of 1 record.

mddocumentid==16573211.4
1 item found in 3.890 seconds

This is much much faster, but its returning two records.

mddocumentid:"16573211.4"
2 items found in 0.003 seconds

Yes it returns the record where mddocumentid is 16573211.4, but it is also returning another record with a completely different value for mddocumentid. The second record returned has 16573211.4 within the body of a PDF.

“mddocumentid” is defined as a “Single-line Text”.

Does this mean the two fields are sharing an index ? Is there a way to get the desirable 0.003 response time but to only return the correct records ?

Thanks

So far I couldn’t reproduce this. A screenshot of the toolbar search options (see its popup menu) would be useful.

So the “is” clause correctly returns one record in just under 4 seconds

Whilst the “matches” clause returns 2 records (incorrectly I think) in less than 0.2 seconds.

Or am I just being a
Muppets_cast