Archiving big files / indexing files in another database

Lasse · December 22, 2020, 6:47am

I’m a journalist using DT as my research database. As my projects often goes into each other (people, articles, notes) and I therefor want everything searchable from one place I have one database where I keep all projects. The problem is that some of the files are very big so the db keeps growing.

Is there a way to archive a folder/file outside of the database/in another database and still keep the meta information within my research database so it still appears in searches? Much like indexing files in the Finder, but I would like to store the file on an external disk or NAS.

pete31 · December 22, 2020, 6:52am

It should be possible to create a text record with some of a given record’s information via AppleScript. What “meta information” do you want to include?

Lasse · December 22, 2020, 7:06am

Maybe meta information is the wrong term, what I mean is the information DT uses to fint a file in a search. Maybe index information or something like that would be more correct. What I want is kind of on alias for a file with enough information to it so I can find the original file wherever I store it. I only guess this is how indexed files works.

pete31 · December 22, 2020, 7:20am

Yes, I got that. But finding it by what? By name? By text? By creation date?

Please search for Search Prefixes in DEVONthink’s help, you’ll find all possible searches there.

Lasse · December 22, 2020, 7:43am

Best would be if I can find it by anything, like I can find files (often PDFs) – that are in the database by name, content, creation date etc. I just don’t want a big PDF using hundreds of MB on my HDD when that project is done – I want to archive the PDF and only keep the “alias” in my database.

rmschne · December 22, 2020, 7:48am

There is probably a way, and the folks here are likely to find it … but, with disk storage (on spinning, SSD, USB, etc.) so inexpensive these days … why not just store the “hundreds of MB” on another device, not your computer’s hard disk. I have a 5 TB disk (yes, I know way too big) just for that purpose (and other stuff).

Simplifies and reduces complexity.

Lasse · December 22, 2020, 8:00am

That’s a good idea but as I use the database every day I don’t want to connect an external drive every time I need to find a document, and to carry that drive wherever I go.

rmschne · December 22, 2020, 8:18am

Oh. I thought you were talking about archiving the metadata without the document … To me archiving meant that it would only occasionally be referred to. In the past when I had the urge to archive, I simply created a new DEVONthink database and moved stuff over there. Kept the file on the local Hard Disk but did not open. When it occured to me that I hadn’t used the archived file for a long time, I move the file to long term storage off the local Hard disk.

Well, is “hundreds of MB” really a limitation? Just asking. In today’s world it’s not much. You have a small disk on your mac I guess? I guess I’m lucky having a 1TB disk on my MacBook, so space rarely an issue despite GB’s of stuff. Do I need all that stuff? Probably not, but it collects!

pete31 · December 22, 2020, 8:37am

A record with a PDF’s plain text would still take circa half the PDF’s size. Not sure this is worth it

Edit: I tested this with a PDF that doesn’t contain (a lot of?) images. The size dramatically decreases if a PDF has images …

Lasse · December 22, 2020, 8:38am

Ah, no, I want the metadata serarchable within the database so I can find the document stored somewhere else.

A single PDF would of course not be a problem, but 100 PDFs 200 MB each makes 20 GB. And then there is video, audio, etc. So yes, it’s a problem for sure, although it might seems strange for the reasons you point out.

pete31 · December 22, 2020, 8:39am

Again, I got that. But you’re not telling by what metadata

I suggest you think a bit about what you actually want first.

Lasse · December 22, 2020, 8:58am

Ok, sorry if I’m being unclear. I want all metadata to be searchable. And I’m not sure if metadata it the right word as I want to find the document by all it’s content, things like date created as well as the text within it.

So what I want to keep in the database from a scanned PDF: all metadata and the OCR text.
What I don’t want to keep in the database: the images/scans that takes up a lot of space.

Does that makes it easier to understand?

rmschne · December 22, 2020, 9:12am

Just occurred to me, but you may still not like the idea. But in the spirit of keeping things simple (short and long term) drop the archived databases off to an external SDD drive. I travelled (hardly can remember what that is anymore!) with a Samson Portable T5. Very small and light. Hardly a burden. Put all your archived stuff on something like that.

Lasse · December 22, 2020, 10:35am

I got a T5 and I love it, but as small and delicate it is it cannot be compared with the simplicity of an internal drive.

pete31 · December 22, 2020, 10:35am

This script creates a text record with the following properties of a selected record

comment
creation date
custom meta data
label
modification date
plain text, plus
- name
- kind
- URL
state
tags
Reference URL

It first searches for a text record whose name is the selected record’s name plus (Plain text and some meta data) - if you want another name change this before you start to create your “alias” records.

If no record with this name exists it creates one, if it exists it updates all listed properties.

Click the “alias” record’s URL to open the original record.

There’s a part in the script that opens the “alias” record if only one record was selected - this is just for testing, comment it out or remove it (see script).

Note: If you select a text record the script does nothing but displaying a notification (doesn’t make sense to create an “alias” text record for a text record …).

-- Create or update text record with plain text and some properties of a selected non-text record

tell application id "DNtp"
	try
		set theRecords to selected records
		if theRecords = {} then error "Please select some records"
		set theTab to (ASCII character 9)
		
		repeat with thisRecord in theRecords
			set theType to (type of thisRecord) as string
			if theType is not in {"text", "«constant txt »"} then
				set {theName_original, theSuffix} to my recordName(name of thisRecord, filename of thisRecord)
				set theName_plaintextAndSomeMetadata to (theName_original & space & "(Plain text and some meta data)") as string
				
				set theResults to search ("name:" & theName_plaintextAndSomeMetadata & space & "kind:text") as string
				if theResults = {} then
					set thePlainTextRecord to create record with {name:theName_plaintextAndSomeMetadata, type:text} in root of inbox
				else
					set thePlainTextRecord to item 1 of theResults
				end if
				
				tell thePlainTextRecord
					set comment to comment of thisRecord
					set creation date to creation date of thisRecord
					try
						set custom meta data to {}
					end try
					try
						set custom meta data to custom meta data of thisRecord
					end try
					set label to label of thisRecord
					set modification date to modification date of thisRecord
					set theText to (("Name:" & theTab & theName_original & "." & theSuffix & linefeed & linefeed & "Kind:" & theTab & (kind of thisRecord) as string) & linefeed & linefeed & "URL:" & theTab & theTab & (URL of thisRecord) as string) & linefeed & linefeed & "---" & linefeed & linefeed & plain text of thisRecord
					set plain text to theText
					set state to state of thisRecord
					set tags to tags of thisRecord
					set URL to reference URL of thisRecord
				end tell
			else
				display notification "This script doesn't work with text records"
			end if
		end repeat
		
		-- Comment out this part by prefixing each line with # (or simply remove it)
		if (count theRecords) = 1 then
			open window for record thePlainTextRecord
			activate
		end if
		
	on error error_message number error_number
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		return
	end try
end tell

on recordName(theName, theFilename)
	set theSuffix to my getSuffix(theFilename)
	if theName ends with theSuffix and theName ≠ theSuffix then set theName to characters 1 thru -((length of theSuffix) + 2) in theName as string
	return {theName, theSuffix}
end recordName

on getSuffix(thePath)
	set revPath to reverse of characters in thePath as string
	set theSuffix to reverse of characters 1 thru ((offset of "." in revPath) - 1) in revPath as string
end getSuffix

Lasse · January 11, 2021, 10:37am

Wow, will try asap! Thanks!