Towards a new content-based linking scheme?

macula · September 28, 2015, 8:05pm

Even though I spend practically half of each working day with DevonThink, I still haven’t resolved the perennial dilemma: to import or to index? To lock my data in a closed format or to expose it to the hazards of indexing? DevonThink in general, at least in my experience, can act temperamentally with indexed files. I have lost indexed data numerous times; never have I lost a single imported document, though. True, these unfortunate incidents are often the result of a clumsy move or rename operation in the Finder—but not always. Syncing and the handling of replicants are two areas in which I’ve found indexing to be particularly counterintuitive, if not actually fragile. The risks are particularly acute when the indexed content is in Dropbox—a use case that, I hasten to add, unlike placing the database itself on Dropbox, is perfectly legitimate. There are simply too many things that can go awry with indexed content, exposed as it is to the multitude of cloud services and apps running on our capable multitasking machines; and far more unfortunate scenarios than I can list, which may result in broken links or duplicate data.

The problem is aggravated by the habit of several apps—especially iOS apps—to delete and rewrite a file from scratch on each “Save”, a process that (I think?) insidiously results in an entirely new entity carrying a new DevonThink URI. And if the contents of your database are extensively interconnected with x-devonthink-item:// links, then your copiously constructed wiki breaks apart while you remain oblivious to the damage.

So I am taking this idle moment to daydream a future in which DevonThink would no longer need to import its content—everything would be indexed. Not only that, but links would somehow (magically) always point to the intended content, even if the original file were deleted and its content copy-pasted into another, new file with a different file name! I don’t exactly how this could be made possible, but in principle it should be possible—perhaps by calculating the hash of each file and producing a unique URI based on that hash, which would be updated on each Save. I guess our dual- and quad- core MacBook Airs/Pros and iMacs are capable to do some speedy hashing on our document-size files (a few MBs maximum, usually much less).

Food for thought for the long run, perhaps.

korm · September 28, 2015, 8:55pm

Very interesting concept. Well beyond OS X. Looks like you want iOS on the desktop, where the file system is completely hidden from end users.

I can see your point. All the hazard you mention are not unique to DEVONthink, though, are they? As long as we can fiddle with files we can shoot our feet.

I see this comment here from time to time and scratch my head. Your data is never “locked up” in DEVONthink. The files are all stored in the open inside the database package – all you have to do is remove the .dtBase2 extension. The file hierarchy is not the same as in the DEVONthink “browser”, but that’s it. I’d bet there is almost zero probabilty that one day DEVONthink would stop working without warning and you’d never be able to run an export (which maintains the hierarchy, btw). With export, replicants are turned into duplicates; nothing is lost. I don’t think I’ve ever read a case here where someone had to manually rebuild a file hierarchy by digging around inside the database package.

But on iOS if an app dies – and the developer didn’t bother to expose the files to iTunes sync – your data is also dead. Now, that’s real lock-in.

BLUEFROG · September 28, 2015, 9:41pm

+1 korm

We have definitely kept things far more open than people think. And while you shouldn’t muck about inside the .dtBase2 file outside of DEVONthink, the files are still accessible and recoverable in various ways, if the need arises.

macula · September 28, 2015, 9:57pm

I stand corrected about the semantics of “locked”. Data in DevonThink is indeed readable and retrievable but with a big “do not touch” sign on the door. One of the problems with this is that data is modified all the time by today’s apps, even when the user believes it’s not. I would personally think twice before reading a PDF on, for example, the excellent PDF Expert app (iOS) straight from the .dtBase file. Who knows what data (or metadata!) could be modified in this supposedly read-only process.

Anyway I am not deluded to think that such a radical paradigm shift can happen overnight, but do believe that it would make sense to think in that direction.

Actually, the scheme I tried to outline (vaguely, no doubt) is very much unlike that of iOS, where data is literally locked in app-specific sandboxes (with exceptions). I am rather suggesting that data should remain “free-range”, with DevonThink links to that data based on unique identifiers derived from the content, not the location of the content in the file system.

korm · September 28, 2015, 10:10pm

I wonder if there is anyone who has implemented a sort of “free range” data storage regime. Most file systems are based on the premise that content is opaque to the operating system. To make it less opaque to the user, utilties like Spotlight came along and larger applications like DEVONthink. These things crawl the content of your files and collect metadata and word concordances.

But going to the next step suggested here – knowing where data is located by its context – might be orders of magnitude more complex. If I have 400 news clippings with the phrase “the Pope arrived in Philadelphia” then how much context (i.e., how much more data) would be needed to find that article clipped from the Inquirer at 2032 on September 27 that I’m looking for? Especially if I don’t know that the article was clipped from that source at that time?

Oh, wait. DEVONthink’s concordance and AI can actually help out here.

macula · September 28, 2015, 10:19pm

This is beyond my area of expertise also, but I believe that it is possible to compute a “hash” that, with extremely high probability, is unique to a given file. This is how, for example, Dropbox avoids storing copies of the same file more than once in its servers. Say that we both have a copy of the same mp3 in our respective Dropboxes. Dropbox will identify the two copies as instances of the same file (same hash) and therefore store only one copy of that file in its cloud servers, linking it to both my and your account. This idea is very different from the DevonThink AI algorithms that calculate degrees of affinity between documents. With hashes, change one byte in a 500GB video and you’ll get a completely different hash—which is exactly what you need if you are trying to identify files uniquely.

BLUEFROG · September 28, 2015, 10:46pm

It seems to me you’re describing an OS and equipment agnostic filesystem which would require an incredible amount of interconnection and data sharing, and most likely would require an intermediary filesystem on the cloud (as many wouldn’t be satisfied with a local solution).

I think there is still a large group of people who would balk at the idea of having all their data in someone else’s hands. (I know I would never go for it.) In my opinion, the price of full data mobility like this, is security and privacy.

macula · September 28, 2015, 10:57pm

Well, I was “just” thinking of a layer of abstraction between DevonThink and the filesystem, which would link DevonThink items with the physical files in the filesystem based on a hash of their contents. I seriously doubt that the necessary computational power would be that high—and anyway, we’re not talking about tomorrow’s release but a long way down the (imaginary) road.

I should note, by the way, that OSX aliases work on a similar (similar, not same) principle.

BLUEFROG · September 28, 2015, 11:41pm

No offense intended. This is a free place to speak and discuss.

I’m not sure I see the benefits of what you’re envisioning, but you obviously have a different vision than I do (which is good).

macula · September 28, 2015, 11:48pm

No offense taken, either, Jim! (I now see my quotation marks around just, which I intended as self-sarcasm for my rather tall order, may have given the wrong impression). It’d be a pretty dull forum if we agreed all the time Thanks for the chat.

kstrauser · October 7, 2015, 9:45pm

This is one of the main reasons you got my money. DTPO isn’t inexpensive and I spent a lot of time comparing it to alternatives before I bought a license. One of my hard requirements was that I had to retain full access to all my data, even if the app organizing it explodes / gets deleted / is obsoleted and not maintained / etc. There’s no way I’d trust a truly locked-up system to manage all of my personal records.

BLUEFROG · October 7, 2015, 9:49pm

And neither would we.

kewms · October 10, 2015, 5:45pm

The time needed to compute a unique hash for a file is directly proportional to the length of the file. So when you start talking about unique hashes for all items in a gigabyte or terabyte-scale file system? Yes, the computing power required starts to add up.

Katherine

avatar · October 18, 2015, 9:22pm

I get the feeling this is something of a touchy area, but - I keep thinking that it would be good if I could index a folder with DTPO, edit the files (which I can do right now) and move them around (which I can’t do right now).

It’s that last bit about moving the files around that makes people baulk or respond with a shocked “DTPO is not a Finder replacement.” But you know, it’s actually damn close to the best Finder replacement - except for that inability to move files around.

Listing a Finder folder in column view and making the window big comes close - but you can’t edit the text content directly in the rightmost pane - you have to open some default app. (Really someone should come up with one of those Finder replacement apps which would do this, but they never implement it.)

In DTPO as three panes widescreen, you can edit the “right pane content”, and for me that’s the best way to browse Finder content … until I want to move something that is; then I have to remember that it’ll appear to work, but it won’t actually be reflected in the Finder.

To me, DTPO’s way of presenting Finder data is better than the Finder itself - except for that big exception - files and folders can’t “really” be moved.

I expect in reply to this everyone will just say, well, just stick it all into a database. But there are advantages to keeping stuff in the file system.

But I suppose the “move file in Finder” addition will never happen, even as an option.

gg378 · October 18, 2015, 9:47pm

Avatar: I haven’t thought too carefully about the issue with moving files around, but one runs into some fundamental problems:

If you go simply by filename (as DT right now does for indexed files), you cannot move files. Because there is no law that forbids files with exactly the same name in different folders. So how would DT unambiguously determine where the file has gone?

Because of this issue, one often tracks a UUID associated with the file; that’s for example what DT does for its internal database files. The filename is only a convenient facade to help you remember what this might be, but the system goes by some atrociously long hex-numbered UUID. That way you can follow anything where ever it goes! Fair enough, but now you start to have the reverse problem: Suppose I have a file “manuscript_last draft.pdf”. I file it. It has a unique UUID. I can move it where I want. But then in the last moment, my collaborator sends my a final, unexpected update, and then submits the manuscript. So now I have to replace the orginal file with the truly “last draft”. I want it to go exactly where the old one went, same tags/groups. But DT can never accept them as the same, because they will forever have a different DT. In the current scheme of DT indexing, I can slip DT an updated file, whenever the location and name match. I use that fact quite often. You lose that if you go by UUID (*).

I think as long as you have two disjoint systems (Finder and DT) working on the same data (in case of indexed files), there is no complete reconciliation of these issues, as by default one system cannot know what the other one has been doing.

So I think it’s not just a question of the devs not doing it. This is deeper.

(*): There are of course ways around this: I can open the old pdf, merge the new one into it, and then delete the old pages, and re-save: filename and UUID preserved! But those are crutches.

alanshutko · October 18, 2015, 10:15pm

If you did this, you would need to update the link in each file linking to the file you changed. If you had one file that fifty files linked to, you would need to update fifty links. Then, since you just changed those files, you would need to change any files that linked to them. And so forth.

gg378 · October 18, 2015, 10:17pm

At least for me, the record of keeping aliases good and alive, especially across OS X fresh installs, is rather mixed. Anything done along those lines with DT has to be a lot more solid.

Finder aliases are most of the time just a convenience. I can live with losing those. In DT, the connections between files are really relevant in terms of information management, at least for me. So the mechanisms must be extremely robust.

avatar · October 19, 2015, 7:09pm

I don’t really understand this. DTPO’s method of tracking indexed files is so different from that of Finder that it can’t move files? And it can never find a way of doing so?

So, um… right, as I understand it, because DTPO and Finder are so fundamentally different in their initial architectural design approaches that there never will be the option to move indexed files around the Finder with DTPO.

Seems a shame. I wonder how those Finder replacement apps like Path Finder do it. Presumably by not having such a difference. TBH if something like Path Finder did allow editing of text etc. in the right pane I’d be tempted to use it quite a bit.

macula · October 19, 2015, 7:18pm

As Napoleon said, where there’s a will, there’s a way.

(As a last resort, there’s always the avenue of kernel extensions, for instance.)

gsgmx2 · February 8, 2016, 12:51pm

I do wonder why it should be so complicated to follow movements of indexed files under OS X. Afaik there is an OS X API for TimeMachine which keeps track of any change in the filesystem. These change infos should be available to DT by that API.

But Apple does not use it for any of its Foto Apps on OS X either. If you ask either iPhoto of Photo to just index-add some files to their database and later move them to another location in your filesystem these apps loose track of the files as well. Whyever?