Extract metadata of damaged database to CSV?

Trillium · March 2, 2011, 11:05pm

My most important question is how can I export the whole list of metadata in some delimited format?

Is it possible to export all the metadata esp URLs, labels, etc from a database with missing images but intact metadata?

Then, I can figure out some way to download the images again using that data. If I have it.

I had a big database that I stupidly had let swell too large between full backups. It contained a great many images and web pages that Id saved.

It was being copied to my main subject DB - but at that point it froze and I had to reboot.

When I opened the program again, a lot of the images were broken. I made a backup and then tried to repair the DB, and it did not work, The images were still broken. I have not been able to find them in either of the database “files.noindex” folders - or anywhere on my hard disk. The backup is too old to be useful.

However, I still have the original database which - although (many) images themselves are missing, (thousands of images) it still contains their metadata which I now want.

The URLs, labels, filenames, etc. are significant for me. Using them I may be able to reconstruct important info. Some of the URLS are fairly long.

Of course, if only DTP could “refetch” ONLY the missing web-based files- but actually DTP is great, I just was pushing it too hard. I will be happy just to be able to save the info from that several days.

Is there some way to get just a list of the missing files- with their URLs and paths of the missing files, to run wget or curl on?

Trillium · March 3, 2011, 3:19am

The OPML file export includes the dates but still no labels. The files show as missing, though because their paths appear to have gotten completely messed up and their file names have been truncated.

Is there a way to get DTPO to export the complete list of metadata, including the original URLs, labels, CREATION TIME and the list of the paths where it THINKS those files should be on the filesystem?

Is there a way to get every field dumped out, including all the fields that I can see in the database window, plus the path that shows up when these files are not found where it expects them to be?

If I can get all that out of my DB, I’ll be able to move them where they would be recognized and then export them in some logical structure that lets me organize them better before reimporting them. That was my big mistake, I didn’t expect the crash when I tried to copy them.

cgrunenberg · March 3, 2011, 12:38pm

Here’s an example script exporting metadata as double-quoted CSV:


-- Export Metadata (Name, URL, Comment, Tags & Annotations) as double-quoted CSV.
-- Created by Christian Grunenberg on Fri Dec 03 2010.
-- Copyright (c) 2010. All rights reserved.

tell application id "com.devon-technologies.thinkpro2"
	try
		set theSelection to the selection
		if theSelection is {} then error "Please select some documents."
		set theFile to choose file name default name "Export.csv"
		
		show progress indicator "Exporting..."
		
		set theCSV to "\"Name\";\"URL\";\"Comments\";\"Tags\";\"Annotations\"" & return
		set theCSV to theCSV & my createCSV(theSelection)
		set thePath to POSIX path of theFile
		if thePath does not end with ".csv" then set thePath to thePath & ".csv"
		
		do shell script "echo " & quoted form of theCSV & ">" & quoted form of thePath
		
		hide progress indicator
	on error error_message number error_number
		hide progress indicator
		if the error_number is not -128 then display alert "DEVONthink Pro" message error_message as warning
	end try
end tell

on createCSV(theseRecords)
	local this_record, this_csv, this_name, this_URL, this_annotation, theseTags
	tell application id "com.devon-technologies.thinkpro2"
		set this_csv to ""
		repeat with this_record in theseRecords
			set this_name to name of this_record as string
			set this_csv to this_csv & my prepareCSV(this_name)
			set this_URL to URL of this_record as string
			if this_URL begins with "x-devonthink-item://" then
				set this_csv to this_csv & my prepareCSV("")
				set this_URL to (characters 21 thru -1 of this_URL) as string
			else
				set this_csv to this_csv & my prepareCSV(this_URL)
				set this_URL to ""
			end if
			set this_csv to this_csv & my prepareCSV(comment of this_record as string)
			
			set {od, text item delimiters of AppleScript} to {text item delimiters of AppleScript, return}
			set theseTags to (tags of this_record) as string
			set text item delimiters of AppleScript to od
			set this_csv to this_csv & my prepareCSV(theseTags)
			
			if this_URL is not "" then
				set this_annotation to plain text of (get record with uuid this_URL)
			else
				set this_annotation to ""
			end if
			
			if (exists this_annotation) and (this_annotation is not missing value) then
				set this_csv to this_csv & my prepareCSV(this_annotation) & return
			else
				set this_csv to this_csv & my prepareCSV("") & return
			end if
			
			if type of this_record is group then
				step progress indicator this_name
				set this_csv to this_csv & my createCSV(children of this_record)
			end if
		end repeat
	end tell
	return this_csv
end createCSV

on prepareCSV(theString)
	if theString contains "\"" then
		local od
		set {od, text item delimiters of AppleScript} to {text item delimiters of AppleScript, "\""}
		set theString to text items of theString
		set text item delimiters of AppleScript to "\"\""
		set theString to "" & theString
		set text item delimiters of AppleScript to od
	end if
	return "\"" & theString & "\"" & ";"
end prepareCSV

cgrunenberg · March 3, 2011, 12:44pm

And here’s a script replicating all items with missing files to the group “Missing Files”:


tell application id "com.devon-technologies.thinkpro2"
	set theDatabase to current database
	set theContents to contents of theDatabase
	set theMissingFiles to create location "/Missing Files" in theDatabase
	show progress indicator "Checking..." steps (count of theContents)
	try
		repeat with theRecord in theContents
			set thePath to path of theRecord
			if thePath is not missing value and length of thePath > 0 then
				try
					set pathExists to false
					tell application "Finder" to if exists thePath as POSIX file then set pathExists to true
					if not pathExists then replicate record theRecord to theMissingFiles
				end try
			end if
			step progress indicator (name of theRecord) as string
		end repeat
	end try
	hide progress indicator
end tell

sjk · March 3, 2011, 5:32pm

Still hoping for a DT script archive outside the forum …

cgrunenberg · March 4, 2011, 7:50am

There are plans to add this to the Support/Welcome Assistant, including automatic installation.

Trillium · March 5, 2011, 2:49am

Are there any plans to add hooks to external RDBMS’s like Postgresql ?http://PostgreSQL.org?

That could be extremely flexible. Free added value, also there’s high quality JDBC drivers. A bulletproof RDBMS backup process could replicate a database folder structure by creating or restoring a DTPO databasse from the exact same structure in tables.

The potential could be much bigger though. That combination would open up a lot of doors.

Trillium · March 6, 2011, 7:04am

Are the file paths that currently showing as the path to the missing files meaningful? Thats where the files used to be.

is there a way to export those paths with the rest of the info, especially the label, flag info and original file creation and modification times all at the same time, in the same records… so I can use that to do a mass copy from the folders to a location they can be reattached.

The OPML script exports the times, and URLs but not the label data… OS X finder shows the times in a generalized fashion, but there must be a more accurate time stored - as your OPML script shows one. But is that necessary?

All but around 100-200 of the images naver made it to the destination DB. They show as missing but - apart from several hundred that were damaged and now wont open,

I think the files that I need are - apart from a fair number (>200) that were damaged…in a directory structure that was copied to the destination database, but the names and paths are wrong. Thats what was lost in the copy.

I have a folder full of images in a alphabetized structure but the file names and paths are clearly not right.

I originally tried to repair the database but that basically left me with a worse mess…Luckily I had saved both of them immediately after the original copy went wrong So thats what I am working with now.

After running the replicate missing files script I realized that almost all of the images are missing. The (now renamed) original inbox database still has the thumbnails and labels, but almost no images still work - all the images are now in red.

I need to reconstruct the actual paths now. The database contained important images and information I can’t replace.

Any ideas on how best to proceed? The putative target files are in the structure the program was trying to move them to. But it didn’t work.

the modification times might be usable as a primary key to help the matching… Thats my best guess on what to do. But its a major pain. Is there an easier way?

cgrunenberg · March 7, 2011, 11:39am

No.

You could just move the files (e.g. from a backup) to these paths.

As DEVONthink is simply using Mac OS X to copy/move files, do you use a disk image (or another volume/drive) running out of space? Did you check the filesystem?

Trillium · March 7, 2011, 4:23pm

The files occupied around 3 gigs and there was 10-12 gigs of empty space left.

The files were in directories in the files.noindex folder in the destination database, however, the paths as listed in the original source DB were completely wrong.

As we speak I’m trying to sort it out by using the file creation time. That seems like the best method.

BTW, despite these problems, you have a great product. Its a tribute to its usefulness that people are … extending it the way they are.

Is there any way to get the info that was lost? What was the algorithm used to decide how to change the names?

Trillium · March 7, 2011, 7:15pm

Are there any tools that I can use to open the old database and get whatever information out that I might be able to use? This is proving to be more complicated than I thought for a lot of stupid reasons.

Thanks.

cgrunenberg · March 8, 2011, 11:34am

The only “tool” is probably AppleScript, see examples above. But if the files are missing in the source database, then there should be orphaned files in the destination database. The command Tools > Verify & Repair should add them again to the database (although without any meta information like labels).

Trillium · March 8, 2011, 1:43pm

The repair was the first thing I tried, but it made things even worse. Thousands of “orphans” many of them damaged and unreadable.

Yesterday I realized that the DEVONtech_storage files that get written when a normal export is done contain most of the information that I need. Is there a way to get that DEVONtech_storage file output with the data -for the old, missing files. Without the files being there?

I tried to, but found that I cannot export the records (with the missing files) from the damaged database to get the DEVONtech file. Instead it performs a check and deletes the missing files from its list, something that I don’t at this time, for this DB, want. (Ive turned off automatic backups too for now… for that reason.)

Is there any utility you could point me at that I can use to get the equivalent data? Once I have it I could match the files much easier than now.

The thumbnails of the missing files are still fine and almost readable.

An AI program could probably match the thumbnails to my missing files and have them all back in their proper locations in seconds!

Sadly, this disruption comes at a bad time for me and I really need that database in one piece without spending a lot of time on the solving of the problem, so I can work with the complete information.

The crash did come in the middle of a copy to my main DB, but that DB is too valuable for me to want to experiment with it right now. (Like the other poster, now it has lots of new work and reconciling two different versions of it would be a big pain.)

But suppose I did try to recreate the situation immediately after the crash, and rebuilt it again. (stupidly, that time, without backing them both up completely first. I just hit the “repair” button.)

Is there a way to make that work?

Getting several thousand “orphans” wont work

Serializing the file names, even if the filename was incomprehensibly dense, would be far better.

Ive (made copies of the originals and) renamed all of the files now by date (year, month day, hours, minutes, seconds) that they were written, and as that info is still in the DB, I can laboriously look them up one by one. But its unbearably tedious.

If I could override the check for the missing files - and simply output a “DEVONtech_storage” file then I would be closer.

Of course, instead it performs a check and deletes the records with missing files from its list, something that right now I didn’t want.

Thanks.

Even with my problems, its still an invaluable program to me. As soon as I can I’m going to shrink my DBs to a more manageable size.

cgrunenberg · March 8, 2011, 2:42pm

Reimporting should mark the files as duplicates (of the missing files). Is that the case? Then it might be easily scriptable, just let me know and I’ll post a script.

Trillium · March 8, 2011, 9:52pm

I don’t think that would work unless the program ignores filenames and the lack of URLs in the on disk files… would it?

The file names are different, and of course, many of the originals came from the web and the imported files would just be on disk.

During the version upgrade and the failed move from Inbox to main DB, at some point the program tried to make the names shorter… I don’t know exactly what made the program hang, but it hung and had to be force quit and then I stupidly tried to repair it without a backup the first time.

The result was what I have now.

cgrunenberg · March 9, 2011, 10:57am

The recognition of duplicates compares only contents (text, images) and as the thumbnails are fine, the recognition of duplicates should also still work even if the files are missing or the filenames/URLs are different.

Trillium · March 9, 2011, 11:20pm

I just tried re-importing a fairly large number of the image files and DT does not seem to recognize them as duplicates of the missing files, unfortunately.

The only ones that are marked as duplicates are files that are both not missing.

The file creation dates are intact. The icons are intact.

Am I missing some important step?

I imported them by dragging a folder full on them to the sorter, which then put them in the target database. I then moved them to the top level, which is where the others still are.

cgrunenberg · March 10, 2011, 1:51pm

That’s indeed strange. I’ve just imported an image, removed it from the database package so that the file is missing and finally reimported the image again and they’re now both marked as duplicates (as expected).

Are similar/identical images listed in the See Also/Classify drawer?

Trillium · March 10, 2011, 7:55pm

The most obvious difference between the re-imported images and the originals, the originals were downloaded from the web using Devon Agent, and they have a (no longer valid, but meaningful for me) URL associated with them.

Also the re-imported images were ones that were originally brought into DTPO1, then I upgraded to DTPO2, and one of the first things I did in DTPO2 was to try to import the images in my bloated Inbox into my main database.

Thats when it crashed.

I should have done the import before the upgrade, I now realize that.

However, the data I need is in there, as the thumbnails for the missing files still appear.

What is the long string that is associated with the images. Where does that come from? Is that an internal identifier?

personally, because it has the potential to be forever unique, I would prefer it if DTPO2 had an option to save files in a format that used a different naming scheme, even if it looks incomprehensible to humans, rather than what it does now, which seems to be problematic when multiple similarly named files are damaged or removed. What seems like the most useful would be a name that started with the year, then the month, then the day, in two digit format, and then the hour, minute, second, and then increment a counter based on an internal sequence that could not clash. that way every file would have a unique filename. Even if there were tens of thousands of them in one directory. (This kind of thing is where a ACID-compliant database REALLY starts showing its utility, even in a single user situation.)

For example, I have noticed that some files that were accidentally imported into the old Inbox database by mistake at some point got urls that pointed to the few old image files that were not lost in the transition. Because the name that the new DTPO attaches to them was SO similar.

Does it look at the sixteen directories and simply keep adding one round robin, eeny meeny miney mo, sort of, to the base filename so there isn’t a clash?

thats a hack that only puts off the need for a more robust solution.

Why don’t you look at SQLite? Lots of commercial products embed SQLite. For example, I have seen it in commercial GPS units.

Trillium · March 10, 2011, 8:09pm

Did i answer this question properly before…

"T: Are the file paths that currently showing as the path to the missing files meaningful? Thats where the files used to be.

C: You could just move the files (e.g. from a backup) to these paths.

T:All but around 100-200 of the images naver made it to the destination DB. They show as missing but - apart from several hundred that were damaged and now wont open"

That was the first thing that I tried, but I found that the paths were wrong! And the truncation of the file names had the effect of removing the part of the previous name that was unique for me (the path to the file was in the filename before, with DTPO1.) DTPO2 replaced the longer string with a shorter one - with a string that incremented upward by one in a round robin fashion…