Importing E-mail and Attachments via AppleScript

Not written much as I’ve not done much – as it turned out, there were less than 2,000 attachments in messages in Apple Mail from 2000-2005 (and none before), so I very quickly ran out of folders to run my scripts over :slight_smile:

But I did end up with 30,000+ extracted attachments from 1994-2024.

There are messages from pre-2005 which show evidence of having had attachments, but they’re not in the data within Apple Mail. I do have at least some of those “missing” attachments which were separated by whatever means when the messages were first received.

I don’t yet know if I’ll bother importing and linking any of those to the original e-mails as it will be a relatively manual process. I might import them and see if they’re picked up on the Graph (slim chance for many with generic names, I reckon).

I’m still contemplating what I’m going to do about duplicate attachments, because if I make them replicants, they’ll all only be able to point back to one original message.

I think there are potential methods to deal with that via metadata – I just haven’t started to investigate yet.

I still have to re-import the messages in the created .mbox files so I can get rid of the bloated original e-mails previously imported. This will also apply for when I decide to not even keep the attachments which have been extracted – there is a point at which even my hoarding goes well beyond usefulness :slight_smile:

So a few manual things to do, maybe after that I’ll review the script for additional processing exclusions and potential rationalisations of code.

Sean

Time for a slight reversion…

Not all e-mail source files in DEVONthink have the first line of the kind:

From somebody@example.com Fri Oct 04 08:35:27 2002

to mark the start of a message, so I’ll need to put a test in for is the first 5 characters of the first line are From and create that line, as I had previously been doing for all messages, if that line doesn’t exist.

At least it’s a solved problem, but it will add a small amount of time to the processing whenever that line doesn’t exist. (Thank goodness I’ve kept a copy of my script each day I made significant changes!)

I blame the many and varied e-mail programs I’ve used over the years and their many and varied ways of storing things.

I do have a question for @cgrunenberg and @eboehnisch, however (or anyone else who has an answer - @BLUEFROG, perhaps?):

If I delete a message where I have imported and kept attachments (which link to the original e-mail item link) then I re-import the message source with those attachments removed, while existing links from the imported attachments to the re-imported e-mail work (i.e. the incoming link works), any attachments which are linking in that way are not showing up (either as data in the “Incoming Item Links” column or in the Document > Links Inspector) for the newly re-imported message. Is there a way to “re-generate” this count of incoming links (and have them show in the Document > Links Inspector) in databases/groups/selected items?

I have found removing then adding the link back into the attachment’s URL field works, and I could automate that, but if I could regenerate wholesale, that would be handier :slight_smile:

Certainly not a feature request!* :smiley:

Thanks,

Sean

PS UPDATE: I have Tried Verifying, Optimising, and Synchronising the database (not actually expecting these to have an effect), as well as Update Items – just saw “Rebuild Database”, I was going to try that, but going by its description, I think I might just do the scripted re-linking mentioned above if that’s the only other way to achieve the same thing.

*Unless you want it to be!

Slight addition to question.

What happens in the backend when an e-mail record’s .eml file is changed (i.e. the one you see when you click on a record in a list and choose Show in Finder).

I can see the modification date changes (of course), I can see that the displayed message data changes (great!), I can see the “Incoming Links” remain (hooray!) so those seem pretty sensible.

But is it like any other time a DEVONthink record’s source file is edited (e.g. a .docx in Word) – do indices, etc. all get updated as well (these are imported messages).

I’m currently leaning towards generating the new e-mail data and just writing it straight back into the .eml file as I keep or delete attachments, rather than writing an .mbox to bring in later (I can drop generation of the From first line for those messages which don’t have one, then).

It’s my intended “end state” on the messages and attachments in DEVONthink, after all.

It’s also fewer file operations than the alternative path undertaken by others (as I understand it).

I could still include an “attachments links” footer if it feels useful, too.

Thanks,

Sean

Emails are considered to be static so far and are not reindexed.

1 Like

Thanks!

Back from my conference and caught up on my sleep, so here are the final totals on space saved – those before 2002 had 0 MB saved because all attachments were already extracted by the mail client (usually Eudora back then).

Year E-mails Original Size Final Size Saving
2002 2406 86 66 20
2003 2472 294 217 77
2004 2013 166 126 40
2005 3032 223 170 53
2006 2261 338 257 81
2007 2652 299 236 63
2008 4120 471 369 102
2009 5205 543 421 122
2010 7217 898 687 211
2011 6719 733 561 172
2012 8085 863 674 189
2013 9265 1000 806 194
2014 9445 1000 795 205
2015 9821 1100 868 232
2016 9445 968 804 164
2017 10988 1400 1100 300
2018 8353 1000 794 206
2019 5818 945 734 211
2020 7207 1300 1100 200
2021 6133 1400 1100 300
2022 3642 836 656 180
2023 4831 1100 848 251
2024 3638 1200 936 264

That’s a total saving of 3.84GB of space by decoding the Base64 data across 30,800 attachments.

Here are the final database statistics (more than 3.85GB difference to screenshot of statistics above because at prior stage, I still had split and unsplit versions of e-mails in the database for some year groups):

Time to review my script and post final version/s (there are alternative paths on a few matters, so I have to decide whether to have alternate handlers called, or post different versions of the script).

Sean

Ah, Houston, we’ve had a problem.
Jim Lovell, Mission Commander, Apollo 13

Well, not really a problem, but an unexpected mismatch between DEVONthink and MIME types.

Well, OK, it shouldn’t have been unexpected, but the fun is in the journey, right? Right?

So, I was working through the script yesterday and I realised I tell DEVONthink to delete images which have been imported via import attachments of record which are < minPictureSize, via this test:

if (the record type of currentAttachment) is picture and (the size of currentAttachment) < minPictureSize then
	delete record currentAttachment
else
	...
end if

but when it comes to including MIME parts of Content-Type: image/ (MIME parts were being included because imported attachments were being deleted) less than that size (when base64 encoded), I used this logic:

if ((text item 1 of currentPart contains "Content-Type: image/")) and (text item 1 of currentPart contains "Content-Disposition: inline")
	if currentPartSize < minMIMEEncodedSize then
		-- If so, write that small encoded data into our mail source variable
		set messageMboxText to messageMboxText & currentPart & "\n" & partDelimiter
	end if
end if

This raised a couple of issues:

  1. The file formats (graphics formats) included in DEVONthink records where record type is picture does not align with MIME parts of Content-Type: image/, so I can’t be assured I’m always deleting the right attachments and/or including the right MIME parts for any given message
  2. I wasn’t testing that attachments imported via import attachments of record which were < minPictureSize were also inline before deletion (I suspect I can’t even locate inline attachments after they’ve been imported via import attachments of record), further increasing the discrepancy between deleted imported attachments and kept MIME types already present from mismatched sets of file types.

I suspect these issues are insurmountable without decoding each MIME part into its original file data so I can test that decoded data before importing selected attachments.

I am most definitely not going to import decode each MIME part into its original file data so I can test that decoded data before importing selected attachments.

So, slight change in plans.

I will now not delete attachments imported via import attachments of record where record type is picture and their size is < minPictureSize.

I still won’t delete MIME parts which are of Content-Type: image/ and of Content-Disposition: inline if the currentPartSize < minMIMEEncodedSize and I will still delete all other attachment MIME parts (except those of Content-Type: text/). So, in fact, I don’t need to re-process the message sources to “correct” this “mistake” (see below)

This allows (most) signature images to still display in the message, and I may or may not manually delete those at a later stage via a Smart Group like:

I can narrow it down further by only including .jpg, .jpeg, .png, and .gif files (the most likely graphics file types used in signatures).

For now, the implications:

  • The script actually completed 2010 in about half the time (1h06m vs 2h00m), including recreating message sources
  • I’ll need to re-import and re-process all the 2002+ year archives (except 2010) – however, I don’t need to recreate the message source, I just need to re-import the picture attachments via import attachments of record and keep those < 50,000 bytes, so processing will be much quicker than prior runs, but importing the mailboxes will still be some level of pain
  • For 2010, an extra 1,000 attachments were not deleted, and 15MB of extra space is used before any manual processing – so I’ll gain less space back than the above table implies (maybe < 5%, going by 2010 figures)
  • There’ll be extra manual processing time after running the script if I want to get rid of obviously-signature-related images in DEVONthink

So my plan is to finish checking over the script for any other (now) obvious issues or improvements while I re-import the mailboxes, then run the re-processing and see where I’m at and post results and the next iteration of the script here.

The adventure continues!

Sean

Nothing like a deadline to spur action…

It’s Friday night and as of 7 hours ago, I start my new job (yay!) next Thursday morning (I did not see that coming!).

I’ve progressed to having already processed out most signature-related attachments, the count of pictures < 50,000 bytes has gone from over 35,000 records to < 13,000, with just a day’s spotty efforts.

I’ve utilised DEVONthink’s customer metadata and ability to count duplicates to ease this, along with tagging scripts which can be toolbar button or menu/hotkey triggerable.

Highest number of duplicates of individual signature-related images was ~1,150, of which there were several items with that many duplicates.

I will commit to post, by bedtime Wednesday evening (my time), as much as possible of:

  • The latest version of my Attachments-detaching script
  • The process, scripts and settings for dealing with signature graphics semi-manually (or semi-automatically, whatever floats your boat)
  • The script to relink attachments to exported/re-imported truncated .eml files of DEVONthink records
  • As many thoughts and observations as I can on alternative paths, choices I made, etc.

For my personal data, I’m now like 95% of where I’d like to end up – and most of the remaining 5% is deciding if I just leave duplicate non-signature-related extracted attachments as duplicates, or figure out a way to change them to replicants while still linking them somehow to each one’s original e-mail message. I may well kick that can down the road.

I’ve had a ball getting to where I am :slight_smile:

Sean

PS and I couldn’t have gotten there without the phenomenal coding of the DEVONthink devs (@cgrunenberg and @eboehnisch), and the ever-present guiding hand here in the forums of @BLUEFROG!

2 Likes

Congratulations on the new gig! We hope the successes outweigh the stumbles and it provides you with fulfilling challenges and good camaraderie! :slight_smile:

1 Like