Separate imported e-mail attachments for better search

Based on the various scripts for importing e-mail attachments floating around the forum (e.g. importing emails with attached PDF files - #10 by Blanc) I’ve created a script that can do the following:

  • Separate attachments from e-mails (.eml files)
  • Add attachments and original e-mail to a new group
  • Add a backlink (x-devonthink-item://) to the attachments referring to the original e-mail
  • Update the timestamps of the group and attachments based on the original e-mail

The script uses large timeouts to also make it usable on larger e-mail archives.

-- Import attachments of selected emails

tell application id "DNtp"
	set theSelection to the selection
	set tmpFolder to path to temporary items
	set tmpPath to POSIX path of tmpFolder
	
	with timeout of 14400 seconds
		repeat with theRecord in theSelection
			if type of theRecord is unknown and path of theRecord ends with ".eml" then
				set theRTF to convert record theRecord to rich
				set theURL to reference URL of theRecord
				set newGroup to false
				
				try
					if type of theRTF is rtfd then
						set thePath to path of theRTF
						set theGroup to parent 1 of theRecord
						set theName to name of theRecord
						set theModificationDate to the modification date of theRecord
						set theCreationDate to the creation date of theRecord
						set theAdditionDate to the addition date of theRecord
						
						tell text of theRTF
							if exists attachment in attribute runs then
								
								tell application "Finder"
									set filelist to every file in ((POSIX file thePath) as alias)
									repeat with theFile in filelist
										set theAttachment to POSIX path of (theFile as string)
										
										if theAttachment does not end with ".rtf" and theAttachment does not end with ".png" then
											try
												with timeout of 7200 seconds
													
													-- Importing skips files inside the database package,
													-- therefore let's move them to a temporary folder first
													
													set theAttachment to move ((POSIX file theAttachment) as alias) to tmpFolder with replacing
													set theAttachment to POSIX path of (theAttachment as string)
													tell application id "DNtp"
														if newGroup is false then
															set newGroup to create record with {name:theName, type:group, modification date:theModificationDate, creation date:theCreationDate, addition date:theAdditionDate} in theGroup
														end if
														set importedFile to import theAttachment to newGroup
                                                        set URL of importedFile to theURL
														set the modification date of importedFile to theModificationDate
														set the creation date of importedFile to theCreationDate
														set the addition date of importFile to theAdditionDate
														
													end tell
												end timeout
											end try
										end if
									end repeat
									
								end tell
								if newGroup is not false then
									tell application id "DNtp"
										move record theRecord to newGroup
									end tell
								end if
							end if
						end tell
					end if
				on error msg
					display dialog msg
				end try
			end if
			
			delete record theRTF
		end repeat
	end timeout
end tell

on makeDate(dateString)
	set theDate to date dateString
end makeDate
6 Likes

Thanks for posting this enhanced version of my script! For anyone interested in this: The script requires at least the Pro edition of DEVONthink 3.

2 Likes

An additional tip: you can run this best on only those e-mails which containt attachments. You can use the selector from Advanced Search (look for Attachments and search for >= 1) or directly use md_attachments>=1 in the search box.

Also note: this script keeps the original e-mails (.eml files) which include the attachments, plus it separates the attachment. This is to keep as much context as possible. I’ve looked into adapting the script to also be able to strip the original attachments, but DT isn’t able to manipulate .eml files. But you can easily do it yourself after the archiving has been done by using a script like GitHub - Conengmo/emailstripper: Strip attachments from local mbox files or parts from here: MailboxCleanup/mailbox_message.py at main · AlexanderWillner/MailboxCleanup · GitHub

I’m working on expanding this script to be able to do something which has been asked in the forums sometimes: being able to replace attachments with DEVONthink links and index the attachments as ‘proper’ DT records. Wondering if there would be any tips or things to be aware of @cgrunenberg? I still have to test this on a larger archive, but so far it seems to be working.

I’m doing the following:

  • Adding a function to the AppleScript above which writes the names + reference URLs of the attachment to the .eml file on disk in the form of a Finder comment (see below for the updated script)
  • Running a Python script which goes through the .eml files on disk, strips the attachments and replaces them with the filenames + URLs found in the Finder comment of the .eml file

This is the Python script used (be sure to install xattr and bpylist:

import email.mime.text
from email import message_from_file
import os
import re
import uuid
import xattr
from bpylist import bplist

def main(path, filename=None):
    """Extract, store and remove attachments from all or a single mbox file in path."""
    iterator = [filename] if filename is not None else os.listdir(path)
    for filename in iterator:
        count = 0
        if filename.endswith('.eml'):
            count_before = count
            f = open(os.path.join(path, filename))
            msg = message_from_file(f)
            count = walk_over_parts(msg, count, path, filename)
            if count > count_before:
                print(msg)
            print('Removed {} attachments from {}.'.format(count, filename))


def walk_over_parts(parent, count, path, filename):
    """Walk over the parts of a parent and try to remove attachments.
    
    This function works recursive. So parent is a message, or a part of a message, or a subpart of a part, etc.
    """
    if not parent.is_multipart():
        return count
    for i, part in enumerate(parent.get_payload()):
        if part.get_content_type() in ["text/plain", "text/html"]:
            continue
        if part.is_multipart():
            count = walk_over_parts(part, count, path, filename)
            continue
        content_size, attachment_name = parse_attachment(part)
        if content_size is not None and content_size > 1e3:
            print('Removing attachment {} with size {:.0f} kB.'.format(attachment_name, content_size / 1e3))
            payload = parent.get_payload()
            comment = bplist.parse(xattr.getxattr(os.path.join(path, filename), 'com.apple.metadata:kMDItemFinderComment')).rstrip("|")
            payload[i] = get_replace_text(comment)
            parent.set_payload(payload)
            count += 1
    return count


def parse_attachment(part):
    """Parse the message part and find whether it's an attachment."""
    if not part.get_content_disposition() in ['inline', 'attachment']:
        return None, None
    attachment_name = part.get_filename()
    if attachment_name is None:
        attachment_name = create_default_name(part)
    if attachment_name is None:
        return None, None
    content = part.get_payload()
    assert type(content) is str
    content_size = len(content)
    return content_size, attachment_name


def create_default_name(part):
    for tup in part._headers:
        if tup[0] == 'Content-Type':
            """tup[1][6:] extracts 'png' from 'image/png' for example. Sometimes the value is image/x-png...
               Somehow, the 'x-' doesn't pose a problem. Not sure how it gets removed."""
            return part.get_content_disposition() + '-' + str(uuid.uuid4()) + '.' + tup[1][6:]


def get_replace_text(comment):
    """Return a message object to replace an attachment with."""
    replace_text = ""
    attachments = comment.split("|")
    for attachment in attachments:
        parts = attachment.split(";")
        print(parts)
        filename = parts[0]
        link = parts[1]
        replace_text = "\n\n<li><a href='{}'>{}</a></li>\r\n".format(link, filename) + replace_text
    return email.mime.text.MIMEText("<br/><br/><hr><br/><strong>Attachments:</strong><ul>{}</ul>".format(replace_text), 'html')


if __name__ == '__main__':
    main(path=/Users/yourname/E-mail/archive')

And this is the updated Applescript:

-- Import attachments of selected emails
property currentCount : 0

tell application id "DNtp"
	set theSelection to the selection
	set tmpFolder to path to temporary items
	set tmpPath to POSIX path of tmpFolder
	
	with timeout of 14400 seconds
		repeat with theRecord in theSelection
			set currentCount to currentCount + 1
			if type of theRecord is unknown and path of theRecord ends with ".eml" then
				set theRTF to convert record theRecord to rich
				set theURL to reference URL of theRecord
				set theSender to URL of theRecord
				set theGroup to parent 1 of theRecord
				set theName to name of theRecord
				set theModificationDate to the modification date of theRecord
				set theCreationDate to the creation date of theRecord
				set theAdditionDate to the addition date of theRecord
				set commentString to ""
				set newGroup to false
				
				set logString to currentCount & ": " & theName & " (" & theURL & ")"
				log logString
				
				try
					if type of theRTF is rtfd then
						set thePath to path of theRTF
						tell text of theRTF
							if exists attachment in attribute runs then
								tell application "Finder"
									set filelist to every file in ((POSIX file thePath) as alias)
									repeat with theFile in filelist
										set theAttachment to POSIX path of (theFile as string)
										
										if theAttachment does not end with ".rtf" and theAttachment does not end with ".png" then
											try
												with timeout of 7200 seconds
													
													-- Importing skips files inside the database package,
													-- therefore let's move them to a temporary folder first
													
													set theAttachment to move ((POSIX file theAttachment) as alias) to tmpFolder with replacing
													set theAttachment to POSIX path of (theAttachment as string)
													tell application id "DNtp"
														if newGroup is false then
															set newGroup to create record with {name:theName, type:group, modification date:theModificationDate, creation date:theCreationDate, addition date:theAdditionDate} in theGroup
														end if
														
														set importedFile to import theAttachment to newGroup
														set URL of importedFile to theURL
														set the modification date of importedFile to theModificationDate
														set the creation date of importedFile to theCreationDate
														
														--set importedPath to path of importedFile
														--tell application "Finder"
														--	set comment of ((POSIX file importedPath) as alias) to theURL
														--end tell
														
														set commentString to ((filename of importedFile) as string) & ";" & ((reference URL of importedFile) as string) & "|" & commentString
														log commentString
													end tell
												end timeout
											end try
										end if
									end repeat
								end tell
							end if
						end tell
						if newGroup is not false then
							tell application id "DNtp"
								move record theRecord to newGroup
								set recordPath to path of theRecord
								if commentString is not equal to "" then
									tell application "Finder"
										set comment of ((POSIX file recordPath) as alias) to commentString
									end tell
								end if
							end tell
						end if
						
					end if
				on error msg
					display dialog msg
				end try
			end if
			delete record theRTF
		end repeat
	end timeout
end tell
1 Like

Not sure what exactly should be replaced where. An example would be helpful.

The attachments are an integral part of the e-mail message (.eml file). Searching them turns up the e-mail, but e.g. doesn’t allow full search or showing occurences such as regular DT records. So it separates the attachment from the .eml message and adds it to DT as a separate record. The .eml file is changed so it includes a MIME-part with a HTML message that links to the relevant DT record.

This sounds like a tricky operation actually. In case of multiple attachments having no or the same name it might be difficult to figure out the related file.

Maybe not that tricky, but I might be overlooking something, so open for tips. The Python script is taking the reference URL and filename directly from what the Applescript you wrote has done (I just write the separated attachment name + reference URL to a Finder comment so they can be linked to the .eml file).

I don’t figure out any attachments names etc. in the Python script (that’s being done by the import part in the Applescript). I just remove all of them (above a certain size) with the Python script from the .eml file and replace the attachments with a bit of HTML with links to the DT items from the Finder comment:

Example with an .eml file with ‘attachment1.doc’ and ‘attachment2.doc’:

set commentString to ((filename of importedFile) as string) & ";" & ((reference URL of importedFile) as string) & "|" & commentString

This gives me attachment1.doc;x-devonthink-item://uuid1|attachment2.doc;x-devonthink-item://uuid2 in the Finder comment of the .eml file which gets process in the Python script in these parts:

comment = bplist.parse(xattr.getxattr(os.path.join(path, filename), 'com.apple.metadata:kMDItemFinderComment')).rstrip("|")

and

def get_replace_text(comment):
    """Return a message object to replace an attachment with."""
    replace_text = ""
    attachments = comment.split("|")
    for attachment in attachments:
        parts = attachment.split(";")
        print(parts)
        filename = parts[0]
        link = parts[1]
        replace_text = "\n\n<li><a href='{}'>{}</a></li>\r\n".format(link, filename) + replace_text
    return email.mime.text.MIMEText("<br/><br/><hr><br/><strong>Attachments:</strong><ul>{}</ul>".format(replace_text), 'html')

I’m content with the group storing the origional .eml file and attachments files
Why are we modifying the .eml file?

attachments as ‘proper’ DT records

The attachments work for me as part of the group, or as independent DT records

The .eml files encapsulate all the attachments. You won’t be able to find them as separate documents in in DT, and there you also can’t easily search ‘inside’ them. E.g. I’ve got loads of interesting documents (mostly PDF and Office documents) sent to me over the years. I can find them when I search for them, but I don’t know why the surface, because DT can’t look ‘inside’ the attachment (only the .eml file).

By separating the attachments I can treat them as ‘regular’ DT documents in regards to surfacing the content.

Modifying the original .eml is definitely not needed if you just want to include both the original .eml file and the attachment. But my e-mail archive spans 22 years and is ~35GB. Having all attachments in there twice is a bit too much so that’s why I’m doing it this way. Probably not needed for most, but works well for me.

1 Like

The attachments of emails are actually indexed since version 3.0, therefore a toolbar search should find the email. But the Search inspector supports only one document but not its attachments.

Yes, you’re absolutely right - sorry for not making myself clear. Why I’m doing it this way is because when I have e.g. an e-mail with a PDF attachment which contains a specific phrase I’m searching for, it does turn up in the search results, but I can’t find the occurrences without separately opening the attachment and redoing the search e.g. in Preview (see How to find search occurence in an e-mail attachment?). But maybe there’s an easier way I’m not seeing?

How are you getting the attachments a independent DT records? By running an Applescript like the one above or some other way?

That’s probably the easiest option currently.

Why don’t you just drag and drop the attachments out of the email into the database? They’re indexed and searchable as individual documents then. :thinking:

I’m using the script provided by DT (Add message(s) & attachments to DEVONthink.scpt)

Because my e-mail archive spans ~300.000 messages, so it would be quite some dragging and dropping :wink:

1 Like

True but I’m guessing you don’t need to import attachments from all 300,000 emails. :slight_smile:

Thank you @mdbraber for your work. I am recently “back” to DT and was hoping to use it to quickly wrangle a pile of .eml files with attachments. I copied your script from 5/11 to the DT Scripts folder [Library/Application Scripts/com.d-t.think3/Menu]. When I run it on a selected email in DT, it creates a group and places the message into the group, but does not appear to separate the attachments (PDFs) or include them in the newly created group. Am I missing a critical step? (I wandered into these chains last night and am trying to follow, but am not at the level of also figuring out python integration!)

Many thanks.

1 Like

Try running the script from the Apple Script Editor (separate app on your system) and see what the Debug output says (click the buttons 1 and 2 to get the debug output)