Importing email WITHOUT attachments in DTPO, as plain text

Sorry if I’m being dense.

I have DEVONthink Pro Office. After using the forum search, I thoroughly understand that this affords me the capability to bulk import emails from Apple Mail as RTF with attachments.

However, I’d actually prefer to import just the plain text. I have large amounts of old email in Mail, and more yet archived on DVDs. I’d like to dump it all into a DTPO database so I can do a full-text search once or twice a month when I need some old snippet of information. The rest of the time I’d like to keep it all locked up, maybe on a disk image, so it doesn’t distract Mail and Spotlight. I don’t need any formatting for this, or any of the attachments, some of which are quite large.

But I can’t find an option to DISABLE the RTF email import feature of DTPO. What to do?

Also, is there a practical limit to the number of files that can be catalogued in a DTPO database? If I dump all my old emails, including the spam folders and mailing lists, I’ll probably have over a million messages in one database. Is that a problem for the full-text search engine?

Thanks,
Chris Ferebee

You cannot disable the “rich” mail import feature. But if you intend to spend some time on shaping up the database, you could run an rtf->txt converter on all the imported files by first exporting the lot and then create a new database by importing the text files.

There practical file limit depends on the amount of RAM on your machine. But millions of files may be stretching it.

Annard,

Thanks for the explanation!

I’d like to request two features then.

  1. Please let me disable the RTF conversion. I want to be able to bulk import emails as plain text, without any formatting and without the attachments.

  2. Please provide a means of bulk importing email without using Apple Mail by simply dragging a folder (hierarchy) of .emlx files. While you’re at it, please make the parser robust enough to support other formats. The CommuniGate Pro mail server that I use can store messages in “MailDir” format, each email as a single file, and other servers such as cyrus imapd use similar formats - basically just the raw message text, possibly with something extra tacked on the beginning.

The RTF import facility is impressive, but it’s not suited to dealing with large amounts of raw email, perhaps restored from backups, and maybe from a situation where Apple Mail itself is unable to deal with the full mailbox hierarchy.

Sincerely,
Chris Ferebee

We do support this format (at least Pantomime supports it), if you go to the Unix Mailbox Source the Open Panel should recognise this as such as far as I remember. Just give it a try.

Annard,

are you sure about that? I think the “Unix Mailbox Source” refers to the classic “mail spool” format, which has all the messages from one mailbox in a single text file.

Apple Mail through OS X 10.3 used that format, but from 10.4 onwards Mail stores each message in a separate (.emlx) file. Similarly, POP3 servers traditionally use mail spool format, but for IMAP servers, which may have tens of thousands of messages in a single mailbox and need random acces to each message, MailDir format is usually more efficient. (Again, MailDir uses directories with each message in a separate text file.)

When I import “Unix Mailbox Source”, DTPO asks for a single file to import. If I were to import a whole mailbox, I’d need to select thousands of individual files. Also, “.emlx” files aren’t selectable in the open file dialog…

Sincerely,
Chris Ferebee

No, .emlx is not supported. As for the maildir format, you can select a Unix mailbox file or a maildir folder and the latter will be valid if the folder contains the following subfolders: “new”, “cur” and “tmp”. In that case it should interpret the folder as a maildir folder and read its contents.

Thanks for the clarification.

Unfortunately, the “MailDir” directories used by CommuniGate Pro do not have the subdirectories you mention, and are not recognized by the import function.

So I guess that brings me back to my feature request for the ability to import a directory (with subdirectories) filled with email messages in raw text format, one to a file, parsing the header and ignoring extraneous information at the beginning of the file. That should take care of .emlx as well…

Sincerely,
Chris Ferebee

The quest continues. There is an “emlx to mbox converter” available from http://www.cosmicsoft.net/emlxconvert.html that, well, you get the idea. I dragged 22,000 .emlx’s to its window and it gamely created an mbox from them, so it seems to be pretty stoic.

The original Messages folder full of .emlx files (and hence also the resulting mbox) was about 2 GB. Importing it into DTPO took about an hour, resulting in a 3.6 GB database. DTPO got stuck at the end, leaving a “Stop” button open in the import progress pane that didn’t do anything, so I force quit it. After restarting DTPO, he database opened up without a problem.

I’ve been doing this on a Mac Pro with 12 GB of RAM - but the mouse did start getting jerky during the import process.

Next to get rid of the attachments. I selected all 22,000 items and chose Convert to Text from the popup menu.

That seems to be too much. DTPO got stuck at message 2,115 and has been beachballing for 15 minutes. I think it’s time to give up. It looks like I won’t be importing a million emails anytime soon!

We now return to your regular scheduled programming. :slight_smile: