Re: mailbox format(s)


On 27 feb 2004, at 0:26, Bruce Lilly wrote:

I quite strongly agree with Iljitsch on this one.
Having a well-defined format that can be used for files
is very helpful. The biggest advantage would be that it
makes it easier to switch MUAs.

If the mailbox is accessed via a standard network protocol such as
POP or IMAP, it is trivially easy to switch MUAs.

Not quite. Then the new MUA must download all the messages again fromthe server. Note that ISPs often don't allow customers to use theirmailbox for long-term storage, so in practice this doesn't work at allfor many users. The alternative would be to run a local IMAP-likeservice, which seems excessive and still doesn't allow the use of morethan one program for downloading mail, just more than one program fordisplaying it.

A single format might not scale
well; what works for an organization with plenty of resources
might not work at all for a guy with a PDA or cell phone (and
vice versa).

I have more than 300 MB worth of mail on my server. My laptop keeps acomplete copy of this, but I don't think that makes sense for a PDA todo the same: just caching the most recently read messages would be abetter choice.

The format could come in single-message and multiple-message (like
current mbox) variants. The later should just be a concatenation
of the former, or otherwise a very trivial transform.

Been there, done that. Never again.  A flat file just doesn't
work well with even a modest number of messages.

It can work if you build an index and don't go around removing messagefrom the middle of the file too often.

Cyrus IMAP stores one
message per file, with a database for metadata (access lists,
etc.), and it's quite fast.

Maybe for random access, but if you need to access all messages you'rebound to be slower. Also, the file system overhead makes this a prettybad idea.

What I imagine is a system where messages are stored in a binarycontainer format such as IFF/AIFF/ASF/AVI when they are created. Atypical message would start with a header section (which includes themsgid), then a body section consisting of one or more text and/orbinary parts and finally an optional signature section. These areconcatenated and the whole thing is flagged as immutable in transit.The container format makes sure that it's easy to skip ahead to thepart of the message that is of interest at any particular time. Thenthere are three other parts: an end-to-end control section, a localcontrol section and any information left by intermediate systems. Thesesections can either be kept in a separate place and be linked to themain section through the use of the msgid, or they can simply be tackedon at the end of it.

A mailbox can then simply consist of a number of these messages thatmay or may not be concatenated into larger files. The control sectionscan be split off or copied to a different location and be used as anindex if this is desired. But it's not really necessary to do that assearching (on header fields) through a large mailbox is fairlyefficient: read the header, skip the body, read the control sections,next message. (The length of each section is specified in the containerformat so there is no need to parse the whole body and no 8bitcleanliness issues.)

I think both having one message per file and having all messages in onefile isn't the best idea, grouping messages in files of a few hundredkB or several MB is probably better. We can add some padding after anyembedded control/logging sections (which can just as easily be trailersas headers) so that when those sections grow, space can be borrowedfrom the padding sections.

Now obviously this is just one idea and I'm not saying an eventualsolution must be like this, I'm just trying to show there are more moreways to skin a cat.

On the way out, not worth caring about too much
(as Iljitsch said, nobody should be forced to use
a format).

If it won't work for some systems (remember that bit about
heterogeneity), what is the point of having a standard?

Standards improve quality, because they usually eliminate inferior waysof achieving a result. And often vendors that implement their ownsolution also support the standard to some extent in order to becompatible.