Re: Last Call: 'The APPLICATION/MBOX Media-Type' to Proposed Standard

This proposal needs to be sent back for further consideration, either
by a
WG, or at least by some mailing list of those with knowledge of the
format.

It has just missed too many opportunities, including the opportunity to
document the mbox format once and for all (and warts and all).

uh, no.  this is an application for a media-type.  it's not trying to
define the mbox format, it's just trying to define a label for the mbox
format that can be used in contexts where MIME media-type labels are
required.  the intent is clearly that this label should be usable with
the variety of mbox files that exist, not with one specific variant of
mbox file.

But to do that, it needs to specify which variant of mbox the data is, and
to do that, there needs to be at least a description of the various
variants.

there are lots of media types for which there isn't a description of all of
the variants available.  we find it useful to label things even without
those descriptions.


Exactly right. A label can be of value even if the thing it labels is  a bit
fuzzy.

Eric Hall already made this point, but it bears repeating: Having names names
for commonly used formats is a lot more important than insisting on precise
definitiosn of those formats. MIME tried it the other way at first, and we
still haven't fully recovered from the mess it led to operationally.

From looking at the formail man page (just something quickly found on my
machine), I see mention of a variant using Content-Length:; a "traditional
Berkeley mbox format"; and a "BABYL rmail format" (*is* that mbox?).

no, BABYL format is not the same as mbox format.


FWIW, I do not regard schemes using Content-length: (there are several
incompatible variants) in lieu of colonless from lines as being legitimate mbox
files.

trying to write a definitive specification for the mbox format is not
the same thing as trying to define a label for that format.  nor is it
the same thing as trying to document existing practice.  many authors
and groups get these confused, with frequently unfortunate results.

However (again), it *does* need to provide a way to identify the variants
in actual use. If it doesn't tell me which variant of mbox it is, it might
as well just say "application/octet-stream".

lots of tools seem to be able to make use of mbox files without knowing
precisely which variant they're reading.


Bingo.

I will also add that the addition of parameters specifying regular expressions
will almost certainly harm interoperability more than it helps it. There
are many reasons for this:

(1) An optional parameter will likely be neither generated or read. I certainly
    wouldn't bother with it in the vaarious tools I've developed that deal
    with mbox files.

(2) A mandatory parameter stands a good chance of scaring people away from
    ever using the media type. Alternately, they'll use the media type
    but ignore the parameter requirement.

(3) To the extent that people would generate such a parameter, I view the
    chances that it would be done correctly as fairly low.

(4) There are performance and security issues associated with the use of
    regular expressions. Writing a regexp that consumes vast amounts of
    CPU isn't hard, which means anyone developing, say, an automatic
    tool that accepts and converts labelled mbox files from random sources
    would be vulnerable. Very large mbox files are pretty common as well -
    I've seen a bunch weighing in at in excess of 2GB, so the performance
    impact of regexp scanning also cannot be ignored.

(5) Most mbox delimiter lines involve one or more spaces. Spaces in
    MIME parameter values are sometimes problematic due to header
    folder, and while parameter encoding exists as a solution to this
    problem the number of agents that don't support it is pretty large.

I should also note that specifying a boundary and line terminator in a
parameter might not even be sufficient; there's also from stuffing to consider.

Now, some may be inclined to compare this sort of parameter with the boundary
parameter on multipart. However, the two cases are actually fairly different.
For one thing, the first line of an mbox provides you with a sample delimiter;
there's no need to scan for it and no chance of it getting confused with other
stuff in the preamble.  Additionally, mbox stores a sequential list of message
wherease MIME multiparts can be nested; this leads to interesting cases where
it can be useful to have a definitive indictor of the boundary. (The obvious
example is a maformed multipart with no legal boundaries stored inside a
legitimate multipart.) And finally, multipart boundaries are fixed strings;
there are no regpexps in sight.

                                Ned