Alternative view of some aspects of MIME

The recent discussion of 0d0a led me to dust off something I started
to write a while ago, and tidy it up a bit. I hope it is helpful.
If you think that I am misunderstanding something that everyone else
understands then please let me know without bothering the list. I don't
want to start another storm in a teacup, but writing this out helped
me understand it so it might help others.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

A MIME basic bodypart (i.e. non-multipart bodypart) can be
looked at at various levels

 1. ENCODED. This is the format of the message as it crosses the mail
    transport path. This will always consist of a sequence of lines
    of 7-bit bytes, each line less than 1000 bytes and not including
    0d0a as consecutive bytes.

 2. UNDERLYING. The underlying format after you remove the encoding
    is not necessarily a sequence of octets. It may have other structure.
    In particular the TEXT types are a sequence of lines. At this
    level lines are arbitrarily long and can contain any octets.

 3. LOCAL. Sometimes various bit level transformations will be needed
    before passing the underlying message to the display processor.

 4. DISPLAY. This is what the user ultimately sees/hears/experiences.
    The objective of the exercise is to arrange as closely as possible
    that the recipient sees(/etc) what the sender intends.

To give an example of LOCAL transformations: in Unix you will add
LF (0a) to the line before sending it to the display code in the
operating system. That display code will discard your LF and 
do something appropriate for your hardware such as send CR, fill
characters, LF, more fill characters. LOCAL is only mentioned to
help people realise that it is unimportant and not really part of
the model.

One issue that looks like it is related to the LOCAL level, but isn't,
is how programmers choose to store the underlying message after
decoding it. For example if on VMS you decide to store the message as
a variable-length-record file then each "line" will be prefixed in
the file by a 16 bit length. This will mean that you won't be able
to handle lines longer than about 65K. More significantly in UNIX
if you choose to store the message with lines delimited by LF (0a)
then you will lose the ability to handle lines which contain a 0a
octet. If at some time in the future someone defines a Content subtype
in which lines longer than 65K are likely or lines in which a 0A octet
needs to be preserved are likely then these programming decisions will
need to be reviewed. Actually such Content-type/subtypes would 
probably be intercepted on the standardization track.

Even more significant in the unix context is how the message is
stored in ENCODED form after reading the message in from the mail
transport. If you represent the lines of the encoded message in
a file with LF delimiters then you lose the ability to handle
messages correctly which contain a bare LF in a line in the encoded
form. As before the effects of this programming decision are not
serious since the endodings are robust against having a LF octet
replaced by an extra line break: the likely result is to introduce
a line break in the decoded message in place of a raw LF. Since
a raw LF has no semantic meaning in any existing content subtype
this is not likely to be an unsatisfactory result.

A Content subtype must define the mapping between UNDERLYING
and DISPLAY in clear terms so that the implementors of mail user
interfaces will correctly display the message.

The remaining issue is the mapping between UNDERLYING and ENCODED.
[I presume in the following that since portable eol was removed from
BASE64 that line breaks in BASE64 are represented by encoding 0d0a].
Not all bodyparts can be encoded in all encodings. A bodypart with
lines longer than 1000 can not be encoded in 7bit or 8bit. A line
containing 0d0a as consecutive octets can not be encoded in 7bit, 8bit
or base64. An octet greater than 127 can not be encoded in 7bit. Note
that the restriction on consecutive 0d0a octets for encoding in
base64 only applies to line-oriented Content-type/subtypes. Anyway
here is that table:

                        | 7bit  | 8bit  |  q-p  | base64
                        ----------------------------------
Not line-oriented       |  N    |  N    |  Y    |  Y
                        |       |       |       |
lines + has octet >127  |  N    |  Y    |  Y    |  Y
                        |       |       |       |
lines + line > 1000     |  N    |  N    |  Y    |  Y
                        |       |       |       |
lines + 0d0a sequence   |  N    |  N    |  Y    |  N

[Well I guess this table should be multi-dimensional but you get the
general idea.]

Note that all known and proposed character sets that might be used in
the line-oriented Content-type/subtypes are careful to avoid using
small octets including 0d and 0a, even in the middle of a 32bit character, 
so the fact that lines containing such a sequence restrict ones choice of 
encoding is not a practical problem.

Bob Smart