Re: Newline problem: Another stab

First of all, let me say that I think Keith has spelled things out _very_
nicely here. Thanks Keith -- your original message is a very important one that
I think everyone involved in this debate should read.

This is exactly what I think we need.  I know it sounds silly, but it is far
simpler than making exceptions for text body parts and quoted-printable.


I'm afraid that I agree with Keith and not Nathaniel on this. There's an
important point that we all seem to be missing here. What we're talking about
is a model for how things work. That's all it is -- a model. When it comes down
to implementation of the model the actual mechanisms involved may break down on
completely different boundaries and operate in completely different ways.
What's important is that the resulting output be consistent. The model is
simply an explanation of the process that's involved. For best results the
model should be as simple and consistent as possible. Reality is almost never
so kind, and actual implementations look very different.

sendmail does not conform to the MTA/UA model either, when it comes right down
to it. But who cares -- it works, most of the time, at least...

I believe that we're all in total agreement as to what the output should look
like. I think this implies that we can find the simplest model fairly easily
once we divorce ourselves from the notion that the model must accurately
parallel the steps that a specific UNIX mail system implementation takes.

I would generalize the procedure  you outlined above to all types of body
parts and all content-transfer-encodings, as follows:

1.  Body part is "composed" somehow, in some "native" format.  This might be
a UNIX-style text file, or a Sun raster image, or audio samples in a
system-dependent format, whatever.


It can also encompass systems that have multiple native representations for
the same thing, e.g. CR delimited, LF delimited, counted records, fixed
length records, and so on.

2.  Before a content-transfer-encoding is applied to a body part, the body
part is first converted to "canonical" format.  Continuing the examples
above, the canonical format might be a CRLF-delimited text file, a GIF file,
and audio samples according to the audio/basic spec.


Let's take a moment and consider canonical forms. Most content types implicitly
define their own canonical form. For example, audio/basic is sound that's
encoded into a stream of bytes. The stream of bytes is the canonical form of
this content type. There's no concept of an end-of-line associated with this
content-type.

The application/postscript content-type also defines a canonical form, since
PostScript is a stream of bytes that collectively comprise a PostScript
program. PostScript may or may not have end-of-lines -- you cannot assume that
any particular sequence in PostScript is an end-of-line unless you know
something about the program itself. A majority of PostScript admits the
interpretation that a CR, LF, or CRLF is an end-of-line (and an end-of-line can
usually be represented by any of these sequences), but this is not always true.
Thus, unless you know something about the PostScript you're encoding above and
beyond the fact that it is PostScript, you cannot assume that there are any
end-of-line sequences in it.

And of course there's text/plain, which is the one that's causing us so much
grief. The canonical form for text/plain is a bunch of line delimited by CRLF.
Yes, I know that this is not the usual "native" representation of text (except
on the 80 million-odd PCs in the world today ;-). But we get the simplest model
if we assume text to be in this form prior to encoding.

It is the responsibility of the definition of the content-type to define the
canonical form. Most of the content-types defined in MIME do this quite nicely.
(If you read the PostScript manual you'll find that the canonical form of
PostScript is very clearly laid out. Amusingly enough, it does NOT include the
notion of a CTRL/D end-of-file indicator; this is an indicator specific to
certain environments that is not part of the canonical form.)

About the only canonical form that's not rigidly laid out is plain text!
RFC821/RFC822 does lay out the format, but MIME actually supports something a
bit different. It is certainly possible to have very long lines, and 8 bit
characters, in MIME. These are not allowed in 821/822.

I have always thought that, once the portable eols disappeared from MIME, that
the structure of this was obvious. Given all this discussion I'm no longer sure
of this.

3.  Content-transfer-encoding is applied.


I'd like to tighten this up a bit. The base64 encoding is pretty obvious. Step 2
of this gives you a byte stream. base64 turns this into text, and that's
that.

Quoted-printable is a bit different. It turns the byte stream into printable
text, but in so doing it may convert CR-LFs into line breaks. I think this last
"may" should be reserved for text data only. It should work with other types of
data, but it should only be done with text. I view this as a clarification
rather than being essential.

4.  The encoded object is inserted into a MIME-message with appropriate body
part headers and boundary markers.

Note that after step 1, any object to be encoded is just an octet stream, and
the rules are the same no matter which content-transfer-encoding gets applied.


Yes, but this is just the model, and I think we should make that clear.

In practice, of course, many q-p encoders will combine steps  1 and 2,
especially if they "know" whether the object being encoded is text or binary. 
That's fine as long as the result is the same.  But...


I have it easy -- in my implementation I do the encodings in the MTA and not
the UA. The problems start when you have to encode in the UA but the MTA does
not accept messages in RFC822 compliant format. This leads to a model where
there's a bunch of wierdness in the output processing after quoted-printable is
produced. If this happens to match the form that step 1 produces you can
simplify the quoted-printable step by simply passing the EOL thingies straight
through. This is the right way for sendmail to work, and it does not violate
the rules in any way or produce incompliant output. The steps are different,
that's all.

If we want to describe a "typical UNIX implementation", so that the poor
implementors won't be confused by being presented with a concise, consistent
model, well, I have no objection to that either. But I do object to the
pollution of a very simple model with concepts that pertain, not even to a
specific operating system, but to a specific implementation that happens to run
on that operating system. And worse, this model is easily implemented in that
environment (albeit not by copy each step to a chunk of separate code).

It is very important to specify things in such a way that every content-type
has a well-defined canonical form that is independent  of
content-transfer-encoding.


I don't think this is an issue except for text. Note that it is going to make
UTF very interesting (I think it forces it to be a subtype), but that's not a
problem we're going to address until we have some assurance that UTF is a
reality.

When specified this way,

* it's easy to define how a content integrity check should work (it just gets
  computed over the output of step 1),


Are you sure about this? How about after step 2, when the thing is in canonical
format? I don't think you want to do a integrity check on something that will
be represented differently on different systems.

* it's easy to define how to convert from one encoding to another (undo steps
  4 through 2, and redo steps 2 through 4 with the new encoding),


If you mean you convert back to the canoncial form and then forward to the new
encoding, then I agree completely.

* the encoding of text body parts is consistent with the encoding of other
  body parts.


And all the transfer encodings can be viewed as consistent. 7bit and 8bit are
just not capable of encoding everything, that's all.

* if the need arises to do so, it is easy to define a new
  content-transfer-encoding without changing the definition of any
  body part.


This is a _real_ advantage that the debate over UTF has only just made clear
to me.

The simple rule that lets this happen is:  in quoted-printable, octets 0D 0A
MAY be encoded as ("hard") end-of-line, and when decoding, a "hard"
end-of-line ALWAYS means 0D 0A.    (If you want to be really strict, then say
that octets 0D 0A may only be encoded as end-of-line when they are intended
to represent an end-of-line in the native format text, but I think it is a
lot simpler to leave this rule out.)


I agree that this last rule should not be hard-and-fast. I believe it should be
recommended, however.

It also needs to be said that the canonical form of a text/* object is one
where end-of-line is always represented as a CR LF pairs from the specified
character set.  This makes it clear how to encode a text/* object in base64. 
(This rule may be extended to other content-types if the definition for that
content-type specifically says to use the text end-of-line rule.)


Absolutely!

Keith


Great job, great message!

                                        Ned