Re: Newline problem: Another stab

From: Nathaniel Borenstein <nsb(_at_)thumper(_dot_)bellcore(_dot_)com>

Keith, I just don't think it works.

The hard reality is that there are lots of mail transport agents out
there that think they have license to mess with line breaks.  If
everything gets converted to CRLF format before transport, I predict
that LOTS of existing transport agents will break, badly, because
they've suddenly started getting CRLF where they expected whatever the
local standard was.


Obviously my explanation still leaves something to be desired. :-)

I'm not recommending that we convert everything to CRLF before transport,
or that we change anyone's local standard for how to store mail messages.

Let's say I have a text object that I want to encode in base64 (because it
needs to get there bit for bit intact) and ship somewhere via email.  I don't
want to just convert the file byte-for-byte to base64 and mail it, because
the recipient's end of line convention might be different from mine.  What I
need is a canonical format for the representation of text files.  Then I
can convert my text file to that format, encode it in base64, and mail it.
When the messages arrives at the other end, and after base64 decoding, my
text object will still be in canonical form, and the recipient's UA can
do whatever is necessary to display it, convert it to his local storage
format, whatever.

I'm assuming (in the absence of a specification) that the "canonical" format
for text objects will use CR LF to denote end-of-line.  (Note 1: the MIME
spec needs to define what the canonical form of a text object looks like.)

An important point here is that the "canonical" format that uses CR  LF is the
octet stream *before* encoding.  The resulting message containing the base64
encoding is a *text file* in the sender's local format.  It doesn't contain
bare CRLFs unless the local system happens to use that convention for
end-of-line.  

An example:  Say I want to encode the text:

Hi there!
Lolita.

in base64.  My local system uses the LF character (0a hex, 012 octal) to
denote end-of-line.  So on my system the object looks like this (using
C-escapes):

"Hi there!\012Lolita.\012"

After conversion to canonical form, the object looks like this:

"Hi there!\015\012Lolita.\015\012"

After encoding in base64, the object looks like this (using the same
notation):

"SGkgdGhlcmUhDQpMb2xpdGEuDQo=\012"

The CR (\015) and LF (\012) that end each line in canonical form are encoded
along with everything else.  The \012 character that ends the encoded version
is the end-of-line character in my local environment.

So what does this have to do with quoted-printable?

No matter what type of object I'm mailing, the canonical form for that
object should be an octet stream that is independent of the 
content-transport-encoding.

So the canonical form of my text object is *still*

"Hi there!\015\012Lolita.\015\012"

Now if I want to encode this in quoted-printable instead of base64,
I take this canonical form (it's an octet stream, remember?), pass it
through my quoted-printable encoder, and I get:

"Hi  there=21\012Lolita.\012"

Notice what has happened: the bytes \015 \012 in the canonical form have been
*encoded* as an end-of-line according to my local system's convention.

Let's say that the person who receives this message uses a VMS system,
which normally stores text files in a record-oriented format, rather
than using a special end-of-line character.  When the encoded message
is received on this system, it will look like:

Record 1: "Hi there=21"
Record 2: "Lolita."

The mail reader could then translate this back to canonical form, by
inserting \015\012 after the end of every line:

Hi there!\015\012Lolita.\015\012

This octet stream is then passed onto the display module for text 
body parts.   This module is responsible for translating \015\012
into whatever sequence is needed to start a new line.

Of course, the recipient's system might just as well convert each line
individually from quoted-printable to unencoded form, and display each line
separately.  The result (as displayed on the recipient's screen) would be the
same.  It might seem easier to implement a mail reader this way, rather than
converting a text object to canonical form just to display it.  But to be
fully MIME-compliant, the recipient's mail reader would still need to be able
to deal with text messages encoded in base64, where the bytes after encoding
would contain CRLFs.

It is very important that each content-type have a well-defined canonical
form that is an octet stream, and that is independent of
content-transfer-encoding.   If we specify things this way,  then
content-integrity-checks, encoding text in base64, using q-p for "binary"
objects, 8-to-7-bit gateways, gateways that change the transport encoding to
ensure safe transport  through an 80 column network, compressed body parts, or
additional content-transport-encodings  -- all become easy to define, because
at some level, everything is a octet stream that is represented the same way
on every system.

As for quoted-printable, we can do three things with end-of-line:

(1) we can ignore it, like in base64.  This is simple and unlikely to be
misunderstood, but ugly.  If we do this we will need to encode the CRLFs
that are used as end-of-line in the unencoded format as =0D=0A.

(2) we can say, as the current MIME draft says, that line breaks in the
unencoded text become line breaks in quoted-printable.  That looks fine, but
it breaks the separation between encoding and content-type.  Now we have to
special case our content-transfer-encoders and -decoders to know about text
objects.  Now  it is more difficult to define a content-integrity-check for
text, because the line breaks aren't part of the octet stream anymore.  Also,
the canonical form for text objects to be q-p encoded is different from the
form for text objects to be base64 encoded, because we need to use CR LF to
end lines in the unencoded form of the base64-encoded object.   Basically this
leads to lots of special cases.

(3) we can define q-p such that 0D 0A may be encoded as a line break, and
line breaks are always converted to 0D 0A when decoded.  This is a very
simple rule, and it eliminates lots of special cases that would result from
the use of rule 2.

Note: It's fine with me if we include language that discourages use of this
rule with other than text objects.  If I'm going to use q-p to transmit a
binary object (containing mostly printable ascii characters), then I'm going
to generate nothing but soft line breaks in my q-p.  But I think it's
important to define precisely what octets are produced when decoding hard
line break in q-p.

Keith