PROBLEM: Newlines & Quoted-printable

As you know, I'm trying to put together the new MIME draft that
incorporates the IESG comments as well as the others that have trickled
in since the last draft.  Unfortunately, I've hit a bit of a snag when
trying to incorporate last week's discussion about quoted-printable and
newlines.   At the time, I saw nothing to argue with in Alain Fontaine's
suggested changes, but when I went to make them, I realized I didn't
agree with them.  Either I just didn't look hard at them enough the
previous time, or I've grown more obtuse over the weekend, but I feel
like I have to reopen this subject.

This is a VERY obtuse subject, and I offer my apologies in advance if
this seems long and tedious.  The basic question is simple:  How should
line breaks be represented and encoded in the quoted-printable encoding?

The current MIME draft (Jan 1992) is pretty clear on this:  Line breaks
in quoted printable are represented as line breaks.  While there's a
mechanism for preceding a line break with an equal sign, which means it
is a non-significant line break, all "real" line breaks simply appear as
line breaks.  

Confusion enters because of the ambiguous phrase "line break". 
Presumably, by "line break" we mean CRLF.  

Now we're already getting onto shakier ground,  because there has always
been a "polite fiction" about CRLF in RFC 822.  The polite fiction says
that "whenever an RFC 822 parser looks at a message, it sees CRLF as the
line break."  Unfortunately, this fiction does not correspond to reality
very well.  For example:  How many UNIX user agent programs, do you
think, parse a message's header by changing all the newlines to CRLF's,
then searching for lines that end with CRLFS, and then changing them
back before displaying them to the user?   This may seem irrelevant
insofar as 822 is the standard format for message TRANSPORT, but
unfortunately it is more than that, and this is where we start to get
into real trouble.

Fontaine's proposal says that any data to be transmitted with
quoted-printable must be converted to the CRLF representation for
newlines BEFORE encoding.  On the surface, this sounds reasonable, but
the more I think about it the more I think it is a recipe for disaster,
if it is even implementable at all, which is why I'm re-raising the
issue.

Think about the way text is transmitted now:  Within a
domain-of-newline-convention (e.g. a local UNIX system), mail is NOT
typically converted to the CRLF convention.  Oh, sure, sendmail and
other MTA's do this for message transport, but by the time a message
shows up in a mailbox, it is in the local newline convention, and it
typically stays that way for all non-delivery processing.  In other
words, in existing practice, the conversion of newlines is very much a
function of the transport layer.  A UNIX UA typically composes mail
using the local newline convention and then passes it off to the MTA,
which converts to CRLF when talking over SMTP.

Under Fontaine's proposal, the newline characters would be converted to
CRLF by whoever was doing the encoding.  Typically, in many
environments, this will be the user agent.  But now look at the
situation from poor sendmail's perspective:  Now sometimes it is being
called to deliver "plain text" (old-fashioned) mail in which there are
newlines that need to be converted to CRLF, and sometimes it is being
given quoted-printable mail in which CRLF's are already there.  How's it
suppposed to tell the difference?  Worse still, say that sendmail
receives a message from the outside that is encoded in quoted-printable.
 Currently, sendmail knows to convert CRLF's to newlines in mail that
comes in from the outside with a local destination.  Is it supposed to
do the same thing, now, with quoted-printable mail?  If so, does that
mean it has to decode such mail?  If quoted-printable implies CRLF
newline representation, does this mean that the message must be passed
on to the UA either using the "alien" newline convention or using long
lines and eight-bit data that may break something else?  The question is
further complicatedb by the possibility of encoded sequences like
=0D=0A, which are unambiguously NOT representations of line breaks
according to the existing rule #1, but become ambiguous by the
introduction of Fontaine's new rule #1.  (The existing rule PROHIBITS
using =0d=0a to encode line breaks, but Fontaine permits it.)  The net
effect is that MTA's would have to get into the business of decoding
encoded data, performing newline transformations, and then maybe even
re-encoding somehow it for local delivery.  

As I tried to figure out how to make the changes Alain proposed, I came
to remember that we had specifically designed quoted-printable NOT to
behave the way he suggests.  As currently defined, quoted-printable
says, in effect, "we're not messing with the definition of line breaks".
 This has the very nice property that all existing software that deals
with line breaks should do the right thing.   A quoted printable line
break is represented, on the local system, precisely the way the local
system represents CRLF as defined by RFC 822.  That's a very simple
rule, and one that I don't think we should break.  

What all this points to, I believe, is that quoted-printable is
fundamentally line-oriented in the same way that unencoded 822 mail is,
and we should just be upfront about that fact.  It is NOT an encoding
intended to produce identical binary data on the recipient's end. 
Quoted-printable data will not even necessarily have the same number of
BYTES on the recipient system as on the sending system (e.g. if CRLF is
converted to newline).  This is a property it shares with text.  This
does not mean you couldn't checksum it, but you'd need a checksum
algorithm that treats line breaks specially, something like the notion
of "portable newline" that used to be in base64 but no longer is.

In summary, the biggest problems with Alain's proposal are that it
muddles the current layering of Internet mail software, in which CRLF
conversion is almost exclusively a transport/gateway issue, and that it
intorduces the possibility that quoted-printable data would contain
sequences such as =0D=0A which open up new ambiguities.   (Should that
be a newline or just the two specified bytes?)   The existing draft, I
believe, has neither of these problems, and therefore it is my current
belief that it should not be changed.  At the moment, in fact, I feel
VERY lucky to have caught it, rather than introduced a possibly very
severe problem on the eve of proposed standard status.

Comments, anyone?  -- Nathaniel