Re: PROBLEM: Newlines & Quoted-printable

This is a VERY obtuse subject, and I offer my apologies in advance if
this seems long and tedious.  The basic question is simple:  How should
line breaks be represented and encoded in the quoted-printable encoding?


I've never thought this was particularly obtuse. But then again, I use an
operating system that fully supports at least 5 different mechanisms for
encoding line breaks, all of which can be converted and interchanged any
time you want. After having to deal with this for so long it has become
second nature.

The current MIME draft (Jan 1992) is pretty clear on this:  Line breaks
in quoted printable are represented as line breaks.  While there's a
mechanism for preceding a line break with an equal sign, which means it
is a non-significant line break, all "real" line breaks simply appear as
line breaks.


No argument here.

Confusion enters because of the ambiguous phrase "line break". 
Presumably, by "line break" we mean CRLF.


No, we mean a line break. For text messages this is defined to be equivalent to 
a CRLF. Other types and subtypes may do things differently.

My favorite example is, as always, PostScript. PostScript material is a
stream. Normally in this stream CR, LF, CRLF, and space are all more or
less equivalent, but any of them can be redefined at any time to mean
something completely different.

Other formats might involve, say, fixed length records. In these formats
line breaks simply do not exist.

Now we're already getting onto shakier ground,  because there has always
been a "polite fiction" about CRLF in RFC 822.  The polite fiction says
that "whenever an RFC 822 parser looks at a message, it sees CRLF as the
line break."  Unfortunately, this fiction does not correspond to reality
very well.  For example:  How many UNIX user agent programs, do you
think, parse a message's header by changing all the newlines to CRLF's,
then searching for lines that end with CRLFS, and then changing them
back before displaying them to the user?   This may seem irrelevant
insofar as 822 is the standard format for message TRANSPORT, but
unfortunately it is more than that, and this is where we start to get
into real trouble.


Coming as I do from an operating system that thinks this is entirely
normal I don't see the problem apart from the fact that standard I/O cannot
deal with this stuff very gracefully. (You should see the contortions imposed
on standard I/O when it tries to cope with a more general definition of file
in a way that's both capable of dealing with mixed formats yet preserves the
overall "flavor" of UNIX. Yuck.)

Fontaine's proposal says that any data to be transmitted with
quoted-printable must be converted to the CRLF representation for
newlines BEFORE encoding.  On the surface, this sounds reasonable, but
the more I think about it the more I think it is a recipe for disaster,
if it is even implementable at all, which is why I'm re-raising the
issue.


It is implementable. I've implemented it. But I cannot and do not claim
that I have a typical environment.

Think about the way text is transmitted now:  Within a
domain-of-newline-convention (e.g. a local UNIX system), mail is NOT
typically converted to the CRLF convention.  Oh, sure, sendmail and
other MTA's do this for message transport, but by the time a message
shows up in a mailbox, it is in the local newline convention, and it
typically stays that way for all non-delivery processing.  In other
words, in existing practice, the conversion of newlines is very much a
function of the transport layer.  A UNIX UA typically composes mail
using the local newline convention and then passes it off to the MTA,
which converts to CRLF when talking over SMTP.


I actually use counted records to store everything prior to the point of
transmission. This is, of course, the most general of the formats available
to me.

Under Fontaine's proposal, the newline characters would be converted to
CRLF by whoever was doing the encoding.  Typically, in many
environments, this will be the user agent.  But now look at the
situation from poor sendmail's perspective:  Now sometimes it is being
called to deliver "plain text" (old-fashioned) mail in which there are
newlines that need to be converted to CRLF, and sometimes it is being
given quoted-printable mail in which CRLF's are already there.  How's it
suppposed to tell the difference?  Worse still, say that sendmail
receives a message from the outside that is encoded in quoted-printable.


This seems so easy to resolve I must be missing something. For starters,
the encoder reads the input material. It must know what constitutes a
"line break" in whatever the input material is. For text, this is going to
be whatever the local newline convention or conventions are. Most other things 
by and large don't have line break conventions that make sense to recognize, so
there are no line breaks to deal with. (Certainly the two examples I gave
above don't have newline mechanisms that make any sense.)

When a newline is encountered it is encoded as such. Both base64 and 
quoted-printable admit the possibility of encoding it as a 0D0A. 
Quoted-printable also admits the use of line break as the encoding for a
line break (note that these are NOT the same thing -- one is the encoding
for the other).

But the process is not finished. The encoding must be placed in "transmission
format". This may be CRLF delimited lines, or LF delimited lines, or counted
length lines (this is what I use), or whatever. The point is that this is
yet another intermediate representation -- it is what the transport
mechanism expects to get. The one thing that is for sure now is that you're
talking about short text lines here with some form of newline convention.

Currently, sendmail knows to convert CRLF's to newlines in mail that
comes in from the outside with a local destination.  Is it supposed to
do the same thing, now, with quoted-printable mail?  If so, does that
mean it has to decode such mail?  If quoted-printable implies CRLF
newline representation, does this mean that the message must be passed
on to the UA either using the "alien" newline convention or using long
lines and eight-bit data that may break something else?  The question is
further complicatedb by the possibility of encoded sequences like
=0D=0A, which are unambiguously NOT representations of line breaks
according to the existing rule #1, but become ambiguous by the
introduction of Fontaine's new rule #1.  (The existing rule PROHIBITS
using =0d=0a to encode line breaks, but Fontaine permits it.)  The net
effect is that MTA's would have to get into the business of decoding
encoded data, performing newline transformations, and then maybe even
re-encoding somehow it for local delivery.


I'm not sure I understand all this. There are two cases. Either sendmail
does the encoding or it does not. If it does the encoding it has to be
able to obtain not only the data but knowledge of what newline convention,
if any, the data supports.

If sendmail doesn't do the encoding (and I don't think it should) the
encoder must be elsewhere (in the UA, I guess) but it still has to analyze
the material presented to it and understand what a newline is in it. It
then spits out something that is acceptable to the conventions of the
local transport. I expect this will not be canonical 822 but instead will
be 822/LF.

In either case I don't see that much of a problem.

As I tried to figure out how to make the changes Alain proposed, I came
to remember that we had specifically designed quoted-printable NOT to
behave the way he suggests.  As currently defined, quoted-printable
says, in effect, "we're not messing with the definition of line breaks".


But this gets back to the question of what a line break is. I don't know what
it is unless you tell me what we're dealing with first.

 This has the very nice property that all existing software that deals
with line breaks should do the right thing.   A quoted printable line
break is represented, on the local system, precisely the way the local
system represents CRLF as defined by RFC 822.  That's a very simple
rule, and one that I don't think we should break.


I don't think Alain's changes break this rule. All you have to accept is the
existence of yet another intermediate form between the encoder and the
transport.

What all this points to, I believe, is that quoted-printable is
fundamentally line-oriented in the same way that unencoded 822 mail is,
and we should just be upfront about that fact.  It is NOT an encoding
intended to produce identical binary data on the recipient's end. 
Quoted-printable data will not even necessarily have the same number of
BYTES on the recipient system as on the sending system (e.g. if CRLF is
converted to newline).  This is a property it shares with text.  This
does not mean you couldn't checksum it, but you'd need a checksum
algorithm that treats line breaks specially, something like the notion
of "portable newline" that used to be in base64 but no longer is.


Suppose I never use the line breaks in quoted-printable. I always explicitly
code the 0D0A or whatever into the stream. (This is, in fact, what I believe
is necessary to represent many things in quoted-printable.) Why is this
fragile in any way?

In summary, the biggest problems with Alain's proposal are that it
muddles the current layering of Internet mail software, in which CRLF
conversion is almost exclusively a transport/gateway issue, and that it
intorduces the possibility that quoted-printable data would contain
sequences such as =0D=0A which open up new ambiguities.   (Should that
be a newline or just the two specified bytes?)   The existing draft, I
believe, has neither of these problems, and therefore it is my current
belief that it should not be changed.  At the moment, in fact, I feel
VERY lucky to have caught it, rather than introduced a possibly very
severe problem on the eve of proposed standard status.


Its pretty obvious that I just don't have the right perspective to deal with
this, since I deal with it at all sorts of levels all the time. But I'll
be the first to say that my views don't represent anything like a majority
view here. (Although come to think of it, PCs use CRLF conventions...) I
can accept pretty much anything short of out-of-band conventions for the
representation of line breaks in basse64. Those were hideous and I'm glad
that they are gone.

                                Ned