Re: Character-set header (was Re: Minutes of the Atlanta 822ext meeting)

John C Klensin writes:

A gateway that is choosing an appropriate encoding without full knowledge of
all types might do well to use a combination of the type, an analysis of
the data, and some heuristics to encode things. No other approach seems
appropriate.

  This is scary, and dredges up all of the old fears about gateways and 
other critters creatively encoding and improving messages beyond 
recognition.


I disagree. There are two defined encodings. That's all. You must use one or
the other. Currently we're ONLY talking about how to encode 8 bit text (you
know, the stuff with limited line lengths, etc.). Either encoding will handle 8
bit text in an invertible fashion -- done correctly, either encoding can be
reversed without NO information loss whatsoever. The encodings are defined to
work this way.

You can talk about creative encoders -- but they are broken if they don't
encode according to the specification. You cannot solve the problem of people
using broken encoders by pushing and tugging at who can use what encoding when.
You can produce recomendations until you're blue and it will not solve the
broken encoder problem. Indeed, by making the selection of encoder more and
more complicated, you insure that implementers will more time selecting and
encoding and less time getting it right.

The selection of an encoder is significant in one way, and one way only --
efficiency. It is certainly true that for some material quoted-printable is
more efficient (material composed of mostly printable ASCII characters) and for
some material base64 is more efficient. This is because base64 has a fixed
overhead of 33% while quoted-printable's efficiency ranges from 0% overhead to
200% overhead, depending on the distribution of the bytes in the input
material. This is why I say some heuristics are useful in determining which
encoding to use.

There is no argument to be made on the basis of readability -- it works out to
be the same consideration as efficiency. base64 renders the input material
totally unreadable. quoted-printable may or may not make things unreadable, in
direct proportion to its efficiency (the more unreadable the output is, the
less efficient the encoding is).

We can also extend this to cover encoding of binary material. Once again either
encoding will work -- you have to be a little more careful, but either one will
work. (I'm not going to get into why this is true, but it is true and a careful
study of the documents will show this.) You can make a case for never using
quoted-printable on binary material, but apart from potential inefficiency
there is no harm in doing so.

Finally, there is some difference between encodings in the EBCDIC world (some
of the characters quoted-printable uses may not be available, although I should
point out that it is possible to encode more than the minimal set of characters
in quoted-printable to avoid this invariance problem). This is not, however, a
problem for this group, since we're assuming a 7 bit printable path at least.
And who is in a position to determine the path in front of them better, the
gateway or the person who sent the message? By exercising a sender's mandate of
one encoding over another, you may in fact be blocking interoperability.

My conclusion, then, is that all this discussion of mechanisms for specifying
the encoding to be done by a remote host is largely a waste of time. Since
either encoding is invertible and both must be supported, the sender really has
no reason to specify this. By including such a specification you complicate
the implementation of the intermediate, and thereby may end up causing the
nonexistent problem you were trying to avoid in the first place!

                                Ned