Re: latest draft - content-transfer-encoding

It seems I was composing two replies at the same time and got the recipients
transposed. My apologies -- it is late and I'm a little dazed.

Here's the reply that should have gone back to you.

I agree that UTF *could* be treated as a character set.  However it
also has properties similar to a transfer encoding.


Similar to in the sense that it's a mapping, maybe, but it is definitely
definitely not the same. Generally speaking, a transfer encoding is a mechanism
that maps a byte stream into a format that's compatible with the RFC822 concept
of text. The only exception to this notion is the handling of end-of-line
boundaries, and the decision was made in Sante Fe to force these things to be
in-band, so they now amount to no exception at all.

UTF is a mechanism for representing a specific set of characters as a byte
stream. While it is true that you could use it as an encoding mechanism to map
(and maybe compress in some cases but certainly not all) arbitrary 32-bit
quantities into a byte stream, this still does not satisfy the definition of a
transfer encoding -- the domain of the mapping is completely different and so
is the range!

Mathematically speaking, UTF is about as similar to a transfer encoding as the
sine function is to the set of all diophantine functions. They are all
mappings, but they serve completely different purposes, operate on completely
different inputs, and produce outputs with completely different
characteristics.

When I have a
10646 editor (for mail), I think I would prefer that it be encoded in
UTF for transmition (over the eventual 8 bit connection).


You cannot assume the connection is 8-bit. This was an axiomatic basis of the
MIME work.

That way
when it is received by someone with the same mail handling facilities
I have, it would be automatically unencoded.


This sounds nice in principle, but it begs the question of whether anyone will
ever use UTF, and, for that matter, whether UTF will be adopted as a standard
of any kind whatsoever. Look what happened to the mappings specified in the
first DIF 10646.

If it is handled as
a character set, not a transfer encoding, then the user has to do the
character set transformation explicity.


There's nothing that says this has to be the way things are implemented. In
fact, I would find such an implementation totally unacceptable. I can only
speak of my own implementation of MIME, but in it character set translations
are performed automatically; there is no need for the user to do anything at
all aside from the need to specify (once) what transformations are to take place
and when. (The system manager will usually do most of this on behalf of the
users, but that's a local configuration issue well beyond the scope of this
discussion.)

This point has nothing to do with UTF anyway -- at present UTF is nothing more
than a promising, but distant, spectre on the horizon. It may loom larger some
day, but in order for this to happen depends on at least two things. One, it
has to make it to full IS status. Two, it has to be widely implemented. It all
depends on who you talk to, but I hear noises that certainly seem to indicate
that there's going to be substantial opposition to both of these things
happening. Others, of course, say different things, but the point I'm trying to
make is that the future is far from certain and I for one don't care to commit
to one particular view of what the future may bring.

It is just generally sound engineering that when you're dealing with the
handling of multiple character sets (and I have to handle dozens of different
ones) a structurally sound framework for implementing general character set
support is very handy to have around. Once you have this why not make it
generally available and programmable?

In summary, the reasons that UTF is not listed as a valid transfer encoding in
MIME are:

(1) Listing something that's defined in a DIS is unacceptable to the IESG.
    Even a placeholder for such a thing is unacceptable. This reason alone
    would be sufficient to completely veto any such notion -- the rest of this
    discussion is purely academic.

(2) There is no clear need for such a specification at this time. It is not
    clear that UTF will ever really exist, let alone achieve significant
    usage anywhere.

(3) UTF is only suitable for a subset of the inputs a transfer encoding is
    expected to handle, and does not produce output in the format that meets
    the requirements of a transfer encoding. You could fix the problem with
    inputs by ignoring it (make UTF only applicable to subset of all content
    types). However, doing things like this was a SHOW STOPPER for many people.
    (We visited this issue in another context.) The output problem would then
    require some sort of additional encoding (either quoted-printable or
    base64 would work but as others have pointed out neither is exactly right
    for the job). This immediately begs the question of why consider UTF
    as a transfer encoding at all.

(4) Calling UTF a character set does not seem to impose any significant
    restrictions on functionality, and seems to align nicely with the overall
    structure of MIME, whereas calling it a transfer encoding seems to really
    clash with the structure in lots of ways.

(5) If, for some reason, there comes a time when it is necessary to add
    additional transfer encodings, this can be done by adding them in a
    follow-on standards-track RFC. This has been made very hard to do quite
    intentionally, since the proliferation of additional encodings was viewed
    as a very bad thing. Allowing casual proliferation of transfer encodings
    was a SHOW STOPPER for many people. But if a compelling case can be made,
    and in particular points (1) and (3) can be dealt with, this could
    definitely be done in the future. Moreover, experimentation with this could
    begin now by using X-UTF as a transfer encoding. Thus, there's actually
    nothing standing in the way of doing this now and standardizing it later.

Finally, let me say a few words about the suitability of using the existing
transfer encodings with a UTF character set. It is definitely true that
quoted-printable would not handle this efficiently. However, I view the use of
quoted-printable as being rather limited, in that it provides an encoding
that's specifically suited to the occasional use of 8-bit characters (please
note that this is not specifically related to the content-type). In particular,
material that contains 8-bit text will be rendered in a fashion that's about as
legible as it can be on 7-bit-only equipment. This was felt to be useful to
make the transition from 7-bit systems a little easier (easing this transition
has dictated several aspects of overall MIME design besides this).

Base64 is almost always the preferred encoding for material that contains
non-trivial numbers of bytes with the high bit set. (I personally recommend,
and implement, making this choice on the basis of byte frequency counts.) And
it, or more accurately, the general class of 3-in-4 encodings, provide near
optimal worst-case performance when the input distribution is unknown and when
the output character set is limited to the invariant subset of ASCII.
(Actually, there are encodings that have superior worst-case performance. They
use the few remaining characters in the minimal invariant subset to advantage.
It has been a while since I worked this all out, but as I recall, using a 76
character output set admits the possibility of an 8 in 10 encoding. These
encodings are, however, rather complex and very slow, and the 1.33 expansion
factor never drops below 1.25, so there's little point in using them.)

It would certainly be possible to implement an encoding that gives preferential
treatment to high-bit-set-bytes. Given some knowledge of input frequency
distributions it is _always_ possible to do as good or better than base64. If
the use of some format that generates high-bit-set-bytes becomes dominant it
would certainly be a good idea to consider such a scheme. However, I feel
compelled to point out that this has not happened yet and does not promise to
happen any time soon. It is far from clear that the majority of mail (by
volume) will even be text, let UTF text, in the future. As such, perhaps the
time could be better spent working on a compression extension to  MIME. This
promises to provide considerably better use of bandwidth for a broader class of
inputs than additional transfer encodings can.

                                Ned