Re: printable wide character (was "multibyte") encodings

In <9301111814(_dot_)AA26178(_at_)dimacs(_dot_)rutgers(_dot_)edu>, Henry wrote:

UTF-2 does not use escape sequences...
...UTF-2 is completely
unambiguous -- there is *no* uncertainty about the location of
character boundaries, and no doubt about whether you are seeing a
full character or partial bytes.
...This particular encoding,
and no other, is what has convinced me that the transition can be made
without massive pain, without massive duplication of code, without
massive incompatibility.  It is, purely and simply, *well designed*.
...The tremendous advantage of the 10646-UTF-2 character set is that it
breaks almost nothing in our existing software corpus.
...Although there are several possible paths out of the 8-bit-character world,
this one is *overwhelmingly* the path of least resistance.  This is the
one people are going to use.


Henry makes many excellent arguments in favor of the UTF-2
encoding.  His points are well taken, particularly those about
attempting to preserve the existing (8-bit-oriented) software
corpus.  However, it seems to me that all of these arguments
prove that UTF-2 is an excellent choice for representing an
extended character set within a particular system.  They have
little bearing on nomenclature and partitioning of functionality
within an interchange standard such as MIME.

It is a mistake to view UTF-2 as a content-transfer-encoding.  It is
an alternate way to represent 10646 characters, a variable-width form
as contrasted to the fixed-width form that 10646 uses.  It is a mistake
to think of it as something layered on top of 10646.


It is not a mistake to keep explicitly separate and distinct that
which is separate and distinct, particularly when the costs of
maintaining the distinction are low.  UTF-2 *is* an encoding.
Whether or not it is a good encoding (and it is a very good
encoding) does not affect our choice of whether to indicate its
use in a field labeled "encoding" or a field labeled "character
set."

Earlier, in <9301111450(_dot_)AA01003(_at_)adam(_dot_)MIT(_dot_)EDU>, I had 
written:

      ...define a Content-transfer-encoding which encodes 16
      (or 32) -bit characters, and model the communication path
      between the content-transfer-encoding decoder and the
      richtext parser as a stream of 16- or 32-bit characters.
      (Whether this stream is implemented as an octet stream in
      some canonical order, or as some word-oriented IPC
      mechanism, is an implementation detail.)


In the same vein as the parenthetical sentence, an operating
system (such as Plan Nine) which happened to use UTF-2 internally
could implement a UTF-2 content-transfer-encoding decoder as a
null process, with the communication path between decoder and
parser as its conventional UTF-2 octet stream.  (The document
describing MIME's adoption of 10646 and/or UTF could even
recommend such a strategy.)

10646 could just
as easily have defined the UTF-2 representations as the canonical form,
with a fixed-width alternate form to simplify some kinds of processing.


But it did not.

I have nothing against UTF-2.  But as much as we like it, there
will be environments which do not use it internally, either
because they use some other encoding, or because they use wide
(>8 bit) characters internally.  (For what it's worth, I have
heard it claimed that "Windows NT" will do so, not that it should
in any means be viewed as a trend-setter to be emulated; I
shudder to think of the abandonment of an existing software
corpus being engendered by its decision.)  I don't think that an
interchange standard should presuppose or encourage the use of a
particular encoding by lumping it in with a character set.

In <9301112056(_dot_)AA03406(_at_)wilma(_dot_)cs(_dot_)utk(_dot_)edu>, Keith 
wrote:

[I had written:]

Richtext only barely meshes with ISO-2022-JP because 2022-JP is
sometimes 8 bits per character and sometimes 16...


To me, it doesn't make sense to feed text with character set 'x' to a
richtext parser designed to understand character set 'y'.


I agree.  That's exactly why I'm trying to avoid having to teach
richtext parsers (or any other message processing software) about
encodings, which are properly dealt with in one central place.

     Keith Moore last month bemoaned the suggestion of a
     departure from the familiar and comfortable byte stream.
     If we're going to use characters larger than 8 bits, some
     departure somewhere from an octet stream is obviously
     (and by definition) necessary.


There is no reason why we cannot define everything in terms of a canonical
encoding into an octet stream.  FTP has been doing this for years.


But this suggestion begs the question: somewhere there is a
mapping between that octet stream and the "everything" which has
been encoded.  Above the level of the FTP "representation type"
[RFC959 sec. 3.1.1], data is handled which may be bytes or words
of more than 8 bits.  Analogously, it is not unthinkable for mail
messages above the level of the content-transfer-encoding to
consist of wide characters.

UTF is clearly *not* a suitable MIME content-transfer-encoding.  It cannot be
applied to any content-type, nor is it interchangable with other
content-transfer-encodings.  It is possible and even reasonable to encode,
say, image/gif as base64, quoted-printable, or binary.  Encoding image/gif in
UTF is meaningless.

A content-transfer-encoding maps the canonical form of a particular content
(which MUST be an octet stream) into its on-the-wire representation (which
also must be an octet stream).


These arguments are suggestive, but are not entirely backed up
by the existing language in RFC1341.  The language discussing
canonical forms of content-types tends to mention "natural
format" or "character-oriented" or "byte stream."  I do not see
a requirement explicitly stated that a canonical form must be
8-bit, and whether not it is implicit or hidden my arguments do
in fact suggest that withdrawing such a requirement (or making
other changes; see below) might bear consideration.

UTF does not make sense for existing content-types because none
of them have 16- or 32-bit natural representations.  (I can't be
sure of GIF, JPEG/JFIF, and MPEG, which are only sketchily
described.)

It's true that RFC1341 suggests that encoding/decoding
activities may take place during stages other than those
corresponding to the content-transfer-encoding.  Section 7.5
mentions the JFIF encoding, and appendix H step 2 mentions
conversion, transformation, compression, or "various other
operations specific to the various content types" under the guise
of "conversion to canonical form."  This suggests to me not that
we should accept analogous implicit conversions and encodings of
wide character sets in text messages, but that canonical
encodings deserve more attention.  I had already been worried
about the mention of JFIF in section 7.5, and about the lack of
mention of similar encodings for GIF, audio, or video.  (Perhaps
the references for these formats, which I haven't chased, do
describe canonical 8-bit encodings.)

If MIME is to enjoy the same widespread and long-lived
applicability as has RFC822, as we'd all like it to, I think
it's a good idea to err on the side of explicit rather than
implicit indication of encodings and other operations.  Just as
the character set is explicit (rather than implicit ASCII, as it
was for RFC822), encodings and other processing steps should be
explicitly and distinctly specified, not lumped in with
content-types and character sets.  This is not for the purposes
of allowing or encouraging the proliferation of multiple
redundant encodings (or other parameters), but rather for the
support of clean design and separation of function.

Imagine a system which does not use UTF-2 internally, either
because it uses some other encoding, or because it uses entirely
16- or 32-bit characters, or because it has no widespread
knowledge of wide characters but is making only piecemeal
modifications to, say, mail transfer and display software.
If the application of UTF is specified as part of the charset
parameter in a content-type: text line, then either:

     1. knowledge of UTF (which is fairly easy, but not trivial,
        to decode) will have to be built into all message display
        programs, or

     2. whichever message transfer program is responsible for
        decoding the content-transfer-encoding will have to peek
        at the content-type: text charset= parameter to determine
        whether also to decode UTF.

If, on the other hand, UTF is specified separately from the
character set, a single process which is responsible for
decomposition of MIME messages into this hypothetical system's
canonical internal wide character format would be able to do so
much more cleanly, without peeking at fields which shouldn't
concern it.

If it is agreed that UTF-2 is not a proper content-transfer-encoding,
and/or that the requirement that canonical forms be restricted to
8-bit encodings be formalized and preserved, perhaps we need a
specification of content-encoding, either as a separate header or
as a new parameter within the content-type header.  This
hypothetical new field would then also be the place to specify
that JPEG images are being encoded with JFIF, and that the
still-sketchy audio and video formats are being encoded with
whatever their preferred canonical encodings turn out to be.
Again, this hypothetical field would not exist for the sake of
multiple redundant encodings, but simply to make explicit and
distinct an important processing step which might otherwise be
awkward to handle on a system which had different internal
assumptions than did the system which had originated a particular
message or which was contemplated by and familiar to the
designers of the interchange format.

                                        Steve Summit
                                        scs(_at_)adam(_dot_)mit(_dot_)edu