Re: printable wide character (was "multibyte") encodings

Date: Mon, 11 Jan 93 09:50:46 -0500
From: scs(_at_)adam(_dot_)mit(_dot_)edu (Steve Summit)
Subject: Re: printable wide character (was "multibyte") encodings


(Steve writes):

10646 is a character set, and UTF is an encoding,
and it's risky to muddle the two issues.


For any MIME content-type, there is a "canonical form" which defines how you
map an object of that particular type into an octet stream.  The MIME
canonical form for "text/plain;charset=ISO-10646" could be defined as "encode
the characters according to UTF-2", without breaking anything.   (It could
also be "encode each n-bit code as n/8 octets, most significant octet first".
Whatever.  It just needs to be defined.)

Many content-types require some translation to get them into canonical form;
this is nothing specific to 10646.  After all, the canonical form for
"text/plain;charset=US-ASCII" is "each character encoded in one octet, with
the 0x80 bit of the octet set to zero, and ends of lines encoded as CR LF" --
it's not as if you can do anything you wish with that bit or encode the ends
of lines in any way you wish.

The importance of a one-to-one mapping (in the case of Unicode,
between characters and 16-bit quantities) becomes apparent when
additional processing steps are imposed.  It's nice always to be
able to know where the individual character boundaries are, and
not to misinterpret partial bytes which aren't full characters.


This isn't a problem with UTF-2.

One such additional processing step which illustrates this nicety
is a richtext parser.

Richtext only barely meshes with ISO-2022-JP because 2022-JP is
sometimes 8 bits per character and sometimes 16.  Since a
richtext parser isn't likely to understand that distinction, it
can get confused when an 8-bit half of a 16-bit character happens
to match the bit pattern for '<'.  The solution, as Rhys
Weatherly has proposed, is to further encode an 8-bit half with
that value as <lt>, as other '<' characters are encoded in
richtext, but (to borrow a phrase) the ice seems thin here.
(Another point to consider is that a richtext processor wants to
keep track of character boundaries so that it can count them
while justifying and filling lines.)


To me, it doesn't make sense to feed text with character set 'x' to a
richtext parser designed to understand character set 'y'.  Aside from the
charset switching commands inside richtext, the richtext language doesn't
have a problem -- if you want to say "<foo>", you have to encode each of the
characters in "<foo>" in whatever character set is being used. The body part
will then be labeled "text/richtext;charset=whatever" and the recipient's
richtext parser will have to understand that character set. 

The character set support *within* richtext is not well defined -- for
example, what does it mean to have <iso10646>stuff</iso10646> within a
richtext body part that is encoded in ebcdic?  [What happens if the "stuff",
in encoded form, contains valid richtext commands? (when interpreted
according to the outer character set) What happens if the "stuff", in DEcoded
form, contains valid richtext commands?].  

Allowing a character set parameter to a richtext body part creates a problem
in that richtext interpreters might be expected to understand several
character sets.  We don't expect a PostScript interpreter to understand
anything but ASCII -- why should we expect this of richtext?

(One solution might be to define the canonical form for text/richtext to
*always* be in iso10646 with utf-2 encoding, disallow the charset= parameter
to text/richtext content-types, and eliminate the character set commands from
within richtext.)

But I tend to view these as problems in the definition of richtext, which
have no bearing on how iso 10646 text is to be represented.

I think there is agreement among richtext implementors that the richtext spec
in general needs lots of tightening up.  So if there is a problem with
richtext combined with ISO 10646 encoded in a particular way (which, mind
you, does not appear to be the case if UTF-2 is used), perhaps the attention
is best directed to fixing the richtext definition to make it clearer what
character set is to be used.

--------

      Keith Moore last month bemoaned the suggestion of a
      departure from the familiar and comfortable byte stream.
      If we're going to use characters larger than 8 bits, some
      departure somewhere from an octet stream is obviously
      (and by definition) necessary.


There is no reason why we cannot define everything in terms of a canonical
encoding into an octet stream.  FTP has been doing this for years.

      Calling UTF a transfer encoding has the additional
      implication that we either need to expand the syntax of
      the Content-Transfer-Encoding line to allow the
      specification of two (or more) cascaded encodings, or
      else define an encoding which maps 16- or 32-bit
      characters all the way back to printable characters.


UTF is clearly *not* a suitable MIME content-transfer-encoding.  It cannot be
applied to any content-type, nor is it interchangable with other
content-transfer-encodings.  It is possible and even reasonable to encode,
say, image/gif as base64, quoted-printable, or binary.  Encoding image/gif in
UTF is meaningless.  

A content-transfer-encoding maps the canonical form of a particular content
(which MUST be an octet stream) into its on-the-wire representation (which
also must be an octet stream).  This is the model currently used in MIME, and
I have yet to see a good reason to change it.

                                      Steve Summit
                                      scs(_at_)adam(_dot_)mit(_dot_)edu