Re: printable multibyte encodings

In <9212161857(_dot_)AA11763(_at_)wilma(_dot_)cs(_dot_)utk(_dot_)edu>, Keith 
Moore writes:

MIME currently has the model that the "canonical" form for any content-type
is an octet-stream.  I don't want to break this, and there is no compelling
reason to do so.


This is worth reconsidering as we move to multibyte character
sets, especially unified ones: it's arguable that the canonical
form should be a stream of *characters*.

I think it's important to keep very clearly in mind that encoding
is orthogonal to content, and that richtext is a content.
Therefore, it should be possible to write content-transfer-encoding
decoders as separate programs, piped into (assuming the existence
of Unix-like pipes) a content-specific display process such as a
richtext parser.

Imagine we're using straight Unicode as a character set.  (I
realize that the incorporation of Unicode into MIME hasn't been
finalized yet, but that's why I'm exploring a few ideas here.)
If we model the channel between the content-transfer-encoding
decoder and the richtext parser as an octet stream, which somehow
(e.g. UTF) encodes Unicode, we have two problems:

     1. The richtext parser still has to do some decoding, and
        we've sullied the partitioning between transfer-encoding
        and content (the decoding process hasn't really done its
        job).

     2. The richtext parser can get confused if the intermediate
        encoded form of the Unicode characters contains bytes
        with values 60 or 62 ('<' or '>').  Rhys Weatherley is
        trying to untangle a horrible snarl introduced by the
        presence of such bytes in ISO-2022-JP when integrated
        with richtext.

I hasten to admit that modeling the path between the decoder and
the richtext parser as a hexadectet stream has its own problems,
mainly in that it makes the richtext parser harder to write (one
has to be careful before using any standard string manipulation
routines).

Keith continues:

The way to do 10646 is to define a MIME-canonical form that is an
octet-stream.  Pick one such that most of the printable 10464 characters that
are also in the ASCII set coincide with the code values uses for those
characters in ASCII.  Then encode the result with any content-transfer-
encoding you want.


And, in <Qf=qUjW0M2Yt05NhJz(_at_)thumper(_dot_)bellcore(_dot_)com>, Nathaniel 
Borenstein agrees:

...it seems to me that any 16-bit (or 32-bit,
or 128-bit, or whatever) characters can be (and typically are)
represented as 8-bit octets in a canonical order.  At that point, they
can be represented in MIME using either base64 or quoted-printable in a
straightforward manner.


This is an obvious way to do it; I guess my acknowledgement of
that fact in my earlier message wasn't clear enough.

My minor concern, from <9212161600(_dot_)AA21085(_at_)adam(_dot_)MIT(_dot_)EDU>:

...something
bothers me about using a multibyte encoding (i.e. one of the UTF
variants) which assumes the reliability of the eighth bit and so
which will almost always have to be turned right around and fed
through quoted-printable or base64.


is mostly an aesthetic one.  If we believe (as I do) that
8-bit-clean channels cannot be depended upon to exist everywhere
soon, it's a bit odd to define a multibyte encoding (e.g. UTF)
which assumes their existence.  I expect to be doing a lot of
manual encoding and decoding for a while; an encoding which makes
the underlying character visible (e.g. in hexadecimal) is quite
friendly.  (The UTF variants are not so friendly, whether or not
their individual octets are clearly visible.)

(There is also the possibility of not using a 16 -> 8 bit
encoding, but using quoted-printable or base64 to encode all of
the octets comprising a multibyte stream.  Perhaps this is what
Keith and Nathaniel were suggesting, but it seems unwise; plain
text containing a preponderance of ASCII characters would appear
(in quoted-printable) as

        =00t=00h=00i=00s=00 =00i=00s=00 =00a=00 =00m=00e=00s=00s

which is neither compact nor readable.)

My more significant concern was hidden in a parenthetical note:

(In particular, I think
we're going to have to acknowledge the possibility of multiple,
cascaded encodings specified in the Content-transfer-encoding:
header.)


If I've written a document using Unicode characters, which is
encoded first using UTF and then using quoted-printable before
transmission, what should the Content-transfer-encoding line
contain?  RFC1341 suggests strongly that Content-transfer-encoding
always mentions exactly one encoding, yet neither quoted-printable
nor UTF by themselves tell the whole story.

Let me make my position clear: I'm not lobbying intensely for a
hypothetical new printable multibyte encoding; I'm just tossing
the idea out for consideration.  However, if it's felt that it's
more important to keep the number of encodings down (and I agree
that it is important), then I will lobby for an expanded
definition of the Content-transfer-encoding header to provide for
multiple, cascaded encodings.

                                        Steve Summit
                                        scs(_at_)adam(_dot_)mit(_dot_)edu