Re: printable wide character (was "multibyte") encodings

I should apologize for having started this discussion and then
going away on vacation while it degenerated.  This reply, to a
letter of Nathaniel's from December 17, should be a bit more
on-topic.

In <UfAAv6i2NasDI0gq5p(_at_)thumper(_dot_)bellcore(_dot_)com>, Nathaniel wrote:

[Henry had written:]

UTF-2, in particular, is an encoding of 16-bit characters that represents
ASCII characters as themselves (one octet apiece) and is "file-system
safe", avoiding octets that have special meaning to common software.


That's fine.  It seems to me that the right way to do 10646 in MIME is
to have a character set something like "ISO-10646-UTF-2", and to say
that the raw data for a MIME text/* entity of this character set is text
in UTF-2.


I disagree.  10646 is a character set, and UTF is an encoding,
and it's risky to muddle the two issues.

Issues of nationalistic pride aside, one real strength of Unicode
(the 16-bit subset of 10646) is that it strives in large part to
map characters one-to-one with code points *without* "out of
band" codes such as the character-set-switching escape sequences
of ISO-2022.  (Exceptions are the various joiners, left/right
embedded directionality indicators, and non-spacing diacritics.)
To call an encoding of 10646 (such as UTF) a distinct character
set would be a step in the wrong direction.

The importance of a one-to-one mapping (in the case of Unicode,
between characters and 16-bit quantities) becomes apparent when
additional processing steps are imposed.  It's nice always to be
able to know where the individual character boundaries are, and
not to misinterpret partial bytes which aren't full characters.
One such additional processing step which illustrates this nicety
is a richtext parser.

Richtext only barely meshes with ISO-2022-JP because 2022-JP is
sometimes 8 bits per character and sometimes 16.  Since a
richtext parser isn't likely to understand that distinction, it
can get confused when an 8-bit half of a 16-bit character happens
to match the bit pattern for '<'.  The solution, as Rhys
Weatherly has proposed, is to further encode an 8-bit half with
that value as <lt>, as other '<' characters are encoded in
richtext, but (to borrow a phrase) the ice seems thin here.
(Another point to consider is that a richtext processor wants to
keep track of character boundaries so that it can count them
while justifying and filling lines.)

Well, if most of the characters are in fact ASCII, you can then use
quoted-printable, right?  And troublesome characters like NUL can be
encoded using either quoted-printable OR base64, right?


Not if we view the Content-transfer-encoding as a step which
happens before any higher-level processing, such as richtext
parsing, occurs.  If UTF (or some other encoding, which we're
mistakenly lumping with a character set) can contain octets of
values 60 ('<'), 13 (CR), 10 (LF), or 0, a richtext parser which
presupposes an octet stream is likely to get confused, and
quoted-printable won't help unless it is handled somewhere within
the richtext parser, which seems unclean and unnecessary.

It is often mentioned that MIME was designed with Unicode and/or
10646 in mind, and that dropping those character sets in is
something that could happen any day now.  Some may find it
obvious exactly how 16- or 32-bit character sets are to be
handled, but I think it's clear based on the amount of confusion
surrounding these issues that any such obvious solutions should
be described explicitly, and soon, lest mistaken assumptions
proliferate.

I can think of three general approaches for handling these large
character sets while also considering richtext:

     1. Continue to think primarily about octet streams, with
        wide characters hidden in them via UTF or equivalent
        encoding mechanisms.  This is probably the path of least
        resistance, but it's the one I'm arguing against, because
        I think it's full of lingering problems.

     2. Take the plunge and embrace wide characters with open
        arms: define a Content-transfer-encoding which encodes 16
        (or 32) -bit characters, and model the communication path
        between the content-transfer-encoding decoder and the
        richtext parser as a stream of 16- or 32-bit characters.
        (Whether this stream is implemented as an octet stream in
        some canonical order, or as some word-oriented IPC
        mechanism, is an implementation detail.)  The point is
        that the richtext parser's front-end "get a character"
        primitive would get a wide, multioctet character.  (The
        special '<' character would therefore appear as a 16- or
        32-bit quantity with value 60).

        Keith Moore last month bemoaned the suggestion of a
        departure from the familiar and comfortable byte stream.
        If we're going to use characters larger than 8 bits, some
        departure somewhere from an octet stream is obviously
        (and by definition) necessary.  Recalling the proper
        definition of "byte", however, we can if we wish continue
        to think about byte streams, as long as we remember that
        a byte may have more than 8 bits.  (It is for this reason
        that I am avoiding the term "multibyte character" in this
        note, and wishing that I hadn't used it in the original
        Subject line.)

        Deciding that knowledge of wide characters should
        permeate the entire richtext parser, all the way back to
        its front-end input primitives, does of course entail a
        significant rewrite of the parser, and implies eschewing
        many language-supplied string manipulation facilities
        (e.g. the standard C library's str* routines).
        TANSTAAFL.

        Calling UTF a transfer encoding has the additional
        implication that we either need to expand the syntax of
        the Content-Transfer-Encoding line to allow the
        specification of two (or more) cascaded encodings, or
        else define an encoding which maps 16- or 32-bit
        characters all the way back to printable characters.

     3. Implement special wide-character support wholly within
        richtext, with a mechanism like

                <widechar>char-id</widechar>
        or
                <widechar: char-id>

        where char-id is a printable hexadecimal representation,
        printable name, or other printable description of a
        single wide character.  (In the latter syntax,
        <widechar...> obviously becomes another example of a
        richtext directive not to be closed with a matching </>
        directive; it also introduces a new kind of parameterized
        directive.)  This approach has the advantages of limiting
        wide character support towards the back end of the
        richtext processor and eliminating any pernicious
        interactions at the richtext parser front-end, but it has
        the disadvantages that it introduces Yet Another encoding
        mechanism, and it only supports wide characters when
        richtext is also being used, offering no help for using
        wide characters with other text or content types.

I'd be very interested to hear other people's opinions on which
of these alternatives (or others I haven't thought of) are
preferable.  Since alternatives 2 and 3 contain surprises which I
haven't heard any mention of, I suspect that people are tending
to think along the lines of alternative 1, which I should
probably come up with more cogent arguments against in order to
bolster my claim that it's the wrong approach.

Compilers were once thought to be nearly impossible to write,
until (among other things) we learned to separate lexical
analysis from parsing, which turned out to make the task much
cleaner and more tractable.  In an analogous way, I'd like to
keep transfer encoding issues clearly separated from character
set issues, and to rig things up so that all encodings can be
cleanly removed during one phase of processing, leaving a
straightforward character stream to be handled by later phases,
and to encourage the use of "constant width" extended character
sets such as Unicode.  (To preserve the constant width attribute,
we might want to encourage the use of precomposed characters and
discourage constructive non-spacing diacritics if we do adopt
Unicode within MIME.)  I have nothing against UTF (I agree that
it's nearly mandatory to use an encoding such that characters
which require only the lower 7 bits can be transmitted as
themselves, both for readability and transmission efficiency
reasons), but I'd like to see it stripped out early upon receipt.

Alternative 1 could be made to work, but by mushing together the
notions of character set and transfer encoding it places
unnecessary constraints on the encoding chosen or the richtext
parser's implementation or both.  (Either the encoding must
ensure that stray <, CR, LF, NUL, etc. characters aren't
introduced, or the richtext parser must be able to deal with
them.  UTF-2 does avoid these particular troublesome characters,
but it would be unfortunate to constrain any future encodings or
eventual son-of-1341 migration paths by choices made today.)

Alternative 2 seems like a lot of work, but moving to characters
wider than 8 bits is a radical change which we shouldn't expect
to be able to make with only minor modifications and rethinkings
of our ways of doing things.  Keeping our nascent mechanisms for
handling wide characters cleanly defined will help ensure that
they will also be long-lived.

Please note that the thrust of my discussion here has concerned
the means by which wide character sets might be supported, and
not which particular one is to be used.  I may try to defend
Unicode/10646 against its detractors elsewhere, but for the
moment I'd like to leave that still-volatile argument aside and
consider how we will incorporate wide character set(s) assuming
we manage to agree on their definition.

                                        Steve Summit
                                        scs(_at_)adam(_dot_)mit(_dot_)edu