Re: UTF-2 and the 8-bit problem

Erik's note raises several provocative questions (perhaps more than he
realizes).  I think the answers may tell us a lot about how to proceed.

"10646, UTF-2 and MU may be solutions, but what is the problem?"


   There are a few problems claimed.  They are not necessarily
consistent with each other in terms of the choices they dictate about
the use of 10646 (or some subset or additional coding on it) in Internet
email.  Some are:

(i) a large desire to use lots of (what would be absent 10646)
separate character sets in the same message, and to do so in plain text,
absent procedural or generic markup.  There is no other (known or
proposed) solution in sight.

    I used to be in this camp.  I've dropped out after thinking about
the discussions on this list and thinking about where I actually see
multilingual texts in paper and electronic mail.  Those texts fall
mostly into two categories:
   -- parallel translations for the convenience of the reader.  Turns 
      out that Multipart/Alternative is a better approach to most of
      these, and multipart/alternative permits different character sets
      to be used in different sections.  [I wonder if that example
      belongs in the MIME doc?]
   -- Texts, typically more scholarly ones rather than casual mail, that
      need to use one language to talk about another, or insertions from
      one into another.  We discussed this one in the richtext context
      and decided it didn't belong there and that stronger markup was
      needed.  If that be the case, then we don't need the capability
      for text/plain either.  

(ii) It is really a big problem to handle multiple character sets in
UAs; one should have only one character set in mail.

   This is a real problem, but 10646 won't solve it.  First, we are
stuck with [US-]ASCII, with a family of 8859 character sets, and with
2022-JP.  We might conceivably be able to phase down the things that
create them (although I have my doubts), but readers are going to need
to support at least some of them for quite a while.  Second, 10646 and
variations becomes a big advantage in this regard only if we can settle
on exactly one of them.  It is clear from the interminable discussions
of the last few months that this isn't going to happen.  There is good
reason for using a coding variation that is seven-bit compatible and
algorithmic (e.g., MU).  There is good reason for using a coding
variation that is seven-bit compatible and understandable when not
decoded (e.g., MNEM).  There is good reason for using a coding variation
that uses 8bit and that is compact for the Unicode/BMP subset and very
compact for ASCII and 8859-1 and that, as Erik points out, offers good
compatibility with U**X file storage (e.g., UTF-2).  There is good
reason for sending the full 32-bit form (UCS-4).  There may be good
reason for sending the 16-bit form (UCS-2).
   But, while we have seen a lot of argument that one form or another is
"better" or "good enough for most purposes", we've yet to see anything
that I would consider even mildly convincing for prohibiting some forms
altogether.  Given the registration mechanisms, it is not even clear
that we could do that.
   So much for "10646 vastly simplifies the UA problem".

(iii) 10646 is about to take over the universe, and, if the email
community does not get behind it, that will be a serious problem.

Taking over the universe is not demonstrated.  We are talking here about
a standard that hasn't even been published and its use in the internals
of some operating systems that, despite hype and interest, are in
"research" and/or "pre-release" or "beta test" status.  Can one
experiment with it?  Sure.  Should one? Probably.  Should it be
standardized by IETF and pushed agressively on the community to solve a
problem?  Can everyone spell "the ISO way"?

(iv) Transport of multiple character sets is really a problem and it can
be solved by using a single character set in transport, even if
different sets are used *within* receiving and sending systems.

It is a problem.  We don't have a solution.  If we had adopted
transport-based strategies, rather than the UA/MIME one, it would be an
interesting issue.  As it is, this argument is a layer-violating
variation on (ii) above.

"Who wants to use 10646 or Unicode in email anyway, and why?"


Some of the answers to this one are part of the problem statements
above.  Some are utopian fantasies (e.g., the belief that 10646 is going
to permit us to get rid of all of those other character sets).  Some are
variants on the "system X is going to use this internally, it would be
much nicer if everyone used it in mail" theme (which might be
interesting if "system X" really dominated the market today, rather than
being speculation about what might happen in the future).

if I added that UTF-2 could be used on 8-bit paths, and that UTF-2
could be converted to MU when a mail message bumps into an MTA that
does not support 8-bit.
...
I.e. you convert from
   Content-Type: text/plain; charset=utf-2
   Content-Transfer-Encoding: 8bit
to
   Content-Type: text/plain; charset=mu
   Content-Transfer-Encoding: 7bit


I think this illustrates the problem, rather than a solution to it (!). 
While the SMTP extensions stuff clearly does not prohibit this (nor
should it), we already have a problem of expecting gateways to have to
reach much further into the message structure to make proper conversions
than is comfortable for either people who worry about layering or those
who hope that MTAs and gateways can be kept simple enough to be correct.

Here we have an 8 -> 7 gateway that recognizes a particular charset
parameter and goes off and converts it to a different charset.  Bleech. 
This may be a strong argument for modifying MIME to permit:
    Content-Type: text/plain; charset=iso-10646
    Content-10646-encoding: MU  (or UTF-2 or UCS-2 or UCS-4 or MNEM)
    Content-transfer-encoding: 8bit (or 7bit or quoted-printable or base64)

That impresses me as a genuinely terrible idea.  But it may be
preferable to encouraging gateways to do character set conversion in
order to overcome 8-> 7 boundaries.

I draw some conclusions from this:
  -- The speculation to consensus and testing ratio has gotten a little
high and it is probably time to go back to the drawing board.
  -- With the exception of the (IMHO very useful) introduction of the
"MU" structure to deal with 7bit transports, we are no closer to
convergence and consensus, or new ideas that might help, on this than we
were six weeks ago.
  -- With the possible exception of the order of encoding issue
identified in Erik's other note, there are no MIME problems here that
need solving with the current revision.

And that leads to a suggestion/request that is certainly not new either:
  There is clearly interest within IETF in an effort to define how to
handle 10646 over email in a relatively standardized way.  There seem to
be no shortage of proposals, all of which can be cast as
MIME-extensions.  Could the people who feel passionately about this
please ask for creation of a WG--one that can be monitored for whether
or not it is converging on anything--and move that discussion to it?
  Greg, can you run this discussion as having reached a sufficient dead
end (and demonstrated that there are no critical MIME issues) that it is
henceforth out of order here?

     ---john