Re: Character set Detail Considered Harmful

Nathaniel writes...

Yeah, well, maybe.  But I think that "X-ISO-8859-1" is kind of offensive
-- there's hardly anything experimental about it, which is what "X-"
usually implies.  How about the following compromise proposal:
...


While I am sympathetic to Dave's concerns -- character set issues are 
*very* complex and won't be completely and elegantly "solved" any time 
soon, since there are not only multiple Standards but multiple 
*models*-- it seems to me that we are in danger of "throwing out the 
baby with the bathwater" and taking a serious step backwards.

ISO-8859-n, especially for values of "n" equal to 1, isn't experimental. 
It is/they are widely used and, indeed, are implemented in the hardware 
of several vendors.  For 8859-1 in particular, we even know what to do 
with it at gateways into networks whose native character set is not 
ASCII-based.  Call it ISO-8859-1 (or some agreed-upon lexical variant on 
that form) and I can implement gateway code to do rational things with 
it.  Call it X-ISO-8859-X, or X-foobar, and, if I am cautious, I can't 
do a thing with it unless I keep tables of known senders whose 
definition of that X-token is the same as mine.  

For a gateway between major networks that has to perform character set 
translations, X-tokens are pretty close to useless: one can either
reject the mail as untranslatable or can send it through with a "this 
may be trash" warning of some sort.

Now the Japanese use of ISO 2022 is a little different from this, since 
it is not established in an ISO Standard and we have had some difficulty 
obtaining an official definition.  But it is in very common use in part 
of the world, the people who use it know what it means, it is possibly 
to build gateways to convert to locally-preferred forms on private 
networks if one knows that is coming in, and, again, there is absolutely
nothing experimental about it.

10646 is different, significantly different, for two reasons.  First, it 
is not a Standard but a proposal that has spent its long life (and 
through several versions) mired in controversy.  It is impossible to 
know what will ultimately appear, and when.  And, second, it raises 
issues of how to handle and use 16 and/or 32 bit characters and there is 
little or no production-level experience with doing that in 
heterogeneous data communication environments.   So there are strong 
arguments for saying "let 10646 stabilize and get itself approved in 
some form, then write rules for using it that are consistent with the 
final, Standard, definition".  Experimenting with X-10646 would 
certainly be consistent with that approach.

While I personally like it a lot, and don't see the internal 
contradictions in it that I see in the current 10646 draft, RFC-CHAR is 
in somewhat similar status.  Still evolving a bit, not nearly the kind 
of production-use experience that exists with, e.g., ISO 8859-1 or the 
Japanese use of 2022.   So I can see arguments for saying "let's defer 
locking ourselves into that for the moment", even though I hope we can 
avoid that decision.

For whatever it is worth, please, folks, remember where this WG started 
a year ago.  The major mandate--reinterpreted in the spring to separate 
out dealing with the transport issues--was to make international 
character sets, especially western European character sets--"work" in a 
well-defined, canonical way.  Let's not let go of that: it is 
at least as important to some major communities as sound and pictures. 
And, if I'm left in a situation in which I can send international 
characters in a canonical way by converting pages to images and then 
sending them as image types, but I can't send them canonically as 
characters, we may find ourselves wiping out one of the major advantages 
of email over fax machines, the manipulability of the transmitted text.

RFC-XXXX defines a place to put the "charset" value, as it does now.  It
defines "US-ASCII' as the string to use for expressing, well, US-ASCII. 
It says other values may be used among consenting mail systems, and
SUGGESTS that the names given to the character sets should be taken from
RFC-CHAR.  End of story.  That is functionally equivalent to the current
draft, I believe.  Would it be satisfactory to all parties?

   No, I don't think so.

      --john