Re: The future of multilingual character sets

Ned Freed writes:

This is of course an interesting question. But it is not a question
for our working group to answer or get involved with. I take it for
granted that being aligned with international standards efforts (10646
among others) is a Good Thing.


Well, ISO 2022 is also an International Standard. Actually, 2022 *is*
an IS, while 10646 isn't. It is not at all clear that 10646 will be
useful and used on the Internet. Since this is not clear yet, the
document being developed by this WG should not take any position on
this issue. However, the current version of MIME does seem to take a
position, in some sense:

                            It is our hope that ISO-10646  or  some
                 other effort will eventually define a single world
                 character set which can then be specfied  for  use
                 in  Internet  mail,  but  in  the  advance of that
                 definition we cannot specify the use of ISO-10646,
                 Unicode,   or   any   other  character  set  whose
                 definition is, as of this writing, incomplete.

Actually, the above is barely acceptable, since it does say "10646 or
some other effort". I would prefer it if "10646" and "Unicode" were
not mentioned at all. And the following paragraph is, in my view,
unacceptable:

            The use  of  the  string  "ISO-10646"  as  a  character  set
            specification  is  hereby  reserved for future use, once the
            ongoing efforts to define a standard universal character set
            are completed.

It is not MIME's business to predict which names will be registered
with IANA, particularly names of formats that aren't being used yet
(unlike ASCII).

This position arises from several
different rationales, but the one that concerns me most at this time
is that I perceive the Internet as trying to align with international
standards in the areas where they exist and don't cause formal
conflicts.


JUNET deliberately chose to use an extensible subset of an
International Standard (i.e. ISO 2022), leaving their door wide open
for future international cooperation. But what do you do? You just
walk right past them, without even glancing at their open arms,
straight to some other goal. Is this NIH or what?

You say you don't want to "cause formal conflicts". Choosing 10646 is
a giant conflict, in my mind.

But wait a sec. Perhaps I should cool down a bit. Maybe we should let
10646 try to get up on its feet, and see how far it can walk (or
run!). The Internet should be open to experimentation...

Since the Internet currently has no standards in place for
the use of character sets on the network (existing practice does not
constitute a standard) there are ipso facto no formal standards to
conflict with.


You what?

Moreover, it has been intimated to me that documents that
directly conflict with international standards work are extremely
unlikely to be approved as standards.


While I would generally agree that "direct conflict with international
standards" is a bad idea, I think people value ISs a bit too much. We
should not forget that ISs are written by ordinary people, and that
ordinary people make mistakes. X windows is not based entirely on ISs,
and yet it may become one. ISO 646 is an IS, and yet we hear people
pushing to obsolete it.

10646 is sort of a special case, since it is only a draft at present. 
(I also don't know its current status -- anyone care to comment on
this?)


A 4-month review period started at the end of January 1992. After
that, voting will take place, and if the results are favorable, we may
have an IS in another few months. People say that the C and POSIX WGs
are likely to arrange to send warnings about NUL octets.  On the other
hand, some experts say that 10646 will probably pass this time.

It has a 2-octet form that strongly resembles Unicode (Unicode says
they will align Unicode 1.1 with 10646's 2-octet form), and there is a
4-octet form. Also, there are two levels, one is normal, and two
allows "combining marks" (floating diacriticals and the like).

From the email point of view, one should note that the 2- and 4-octet
forms use the C0 and C1 spaces. This means that there may be stray CR
and LF octets. However, these people are brilliant -- they thought up
a way around this, by algorithmically converting 10646 to a non-fixed
width encoding called UTF. However, UTF still uses the 8th bit, which
may also be of interest to this WG.

10646
imposes a large up-front cost, but once that's been borne the amount
of effort involved in national customization drops considerably.


I'm skeptical. (Customers won't throw out their existing data so
readily.)

... but who is already hard at work developing
support for 10646 even though it is not final yet!


I'd hope they have their bases covered.

I really, really wish that there was an IETF working group that
addressed character set issues directly.


I'll second that!

Since Keld's code does not address mnemonics for Vietnamese (as far as
I know) why is there a conflict?


Actually, Keld *does* try to accommodate the Vietnamese.

There is one piece that's missing from Keld's work, and that is
information about who (or what) uses what. This properly should be a
different document anyway, since it could never be anything other than
informational, but it would be very useful information to have, don't
you think?


Yes, I think it would be very useful. Some of the stuff in my first
draft is similar to this, and I have collected more info since then. I
certainly would be interested in cooperating with others to produce
one or more RFCs of this nature.

I'm not so sure about the "informational" part, however. I have no
trouble imagining a non-informational spec for ISO-2022-jp, for
example.


Regards,
Erik