The future of multilingual character sets (was RE: quoted-printable)

But there are other
character sets on the horizon (I already have to cope with two for
Japanese, and this before the arrival of 10646), and the problem of
how to convert from one to the other is not that far away. RFC-CHAR is
attempting to address the need to support existing practice while
allowing for conversion to/from future practice.

I'm still not convinced that conversions based on 10646 will be that
useful. For example, it is not clear whether East Asian users will
accept the "Han unification" done in the current draft of 10646. So if
one converts from a 2022-like encoding to 10646, and then tries to
render that, the user may not see what he/she wants to see. I'm not
stating this as a fact. This is just a concern of mine, based upon
numerous comments made by Japanese, saying that Unicode and 10646 are
destroying culture, etc.


This is of course an interesting question. But it is not a question for our
working group to answer or get involved with. I take it for granted that being
aligned with international standards efforts (10646 among others) is a Good
Thing. This position arises from several different rationales, but the one that
concerns me most at this time is that I perceive the Internet as trying to
align with international standards in the areas where they exist and don't
cause formal conflicts. Since the Internet currently has no standards in place
for the use of character sets on the network (existing practice does not
constitute a standard) there are ipso facto no formal standards to conflict
with.

I don't know what the actual policy is here, but when I go to IETF meetings and
see a significant number of working groups pushing towards the adoption of
various international standards into the Internet at large, I assume that
something is at work making this happen. Moreover, it has been intimated to me
that documents that directly conflict with international standards work are
extremely unlikely to be approved as standards.

10646 is sort of a special case, since it is only a draft at present. (I also
don't know its current status -- anyone care to comment on this?) But I also
see an enormous pressure building behind the need for a unified character set
standard. The ASN.1 folks are literally drooling over the potential implicit in
having 10646strings (or whatever they are called) as a standard primitive. As
are the X.400 folks, the X.500 folks, and probably every other working group as
well. I think that once 10646 reaches closure you will see a big push to update
standards and subsequently implementations to the point where they can support
it.

I also see a huge wave of support building in the vendor community. Vendors are
tired of building special versions of everything so that their products can
sell into all sorts of different markets. 10646 imposes a large up-front cost,
but once that's been borne the amount of effort involved in national
customization drops considerably. I've talked with more than one senior
developer whose usual attitude about standards-to-be is "maybe we'll support it
when the cows come home, and maybe not even then", but who is already hard at
work developing support for 10646 even though it is not final yet!

My position, then, is that while we are in no position to criticize the 10646
work that's going on now, we are also in no position to develop standards that
contradict it in any major way. (Side note: I really, really wish that there
was an IETF working group that addressed character set issues directly. I don't
like the fact that we have to make decisions like this. I was very relieved
when the security group decided on a integrity check algorithm to use; it make
the selection for us so much easier.)

Mind you, I'm not particularly pleased myself with the direction 10646 has
gone. I am especially unhappy about an attitude I see developing, to whit, that
the only part of 10646 worth supporting is the UniCode subset. But I am not the
one to complain, and if there are people with legitimate concerns that are not
being addressed they should get in there and fight some more for what they
want.

But having two mnemonic formats is an entirely different kettle of
fish. We don't need two, we need one that has the input of the entire
community going into its design.

I feel the same way, but, unfortunately, we already have more than one
set of mnemonics. There are at least 3 sets known to "the character
encoding experts". (By Keld Simonsen, Alain LaBonte/, and Johan van
Wingen.)


My position about 10646 translates to a basic requirement for mnemonic -- it
must be, if not aligned with 10646, at least alignable with 10646.

I don't know what these other mnemonic sets are -- I have never seen them, so I
don't have an opinion about them. This is the first I've heard of them. If the
authors of these mnemonics would like to put forward proposals we would look at
them, of course. But no such proposals have come forward, and I cannot evaluate
what I have not seen.

And the Vietnamese-using community has yet another method, which uses
up to three characters to represent letters with two accents. I have
been trying to convince them that it would be a good idea to unify
these approaches, but I haven't made much progress.  They've been
using their method for a couple of years already.


It sounds like this would be a reasonable approach for Vietnamese. Since
Keld's code does not address mnemonics for Vietnamese (as far as I know) why
is there a conflict?

I'm beginning to think that perhaps an all-encompassing truly
multilingual encoding won't be used that much. Perhaps it would be a
good idea to simply document and name the formats currently used by
the various communities of the world. E.g. one for iso-2022-jp, one
for the Vietnamese method, one for Latin-1, and so on. And then wait
to see which multilingual encodings catch on.


Isn't this exactly what Keld is trying to do? Admittedly he is aligning
things with 10646, but there are a bunch of characters that have no
10646 equivalents already and there will probably be even more in the
future.

There is one piece that's missing from Keld's work, and that is information
about who (or what) uses what. This properly should be a different document
anyway, since it could never be anything other than informational, but it would
be very useful information to have, don't you think? I know than Ran Atkinson
has collected some sketchy data along these lines, but it would be, if
anything, just a start. Anyone else care to start collecting more of this sort
of data?

                                        Ned