ietf-822
[Top] [All Lists]

Re: 10646 etc.

1993-02-10 14:20:13
What are the defining documents for these two different character sets?
My understanding is that 10646 claims to define only one.

Henry,
   While the S/N ratio has gotten lousy enough to confuse everyone,
there are two or three real discussions going on here.  The real
discussion in this case is that "10646" [appears to*] "claim" that Hindi
and Sanskrit use the same character set (plus maybe some characters at
the edges that appear in one or the other but not both).  The same
"claim" is made for Japanese and Chinese.
   There are probably purposes for which this claim is appropriate. 
They include more or less crude renderings where the goal is to have
something that is "close enough" that a human reader with some tolerance
and flexibility can figure out what is intended, even if an
eight-year-old who wrote that way in a school local to one of the
countries involved would be sternly spoken to for committing the errors
involved.
   There are probably purposes for which this claim is false.  They
might include discussions of one of the languages in another that
"shared" a "unified" character set and a variety of situations in which
use of renderings appropriate to one language for the other was
aesthetically offensive enough to be considered just plain wrong.

I'm trying to explain the question here without getting drawn into the
flame wars.  That question is ultimately whether, for Internet mail
purposes (on this mailing list, I suggest it is inappropriate to look at
other issues or contexts) "unification" of these character sets is
appropriate and in what contexts.

One hypothesis, implicit in Unicode and in the commonly-held assumption
that no one will ever get around to defining 10646 beyond the (16 bit)
BMP, is "yes, always".   The alternate hypothesis is that, for at least
some purposes, "unification" assigns fundamentally different things to
the same code points.  To the extent to which the alternate hypothesis
is true, one either needs a universal character set that does not unify
(a goal of 10646 DIS-1) or one must identify both the character coding
in use (e.g., "10646") *and* the language in use in the content type
And one must not try to use two "unified" languages that share the same
code points in any text/plain subtype.

That is a statement of the question/issue.  I don't know how to resolve
it--we are going around in circles here.  And we can't draw much comfort
from 10646: the ISO working group essentially worked out a political
compromise and then threw up its hands about how to apply their result
to real-world situations.  The introductory sections of 10646 DIS-2 are
just full of disclaimers about applicability assertions.

There is a separate claim, which I don't think we need to put energy
into but that may help to explain some perspectives, that "unification"
was ruthlessly applied to some Asian languages while avoiding the
opportunity to apply it to European ones.  To take an extreme example,
many of the Roman upper-case letters are derived from the glyphs of
North Semitic, as are many Greek and Cyrillic characters.  There has
been no effort to "unify" the Roman, Greek, and Cyrillic characters back
onto their common North Semitic ancestor, even though one could probably
read such things with a little practice.

   --john

* All assertions made above about the content of 10646 are inferences
based on the text of 10646.1 DIS-2 and combined with assorted accounts
of the comment resolution meetings of JTC1/SC2 and its working groups. 
I haven't seen IS 10646, and I don't think anyone else has either. 

<Prev in Thread] Current Thread [Next in Thread>