Re: 10646, and all that

In 
<9303130959(_dot_)AA19206(_at_)necom830(_dot_)cc(_dot_)titech(_dot_)ac(_dot_)jp>,
 Masataka Ohta wrote:

The DIS does not say that the correnponding CJK characters are
the same single character. Instead, it says that the same code point
is assigned to the different "graphic symbols".


Can you cite some text from the DIS on which you base this claim?


In section 25 of the DIS, it is written that:

      Any entry in any of the G, T, J, or K columns in-
      cludes a sample graphic symbol from the source
      character set standard, together with its coded
      representation in that standard.


In other words, the differences are still graphic.  I do not see any
suggestion here that distinct characters are sharing a single code
point.  (I'm sorry to be picking nits over terms like "character"
and "glyph.")

...I asked on
the Unicode and ISO10646 mailing lists, and was assured that
ISO-10646 retains Unicode's notion that one code point is exactly
equivalent to one (possibly unified) character, and that
nationalized ideographs are treated as glyph variants.


Strange. Why can't you cite some text, then?


I'm sorry; I had no text to cite.  I have only a copy of Unicode,
from which I have cited before and which is irrelevant to exactly
what ISO-10646 says.  My "assurance" was in the form of mail
(from Glenn Adams) on a different list, which it would have been
improper to re-post without permission.  I now have Glenn's
permission, and will send a copy of his mail to anyone who is
interested, although it probably won't say much to anyone whose
mind is already made up on the issue.

The issue is not ISO-10646-specific.  Examples have been
presented for which language information would be useful
regardless of the character set.


So far, only proposed usefulness of the separate language information
other than displaying of ISO 10646 is for spell checking...


I don't remember the spell check example; it's obviously silly.
Multipart/alternate messages could have different parts in
different languages.  A message composed using any character set
which unifies the diaeresis and umlaut marks (i.e. ISO-8859-1)
can be transliterated more effectively, on a terminal without
either mark, if it is known whether the message is in German or not.

The example of transliterating German appropriately may not seem
interesting, but I can assure you that it is interesting to me,
just as I know that correctly rendering Japanese and Chinese Han
is interesting and important to you.

A body-scope language tag may introduce
some potential for confusion, but it replaces the more confusing
and less workable notion of trying to encode language matrices in
the character set name.


How can you say it more confusing and less workable?


Actually, I can't, because with my speech impediment it comes out
sounding like "ore onfusing and ess orkable." :-)

But I can seriously argue that language-overloaded charset names
of the form iso-10646-x-y-z are both:

        confusing, because it is not obvious which one to choose
        for a message which uses none of the languages
        discriminated by x, y, and z (you have recently suggested
        "iso-10646-eurocentric" to cover this class of messages); and

        unworkable, because the number of permutations and
        combinations of x, y, and z is too large.  A simple,
        workable solution will have a small number of charsets.
        (This is one reason why the ISO-646 variants were left
        out of MIME and are generally falling out of favor.)
        Too many people would decide that a "standard" containing
        12 or 24 (or whatever the number is) language-based
        charset variants was a mess, and ignore it. 
        (Perhaps this is your intention.)

We already have a definition of "charset" which is completely workable
and not confusing.


I beg to differ.  The "current" definition is not workable,
because I cannot read it and figure out what it implies or what
its underlying rationale is, nor can I use it to accurately answer
questions about proposed charsets, nor has anyone been able to
explain it to me satisfactorily.

I have shown a concret, non-confusing and workable example of such a
profiling.
On the other hand,
I have never seen any workable example of the use of separate
language information.
I have never seen workable definition of separate language information.
I have never seen workable purpose of separate language information.


We have seen plenty of examples of the use of both schemes, some
workable, some otherwise.  We have not seen a formal proposal for
either scheme.  We have not seen a definitive argument concluding
that one scheme or the other is more or less workable, or more or
less confusing.  We have merely seen a number of opinions advanced.

                                        Steve Summit
                                        scs(_at_)adam(_dot_)mit(_dot_)edu