Re: restrictions when defining charsets

In <728305818(_dot_)16894(_dot_)KLENSIN(_at_)INFOODS(_dot_)UNU(_dot_)EDU>, John 
Klensin writes:

...is it not the case that
  (i) By making the decision that idiographic characters that "look the
same" (i.e., have the same glyphs) are coded the same way, IS 10646
becomes a "glyph standard" not a "character standard" for the subset of
languages involved?
  (ii) By maintaining a distinction between code positions for
characters with the same appearance in most alphabetic languages, IS
10646 really is a "character [set] standard" for those languages...


We could probably argue forever over shades of meaning of "glyph"
and "character" (and I gather that the ISO working groups on
character sets do so), but I would hope that for the purposes of
IETF we could duck a few of those issues and accept the fruits of
the Unicode and ISO10646 "unification" efforts.

I'm no expert on international character sets, and all I know
about Unicode is what I read in the book [1].  On the other hand,
that gives me an external and (I hope) less biased perspective,
which I here offer, for what it's worth.

Unicode represents an excellent unification effort.  Nevertheless,
it is tempered by pragmatism, which in some cases resulted in
decisions being made which might seem to be at odds with the
theoretical ideal of complete unification.  For example, U+0041
LATIN CAPITAL LETTER A, U+0391 GREEK CAPITAL LETTER ALPHA, and
U+0410 CYRILLIC CAPITAL LETTER A retain distinct code points even
though their glyphs are largely identical.  I will not attempt to
justify such decisions: they weren't mine, and although I have a
few suspicions as to their rationale, dragging them out here now
would serve no other purpose than inevitably to reopen some
tedious discussion.

I gather that the complaints still hovering around Unicode,
several of which have recently cropped up on this list and in the
Usenet newsgroup comp.std.internat, involve individuals who feel
that their particular language, country, and/or culture has been
slighted by one of the asymmetrical unification decisions.  That
is, a text containing the character U+0B85 TAMIL LETTER A is
almost certainly part of a script written in the Tamil language,
and a text containing U+0391 GREEK CAPITAL LETTER ALPHA is
probably written in Greek (unless it's a technical usage...), but
a text containing one of the unified Chinese/Japanese/Korean
ideographs might be written (obviously) in either Chinese,
Japanese, or Korean, a text containing U+00C4 LATIN CAPITAL
LETTER A DIAERESIS might be written in one of several European
languages, and a text containing U+0041 LATIN CAPITAL LETTER A
could be written in almost anything.

If the complaints have to do with unification having been done at
all (rather than with its having perhaps been done less than
impartially), the Unicode standard itself provides an excellent
defense for the process:

        There is some concern that unifying the Han characters
        can lead to confusion because they are sometimes used
        differently in the three languages [Chinese, Japanese,
        and Korean].  Computationally, Han character unification
        presents no more problems than having a single character
        set for the Roman alphabet that is used to write
        languages as different as English and Vietnamese.
        Programmers do not expect the characters "c" "h" "a" and
        "t" alone to tell us whether "chat" is a French word for
        "cat" or an English word meaning "informal talk."
        Likewise, we depend on context to identify the American
        hood (of a car) with the British bonnet.  Few computer
        users are confused by the fact that ASCII can also be
        used to represent such words as the Welsh word "ynghyd,"
        which are strange looking to English eyes.  Although it
        would be convenient to identify words by language for
        programs such as spell-checkers, it is neither practical
        nor productive to encode a separate Latin character set
        for every language which uses it.
        [1, sec. 3.4, p. 112]

Whatever shape/character/meaning information Unicode characters
do convey, language information shouldn't be thought to be one of
them, and the fact that some assumptions about language can be
inferred from some of the characters should be viewed as an
accident.  If any interchange standard desires to transmit
language information, it should not rely on the character set,
but should instead use an explicit field in a header of some
kind.

For those who have use of an "even more unified" Unicode, perhaps
in order to display Unicode characters on equipment which
provides a single glyph which is assumed to be suitable for Latin
and Cyrillic A as well as Greek Alpha, I am preparing a set of
typographical equivalence tables which I will be distributing
once they're finished.

In <199301291937(_dot_)AA27973(_at_)dkuug(_dot_)dk>, Keld Simonsen writes:

I am not an expert on Han characters, but I believe that the
unification is done at the character level. This means that Chinese
(PRC), Taiwanese, Japanese and Korean character sets have been
tabled and characters having the same origin and almost the same
appearance have been said to be equivalent, and each of these
characters relates to a single Unihan character. This relation is
defined in ISO 10646 (at least it was in the DIS2).

So it is still the meaning that is coded, it is not the shape.
The shape may be different for different languages, there may be
a Chinese Unihan font and a Japanese Unihan font which may
differ significantly in many places.


This is essentially my understanding as well, although I would
not go so far as to say that it is absolute meaning which is
encoded, "meaning" being an impossible to define as well as (in
the present context) emotionally laden term.  Suffice it to say
that the Han unification was not done capriciously; the
ideographs which have been unified possess demonstrable and
documented aspects in common (involving derivation, shape, and
usage) which warrant that unification.

It is also worth noting that

        The validity of [the ideographic unification] effort was
        verified in 1991 by an independent team of East Asian
        experts at the University of Toronto.  (See the Unicode
        CJK Unification Verification Project Final Report, Kazuko
        Nakajima, Project Leader, Associate Professor, Department
        of East Asian Studies, University of Toronto.)
        [1, sec. 3.4, p. 115]

In 
<9301300723(_dot_)AA21903(_at_)necom830(_dot_)cc(_dot_)titech(_dot_)ac(_dot_)jp>,
 Masataka Ohta writes:

So, I want character code be informative enough so that I can produce
state-of-the-art quality shape of Japanese characters and Chinese
characters without requiring external profiling information.


There is no question but that Unicode does not attempt to support
this goal.  My own feeling is that it is not the purpose of a
character set to do so, and that where language-specific
processing is desired, an explicit indication (if that means
"external profiling information," so be it) of language is both
necessary and appropriate.

CAUTION: Don't be confused by the fact that Unicode gives unique mapping
of a byte stream to glyphs of almost all *EUROPEAN* languages without
requiring external profiling informaiton.


This is a curious usage of "almost all."  If we want a code point
out of a character set to convey language information, about the
only European language for which Unicode does so is Greek.  In
particular, all of the languages which use variants of the Latin
alphabet have been unified (via ASCII and the ISO8859 variants)
for far longer than Unicode has been around, and distinctions
between languages which are coded in these scripts are completely
demolished.

If the concern is merely that the display fonts being used by
Chinese and Japanese speakers tend to differ more significantly
than those used for, say, English and German, this seems like a
comparatively minor issue.

There may be some nuance of Masataka Ohta's complaint which I am
missing, but I know I am not completely insensitive to the issue.
Unicode, like ISO 8859-1 before it, assigns a single code point
to latin small letter o with diaeresis.  One cannot tell based on
the character set alone whether it is o-diaeresis used in English
or o-umlaut used in German.  This is not an academic concern; I
am working on software to display multinational characters on
possibly-restricted equipment (Markus Kuhn, and doubtless many
others, are working on similar projects), and the appropriate
transliterations end up depending on language.  On equipment with
a limited character set repertoire, lacking diacritics, the
English word coo"perate should be rendered as cooperate, but the
German word sho"n can be rendered as shoen.  For this reason, I
would like to see some means of explicitly specifying language,
which would help to address Masataka's concern as well, but
that's a proposal for another day.

                                                Steve Summit
                                                
scs(_at_)adam(_dot_)mit(_dot_)edu

[1] The Unicode Consortium, The Unicode Standard -- Worldwide
Character Encoding -- Version 1.0, Volume 1, Addison-Wesley,
1990, 1991, ISBN 0-201-56788-1.