perl-unicode

Re: Is it true that there are no longer no official mappings from JIS X to Unicode?

2002-03-28 01:50:12
Dan Kogai passed on this question:

On Wednesday, March 27, 2002, at 08:06 , Anton Tagunov wrote:
Hello, Dan!

BTW, is the guy speaking

http://www.debian.or.jp/~kubota/unicode-symbols.html.en

right or not? His article is dated september..
He is speaking about lack of official tables, that the tables
have been withdrawn and made obsolete but without any replacement.
Is that true?


and answered it:

Correct.

http://www.unicode.org/Public/MAPPINGS/EASTASIA/Readme.txt
The entire former contents of this directory are obsolete and have been
moved to the OBSOLETE directory.  The latest information may be found
in the Unihan.txt file in the latest Unicode Character Database.
August 1, 2001.

But the Unihan database only contains JIS X 0208 and 0212, not JIS X 
0201 because 0201 only maps ASCII and Halfwidth Katakana.

I did also ask if the issue Kubota-san has raise has ever been worked on 
this at unicode(_at_)unicode(_dot_)org but I got no definitive answer.

So here is your definitive answer.

The mapping issues that Kubota-san raises in that document are
real and complicated, having to do problems of roundtripping through
various East Asian character encodings, differences in vendor
mappings, and differences in interpretation of character widths.

The problems of specification of character widths *are* clearly
within the scope of UAX #11, East Asian Width, and some of Kubota-san's
issues have been dealt with there in recent revisions of UAX #11
and its associated data table, EastAsianWidth.txt for Unicode 3.2.

However, the issue of non-Han character mappings between Unicode
and legacy East Asian character sets has been the subject of
some misunderstanding.

Contrary to popular opinion, the Unicode Consortium has *never*
published authoritative mapping tables for any of the East Asian
legacy standards per se. The Windows and Macintosh East Asian
mapping tables are provided by Microsoft and Apple, respectively,
and are vouched for by those companies as representing their
vendor mappings. The tables for the East Asian *standards*, on
the other hand, such as JIS0201, JIS0208, JIS0208, KSC5601.TXT,
BIG5.TXT, CNS11643.TXT were only ever provided as tentative,
informational-only tables. No claims were made about those
tables being authoritative determinations by the Unicode Consortium
as to what the mappings *should* be. On the contrary, the tables
had very tentative wording, indeed; for example:

#       This table contains the data the Unicode Consortium has on how
#       JIS X 0208 (1983) characters map into Unicode.

#   The kanji mappings are a normative part of ISO/IEC 10646.  The
#       non-kanji mappings are provisional, pending definition of
#       official mappings by Japanese standards bodies

However, merely having the tables up on the website led most people
to ignore the cautions and interpret them as authoritative tables
provided by the Unicode Consortium, anyway.

That, in turn, has led over the years to various and repeated
reports of "bugs" in the tables -- some presented rather indignantly.
And because of differences in implementations, the bug reports
sometimes come in equal but opposite pairs.

However desirable it may be for somebody to "provide the answer" for
everyone about East Asian character set non-Han mappings, the
Unicode Technical Committee has not yet determined that it is
part of its charter to "standardize" mapping tables, particularly
for East Asian non-Han characters, nor is it self-evident how
it would go about doing so, given the de facto differences
in implementations and preexisting (pre-Unicode) complications
in interpretation of some characters in the East Asian standards.

The uncertainty within the Unicode Technical Committee as to exactly
who owns the mapping problem -- the UTC itself, the East Asian
standards committees, or the vendors -- led to the decision
last year to move all the East Asian standards mapping tables
explicitly to the OBSOLETE directory under the MAPPINGS section
of the online data on the website. This leaves the same, unchanged tables
available to people if they want, but makes it more obvious
that the UTC is not standing behind those tables as representing
any authoritative opinion.

However, this action itself has led to further misinterpretations.
Kubota-san says, "The cross mapping tables for east asian encodings
and character sets ... became obsolete [in Unicode 3.1.1]." The
fact is that the UTC has no official statuses for mapping tables,
and it is meaningless to say that the UTC "obsoleted" some
particular mapping table, because none of them are standardized,
authoritative, obsoleted, deprecated, superseded, or have any other
official status. They are all simply provided for information --
and because of the problematical issues in the East Asian standards
mapping tables, they were pushed into the already existing "OBSOLETE"
directory to make it more obvious that the UTC wasn't claiming they
were authoritative or up-to-date. Kubota-san also says, "Now we
don't have any authorized mapping tables for east asian encodings."
Well, they weren't "authorized" (in the sense of "authoritative")
in the first place -- they were simply individual contributions
that were posted for information. Their posting was, of course,
authorized, but the content was not authoritative.

Kubota-san also goes on to say, "Unicode is a standard. Not supplying
an authorized unified reference mapping table seems to show that
Unicode abandons the responsibility as a standard." Again, however
desirable it might be to have the Unicode Consortium to provide
the definitive answer to all East Asian character set interoperability
problems, and however much we might like there to be a single,
simple answer, it isn't obvious that that is going to happen. The
responsibility of the Unicode Consortium as a standards organization
is to maintain and develop the *Unicode Standard* -- which it does.
The Unicode Consortium is not responsible for the maintenance and
interpretation of JIS standards or KS X standards or GB standards
or CNS standards and the like.

What people may be missing here is that there is no IRG equivalent
for the non-Han unification problem in East Asian standards. The IRG
provides normative mapping tables for *ideographic* characters as
part of the official work of WG2 to encode unified Han characters
in 10646 -- and those tables are then published both by ISO and
the Unicode Consortium as normative parts of their standards.
But there is no "nonIRG" to do the same work for the non-ideographic
characters, with official participation by the relevant national
standards committees representing their standards. Until such
time as a "nonIRG" is put together, it isn't clear how anyone is
going to assemble definitive, normative mapping tables for those
various legacy standards.

In the meantime, it may be possible to do a better job of *explaining*
the mapping problems highlighted by Kubota-san. And in fact there
have been tentative proposals for someone to write a Unicode Technical
Report about East Asian non-ideographic character mapping. That could
enable the provision of mapping tables that would have some context,
validity, and an explanation of alternative mappings and mapping
problems. But until someone steps forward to truly *own* the problem
and author such a Technical Report, it is unlikely that the issue
will move forward.

Ken Whistler, Technical Director, Unicode, Inc.

[I don't usually sign myself thus in contributions to this list, but
 you wanted a definitive answer, so there it is.]



<Prev in Thread] Current Thread [Next in Thread>