Re: Encode::CJKguide

Dan Kogai wrote on 2002-03-26 22:35 UTC:

I would appreciate if you give me some feedback.


Thanks for the posting. I'm afraid, this text still needs a lot of
feedback. Sorry, if the order of my remarks below are somewhat mixed up
in order.

And don't forget there are many scripts which has no character set at
all that are waiting to be coded one way or another.


Unicode 3.2 does now encode every script for which there exists any
coding practice. People are starting to find it really difficult to come
up with any ideas for what else could possibly be included into Unicode
4.0, apart from the few difficult scripts (such as Hieroglyphs) that are
already being worked on. For the remaining scripts (practically all of
which are historic scripts that haven't been used for decades or
centuries), Unicode is likely going to be the only character set that
will code them for the forseeable future.

And not all
scripts are accepted or approved by Unicode Consotium.  If you want to
spell in Klingon, you have to find your own encoding.


Klingon is a very bad example. The entire available Klingon language
literature uses a Latin transcription of Klingon, therefore is was
decided decided after consultation with leading Klingon experts that
ASCII fullfills the requirements of the Klingon language community
perfectly and that no Unicode extension is necessary. The Klingon script
that was proposed by Everson to the Unicode consortium turned out to be
nothing but a part of the set decoration of some Star Trek movies, which
is copyrighted by Paramount Pictures and which is not actually used to
write Klingon by anyone.

or a nihilistic anti-Unicode activist who accept nothing but Mule
ctext


Actually, Mule is currently being rewritten to move completely from
ctext to Unicode (will be released in Emacs 22).

You are arguing in a discussion that happened 3-5 years ago. The
pro/contra battle is mostly over today ...

Perl 5.6 tackled the modern when it added Unicode support internally.
Now in Perl 5.8 we tackled the classic by adding supoprt for other
encodings externally.  I hope you like it.


I wouldn't use current version numbers in documentation this way ("now
in 5.8"). You will otherwise have to update these in each release, and
if not, the documentation will look out-of-date.

postmodernistist -> postmodernist
who think -> who thinks
who want -> who wants
scripts which has -> scripts which have
legacy data are there to last -> legacy data are here to stay
at your fingertip -> at your fingertips

(Remember: 3rd person singular verbs end with an "s" in English.)

The text generally needs some proofreading from a native reader
(I'm not one).

50% of statically linked perl consists of Encode!


Side note: I still think, Encode should have used the encoding tables
that are already provided by the operating system where available. For
example on Linux, the iconv() function with glibc 2.2 or newer does
already provide access to all the necessary tables. I observe at the
moment, that almost a dozen different programming language communities
reinvent the recoding wheel simultaneously and independently, even
though portable C libraries such as libiconv are already available for
exactly the same purpose.

You have only one and single authority, Unicode Consortium.


Actually not really correct, as there is also ISO/IEC JTC1/SC2/WG2, the
independent committee in charge of ISO 10646. However, the two
coordinate their work very closely fortunately.

You have to *pay* the Consortium to become a member, ultimately to
vote on what Unicode will be.  It is not Open Source :)


What does this have to do with Open Source? It costs nothing to submit
encoding proposals and it costs nothing to join the 
unicode(_at_)unicode(_dot_)org
mailing list to discuss them. It also costs nothing to join your
national standards body as a contributing expert and vote this way
within ISO/IEC JTC1/SC2/WG2.

You *ONLY* have to pay the Consortium to become a member and vote on
what Unicode will be.  You don't have to be knowledgeable, you don't
have to be respected, you don't even have to be a native user of the
language you want to poke your nose on. It is not Open Source :)


Again. What does competence have to do with Open Source? Come on! You
are obviously able to write a lot of inflamatory things in open source
documentation without having to be knowledgeable about all of them, so
where is the big difference here? :-)

Actually, my personal experience has been that the real gurus within
both SC2 and UTC today are in fact highly knowledgeable linguists, who
clearly have qualified themselves in their writings as the world's
leading experts on coded character sets, who are known for in-depth
research and who are very willing to take diverse scholarly expert
advise onboard. Decisions on Han encoding are not made by the Unicode
consortium directly, but by a subgroup called the Ideographic Raporteur
Group (IRG), which consists practically exclusively of CJKV native
speakers. Therefore I read your comments with some dismay. At least
please clarify that this text represents Dan Kogai's personal and
possibly uninformed opinion on character encodings and their history,
and not some consens of everyone involved in the Perl 5.8 release.
I think this text is still in very early alpha testing ...

There are many good pages on this subject in Japanese but not
so many in English....


Many of which have a rather Japan-specific and sometimes semi-informed
view of Unicode and often do not at all represent Chinese or Korean
views on issues such as Han unification. Please remember: CJK != Japan
and there are also many good or better Korean and Chinese web pages on
these issues.

I would at least add a reference to Ken Lunde's  CJKV Information
Processing, O'Reilly & Associates, 1999, ISBN 1565922247, which is
widely regarded as the bible on this topic, even by Japanese
anti-Unicode geeks.

You should definitely also add a pointer to the Unihan database, which
is the most comprehensive existing source of cross-reference and
encoding conversion data between the different Han encodings:

http://www.unicode.org/Public/UNIDATA/Unihan.txt

Hope this helped ...

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>