perl-unicode

Encode::CJKguide (> 500 lines Long!)

2002-03-26 15:35:29
Folks,

I am almost ready to release Encode-1.00. While I am waiting for Anton to submit his patch for Encode::Supported, I have written a pod as follows. That explains how CJK encodings are made. I would appreciate if you give me some feedback. There are many good pages on this subject in Japanese but not so many in English....

Dan the Encode Maintainer

=head1 NAME

Encode::CJKguide -- a guide to CJK encodings

=head1 SYNOPSIS

This POD document describes how various CJK encodings are structured
and its underling history.

=head1 The Evolution of CJK encodings

This section describes how CJK encodings have evolved before Unicode.

=head2 The history before CJK

First there was ASCII.  ASCII is a seven bit encodings that looks
like this;

=over 2

=item The ASCII table

         0123456789abcdef0123456789abcdef
  0x00:  <!--    Control Characters   -->
  0x20:   !"#$%&'()*+,-./0123456789:;<=>?
  0x40:  @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
  0x60:  `abcdefghijklmnopqrstuvwxyz{|}~

=back

The last one (0x7E) is DEL (*1).  ASCII has already been prevalent
before any CJK encoding was introduced.  ASCII is also needed to
implement very CJK handling programs so virtually all CJK encodings
are designed to coexist with ASCII (*2)

=over 2

=item *1

Why DEL is not assigned to 0x00-0x1f but 0x7F is a funny story. Back
when ASCII was designed, punchcards were still widely used.  To
conserve forest :), instead of throwing out mispunched cards they
decided to punch all holes for mistyped byte.  so 0x7F, with all 7
holes, punched, became DEL.

=item  *2

I have heard of EBCDIC-Kanji but does anyone know about it
more than its name?

=back

=head2 Escaped vs. Extended -- two ways to implement multibyte

The history of multi-byte character encoding began when Japan
Industorial Standard (JIS) has published JIS C 6226 (later became JIS
X 0208:1978).  It contained 6353 characters and it won't naturally fit
in a byte.  It wouldn't fit in a byte, it must be encoded with a pair
of bytes.  But how are you going to do so and still make ASCII
available?

One way to do so is that you somehow tell the computer the beginning
and ending of double-byte character.  In practice, we use I<escape
sequence> to do so.  When the computer catches the sequence of bytes
beginning with an escape character (\x1b), it changes the state
whether the following bytes are in ASCII or double-byte character.

But there are many computers (still today) that has no idea of escape
sequence.  To allow coexistence with those, you should avoid such
cases that a given byte mistaken as a control character, including
the half of double-byte character.  In other words, avoid 0x00-0x20
(control and space) and 0x7e (DEL).  So the resulting double-byte
character can map up to 94 x 94, or 8836.  Now you can fit everything
in JIS C 6226.  Now, 7bit-JIS is born.

=over 2

=item the JIS Character Table

    21  22  ....  7E First Byte
   +---------------------------
 21| You can now map up to
 22| 8836 Characters here
 . |
 . |
 7E|
 Second byte

=back

Escape-based double-byte character implementation is great for
transfer encodings.  But once you need to develop a text editor, it
gets pain in the neck because you can't tell if the byte you are
looking for is whether a whole ascii or half of double-byte
character simply by looking at the byte itself.

Fortunately, ASCII uses only 7 bits out of 8.  Most computers back
then already used Octet for byte.  So a byte has one more extra bit.
Why not use that bit double-byte inidicator?

Instead of escape sequence, you just add 0x8080 to a JIS character
set.  That is what Extended Unix Character does.  In a way, EUC
extends ASCII rather than escapes.

=over 2

=item the EUC Character Table

    A1  A2  ....  FE First Byte
   +---------------------------
 A1| You can map up to
 A2| 8836 Characters here
 . |
 . |
 FE|
 Second byte

=back

This concept of 94x94 planar table quickly became standard in other
CJK world as well.  People's Republic of China (just "China" as
follows) has set GB 2312 national standard in 1990 and Republic of
Korea (South Korea; simply "Korea" as follows) has set KS C 5601 in
1989.  They are both based upon JIS C 6226, could be one of the
reasons why these character codes contain Kana, phonetic characters
used only in Japan.

Though there are escape-based encodings for these two (ISO-2022-CN
and ISO-2022-KR, respectively), they are hardly used in favor of EUC.
When you say gb2312 and ksc5601, EUC-based encoding is assumed.

=head2 Scattered JIS? -- An acrobat called Shift JIS

So we have escape-based encodings (ISO-2022-XX) and extention-based
encodings (EUC-XX).  They coexist with old systems very well,
especially EUC.  For most cases, programs developed by people who know
nothing but ASCII runs unmodified (perl has been one of them).  And
they lived happily ever after...?  NO!

Mitsubishi, and ASCII-Microsoft (Now Microsoft Japan) was in troble
when they try to introduce Han ideographic character (I simply call
them "Kanji" as follows) support in MBASICPlus that runs on Multi 16,
Mitubishi's first 16-bit personal computer.  Before JIS X 0208 first
introduced in 1978, JIS has already introduced what is called JIS X
0201 in 1976.

The Japanese try to teach computers how to handle thier language
before double-byte character became available.  Unlike Chinese which
is purely ideographic (*1), Japanese had two variations of Kana, a
phonetic representation of thier language as well.  So they decided to
squeeze Katakana into upper half of the byte.

=over 2

=item The JIS X 0201 table

         0123456789abcdef0123456789abcdef
  0x00:  <!--    Control Characters   -->
  0x20:  <!--
  0x40:       ASCII isprint() zone
  0x60:                               -->
  0x80:
  0xa0:   <!-- Katakana zone
  0xc0:                             -->
  0xe0:

=back

Mitsubishi, among other companies, have already used this
Katakana extention of ASCII.  So you can't apply EUC for backward
compatibility's sake.

Their answer was nothing but acrobatic.

  - Let's use 0x81-0x9F and 0xE0-0xEF of the first byte.
  (47 code points); They are the gabs left from JIS X 0201.
  - Let's use 0x40-0x7E and 0x80-0xFC of the second byte
  (188 code points).  ASCII control codes are still avoided
  and CP/M (The OS of Multi 16) uses 0xFD-0xFF.

Coincidentally, 47 x 188 is also 8836, exactly the same as 94x94.  Now
all you have to do is lay each characters in JIS X 0208 therein.

=over 2

=item The MS Kanji

         First Byte          Second Byte
         0123456789abcdef    0123456789abcdef
  0x00:  cccccccccccccccc    cccccccccccccccc
  0x10:  cccccccccccccccc    cccccccccccccccc
  0x20:  aaaaaaaaaaaaaaaa
  0x30:  aaaaaaaaaaaaaaaa
  0x40:  aaaaaaaaaaaaaaaa    JJJJJJJJJJJJJJJJ
  0x50:  aaaaaaaaaaaaaaaa    JJJJJJJJJJJJJJJJ
  0x60:  aaaaaaaaaaaaaaaa    JJJJJJJJJJJJJJJJ
  0x70:  aaaaaaaaaaaaaaac    JJJJJJJJJJJJJJJc
  0x80:   JJJJJJJJJJJJJJJ    JJJJJJJJJJJJJJJJ
  0x90:  JJJJJJJJJJJJJJJJ    JJJJJJJJJJJJJJJJ
  0xa0:   kkkkkkkkkkkkkkk    JJJJJJJJJJJJJJJJ
  0xb0:  kkkkkkkkkkkkkkkk    JJJJJJJJJJJJJJJJ
  0xc0:  kkkkkkkkkkkkkkkk    JJJJJJJJJJJJJJJJ
  0xd0:  kkkkkkkkkkkkkk      JJJJJJJJJJJJJJJJ
  0xe0:  JJJJJJJJJJJJJJJJ    JJJJJJJJJJJJJJJJ
  0xf0:                      JJJJJJJJJJJJJJXX

  c = ASCII control          J = MS Kaji
  a = ASCII printable        K = CP/M control
  k = JIS X 0201 kana

=back

Simply put, MS Kanji has made double-byte character possible by giving
up ASCII/JISX0201 compliance of the second byte. Uglier it may be, now the
backward compatibility to thier previous code was promised.

NEC has also adopted this new Kanji code when they introduced MS-DOS
ver. 2.0 to PC-9801, the most popular line of personal computers in
Japan until AT compatible (or the same "PC" as anywhere) finally takes
over its reign with a help of Windows 95.  So did Apple when they
introduced KanjiTalk.

With the support of two most popular operating systems for personal
computers, This acrobatic encoding, later called Shift JIS,  has
become the most popular encoding in Japan.

But there were prices to be paid.  It is harder to port applications
than EUC because the second byte may look like ASCII when the second
byte is in 0x40-0xFE.  It also lacks expandability that EUC had (EUC
now suports JIS X 0212-1990, extended Kanji, which is theoretically
impossible in Shift JIS).

The name "Shift" JIS came from the fact that JIS character sets are
"Shifted" when mapped.  IMHO, this is more like "Realigned" but the
word "Realigned" is hardly appealing to most Japanese speakers.  When
we talk about "Shift"ing, EUC is far more like shifting, with MSB
acting as the shift....

As you see, Shift JIS is more vendor-driven than other JIS encodings.
And this was the case for Big5, the most popular encoding for
Traditional Chinese.  The name Big5 came after the fact that the 5
major PC vendors in Taiwan has worked on the encoding.

Well, for Big5, there were better reason to do
so because 8836 characters were hardly enough.  And fortunately for
them, they have no katakana to to silly-walk.

Here is how Big5 maps.

  - First byte:   0xA1-0xC6, 0xC9-0xF9
  - Second byte:  0x40-0x7E, 0xA1-0xFE
  - Source Character Set: Proprietary

Back then there was no equivalent to JISX0208 that they could refer to.
The Taiwanese were aware of the weakness of this Shift-JISish
encodings so they decided to build yet another encoding, this time by
the government.  And the result was CNS 11643.  CNS 11643 consists of
7 94x94 planes.  Firs two planes derive form Big5 but tidied with
duplicate characters removed.  CNS 11643 is EUC-safe and used in EUC-TW.

=head1 CJK, Unicode and ISO-2022

This section describes Unicode and its impact on CJK.  This section
also describes ISO-2022, the biggest contender to Unicode today.

=head2 Write once, reads everywhere? -- Unicode

Back in the time before Unicode, virtually all given encoding is mere
bilingual or biscript, ASCII plus local.  With so many encodings
emerging, it is only natural to try to set an encoding that coveres as
many, if not all, written language.  And Unicode was one of the
answers.

I carefully said "one of" because ISO-2022 has already existed.
ISO-2022 is a escape-based encoding (*1) (7bit-JIS is one of them) and by
assigning escape sequence to existing character set, ISO-2022, in
theory, can swallow as many character sets as possible to form a
universal encoding.  ISO-2022-JP-2 adopts this idea.  In addition to
JIS X 0208 and JIS X 0212, it contains GB 2312 and KS C 5601.

  *1 Exactly speaking, this is not true.   As a matter of fact,
  EUC is ISO-2022-compliant.  I'll discuss this later

However, what many people, especially vendors and programmers were
waiting for, was a fixed-width encoding so you can manipulate each
character statelessly.  That's Unicode -- or its first goal which is
now somewhat diverted.

Back in 1987 when the word Unicode was coined, 16 bit was thought to
be the practical maximum for a single character; Memories were too
expensive and 32-bit OS was not available on desktop.  In order to
squeeze all (and increasing) CJK encodings, they found that simply
realigning the existing character sets would not work.  They came up
with arguablly the most controversial idea;  Han Unification.

Many of ideographs used in China, Japan, and Korea not only I<look>
the same but also I<mean> the same.  Then why not give the same code
point for those in common and save the code points?

There are two cases to consider.  Those they look different but means
the same (Case 1) and vise varsa (Case 2).  The Han Unification of
Unicode decided to unify based upon Case 1;  Let's unify the ones with
the same shape!

As a result, something funny has happed.  For example, U+673A means "a
machine" in Simplified Chinese but "a desk" in Japanese.  "a machine"
in Japanese.  U+6A5F.  So you can't tell what it means just by looking
at the code.

But the controversy didn't stop there.  Han Unification also decided
to apply Case 2 for those characters whose origin was the same.  These
characters that are sheped different but means the same with the same
origin is called I<Itaiji>.  Unicode does not differenciate Itaiji;
should you need to differenciate, use different fonts.

The problem is that Itaiji is very common in proper nouns, especially
surnames in Japan.  "Watanabe", with two characters "Wata" (U+6E21)
and "Nabe",  is a very popular family name in Japan but there are at
least 31 different "Nabe" in existence. But Unicode lists only
U+8FBA, U+908A, and U+9089 in the code set. (*2)

  *2 Actually, Unicode is less to blame on itaiji problem than the
  Japanese domestic character sets, Because JIS X 0208 only contains 3
  of them also.  But the point is that Unicode has shut the door for
  itaiji even when there are rooms for expansions and updates -- at
  least for the time being.

For the better or for the worse, Unicode is still practical, at least
as practical as most regional character sets, thanks to Case 3.  If
the existing character set says they are different, give them
different code points so you can convert the string to Unicode then
back and get the same string.  That is why "Nabe" has 3, not one, code
points;  In the case above, JIS X 0208 had three of them.

Ironically, this move toward Han Unification has reduced the number of
code points but bloated the size of Unicode encoders and decoders,
including the Encode.  For instance, you can convert from Shift JIS to
EUC-JP programatically because they both share the same charset.  This
is impossible in Unicode and you have to have a giant table to do so.
As a result, 50% of statically linked perl consists of Encode!

=head2 the "Multicode" ? -- the ISO-2022

While Unicode makes multilingualization possible by providing a
single, unified character sets, ISO-2022 tries to achieve the same
goal by suppling a glue to multiple character sets.  Here is how it
does that.

=over 2

=item 0.  In-Use Table

Divide the table with 256 elements into 4 sections;

  0x00-0x1f    C0
  0x20-0x7f    GL
  0x80-0x9f    C1
  0xA0-0xFF    GR

The whole table is called I<In-use table>.  See C0 and GL
correspond to ASCII controls and printables, respectively.

=item 1.  G0-G3 Buffers

Prepare 4 tables with the size equals to GL.  We call them buffers
and they are named from G0 to G3.

=item 2.  Single Shift and  Charset Invocation

When you receive certain control character, swap GR with either G2
or G3.  This is called Character Table I<Invocation>.  When a whole
character is complete, restore the state of GR.  Since GR may change
character to character basis, the control character used here is
called "Single Shift Character", or SS for short.  SS2 and SS3
correspond to G2 and G3, respectively.

=item 3.  Locking Shift and Charset Designation

When you receive an escape sequence, swap GL with the character
set the escape sequence specifies.  This is called Character Set
I<Invocation>.  You don't have to restore GL until the next escape
sequence.  Thus this action is called "Locking Shift".

=item 4.  Character Set Spesifications

The character sets that can be invoked or designated must be in
96**n or 94**n (96 - space and DEL).

=back

Whoa.  Complicated?  Maybe.  But let me show you two examples, EUC-JP
and ISO-2022-JP-1 so you get the picture.

=over 2

=item EUC-JP

                                  sizeof(charset)  ESC. seq.
  ----------------------------------------------------------
  GL  G0: US-ASCII                        96 ** 1
  GR  G1: JIS X 0208-1983                 94 ** 2
      G2: JIS X 0201:1973 (Katakana only) 94 ** 1
      G3: JIS X 0212:1990                 94 ** 2
  SS2 = 0x8E
  SS3 = 0x8F

  No escape sequence used.

=item ISO-2022-JP-1  [RFC2237]

  ----------------------------------------------------------
  GL G0: US-ASCII                         96 ** 1   \e ( B
         JIS X 0208-1978                  94 ** 2   \e $ @
         JIS X 0208-1983                  94 ** 2   \e $ @
         JIS X 0201-Roman                 96 ** 1   \e ( J
         JIS X 0212-1990                  94 ** 2   \e $ D

  GR is unused, G1-G3 is unused.

=back

As you see, can call EUC "Single-shift based ISO-2022" or even
ISO-2022-8bit.  You may not not know this but ISO-8859-X, also known
as Latin X, is also ISO-2022-compliant.  They can be defined as
follows;

                                  sizeof(charset)  ESC. seq.
  ----------------------------------------------------------
  GL  G0: US-ASCII                        96 ** 1
  GR  G1: Varies                          96 ** 1

  No G2 and G3, no escape sequence and single shifts.

ISO-2022 has advantages over Unicode as follows;

=over 2

=item *

ISO-2022 differenciates I<charset> and I<encoding> strictly and it
specifies only encoding.  Charsets are up to regional government
bodies.  All that EMCA, that maintains ISO-2022 has to do, is to
register that.  This makes work sharing much easier.

On the other hand, Unicode Consortium have to work both on charsets
and encodings (even though some of the works are delegated to other
parties, such as IRG), resulting more time and arguments for a new
character to be introduced.

=item *

Has no practal size limit, even in EUC-form.  EUC-TW is already 4
bytes max.  And if you are happy with escape sequences, you can swallow
as many charsets as you wish.

=item *

You have to *pay* the Consortium to become a member, ultimately to
vote on what Unicode will be.  It is not Open Source :)

=back

At the same time, Unicode does have advantages over ISO-2022 as
follows;

=over 2

=item *

You have only one and single authority, Unicode Consortium.  You don't
have to worry about to whom you ask whether and how a give character
is mapped.

=item *

Have a consise set of characters that covers most popular languages.
You may not be able to express what you have to say in the fullest
extent but you can say most of it.

=item *

You have to *pay* the Consortium to become a member, ultimately to
vote on what Unicode will be.  It is not Open Source :)

=item *

More supports from vendors.  Unicode started its life to make vendors
happier (or lazier), not poets or liguists by tidying the charset and
encoding.  Well, it turned out be not as easy as they first hoped,
with heaping demand for new codepoints and sarrogate pair.  But it is
still a bliss enough that you only have to know one charset (Unicode
does have several different encodings).  That is, except for those who
hack converters like Encode :)

=item *

You *ONLY* have to pay the Consortium to become a member and vote on
what Unicode will be.  You don't have to be knowledgeable, you don't
have to be respected, you don't even have to be a native user of the
language you want to poke your nose on. It is not Open Source :)

=back

=head1 Will Character Sets and Encodings ever be Unified?

This section discusses the future of charset and encodings.  In doing
so, I decided to grok the philosophy of perl one more time

=head2  Character Sets and Encodings should be designed to make easy
writings easy, without making hard writings impossible

Does Unicode meet this criterion?  It first opted for the first part,
to make easy writings easy by squeezing all you need to 16 bits.   But
Unicode today seems more forcused on making hard writings possible.

The problem is that this move toward making hard writings possible is
making Unicode trickier and trickier as time passes.  Surrogate pair
was introduced in 1996 but I have yet to see an application that makes
use of it.

On the other hand, ISO-2022 series seems to care little about this.
EUC is easy yet it stops when you try to make hard writings (multiple
CJK text in a single file).

I have to conclude there is no silver bullet here yet.  Unicode has
tried hard to be but it is quicksilver at best.  Quicksilver it may
be, that's the bullet we have.  That's the bullet Larry has decided to
use so I forge the gunmetal to shoot it.  And the result was Encode.

=head2  There are more than one way to encode it

In spite of all advocacies to the Unified Character Set and Encoding,
legacy data are there to last.   So at very least, you still need
"legacy" encodings for the very same reason you can't trash your tape
drives while you have a terabyte RAID at your fingertip.

Also remeber EBCDIC is still in use (and coexists with Unicode! see
L<perlebcdic>).

And don't forget there are many scripts which has no character set at
all that are waiting to be coded one way or another.  And not all
scripts are accepted or approved by Unicode Consotium.  If you want to
spell in Klingon, you have to find your own encoding.

=head2  A Module for getting your job done

If you are a die-hard Unicode advocate who want to tidy the world by
converting everything there into Unicode, or a nihilistic anti-Unicode
activist who accept nothing but Mule ctext, or a postmodernistist who
think the classic is cool and the modern rocks, this module is for
you.

Perl 5.6 tackled the modern when it added Unicode support internally.
Now in Perl 5.8 we tackled the classic by adding supoprt for other
encodings externally.  I hope you like it.

=head1 Author(s)

By Dan Kogai E<lt>dankogai(_at_)dan(_dot_)co(_dot_)jpE<gt>.   Send your 
comments to
E<lt>perl-unicode(_at_)perl(_dot_)orgE<gt>.  You can subscribe via
L<http://lists.perl.org/showlist.cgi?name=perl-unicode>.

=cut

<Prev in Thread] Current Thread [Next in Thread>