perl-unicode

Re: Encode::CJKguide (> 500 lines Long!)

2002-03-26 21:24:50
On Wed, Mar 27, 2002 at 07:35:18AM +0900, Dan Kogai wrote:
as follows.  That explains how CJK encodings are made.  I would 
appreciate if you give me some feedback.  There are many good pages on 
this subject in Japanese but not so many in English....

Thanks a lot of explaining this matter so eloquently. :-)

I'l patched it with the usual spellchecking, podchecking and nitpicking,
and corrected the Case 1 / Case 2 reversed mistake, and added some Trad.
Chinese-related info.

Thanks,
/Autrijus/

--- bb  Wed Mar 27 11:50:31 2002
+++ aa  Wed Mar 27 12:21:11 2002
@@ -5,7 +5,7 @@
 =head1 SYNOPSIS
 
 This POD document describes how various CJK encodings are structured
-and its underling history.
+and its underlying history.
 
 =head1 The Evolution of CJK encodings
 
@@ -30,7 +30,7 @@
 
 The last one (0x7E) is DEL (*1).  ASCII has already been prevalent
 before any CJK encoding was introduced.  ASCII is also needed to
-implement very CJK handling programs so virtually all CJK encodings
+implement various CJK handling programs, so virtually all CJK encodings
 are designed to coexist with ASCII (*2)
 
 =over 2
@@ -38,12 +38,12 @@
 =item *1
 
 Why DEL is not assigned to 0x00-0x1f but 0x7F is a funny story. Back
-when ASCII was designed, punchcards were still widely used.  To
+when ASCII was designed, punch cards were still widely used.  To
 conserve forest :), instead of throwing out mispunched cards they
 decided to punch all holes for mistyped byte.  so 0x7F, with all 7
 holes, punched, became DEL.
 
-=item  *2
+=item *2
 
 I have heard of EBCDIC-Kanji but does anyone know about it
 more than its name?
@@ -53,7 +53,7 @@
 =head2 Escaped vs. Extended -- two ways to implement multibyte
 
 The history of multi-byte character encoding began when Japan
-Industorial Standard (JIS) has published JIS C 6226 (later became JIS
+Industrial Standard (JIS) has published JIS C 6226 (later became JIS
 X 0208:1978).  It contained 6353 characters and it won't naturally fit
 in a byte.  It wouldn't fit in a byte, it must be encoded with a pair
 of bytes.  But how are you going to do so and still make ASCII
@@ -62,7 +62,7 @@
 One way to do so is that you somehow tell the computer the beginning
 and ending of double-byte character.  In practice, we use I<escape
 sequence> to do so.  When the computer catches the sequence of bytes
-beginning with an escape character (\x1b), it changes the state
+beginning with an escape character (C<\x1b>), it changes the state
 whether the following bytes are in ASCII or double-byte character.
 
 But there are many computers (still today) that has no idea of escape
@@ -96,7 +96,7 @@
 
 Fortunately, ASCII uses only 7 bits out of 8.  Most computers back
 then already used Octet for byte.  So a byte has one more extra bit.
-Why not use that bit double-byte inidicator?
+Why not use that bit double-byte indicator?
 
 Instead of escape sequence, you just add 0x8080 to a JIS character
 set.  That is what Extended Unix Character does.  In a way, EUC
@@ -118,36 +118,37 @@
 =back
 
 This concept of 94x94 planar table quickly became standard in other
-CJK world as well.  People's Republic of China (just "China" as
+CJK world as well.  People's Republic of China (just I<China> as
 follows) has set GB 2312 national standard in 1990 and Republic of
-Korea (South Korea; simply "Korea" as follows) has set KS C 5601 in
-1989.  They are both based upon JIS C 6226, could be one of the
+Korea (South Korea; simply I<Korea> as follows) has set KS C 5601 in
+1989.  They are both based upon JIS C 6226, which could be one of the
 reasons why these character codes contain Kana, phonetic characters
 used only in Japan.
 
 Though there are escape-based encodings for these two (ISO-2022-CN
 and ISO-2022-KR, respectively), they are hardly used in favor of EUC.
-When you say gb2312 and ksc5601, EUC-based encoding is assumed.
+When you specify C<gb2312> and C<ksc5601> in B<Encode>, EUC-based
+encoding is assumed.
 
 =head2 Scattered JIS? -- An acrobat called Shift JIS
 
-So we have escape-based encodings (ISO-2022-XX) and extention-based
-encodings (EUC-XX).  They coexist with old systems very well,
+So we have escape-based encodings (ISO-2022-XX) and extension-based
+encodings (EUC-XX).  They both coexist with old systems very well,
 especially EUC.  For most cases, programs developed by people who know
 nothing but ASCII runs unmodified (perl has been one of them).  And
 they lived happily ever after...?  NO!
 
-Mitsubishi, and ASCII-Microsoft (Now Microsoft Japan) was in troble
-when they try to introduce Han ideographic character (I simply call
-them "Kanji" as follows) support in MBASICPlus that runs on Multi 16,
+Mitsubishi, and ASCII-Microsoft (Now Microsoft Japan) was in trouble
+when they try to introduce Han ideographic character (I'll simply call
+them I<Kanji> as follows) support in MBASICPlus that runs on Multi 16,
 Mitubishi's first 16-bit personal computer.  Before JIS X 0208 first
 introduced in 1978, JIS has already introduced what is called JIS X
 0201 in 1976.
 
-The Japanese try to teach computers how to handle thier language
+The Japanese try to teach computers how to handle their language
 before double-byte character became available.  Unlike Chinese which
 is purely ideographic (*1), Japanese had two variations of Kana, a
-phonetic representation of thier language as well.  So they decided to
+phonetic representation of their language as well.  So they decided to
 squeeze Katakana into upper half of the byte.
 
 =over 2
@@ -167,7 +168,7 @@
 =back
 
 Mitsubishi, among other companies, have already used this
-Katakana extention of ASCII.  So you can't apply EUC for backward
+Katakana extension of ASCII.  So you can't apply EUC for backward
 compatibility's sake.
 
 Their answer was nothing but acrobatic.
@@ -204,7 +205,7 @@
    0xe0:  JJJJJJJJJJJJJJJJ    JJJJJJJJJJJJJJJJ
    0xf0:                      JJJJJJJJJJJJJJXX
 
-   c = ASCII control          J = MS Kaji
+   c = ASCII control          J = MS Kanji
    a = ASCII printable        K = CP/M control
    k = JIS X 0201 kana
 
@@ -212,12 +213,11 @@
 
 Simply put, MS Kanji has made double-byte character possible by giving
 up ASCII/JISX0201 compliance of the second byte.   Uglier it may be, now 
-the
-backward compatibility to thier previous code was promised.
+the backward compatibility to their previous code was promised.
 
 NEC has also adopted this new Kanji code when they introduced MS-DOS
 ver. 2.0 to PC-9801, the most popular line of personal computers in
-Japan until AT compatible (or the same "PC" as anywhere) finally takes
+Japan until AT compatible (or the same I<PC> as anywhere) finally takes
 over its reign with a help of Windows 95.  So did Apple when they
 introduced KanjiTalk.
 
@@ -228,23 +228,24 @@
 But there were prices to be paid.  It is harder to port applications
 than EUC because the second byte may look like ASCII when the second
 byte is in 0x40-0xFE.  It also lacks expandability that EUC had (EUC
-now suports JIS X 0212-1990, extended Kanji, which is theoretically
+now supports JIS X 0212-1990, extended Kanji, which is theoretically
 impossible in Shift JIS).
 
-The name "Shift" JIS came from the fact that JIS character sets are
-"Shifted" when mapped.  IMHO, this is more like "Realigned" but the
-word "Realigned" is hardly appealing to most Japanese speakers.  When
-we talk about "Shift"ing, EUC is far more like shifting, with MSB
+The name I<Shift> JIS came from the fact that JIS character sets are
+I<Shifted> when mapped.  IMHO, this is more like I<Realigned>, but
+that word is hardly appealing to most Japanese speakers.  When
+we talk about I<Shift>ing, EUC is far more like shifting, with MSB
 acting as the shift....
 
 As you see, Shift JIS is more vendor-driven than other JIS encodings.
 And this was the case for Big5, the most popular encoding for
 Traditional Chinese.  The name Big5 came after the fact that the 5
-major PC vendors in Taiwan has worked on the encoding.
+major PC vendors in Taiwan (Acer, Eten, FIC, Mitac, and Zerone) has
+composed the encoding together.
 
-Well, for Big5, there were better reason to do
-so because 8836 characters were hardly enough.  And fortunately for
-them, they have no katakana to to silly-walk.
+Well, for Big5, there were better reason to do so, because 8836 characters
+were hardly enough.  And fortunately for them, they have no katakana to
+silly-walk.
 
 Here is how Big5 maps.
 
@@ -268,19 +269,25 @@
 
 Back in the time before Unicode, virtually all given encoding is mere
 bilingual or biscript, ASCII plus local.  With so many encodings
-emerging, it is only natural to try to set an encoding that coveres as
+emerging, it is only natural to try to set an encoding that covers as
 many, if not all, written language.  And Unicode was one of the
 answers.
 
-I carefully said "one of" because ISO-2022 has already existed.
+I carefully said I<one of>, because ISO-2022 has already existed.
 ISO-2022 is a escape-based encoding (*1) (7bit-JIS is one of them) and by
 assigning escape sequence to existing character set, ISO-2022, in
 theory, can swallow as many character sets as possible to form a
 universal encoding.  ISO-2022-JP-2 adopts this idea.  In addition to
 JIS X 0208 and JIS X 0212, it contains GB 2312 and KS C 5601.
 
-   *1 Exactly speaking, this is not true.   As a matter of fact,
-   EUC is ISO-2022-compliant.  I'll discuss this later
+=over 2
+
+=item *1
+
+Precisely speaking, this is not true.   As a matter of fact, EUC is
+ISO-2022-compliant.  I'll discuss this later.
+
+=back
 
 However, what many people, especially vendors and programmers were
 waiting for, was a fixed-width encoding so you can manipulate each
@@ -292,55 +299,62 @@
 expensive and 32-bit OS was not available on desktop.  In order to
 squeeze all (and increasing) CJK encodings, they found that simply
 realigning the existing character sets would not work.  They came up
-with arguablly the most controversial idea;  Han Unification.
+with arguably the most controversial idea: B<Han Unification>.
 
 Many of ideographs used in China, Japan, and Korea not only I<look>
 the same but also I<mean> the same.  Then why not give the same code
 point for those in common and save the code points?
 
-There are two cases to consider.  Those they look different but means
-the same (Case 1) and vise varsa (Case 2).  The Han Unification of
+There are two cases to consider.  Those that look the same but has
+different meanings (Case 1), or vise versa (Case 2).  The Han Unification of
 Unicode decided to unify based upon Case 1;  let's unify the ones with
 the same shape!
 
-As a result, something funny has happed.  For example, U+673A means "a
-machine" in Simplified Chinese but "a desk" in Japanese.  "a machine"
-in Japanese.  U+6A5F.  So you can't tell what it means just by looking
-at the code.
+As a result, something funny has happed.  For example, U+673A means I<a
+machine> in Simplified Chinese but I<a desk> in Japanese.  The character
+that means I<a machine> is U+6A5F in Japanese and Traditional Chinese.
+So you can't tell what it means just by looking at the code.
 
 But the controversy didn't stop there.  Han Unification also decided
 to apply Case 2 for those characters whose origin was the same.  These
-characters that are sheped different but means the same with the same
-origin is called I<Itaiji>.  Unicode does not differenciate Itaiji;
-should you need to differenciate, use different fonts.
+characters that are shaped different but means the same with the same
+origin is called I<Itaiji> (characters with alternative bodies).
+Unicode does not differentiate Itaiji; should you need to differentiate,
+use different fonts.
 
 The problem is that Itaiji is very common in proper nouns, especially
-surnames in Japan.  "Watanabe", with two characters "Wata" (U+6E21)
-and "Nabe",  is a very popular family name in Japan but there are at
-least 31 different "Nabe" in existence. But Unicode lists only
+surnames in Japan.  For example: I<Watanabe>, with two characters I<Wata>
+(U+6E21) and I<Nabe>, is a very popular family name in Japan -- but there
+are at least 31 different I<Nabe> in existence. But Unicode lists only
 U+8FBA, U+908A, and U+9089 in the code set. (*2)
 
-   *2 Actually, Unicode is less to blame on itaiji problem than the
-   Japanese domestic character sets, Because JIS X 0208 only contains 3
-   of them also.  But the point is that Unicode has shut the door for
-   itaiji even when there are rooms for expansions and updates -- at
-   least for the time being.
+=over 2
+
+=item *1
+
+Actually, Unicode is less to blame on itaiji problem than the
+Japanese domestic character sets, because JIS X 0208 only contains 3
+of them too.  But the point is that Unicode has shut the door for
+Itaiji even when there are rooms for expansions and updates -- at
+least for the time being.
+
+=back
 
 For the better or for the worse, Unicode is still practical, at least
-as practical as most regional character sets, thanks to Case 3.  If
+as practical as most regional character sets, thanks to Case 3:  If
 the existing character set says they are different, give them
 different code points so you can convert the string to Unicode then
-back and get the same string.  That is why "Nabe" has 3, not one, code
-points;  In the case above, JIS X 0208 had three of them.
+back and get the same string.  That is why I<Nabe> has 3, not one, code
+points;  in the case above, JIS X 0208 had three of them.
 
 Ironically, this move toward Han Unification has reduced the number of
 code points but bloated the size of Unicode encoders and decoders,
-including the Encode.  For instance, you can convert from Shift JIS to
-EUC-JP programatically because they both share the same charset.  This
+including the Encode module.  For instance, you can convert from Shift JIS to
+EUC-JP programmatically because they both share the same charset.  This
 is impossible in Unicode and you have to have a giant table to do so.
-As a result, 50% of statically linked perl consists of Encode!
+As a result, 50% of statically linked perl consists of B<Encode>!
 
-=head2 the "Multicode" ? -- the ISO-2022
+=head2 the I<Multicode> ? -- the ISO-2022
 
 While Unicode makes multilingualization possible by providing a
 single, unified character sets, ISO-2022 tries to achieve the same
@@ -366,13 +380,13 @@
 Prepare 4 tables with the size equals to GL.  We call them buffers
 and they are named from G0 to G3.
 
-=item 2.  Single Shift and  Charset Invocation
+=item 2.  Single Shift and Charset Invocation
 
 When you receive certain control character, swap GR with either G2
 or G3.  This is called Character Table I<Invocation>.  When a whole
 character is complete, restore the state of GR.  Since GR may change
 character to character basis, the control character used here is
-called "Single Shift Character", or SS for short.  SS2 and SS3
+called I<Single Shift Character>, or SS for short.  SS2 and SS3
 correspond to G2 and G3, respectively.
 
 =item 3.  Locking Shift and Charset Designation
@@ -380,9 +394,9 @@
 When you receive an escape sequence, swap GL with the character
 set the escape sequence specifies.  This is called Character Set
 I<Invocation>.  You don't have to restore GL until the next escape
-sequence.  Thus this action is called "Locking Shift".
+sequence.  Thus this action is called I<Locking Shift>.
 
-=item 4.  Character Set Spesifications
+=item 4.  Character Set Specifications
 
 The character sets that can be invoked or designated must be in
 96**n or 94**n (96 - space and DEL).
@@ -420,7 +434,7 @@
 
 =back
 
-As you see, can call EUC "Single-shift based ISO-2022" or even
+As you see, we can call EUC I<Single-shift based ISO-2022> or even
 ISO-2022-8bit.  You may not not know this but ISO-8859-X, also known
 as Latin X, is also ISO-2022-compliant.  They can be defined as
 follows;
@@ -432,13 +446,13 @@
 
    No G2 and G3, no escape sequence and single shifts.
 
-ISO-2022 has advantages over Unicode as follows;
+ISO-2022 has advantages over Unicode as follows:
 
 =over 2
 
 =item *
 
-ISO-2022 differenciates I<charset> and I<encoding> strictly and it
+ISO-2022 differentiates I<charset> and I<encoding> strictly and it
 specifies only encoding.  Charsets are up to regional government
 bodies.  All that EMCA, that maintains ISO-2022 has to do, is to
 register that.  This makes work sharing much easier.
@@ -450,19 +464,19 @@
 
 =item *
 
-Has no practal size limit, even in EUC-form.  EUC-TW is already 4
+Has no practical size limit, even in EUC-form.  EUC-TW is already 4
 bytes max.  And if you are happy with escape sequences, you can swallow
 as many charsets as you wish.
 
 =item *
 
-You have to *pay* the Consortium to become a member, ultimately to
+You have to B<pay> the Consortium to become a member, ultimately to
 vote on what Unicode will be.  It is not Open Source :)
 
 =back
 
 At the same time, Unicode does have advantages over ISO-2022 as
-follows;
+follows:
 
 =over 2
 
@@ -474,7 +488,7 @@
 
 =item *
 
-Have a consise set of characters that covers most popular languages.
+Have a concise set of characters that covers most popular languages.
 You may not be able to express what you have to say in the fullest
 extent but you can say most of it.
 
@@ -486,16 +500,16 @@
 =item *
 
 More supports from vendors.  Unicode started its life to make vendors
-happier (or lazier), not poets or liguists by tidying the charset and
+happier (or lazier), not poets or linguists by tidying the charset and
 encoding.  Well, it turned out be not as easy as they first hoped,
-with heaping demand for new codepoints and sarrogate pair.  But it is
+with heaping demand for new codepoints and surrogate pair.  But it is
 still a bliss enough that you only have to know one charset (Unicode
 does have several different encodings).  That is, except for those who
-hack converters like Encode :)
+hack converters like B<Encode>. :)
 
 =item *
 
-You *ONLY* have to pay the Consortium to become a member and vote on
+You B<ONLY> have to pay the Consortium to become a member and vote on
 what Unicode will be.  You don't have to be knowledgeable, you don't
 have to be respected, you don't even have to be a native user of the
 language you want to poke your nose on. It is not Open Source :)
@@ -505,14 +519,14 @@
 =head1 Will Character Sets and Encodings ever be Unified?
 
 This section discusses the future of charset and encodings.  In doing
-so, I decided to grok the philosophy of perl one more time
+so, I decided to grok the philosophy of perl one more time.
 
 =head2  Character Sets and Encodings should be designed to make easy
 writings easy, without making hard writings impossible
 
 Does Unicode meet this criterion?  It first opted for the first part,
 to make easy writings easy by squeezing all you need to 16 bits.   But
-Unicode today seems more forcused on making hard writings possible.
+Unicode today seems more focused on making hard writings possible.
 
 The problem is that this move toward making hard writings possible is
 making Unicode trickier and trickier as time passes.  Surrogate pair
@@ -526,36 +540,36 @@
 I have to conclude there is no silver bullet here yet.  Unicode has
 tried hard to be but it is quicksilver at best.  Quicksilver it may
 be, that's the bullet we have.  That's the bullet Larry has decided to
-use so I forge the gunmetal to shoot it.  And the result was Encode.
+use so I forge the gunmetal to shoot it.  And the result was B<Encode>.
 
 =head2  There are more than one way to encode it
 
 In spite of all advocacies to the Unified Character Set and Encoding,
-legacy data are there to last.   So at very least, you still need
-"legacy" encodings for the very same reason you can't trash your tape
+legacy data are there to last.  So at very least, you still need
+I<legacy> encodings for the very same reason you can't trash your tape
 drives while you have a terabyte RAID at your fingertip.
 
-Also remeber EBCDIC is still in use (and coexists with Unicode! see
+Also remember EBCDIC is still in use (and coexists with Unicode! see
 L<perlebcdic>).
 
 And don't forget there are many scripts which has no character set at
 all that are waiting to be coded one way or another.  And not all
-scripts are accepted or approved by Unicode Consotium.  If you want to
-spell in Klingon, you have to find your own encoding.
+scripts are accepted or approved by Unicode Consortium.  If you want to
+spell in Klingon alphabets, you have to find your own encoding.
 
 =head2  A Module for getting your job done
 
-If you are a die-hard Unicode advocate who want to tidy the world by
+If you are a die-hard Unicode advocate who wants to tidy the world by
 converting everything there into Unicode, or a nihilistic anti-Unicode
-activist who accept nothing but Mule ctext, or a postmodernistist who
-think the classic is cool and the modern rocks, this module is for
+activist who accepts nothing but Mule ctext, or a postmodernist who
+thinks the classic is cool and the modern rocks, this module is for
 you.
 
 Perl 5.6 tackled the modern when it added Unicode support internally.
-Now in Perl 5.8 we tackled the classic by adding supoprt for other
+Now in Perl 5.8 we tackled the classic by adding support for other
 encodings externally.  I hope you like it.
 
-=head1 Author(s)
+=head1 AUTHORS
 
 By Dan Kogai E<lt>dankogai(_at_)dan(_dot_)co(_dot_)jpE<gt>.   Send your 
comments to
 E<lt>perl-unicode(_at_)perl(_dot_)orgE<gt>.  You can subscribe via

Attachment: pgpelNCmyWWPJ.pgp
Description: PGP signature