perl-unicode

[PATCH] Big5-related changes.

2002-04-19 13:04:08
I've been immersed in Big5-related issues in the past few days, and
came back with these last-minute (err, week?) changes before 5.8-RC1.

The Diff contains fixes to TW.pm, Alias.pm, and README.(tw|cn).

(For jhi) README fixes are trivial -- mentions new HanExtra encodings,
fix some China word usage, and add my latin-1 name.

(For dan) big5-hkscs should be upgraded to the 2001 edition, as per
Hong Kong government's decree. It's available separately at:

    http://egb.elixus.org/~autrijus/big5-hkscs.ucm.gz

Also, please delete big5.ucm and replace it with big5-eten, at:

    http://egb.elixus.org/~autrijus/big5-eten.ucm.gz

I've fixed Alias.pm so big5 aliases to big5-eten. The reason is that
the 'Big5' as originally defined isn't used anywhere on earth; non-
Microsoft systems uses 'big5' to mean 'big5-eten', and Microsoft
uses 'big5' to mean 'cp950'.

It is therefore unwise to have a canonical 'big5' encoding, much like
there should not be a 'gb2312' encoding. Since gb2312 is now aliased
to euc-cn and not cp936, I think big5 should alias to big5-eten and
not cp950.

<!--
This is agreeing with T. H. Hsieh's similiar decision on glibc-2.2:
<http://www.linux.org.tw/mail-archie/cle-devel/cle-devel.200009/msg00100.html>;
this agrees with my FreeBSD charmap (and the dominating ETen charmap
in taiwan). The Unicode mappings now also agrees with libiconv-1.7's,
although the latter does not contain the ETen-specific parts.
-->

Oh, I just noticed that Dan retained the 'gb2312.ucm' name, although
the encoding is called 'gb2312-raw'. I admit that I don't fully
understand the reason, but if that's to stand, then big5-eten could also
be named 'big5.ucm', and still say '<code_set_name> "big5-eten"', for
consistency's sake.

Thanks,
/Autrijus/

--- /home/autrijus/perl/ext/Encode/TW/TW.pm     Fri Apr 19 22:02:58 2002
+++ TW.pm       Sat Apr 20 03:13:07 2002
@@ -30,10 +30,10 @@
 
   Canonical   Alias            Description
   --------------------------------------------------------------------
-  big5        /\bbig-?5$/i     The original Big5 encoding
-  big5-hkscs  /\bbig5-hk(scs)?$/i
-                                Big5 plus Cantonese characters in 
-                                Hong Kong
+  big5-eten   /\bbig-?5$/i     Big5 encoding (with ETen extensions)
+             /\bbig5-?et(en)?$/i
+  big5-hkscs  /\bbig5-?hk(scs)?$/i
+                                Big5 + Cantonese characters in Hong Kong
   MacChineseSimp               Big5 + Apple Vendor Mappings
   cp950                                Code Page 950 
                                 = Big5 + Microsoft vendor mappings
@@ -44,11 +44,18 @@
 =head1 NOTES
 
 Due to size concerns, C<EUC-TW> (Extended Unix Character), C<CCCII>
-(Chinese Character Code for Information Interchange) and C<BIG5PLUS>
-(CMEX's Big5+) are distributed separately on CPAN, under the name
-L<Encode::HanExtra>. That module also contains extra China-based encodings.
+(Chinese Character Code for Information Interchange), C<BIG5PLUS>
+(CMEX's Big5+) and C<BIG5EXT> (CMEX's Big5e) are distributed separately
+on CPAN, under the name L<Encode::HanExtra>. That module also contains
+extra China-based encodings.
 
 =head1 BUGS
+
+Since the original C<big5> encoding (1984) is not supported anywhere
+(glibc and DOS-based systems uses C<big5> to mean C<big5-eten>; Microsoft
+uses C<big5> to mean C<cp950>), a concious decision was made to alias
+C<big5> to C<big5-eten>, which is the de facto superset of the original
+big5.
 
 The C<CNS11643> encoding files are not complete. For common C<CNS11643>
 manipulation, please use C<EUC-TW> in L<Encode::HanExtra>, which contains
--- /home/autrijus/perl/ext/Encode/lib/Encode/Alias.pm  Wed Apr 10 05:13:28 2002
+++ Alias.pm    Sat Apr 20 03:11:11 2002
@@ -217,8 +217,9 @@
         define_alias( qr/(?:x-)?windows-949$/i    => '"cp949"' );
         define_alias( qr/\bks_c_5601-1987$/i      => '"cp949"' );
         # for Encode::TW
-       define_alias( qr/\bbig-?5$/i              => '"big5"' );
-       define_alias( qr/\bbig5-hk(?:scs)?$/i     => '"big5-hkscs"' );
+       define_alias( qr/\bbig-?5$/i              => '"big5-eten"' );
+       define_alias( qr/\bbig5-?et(?:en)$/i      => '"big5-eten"' );
+       define_alias( qr/\bbig5-?hk(?:scs)?$/i    => '"big5-hkscs"' );
     }
     # utf8 is blessed :)
     define_alias( qr/^UTF-8$/i => '"utf8"',);
--- /home/autrijus/perl/README.tw       Thu Apr 18 06:01:01 2002
+++ README.tw   Sat Apr 20 03:15:51 2002
@@ -29,8 +29,8 @@
 
 Encode 延伸模組支援下列正體中文的編碼方式:
 
-    big5       原始的 Big5 編碼 (含倚天日文字形)
-    big5-hkscs Big5 + 香港外字集
+    big5       Big5 編碼 (含倚天延伸字形)
+    big5-hkscs Big5 + 香港外字集, 2001 年版
     cp950      字碼頁 950 (Big5 + 微軟添加的字符)
 
 舉例來說, 將 Big5 編碼的檔案轉成 Unicode, 祗需鍵入下列指令:
@@ -61,8 +61,10 @@
 如果需要更多的中文編碼, 可以從 CPAN (L<http://www.cpan.org/>) 下載
 Encode::HanExtra 模組. 它目前提供下列編碼方式:
 
+    cccii      1980 年文建會的中文資訊交換碼
     euc-tw     Unix 延伸字符集, 包含 CNS11643 平面 1-7
     big5plus   中文數位化技術推廣基金會的 Big5+
+    big5ext    中文數位化技術推廣基金會的 Big5e
 
 另外, Encode::HanConvert 模組則提供了簡繁轉換用的兩種編碼:
 
@@ -163,6 +165,6 @@
 
 Jarkko Hietaniemi E<lt>jhi(_at_)iki(_dot_)fiE<gt>
 
-唐宗漢 E<lt>autrijus(_at_)autrijus(_dot_)orgE<gt>
+Autrijus Tang (唐宗漢) E<lt>autrijus(_at_)autrijus(_dot_)orgE<gt>
 
 =cut
--- /home/autrijus/perl/README.cn       Thu Apr 18 06:01:01 2002
+++ README.cn   Sat Apr 20 03:15:43 2002
@@ -24,7 +24,7 @@
 
 Perl 掛旯眕 Unicode 輛俴紱釬. 涴桶尨 Perl 囀窒腔趼睫揹杅擂褫蚚 Unicode
 桶尨; Perl 腔滲宒迵呾睫 (瞰?諆?寞桶尨宒掀勤) 珩夔勤 Unicode 輛俴紱釬.
-婓怀?趧動銙麜?, 峈賸揭燴眕 Unicode 眳ゴ腔晤鎢源宒揣湔腔杅擂, Perl
+婓怀?趧動銙麜?, 峈賸揭燴眕 Unicode 眳ゴ腔晤鎢源宒湔溫腔杅擂, Perl
 枑鼎賸 Encode 涴跺耀輸, 褫眕?藥蒯愻袢媔賺□匾椅踾孖迮覺鉰輮?擂.
 
 Encode 晊扥耀輸盓堔狟蹈潠极笢恅腔晤鎢源宒:
@@ -36,7 +36,7 @@
     cp936      趼鎢珜 936, 珩備峈 GBK (孺喃弊梓鎢)
     hz         7 掀杻砯堤宒 GB2312 晤鎢
 
-撼瞰懂佽, 蔚 EUC-CN 晤鎢腔紫偶蛌傖 Unicode, 檍剒瑩?輴臏倗蜂?:
+撼瞰懂佽, 蔚 EUC-CN 晤鎢腔恅紫蛌傖 Unicode, 檍剒瑩?輴臏倗蜂?:
 
     perl -Mencoding=euc-cn,STDOUT,utf8 -pe1 < file.euc-cn > file.utf8
 
@@ -51,12 +51,12 @@
     # ぎ雄 euc-cn 趼揹賤昴; 梓袧怀堤?趧停縢撈簊騥暴駘? euc-cn 晤鎢
     use encoding 'euc-cn', STDIN => 'euc-cn', STDOUT => 'euc-cn';
     print length("醬邯");             #  2 (邧竘瘍桶尨趼睫)
-    print length('醬邯');             #  4 (等竘瘍桶尨弇啋郪)
+    print length('醬邯');             #  4 (等竘瘍桶尨趼誹)
     print index("袘袘諒餃", "閤遢"); # -1 (祥婦漪森赽趼睫揹)
     print index('袘袘諒餃', '閤遢'); #  1 (植菴媼跺趼誹羲宎)
 
-婓郔綴珨蹈瞰赽爵, "袘" 腔菴媼跺弇啋郪迵 "袘" 腔菴珨跺弇啋郪賦磁傖 EUC-CN
-鎢腔 "閤"; "袘" 腔菴媼跺弇啋郪寀迵 "諒" 腔菴珨跺弇啋郪賦磁傖 "遢".
+婓郔綴珨蹈瞰赽爵, "袘" 腔菴媼跺趼誹迵 "袘" 腔菴珨跺趼誹賦磁傖 EUC-CN
+鎢腔 "閤"; "袘" 腔菴媼跺趼誹寀迵 "諒" 腔菴珨跺趼誹賦磁傖 "遢".
 涴賤樵賸眕ゴ EUC-CN 鎢掀勤揭燴奻都獗腔恀枙.
 
 =head2 塗俋腔笢恅晤鎢
@@ -143,6 +143,6 @@
 
 Jarkko Hietaniemi E<lt>jhi(_at_)iki(_dot_)fiE<gt>
 
-昄跁犖 E<lt>autrijus(_at_)autrijus(_dot_)orgE<gt>
+Autrijus Tang (昄跁犖) E<lt>autrijus(_at_)autrijus(_dot_)orgE<gt>
 
 =cut

Attachment: pgpZfKo3nHm1C.pgp
Description: PGP signature

<Prev in Thread] Current Thread [Next in Thread>