I've been immersed in Big5-related issues in the past few days, and
came back with these last-minute (err, week?) changes before 5.8-RC1.
The Diff contains fixes to TW.pm, Alias.pm, and README.(tw|cn).
(For jhi) README fixes are trivial -- mentions new HanExtra encodings,
fix some China word usage, and add my latin-1 name.
(For dan) big5-hkscs should be upgraded to the 2001 edition, as per
Hong Kong government's decree. It's available separately at:
http://egb.elixus.org/~autrijus/big5-hkscs.ucm.gz
Also, please delete big5.ucm and replace it with big5-eten, at:
http://egb.elixus.org/~autrijus/big5-eten.ucm.gz
I've fixed Alias.pm so big5 aliases to big5-eten. The reason is that
the 'Big5' as originally defined isn't used anywhere on earth; non-
Microsoft systems uses 'big5' to mean 'big5-eten', and Microsoft
uses 'big5' to mean 'cp950'.
It is therefore unwise to have a canonical 'big5' encoding, much like
there should not be a 'gb2312' encoding. Since gb2312 is now aliased
to euc-cn and not cp936, I think big5 should alias to big5-eten and
not cp950.
<!--
This is agreeing with T. H. Hsieh's similiar decision on glibc-2.2:
<http://www.linux.org.tw/mail-archie/cle-devel/cle-devel.200009/msg00100.html>;
this agrees with my FreeBSD charmap (and the dominating ETen charmap
in taiwan). The Unicode mappings now also agrees with libiconv-1.7's,
although the latter does not contain the ETen-specific parts.
-->
Oh, I just noticed that Dan retained the 'gb2312.ucm' name, although
the encoding is called 'gb2312-raw'. I admit that I don't fully
understand the reason, but if that's to stand, then big5-eten could also
be named 'big5.ucm', and still say '<code_set_name> "big5-eten"', for
consistency's sake.
Thanks,
/Autrijus/
--- /home/autrijus/perl/ext/Encode/TW/TW.pm Fri Apr 19 22:02:58 2002
+++ TW.pm Sat Apr 20 03:13:07 2002
@@ -30,10 +30,10 @@
Canonical Alias Description
--------------------------------------------------------------------
- big5 /\bbig-?5$/i The original Big5 encoding
- big5-hkscs /\bbig5-hk(scs)?$/i
- Big5 plus Cantonese characters in
- Hong Kong
+ big5-eten /\bbig-?5$/i Big5 encoding (with ETen extensions)
+ /\bbig5-?et(en)?$/i
+ big5-hkscs /\bbig5-?hk(scs)?$/i
+ Big5 + Cantonese characters in Hong Kong
MacChineseSimp Big5 + Apple Vendor Mappings
cp950 Code Page 950
= Big5 + Microsoft vendor mappings
@@ -44,11 +44,18 @@
=head1 NOTES
Due to size concerns, C<EUC-TW> (Extended Unix Character), C<CCCII>
-(Chinese Character Code for Information Interchange) and C<BIG5PLUS>
-(CMEX's Big5+) are distributed separately on CPAN, under the name
-L<Encode::HanExtra>. That module also contains extra China-based encodings.
+(Chinese Character Code for Information Interchange), C<BIG5PLUS>
+(CMEX's Big5+) and C<BIG5EXT> (CMEX's Big5e) are distributed separately
+on CPAN, under the name L<Encode::HanExtra>. That module also contains
+extra China-based encodings.
=head1 BUGS
+
+Since the original C<big5> encoding (1984) is not supported anywhere
+(glibc and DOS-based systems uses C<big5> to mean C<big5-eten>; Microsoft
+uses C<big5> to mean C<cp950>), a concious decision was made to alias
+C<big5> to C<big5-eten>, which is the de facto superset of the original
+big5.
The C<CNS11643> encoding files are not complete. For common C<CNS11643>
manipulation, please use C<EUC-TW> in L<Encode::HanExtra>, which contains
--- /home/autrijus/perl/ext/Encode/lib/Encode/Alias.pm Wed Apr 10 05:13:28 2002
+++ Alias.pm Sat Apr 20 03:11:11 2002
@@ -217,8 +217,9 @@
define_alias( qr/(?:x-)?windows-949$/i => '"cp949"' );
define_alias( qr/\bks_c_5601-1987$/i => '"cp949"' );
# for Encode::TW
- define_alias( qr/\bbig-?5$/i => '"big5"' );
- define_alias( qr/\bbig5-hk(?:scs)?$/i => '"big5-hkscs"' );
+ define_alias( qr/\bbig-?5$/i => '"big5-eten"' );
+ define_alias( qr/\bbig5-?et(?:en)$/i => '"big5-eten"' );
+ define_alias( qr/\bbig5-?hk(?:scs)?$/i => '"big5-hkscs"' );
}
# utf8 is blessed :)
define_alias( qr/^UTF-8$/i => '"utf8"',);
--- /home/autrijus/perl/README.tw Thu Apr 18 06:01:01 2002
+++ README.tw Sat Apr 20 03:15:51 2002
@@ -29,8 +29,8 @@
Encode 延伸模組支援下列正體中文的編碼方式:
- big5 原始的 Big5 編碼 (含倚天日文字形)
- big5-hkscs Big5 + 香港外字集
+ big5 Big5 編碼 (含倚天延伸字形)
+ big5-hkscs Big5 + 香港外字集, 2001 年版
cp950 字碼頁 950 (Big5 + 微軟添加的字符)
舉例來說, 將 Big5 編碼的檔案轉成 Unicode, 祗需鍵入下列指令:
@@ -61,8 +61,10 @@
如果需要更多的中文編碼, 可以從 CPAN (L<http://www.cpan.org/>) 下載
Encode::HanExtra 模組. 它目前提供下列編碼方式:
+ cccii 1980 年文建會的中文資訊交換碼
euc-tw Unix 延伸字符集, 包含 CNS11643 平面 1-7
big5plus 中文數位化技術推廣基金會的 Big5+
+ big5ext 中文數位化技術推廣基金會的 Big5e
另外, Encode::HanConvert 模組則提供了簡繁轉換用的兩種編碼:
@@ -163,6 +165,6 @@
Jarkko Hietaniemi E<lt>jhi(_at_)iki(_dot_)fiE<gt>
-唐宗漢 E<lt>autrijus(_at_)autrijus(_dot_)orgE<gt>
+Autrijus Tang (唐宗漢) E<lt>autrijus(_at_)autrijus(_dot_)orgE<gt>
=cut
--- /home/autrijus/perl/README.cn Thu Apr 18 06:01:01 2002
+++ README.cn Sat Apr 20 03:15:43 2002
@@ -24,7 +24,7 @@
Perl 掛旯眕 Unicode 輛俴紱釬. 涴桶尨 Perl 囀窒腔趼睫揹杅擂褫蚚 Unicode
桶尨; Perl 腔滲宒迵呾睫 (瞰?諆?寞桶尨宒掀勤) 珩夔勤 Unicode 輛俴紱釬.
-婓怀?趧動銙麜?, 峈賸揭燴眕 Unicode 眳ゴ腔晤鎢源宒揣湔腔杅擂, Perl
+婓怀?趧動銙麜?, 峈賸揭燴眕 Unicode 眳ゴ腔晤鎢源宒湔溫腔杅擂, Perl
枑鼎賸 Encode 涴跺耀輸, 褫眕?藥蒯愻袢媔賺□匾椅踾孖迮覺鉰輮?擂.
Encode 晊扥耀輸盓堔狟蹈潠极笢恅腔晤鎢源宒:
@@ -36,7 +36,7 @@
cp936 趼鎢珜 936, 珩備峈 GBK (孺喃弊梓鎢)
hz 7 掀杻砯堤宒 GB2312 晤鎢
-撼瞰懂佽, 蔚 EUC-CN 晤鎢腔紫偶蛌傖 Unicode, 檍剒瑩?輴臏倗蜂?:
+撼瞰懂佽, 蔚 EUC-CN 晤鎢腔恅紫蛌傖 Unicode, 檍剒瑩?輴臏倗蜂?:
perl -Mencoding=euc-cn,STDOUT,utf8 -pe1 < file.euc-cn > file.utf8
@@ -51,12 +51,12 @@
# ぎ雄 euc-cn 趼揹賤昴; 梓袧怀堤?趧停縢撈簊騥暴駘? euc-cn 晤鎢
use encoding 'euc-cn', STDIN => 'euc-cn', STDOUT => 'euc-cn';
print length("醬邯"); # 2 (邧竘瘍桶尨趼睫)
- print length('醬邯'); # 4 (等竘瘍桶尨弇啋郪)
+ print length('醬邯'); # 4 (等竘瘍桶尨趼誹)
print index("袘袘諒餃", "閤遢"); # -1 (祥婦漪森赽趼睫揹)
print index('袘袘諒餃', '閤遢'); # 1 (植菴媼跺趼誹羲宎)
-婓郔綴珨蹈瞰赽爵, "袘" 腔菴媼跺弇啋郪迵 "袘" 腔菴珨跺弇啋郪賦磁傖 EUC-CN
-鎢腔 "閤"; "袘" 腔菴媼跺弇啋郪寀迵 "諒" 腔菴珨跺弇啋郪賦磁傖 "遢".
+婓郔綴珨蹈瞰赽爵, "袘" 腔菴媼跺趼誹迵 "袘" 腔菴珨跺趼誹賦磁傖 EUC-CN
+鎢腔 "閤"; "袘" 腔菴媼跺趼誹寀迵 "諒" 腔菴珨跺趼誹賦磁傖 "遢".
涴賤樵賸眕ゴ EUC-CN 鎢掀勤揭燴奻都獗腔恀枙.
=head2 塗俋腔笢恅晤鎢
@@ -143,6 +143,6 @@
Jarkko Hietaniemi E<lt>jhi(_at_)iki(_dot_)fiE<gt>
-昄跁犖 E<lt>autrijus(_at_)autrijus(_dot_)orgE<gt>
+Autrijus Tang (昄跁犖) E<lt>autrijus(_at_)autrijus(_dot_)orgE<gt>
=cut
pgpZfKo3nHm1C.pgp
Description: PGP signature