perl-unicode

The last pieces of Chinese puzzle

2002-03-04 19:27:12
I just finished the HZ support; since its escape rule is a little bit
beyond "E" format's capability (specifically, the escaping ~~ and the
to-be-ignored ~\n), I opted for a regex-based approach.

Tested against libiconv's test file, and found no problems. However
I encountered a little problem trying to apply this patch (seems that
'patch' won't generate new directory for me), so the HZ.pm is attached
separately. Please

1. remove Encode/Encode/HZ.enc
2. mkdir Encode/lib/Encode/CN, and put HZ.pm there
3. apply the rest of patch, as included below.

Aside from HZ support, this patch also makes ::TW and ::CN try to
autoload Encode::HanExtra. The reason is it should be transparent to
user (after downloading HanExtra.pm) whether we choose to put some
encoding into the core or not. It also fixed a couple \t nits in POD.

I'll work on the tests when I find some time...

Thanks,
/Autrijus/

diff -dur Encode/CN/CN.pm Encode.new/CN/CN.pm
--- Encode/CN/CN.pm     Tue Mar  5 06:50:47 2002
+++ Encode.new/CN/CN.pm Tue Mar  5 09:59:25 2002
@@ -1,8 +1,13 @@
 package Encode::CN;
-use Encode;
 our $VERSION = '0.02';
+
+use Encode;
+use Encode::CN::HZ;
 use XSLoader;
 XSLoader::load('Encode::CN',$VERSION);
+
+local $@;
+eval "use Encode::HanExtra"; # load extra encodings if they exist
 
 1;
 __END__
@@ -25,7 +29,8 @@
   gb2312       The raw (low-bit) GB2312 character map
   gb12345      Traditional chinese counterpart to GB2312 (raw)
   iso-ir-165   GB2312 + GB6345 + GB8565 + additions
-  cp936        Code Page 936, also known as GBK (Extended GuoBiao)
+  cp936                Code Page 936, also known as GBK (Extended GuoBiao)
+  hz           7-bit escaped GB2312 encoding
 
 To find how to use this module in detail, see L<Encode>.
 
@@ -35,9 +40,10 @@
 separately on CPAN, under the name L<Encode::HanExtra>. That module
 also contains extra Taiwan-based encodings.
 
-=head1 BUGS
+This module will automatically load L<Encode::HanExtra> if you have it on
+your machine.
 
-The C<HZ> (Hanzi) escaped encoding is not supported.
+=head1 BUGS
 
 ASCII part (0x00-0x7f) is preserved for all encodings, even though it
 conflicts with mappings by the Unicode Consortium.  See
diff -dur Encode/KR/KR.pm Encode.new/KR/KR.pm
--- Encode/KR/KR.pm     Tue Mar  5 06:50:47 2002
+++ Encode.new/KR/KR.pm Tue Mar  5 10:01:05 2002
@@ -1,6 +1,7 @@
 package Encode::KR;
-use Encode;
 our $VERSION = '0.02';
+
+use Encode;
 use XSLoader;
 XSLoader::load('Encode::KR',$VERSION);
 
@@ -23,7 +24,7 @@
 
   euc-kr       EUC (Extended Unix Character)
   ksc5601      Korean standard code set
-  cp949        Code Page 949 (EUC-KR + Unified Hangul Code)
+  cp949                Code Page 949 (EUC-KR + Unified Hangul Code)
   
 To find how to use this module in detail, see L<Encode>.
 
diff -dur Encode/MANIFEST Encode.new/MANIFEST
--- Encode/MANIFEST     Tue Mar  5 06:50:47 2002
+++ Encode.new/MANIFEST Tue Mar  5 10:00:38 2002
@@ -95,7 +95,6 @@
 Encode/gb1988.enc
 Encode/gb2312.enc
 Encode/gsm0338.enc
-Encode/HZ.enc
 Encode/iso-ir-165.enc
 Encode/ir-197.enc
 Encode/jis0201.enc
@@ -155,6 +154,7 @@
 lib/Encode/Unicode.pm
 lib/Encode/utf8.pm
 lib/Encode/XS.pm
+lib/Encode/CN/HZ.pm
 lib/Encode/Tcl/Escape.pm
 lib/Encode/Tcl/Extended.pm
 lib/Encode/Tcl/HanZi.pm
diff -dur Encode/TW/TW.pm Encode.new/TW/TW.pm
--- Encode/TW/TW.pm     Tue Mar  5 06:50:47 2002
+++ Encode.new/TW/TW.pm Tue Mar  5 09:59:21 2002
@@ -1,9 +1,13 @@
 package Encode::TW;
-use Encode;
 our $VERSION = '0.02';
+
+use Encode;
 use XSLoader;
 XSLoader::load('Encode::TW',$VERSION);
 
+local $@;
+eval "use Encode::HanExtra"; # load extra encodings if they exist
+
 1;
 __END__
 =head1 NAME
@@ -23,7 +26,7 @@
 
   big5         The original Big5 encoding
   big5-hkscs   Big5 plus Cantonese characters in Hong Kong
-  cp950        Code Page 950 (Big5 + Microsoft vendor mappings)
+  cp950                Code Page 950 (Big5 + Microsoft vendor mappings)
   
 To find how to use this module in detail, see L<Encode>.
 
@@ -32,6 +35,9 @@
 Due to size concerns, C<EUC-TW> (Extended Unix Character) and C<BIG5PLUS>
 (CMEX's Big5+) are distributed separately on CPAN, under the name
 L<Encode::HanExtra>. That module also contains extra China-based encodings.
+
+This module will automatically load L<Encode::HanExtra> if you have it on
+your machine.
 
 =head1 BUGS
 
--- Encode/Encode.pm    Tue Mar  5 06:50:47 2002
+++ Encode.new/Encode.pm        Tue Mar  5 10:05:33 2002
@@ -173,7 +173,6 @@
 # TODO: HP-UX '8' encodings arabic8 greek8 hebrew8 kana8 thai8 turkish8
 # TODO: HP-UX '15' encodings japanese15 korean15 roi15
 # TODO: Cyrillic encoding ISO-IR-111 (useful?)
-# TODO: Chinese encodings HZ
 # TODO: Armenian encoding ARMSCII-8
 # TODO: Hebrew encoding ISO-8859-8-1
 # TODO: Thai encoding TCVN

Attachment: HZ.pm
Description: Perl program

Attachment: pgpO1embqGnzi.pgp
Description: PGP signature

<Prev in Thread] Current Thread [Next in Thread>