perl-unicode

Re: 5.8 roadmap and Encode

2002-03-04 15:34:00
On Tue, Mar 05, 2002 at 06:46:49AM +0900, Dan Kogai wrote:
Note that the CN.a JP.a KR.a TW.a bring in extra 5.5 MB, which is
quite a bit extra, the whole Perl being here (x86 linux) only 6 MB...

  I agree.  Extra 5.5 Meg for single file is definitely too much (not a 
big deal for shared perl, however).

Agreed too.

In related news: The depot's ext/Encode/MANIFEST seems to have
        t/table.rnd
        t/table.utf8
but not found in the corresponding directory. Integration mistake?

Also, I just found out the gbk.enc is virtually the same as cp936
(unlike almost all other CPs, Microsoft didn't bring any vendor add-ons
to this codepage), so please delete Encode/gbk.enc; I added an alias
entry in Encode.pm instead.

Attached please find the above patches, with POD pages (thanks Dan!) 
for TW, CN and KR.pm. Oh, and CP949 really belongs to KR.

/Autrijus/

diff -dur Encode/CN/CN.pm Encode.new/CN/CN.pm
--- Encode/CN/CN.pm     Sat Mar  2 11:45:11 2002
+++ Encode.new/CN/CN.pm Tue Mar  5 06:18:06 2002
@@ -6,4 +6,48 @@
 
 1;
 __END__
-todo: HZ (Escape-based)
+=head1 NAME
+
+Encode::CN - China-based Chinese Encodings
+
+=head1 SYNOPSIS
+
+    use Encode::CN;
+    $euc_cn = encode("euc-cn", $utf8);
+    $utf8   = encode("euc-cn", $euc_cn);
+
+=head1 DESCRIPTION
+
+This module implements China-based Chinese charset encodings.
+Encodings supported are as follows.
+
+  euc-cn       EUC (Extended Unix Character)
+  gb2312       The raw (low-bit) GB2312 character map
+  gb12345      Traditional chinese counterpart to GB2312 (raw)
+  iso-ir-165   GB2312 + GB6345 + GB8565 + additions
+  cp936        Code Page 936, also known as GBK (Extended GuoBiao)
+
+To find how to use this module in detail, see L<Encode>.
+
+=head1 NOTES
+
+Due to size concerns, C<GB 18030> (an extension to C<GBK>) is distributed
+separately on CPAN, under the name L<Encode::HanExtra>. That module
+also contains extra Taiwan-based encodings.
+
+=head1 BUGS
+
+The C<HZ> (Hanzi) escaped encoding is not supported.
+
+ASCII part (0x00-0x7f) is preserved for all encodings, even though it
+conflicts with mappings by the Unicode Consortium.  See
+
+F<http://www.debian.or.jp/~kubota/unicode-symbols.html.en>
+
+to find why it is implemented that way.
+
+=head1 SEE ALSO
+
+L<Encode>
+
+=cut
diff -dur Encode/CN/Makefile.PL Encode.new/CN/Makefile.PL
--- Encode/CN/Makefile.PL       Tue Mar  5 05:10:29 2002
+++ Encode.new/CN/Makefile.PL   Tue Mar  5 06:25:35 2002
@@ -3,7 +3,6 @@
 use ExtUtils::MakeMaker;
 
 my %tables = (EUC_CN   => ['euc-cn.enc'],
-             GBK      => ['gbk.enc'],
              GB2312   => ['gb2312.enc'],
              GB12345  => ['gb12345.enc'],
              CP936    => ['cp936.enc'],
diff -dur Encode/Encode.pm Encode.new/Encode.pm
--- Encode/Encode.pm    Tue Mar  5 00:29:25 2002
+++ Encode.new/Encode.pm        Tue Mar  5 05:57:54 2002
@@ -167,10 +167,13 @@
 # Seen in some Linuxes.
 define_alias( qr/^ujis$/i => 'euc-jp' );
 
+# CP936 doesn't have vendor-addon for GBK, so they're identical.
+define_alias( qr/^gbk$/i => '"cp936"');
+
 # TODO: HP-UX '8' encodings arabic8 greek8 hebrew8 kana8 thai8 turkish8
 # TODO: HP-UX '15' encodings japanese15 korean15 roi15
 # TODO: Cyrillic encoding ISO-IR-111 (useful?)
-# TODO: Chinese encodings GB18030 EUC-TW HZ
+# TODO: Chinese encodings HZ
 # TODO: Armenian encoding ARMSCII-8
 # TODO: Hebrew encoding ISO-8859-8-1
 # TODO: Thai encoding TCVN
diff -dur Encode/KR/KR.pm Encode.new/KR/KR.pm
--- Encode/KR/KR.pm     Sun Feb 17 01:12:34 2002
+++ Encode.new/KR/KR.pm Tue Mar  5 06:19:03 2002
@@ -6,6 +6,40 @@
 
 1;
 __END__
+=head1 NAME
 
-todo:
+Encode::KR - Korean Encodings
+
+=head1 SYNOPSIS
+
+    use Encode::CN;
+    $euc_kr = encode("euc-kr", $utf8);
+    $utf8   = encode("euc-kr", $euc_kr);
+
+=head1 DESCRIPTION
+
+This module implements Korean charset encodings.  Encodings supported
+are as follows.
+
+  euc-kr       EUC (Extended Unix Character)
+  ksc5601      Korean standard code set
+  cp949        Code Page 949 (EUC-KR + Unified Hangul Code)
+  
+To find how to use this module in detail, see L<Encode>.
+
+=head1 BUGS
+
+The C<Johab> (two-byte combination code) encoding is not supported.
+
+ASCII part (0x00-0x7f) is preserved for all encodings, even though it
+conflicts with mappings by the Unicode Consortium.  See
 
+F<http://www.debian.or.jp/~kubota/unicode-symbols.html.en>
+
+to find why it is implemented that way.
+
+=head1 SEE ALSO
+
+L<Encode>
+
+=cut
diff -dur Encode/KR/Makefile.PL Encode.new/KR/Makefile.PL
--- Encode/KR/Makefile.PL       Tue Feb 26 06:59:47 2002
+++ Encode.new/KR/Makefile.PL   Tue Mar  5 06:16:25 2002
@@ -4,6 +4,7 @@
 
 my %tables = (EUC_KR   => ['euc-kr.enc'],
              KSC5601  => ['ksc5601.enc'],
+             CP949    => ['cp949.enc'],
              );
 
 my $name = 'KR';
diff -dur Encode/MANIFEST Encode.new/MANIFEST
--- Encode/MANIFEST     Sun Feb 17 01:12:34 2002
+++ Encode.new/MANIFEST Tue Mar  5 06:21:44 2002
@@ -47,6 +47,7 @@
 Encode/ascii.enc
 Encode/ascii.ucm
 Encode/big5.enc
+Encode/big5-hkscs.enc
 Encode/cp1006.enc
 Encode/cp1047.enc
 Encode/cp1047.ucm
@@ -95,6 +96,7 @@
 Encode/gb2312.enc
 Encode/gsm0338.enc
 Encode/HZ.enc
+Encode/iso-ir-165.enc
 Encode/ir-197.enc
 Encode/jis0201.enc
 Encode/jis0208.enc
diff -dur Encode/TW/TW.pm Encode.new/TW/TW.pm
--- Encode/TW/TW.pm     Sat Mar  2 11:45:11 2002
+++ Encode.new/TW/TW.pm Tue Mar  5 06:16:38 2002
@@ -6,3 +6,49 @@
 
 1;
 __END__
+=head1 NAME
+
+Encode::TW - Taiwan-based Chinese Encodings
+
+=head1 SYNOPSIS
+
+    use Encode::CN;
+    $big5 = encode("big5", $utf8);
+    $utf8 = encode("big5", $big5);
+
+=head1 DESCRIPTION
+
+This module implements Taiwan-based Chinese charset encodings.
+Encodings supported are as follows.
+
+  big5         The original Big5 encoding
+  big5-hkscs   Big5 plus Cantonese characters in Hong Kong
+  cp950        Code Page 950 (Big5 + Microsoft vendor mappings)
+  
+To find how to use this module in detail, see L<Encode>.
+
+=head1 NOTES
+
+Due to size concerns, C<EUC-TW> (Extended Unix Character) and C<BIG5PLUS>
+(CMEX's Big5+) are distributed separately on CPAN, under the name
+L<Encode::HanExtra>. That module also contains extra China-based encodings.
+
+=head1 BUGS
+
+The C<CNS11643> encoding files are not complete (only the first two planes,
+C<11643-1> and C<11643-2>, exist in the distribution). For common CNS11643
+manipulation, please use C<EUC-TW> in L<Encode::HanExtra>, which contains
+plane 1-7.
+
+ASCII part (0x00-0x7f) is preserved for all encodings, even though it
+conflicts with mappings by the Unicode Consortium.  See
+
+F<http://www.debian.or.jp/~kubota/unicode-symbols.html.en>
+
+to find why it is implemented that way.
+
+=head1 SEE ALSO
+
+L<Encode>
+
+=cut

Attachment: pgphqXLP8CqH7.pgp
Description: PGP signature

<Prev in Thread] Current Thread [Next in Thread>