perl-unicode

Fwd: [PATCH][docs] Encode.pm

2002-03-20 07:56:42
This is a forwarded message
From: Anton Tagunov <tagunov(_at_)motor(_dot_)ru>
To: perl5-porters(_at_)perl(_dot_)org <perl5-porters(_at_)perl(_dot_)org>
Date: Tuesday, March 19, 2002, 9:48:06 PM
Subject: [PATCH][docs] Encode.pm

===8<==============Original message text===============
Hello, developers!

With my upgraded knowledge of encoding naming I propose this.

Justification:

1)
  Shift-JIS -> Shift_JIS does not hurt anyone, cause it does
                         not work either way, Encode::encode
                         understands only 'shiftjis'

                         I would prefer to settle the naming
                         first,
                         going to submit a separate bug
                         report for all aliases that do not
                         work later
2)
  I do not care too much if I have done a wrong classification
  of encodings: I hope that as soon as something like this
  gets into the docs we'll get plenty of feedback sufficient
  to correct even the worth mistakes :-) 2 me it looks
  good just to start the section.

  <DISCLAIMER>
   The main goal was to separate MIME names from
   ISO names from proprietary names.
  </DISCLAIMER>
  
Comment:

  JIS 0201
  JIS 0208
  JIS 0212
  GB 1988
  GB 2312

  are under my severe suspect, but I have posted separate mails
  on them.

Grumbling:

  CNS 11643
  GB 12345

  really hurt my feelings because they have a space inside,
  but I have found no reason to touch them: neither
  IANA nor rfc1345 name them, and everywhere I've seen them
  they are written with a space.
  Do you think it could still be translated to CNS-.., GB-
  for consistency and beauty ?  :-)

Proposition:

  Should Name: HZ-GB-2312 be established as a synonym to HZ?
  Or not worth the trouble?

Looking forward to your opinions! :-)))

- Anton


--- ext/Encode/Encode.pm.orig   Mon Mar 18 00:20:24 2002
+++ ext/Encode/Encode.pm        Tue Mar 19 21:42:26 2002
@@ -500,34 +500,34 @@
 
   ISO 10646-1 => UCS-2
 
-The ISO 8859 and KOI:
+The ISO-8859 and KOI:
 
-  ISO 8859-1  ISO 8859-6   ISO 8859-11         KOI8-F
-  ISO 8859-2  ISO 8859-7   (12 doesn't exist)  KOI8-R
-  ISO 8859-3  ISO 8859-8   ISO 8859-13         KOI8-U
-  ISO 8859-4  ISO 8859-9   ISO 8859-14
-  ISO 8859-5  ISO 8859-10  ISO 8859-15
-                           ISO 8859-16
-
-  Latin1  => 8859-1  Latin6  => 8859-10
-  Latin2  => 8859-2  Latin7  => 8859-13
-  Latin3  => 8859-3  Latin8  => 8859-14
-  Latin4  => 8859-4  Latin9  => 8859-15
-  Latin5  => 8859-9  Latin10 => 8859-16
-
-  Cyrillic => 8859-5
-  Arabic   => 8859-6
-  Greek    => 8859-7
-  Hebrew   => 8859-8
-  Thai     => 8859-11
-  TIS620   => 8859-11
+  ISO-8859-1  ISO-8859-6   ISO-8859-11         KOI8-F
+  ISO-8859-2  ISO-8859-7   (12 doesn't exist)  KOI8-R
+  ISO-8859-3  ISO-8859-8   ISO-8859-13         KOI8-U
+  ISO-8859-4  ISO-8859-9   ISO-8859-14
+  ISO-8859-5  ISO-8859-10  ISO-8859-15
+                           ISO-8859-16
+
+  Latin1  => ISO-8859-1  Latin6  => ISO-8859-10
+  Latin2  => ISO-8859-2  Latin7  => ISO-8859-13
+  Latin3  => ISO-8859-3  Latin8  => ISO-8859-14
+  Latin4  => ISO-8859-4  Latin9  => ISO-8859-15
+  Latin5  => ISO-8859-9  Latin10 => ISO-8859-16
+
+  Cyrillic => ISO-8859-5
+  Arabic   => ISO-8859-6
+  Greek    => ISO-8859-7
+  Hebrew   => ISO-8859-8
+  Thai     => ISO-8859-11
+  TIS620   => ISO-8859-11
 
 The CJKV: Chinese, Japanese, Korean, Vietnamese:
 
-  ISO 2022     ISO 2022 JP-1  JIS 0201  GB 1988   Big5       EUC-CN
-  ISO 2022 CN  ISO 2022 JP-2  JIS 0208  GB 2312   HZ         EUC-JP
-  ISO 2022 JP  ISO 2022 KR    JIS 0210  GB 12345  CNS 11643  EUC-JP-0212
-  Shift-JIS                            GBK       Big5-HKSCS EUC-KR
+  ISO-2022     ISO-2022-JP-1  JIS 0201  GB 1988   Big5       EUC-CN
+  ISO-2022-CN  ISO-2022-JP-2  JIS 0208  GB 2312   HZ         EUC-JP
+  ISO-2022-JP  ISO-2022-KR    JIS 0210  GB 12345  CNS 11643  EUC-JP-0212
+  Shift_JIS                            GBK       Big5-HKSCS EUC-KR
   VISCII                               ISO-IR-165
 
 (Due to size concerns, additional Chinese encodings including C<GB 18030>,
@@ -572,6 +572,59 @@
   DingBats    Roman8
   GSM 0338    Symbol
 
+=head2 Encoding Classification
+
+Encodings
+
+  US-ASCII    UTF-8       KOI8-R      ISO-8859-*
+  ISO-2022-CN ISO-2022-JP ISO-2022-KR Big5
+  EUC-CN      EUC-JP      EUC-KR
+
+are L<http://www.iana.org/assignments/character-sets>-registered
+as preferred MIME names and may probably be used over the Internet. 
+So is
+
+  Shift_JIS
+
+but despite its wide spread it bears the label of being 
+Microsft proprietary.
+
+  UTF-16 KOI8-U ISO-2022-JP-2 
+
+are IANA-registered preferred MIME names but probably shoule
+be avoided as encoding for web pages due to lack of browser 
+support.
+
+
+  ISO-2022      (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
+  ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html)
+  ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
+  GBK 
+  VISCII
+  GB 12345      (only plains 1 and 2 available)
+  GB 18030
+  CNS 11643
+
+are totally valid encodings but not registered at IANA.
+
+  BIG5PLUS
+  EUC-JP-0212   (Encode::lib::Encode::Tcl::Extended)
+
+are a bit proprietary
+
+You may probably get some info on CJK encodings at
+
+  brief description for most of the mentioned CJK encodings
+   http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html
+
+  several years old, but still useful
+   http://www.oreilly.com/people/authors/lunde/cjk_inf.html
+
+  and some in-depth reading for the heroes :-)
+   http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM (eq ISO-2022)
+   http://www.faqs.org/rfcs/rfc1345.txt
+
+
 =head1 PERL ENCODING API
 
 =head2 Generic Encoding Interface
@@ -598,7 +651,7 @@
 internal form and returns the resulting string.  For CHECK see
 L</"Handling Malformed Data">.
 
-For example to convert ISO 8859-1 data to UTF-8:
+For example to convert ISO-8859-1 data to UTF-8:
 
        $utf8 = decode("latin1", $latin1);
 
@@ -611,7 +664,7 @@
 encode() or through PerlIO: See L</"Encoding and IO">.  For CHECK
 see L</"Handling Malformed Data">.
 
-For example to convert ISO 8859-1 data to UTF-8:
+For example to convert ISO-8859-1 data to UTF-8:
 
        from_to($data, "iso-8859-1", "utf-8");
 
@@ -848,7 +901,7 @@
 "character operations" (e.g. C<lc>, C</\W+/>, ...).
 
 You can also use PerlIO to convert larger amounts of data you don't
-want to bring into memory.  For example to convert between ISO 8859-1
+want to bring into memory.  For example to convert between ISO-8859-1
 (Latin 1) and UTF-8 (or UTF-EBCDIC in EBCDIC machines):
 
     open(F, "<:encoding(iso-8859-1)", "data.txt") or die $!;
===8<===========End of original message text===========



-- 
Best regards,
 Anton                            mailto:tagunov(_at_)motor(_dot_)ru