perl-unicode

[PATCH] Re: [Encode] Encode::Supported revised

2002-04-04 04:32:51
Hello, Dan!

1)
This my second portion of comments on the renewed Supported.pod.
This part is 100% orthogonal to the first part

2)

This patch
- changes status of KOI8-U on Jungshik's comment
  (sorry, I have never tested that myself :-(
- upgrades GB2312 to the "first class citizen"
  (why not?)
- adds a section on Microsoft naming acrobatics
- that patch includes a comment on the Shift_JIS
  differences between JIS X 0208-1997 Appendix 1
  and cp932
- ...
- this patch also makes clear that Encode supports
  the standards for GB2312 and Big5 not Microsoft
  extensions (have I grasped it right? :-)

--- ext/Encode/lib/Encode/Supported.pod.orig    Mon Apr  1 03:42:52 2002
+++ ext/Encode/lib/Encode/Supported.pod Thu Apr  4 15:16:10 2002
@@ -308,8 +308,8 @@
 
 =item * 
 
-To (en|de) code Encodings marked as C<*>, You need C<Encode::HanExtra>
-,available from CPAN.
+To (en|de) code Encodings marked as C<(*)>, You need 
+C<Encode::HanExtra>, available from CPAN.
 
 =back
 
@@ -317,33 +317,43 @@
 
   US-ASCII    UTF-8     ISO-8859-*  KOI8-R
   Shift_JIS   EUC-JP  ISO-2022-JP ISO-2022-JP-1
-  EUC-KR      Big5
+  EUC-KR      Big5      GB2312
 
-are registered to IANA as preferred MIME names and may probably be used over 
the Internet.
+are registered to IANA as preferred MIME names and may probably 
+be used over the Internet.
 
-C<Shift_JIS> is no longer Microsft proprietary since it has been
-officialized by JIS X 0208-1997.
+C<Shift_JIS> has been officialized by JIS X 0208-1997.
+L<Microsoft-related naming mess> gives details.
+
+C<GB2312> is the IANA name for C<EUC-CN>.
+See L<Microsoft-related naming mess> for details.
+
+C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
+with Encode. See L<Encode::CN -- Continental China> for details.
 
   EUC-CN
+  KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)
 
-has not been registered with IANA (as of march 2002) but
-seems to be supported by major web browsers. In Encode, GB2312
-is aliased to EUC-CN, with "uncooked" version of GB2312 canonicalized
-as gb2312-raw.  See L<Encode::CN> for details.
+have not been registered with IANA (as of March 2002) but
+seem to be supported by major web browsers. 
+IANA name for C<EUC-CN> is C<GB2312>.
 
   KS_C_5601-1987
 
-has been registered to IANA but when they are used, they are
-EUC-coded.  Internet community in Korea is not happy with this.
-so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
-of C<euc-kr>, with ksc5601-raw for "uncooked".
+is heavily misused.
+See L<Microsoft-related naming mess> for details.
+
+C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
+with Encode. See L<Encode::KR -- Korea> for details.
 
   UTF-16 
-  KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)
 
-are IANA-registered (C<UTF-16> even as a preferred MIME name)
+=for comment
+waiting for comments from Jungshik Shin to soften this - Anton
+
+is a IANA-registered preferred MIME name
 but probably should be avoided as encoding for web pages due to 
-the lack of browser supports.
+the lack of browser support.
 
   ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
   GBK
@@ -360,6 +370,73 @@
   BIG5PLUS (*)
 
 is a bit proprietary name. 
+
+=head2 Microsoft-related naming mess
+
+Microsoft products misuse the following names:
+
+=over 2
+
+=item KS_C_5601-1987
+
+Microsoft extension to C<EUC-KR>.
+
+Proper name: C<CP949>.
+
+See
+http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
+for details.
+
+Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect
+this common misusage. 
+I<Raw> C<KS_C_5601-1987> encoding is available as C<kcs5601-raw>.
+
+See L<Encode::KR -- Korea> for details.
+
+=item GB2312
+
+Microsoft extension to C<EUC-CN>.
+
+Proper names: C<CP936>, C<GBK>.
+
+C<GB2312> has been registered in the C<EUC-CN> meaning at
+IANA. This has partially repaired the situation: Microsoft's 
+C<GB2312> has become a superset of the official C<GB2312>.
+
+Encode aliases C<GB2312> to C<euc-cn> in full agreement with
+IANA registration. C<cp936> is supported separately.
+I<Raw> C<GB_2312-80> encoding is available as C<kcs5601-raw>.
+
+See L<Encode::CN -- Continental China> for details.
+
+=item Big5
+
+Microsoft extension to C<Big5>.
+
+Proper name: C<CP950>.
+
+Encode separately supports C<Big5> and C<cp950>.
+
+=item Shift_JIS
+
+Microsoft's understanding of C<Shift_JIS>.
+
+JIS has not endorsed the full Microsoft standard however.
+The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
+subsets, while Microsoft has always been meaning C<Shift_JIS> to
+encode a wider character repertoire.
+
+As a historical predecessor Microsoft's variant
+probably has more rights for the name, albeit it may be objected
+that Microsoft shouldn't have used JIS as part of the name
+in the first place.
+
+Unabiguous name: C<CP932>.
+
+Encode separately supports C<Shift_JIS> and C<cp932>.
+
+=back
+
 
 =head1 Bookmarks
 
What do you think of it, Dan? :-)

3)

Jungshik, I would have certainly advocated linking not only to
http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
but also to your comments on the KS_C_5601-1987 in the list archive,
but all your mails were on several subjects each.

Jungshik> ... refer to Ken Lunde's CJKV Information Processing
Jungshik> about that 'epic war' between two camps. (see p.197 of
Jungshik> the book and http://jshin.net/faq/qa8.html)
Jungshik> We even set up a web page to prevent M$ from spreading that
Jungshik> ill-defined name.

maybe we may link to this page? What is the address?

4)

Certainly the
[ID 20020312.006] pod2html does not translate space to '_' in L<>-s
bug still spoils our links. I have sent a new mail on that to
perl5-porters..

Furthermore, I don't understand why C<gb2312-raw> converts
to <CODE>gb2312-raw> while C<GB2312> becomes a link?

Anyway I have gone for putting C<> around, but if that feature/bug
persists maybe it's better to drop the C<> in my patch.

- Anton