perl-unicode

Re: [PATCH 1/2 + 0.1] Supported.pod

2002-04-05 05:13:35
Hello!

Have just read Jungshik's mail and have patched Supported.pod
a bit more: added (x-)windows-949 aliases stuff.

--- ext/Encode/lib/Encode/Supported.orig.pod    Fri Apr  5 01:00:36 2002
+++ ext/Encode/lib/Encode/Supported.pod Fri Apr  5 15:18:25 2002
@@ -63,7 +63,7 @@
   ascii         US-ascii                                   [ECMA]
   iso-8859-1   latin1                                       [ISO]
   utf8          UTF-8                                   [RFC2279]
-  UCS-2                ucs2, iso-10646-1, UTF-16LE             [IANA, UC]
+  UCS-2                ucs2, iso-10646-1, UTF-16BE             [IANA, UC]
   UTF-16LE      UCS-2LE                                       [UC]
   ----------------------------------------------------------------
 
@@ -188,8 +188,11 @@
 
   ----------------------------------------------------------------
   euc-kr               MacKorean                        [RFC1557]
-               cp949                   ks_c_5601-1987 is an alias
-                                       thereof.
+               cp949                   ks_c_5601-1987
+                                        windows-949 
+                                        x-windows-949
+                                        uhc
+                                        are aliases thereof.
   iso-2022-kr                                           [RFC1557]
   johab                                  [KS X 1001:1998, Annex 3]
   ksc5601-raw                          KSC5601 as is
@@ -456,14 +459,42 @@
 C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
 with Encode. See L<Encode::KR -- Korea> for details.
 
-  UTF-16 
+  UTF-16 UTF-16BE UTF-16LE
 
-=for comment
-waiting for comments from Jungshik Shin to soften this - Anton
+are a IANA-registered C<charset>s. See [RFC 2781] for details.
+Jungshik Shin reports that UTF-16 with a BOM is well accepted
+by MS IE 5/6 and NS 4/6. Beware however that
+
+=over 2
+
+=item * 
+
+C<UTF-16> support in any software you're going to be 
+using/interoperating with has probably been less tested 
+then C<UTF-8> support
+
+=item *
+
+data coded with C<UTF-8> seamlessly passes traditional
+command piping (C<cat>, C<more>, etc.) while UTF-16 coded
+data is likely to cause confusion (with it's zero bytes,
+for example)
+
+=item *
+
+it is beyond the power of words to describe the way HTML browsers
+encode non-C<ASCII> form data. To get a general impression refer to
+L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
+While encoding of form data has stabilzed for C<UTF-8> coded pages 
+(at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to 
+expect fun (and cross-browser discrepancies) with C<UTF-16> coded 
+pages!
+
+=back
+
+The rule of thumb is to use C<UTF-8> unless you know what
+you're doing and unless you really need from using C<UTF-16>.
 
-is a IANA-registered preferred MIME name
-but probably should be avoided as encoding for web pages due to 
-the lack of browser support.
 
   ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
   GBK
@@ -498,7 +529,8 @@
 for details.
 
 Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect
-this common misusage. 
+this common misusage. Other aliases are C<x-windows-949> (used by
+Mozilla), C<windows-949> and C<uhc>.
 I<Raw> C<KS_C_5601-1987> encoding is available as C<kcs5601-raw>.
 
 See L<Encode::KR -- Korea> for details.
@@ -515,7 +547,7 @@
 
 Encode aliases C<GB2312> to C<euc-cn> in full agreement with
 IANA registration. C<cp936> is supported separately.
-I<Raw> C<GB_2312-80> encoding is available as C<kcs5601-raw>.
+I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
 
 See L<Encode::CN -- Continental China> for details.
 
@@ -568,6 +600,23 @@
 belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
 example of being both a CCS and CES.
 
+=item charset (in MIME context)
+
+has long been used in the meaning of C<encoding>, CES.
+
+While C<character set> word combination has lost this meaning
+in MIME context since [RFC 2130], C<charset> abbreviation has
+retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
+
+
+ This document uses the term "charset" to mean a set of rules for
+ mapping from a sequence of octets to a sequence of characters, such
+ as the combination of a coded character set and a character encoding
+ scheme; this is also what is used as an identifier in MIME "charset="
+ parameters, and registered in the IANA charset registry ...  (Note
+ that this is NOT a term used by other standards bodies, such as ISO).
+                                               [RFC 2277]
+
 =item EUC
 
 Extended Unix Character.  See ISO-2022
@@ -683,7 +732,7 @@
 
 =item czyborra.com
 
-<http://czyborra.com/>
+L<http://czyborra.com/>
 
 Contains a a lot of useful information, especially gory details of ISO
 vs. vendor mappings.
@@ -697,6 +746,37 @@
 L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
 
 You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
+
+=item Jungshik Shin's Hangul FAQ
+
+L<http://jshin.net/faq>
+
+And especially it's subject 8
+
+L<http://jshin.net/faq/qa8.html>
+
+a comprehensive overview of the Korean (C<KS *>) standards.
+
+=back
+
+=head2 Offline sources
+
+=over 2
+
+=item Ken Lunde
+
+CJKV Information Processing
+1999 O'Reilly & Associates, ISBN : 1-56592-224-7
+
+The modern successor of the C<CJK.inf>.
+
+Features a comprehensive coverage on CJKV character sets and
+encodings along with many other issues faced by anyone trying
+to better support CJKV languages/scripts in all the areas of
+information processing.
+
+To purchase this book visit
+L<http://www.oreilly.com/catalog/cjkvinfo/>
 
 =back


<Prev in Thread] Current Thread [Next in Thread>