perl-unicode

[PATCH 1/2] Supported.pod

2002-04-05 03:56:22
Hello, experts!

Have splitted my patch to Supported.pod into two levels.

This is the general utility patch that does not have
my arguable changes [level 1/2].

- fixes some typos
- rewords section on UTF-16
- adds 'charset (MIME context)' to glossary
- adds a reference to Ken's CJKV book

Dan?

/Anton/


--- ext/Encode/lib/Encode/Supported.orig.pod    Fri Apr  5 01:00:36 2002
+++ ext/Encode/lib/Encode/Supported.pod Fri Apr  5 14:41:56 2002
@@ -63,7 +63,7 @@
   ascii         US-ascii                                   [ECMA]
   iso-8859-1   latin1                                       [ISO]
   utf8          UTF-8                                   [RFC2279]
-  UCS-2                ucs2, iso-10646-1, UTF-16LE             [IANA, UC]
+  UCS-2                ucs2, iso-10646-1, UTF-16BE             [IANA, UC]
   UTF-16LE      UCS-2LE                                       [UC]
   ----------------------------------------------------------------
 
@@ -456,14 +456,42 @@
 C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
 with Encode. See L<Encode::KR -- Korea> for details.
 
-  UTF-16 
+  UTF-16 UTF-16BE UTF-16LE
 
-=for comment
-waiting for comments from Jungshik Shin to soften this - Anton
+are a IANA-registered C<charset>s. See [RFC 2781] for details.
+Jungshik Shin reports that UTF-16 with a BOM is well accepted
+by MS IE 5/6 and NS 4/6. Beware however that
+
+=over 2
+
+=item * 
+
+C<UTF-16> support in any software you're going to be 
+using/interoperating with has probably been less tested 
+then C<UTF-8> support
+
+=item *
+
+data coded with C<UTF-8> seamlessly passes traditional
+command piping (C<cat>, C<more>, etc.) while UTF-16 coded
+data is likely to cause confusion (with it's zero bytes,
+for example)
+
+=item *
+
+it is beyond the power of words to describe the way HTML browsers
+encode non-C<ASCII> form data. To get a general impression refer to
+L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
+While encoding of form data has stabilzed for C<UTF-8> coded pages 
+(at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to 
+expect fun (and cross-browser discrepancies) with C<UTF-16> coded 
+pages!
+
+=back
+
+The rule of thumb is to use C<UTF-8> unless you know what
+you're doing and unless you really need from using C<UTF-16>.
 
-is a IANA-registered preferred MIME name
-but probably should be avoided as encoding for web pages due to 
-the lack of browser support.
 
   ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
   GBK
@@ -515,7 +543,7 @@
 
 Encode aliases C<GB2312> to C<euc-cn> in full agreement with
 IANA registration. C<cp936> is supported separately.
-I<Raw> C<GB_2312-80> encoding is available as C<kcs5601-raw>.
+I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
 
 See L<Encode::CN -- Continental China> for details.
 
@@ -568,6 +596,23 @@
 belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
 example of being both a CCS and CES.
 
+=item charset (in MIME context)
+
+has long been used in the meaning of C<encoding>, CES.
+
+While C<character set> word combination has lost this meaning
+in MIME context since [RFC 2130], C<charset> abbreviation has
+retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
+
+
+ This document uses the term "charset" to mean a set of rules for
+ mapping from a sequence of octets to a sequence of characters, such
+ as the combination of a coded character set and a character encoding
+ scheme; this is also what is used as an identifier in MIME "charset="
+ parameters, and registered in the IANA charset registry ...  (Note
+ that this is NOT a term used by other standards bodies, such as ISO).
+                                               [RFC 2277]
+
 =item EUC
 
 Extended Unix Character.  See ISO-2022
@@ -683,7 +728,7 @@
 
 =item czyborra.com
 
-<http://czyborra.com/>
+L<http://czyborra.com/>
 
 Contains a a lot of useful information, especially gory details of ISO
 vs. vendor mappings.
@@ -697,6 +742,36 @@
 L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
 
 You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
+
+=item Jungshik Shin's Hangul FAQ
+
+L<http://jshin.net/faq>
+
+And especially it's subject 8
+
+L<http://jshin.net/faq/qa8.html>
+
+a comprehensive overview of the Korean (C<KS *>) standards.
+
+=back
+
+=head2 Offline sources
+
+=over 2
+
+=item Ken Lunde
+
+CJKV Information Processing
+1999 O'Reilly & Associates, ISBN : 1-56592-224-7
+
+The modern successor of the C<CJK.inf>.
+
+Features a comprehensive coverage on CJKV character sets and
+encodings along with many other issues faced by anyone trying
+to better support CJKV languages/scripts in all the areas of
+information processing.
+
+To purchase this book visit
+L<http://www.oreilly.com/catalog/cjkvinfo/>
 
 =back


<Prev in Thread] Current Thread [Next in Thread>