perl-unicode

[PATCH] Supported.pod: cleanup/UTF-16/CJK.inf + an invasion to the Glossary

2002-04-04 20:47:30
Hello, Dan! Hello, Jungshik and Autrijus!

Here's a new patch for Supported.pod.

It does

- minor cleanup
- rewrites section on UTF-16 (I hope you and Jungshik will like it :-)
- adds a link to Jungshiks reference on Korean character set standards
- tries to add a link to Ken Lunde's offline book
- drops

I could not find this page because the hostname doesn't resolve!

 Brief description for most of the mentioned CJK encodings
L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>

for good



- and.. a heavy invasion to the glossary

Gentlemen! Please excuse my boldness. Here's my vision of
what the Glossary should be.

As a justification for my changes here's a more expanded form:

http://tagunov.tripod.com/survey2.html

It has been largely upgraded in the last 24 hours,
but is still under construction.


My heartiest regards,
    /Anton/


--- ext/Encode/lib/Encode/Supported.pod.orig    Fri Apr  5 01:00:36 2002
+++ ext/Encode/lib/Encode/Supported.pod Fri Apr  5 07:33:54 2002
@@ -63,7 +63,7 @@
   ascii         US-ascii                                   [ECMA]
   iso-8859-1   latin1                                       [ISO]
   utf8          UTF-8                                   [RFC2279]
-  UCS-2                ucs2, iso-10646-1, UTF-16LE             [IANA, UC]
+  UCS-2                ucs2, iso-10646-1, UTF-16BE             [IANA, UC]
   UTF-16LE      UCS-2LE                                       [UC]
   ----------------------------------------------------------------
 
@@ -456,14 +456,42 @@
 C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
 with Encode. See L<Encode::KR -- Korea> for details.
 
-  UTF-16 
+  UTF-16 UTF-16BE UTF-16LE
 
-=for comment
-waiting for comments from Jungshik Shin to soften this - Anton
+are a IANA-registered C<charset>s. See [RFC 2781] for details.
+Jungshik Shin reports that UTF-16 with a BOM is well accepted
+by MS IE 5/6 and NS 4/6. Beware however that
+
+=over 2
+
+=item * 
+
+C<UTF-16> support in any software you're going to be 
+using/interoperating with has probably been less tested 
+then C<UTF-8> support
+
+=item *
+
+data coded with C<UTF-8> seamlessly passes traditional
+command piping (C<cat>, C<more>, etc.) while UTF-16 coded
+data is likely to cause confusion (with it's zero bytes,
+for example)
+
+=item *
+
+generally the way HTML browsers encode non-C<ASCII> form data
+is beyond what words can tell. Refer to
+L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>
+for a general impression. Browser behavior, however has stabilzed
+for C<UTF-8> coded pages (at least IE 5/6, NS 6, Opera 6).
+Be sure to expect more fun (and discrepancies between browsers)
+with C<UTF-16> coded pages!
+
+=back
+
+The rule of thumb is to use C<UTF-8> unless you know what
+you're doing and unless you really need from using C<UTF-16>.
 
-is a IANA-registered preferred MIME name
-but probably should be avoided as encoding for web pages due to 
-the lack of browser support.
 
   ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
   GBK
@@ -515,7 +543,7 @@
 
 Encode aliases C<GB2312> to C<euc-cn> in full agreement with
 IANA registration. C<cp936> is supported separately.
-I<Raw> C<GB_2312-80> encoding is available as C<kcs5601-raw>.
+I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
 
 See L<Encode::CN -- Continental China> for details.
 
@@ -556,17 +584,75 @@
 A collection of unique characters.  A I<character> set in the most
 strict sense. At this stage characters are not numberd.
 
-=item coded character set (CCS)
+=item CCS (RFC 2130 Coded Character Set)
+
+  A mapping from a set of abstract characters to a set of integers
+                                                         [RFC 2130]
+
+C<Unicode 3.2> is an example of a "pure" CCS.
+
+=item character encoding scheme (CES), encoding
+
+ -  A description of an algorithm which transforms every possible
+    sequence of octets to either a sequence of pairs <CCS, code
+    value> or to the  error state "illegal octet sequence"
+ -  Specifications, either by reference to CCS's registered by IANA or
+    in text, of each CCS upon which this CES is based.
+                                                         [RFC 2130]
+
+
+C<UTF-8>, C<UCS-2>, all of C<ISO-2022-*> and C<EUC-*> standards
+are "pure" CES's (do not define new CCS's on their own).
+
+=item ISO coded character set
+
+  coded character set; code: A set of unambiguous rules that
+  establishes a character set and the one-to-one relationship 
+  between the characters of the set and their coded 
+  representation.
+         [http://www.evertype.com/standards/iso8859/8859-14-en.pdf]
 
-A character set that is mapped in a way computers can use directly.
-Many character encodings including EUC falls in this category.
+Any standard complying with this wide-spread definition defines both
+a CCS and a CES for it. C<ISO-8859-*> naturally fall into this
+category: they each define a 96-character CCS and an 8-bit CES 
+encoding for that CCS, C<ASCII> and C<ISO 6429> set of control
+characters.
 
-=item character encoding scheme (CES)
+C<ASCII> is another example of a standard that defines both a CCS
+(96-chracter) and a CES.
 
-An algorithm to map a character set to a byte sequence.  You don't
-have to be able to tell which character set a given byte sequence
-belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
-example of being both a CCS and CES.
+CJK standards are most controversial here: C<GB 2312-80> for
+example defines both a 96x96 CCS (encoded by EUC-CN, ISO-2022-CN)
+and an implicit I<raw> double-byte 7-bits per byte CES, available
+as C<gb2312-raw> with Encode. This CES is only of a limited
+applicability: it encodes only C<GB_2312-80> CCS, not C<ASCII> or 
+C<ISO 6429> CCS's. As a result C<SPACE>, C<CR> and C<LF> are missing
+among others.
+
+C<JIS X 0208-1983> and C<KS C 5601-1987> is in a similar position:.
+[RFC 1345] and IANA registry have defined C<JIS_X0208-1983> and
+C<KS C 5601-1987> names to refer to the corresponding I<raw>
+double byte 7-bit CES's (the later one is available as
+C<ksc5601-raw> with Encode). Please compare to the
+L<Microsoft-related naming mess> section :-).
+
+
+=item charset (in MIME context)
+
+has long been used in the meaning of C<encoding>, CES.
+
+While C<character set> word combination has lost this
+meaning since [RFC 2130], C<charset> abbreviation has
+retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
+
+
+ This document uses the term "charset" to mean a set of rules for
+ mapping from a sequence of octets to a sequence of characters, such
+ as the combination of a coded character set and a character encoding
+ scheme; this is also what is used as an identifier in MIME "charset="
+ parameters, and registered in the IANA charset registry ...  (Note
+ that this is NOT a term used by other standards bodies, such as ISO).
+                                               [RFC 2277]
 
 =item EUC
 
@@ -590,14 +676,14 @@
 
 =item Unicode
 
-A Character Set that aims to include all character character
+A CCS (Coded Character Set) that aims to include all character character
 repertoire of the world.  Many character sets in various national as
 well as industorial standards are therefore a subset thereof.
 
 =item UTF
 
 Short for I<Unicode Transformation Format>.  Determinse how to map a
-unicode character into byte sequnece.
+unicode character into byte sequnece. A CES.
 
 =item UTF-16
 
@@ -671,7 +757,7 @@
 
 L<http://www.unicode.org/glossary/>
 
-The glossary of this document is based opon this site.
+The glossary of this document is based upon this site.
 
 =back
 
@@ -683,7 +769,7 @@
 
 =item czyborra.com
 
-<http://czyborra.com/>
+L<http://czyborra.com/>
 
 Contains a a lot of useful information, especially gory details of ISO
 vs. vendor mappings.
@@ -698,11 +784,34 @@
 
 You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
 
+=item Jungshik Shin's Hangul FAQ
+
+L<http://jshin.net/faq>
+
+And especially it's subject 8
+
+L<http://jshin.net/faq/qa8.html>
+
+has a comprehensive overview of the C<KS *> (Korean) standards.
+Tha author claims however that the document needs
+some modernisation :-)
+
 =back
 
-=cut
+=head2 Offline sources
+
+=over 2
+
+=item Ken Lunde
 
-I could not find this page because the hostname doesn't resolve!
+CJKV Information Processing
+1999 O'Reilly & Associates, ISBN : 1-56592-224-7
 
- Brief description for most of the mentioned CJK encodings
-L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
+The modern successor of the C<CJK.inf>.
+The book of choice for everyone interested.
+
+L<http://www.oreilly.com/catalog/cjkvinfo/>
+
+=back
+
+=cut


<Prev in Thread] Current Thread [Next in Thread>