Hello, Dan! Hello, Jungshik and Autrijus!
Here's a new patch for Supported.pod.
It does
- minor cleanup
- rewrites section on UTF-16 (I hope you and Jungshik will like it :-)
- adds a link to Jungshiks reference on Korean character set standards
- tries to add a link to Ken Lunde's offline book
- drops
I could not find this page because the hostname doesn't resolve!
Brief description for most of the mentioned CJK encodings
L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
for good
- and.. a heavy invasion to the glossary
Gentlemen! Please excuse my boldness. Here's my vision of
what the Glossary should be.
As a justification for my changes here's a more expanded form:
http://tagunov.tripod.com/survey2.html
It has been largely upgraded in the last 24 hours,
but is still under construction.
My heartiest regards,
/Anton/
--- ext/Encode/lib/Encode/Supported.pod.orig Fri Apr 5 01:00:36 2002
+++ ext/Encode/lib/Encode/Supported.pod Fri Apr 5 07:33:54 2002
@@ -63,7 +63,7 @@
ascii US-ascii [ECMA]
iso-8859-1 latin1 [ISO]
utf8 UTF-8 [RFC2279]
- UCS-2 ucs2, iso-10646-1, UTF-16LE [IANA, UC]
+ UCS-2 ucs2, iso-10646-1, UTF-16BE [IANA, UC]
UTF-16LE UCS-2LE [UC]
----------------------------------------------------------------
@@ -456,14 +456,42 @@
C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
with Encode. See L<Encode::KR -- Korea> for details.
- UTF-16
+ UTF-16 UTF-16BE UTF-16LE
-=for comment
-waiting for comments from Jungshik Shin to soften this - Anton
+are a IANA-registered C<charset>s. See [RFC 2781] for details.
+Jungshik Shin reports that UTF-16 with a BOM is well accepted
+by MS IE 5/6 and NS 4/6. Beware however that
+
+=over 2
+
+=item *
+
+C<UTF-16> support in any software you're going to be
+using/interoperating with has probably been less tested
+then C<UTF-8> support
+
+=item *
+
+data coded with C<UTF-8> seamlessly passes traditional
+command piping (C<cat>, C<more>, etc.) while UTF-16 coded
+data is likely to cause confusion (with it's zero bytes,
+for example)
+
+=item *
+
+generally the way HTML browsers encode non-C<ASCII> form data
+is beyond what words can tell. Refer to
+L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>
+for a general impression. Browser behavior, however has stabilzed
+for C<UTF-8> coded pages (at least IE 5/6, NS 6, Opera 6).
+Be sure to expect more fun (and discrepancies between browsers)
+with C<UTF-16> coded pages!
+
+=back
+
+The rule of thumb is to use C<UTF-8> unless you know what
+you're doing and unless you really need from using C<UTF-16>.
-is a IANA-registered preferred MIME name
-but probably should be avoided as encoding for web pages due to
-the lack of browser support.
ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
GBK
@@ -515,7 +543,7 @@
Encode aliases C<GB2312> to C<euc-cn> in full agreement with
IANA registration. C<cp936> is supported separately.
-I<Raw> C<GB_2312-80> encoding is available as C<kcs5601-raw>.
+I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
See L<Encode::CN -- Continental China> for details.
@@ -556,17 +584,75 @@
A collection of unique characters. A I<character> set in the most
strict sense. At this stage characters are not numberd.
-=item coded character set (CCS)
+=item CCS (RFC 2130 Coded Character Set)
+
+ A mapping from a set of abstract characters to a set of integers
+ [RFC 2130]
+
+C<Unicode 3.2> is an example of a "pure" CCS.
+
+=item character encoding scheme (CES), encoding
+
+ - A description of an algorithm which transforms every possible
+ sequence of octets to either a sequence of pairs <CCS, code
+ value> or to the error state "illegal octet sequence"
+ - Specifications, either by reference to CCS's registered by IANA or
+ in text, of each CCS upon which this CES is based.
+ [RFC 2130]
+
+
+C<UTF-8>, C<UCS-2>, all of C<ISO-2022-*> and C<EUC-*> standards
+are "pure" CES's (do not define new CCS's on their own).
+
+=item ISO coded character set
+
+ coded character set; code: A set of unambiguous rules that
+ establishes a character set and the one-to-one relationship
+ between the characters of the set and their coded
+ representation.
+ [http://www.evertype.com/standards/iso8859/8859-14-en.pdf]
-A character set that is mapped in a way computers can use directly.
-Many character encodings including EUC falls in this category.
+Any standard complying with this wide-spread definition defines both
+a CCS and a CES for it. C<ISO-8859-*> naturally fall into this
+category: they each define a 96-character CCS and an 8-bit CES
+encoding for that CCS, C<ASCII> and C<ISO 6429> set of control
+characters.
-=item character encoding scheme (CES)
+C<ASCII> is another example of a standard that defines both a CCS
+(96-chracter) and a CES.
-An algorithm to map a character set to a byte sequence. You don't
-have to be able to tell which character set a given byte sequence
-belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
-example of being both a CCS and CES.
+CJK standards are most controversial here: C<GB 2312-80> for
+example defines both a 96x96 CCS (encoded by EUC-CN, ISO-2022-CN)
+and an implicit I<raw> double-byte 7-bits per byte CES, available
+as C<gb2312-raw> with Encode. This CES is only of a limited
+applicability: it encodes only C<GB_2312-80> CCS, not C<ASCII> or
+C<ISO 6429> CCS's. As a result C<SPACE>, C<CR> and C<LF> are missing
+among others.
+
+C<JIS X 0208-1983> and C<KS C 5601-1987> is in a similar position:.
+[RFC 1345] and IANA registry have defined C<JIS_X0208-1983> and
+C<KS C 5601-1987> names to refer to the corresponding I<raw>
+double byte 7-bit CES's (the later one is available as
+C<ksc5601-raw> with Encode). Please compare to the
+L<Microsoft-related naming mess> section :-).
+
+
+=item charset (in MIME context)
+
+has long been used in the meaning of C<encoding>, CES.
+
+While C<character set> word combination has lost this
+meaning since [RFC 2130], C<charset> abbreviation has
+retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
+
+
+ This document uses the term "charset" to mean a set of rules for
+ mapping from a sequence of octets to a sequence of characters, such
+ as the combination of a coded character set and a character encoding
+ scheme; this is also what is used as an identifier in MIME "charset="
+ parameters, and registered in the IANA charset registry ... (Note
+ that this is NOT a term used by other standards bodies, such as ISO).
+ [RFC 2277]
=item EUC
@@ -590,14 +676,14 @@
=item Unicode
-A Character Set that aims to include all character character
+A CCS (Coded Character Set) that aims to include all character character
repertoire of the world. Many character sets in various national as
well as industorial standards are therefore a subset thereof.
=item UTF
Short for I<Unicode Transformation Format>. Determinse how to map a
-unicode character into byte sequnece.
+unicode character into byte sequnece. A CES.
=item UTF-16
@@ -671,7 +757,7 @@
L<http://www.unicode.org/glossary/>
-The glossary of this document is based opon this site.
+The glossary of this document is based upon this site.
=back
@@ -683,7 +769,7 @@
=item czyborra.com
-<http://czyborra.com/>
+L<http://czyborra.com/>
Contains a a lot of useful information, especially gory details of ISO
vs. vendor mappings.
@@ -698,11 +784,34 @@
You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
+=item Jungshik Shin's Hangul FAQ
+
+L<http://jshin.net/faq>
+
+And especially it's subject 8
+
+L<http://jshin.net/faq/qa8.html>
+
+has a comprehensive overview of the C<KS *> (Korean) standards.
+Tha author claims however that the document needs
+some modernisation :-)
+
=back
-=cut
+=head2 Offline sources
+
+=over 2
+
+=item Ken Lunde
-I could not find this page because the hostname doesn't resolve!
+CJKV Information Processing
+1999 O'Reilly & Associates, ISBN : 1-56592-224-7
- Brief description for most of the mentioned CJK encodings
-L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
+The modern successor of the C<CJK.inf>.
+The book of choice for everyone interested.
+
+L<http://www.oreilly.com/catalog/cjkvinfo/>
+
+=back
+
+=cut