perl-unicode

Re: [Encode] Encode::Supported revised

2002-04-03 23:23:57
On Wed, 3 Apr 2002, Dan Kogai wrote:

  Dan,

  Thank you for your write-up. Below are some comments.

        o The MIME name as defined in IETF RFCs.
   UCS-2         ucs2, iso-10646-1                    [IANA, et al]
   UCS-2le
   UTF-8         utf8                                     [RFC2279]
   ----------------------------------------------------------------

  How about UCS-2BE? Of course, if UCS-2 is network byte order
(big endian), it's not necessary. In that case, you may alias UCS-2
to UCS-2BE.


=item Encode::KR -- Korea

   ----------------------------------------------------------------
   euc-kr                MacKorean      

     euc-kr                MacKorean      [RFC1557, IANA, KS X 2901]

                 cp949                   ks_c_5601-1987
   iso-2022-kr                           [RFC1557]
   johab                                 

     johab                                 KS X 1001:1998 Annex 3

   ksc5601-raw                           KSC5601 as is
   ----------------------------------------------------------------

=item Vietnamese encodings VPS

  Mozilla supports VPS. See

   http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf
   http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut


=head1 Encoding vs. Charset

Character encoding (or just "encoding") and Character Set (or just
"charset") are often used interchangeably but they are different
concepts.

=item Character I<Set> (I<charset> for short)

  Could you please just say 'Encoding vs Character Set'
and remove parenthetical 'charset for short' or 'just charset' following
'character set'?  I agree to your distinction between 'encoding' and
'character set', but what is bothering me is that you treat 'charset'
as a synonym to 'character set'.

Whether you like it or not, 'charset' is overloaded by MIME to mean
'encoding' (Character set Encoding Scheme=CES as defined in RFC 2130).
Everyday numerous html documents are produced with meta tags that read
'Content-Type=text/html; charset=XXXX'. The same is true of email
messages with C-T header like 'text/plain; charset=ISO-2022-JP'.
Therefore 'Encoding vs Charset' can be interpreted as 'Encoding vs
Encoding'.  On the other hand, no one with *sufficient understanding*
of the issue uses 'character set' to mean encoding.


Is a collection of characters in which each character is distinguished
with unique ID (in most cases, ID is number).

  Some people like to distinguish between a mere collection of characters
and a collection of characters with uniq(numeric) ID /code points.
The former is sometimes refered to as a character repertoire
or a character set whereas the latter is called a 'coded character set'.

=item Character I<Encoding>

A character encoding may also encode character set as-is (also called
a I<raw> encoding.  i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is
as-is, JIS X 0201 is prepended  with \x8E, JIS X 0208 is added by
0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).

   In a strict sense, the concept of 'raw' or 'as-is' (which you
apparently use to mean a coded character set invoked on GL)  is not
appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
characters to their GL position when enumerating characters in their
charts. The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001
are row (ku) and column(ten?)  while GB 2312-80 appears to use GL
codepoints. That's why I prefer gb2312-gl and ksx1001-gl to gb2312-raw
and ksx1001-raw. 'gl' doesn't have a risk of being mistaken for row and
column numbers.


   KS_C_5601-1987

has been registered to IANA but when they are used, they are
EUC-coded.  Internet community in Korea is not happy with this.
so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
of C<euc-kr>, with ksc5601-raw for "uncooked".

  I'm afraid this could give an impression that
IANA is to blame for misuse of the CCS name to mean encoding/CES. Whether
ks_c_5601-1987 is registerd with IANA or not, nobody had used it in
MIME charset designation (although the general public used KS C 5601 or
Wansung to mean EUC-KR) before Microsoft began to use it in 1997~1998
for their own CP949 (not EUC-KR per se). BTW, I wouldn't call CP949 an
*enhanced* version of EUC-KR. CP949 doesn't have some nice properties
of EUC-KR/JP/CN. Rather, I'd say it's an extension of EUC-KR used
in MS-Windows 9x/ME/NT4/2k/XP. CP949 will never be supported under
Linux/Unix.  We'll just go straight to UTF-8.


   UTF-16
   KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)

are IANA-registered (C<UTF-16> even as a preferred MIME name)
but probably should be avoided as encoding for web pages due to
the lack of browser supports.

  Not that I'd encourage people to use UTF-16 for their web pages,
but  UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
and Mozilla.

=item CJK.inf

L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
Somewhat obsolete (last update in 1996), but still useful.  Also try

  Is there any rule against mentioning a book in print as opposed
to online docs :-) ?  Why don't you also  refer to a successor to
CJK.inf, CJKV Information Processing with a very comprehensive coverage
on character sets and encodings.

   Cheers,

  Jungshik