perl-unicode

- charset + character set + coded character set + CCS (?) (was: [Encode] Encode::Supported revised)

2002-04-04 04:47:50
Hello Jungshik!

Our comments go in the same direction, but will you
let me strengthen your statements a bit?

=head1 Encoding vs. Charset
JS> Whether you like it or not, 'charset' is overloaded by MIME to mean
JS> 'encoding' (Character set Encoding Scheme=CES as defined in RFC 2130).
Indeed it is.
RFC 2278 additionally makes it explicit.

JS> On the other hand, no one with *sufficient understanding*
JS> of the issue uses 'character set' to mean encoding.

[ECMA-35, (equivalent of ISO 2022?)]:
coded character set; code
  A set of unambiguous rules that establishes a
  character set and the one-to-one relationship between the 
  characters of the set and their coded representation.

[RFC 1345]:
  The ISO definition of the term "coded character set" is as
  follows: "A set of unambiguous rules that establishes a 
  character set and the one-to-one relationship between the 
  characters of the set and their coded representation."

Hmmm... can this potentially lead to messing "character set" for
a short form of "coded character set" (in the ISO meaning)?

I see that these definitions themselves make a distinction between a
"character set"       (= repertoire    ) and
"coded character set" (= CCS + encoding = CCS + CES),

Jungshik?

Is a collection of characters in which each character is distinguished
with unique ID (in most cases, ID is number).

JS>   Some people like to distinguish between a mere collection of characters
JS> and a collection of characters with uniq(numeric) ID /code points.
JS> The former is sometimes refered to as a character repertoire
JS> or a character set whereas the latter is called a 'coded character set'.
or rather CCS to rule out the ISO understanding

=item Character I<Encoding>

A character encoding may also encode character set as-is (also called
a I<raw> encoding.  i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is
as-is, JIS X 0201 is prepended  with \x8E, JIS X 0208 is added by
0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).

JS>    In a strict sense, the concept of 'raw' or 'as-is' (which you
JS> apparently use to mean a coded character set invoked on GL)  is not
JS> appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
JS> characters to their GL position when enumerating characters in their
JS> charts.
Looks like RFC 1345 has made one big pile:

  JIS_C6226-1978, JIS_C6226-1978 = JIS_C6226-1983
  GB_1988-80
  KS_C_5601-1987
  
are all listed in a similar manner there. Does this RFC change
anything?

JS> The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001
JS> are row (ku) and column(ten?)  while GB 2312-80 appears to use GL
JS> codepoints.
Thanks a lot! I would have never caught this subtlety from what
reading I have.

JS>  That's why I prefer gb2312-gl and ksx1001-gl to gb2312-raw
JS> and ksx1001-raw. 'gl' doesn't have a risk of being mistaken for row and
JS> column numbers.
I used to be advocating for the RFC 1345 names, but they apparently
were not something to ease the situation (too long and too complex :)


   KS_C_5601-1987

has been registered to IANA but when they are used, they are
EUC-coded.  Internet community in Korea is not happy with this.
so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
of C<euc-kr>, with ksc5601-raw for "uncooked".

JS>   I'm afraid this could give an impression that
JS> IANA is to blame for misuse of the CCS name to mean encoding/CES. Whether
JS> ks_c_5601-1987 is registerd with IANA or not, nobody had used it in
JS> MIME charset designation (although the general public used KS C 5601 or
JS> Wansung to mean EUC-KR) before Microsoft began to use it in 1997~1998
JS> for their own CP949 (not EUC-KR per se). BTW, I wouldn't call CP949 an
JS> *enhanced* version of EUC-KR. CP949 doesn't have some nice properties
JS> of EUC-KR/JP/CN. Rather, I'd say it's an extension of EUC-KR used
JS> in MS-Windows 9x/ME/NT4/2k/XP. CP949 will never be supported under
JS> Linux/Unix.  We'll just go straight to UTF-8.

I have incorporated your ideas into a patch, let's see what Dan
thinks on it! (patch sent in reply to Dan's core message on
Supported.pod renewal)


   UTF-16
Awaiting for more comments from you (see bellow)

   KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)
Haven't tested this one myself :-(
No objections to changing its status. My patch has that.


are IANA-registered (C<UTF-16> even as a preferred MIME name)
but probably should be avoided as encoding for web pages due to
the lack of browser supports.

JS>   Not that I'd encourage people to use UTF-16 for their web pages,
JS> but  UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
JS> and Mozilla.
Hmm.. My attempts to use UTF-16 failed with IE5.5..
Has anyone demonstrated it to work?

=item CJK.inf

L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
Somewhat obsolete (last update in 1996), but still useful.  Also try

JS>   Is there any rule against mentioning a book in print as opposed
JS> to online docs :-) ?  Why don't you also  refer to a successor to
JS> CJK.inf, CJKV Information Processing with a very comprehensive coverage
JS> on character sets and encodings.

http://www.oreilly.com/catalog/cjkvinfo/ is the link for the book
"CJKV Information Processing" is the name
But someone has to write a good recommendation for that.
Let it be someone who has the book ;-)

Or may it be

Ken Lunde's book "CJKV Information Processing"
http://www.oreilly.com/catalog/cjkvinfo/

Successor to CJK.inf. Features a very comprehensive coverage
on CJKV character sets and encodings.

?

Heartiest regards, Anton