perl-unicode

Re: - charset + character set + coded character set + CCS (?) (was: [Encode] Encode::Supported revised)

2002-04-04 20:30:00
On Thu, 4 Apr 2002, Anton Tagunov wrote:

 Hi Anton !!

AT> Our comments go in the same direction, but will you
AT> let me strengthen your statements a bit?

  Thank you !

JS> On the other hand, no one with *sufficient understanding*
JS> of the issue uses 'character set' to mean encoding.

AT> [ECMA-35, (equivalent of ISO 2022?)]:

  Yes, I think they're a verbatim equivalent of ISO 2022. I'd never
have been able to read ISO 2022 unless ECMA released it free as ECMA 35.

AT> coded character set; code
AT>   A set of unambiguous rules that establishes a
AT>   character set and the one-to-one relationship between the 
AT>   characters of the set and their coded representation.

AT> [RFC 1345]:
AT>   The ISO definition of the term "coded character set" is as
AT>   follows: "A set of unambiguous rules that establishes a 
AT>   character set and the one-to-one relationship between the 
AT>   characters of the set and their coded representation."

AT> Hmmm... can this potentially lead to messing "character set" for
AT> a short form of "coded character set" (in the ISO meaning)?

AT> I see that these definitions themselves make a distinction between a
AT> "character set"       (= repertoire    ) and
AT> "coded character set" (= CCS + encoding = CCS + CES),

Jungshik?

  Hmm, I feel like being treated as 'the' ultimate something here, which
I'm certainly not and never wanted to be :-)

  I think Dan is right when he wrote that EUC-JP,EUC-KR,EUC-CN,
EUC-TW and even UTF-8 could be regarded as both CCS and CES. Even though
they involve multiple character set standards, the mapping from abstract
characters in those multiple character set standards to integers (despite
being of multiple 'lengths') is strictly one-to-one.  I didn't realize
that it's possible to view things that way until he wrote that. On the
other hand, as he wrote, any encoding that utilize any form of escape
sequence (locking/single shift, designator, etc) , whether defined in
ISO 2022 or not (I have HZ in mind here)  cannot be called a CCS because
just providing the mapping alone cannot fully specify the way actual
text in that encoding is 'serialized' in octet-sequence. Therefore,
I believe the below doesn't hold true for all encodings we have
to deal with although it's the case for some encodings.

AT> "coded character set" (= CCS + encoding = CCS + CES),

Then, I realize that RFC 1345 has the following after quoting
ISO definition of coded character set which you quoted above.

1345> This memo does not put further
1345> restrictions on the term of "coded character set" than the following:
1345>  "A coded character set is a set of rules that unambiguously and
1345>  completely determines which sequence of characters, if any, is
1345>  represented by each possible sequence of n-bit bytes for a certain
1345>  value of n." This implies that e.g. a coded character set extended
1345>  with one or more other coded character sets by means of the extension
1345>  techniques of ISO 2022 constitutes a coded character set in its own
1345>  right.  In this memo the term "charset" is used to refer to the above
1345>  interpretation of the ISO term "coded character set".

However, even RFC 1345 came up with a new term 'charset' for its
*extended* definition of 'coded character set'  to distinguish it from
the original ISO definition. The definition of 'charset' in RFC 1345
is actually in line with RFC 2130/2278. Therefore, what I wrote about
the statement that "coded character set" (= CCS + encoding = CCS + CES)
is still the case, IMO.



DOC> Is a collection of characters in which each character is distinguished
DOC> with unique ID (in most cases, ID is number).

JS>   Some people like to distinguish between a mere collection of characters
JS> and a collection of characters with uniq(numeric) ID /code points.
JS> The former is sometimes refered to as a character repertoire
JS> or a character set whereas the latter is called a 'coded character set'.

AT> or rather CCS to rule out the ISO understanding

  I don't see any conflict between RFC 2130 CCS and ISO coded character
set _quoted_ in RFC 1345. It's not the original ISO definition of 'coded
character set' but  RFC 1345's extension of the definition that made
things complicated. However, even RFC 1345 gave it a new term 'charset'
to tell it from the original ISO defintion.


DOC> =item Character I<Encoding>
DOC> A character encoding may also encode character set as-is (also called
DOC> a I<raw> encoding.  i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is

JS>    In a strict sense, the concept of 'raw' or 'as-is' (which you
JS> apparently use to mean a coded character set invoked on GL)  is not
JS> appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
JS> characters to their GL position when enumerating characters in their
JS> charts.
AT> Looks like RFC 1345 has made one big pile:

AT>   JIS_C6226-1978, JIS_C6226-1978 = JIS_C6226-1983
AT>   GB_1988-80
AT>   KS_C_5601-1987
AT>   
AT> are all listed in a similar manner there. Does this RFC change
AT> anything?

  As we all know well now (and you documented), at least Encode cannot
use 'ks_c_5601-1987' to mean what's described in RFC 1345 (mapping
bet. characters and row/column numbers) because MS took it away for
their own CP949. A similar misuse of GB2312 made it not desirable to
use GB_2312-80 to mean row/column (or GL) repr. of GB 2312-1980 in Encode.


JS> The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001 
JS> are row (ku) and column(ten?)  while GB 2312-80 appears to use GL 
JS> codepoints.

AT> Thanks a lot! I would have never caught this subtlety from what
AT> reading I have.

  Then, you also have to note what Dan wrote about the difference. JIS and
KS may have tried to 'please' the decimal-oriented :-)  Reading
what RFC wrote about GB 2312-80, 

1345> Considering the Chinese standard GB 2312-1980, the
1345> Japanese standards JIS X0208 and JIS X0212, and the Korean standard
1345> KS C 5601, they are all given by row and column numbers between 1 and
1345> 94. So two positions for row and column and a character set
1345> identifier of one character would be almost as short as possible

I developed a reservation about what I wrote about GB 2312-80.  Either I
(or Ken Lunde) am(is) wrong or the author of RFC 1345 was wrong. Or,
both could be right because it's possible that the printed version of
GB 2312-80 in Chinese used GL code points while the English document
submitted to ISO to register GB 2312-80 used row/column number.


DOC>    KS_C_5601-1987

JS>   I'm afraid this could give an impression that
JS> IANA is to blame for misuse of the CCS name to mean encoding/CES. Whether
JS> ks_c_5601-1987 is registerd with IANA or not, nobody had used it in
JS> MIME charset designation (although the general public used KS C 5601 or
JS> Wansung to mean EUC-KR) before Microsoft began to use it in 1997~1998
JS> for their own CP949 (not EUC-KR per se). BTW, I wouldn't call CP949 an
JS> *enhanced* version of EUC-KR. CP949 doesn't have some nice properties

 By 'nice properties', I mean you don't have to go back and forth
to figure out which character set any given octet point in a file/stream
belong to because all octets to represent characters in KS X 1001
have MSB=1 while octets for US-ASCII have MSB=0. That doesn't hold
true for CP949/UHC, Shift_JIS, Big5, and Johab.


AT> I have incorporated your ideas into a patch, let's see what Dan
AT> thinks on it! (patch sent in reply to Dan's core message on
AT> Supported.pod renewal)

  Thanks a lot.


DOC>    UTF-16

JS>   Not that I'd encourage people to use UTF-16 for their web pages,
JS> but  UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
JS> and Mozilla.

AT> Hmm.. My attempts to use UTF-16 failed with IE5.5..
AT> Has anyone demonstrated it to work?

  See my comment about this in my other reply to you. It works fine
with MS IE,Netscape 4, Netscape 6 and Mozilla.



JS> to online docs :-) ?  Why don't you also  refer to a successor to
JS> CJK.inf, CJKV Information Processing with a very comprehensive coverage
JS> on character sets and encodings.

AT> http://www.oreilly.com/catalog/cjkvinfo/ is the link for the book
AT> "CJKV Information Processing" is the name
AT> But someone has to write a good recommendation for that.
AT> Let it be someone who has the book ;-)

  Hmm, is it me :-) ? A collection of reviews is supposed to be at

ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/review/cjkv-reviews.txt

At the moment, the link is broken, though.

AT> Ken Lunde's book "CJKV Information Processing"
AT> http://www.oreilly.com/catalog/cjkvinfo/

  Or, his web page on the book at

  http://www.oreilly.com/~lunde/cjkv-ip.html

AT> Successor to CJK.inf. Features a very comprehensive coverage
AT> on CJKV character sets and encodings.

 How about just adding the following after '...sets and encodings'

  along with many other issues faced by anyone trying to
  better support CJKV languages/scripts in all the areas of information
  processing.

  Cheers,

  Jungshik