perl-unicode

Re[2]: - charset + character set + coded character set + CCS (?) (was: [Encode] Encode::Supported revised)

2002-04-06 01:20:18
Hello, Jungshik!

http://tagunov.tripod.com/survey2.html is largely an answer,
so, if you allow, I will comment with links into this page :)

JS> On the other hand, no one with *sufficient understanding*
JS> of the issue uses 'character set' to mean encoding.

ISO> coded character set; code
ISO>   A set of unambiguous rules that establishes a
ISO>   character set and the one-to-one relationship between the
ISO>   characters of the set and their coded representation.

AT> Hmmm... can this potentially lead to messing "character set" for
AT> a short form of "coded character set" (in the ISO meaning)?

JS>   I think Dan is right when he wrote that EUC-JP,EUC-KR,EUC-CN,
JS> EUC-TW and even UTF-8 could be regarded as both CCS and CES.
They can :)
http://tagunov.tripod.com/survey2.html#BD
classifies it as the ISO point of view: every encoding inevitably
defines a "Character Set" too.

I understand that this is CCS, not
a character repertoire. And you?

JS> Even though
JS> they involve multiple character set standards, the mapping from abstract
JS> characters in those multiple character set standards to integers (despite
JS> being of multiple 'lengths') is strictly one-to-one.  I didn't realize
JS> that it's possible to view things that way until he wrote that.
Neither did I!

JS> On the other hand, as he wrote, any encoding that utilize any form of
JS> escape sequence (locking/single shift, designator, etc) , whether
JS> defined in ISO 2022 or not (I have HZ in mind here)  cannot be called
JS> a CCS because just providing the mapping alone cannot fully specify
JS> the way actual text in that encoding is 'serialized' in octet-sequence.
I agree that EUC-JP is "more" a CCS then ISO-2022-JP :-)
Still, as I write at
http://tagunov.tripod.com/survey2.html#BD
I think that the [RFC 2130] approach is better then ISO, and you? ;)

JS> Therefore, I believe the below doesn't hold true for all encodings
JS> we have to deal with although it's the case for some encodings.
I'm afraid I just do not understand you well here, Jungshik.
AT> "coded character set" (= CCS + encoding = CCS + CES),
My statement is "ISO coded character set" = CCS + CES
This does always hold, does not it?

JS> Then, I realize that RFC 1345 has the following after quoting
JS> ISO definition of coded character set which you quoted above.
1345> This memo does not put further
1345> restrictions on the term of "coded character set" than the following:
1345>  "A coded character set is a set of rules that unambiguously and
1345>  completely determines which sequence of characters, if any, is
1345>  represented by each possible sequence of n-bit bytes for a certain
1345>  value of n." This implies that e.g. a coded character set extended
1345>  with one or more other coded character sets by means of the extension
1345>  techniques of ISO 2022 constitutes a coded character set in its own
1345>  right.  In this memo the term "charset" is used to refer to the above
1345>  interpretation of the ISO term "coded character set".
JS> However, even RFC 1345 came up with a new term 'charset' for its
JS> *extended* definition of 'coded character set'  to distinguish it from
JS> the original ISO definition. The definition of 'charset' in RFC 1345
JS> is actually in line with RFC 2130/2278.
I just more then happy when I opened 2277. The 'charset' definition
there is the best I have seen :-))

Yes 1345 second definition of "coded character set", also named
'charset' is identical to RFC 2130/2277/2278.

JS> Therefore, what I wrote about
JS> the statement that "coded character set" (= CCS + encoding = CCS + CES)
JS> is still the case, IMO.
I'm sorry, Jungshik. I'm afraid I did not understand that. Could you
explain that again?

DOC> Is a collection of characters in which each character is distinguished
DOC> with unique ID (in most cases, ID is number).

JS>   Some people like to distinguish between a mere collection of characters
JS> and a collection of characters with uniq(numeric) ID /code points.
JS> The former is sometimes refered to as a character repertoire
JS> or a character set whereas the latter is called a 'coded character set'.

AT> or rather CCS to rule out the ISO understanding

JS>   I don't see any conflict between RFC 2130 CCS and ISO coded character
JS> set _quoted_ in RFC 1345.
Thanks to Markus G. Kuhn we how have the
http://www.evertype.com/standards/iso8859/8859-14-en.pdf link :)
Both 8859-14-en.pdf and ECMA 35 contain a very close, a bit reworded
wording:
ISO 8859-14> coded character set; code
ISO 8859-14>   A set of unambiguous rules that establishes a
ISO 8859-14>   character set and the one-to-one relationship between the
ISO 8859-14>   characters of the set and their bit combinations.

2130>  A Coded Character Set (CCS) is a mapping from a set of abstract
2130>  characters to a set of integers.
Does the conflict look more evident now?
[RFC 2130] CCS is not at all about encoding. It rather is about
_enumerating_ set of characters IMO.
Here's how I try to reword the [RFC 2130] CCS defintion:
http://tagunov.tripod.com/survey2.html#BB what do you think of it? ;-)

JS>  It's not the original ISO definition of 'coded
JS> character set' but  RFC 1345's extension of the definition that made
JS> things complicated. However, even RFC 1345 gave it a new term 'charset'
JS> to tell it from the original ISO defintion.
Yes, it does conflict, '[RFC 2130] CCS' and '[RFC 2277] charset'==encoding

And furthermore, my opinion is that
http://tagunov.tripod.com/survey2.html#A3.1
ISO coded character set == CCS + CES
Do you approve?

So,
'ISO coded character set' is a 'charset' (not vice versa)
'ISO coded character set' is a CCS       (not vice versa)
'charset'  == 'encoding' == 'RFC 1345 second definition'

DOC> =item Character I<Encoding>
DOC> A character encoding may also encode character set as-is (also called
DOC> a I<raw> encoding.  i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is

JS>    In a strict sense, the concept of 'raw' or 'as-is' (which you
JS> apparently use to mean a coded character set invoked on GL)  is not
JS> appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map
JS> characters to their GL position when enumerating characters in their
JS> charts.
AT> Looks like RFC 1345 has made one big pile:

AT>   JIS_C6226-1978, JIS_C6226-1978 = JIS_C6226-1983
AT>   GB_1988-80
AT>   KS_C_5601-1987
AT>
AT> are all listed in a similar manner there. Does this RFC change
AT> anything?

JS>   As we all know well now (and you documented), at least Encode cannot
JS> use 'ks_c_5601-1987' to mean what's described in RFC 1345 (mapping
JS> bet. characters and row/column numbers) because MS took it away for
JS> their own CP949. A similar misuse of GB2312 made it not desirable to
JS> use GB_2312-80 to mean row/column (or GL) repr. of GB 2312-1980 in Encode.
Yes, yes, yes!
But we're speaking about beautiful theory, not rude practice! :-)
And even in theory the situation is fun to me:
GB 2312-80 _has_ defined a raw CES
JIS X 0208 and KS X 5601 _haven't_
But [RFC 1345] has messed them together and has defined a raw
encoding for each, hasn't it?

JS> The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001
JS> are row (ku) and column(ten?)  while GB 2312-80 appears to use GL
JS> codepoints.

AT> Thanks a lot! I would have never caught this subtlety from what
AT> reading I have.

JS>   Then, you also have to note what Dan wrote about the difference. JIS and
JS> KS may have tried to 'please' the decimal-oriented :-)
:-) given we're hex oriented, rather decimal-oriented, does
http://tagunov.tripod.com/survey2.html#BB please us?

JS> Reading what RFC wrote about GB 2312-80,

1345> Considering the Chinese standard GB 2312-1980, the
1345> Japanese standards JIS X0208 and JIS X0212, and the Korean standard
1345> KS C 5601, they are all given by row and column numbers between 1 and
1345> 94. So two positions for row and column and a character set
1345> identifier of one character would be almost as short as possible

Just what I was speaking about. [RFC 1345] has neglected that
difference and has messed them all up. And has presented us with raw
encodings for each!!

(Quite useless as I retell your, Autrijus's and
Dan's explanations in  http://tagunov.tripod.com/survey2.html#A5.3)

JS> I developed a reservation about what I wrote about GB 2312-80.  Either I
JS> (or Ken Lunde) am(is) wrong or the author of RFC 1345 was wrong. Or,
JS> both could be right because it's possible that the printed version of
JS> GB 2312-80 in Chinese used GL code points while the English document
JS> submitted to ISO to register GB 2312-80 used row/column number.
The world is a mess :-)
And seems [RFC 2130] has added to the mess.
No matter that Microsoft has stolen the name, the raw encoding
continues to live. As I've recently heard on perl5-porters,
jis201-raw and jis208-raw are probably going to get back, because
of some issues I do not understand. I'm indifferent about it, just
noting that I blame (or prise :-) [RFC 1345] for bringing them to us.

JS><snip/>

JS>   Not that I'd encourage people to use UTF-16 for their web pages,
JS> but  UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE
JS> and Mozilla.
Was in my last patch.

JS> Why don't you also  refer to a successor to
JS> CJK.inf, CJKV Information Processing
JS> ...
JS>   Hmm, is it me :-) ?
;-)
JS>   ...
JS>   along with many other issues faced by anyone trying to
JS>   better support CJKV languages/scripts in all the areas of information
JS>   processing.
Done.
Thanks to Dan for speedy application!

My ultimate regards,
 - Anton

P.S.

JS>   Hmm, I feel like being treated as 'the' ultimate something here, which
JS> I'm certainly not and never wanted to be :-)

Settled :)