perl-unicode

Re: Encoding vs Charset

2002-03-27 15:13:50
On Wed, Mar 27, 2002 at 11:59:10AM -0500, Jungshik Shin wrote:
On Wed, 27 Mar 2002, Dan Kogai wrote:

On Wednesday, March 27, 2002, at 11:22 , Jungshik Shin wrote:
  IMHO, you're also misusing the term 'charset' here. MIME charset
can be used synonymously with 'encodings' (or
character set encoding scheme: see CJKV Information Processing,
IETF RFC 2130 and RFC 2278). What has to be distinguished
is 'coded character set' on the one hand (JIS X 0208, JIS X 0212,
KS X 1001, KS X 1003, GB 2312, CNS 11xxx, ISO 10646, ISO 646, US-ASCII,
ISO-8859-x) and 'encoding/character
set encoding scheme/MIME charset on the other hand (EUC-JP,
EUC-KR, EUC-TW, EUC-CN, ISO-2022-JP, ISO-2022-KR, ISO-2022-CN,
ISO-8859-x, UTF-8, UTF-32, UTF-7, UTF-16, Big5, UHC)

   I do not thinks so.   This time I can confidently say it is IANA that 
has goofed.  To make my point clear, let me define Charset and Encoding 
once again.

Character Set:

   a collection of characters in which each character is distinguished 
with unique ID (in most cases, ID is number).

Character Encoding:

   A way to represent characters in byte stream.  Given character 
encoding may contain a single character set (i.e. US-ascii) or multiple 
character sets (i.e. EUC-JP that contain US-ascii, JIS X 0201 Kana, JIS 
X 0208 and JIS X 0212).  Given character encoding may also encode 
character set as-is (raw; US-ascii) or processed (for EUC-JP, US-ascii 
is as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by 
0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).

  You got me wrong. I don't have any objection to 'coded character set'
and 'encoding' defined this way. Problem is that  you're using '(coded)
character set' and 'charset' interchangeably.  They're two different
things depending on where you come from. My point is that because
'charset' is already overloaded with two or more different meanings(as
MIME Content-Type header parameter, it means 'encoding' as you defined
above), you'd better not use it when comparing coded character set on the
one hand and encoding/ character set encoding scheme on the other hand.
Simply, it'd be much better for you to say '(coded) character set vs
encoding' instead of 'charset vs encodig'

I think you are getting closer to agreement. The IETF 'charset' term is
indeed defined very closely to what is named "encoding" above.

Let me point out that "coded character set" and "character set"
are two quite different things. In the first you have also the 
codes associated with the character, while in the latter there is no
codes associated. A "character set" consist of
"abstract characters" in Unicode parlance.

Kind regards
keld
codef character se

<Prev in Thread] Current Thread [Next in Thread>