Re: comments on latest MIME drafts

I'll also note that the MIME term "character set" is NOT the same
thing as the ISO term "coded character set".


What is the difference?

The RFC1345, where many MIME charsets are defined, follows ISO
terminology closely, and has the following definition of "charset":

   The ISO definition of the term "coded character set" is as follows:
   "A set of unambiguous rules that establishes a character set and the
   one-to-one relationship between the characters of the set and their
   coded representation." and this definition may be subject to
   different interpretations.

RFC 1345 is not a standards-track document, and is therefore
irrelevant to this discussion.  The RFC 1521 definition holds.


Not only does this RFC1345 definition disagree with the MIME definition of
character set, it also disagrees with the definition of "coded character set"
in "Character Sets Considered Harmful" (draft-ietf-html-charset-harmful-00.txt,
or simply CCH for the rest of this message).

CCH's definition of "coded character set" is as follows:

coded character set
     A function whose domain is a subset of the integers, and whose
     range is a set of characters.
 
It should be obvious that this is a completely different beast from Keld's, but
in case it isn't, the key difference is Keld's definition calls for a 1:1
mapping from characters to a coded representation. This doesn't allow for
characters with more than one coded representation. The CCH definition operates
in the other direction, saying that each integer in the set must map into a
character.

CCH is an attempt to arrive at consistent terminology for future IETF use. This
document's terminology is derived from a variety of sources, including ISO
specifications. (I don't know specifically where Keld's definition comes from.
It is quite possible that it originates in the ISO as well, since ISO
terminology is known to be inconsistent.)

According to CCH, MIME's "character sets" should have been called "character
encoding schemes" or, more simply, "character encodings". We didn't do this and
it is now too late to change in MIME. There is already a note about this
specific discrepancy in the MIME definition of a character set.

MIME does not require that there be a one-to-one relationship between
characters and their coded representation; it requires only that there
be a unique mapping *from* the coded representation of a sequence of
characters to those characters.


Quite correct, but this also glosses over the distinction between the CCH set
of integers and the sequence of octets in MIME. These are NOT the same --
mappings between them may exist, but they are still very different things.

Put another way, the CCH "character encoding scheme" and MIME's
"character set" give you all you need to get from octets to
characters. A coded character set only goes part way -- it gets you from
integers to characters, but you first have to get from octets to
integers.

                                Ned