Erik M. van der Poel writes:
OK, folks, let's try to reach closure on the definition of the term
"character set".
RFC 1341 (mime1) has this definition:
The term "character set", wherever it is used in this
document, refers to a coded character set, in the sense of
ISO character set standardization work, and must not be
misinterpreted as meaning "a set of characters."
Mime2 has this definition:
The term "character set", wherever it is used in this
document, refers to a unique mapping of a byte stream to
glyphs, a mapping which does not require external profiling
information. For example, bare "ISO 10646" can't be the
charset parameter, because it requires several language
information for the unique mapping to glyphs. However, this
term can refer to multibyte character sets and to extension
techniques such as those used in ISO 2022.
I suggest this definition:
The term "character set" is used in this document to refer to a method
used with one or more tables to convert encoded text to a series of
octets. This definition is intended to allow various kinds of text
encodings, from simple single-table mappings such as ASCII, to complex
table switching methods such as those that use ISO 2022's techniques.
We should get it right.
The first thing to note, is that we have invented our own term "charset"
and that is handy as we can distinquish ourselves from the normal
use of the term "character set" - which for an IETF document would
naturally be read as the ISO term - due to the close cooperation with
ISO.
The IETF definition of "charset" would most naturally be very related
to the ISO term "coded character set". All charsets mentioned in
the MIME standards are ISO coded character sets: ASCII, ISO-8859-X
etc. If we mean something different or a special subset of the
family of ISO "coded character sets" we should say so in the document.
The mime2 definition has several flaws in it. It should not refer
to the term "glyph" - as noted in several previous mailings.
It should not exclude 10646 per se, IMHO.
What it got right (IMHO) is the direction of the mapping - from the
bits to the characters.
the EvdP definition does not relate the concept to ISO terms.
Here a "charset" defines a method from "encoded text" to a series
of "octets". The basic terms are not defined here, but "octet"
is an ISO defined term. I will question the use of "octet".
There is 7-bit communication lines in existence today that are
perfectly capable of doing MIME mail. There are charsets which
do not need more than 7 bits, so I would prefer either
to say a "series of bits" or "output stream".
The term "encoded text" is undefined, and to me it seems circular.
What is "encoded text"? I would expect it to be encoded characters
representing text, so there we have the encoding already.
I would much prefer the ISO term "coded character set" which maps
"characters" to bits in a stream. We should then define also
the concept of "character" - and this should allow unified
CJKT characters, although some japanese and others do not like them.
I think there are good reasons for their dislikings, but anyway there
are characters defined which have the combined meaning of each
of the pre-unified CJKT characters. Characters are here taken to
mean the ISO term.
So here is my go:
The terms "character set" and "charset" are synonymeous in this docuent.
The terms "character set" and "charset", wherever it is used in this
document, refers to a "coded character set", in the sense of
ISO/IEC JTC1 character set standardization work, and must not be
misinterpreted as meaning a "character repertoire". A charset
specification must include all information for a bit stream to
be interpreted as the correct characters.
The terms "bit", "stream" and "character" are as defined by ISO/IEC JTC1.
Comments on the above definitions:
The term "character set" as used in this document must not be
misinterpreted as meaning a "character repertoire" (which is the
ISO definition of the term).
A charset will normally include specification of both graphic and
control characters, contrary to many ISO coded character set
standards that only contain either graphic or control characters.
A charset can use statefull encodings such as the techniques
defined in ISO 2022. Thus a charset can consist of several ISO
charsets. Also a charset can use extention techniques as defined
in ISO 2022 and the mnemonic technique defined in RFC1345.
A character can have its meaning further defined by a language
indication, for example a unified CJK character can have its
meaning restricted to a Japanese character.
Bit and byte ordering must be defined for a charset.
An encoding of a charset, such as UTF-1 of ISO 10646 is also
considered a charset.
The definition uses the ISO term "character" - which means that
it can represent several "glyphs" (in the ISO sense). It also means
that ISO 10646 "combining sequences" are not considered "characters".
keld