Re: What is a charset?

Erik M. van der Poel writes:

OK, folks, let's try to reach closure on the definition of the term
"character set".

RFC 1341 (mime1) has this definition:

            The term "character  set",  wherever  it  is  used  in  this
            document,  refers  to a coded character set, in the sense of
            ISO character set standardization  work,  and  must  not  be
            misinterpreted as meaning "a set of characters."

Mime2 has this definition:

            The term "character  set",  wherever  it  is  used  in  this
            document,  refers  to  a  unique mapping of a byte stream to
            glyphs, a mapping which does not require external  profiling
            information.   For  example,  bare  "ISO 10646" can't be the
            charset parameter,  because  it  requires  several  language
            information for the unique mapping to glyphs.  However, this
            term can refer to multibyte character sets and to  extension
            techniques such as those used in ISO 2022.

I suggest this definition:

    The term "character set" is used in this document to refer to a method
    used with one or more tables to convert encoded text to a series of
    octets.  This definition is intended to allow various kinds of text
    encodings, from simple single-table mappings such as ASCII, to complex
    table switching methods such as those that use ISO 2022's techniques.


We should get it right.

The first thing to note, is that we have invented our own term "charset"
and that is handy as we can distinquish ourselves from the normal
use of the term "character set" - which for an IETF document would
naturally be read as the ISO term - due to the close cooperation with
ISO.

The IETF definition of "charset" would most naturally be very related
to the ISO term "coded character set". All charsets mentioned in
the MIME standards are ISO coded character sets: ASCII, ISO-8859-X
etc. If we mean something different or a special subset of the
family of ISO "coded character sets" we should say so in the document.

The mime2 definition has several flaws in it. It should not refer
to the term "glyph" - as noted in several previous mailings.
It should not exclude 10646 per se, IMHO.
What it got right (IMHO) is the direction of the mapping - from the
bits to the characters.

the EvdP definition does not relate the concept to ISO terms.
Here a "charset" defines a method from "encoded text" to a series
of "octets". The basic terms are not defined here, but "octet"
is an ISO defined term. I will question the use of "octet".
There is 7-bit communication lines in existence today that are
perfectly capable of doing MIME mail. There are charsets which
do not need more than 7 bits, so I would prefer either
to say a "series of bits" or "output stream".

The term "encoded text" is undefined, and to me it seems circular.
What is "encoded text"? I would expect it to be encoded characters
representing text, so there we have the encoding already. 

I would much prefer the ISO term "coded character set" which maps
"characters" to bits in a stream. We should then define also
the concept of "character" - and this should allow unified
CJKT characters, although some japanese and others do not like them.
I think there are good reasons for their dislikings, but anyway there
are characters defined which have the combined meaning of each
of the pre-unified CJKT characters. Characters are here taken to
mean the ISO term.

So here is my go:

    The terms "character set" and "charset" are synonymeous in this docuent.
    The terms "character set" and "charset", wherever it is used in this
    document,  refers  to a  "coded character set", in the sense of
    ISO/IEC JTC1 character set standardization work, and must not be
    misinterpreted as meaning a "character repertoire". A charset
    specification must include all information for a bit stream to 
    be interpreted as the correct characters.
  
    The terms "bit", "stream" and "character" are as defined by ISO/IEC JTC1.

    Comments on the above definitions:

    The term "character set" as used in this document must not be 
    misinterpreted as meaning a "character repertoire" (which is the
    ISO definition of the term). 

    A charset will normally include specification of both graphic and
    control characters, contrary to many ISO coded character set
    standards that only contain either graphic or control characters.

    A charset can use statefull encodings such as the techniques
    defined in ISO 2022. Thus a charset can consist of several ISO
    charsets. Also a charset can use extention techniques as defined
    in ISO 2022 and the mnemonic technique defined in RFC1345.

    A character can have its meaning further defined by a language
    indication, for example a unified CJK character can have its 
    meaning restricted to a Japanese character.

    Bit and byte ordering must be defined for a charset.

    An encoding of a charset, such as UTF-1 of ISO 10646 is also
    considered a charset.

    The definition uses the ISO term "character" - which means that
    it can represent several "glyphs" (in the ISO sense). It also means
    that ISO 10646 "combining sequences" are not considered "characters".

keld