ietf-822
[Top] [All Lists]

charsets and glyphs

1993-02-17 09:30:39
The current version of the MIME document (i.e. mime2) says:

            The term "character  set",  wherever  it  is  used  in  this
            document,  refers  to  a  unique mapping of a byte stream to
            glyphs, a mapping which does not require external  profiling
            information.   For  example,  bare  "ISO 10646" can't be the
            charset parameter,  because  it  requires  several  language
            information for the unique mapping to glyphs.  However, this
            term can refer to multibyte character sets and to  extension
            techniques such as those used in ISO 2022.

and later:

            This RFC specifies the definition of the  charset  parameter
            for  the  purposes  of MIME to be a unique mapping of a byte
            stream to glyphs, a mapping which does not require  external
            profiling  information.  For example, bare "ISO 10646" can't
            be  the  charset  parameter,  because  it  requires  several
            language information for the unique mapping to glyphs.


Why has the stuff about bare 10646 been added to the document?  As far
as I can see, there was no consensus *at all* concerning this issue on
this mailing list.

The term "glyph" is used exactly 4 times in the whole document, and
all 4 of those occurrences are in the material quoted above, but there
is no definition for this term, nor is there any pointer to a document
that defines the term.

However, I *suspect* that MIME's "charset" parameter is not intended
to indicate which glyphs are being represented in the message.  Since
this is only a suspicion of mine, I would like to hear what everybody
else thinks.

There are several ways to write the letter "a", including:

         ***          ***
        *   *        *   *
            *       *    *
         ****       *    *
        *   *       *    *
        *   *       *    *
         *** *       **** *

These two are different glyphs, but they are the same character.  (The
terms "glyph" and "character" have been the subject of lots of debate,
especially in the ISO/IEC JTC1 SCs 2 and 18, but I *think* everyone
would agree about the above example.)

There seems to be a consensus in this group that us-ascii, iso-8859-1
and iso-2022-jp are MIME "charsets".

The first two, us-ascii and iso-8859-1, are what ISO usually calls
"coded character sets".  The last one, iso-2022-jp, is actually a
well-defined combination of 4 coded character sets (typically, only 2
are used in any one message).

So, in my view, a MIME charset is *not* a glyph encoding.  Well, what
*is* a charset, then?  That's the hard question.

John's suggestion of writing a separate document about the kinds of
things that can be registered with IANA as "charsets" seems OK, but I
can't help thinking it would be nice if there were some guidance in
MIME itself, in much the same way that text subtypes are explained.

We could have some prose that explains the intent of The Three Rules,
and it would be up to IANA whether or not to register a particular
proposal.  So far, IANA has not been very strict.  I'm not sure
whether that's a good thing, but then it may just be the Internet Way.
Anything can be registered, but to succeed, it has to prove itself in
the field.


Erik


<Prev in Thread] Current Thread [Next in Thread>