Response to MIME charset issue

Below you will find a response to the issue raised concerning whether ISO
10646/Unicode meets the MIME requirements for a charset which may be
registered. This analysis was prepared by John Jenkins of Taligent with
assistance from myself, Lee Collins, and Mark Davis (all also of Taligent),
as well as assistance from Nathaniel Borenstein and Ned Freed, the authors
of RFC 1521.

In summary, we believe that the character set which forms the basis for our
proposal, namely "international standard ISO/IEC 10646-1:1993(E); Coded
Representation Form=UCS-2; Subset=300; Implementation Level=3", does in
fact meet MIME's requirement. The detailed response follows. Taligent will
be closed this coming week, so we may not be able to respond to any
comments, depending on who decides to come to work anyway.
-------------
The fundamental issue here is whether or not Unicode and ISO/IEC 10646
define what MIME considers to be a "charset."  The relevant MIME language
is:

"This RFC specifies the definition of the charset parameter for the
purposes of MIME to be a unique mapping of a byte stream to glyphs, a
mapping which does not require external profiling information."

The terms "character" and "glyph" have specific meanings within ISO usage.
The United States bodies X3L2 and X3V1 have recently developed a
character/glyph model whose main purpose is to clarify the use of these
terms and provide examples of their usage.  This character/glyph model was
developed at the request of the relevant ISO bodies and has been forwarded
both to SC2 and SC18 for formal approval.

Within the context of standards such as 10646 and 8859, a "character" is
considered to be a unit of semantic information, whereas a "glyph" is
considered to be a unit of visual representation -- an abstract shape which
determines, ultimately, specific bits on a screen or dots on a page.
Distinctions may be made with glyphs (such as a character being drawn with
or without serifs) which do not impact semantic content.  Even the glyphs
themselves do not specify the exact visual representation, which requires
addition information such as font and weight.

(There are some highly-specialized contexts within which authors may choose
to define semantic content in terms of specific glyphs.  An example would
be an English textbook teaching children to read by assigning distinct "a"
glyphs to distinct "a" sounds.  Such examples are rare, however, and no
current ISO character set standard supports them in plain text.)

Any implementation of standards such as 10646 and 8859 begins with
characters--semantic information--and transforms them in some fashion to a
sequence of bit patterns on a page or screen via appropriate glyphs and
formatting information.

The process of transforming characters to glyphs and exact visual
representation is inherently ambiguous without out-of-band information such
as font, point size, and styles such as italic or boldface.  The character
set standards developed by ISO are implicitly intended, however, to be
sufficient to convey semantic information even if this out-of-band
information is missing.  This implicit requirement is explicit in the
development of Unicode.  Unicode speaks of "minimal legibility," meaning
that a plain-text Unicode file contains sufficient information for the
recipient to be able to read and understand its contents, even if all
out-of-band formatting information is lost.

Unicode, 10646, and other ISO character set standards as well as important
national standards such as JIS 0208 explicitly avoid limiting the set of
glyphs which can be used to render the characters they encode.  For the
sake of legibility, they provide representative glyphs indicating a common
or typical shape for characters which they are intended to represent, but
they explicitly deny that these shapes are to be considered normative.

For example, the relevant language from JIS 0208 is:

"Sec. 3.6 of JIS X 0208-1990, The Handling of Variant Characters

"...Accordingly, within a certain scope, [the standard] permits variation
of the graphic expression of a character displayed in a code position, and
that [particular] graphical expression should only be considered to be one
example of the variants. This standard does not specify the details of the
graphical expression."

This is particularly important for ISO/IEC10646, which uses up to four
glyphs to represent characters in its unified East Asian ideograph set.
The mappings between these ideographs and various national standards are a
normative part of 10646; the glyphs used to render these ideographs are
intended to be reflections of the glyphs used in the relevant national
standards, and are not normative.  Indeed, the rules used for Han
unification, developed by the Japanese delegation to the CJK-JRG, allow for
at least as great a variation within any given language for the characters
represented, as to be seen among the glyphs used within the four column
charts.  That multiple glyphs have, in fact, been used to represent these
characters within the 10646 charts is merely a reflection of the
multiple-column format used and is otherwise no more significant than the
fact that a single glyph is used elsewhere within the standard.

Nor is the fact that 10646 allows the use of combining marks relevant.
Combining marks are necessarily a part of the encoding of various South
Asian and semetic languages.  If the issue is the ability to render text
intelligibly as opposed to rendering text exactly, then any Level 3
implementation of 10646 will be able to provide appropriate rendering.  It
was for this reason that the proposal was phrased to apply to Level 3
implementations of 10646.

If it is the intent of MIME to lock users into the specifics of the
bit-layout on the screen or on the page, then no current ISO character set
standard is a "charset," and only a glyph registry such as ISO/IEC 10036
could qualify.  Indeed, even ISO/IEC 10036 does not determine the exact
shape, and would therefore be insufficient to determine the exact shape of
the resulting objects rendered.  Such seems clearly not to be the intent of
MIME.  Instead, MIME intends to provide what Unicode would refer to as
"minimal legibility"--a text with its charset specified can be intelligibly
rendered even if out-of-band information is lost.

We have consulted Ned Freed and Nathaniel Borenstein regarding the intent
of the language within MIME.  Excerpts from their responses follow:

Ned Freed:
------------------
The intent [of MIME] here is pretty simple: Given the sequence of bytes in
the body part and the charset value, it must be possible to display the
message in the fashion the message creator intended.

This may sound like an obvious criteria, but it really isn't. X.400, for
example, includes several body parts that cannot be properly displayed without
out of band information. You may not be aware that out of band information is
being used, but it is.

We wanted to avoid the problems X.400 had in this area, so we attempted to
craft language that would force people registering character sets to deal with
these issues as part of the registration process.

The only problematic part of this sentence is the word "glyph". I have no
preference as to the word we use, but we were told by a large of number of
character set folks that this was the word we wanted.

David Goldsmith:
I assume you did not mean glyphs absolutely literally, or else the fact that
ASCII can be displayed in different fonts would disqualify it. Knowing what
you were trying to accomplish here will help a great deal.


My understanding is that using "glyph" to refer to a a specific bitmap or
whatever is being overly strict. (Yes, I know that Unicode uses the word in
this sense. The equivalent Unicode concept would be "character code", I
believe.)
-------------
and:
-------------
Now that I look at this stuff closely I think our usage of "glyph" really is
incorrect. The problem is that there isn't a single consistent name for the
thing we want here.

Nathaniel Borenstein:
-------------------
If anyone thinks that Unicode can't be a MIME character set because of
something RFC 1521 says, then RFC 1521 is wrong.  Period.  Our intent very
specifically included being INCLUSIVE of the then-emerging Unicode/10646
standard.  Of course, the MIME character set  registration should specify
an unambiguously-interpretable usage of  Unicode/10646, which is the intent
behind the "external profiling information" phrase.

--- end quotations

Determining the appropriate ISO language for MIME to use is difficult,
because ISO currently lacks the formal concept of "minimal legibility."  It
is, however, true that any system supporting Level 3 ISO10646 or Unicode
can intelligibly render plain text in the absence of any further
information.


----------------------------
David Goldsmith
david_goldsmith(_at_)taligent(_dot_)com
Taligent, Inc.
10201 N. DeAnza Blvd.
Cupertino, CA  95014-2233