on character sets and encodings

I don't know about anyone else, but it doesn't seem to me as if
very much has been resolved out of the recent discussions
concerning character sets and encodings.  ("When everything was
said and done, a lot more was said than done.")  As John Klensin
correctly pointed out, there are some underlying philosophical
issues which are being poorly articulated or ignored, which
leaves people talking at cross purposes.  I think I've figured
out, in my own head at least, one of those issues, and I'd like
to discuss it here, because it's clarified several questions and
tradeoffs for me.  (Please don't misinterpret this note: by the
end you'll see that I am not so much disagreeing with anything in
MIME as suggesting some new vocabulary which may, in my case at
least, prevent future disagreement and misunderstanding.)

When exchanging an e-mail message (or anything, for that matter)
between two machines, we may distinguish between aspects of the
message which are fundamental attributes of it, and aspects of
the message which are artifacts of the transmission process.

Fundamental attributes are supposed to be preserved across the
transmission process; they include the essential, communicated
aspects of the message which the transmission process must not
disturb.  In traditional (RFC822) e-mail, fundamental attributes
have included word choice, line break locations, and language
(i.e. which human language a message is written in).  A mail
transport agent is not supposed to substitute one word for
another.  It is not supposed to fill, justify, or otherwise
rearrange lines.  It cannot be expected to translate a message
from the language of the sender to that of the recipient.

Artifacts of the transmission process are those aspects which
the transmission process may alter.  An individual receiving a
message may inspect the state of these aspects on the received
copy of the message, but he cannot assume that (and cannot even
determine if) they were the same on the originating copy.
Obviously, those aspects of a message which fall into this
category are not supposed to matter; a recipient isn't supposed
to mind that the transmission process might have altered things.
In traditional e-mail, transmission artifacts have included
newline representation (which has always been mapped to and from
the canonical CRLF) and character set (which has always been
translated to and from 7-bit US-ASCII).

When I first read RFC1341, and as I was first thinking about
(and discussing on this list) how to use and implement multiple
character sets within e-mail, I made what may have been a
mistake: I assumed that the character set of a message was one of
its fundamental attributes, rather than a transmission artifact.
(I had not at that time appreciated the distinction, nor fully
realized that character set has traditionally been a transmission
artifact.)  I thought of the character set as something that the
sender had selected when composing the message, and which I (or
my mail reading agent) might want to keep in mind in order to
interpret the message properly.

This "mistake" was neither casual nor groundless: there are
reasons for wanting the character set to be considered "part of"
a message.  Here is one: most operating systems have a single,
limited, default character set.  (Plan 9 is, I learn, a glorious
exception: its default set, though single, is hardly limited.)
Once upon a time the local character set always was (and in the
U.S., it still often is) ASCII (or EBCDIC).  On more modern
systems, especially in Europe, it may be ISO-8859-1 (a.k.a.
Latin1).  But RFC1341 says that MIME messages may use ISO-8859-x,
for several values of x.  What is a system to do when it receives
a message encoded using, say, ISO-8859-2?  Obviously it could map
it immediately to 8859-1 or ASCII, discarding characters present
in 8859-2 but not in the mapped-to set.  (This is indeed
essentially MIME's definition of minimal compliance.)  However,
many computer terminals, even those intended for use in the
provincial old U.S. of A., can display a variety of character
sets.  Therefore, it becomes attractive to defer any character
set mapping (and particularly any lossy mapping) to the time the
message is displayed, which requires keeping the charset
information around with the intention of using it later.

This is why I have been objecting to the notion of

        Content-Type: text/plain; charset=iso10646-utf-2

and the like.  In general terms, I wanted the character set
specification to be able to tag along with the message and be
dealt with at message display time.  But I wanted all
(transmission) encodings to be dealt with and decoded at message
receipt time, so that all programs which might ever display a
message wouldn't also have to know about encoding algorithms.
And it seemed wrong, a violation of modularity, if the message
receipt process, in order to do its decoding, had to "peek" at
the charset parameter, which "belonged" to the message display
process.

(I can hear a few objections out there already.  If I want to
centralize knowledge of encoding algorithms by handling decoding
at message receipt time, why not centralize character set
translation as well?  Answer: because if the character set
translation can involve loss, loss is minimized if the
translation decisions are deferred until display time, when the
display device is known.  Objection: UTF-2 is such a superior
encoding, why would you even want to "decode" it at message
receipt time?  What other representation of wide characters might
you use to hold the message until message display time?  Answer:
doesn't matter; my objection was on philosophical grounds of
separation of function, even if the particulars of the UTF-2
example seem weak.  Objection: why do you persist in
distinguishing between "character sets" and "encodings"?  UTF-2
*is* a character set.  Answer: actually, it's gradually dawning
on me that that might be a better way to think about it, after
all, at least for the purposes of MIME...)

It seemed to me that a thing to do to address this concern (and a
few other issues, such as compression, as well) would be to
introduce a new parameter which could separately specify an
encoding:

        Content-Type: text/plain; charset=iso10646; encoding=utf-2

This would let me push encoding details towards the message
receipt end of things, and character set details towards the
message display end, without either end having to "peek" at the
other's parameter.

There are, however, several more objections.  Harald Alvestrand
anticipated exactly such a suggestion when he said earlier that

In this instance, [the Santa Fe meeting's third point] would lead to rejecting
"charset=iso10646;char-enc=utf-2" - just in case anyone was thinking of it!


(The "third point" is the one which says "No further parameters
need to be parsed to get the complete identity of the character
set.")

In private mail, Keith Moore perspicaciously asked if I wouldn't
like, along similar lines, a hypothetical newline= parameter,
which would specify whether a message encoded line breaks as CR,
LF, or CRLF, and which might therefore seem less imposing to
hosts which didn't use internally the same canonical encoding as
messages do externally.  This cuts to the heart of the matter:
it's agreed that newline representation is a transmission
artifact.  Is the thing which charset= describes also?  If so,
then what I need to do, to align my thinking with the model
inherent in MIME, is:

        *not* to try to shove the encoding details into a new
        encoding= parameter, so that charset= can continue to
        describe an inherent attribute, but

        rather, to realize that charset= is a transmission
        artifact, which can and should be inspected, consumed, or
        otherwise completely dealt with at message receipt time,
        and (if I want to defer some character set translation
        decisions to message display time) to augment my own
        internal canonical message representation (whatever my
        message receipt process converts to) with my own
        character set tag, perhaps derived from (but in any case
        separate and distinct from) MIME's charset= parameter.

If the latter is truly the thing to do, I will adopt the memory-
and proper-thinking aid of pretending that "charset=" is
pronounced "transfer character encoding equals."  (Since reaching
this realization, I happened to reread a message from someone
else suggesting essentially the same thing, although I've already
forgotten who it was.)

Just yesterday, an even stronger objection (to the notion of
specifying encodings distinctly) occurred to me, which I'm
surprised nobody else has mentioned: any information about a
character set's encoding must be contained in the character set's
name if the mechanism of RFC1342 is to work.  So, perhaps I had
really better not propose encoding= just now.

Now, if anyone is still with me, you may be asking what all of
the preceding discussion has to do with the MIME draft.  Just
because, after long thought and a few thwacks over the head with
two by fours from IETF members, I have finally come to a proper
understanding of what RFC1341 was trying to say all along,
doesn't necessarily mean that its wording is perfect.  I'd hate
for someone else to have as hard a time getting the right idea,
especially if that person didn't have access to this list, and
especially if that person was writing a would-be MIME-compliant
application (and very especially if the actual compliance level
turned out to be low, but that fact were only discovered after
that person had distributed 10,000 copies which were happily
"interoperating" among themselves).  I have therefore, a few
questions and suggestions.

Question: how are decent-quality MIME mailers expected to deal
with incoming messages encoded using charsets other than the
local system's default?

Question: are MIME character sets, in fact, what I have called
"transmission artifacts?"

Suggestion: discuss (using better nomenclature, if it exists) the
distinction between "message attribute" and "transmission
artifact," and state clearly which one of them a MIME character
set is.

Suggestion: describe, in glowing language such as that which
Keith Moore and Ned Freed have used in response to my bumblings
on this list, the high virtues and significance of a "canonical
encoding."

Suggestion: augment the Appendix on Canonical Encoding Model with
a discussion of character set issues.

Suggestion: state explicitly that "a MIME charset is not what you
might think it is; in particular, it is not a simple character
set.  It is best thought of as an *algorithm* which produces a
stream of characters from an octet stream."

Suggestion: it occurs to me that MIME could use a "rationale"
document, which could be published as an informational RFC, which
would describe at length and with examples the messaging models
inherent in MIME.  Such a document might be a more appropriate
venue for the expanded discussions pertaining to the previous
suggestions.

I'm sorry to be complaining and suggesting rather than offering
much actual text, because I know everyone is anxious to get the
revised MIME draft finished and released.  I'm quite willing to
contribute text (and I may even be willing to volunteer to write
a rationale; encouragement gladly accepted), but first I need to
know for sure what the consensus on charset= (whether artifact or
attribute) is.

                                        Steve Summit
                                        scs(_at_)adam(_dot_)mit(_dot_)edu