Re: on character sets and encodings

Thanks, Steve Summit, for a great discussion about "charset".  It's
good to have an "outsider" come in and ask questions about the draft;
it really focusses the mind.

I agree with most of what Steve said.  I will only respond to the
parts where I have something to say.

What is a system to do when it receives
a message encoded using, say, ISO-8859-2?  Obviously it could map
it immediately to 8859-1 or ASCII, discarding characters present
in 8859-2 but not in the mapped-to set.  (This is indeed
essentially MIME's definition of minimal compliance.)


If MIME's minimal conformance section can be interpreted to mean that
certain characters may be "discarded", something's wrong.

And it seemed wrong, a violation of modularity, if the message
receipt process, in order to do its decoding, had to "peek" at
the charset parameter, which "belonged" to the message display
process.


This reminds me of an objection I had a long time ago.  If you send an
ASCII message to an EBCDIC site, the message will say that it's in
ASCII (charset=us-ascii), but it will say that in EBCDIC!  (The
gateway automatically converts to EBCDIC, but it doesn't make the
corresponding change to the charset parameter.  But maybe some of
those gateways have now been updated to take MIME into account.  Or
perhaps the UAs in those environments have been updated to take this
issue into account.  Does anyone know of such changes?)

People objected to my objection, of course, saying that we had to add
charset tags *somehow*, and that the current way is probably the best.

If I want to
centralize knowledge of encoding algorithms by handling decoding
at message receipt time, why not centralize character set
translation as well?  Answer: because if the character set
translation can involve loss, loss is minimized if the
translation decisions are deferred until display time, when the
display device is known.


What, exactly, do you mean by "message receipt time"?  If you do the
decoding too early, you might break existing mail readers.

Question: how are decent-quality MIME mailers expected to deal
with incoming messages encoded using charsets other than the
local system's default?


MIME probably can't mandate anything here, but it could *suggest* that
a fallback (similar to Keld's &e' for e-acute) be used.

Suggestion: discuss (using better nomenclature, if it exists) the
distinction between "message attribute" and "transmission
artifact," and state clearly which one of them a MIME character
set is.


MIME "charsets" are both "message attributes" and "transmission
artifacts".  These two things are not separate.

The MU (Mail Unicode) proposal I made some time ago is due to the
realization that you cannot keep those two things separate.  At first
glance, my proposal seems like a Content-Transfer-Encoding (CTE)
because it uses a Base64-like method, but it isn't a CTE because it
can only be applied to a stream of 16-bit units (i.e. an even number
of octets), whereas Quoted-Printable and Base64 can be applied to any
number of octets.

I designed MU to be (1) transmissible, (2) readable, and (3) simple.
Of course, it is only really readable if most of the text is ASCII.
That is why I designed another charset called "jpu" (an encoding of
Unicode that can be used in *.jp), which uses iso-2022-jp for ASCII
and JIS X 0208 and uses MU for the other Unicode characters.  A
similar "enhancement" of MU could be used in Europe.  (I.e. use
Latin-1 for the Latin-1 characters, and MU for the other Unicode
characters.)


Cheers,
Erik