It is not a mistake to keep explicitly separate and distinct that
which is separate and distinct, particularly when the costs of
maintaining the distinction are low. UTF-2 *is* an encoding.
Whether or not it is a good encoding (and it is a very good
encoding) does not affect our choice of whether to indicate its
use in a field labeled "encoding" or a field labeled "character
set."
I place a very high value on maintaining the distinction between an
"encoding" and a MIME "content-transfer-encoding". The latter exists to
map the canonical form of a content into a form that allows it to be
transmitted, without loss, via email. There is a secondary but related
purpose of the "content-transfer-encoding", which is to identify what kind
of transport is required to transmit the content without loss: hence the
binary, 8bit, and 7bit "null" encodings.
The MIME model is that the canonical form of a content is an octet-stream.
Even if this isn't explicit in the document, it is amply demonstrated by
the fact that all of the defined content-transfer-encodings take an
octet-stream as input and produce one as output; none of them can deal
with anything wider than an 8-bit quantity. If this isn't clear enough,
we should make it clearer for the draft standard version.
This assumption considerably simplifies the implementation of MIME mail
readers and composers. My MIME parser can deal with any MIME type at some
level. All it has to do is to undo the content-transfer-encoding and pass
the resulting octet-stream to the appropriate display module (as defined
by mailcap or whatever). Things would be a lot hairier if a
content-tranfer-encoding decoder could output of arbitrary width.
[I wrote:]
To me, it doesn't make sense to feed text with character set 'x' to a
richtext parser designed to understand character set 'y'.
[Steve's reply:]
I agree. That's exactly why I'm trying to avoid having to teach
richtext parsers (or any other message processing software) about
encodings, which are properly dealt with in one central place.
Making UTF-2 an "encoding" for MIME purposes doesn't solve the problem of
implementing a richtext parser that understands various character sets.
"Decoding" a UTF-2 stream produces Unicode, but the richtext parser still
wants to read ASCII as well (unless we change richtext).
Besides, I think that there's general agreement that richtext needs work,
including in how it deals with non-ASCII characters. So if richtext and
UTF-2 don't like each other, it's necessarily not a problem with UTF-2 or
MIME's idea of the canonical form of a content -- it's up to richtext to
fix. The richtext example is a red herring.
But this suggestion begs the question: somewhere there is a
mapping between that octet stream and the "everything" which has
been encoded. Above the level of the FTP "representation type"
[RFC959 sec. 3.1.1], data is handled which may be bytes or words
of more than 8 bits.
Absoultely right.
Analogously, it is not unthinkable for mail messages above the level of
the content-transfer-encoding to consist of wide characters.
No, it's not unthinkable. It's just *much simpler* for MIME if we make
the "canonical form" an octet-stream. The sending host nearly always has
to do some translation to get its text into canonical form--mapping its
local character set to the one specified in the content-type header,
mapping its newline convention to CRLF, etc. The "canonical form"
provides a very clean boundary between operating system- or host-specific
functions and system-independent functions which are common to any MIME
implementation.
(host-specific "local" form)
|
\|/
[convert to canonical form] conversion is: specific to content-type
| independent of mail transport
\|/ (operating system and/or host)-specific
(canonical octet-stream) -------------------------------------
| encoding is system independent
\|/ and content-type independent (mostly)
[content-transfer-encoding] and (hopefully) tuned to mail transport
|
\|/
(MIME body part)
The "canonical octet-stream" interface is what allows packages like
metamail and MH 6.8 to support arbitrary MIME formats via external
programs, which often *already exist*. It therefore has a LOT to do with
how widely MIME is used and how quickly implementations become available.
The ability to plug a content-specific display module into a standard
interface (like a UNIX pipe) is essential to the success of MIME. Of all
the interfaces which we might choose, the 8-bit wide "paper tape" model
seems to be the most ubiquitous, and therefore the most powerful vehicle
on which to base MIME contents.
This is why I react strongly to the idea of changing the "canonial form"
to be something besides an octet-stream.
------------------------------------------------------------------
As for UTF-2...I suggest that this WG define two 10646/Unicode charsets:
1) "flat": canonical form is to transmit each n-bit character as n/8
octets, in order from most significant octet first to least significant
octet last.
2) "UTF-2": canonical form is a UTF-2 stream.
...and require any reader that accepts one to accept both. (since it's
trivial to convert from one to the other). The sender (or his UA) can
pick whichever one seems to be the best for the text being transmitted.
Keith Moore