ietf-822
[Top] [All Lists]

RE: printable wide character (was "multibyte") encodings

1993-01-24 13:49:44
It is not a mistake to keep explicitly separate and distinct that
which is separate and distinct, particularly when the costs of
maintaining the distinction are low.  UTF-2 *is* an encoding.
Whether or not it is a good encoding (and it is a very good
encoding) does not affect our choice of whether to indicate its
use in a field labeled "encoding" or a field labeled "character
set."

I place a very high value on maintaining the distinction between an
"encoding" and a MIME "content-transfer-encoding".  The latter exists to
map the canonical form of a content into a form that allows it to be
transmitted, without loss, via email.  There is a secondary but related
purpose of the "content-transfer-encoding", which is to identify what kind
of transport is required to transmit the content without loss: hence the
binary, 8bit, and 7bit "null" encodings.

Keith is right on target here -- this concern is of paramount importance.

The MIME model is that the canonical form of a content is an octet-stream.
Even if this isn't explicit in the document, it is amply demonstrated by
the fact that all of the defined content-transfer-encodings take an
octet-stream as input and produce one as output; none of them can deal
with anything wider than an 8-bit quantity.  If this isn't clear enough,
we should make it clearer for the draft standard version.

Not only is this the model, we also explicitly considered and rejected other
models. Wider models didn't receive much attention since they can trivially be
represented canonically as an octet-stream, the opposite is of course not true.
Narrower models, down to the bit level, were considered, and after much debate,
rejected. What this means is that if you're dealing with a bit stream part of
the canonicalization process must include conversion into an octet stream.
There's a PADDING parameter you can then use to indicate how many padding bits
are present, but this was done only so that there would be a common facility
for any bit stream that needed it.

This added burden on the canonicalization process was deemed an appropriate
tradeoff for simplification of the encoding process. (The fact that most bit
stream formats, notably G3 FAX, have already dealt with the issue of
octet-stream alignment, and that it does not make sense to reengineer this
work, was definitely a consideration.)

This assumption considerably simplifies the implementation of MIME mail
readers and composers.  My MIME parser can deal with any MIME type at some
level.  All it has to do is to undo the content-transfer-encoding and pass
the resulting octet-stream to the appropriate display module (as defined
by mailcap or whatever).  Things would be a lot hairier if a
content-tranfer-encoding decoder could output of arbitrary width.

Not only does it simplify the implementation, it simplifies the specification
as well. One of the reasons for removing the bit encoding stuff was that making
it rigorous would have meant adding a huge amount of additional prose to no
good purpose. The same considerations apply to wider canonical models, I'm
afraid.

But this suggestion begs the question: somewhere there is a
mapping between that octet stream and the "everything" which has
been encoded.  Above the level of the FTP "representation type"
[RFC959 sec. 3.1.1], data is handled which may be bytes or words
of more than 8 bits.  

Absolutely right.

The use of the word "encoding" isn't correct here; this is part of the
canonicalization process for any type. Whenever you define a MIME subtype part
of its specification must include a definition of how the subtype is
represented in terms of an octet stream. (I note that this is not spelled
out in Appendix G; it should be.)

The price we pay for this simplification is that some specifications must
slightly more complex than they might otherwise be. But practically any data
format in widespread use has to deal with representation-in-octet-stream
issues; octet streams are the common denominator that holds the computing
community together.

The benefits we get are much greater. We get a much simpler canonical form
specification. We get uniformity amoungst encodings. We get seamless
re-encoding and decoding. 

Analogously, it is not unthinkable for mail messages above the level of
the content-transfer-encoding to consist of wide characters.

No, it's not unthinkable.  It's just *much simpler* for MIME if we make
the "canonical form" an octet-stream.  The sending host nearly always has
to do some translation to get its text into canonical form--mapping its
local character set to the one specified in the content-type header,
mapping its newline convention to CRLF, etc.  The "canonical form"
provides a very clean boundary between operating system- or host-specific
functions and system-independent functions which are common to any MIME
implementation.

Exactly right.

The "canonical octet-stream" interface is what allows packages like
metamail and MH 6.8 to support arbitrary MIME formats via external
programs, which often *already exist*.  It therefore has a LOT to do with
how widely MIME is used and how quickly implementations become available.
 
This does indeed have everything to do with how well MIME can be made to fit
into lots of applications. 

The ability to plug a content-specific display module into a standard
interface (like a UNIX pipe) is essential to the success of MIME.  Of all
the interfaces which we might choose, the 8-bit wide "paper tape" model
seems to be the most ubiquitous, and therefore the most powerful vehicle
on which to base MIME contents.

Not only is it essential at the implementation level, it also proved to be
an essential part of having a readable and useful specificaiton.

                                Ned