[Top] [All Lists]

Re: Prohibition of EBCDIC in text/plain

1995-06-08 19:17:28
        Good.   Good for  on-the-wire.

        Let me restate this:  we need to look at MIME as an off-the-wire
concept.   MIME is *great*, and the spec is *great* as long as we stay
on-the-wire,  and that's fine.   But people are using MIME off-the-wire
and looking at the same on-the-wire spec.   This needs to be,  at least,
clarified,  better,  formally addressed.

The problem is that most environments already have well-entrenched
mechanisms for translating between "on-the-wire" and "off-the-wire"
(local) representations of email messages. These translation
mechanisms existed long before MIME, and generally remain ignorant of
MIME even today.  Of course, this leads to wierd-looking things such
as ASCII messages that were translated into EBCDIC on receipt by the
local MTA, but which are still labelled as US-ASCII.

Fortunately, as long as *all* incoming MIME messages go through this
translation, you can deal with the change.  If you're on an EBCDIC
machine, and if you see "text/plain; charset=US-ASCII", you know that
the message is really ASCII translated to EBCDIC (and so should your
user agent).

(This isn't just a problem with EBCDIC machines; UNIX machines have a
similar problem with the translation of line endings.  A number of
UNIX UAs don't properly handle a text body part encoded in base64,
because they expect all text to be translated automatically into local
format by the translation layer.  Nevertheless, the MIME rules are
clear and unambiguous -- all body parts are converted to canonical
form before encoding.)

If you want to add any sanity to the "off-the-wire" version of MIME,
you have to change that translation layer to be MIME-aware.  But once
you do this, you have to make sure that *all* messages go through the
new MIME-aware translation when they cross the boundary between
"local" and "on-the-wire". You also have to change *all* of your local
user agents (at least those that were already dealing with
"on-the-wire" MIME translated via the old mechanism), to know about
the new format instead.  

At worst, you need a flag day where you have to change *everything* at
once. At best, your user agents now have to cope with two versions of
MIME.  (You will, of course, want to add some sort of indicator to the
"off-the-wire" format so that user agents can reliably distinguish it
from the "on-the-wire translated to local by the old mechanism"
format.)  You've just made your user agents considerably more complex
to avoid something which was wierd but worked just fine.

If you're going to go to all this trouble, it's very tempting to
declare that the new "off-the-wire" format is *identical* to the
"on-the-wire" format, except for that format indicator that lets your
UAs know the difference.  That way, your translator is very simple,
and all of the knowledge about content-types is in the user agent --
where it should be.

Well, the on-the-wire MIME format isn't the best for local storage.
You might want to use a format for which all of the MIME body parts
are stored in canonical form.  But the one thing you DO NOT want to do
is to translate certain kinds of body parts (like text) into the local
format, because then your (new) user agents will fight with your
translator about who should deal with which parts.

        I think Ned convinced me that as an IETF specification,
it is properly focused at on-the-wire operation.   Is there a way
we can,  without ruffling too many feathers,  put in some wording
that will make the MIME spec a better fit for these off-the-wire
square pegs?

MIME already goes to considerable effort to make sure that
"on-the-wire translated to local format via pre-MIME mechanisms" is

Quoted-printable was carefully defined in such a way that it works for
either ASCII or EBCDIC.  (The q-p sequence "ABCDE=46" translates into
"0x41 0x42 0x43 0x44 0x45 0x46" in canoncal form regardless of whether
the local charset is EBCDIC or ASCII.)

Base64 was also designed so that the encoded form could be translated
to and from EBCDIC without damaging the canonical form.  Trailing
SPACE characters were ignored in quoted-printable, base64, and
multipart boundary markers because of SPACE padding in fixed-length
record systems.  Line lengths in quoted-printable, base64, and header
fields with encoded-words were kept short so that they could fit into
the line-length limitations of the BITNET mail transport.

All of this was done so that you could use MIME with existing 822 UAs,
and without having to change that layer that translates between
on-the-wire and local format.

        I tried using a higher level name on CHARSET= once ... ONCE.
It didn't go over too well.   :-(    Latin-1 would apply equally well
to both  ISO-8859-1  and to  IBM CECP 1047,  which are the canonical
pair for ASCII/EBCDIC translation.   Any,  my test didn't work.

"Latin-1" is not a valid "character set" (as MIME defines the term)
because it doesn't define a unique mapping of octets to characters.
If you have a body part in canonical form with charset=Latin1, you
know what kind of characters might appear in the body part, but you
don't know what algorithm to use to translate those octets into

        More fundamental and basic than that:  let plain text
be plain text.   Let plain text on the EBCDIC systems be converted
into ASCII when it goes out into SMTP.

This is what happens now, no?  The only wierd thing is that the text
you compose on your local machine must be labelled as US-ASCII or some
such (even though it's really EBCDIC) if you're using your old
translator that isn't MIME-aware.  But -- as long as your local user
agents generate messages that look like ASCII messages that arrived
from elsewhere and were translated locally to EBCDIC -- what results
is not ambiguous. It's just wierd.


If you still want the "off-the-wire" format, I suggest you implement
it with new content-transfer-encodings.  Let the existing c-t-e's
retain their present meanings: treat them as if the characters in
these encodings were ASCII translated to your local character set.
Define new content-transfer-encodings for use only in "off-the-wire"
mail in your particular environment.  For instance, you could define a
"ibm-off-the-wire-binary" c-t-e that could contain arbitrary octet
sequences without encoding them as characters, and would also work
efficiently on your file system.  And you could define an
"ibm-off-the-wire-plain-text" encoding for text that was already in
the right format to be blatted to a 3270.  Do make sure that the new
encodings *never* leave your environment without being translated back
into a standard MIME "on-the-wire" c-t-e.

Once you have this, you could (if you wish) translate an incoming
message containing:

content-type: text/plain; charset=us-ascii
content-transfer-encoding: quoted-printable


content-type: text/plain; charset=ebcdic
content-transfer-encoding: ibm-off-the-wall-plain-text

because doing so would not create any ambiguity.