Re: charset philosophy

From: erik(_at_)sra(_dot_)co(_dot_)jp (Erik M. van der Poel)
To: ietf-822(_at_)dimacs(_dot_)rutgers(_dot_)edu
Subject: charset philosophy
Date: Thu, 11 Jul 91 21:23:00 +0900

Hi, IETF-822 People!

Just a quick question to see how others feel about this:

If we send a plain ASCII text message and we wish to label it with a
content type header and a character encoding identifier, we include
the following:

      Content-Type: text/us-ascii

Today, when ASCII messages cross a border into an EBCDIC world, they
are converted to EBCDIC. Assuming that such a gateway is fully
upgraded to conform to the new RFC-XXXX, what should be done about the
content type header? Should it be changed to the following?

      Content-Type: text/us-ebcdic

Or something like that?


(Well, perhaps x-ebcdic.  Let's not make EBCDIC any more legitimate :-) )

At first glance, that seems like the right thing to do.

One thing I don't like about this is that if the ASCII message crosses
a border into an EBCDIC world using a present-day gateway, the
Content-type header obviously continues to say "us-ascii".  In the
presence of gateways that behave as you suggest, one could imagine
that RFC XXXX-"smart" mail readers in the EBCDIC world will be taught
to treat "Content-type: us-ascii" as identical to "Content-type:
x-ebcdic", because some of the gateways will not have been updated
yet!  Or that RFC XXXX ASCII mail composers will omit the Content-type
header for the case text/us-ascii, so as not to have this problem when
a recipient runs an RFC XXXX-smart EBCDIC mail reader.  Finally, lots
of messages get tunneled from an ASCII environment via BITNET into
another ASCII environment (consider a college campus with several
ASCII machines on a LAN, whose only link with the outside world is via
BITNET).  In this case, the two gateways had better either both be
"smart" or both be "stupid".

Of course, a straight translation from ASCII to EBCDIC is not the only
possible or reasonable way to preform the conversion.  Consider a body
part of

content-type: text/us-ascii
content-transfer-encoding: (omitted)

which might be translated by a gateway into

content-type: text/us-ascii
content-transfer-encoding: quoted-printable

(This transformation seems like a good idea when going from Internet
to BITNET, because quoted-printable is a good way to get around the
line length limitations inherent in BITNET.)

Now, my feeling is that a mail reading program, even on an EBCDIC
machine, should decode a quoted-printable body part assuming that the
characters were encoded in ASCII, so an 'A' would still be 41 hex.
Otherwise, you'd get a mixture of characters from the original
character set (i.e. those encoded as :XX), and from the current
character set.  

So, in general, the decoding of quoted-printable body part would look
like this:

encoded message --a--> original bytes --b--> displayable form

where (a) is the conversion from quoted-printable to the originally
encoded form, and (b) is the content-type -specific operation that is
required to display the original bytes.  (or store them in a binary
file, or whatever.)  For the case of a mail reader on an EBCDIC
machine, reading a text/us-ascii body part, this would look like:

encoded message --a--> original bytes --b--> displayable form
   (ebcdic)                (ascii)               (ebcdic)

which seems odd, but should work fine in practice.  Note that (b) is
not necessarily the inverse of (a), since (b) can perhaps emit
whatever bytes are necessary to display the message correctly given
the capabilities of that particular display; any sequences of :XX in
the encoded message would presumably display as a single character.

Also, the (a) conversion is dependent only on the body part's
content-transfer-encoding, and (b) is dependent only on content-type.
It would be nice to retain this symmetry for other body part types and
encodings also.

I now suggest that, in an EBCDIC environment, the _absense_ of a
content-transfer-encoding be interpreted to mean "treat this body part
as if it were originally encoded in ASCII."  This would apply to text
as well as other kinds of body parts (because an application-specific
body part might, after all, consist entirely of ASCII text).  In some
cases, a content-type interpreter might have to translate the ASCII
text back into EBCDIC, but after all, the content-type interpreter
naturally knows whether such conversion is appropriate.

This means that ASCII->EBCDIC gateways can work pretty much as they do
now without causing problems, though a "smart" ASCII->EBCDIC gateway
might wish to do unencoded->quoted-printable or similar conversions to
minimize information loss.

EBCDIC->ASCII gateways can also work the same way.  Mail composers on
EBCDIC machines should generate "Content-type: text/us-ascii" body
parts for maximum interoperability, even though they might be encoded
in bare EBCDIC or quoted-printable.  After all, *all* RFC-XXXX mail
readers (whether on an ASCII or an EBCDIC machine) are required to be
able to understand and display ASCII text body parts, whether encoded
as bare characters, quoted-printable, or base64.  

Body parts marked "Content-type: text/x-ebcdic" might not be readable
on ASCII mail readers.  So an EBCDIC->ASCII gateway might optionally
translate "Content-type: text/x-ebcdic" body parts to 
"Content-type: text/us-ascii". 

In general, I would prefer that gateways change only the
content-transfer-encoding and not the content-type.  This goes for
EBCDIC->ASCII, ASCII->EBCDIC, 8-bit->7-bit, and so on.  Otherwise, the
gateway has a difficult time determining what kinds of translation are
appropriate; to do so it has to know what to do with every potential
combination of content-type and encoding.  Obviously there are some
exceptions to this rule, but we should make it clear to gateway
implementors that this is dangerous territory.

The RFC XXXX message format was *designed* to transmit data
transparently from end-to-end given the existing system of translating
gateways.  It would be a shame to violate that assumption by "fixing"
the gateways to be smarter about content-types.

Keith