character sets

Dear IETF-822 people:

I have been thinking about the character set issues, and my point of
view has been changing quite a lot.

I think most people would agree that intermediate MTAs should not
modify the contents of messages. Modifications such as character
encoding conversions should only be done by the UAs or by intermediate
nodes (gateways) that have been "authorized" to do so. An example of
such a gateway would be an ASCII<->EBCDIC converter. Or a
Latin-1<->Quoted-Readable converter, if the people within the enclave
agree to this.

Within our enclave (our company), we have agreed to use the Shift-JIS
encoding, which gets converted automatically at our gateways. So if I
try to send Latin-1, the gateway will think that it is Shift-JIS, and
the message will get messed up. This means that I need to get my UA to
convert to a safe encoding for me -- Quoted-Readable.

If a user in Europe with a national 646 variant terminal receives a
Latin-1 message, the UA could convert it to the 646 variant.

So we see that character encoding conversions might occur at the
sender's UA, the receiver's UA, or "authorized" gateways.

There may be some problems if we specify the character encoding in one
of the header fields (such as Content-Type). Currently, the UAs and
the gateways do not know about the character encoding header. If the
UAs and the gateways are not upgraded together (to be RFC-XXXX
compliant), some messages may get converted without having their
headers adjusted, and when these messages are forwarded to RFC-XXXX
compliant programs, these programs may believe the header and convert
again, possibly mangling the message. This may sound far-fetched, but
I think I can come up with an example if people want me to.

It may not be practical to upgrade all UAs and gateways at the same
time.

So, maybe we should refrain from specifying the character encoding in
the headers. We can rely on the UAs and gateways to do the
"appropriate" conversions for us.

However, there are some things that we need to include in the header
if we want the software to behave intelligently. For example, if we
want our UA to automatically convert Quoted-Readable to, say, Latin-1,
we will need something like:

        Content-Encoding: Quoted-Readable

so that strings such as "J&o/rn" can be converted. This can probably
be made to work even if the string is in EBCDIC. However, if we use
hex in the quoted encoding, ASCII strings that get converted into
EBCDIC may lose their meaning. E.g. "J\F8rn": if this string is in
EBCDIC and you convert the \F8 to one byte, you probably won't get the
o-slash that was intended.

Of course, if you know that the original code was Latin-1, you could
still do something intelligent with "\F8" even if it's in EBCDIC.
Therefore, a hex encoding must be accompanied by an indication of the
original codeset. For example:

        Content-Encoding: Quoted-Printable, Latin-1

Similarly, uuencoded or base64'ed *textual* data should be accompanied
by an indication of the original codeset.

Interestingly, 7-bit ISO 2022 does not need such a header. If escape
sequences such as ESC $ B get converted to something recognizable in
EBCDIC, it is still possible to do something intelligent with the
following bytes.

So what's my conclusion? If we allow people to send whatever they
want, and we don't mark the text with a character encoding header, how
can we interoperate?

I think we will eventually have to agree upon a standard multilingual
encoding, in much the same way that we have agreed to use ASCII until
now. Whether we decide to go with 10646 or not remains to be seen. In
fact, we may not even want to decide yet. Perhaps there is not so much
demand to mix, say, Japanese and European languages. People can send
Japanese to each other in the 7-bit 2022 encoding, even if these
people are in different parts of the world. However, I think there
*is* a great demand to mix the main European languages. So we should
at least provide these users with something. I think the
Quoted-Readable encoding would be perfect for this, since users with
old UAs should be able to read this, and new UAs will be able to
convert it. So let's concentrate on the Quoted-Readable encoding, and
let's make it EBCDIC-safe i.e. unaffected by ASCII<->EBCDIC
conversions. (Keld? Your turn. :-)

Sorry about the length of this message. This subject is still somewhat
confusing for me, so I may have made some mistakes in the reasoning
above. I would be grateful if someone could point out the mistakes.


Regards,
Erik