[Top] [All Lists]

Re: The transition to UTF-8 header fields

1999-02-05 16:20:18
The short-term goal is to allow messages with unencoded UTF-8 in the
header. This takes three steps:

   (1) Mailers with restrictions on 8-bit characters, notably sendmail,
       have to be fixed.

   (2) Message readers have to learn how to convert common UTF-8
       characters to the local character set.

Or more to the point, they have to learn how to display them.
it's not at all unusual these days for a GUI to be able to
display characters in other than the default local character set.

   (3) Message readers have to turn on conversion for some messages.
       Many existing messages use unencoded ISO-8859-1, so the new
       messages have to be marked, for example with a new header field.

It is nearly always possible to distinguish a UTF-8 string of more
than a few characters' length, from a string in some other character 
set.  So a reasonable strategy would be to display valid UTF-8 
strings as UTF-8, and display other unencoded 8-bit character 
strings using a local default character set.

The long-term goal is to eliminate the implementation burden of multiple
character sets and =??=. This takes one extra step:

   (4) Mail writers have to convert all outgoing messages from the local
       character set to unencoded UTF-8.

Eventually all character-set markers can be removed.

This is *very* long-term.  For several years, mail readers would
still need to be able to handle 2047.  At some point after enough
MTAs became 8-bit header-friendly and enough MUAs became UTF-8
header-friendly, sending UTF-8 in headers would be considered an 
acceptable risk for the sender.  At some much later point, not 
including support for 2047 in your user agent would be considered 
an acceptable risk for a MUA vendor.

* Create UTF8HEADER SMTP extension.  Provides RFC 2047 downgrading for
  both top level headers and nested message/rfc822 headers.

That doesn't survive a cost-benefit analysis.

Your conversion is safe in a fantasy world where all message readers
magically understand =??=. But in that fantasy world you can use
_encoded_ UTF-8. Your SMTP extension is completely unnecessary.

I would say it differently - having this in SMTP requires the SMTP
server to know what kind of UA it is delivering a message to, so 
that it knows whether to downgrade or not.  For many environments
this just doesn't work, and for the environments where it does work,
it's easy enough to just detect the presence of UTF-8 in the message 
header.  There's no need for an SMTP extension.
The real world is more complicated. There's no safe way to send UTF-8.
Many readers don't handle =??=. Many readers don't _want_ to handle
=??=. They want unencoded 8-bit characters. But sendmail can't handle
characters between 128 and 159 in address fields.

Arbitrary 8-bit characters in address fields are a completely separate 
issue, and probably won't happen within our lifetimes.  It is more 
likely that the world will standardize on English (not that I consider
this likely).

Fixing sendmail isn't easy, but it's the fastest way to eliminate this

This is one case where sendmail is not broken.  No change to any single 
MTA will have a great effect on transition times.