[Top] [All Lists]

The transition to UTF-8 header fields

1999-02-04 18:46:36
The short-term goal is to allow messages with unencoded UTF-8 in the
header. This takes three steps:

   (1) Mailers with restrictions on 8-bit characters, notably sendmail,
       have to be fixed.

   (2) Message readers have to learn how to convert common UTF-8
       characters to the local character set.

   (3) Message readers have to turn on conversion for some messages.
       Many existing messages use unencoded ISO-8859-1, so the new
       messages have to be marked, for example with a new header field.

The long-term goal is to eliminate the implementation burden of multiple
character sets and =??=. This takes one extra step:

   (4) Mail writers have to convert all outgoing messages from the local
       character set to unencoded UTF-8.

Eventually all character-set markers can be removed.

Chris Newman writes:
* Create UTF8HEADER SMTP extension.  Provides RFC 2047 downgrading for
  both top level headers and nested message/rfc822 headers.

That doesn't survive a cost-benefit analysis.

Your conversion is safe in a fantasy world where all message readers
magically understand =??=. But in that fantasy world you can use
_encoded_ UTF-8. Your SMTP extension is completely unnecessary.

The real world is more complicated. There's no safe way to send UTF-8.
Many readers don't handle =??=. Many readers don't _want_ to handle
=??=. They want unencoded 8-bit characters. But sendmail can't handle
characters between 128 and 159 in address fields.

Fixing sendmail isn't easy, but it's the fastest way to eliminate this
problem. What you're talking about is fixing sendmail _and_ adding a
painful conversion procedure to sendmail _and_ adding the same painful
conversion procedure to dozens of 8-bit-clean MTAs; that's much more
expensive, and will add years to the transition.