[Top] [All Lists]

Re: The transition to UTF-8 header fields

1999-02-05 12:12:46
In <19990205015007(_dot_)11553(_dot_)qmail(_at_)cr(_dot_)yp(_dot_)to> "D. J. 
Bernstein" <djb(_at_)cr(_dot_)yp(_dot_)to> writes:

The short-term goal is to allow messages with unencoded UTF-8 in the
header. This takes three steps:

  (2) Message readers have to learn how to convert common UTF-8
      characters to the local character set.

They also need to know of some agreed escape notation to render any UTF-8
character they are unable to display using the local facilities.

They also have to do the same thing with headers in Mime multiparts and
message/rfc822 (and actually, that is easier because sendmail is not a
problem there AFAIK).

  (3) Message readers have to turn on conversion for some messages.
      Many existing messages use unencoded ISO-8859-1, so the new
      messages have to be marked, for example with a new header field.

Well any message that currently uses unencoded ISO-8859-1 (except within
the protection of a Content-Type specifying that charset) is not
conforming already, so we are not necessarily obliged to recognise it (it
should have been done as an encoded-word, though I agree it often isn't).

Chris Newman writes:
* Create UTF8HEADER SMTP extension.  Provides RFC 2047 downgrading for
  both top level headers and nested message/rfc822 headers.

The real world is more complicated. There's no safe way to send UTF-8.
Many readers don't handle =??=. Many readers don't _want_ to handle
=??=. They want unencoded 8-bit characters. But sendmail can't handle
characters between 128 and 159 in address fields.

Could you be more specific about that? Which headers are affected? To:
From: Sender: Reply-To: Cc: ... ? I know that sendmail delights in
rewriting such headers under control of its .cf file :-((( ). And does it
barf when the nasty characters are in phrases or comments in those

Fixing sendmail isn't easy, but it's the fastest way to eliminate this
problem. What you're talking about is fixing sendmail _and_ adding a
painful conversion procedure to sendmail _and_ adding the same painful
conversion procedure to dozens of 8-bit-clean MTAs; that's much more
expensive, and will add years to the transition.

So you are saying that if you have to fix sendmail (and other agents) to
do downgrading then you might as well fix it to pass the UTF-8 characters
cleanly in the first place? Sounds a plausible argument, but what about
the case where a good, clean, upgraded agent is negotiating using EHLO
with some ancient agent that does not support UTF8HEADER?  Surely, it is
the good, clean agent that is then responsible to do the downgrading.
Trouble is, those ancient agents are going to be around for a long time to
come, and the decent agents are still going to have to speak to them.

The (slightly) good news is that you can translate iso-8859-1 into UTF-8
without producing any of the characters in 128-159. Does that help at all?

Charles H. Lindsey ---------At Home, doing my own thing------------------------
Email:     chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk  Web:
Voice/Fax: +44 161 437 4506      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9     Fingerprint: 73 6D C2 51 93 A0 01 E7  65 E8 64 7E 14 A4 AB A5