[Top] [All Lists]

Re: The transition to UTF-8 header fields

1999-02-08 05:08:39
In <19990206193352(_dot_)19719(_dot_)qmail(_at_)cr(_dot_)yp(_dot_)to> "D. J. 
Bernstein" <djb(_at_)cr(_dot_)yp(_dot_)to> writes:

sendmail, starting in version 8.8.0, simply drops bytes 128-159 from all
incoming header fields.

The underlying problem is that sendmail's code is far from 8-bit-clean.
sendmail's entire rewriting mechanism works with a string format that
assigns special meanings to bytes 129, 130, 133, etc. It can't even deal
with an 8-bit name in the From line.

Yes, that seems real ugly :-( .

I thought you were worried about receivers that _don't_ understand =??=.
In this case, downgrading will fail. You have at least some chance of
success if you go ahead and send 8-bit characters.

Wrong. I am worried about the receivers that don't understand UTF-8, of
which sendmail seems the chief offender. MTAs that _don't_ understand =??=
are no problem (they simply have to pass it through). Even user agents
that don't understand it are not show stoppers, because they just render
it as received - ugly, but nothing breaks.

The (slightly) good news is that you can translate iso-8859-1 into UTF-8
without producing any of the characters in 128-159.

Hmmm? Character 193 ("A'"), 11000001 binary, is encoded as one of

  11000011 10000001
  11100000 10000011 10000001
  11110000 10000000 10000011 10000001
  11111000 10000000 10000000 10000011 10000001
  11111100 10000000 10000000 10000000 10000011 10000001

Sorry, it seems I was wrong, though actually only the first on your list
seems to be legal UTF-8 according to RFC 2044.

Charles H. Lindsey ---------At Home, doing my own thing------------------------
Email:     chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk  Web:
Voice/Fax: +44 161 437 4506      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9     Fingerprint: 73 6D C2 51 93 A0 01 E7  65 E8 64 7E 14 A4 AB A5