[Top] [All Lists]

Re: The transition to UTF-8 header fields

1999-02-06 12:30:11
Charles Lindsey writes:
And does it
barf when the nasty characters are in phrases or comments in those

sendmail, starting in version 8.8.0, simply drops bytes 128-159 from all
incoming header fields.

The underlying problem is that sendmail's code is far from 8-bit-clean.
sendmail's entire rewriting mechanism works with a string format that
assigns special meanings to bytes 129, 130, 133, etc. It can't even deal
with an 8-bit name in the From line.

It might be possible to reversibly encode the unsafe 8-bit characters
when a header is read, and undo the encoding when a header is written,
without extensive changes to the sendmail code. But, no matter how minor
the changes are, it will take a lot of effort to deploy the new code.

On the bright side, there are lots of 8-bit-clean MTAs. Also, sendmail
before 8.8.0 can at least tolerate 128-159 in Subject fields.

Well any message that currently uses unencoded ISO-8859-1 (except within
the protection of a Content-Type specifying that charset) is not
conforming already, so we are not necessarily obliged to recognise it

I'm concerned with the features that users rely on, not just the
features guaranteed by the IETF.

MUAs with poor 8-bit support and without any European users can get away
with broad rules such as

   Interpret all 8-bit characters in the header as UTF-8.

But other implementors will be much happier with

   Interpret all 8-bit characters in the header as UTF-8 _if_ you see
   the following special header field: ...

This avoids creating any new problems for current users.

Surely, it is
the good, clean agent that is then responsible to do the downgrading.

If =??= is acceptable to the receiving user, why don't you just send it
in the first place?

I thought you were worried about receivers that _don't_ understand =??=.
In this case, downgrading will fail. You have at least some chance of
success if you go ahead and send 8-bit characters.

Remember that the current situation doesn't work. We have a bunch of
unhappy 8-bit users. How do we make everybody happy?

Widespread support for 8859-1 isn't enough: many users need characters
outside 8859-1. The benefits of UTF-8 are obvious.

The =??= encoding is unsatisfactory, for the reasons you've mentioned.
The benefits of _unencoded_ UTF-8 are obvious.

How do we make unencoded UTF-8 work as quickly as possible? By replacing
the mailers that have trouble with it. Newman's SMTP extension, like
8BITMIME, would make this process much slower and much more expensive.

The (slightly) good news is that you can translate iso-8859-1 into UTF-8
without producing any of the characters in 128-159.

Hmmm? Character 193 ("A'"), 11000001 binary, is encoded as one of

   11000011 10000001
   11100000 10000011 10000001
   11110000 10000000 10000011 10000001
   11111000 10000000 10000000 10000011 10000001
   11111100 10000000 10000000 10000000 10000011 10000001

all of which use byte 129. Is there a repetition of 8859-1 somewhere
else in the 10646 character set?