[Top] [All Lists]

Re: The transition to UTF-8 header fields

1999-02-10 15:19:38
That's all fine and good. Unfortunately, what I thought Keith was
talking about was which one causes more problems in the *boundary
cases*, such as:

A) Which approach, 1 or 2, causes more brain damage if sent to an MUA
that has *NOT* been upgraded yet?

B) Which approach, 1 or 2, deals with weirdness such as trying to do
an UTF-8 style reply to an unflagged ISO8859-23 message?

This is close.  The problem with using an extra header is that for 
a variety of reasons, different parts of the message header are
generated by different entities.  Replies are a good example -
consider a message thread which has had a number of people reply
to it, and has a long CC list.  Each person's name in that CC 
list may have been copied from a From header supplied by a 
different user agent.  Some of those names may be in 2047 format, 
others in raw 8859/*, and others in UTF-8.  A single header field, 
particularly one which is not supported by everyone's user agent,
cannot handle all of those cases.

An effective strategy for displaying headers might be:

- if it's in 2047 format, decode and display per 2047

- if a phrase, *text, comment, or quoted-string is a valid UTF-8 
  string, display as UTF-8 

  (the first byte of each UTF-8 character has the length of the
   character encoded in it, and each subsequent byte within that
   character has certain bits set, so it's fairly unlikely that
   something that looks like a valid UTF-8 string is actually a 
   string from some other charset)

- otherwise, display in the recipient's native or default character set

You will (almost certainly) still need a way to negotiate in
SMTP whether the next MTA can deal with 8bit UTF-8 headers, 
and to downgrade to 2047 format if this is not possible.
But an extra header field doesn't help you there - by the time 
you scan the message header looking for the extra field, you can
as easily scan the message header looking for 8bit characters.