Re: UTF-8 over RFC 2047 (Re: Call for Usefor to recharter)


Keith Moore <moore(_at_)cs(_dot_)utk(_dot_)edu> schrieb/wrote:

 I do believe that migrating to utf-8 in message headers is the way
 to go in the long term, and that a transition strategy similar to what
 Dan is suggesting (support utf-8 on receipt soon, generation of utf-8
 later) is ultimately the way to get there from here.


If we do that, we should also simplify the horrible general format of
RFC 2822 messages: No optional comments (or other optional syntax
elements), simpler folding rules, etc.

 But there are several things I don't believe.

 One, that this will return us to a world where messages are ordinary
 text and can be treated as such.  For instance, even if we allow utf-8
 in message headers, there will still be a need for canonicalization of
 certain fields before sending and/or before comparison of values
 embedded in those fields,


In general, this are the header field parts that are 'addresses': 
newsgroup names, email addresses, and domain names.
For all of these, a compatible superset of the IDNA encoding could be 
used.

 Two, that this allows use of utf-8 in addresses or domain names.
 Those are separate problems,


Yes, this is a completly different problem (and the more serious one: if
UTF-8 text fields are mangled, it's not such a big problem as completly
misdirected messages.

Further, IDNA hast some nice properties that make it a better solution
than UTF-8: The encoding is also a good thing FOR HUMANS. This might
sound strange but it's actually quite simple:

While virtually every human being (except illiterates) has an
understanding of the basic DNS alphabet (Latin alphanumerics plus
hyphen), this is not true with the full Unicode range.

Even if the system would support Unicode and UTF-8 completly, the humans
using it don't. They see an opaque identifier made of symbols they don't
know, so they can't read a simple email address aloud over the phone,
copy it to paper, etc.

With IDNA, they just have to tell their system that they don't want to
see characters from other than certain locales. They will still see an
opaque identifier but this time it is composed of symbols they know, so
they can handle it.

Note that this is even independent from the wire format of the
addresses. It would be sensible to take a string that comes in in UTF-8
format and convert it to IDNA only for display to humans that don't know
the characters used in the string.

 Three, that the existing ability of some user agents to display utf-8
 in message headers is sufficient for proper processing of headers
 containing utf-8,


Well, if the servers can handle UTF-8, I suppose that they will take
care of the canonicalisation of the addresses and the conversion to IDNA
when necessary. With SMTP, this could even mean that the address is kept
in the UTF-8 form until it must be sent to a system that does not
advertise a "UTF8ADDR" extension. For NNTP, this would mean that servers
can accept UTF-8 in the POST command and do the translation.
On the other hand, having UTF-8 on the wire between certain servers is
only an unnecessary gadget.

It's useful to implement it in everything that interacts with humans,
though: config files (could also be automatically translated to UTF-8
and non-UTF-8 local encodings by a command like viactive), command line
for message injection, etc.

Claus
--
------------------------ http://www.faerber.muc.de/ ------------------------
OpenPGP: DSS 1024/639680F0 E7A8 AADB 6C8A 2450 67EA AF68 48A5 0E63 6396 80F0