Re: UTF-8 in headers

In <199902041923(_dot_)OAA02613(_at_)spot(_dot_)cs(_dot_)utk(_dot_)edu> Keith 
Moore <moore(_at_)cs(_dot_)utk(_dot_)edu> writes:

I'm not sure what you could mean by "undoing the first mistake".  The email 
world is still not ready for UTF-8 in headers, though if we manage the 
transition carefully it might be ready in a few years.


Well that is where I am hearing conflicting stories. My immediate problem
is that news is going to go to UTF-8 long before it becomes common in mail.
So we need some means to tunnel news articles safely through existing mail
systems.

Now I have proposed two Mime mechanisms for doing that, one of which has
been described as "ugly", and the other of which is agreed to be
"exceedingly ugly". Instead, I have been told that the "proper" way to do
this would be to assume that UTF-8 would become legal in mail, and that we
should instead use the downgrading mechanism that would be associated with
that. (Actually, allowing it in multipart headers rather that full mail
headers would go a long way towards fixing our problem).

And, moreover again, Ned stated on this very list that such could easily
be worked out within six months (and, one would think, he should know). On
that sort of time scale, news could probably wait, but if this problem is
not going to be fixed for "a few years", then news is going to have
to come up with an interim solution _now_.

...   And it would seem 
silly to replace encoded-words with a different and slightly less broken 
mechanism for encoding 8bits in ASCII (but which might work for parameter 
values also) when we already have an investment in encoded-words and 
we'll want to go to UTF-8 in a few years anyway.  It also seems silly
to declare "it's okay to use encoded-words in quoted strings" when 
this will cause a fair amount of disruption.


Yes, the purpose of my message was to try and find how big that
"disruption" might actually be. You already said yourself that many
(most?) software already supported it, though not legally so.

Right! Let us suppose, just be way of Thinking-Out-Loud, that some
extension of RFC2047 (or perhaps some 2047bis) were made that allowed
encoded-words within a quoted-string (with perhaps a prohibition remaining
within addr-specs and msg-ids). What calamities would ensue?

People would start using them in addr-specs, expecting them to be
decoded before display.


No, I carefully said they must be excluded in addr-specs (one day, UTF-8
will be allowed in domain-names, and that transition will lead to all
sorts of interesting happenings - but that is certainly NOT part of
anything under consideration now). I don't really care how they are
displayed, so long as they go to the right place. Obviously, the display
SHOULD reflect the actual address to which the mail goes. And actually, I
don't see why people would try to use them in addr-specs since valid
addresses do not contain 8bit characters anywhere that I know of.

...   Some gateways would translate them to 
raw characters, and fail to translate them back, which would make 
replies fail, as well as causing failures with other tools that 
recognize addresses.


No, because they would only be allowed in the phrase part of an address,
and that part does not affect the routeing. Some mailers that present mail
to you and include the 'real name' in the menu might get it wrong, but I
suspect many (most?) existing mailers might suddenly start getting it
right.

...  Some gateways would translate them to raw 
characters, and translate them back in a way that didn't match the 
original name, and cause similar (but more subtle) failures.


Yes, one of my concerns with RFC2047 is that there is no canonical way to
encode a string. Software that cares about such things ought to work with
the unencoded form (one of the other things I am looking at at the moment
is digital signing of headers, and that is one of the concerns to be
addressed there).

Systems that use name=  or content-disposition filenames would
behave inconsistently - some would store the file under a bizarre
name, while others would store the file with the decoded name.


It does not matter particularly if a file gets stored under the bizarre
name on one machine and under the decoded name on another. The important
thing is that it gets stored someplace. But I suspect that systems that do
not yet implement RFC2231 are facing this problem already.

All of these things happen right now to some degree due to broken
implementations.  But by changing direction, we would make the
overall behavior worse.


Or better?

I don't see the point in changing encoded-words at this point, especially
not if the purpose is to make eventual downgrading to UTF-8 easier.  
Yes, the downgrading rules will be somewhat complex, with lots of special 
cases.


I think the point to do it would be when the UTF-8 stuff comes along, in
one grand RFC that brought all of RFC2047, RFC 2232 and others together in
one document.

The real question is when that is going to be. Ned says six months. I was
dubious when he said it, and am even more dubious now. But if not then,
then what is news supposed to do in the meantime?

But either way, it is not unreasonable to look at some possibilities that
might or might not work, even if they don't get implemented for some
longer time.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Email:     chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk  Web:   
http://www.cs.man.ac.uk/~chl
Voice/Fax: +44 161 437 4506      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9     Fingerprint: 73 6D C2 51 93 A0 01 E7  65 E8 64 7E 14 A4 AB A5