Re: UTF-8 in headers

Hmm! I doubt you will persuade the world to go along with that.

Perhaps not.  But it would be helpful to know why this kind of 
approach would or would not work.  Or are you just saying that
it would be difficult to establish this convention in people's
minds?


Yes. The convention is just too unlike the way headers currently get written.


okay, understood.  of course it doesn't have to be that specific
convention, but *some* convention that people would be willing
to adopt.  granted that the idea that "I can put anything I want
into my header field" may be too widespread...

3. Within a <phrase>, I find the syntax ambiguous. Clearly foobar is a
phrase, but is it one <word> or two (foo and bar are both <word>s, and the
syntax allows adjacent <word>s in a <phrase>)? So can I put
   foo=?iso-8859-1?Q?bar?=   ?

In "foobar" there is exactly one word.  Lexical symbols in structured
header fields are delimited by white space, comments, or specials.
(no, this isn't crystal clear in RFC 822, but it is there if you look
closely enough)


I think not. I was working from the syntax in DRUMS (admittedly my version
was dated March 1998, so it may be different now). The syntax I have gives:

atom = [CFWS] 1*atext [CFWS]

so "foo" and "bar" are atoms (the CFWS is optional).


This probably points to a need for a clarification in DRUMS -
an atom is intended to be the entire contiguous sequence of
atext.

so "foobar" is a phrase consisting of two words side-by-side, as allowed
by the syntax. It can also be legitimately parsed in many other ways, of
course.


this may be allowed by the grammar, but was not intended.

That's correct.  This was an explicit decision of the working group,
made during an IETF meeting in Santa Fe, if I recall.  The idea was 
that encoded-words were their own quoting mechanism, and therefore
should not be combined with double quotes.


Because at the time it didn't appear useful? Experience shows that
designing syntax with particular cases excluded from otherwise general
mechanisms usually leads to grief later on :-( .


The question is, how much grief would we have suffered had we done it
the other way?  Hindsight doesn't help us much with that one.
We make the best decisions we can at the time, and then we deal with them.
Sooner or later the amount of entropy due to old design decisions gets 
too large, and we wipe the slate clean and start more-or-less from scratch.  
(e.g. we put UTF-8 in message headers)

In hindsight, we might regret that decision, because it was one of
the considerations that forced us to invent a different syntax
for MIME parameters.


Or you could say that, having made one mistake, you compounded the error
by making a second mistake, rather than undoing the first one :-( .


I'm not sure what you could mean by "undoing the first mistake".  The email 
world is still not ready for UTF-8 in headers, though if we manage the 
transition carefully it might be ready in a few years.   And it would seem 
silly to replace encoded-words with a different and slightly less broken 
mechanism for encoding 8bits in ASCII (but which might work for parameter 
values also) when we already have an investment in encoded-words and 
we'll want to go to UTF-8 in a few years anyway.  It also seems silly
to declare "it's okay to use encoded-words in quoted strings" when 
this will cause a fair amount of disruption.  I'd rather minimize the 
number of design changes (and transitions) than fix every "mistake" 
as soon as possible.

Also it seems gratuitously abusive to call 1342 a "mistake".  UTF-8 was not
available at the time; we were aware of 10646 but realized (correctly) that
it was not sufficiently mature.  We looked at several other alternatives
and all were rejected. In hindsight, I don't think we would make a different
decision.  

Several years ago I was invited to give a talk on MIME.  My last slide was
entitled "where we blew it" and I mentioned the failure of MIME to provide
a general way to put any octet string in a parameter.  This, I think,
was the real "mistake" (I prefer the word "oversight").  Had we thought 
of it at the time I think we would have made the MIME parameter syntax 
slightly more general.  But we would still have needed a 1342-like fix 
to put non-ASCII characters in existing headers.

It is unfortuante that people are trying to use encoded-words for things 
that they were never designed to do, but is not a failure of the design 
of encoded-words.

 (there were others also; in particular,
encoded-words were designed to encode human-readable text, and
were never intended to encode very long strings of characters
such as might be found in a filename)


Yes, I can see more weight in that part of the argument, though I see that
DRUMS does not put any limit on line lengths in headers (and allows up to
998 in bodies).


This is another omission in DRUMS which should be fixed.
(remember, DRUMS is still an Internet-draft, so you're supposed to cite
it as "work in progress" :)

SMTP still imposes a line length of 1000 characters, and DRUMS messages
should be compatible with SMTP.  (The practical limit is much less
than 1000 characters; many user agents still display more-or-less
raw headers and either do not wrap long lines, or do not wrap headers 
appropriately).

8. Whilst one can see the thinking that lead to RFC 2231, I must point out
that, in a parameter of a Mime header, the following is already legal:
   filename="=?iso-8859-1?Q?my-funny-file-name?="

I don't know what you mean by "legal", since RFCs 1342, 1522, and 2047
all prohibit the interpretation of this string as an encoded-word.
(because it is in quotes). (and encoded-words are not legal MIME
parameter values outside of quotes because they contain = signs)


It is "legal" in the sense that "=?iso-8859-1?Q?my-funny-file-name?=" is a
syntactically correct quoted-string. Of course, its semantics is not what
you would expect (it will contruct a file with a peculiar name 35
characters long on a strictly-conforming implementation).


right.

There are problems with long filenames and header wrapping - an 
encoded-word has a maximum length and some mailers still have line
length limitations.  And people are reluctant to change a 
long-established specification.  (I've asked people before if
encoded-words should be changed to be more uniform and couldn't 
get much support for the idea)


But it seems we are turning up more and more situations that are going to
run into this problem.

Right! Let us suppose, just be way of Thinking-Out-Loud, that some
extension of RFC2047 (or perhaps some 2047bis) were made that allowed
encoded-words within a quoted-string (with perhaps a prohibition remaining
within addr-specs and msg-ids). What calamities would ensue?


People would start using them in addr-specs, expecting them to be
decoded before display.   Some gateways would translate them to 
raw characters, and fail to translate them back, which would make 
replies fail, as well as causing failures with other tools that 
recognize addresses.  Some gateways would translate them to raw 
characters, and translate them back in a way that didn't match the 
original name, and cause similar (but more subtle) failures.
Systems that use name=  or content-disposition filenames would
behave inconsistently - some would store the file under a bizarre
name, while others would store the file with the decoded name.

All of these things happen right now to some degree due to broken
implementations.  But by changing direction, we would make the
overall behavior worse. 

I don't see the point in changing encoded-words at this point, especially
not if the purpose is to make eventual downgrading to UTF-8 easier.  
Yes, the downgrading rules will be somewhat complex, with lots of special 
cases.  But that's not an argument for making still more changes to
existing software (and dealing with yet another set of early 
implementation bugs).  Rather, we should get to work defining the 
eventual use of UTF-8 in email and managing *that* transition. 

Keith