Re: UTF-8 in headers

In <199902040024(_dot_)TAA12052(_at_)spot(_dot_)cs(_dot_)utk(_dot_)edu> Keith 
Moore <moore(_at_)cs(_dot_)utk(_dot_)edu> writes:

Hmm! I doubt you will persuade the world to go along with that.

Perhaps not.  But it would be helpful to know why this kind of 
approach would or would not work.  Or are you just saying that
it would be difficult to establish this convention in people's
minds?


Yes. The convention is just too unlike the way headers currently get written.

3. Within a <phrase>, I find the syntax ambiguous. Clearly foobar is a
phrase, but is it one <word> or two (foo and bar are both <word>s, and the
syntax allows adjacent <word>s in a <phrase>)? So can I put
     foo=?iso-8859-1?Q?bar?=   ?

In "foobar" there is exactly one word.  Lexical symbols in structured
header fields are delimited by white space, comments, or specials.
(no, this isn't crystal clear in RFC 822, but it is there if you look
closely enough)


I think not. I was working from the syntax in DRUMS (admittedly my version
was dated March 1998, so it may be different now). The syntax I have gives:

atom = [CFWS] 1*atext [CFWS]

so "foo" and "bar" are atoms (the CFWS is optional).

word = atom / quoted-string

so "foo" and "bar" are words

phrase = 1*word / obs-phrase

so "foobar" is a phrase consisting of two words side-by-side, as allowed
by the syntax. It can also be legitimately parsed in many other ways, of
course.

So the syntax is ambiguous, but does not actually make any difference
within DRUMS itself (but does when you introduce encoded-words later on).
So is this a bug in DRUMS? It would seem so. Indeed there are many
examples of ambiguities-that-do-not-matter in DRUMS.

but not
     "=?iso-8859-1?Q?Charles=20H.=20Lindsey?="

That's correct.  This was an explicit decision of the working group,
made during an IETF meeting in Santa Fe, if I recall.  The idea was 
that encoded-words were their own quoting mechanism, and therefore
should not be combined with double quotes.


Because at the time it didn't appear useful? Experience shows that
designing syntax with particular cases excluded from otherwise general
mechanisms usually leads to grief later on :-( .

In hindsight, we might regret that decision, because it was one of
the considerations that forced us to invent a different syntax
for MIME parameters.


Or you could say that, having made one mistake, you compounded the error
by making a second mistake, rather than undoing the first one :-( .

 (there were others also; in particular,
encoded-words were designed to encode human-readable text, and
were never intended to encode very long strings of characters
such as might be found in a filename)


Yes, I can see more weight in that part of the argument, though I see that
DRUMS does not put any limit on line lengths in headers (and allows up to
998 in bodies).

See above.  I think you'll find that in practice many (most?) decoders
will decode encoded-words even if they appear in quoted-strings.

8. Whilst one can see the thinking that lead to RFC 2231, I must point out
that, in a parameter of a Mime header, the following is already legal:
     filename="=?iso-8859-1?Q?my-funny-file-name?="

I don't know what you mean by "legal", since RFCs 1342, 1522, and 2047
all prohibit the interpretation of this string as an encoded-word.
(because it is in quotes). (and encoded-words are not legal MIME
parameter values outside of quotes because they contain = signs)


It is "legal" in the sense that "=?iso-8859-1?Q?my-funny-file-name?=" is a
syntactically correct quoted-string. Of course, its semantics is not what
you would expect (it will contruct a file with a peculiar name 35
characters long on a strictly-conforming implementation).

So what would be wrong with letting it mean what it looks as though it
means?

There are problems with long filenames and header wrapping - an 
encoded-word has a maximum length and some mailers still have line
length limitations.  And people are reluctant to change a 
long-established specification.  (I've asked people before if
encoded-words should be changed to be more uniform and couldn't 
get much support for the idea)


But it seems we are turning up more and more situations that are going to
run into this problem.

Right! Let us suppose, just be way of Thinking-Out-Loud, that some
extension of RFC2047 (or perhaps some 2047bis) were made that allowed
encoded-words within a quoted-string (with perhaps a prohibition remaining
within addr-specs and msg-ids). What calamities would ensue?

1. Existing strictly-conforming implementations would display the phrase
"=?iso-8859-1?Q?Charles=20H.=20Lindsey?=" as-is. Yes, it is not a pretty
sight, but those of us without RFC2047-conforming implementations have been
seeing such things for years. It does not seem a show stopper.

2. Many (most?) existing but non-conforming implementations would suddenly
become conforming (and would correctly decode that phrase).

3. Every syntactically correct header, under either the old or the new
syntax, would still be syntactically correct under either old or new
implementations. So nothing is going to blow up in your face that did not
blow up in your face before.

4. You could immediately use encoded-words in Mime parameters. This make
some (not all) of the features of RFC2231 redundant. Maybe you declare
those features "deprecated".

5. For really long parameters, you keep the mechanism described in section
3 of RFC2231, and obviously you keep the language-specification within
encoded-words feature of RFC2231.

6. You advise all people developing new headers that they should include
quoted-string as an option in any human-writeable fields that they invent,
or to include encoded-word in their syntax explicitly, or both. And you
advise them to make provision for long fields if necessary (so, for
example, we might have to look at long newsgroup-names).

7. When you invent UTF-8 headers as an extension to mail, you enact that
all downgrading is to be done by RFC2047(bis).

So, what other problems might there be? I see no show stoppers so far.

 "X-" header, I can use RFC2047 stuff wherever I like, whether it
fits with any supposed syntax or not. But what about other headers, not
defined in RFC822 or DRUMS, but defined in extensions? That is unclear to
me.

It's difficult to answer the question.  The definition for a particular
field should specify whether encoded-words are allowed.  Unfortunately,
this means that the encoded-word decoder has to know how to parse each 
field, and different fields may be either structured or unstructured,
may allow comments or not, and have different notions of special characters. 
In general, encoded-words should only be used for text intended to
be presented to humans, not for machine-readable text like filenames.
In practice, most decoders will decode encoded-words anywhere they see them.


I think your last sentence sums it up (what harm could such an approach
do?). It might be better to encorage designers of new fields to ensure
that new encoded-words only arose within "..." .

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Email:     chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk  Web:   
http://www.cs.man.ac.uk/~chl
Voice/Fax: +44 161 437 4506      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9     Fingerprint: 73 6D C2 51 93 A0 01 E7  65 E8 64 7E 14 A4 AB A5