Re: UTF-8 in headers

Perhaps we could establish something of a uniform syntax for all new 
headers, and along with it, uniform rules for converting to 7bit.
For example, declare that all human-readable strings in future fields 
are to be enclosed in single quotes.  Then it would be easy for a 
converter to know how to translate between UTF-8 and 7bit - convert 
the strings within quotes and leave everything else alone.


Hmm! I doubt you will persuade the world to go along with that.


Perhaps not.  But it would be helpful to know why this kind of 
approach would or would not work.  Or are you just saying that
it would be difficult to establish this convention in people's
minds?

RFC 2047 attempts to define this very carefully, but that hasn't stopped
people from trying to use encoded-words where they don't fit - such as 
within a quoted string.


I still find the RFC2047 rules bizarre.


To some degree, so do I, even though I (mostly) understand the reasons
that they're there.

1. It seems that in an <unstructured> an encoded-word must have whitespace in
front of it and behind it.


Yes.   I believe the original idea in RFC 1342 was that you could parse 
any header field by just looking at the words surrounded by white space
and seeing whether they matched the pattern =\?[^?]+?[BQbq]?[^?]+\?=
But this overlooked that in structured fields you can have certain 
specials immediately adjacent to a 'word' with no intervening white space
(the usual case is a word in a phrase which is normally separated
by white space from surrounding special characters, but this is 
not required).  So the rules got changed for structured fields.
Yes, this is a pain.

So if I really really want my Subject: to be
"=?iso-8859-1?Q?my=20text?=" I can put "\=?iso-8859-1?Q?my=20text?", though
it is not clear that reading agents are expected to hideaway the "\".


Or you could encoded the encoded-word as an encoded-word.
Not that this case happens very often...

"\" affects only how the field is parsed, not how it is displayed.
Presentation of the "\" is not defined.

2. But in a comment this is not so. My encoded-word can have other
characters adjacent to it. But preceding it by a "\" will definitely stop
it being decoded, and it is more likely that the reading agent will hide
the "\".


It's not clear what would happen in practice (the combination of "\"
with encoded-words is an ambiguity that should probably be cleared up), 
but I don't know of any mail readers that hide "\"s.

3. Within a <phrase>, I find the syntax ambiguous. Clearly foobar is a
phrase, but is it one <word> or two (foo and bar are both <word>s, and the
syntax allows adjacent <word>s in a <phrase>)? So can I put
      foo=?iso-8859-1?Q?bar?=   ?


In "foobar" there is exactly one word.  Lexical symbols in structured
header fields are delimited by white space, comments, or specials.
(no, this isn't crystal clear in RFC 822, but it is there if you look
closely enough)

4. It seems the following are allowed in a <phrase>
      Charles Lindsey
      Charles "H." Lindsey
      "Charles H. Lindsey"
      =?iso-8859-1?Q?Charles=20H.=20Lindsey?=
but not
      "=?iso-8859-1?Q?Charles=20H.=20Lindsey?="


That's correct.  This was an explicit decision of the working group,
made during an IETF meeting in Santa Fe, if I recall.  The idea was 
that encoded-words were their own quoting mechanism, and therefore
should not be combined with double quotes.

In hindsight, we might regret that decision, because it was one of
the considerations that forced us to invent a different syntax
for MIME parameters.  (there were others also; in particular,
encoded-words were designed to encode human-readable text, and
were never intended to encode very long strings of characters
such as might be found in a filename)  But the rule has been 
that way for about 8 years now, through three document revisions,
so it seems too late to change it.

5. It seems that the syntax was designed so that an encoded-word would
always be syntactically correct in the places where it was allowed even in
the absence of RFC2047. Which would seem to allow reading agents not to
decode them, for whatever reason. Presumably it was this policy that lead
to RFC 2231.


The syntax was designed so that ordinary RFC 822 parsers would not
be confused by the presence of an encoded-word in a structured header
field - an encoded-word looks like an 'atom' to an RFC 822 parser.
It was also designed to be unusual enough that it was unlikely
to appear in a header field unless it were actually intended to represent
non-ASCII text.  Finally, the characters within an encoded word were 
chosen to survive translation between all known character sets.
(lots of 7-bit ISO 646 national character sets were in use at the time,
not to mention EBCDIC)

6. The rule that an encoded-word cannot occur in a <quoted-string> I find
particularly odd. In fact that <phrase> I quoted above 
      "=?iso-8859-1?Q?Charles=20H.=20Lindsey?="
IS allowed. It fits the syntax of a <quoted-string>. But reading agents
would not be allowed to decode it. AFAICS, the only place where allowing
an encoded-word within a <quoted-string> would be an embarassment would be
in the <local-part> of an <addr-spec> - a problem which could surely be
fixed in other ways.


Though in hindsight the decision may seem like it was short-sighted,
avoiding embarassment was not the justification for the decision.

In fact, part of the problem appears to be that the
syntax of DRUMS is set out in such a way as not to facilitate dealing with
problems that do not arise within DRUMS itself (my foobar example above is
a case in point).


DRUMS was deliberately scoped to not look at MIME (because the task
was already very difficult. There was also (for awhile) an assumption 
that the DRUMS documents might go directly to full Internet Standard,
without first going through Proposed Standard and Draft Standard.
Hence it was necessary to keep DRUMS from referencing MIME, because
MIME is only a Draft Standard, and a full Standard cannot reference
a Draft Standard.

7. So can someone please explain to me why the use of encoded-words in a
<quoted-string> was outlawed, and what evils would ensue from letting them
in?


See above.  I think you'll find that in practice many (most?) decoders
will decode encoded-words even if they appear in quoted-strings.

8. Whilst one can see the thinking that lead to RFC 2231, I must point out
that, in a parameter of a Mime header, the following is already legal:
      filename="=?iso-8859-1?Q?my-funny-file-name?="


I don't know what you mean by "legal", since RFCs 1342, 1522, and 2047
all prohibit the interpretation of this string as an encoded-word.
(because it is in quotes). (and encoded-words are not legal MIME
parameter values outside of quotes because they contain = signs)

RFC 2157 suggests that FTBP parameters can be mapped as encoded-words,
but points out that this 'is not the normal usage of encoded-words'.
RFC 2388 says that encoded-words can be used in Content-Disposition,
but this is clearly an error as it contradicts RFC 2183.

So what would be wrong with letting it mean what it looks as though it
means?


There are problems with long filenames and header wrapping - an 
encoded-word has a maximum length and some mailers still have line
length limitations.  And people are reluctant to change a 
long-established specification.  (I've asked people before if
encoded-words should be changed to be more uniform and couldn't 
get much support for the idea)

 "X-" header, I can use RFC2047 stuff wherever I like, whether it
fits with any supposed syntax or not. But what about other headers, not
defined in RFC822 or DRUMS, but defined in extensions? That is unclear to
me.


It's difficult to answer the question.  The definition for a particular
field should specify whether encoded-words are allowed.  Unfortunately,
this means that the encoded-word decoder has to know how to parse each 
field, and different fields may be either structured or unstructured,
may allow comments or not, and have different notions of special characters. 
In general, encoded-words should only be used for text intended to
be presented to humans, not for machine-readable text like filenames.
In practice, most decoders will decode encoded-words anywhere they see them.

I am asking all these questions just to make sure I understand the present
situation correctly, because it would be useless to try to look for
solutions to the more general problem without doing that first.


right.

Keith