Re: UTF-8 in headers

In 
<Pine(_dot_)SOL(_dot_)3(_dot_)95(_dot_)990127145619(_dot_)482B-100000(_at_)elwood(_dot_)innosoft(_dot_)com>
 Chris Newman <Chris(_dot_)Newman(_at_)innosoft(_dot_)com> writes:

On Sun, 24 Jan 1999, Charles Lindsey wrote:

As regards language names, I would presume that for bodies this is a MIME
matter. Does there exist (or is there a proposal for) a Content-Language:
header, or an equivalent parameter for one of the other MIME headers?

Language in MIME headers is defined in RFC 2231 as an extension to RFC
2047.

Yes, but the question I had in mind was how to specify the language being
used in a body part. Suppose I am writing my body in french. Do I put in a
header like
      Content-Type: text/plain; charset=iso-8859-1; language=FR
I was not aware of such a language parameter in RFC-2046.


This is done with a Content-Language field. See RFC1766 for details.

Note, however, that this is only a solution for body information (and not
necessarily even text), not text in header fields.

I have now looked at RFC 2231, and what a can of worms! Well it does give
a way to specify charsets and languages in Mime parameters, and adds
languages to RFC 2047, though the syntactic sugar is not perticularly
sweet :-( .

However, my concern is that it seems to provide yet another way to
downgrade 8bits to 7bits. It seems that, for downgrading from 8bit to 7bit
when headers are written in UTF-8 (as now proposed for news, and soon to
be proposed for email) we have to distinguish 3 cases:

1. Comments (...), Phrases (e.g. as in "Charles H. Lindsey" 
<chl(_at_)(_dot_)(_dot_)(_dot_)>),
Unstructured text (as in Subject:s), "extension message header fields"
(not quite sure what that means) and all "X-" headers:
      downgrading is by RFC 2047

2. Parameters of Mime headers (e.g. Content-Distribution: attachment;
filename="some-name-written-in-funny-characters")
      downgrading is by RFC 2231

3. All other cases
      there is no downgrading mechanism specified yet

Now, within (3.) we can distinguish
      a) Protocol words. I would be happy to see these remain forever in
         ASCII.
      b) Parameters as in "keyword=parameter" that are part of headers
         that are not Mime headers, but have borrowed the Mime syntax.
      c) Other 'tokens' in assorted headers, including those not
         invented yet. Newsgroup-names in news is the obvious example
         here, but is fixable because we know about it. The worrying
         ones are the ones we do not know about.

So I can see a great danger that the grand canonical method of downgrading
UTF-8 headers (which is subject of this thread) is likely to degenerate
into a large collection of special cases, all done differently. And I do
not see any immediate clean answer to this :-( .


There is no clean answer short of writing out explicit rules, which is what
I've been saying from the start has to be done.

                                Ned