[Top] [All Lists]

Re: UTF-8 in headers

1999-01-24 10:28:18
 Chris Newman <Chris(_dot_)Newman(_at_)innosoft(_dot_)com> writes:

What you have is mostly right, but it's only a start to the complete
problem.  You'll note in RFC 2277 there's a requirement for language
labelling of i18n text.  You'll probably have to use RFC 2482 and define
all the rules for default language and when a language reset happens.

Thanks. I have looked at those two RFCs now. I see that there is provision
for certain "names" to be declared a part of the protocol, and that would
apply to all the things I said should be in ASCII (Header names, and the
like). It seems we need a wording somewhere to draw attention to that, but
otherwise I think we comply with that bit.

As regards language names, I would presume that for bodies this is a MIME
matter. Does there exist (or is there a proposal for) a Content-Language:
header, or an equivalent parameter for one of the other MIME headers?

As to the language of headers, I can see two possibilities:

1. A Language header. That would be simplest, but not so suitable for the
man who wants to give his real name (in a From:) in Chinese, his Subject:
in Arabic, and his Keywords: in Hebrew. Should we care?

2. Use RFC 2482 (language tags embedded in UTF-8 text). Extremely
flexible, but would undoubtedly raise howls of protest from users whose
existing agents saw them as a sequence of garbage characters (people who
read news can get exceedingly irate when shown such things - as witness
the railings against HTML in news, or even against any form of Mime).

I think I prefer 1. so far, but clearly this needs to be looked at in both
mail and news circles. Opinions anyone?

AFAICS the only reason why a newsreader would care about knowing the
language would be in deciding whether to display the characters
left-to-right or right-to-left. Or is that to be determined by the
charset?  Unicode seems confused on that issue.

On Wed, 20 Jan 1999, Charles Lindsey wrote:
2. Header-names are strictly ascii (in fact, the only characters allowed
are ALPHA / DIGIT / "-", which is more restrictive than DRUMS).

I'd prefer if you followed the rules from the Message Format draft.

Regard it as a declaration by the news community that they do not intend
to invent headers outside of those characters. I doubt the mail community
does either, but if they do they would be accepted (news is committed to
accepting, and usually ignoring, any properly-defined mail header).

7. Tokens can use full UTF-8 (but that probably needs reviewing).

Which tokens?

'token' is defined in RFC2045 so, for example, in "charset=iso8859-1" both
"charset" and "iso8859-1" are tokens. The question is whether some day
Mime might allow non-ascii characters in such tokens. That is for Mime to
decide. The only token so far defined in news is for our new User-Agent:
header, in which the name of the agent (e.g. "Mozilla") can use non-ascii
characters if it wants (no harm done, beacuse it is not a protocol word).
As I said, our text on that issue probably needs reviewing.

In some of these cases, it may be acceptable to drop the
`case-insensitive' rules.  Your 450kB number is misleading.  The 450kB
table is a US-ASCII form of a fairly complete table of several character
attributes and character names for Unicode.

Yes, I knew I was overstating the argument, but even a bitmap would be
quite large.

Compressed, it's only 70kB
and for case conversion you probably only need a fraction of that.  So I
suspect the case-conversion table would be of negligable size in practice. 
The hard part is not the size of the table, but the fact that it has to be
periodically updated as new characters are added to Unicode so it's a
nasty maintenance issue.

Yes, that is the killer.

Indeed, in the Newgroups header it is a definite MUST NOT be used. Granted
it may have to be used when downgrading to mail, but in that case it would
have to be restored on the upgrade.

Any news client which is also an SMTP client, MUST support RFC 2047
encoding if it uses non-ASCII characters in any headers sent over SMTP. 
If a "Newsgroups" header is included in an SMTP message, it MUST use RFC
2047 encoding for UTF-8 characters.

I think that is a gatewaying issue. What I would like to see is that if
you downgrade to RFC2047-charset=utf-8, you MUST upgrade to UTF-8
if/when it comes back into the news system. In particular, you do not
downgrade to anything other than utf-8 (even if you think you know how)
and you do not attempt to upgrade anything other than utf-8 (even if you
think you know how - that is for users agents only when they are ready to
display). The point is that UTF-8 <-> RFC2047-charset=utf-8 is a simple
algorithm. Anything else may require knowledge which not all agents

P.S. I think your client has a bug such that when you both post to the
local.mime newsgroup and email to the mailing list, your client omits the
"To" header from the email message

Is it OK now?

Charles H. Lindsey ---------At Home, doing my own thing------------------------
Email:     chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk  Web:
Voice/Fax: +44 161 437 4506      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9     Fingerprint: 73 6D C2 51 93 A0 01 E7  65 E8 64 7E 14 A4 AB A5

<Prev in Thread] Current Thread [Next in Thread>