UTF-8 in headers

In <199901122237(_dot_)WAA25919(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk> 
Charles Lindsey <chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk> writes:

Newman:

* Create rules for UTF-8 headers and downgrading thereof.

Lindsey:

We have worked this out for news. Would you like me to post an outline of
what we have done here?


Newman:

Sure.


OK. So here is a short description of UTF-8 in headers in the present
USEFOR draft. I can privide a fuller description, complete with all the
syntax, if you really want it, but I doubt that would be useful just yet.



1. All headers are in the character set UTF-8. But, since US-ASCII is a
strict subset of UTF-8, anything that is permissable at the present time
remains so. However, that is not to say that Non-Ascii characters are now
to be allowed absolutely anywhere - only where the syntax specifically
allows them.

2. Header-names are strictly ascii (in fact, the only characters allowed
are ALPHA / DIGIT / "-", which is more restrictive than DRUMS).

3. Addr-specs are strictly ascii. If ever the DNS changes to allow UTF-8
in domain names, that might change, but I guess that is a looong way off.
OTOH phrases can use the full UTF-8, so a person's "real name" in an
address can be in Greek/whatever. Actually, we currently allow full UTF-8
in any quoted-string (qtext), but I think that needs fixing.

4. Message-IDs are strictly ascii, wherever they occur. Likewise Dates.

5. Comments (ctext) can use full UTF-8 (but comments SHOULD NOT be used
in news except in the few places that are allowed at present - they MUST
be accepted, of course).

6. Subject, Keywords, Summary, etc allow full UTF-8.

7. Tokens can use full UTF-8 (but that probably needs reviewing).

8. Newsgroup-names can use full UTF-8. Indeed, this was our reason for
admitting UTF-8 in the first place. The Scandinavians wanted to use their
extra characters in newsgroup names, and were all for making iso-8859-1
the default, but we managed to talk them into UTF-8 instead.

9. The Path header is strict ascii.


One of the problems with working with UTF-8 is that some of the character
sets make no distinction between upper- and lower-case letters, and some
have an extra title-case, whatever that might mean. But worse, there is no
algorithmic method of converting upper- to lower-case; it can only be
done, in general, by table lookup, and the table is 450kB in length. So
you cannot allow UTF-8 in any place where some token is said to be
'case-insensitive' and if it is then necessary for agents to be able to
detect and act on it.

Also, one cannot use UTF-8 in situations where transports (that is
relayers and servers in the news context) need to be able to decode and
understand UTF-8 characters (we get away with it in the Newsgroups header
only because the UTF-8 characters are themselves regarded as the canonical
form of the newsgroup-name). The most that a transport agent can be
expected to do is to convert UTF-8 characters into some escaped notation
so that the admin-guy can look at them for diagnostic purposes.

User agents, OTOH, do need to know about UTF-8 characters so that they can
display them or generate them from the keyboard. But there is no need for
every user agent to understand all characters. If you want to read Chinese
newsgroups, then you better buy yourself a user agent that understands
that part of the character space.


As regards RFC-2047, it is a SHOULD accept, but SHOULD NOT generate.
Indeed, in the Newgroups header it is a definite MUST NOT be used. Granted
it may have to be used when downgrading to mail, but in that case it would
have to be restored on the upgrade.

HTH

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Email:     chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk  Web:   
http://www.cs.man.ac.uk/~chl
Voice/Fax: +44 161 437 4506      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9     Fingerprint: 73 6D C2 51 93 A0 01 E7  65 E8 64 7E 14 A4 AB A5