ietf-822
[Top] [All Lists]

Re: Language Tagging within Unicode

2003-01-31 10:12:42

In 
<4(_dot_)2(_dot_)0(_dot_)58(_dot_)J(_dot_)20030127183119(_dot_)054ec5d8(_at_)localhost>
 Martin Duerst <duerst(_at_)w3(_dot_)org> writes:

Hello Charles,

At 22:30 03/01/27 +0000, Charles Lindsey wrote:
Sorry, your mail address at <mduerst(_at_)ifi(_dot_)unizh(_dot_)ch> seems not 
to work
any longer, so I am trying w3.org. The original was Cced to the Usefor
and rfc822-list, and I have set Reply-To there.

Sorry, and thanks. Please keep me cc'ed; I'm not currently on
either list.


One way to check whether the two communities are actually pulling in
opposite directions would be to check to what extent things such as
Tag Characters (U+E0000-U+E007F) and/or RFC 2231 are actually implemented
and used. I don't know of actual use, but I may just not be aware of it.

I doubt anyone actually uses the Unicode Tag characters because, in
bodies, there is the Content-Language header. RFC 2231 introduces language
codes into headers (e.g. by way of updating RFC 2047), but I wouldn't know
how widely used they are if, indeed, anyone actually implements them. I
suspect that it is the Chinese and the Japanese who will be the main
users, if any.  There is, of course, no significant current usage
(experiments apart) of raw UTF-8 in headers.

My immediate concern is with the possibility that raw UTF-8 will become
the charset for headers within Netnews (and maybe even for Email one day).

That in and by itself would not be a concern at all for me.
I think that for Netnews, and also for Email, using raw UTF-8
is the right way to go.


It is in the nature of headers that the texts within them are short - they
are not intended for long essays (that is what message bodies are for). So
it could be argued that language tagging within them is hardly necessary;
in which case the question arises as to whether leaving it to Unicode
tagging for people who really want to do it would be enough.

Effectively, one would say "It will usually be unnecessary to use language
tagging within headers but, if it is considered necessary, then the
language tagging defined for Unicode MAY be used" (note that the
contexts where this would be applicable are all for human consumption).


So is that an allowable usage according to Unicode 3.2? Observe that
headers using raw UTF-8 will not be using any MIME protocol.


My proposal for how to proceed, based on my understanding of
Internet protocols and their usage, and of Unicode,
would be to not use language tagging for headers (i.e. prefer
a tiny degree of 'suboptimality' that in most cases won't matter
at all over a complex design with low chance of getting
implemented (correctly, or at all)).

The only real reason we are having to consider them at all is because RFC
2277 says you MUST make some provision for language tagging in any IETF
standard. I am sure they had bodies rather than headers in mind when they
wore that, but there it is (though one could argue that RFC 2277 is only
a BCP document).


As an alternative, a wording like the one you propose, which
leaves this to the implementers to 'vote with their feet'
(as my understanding is they have done up to now) will probably
also not cause too much damage.

Yes, that was the idea of my wording. "You MAY do it this way if you really
really must, but we really recommend you not to bother". Japanese browsers
might go so far as to recognise them in that context, though I doubt many
Japanese would bother to include them in their headers (except perhaps for
their names which would likely be configured with great care into their
browsers).

However, in that case, I think it would be important to make a
clear exception for identifiers (e.g. newsgroup names, email addresses,
as opposed to simply descriptive text suc as subjects,...),
where adding language tag characters would lead to a breakdown
in interoperability and where they therefore clearly should
be prohibited.

Absolutely so. They will be totally forbidden in newsgroup-names.

Anyway, any more opinions from either of these lists?

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

<Prev in Thread] Current Thread [Next in Thread>