Re: Language Tagging within Unicode


Hello Charles,

At 22:30 03/01/27 +0000, Charles Lindsey wrote:

Sorry, your mail address at <mduerst(_at_)ifi(_dot_)unizh(_dot_)ch> seems not 
to work
any longer, so I am trying w3.org. The original was Cced to the Usefor
and rfc822-list, and I have set Reply-To there.


Sorry, and thanks. Please keep me cc'ed; I'm not currently on
either list.

See below for comments.

--- Below this line is a copy of the message.

Date: Mon, 27 Jan 2003 14:37:34 +0000 (GMT)
From: Charles Lindsey <chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk>
Reply-To: Charles Lindsey <chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk>
Subject: Language Tagging within Unicode
To: "Martin J. Duerst" <mduerst(_at_)ifi(_dot_)unizh(_dot_)ch>
Cc: usenet-format(_at_)landfield(_dot_)com, ietf-822(_at_)imc(_dot_)org

This message is addressed primarily to Martin Duerst, but is Cced to
the Usefor and ietf-822 lists. Will people replying please ensure that
Martin is copied, since I am not aware that he regularly reads either of
those lists.

There currently appears to be some confusion between the Unicode community and
the Email community as to the correct usage of Language Tagging.

Within Unicode (since version 3.1) there have existed the Tag Characters
U+E0000-U+E007F which can, in principle, change the language over and over
again within a single text.

However, there are discouraging remarks about the use of this facility:

    ... However, the use of these characters is strongly discouraged. The
    characters in this block are reserved for use with special protocols.
    They are not to be used in the absence of such protocols, or with any
    protocols that provide alternate means for language tagging, such as
    HTML or XML.  The requirement for language information embedded in
    plain text data is often overstated. ...

(However, there is no defined list of those "special protocols".)


[aside: I think it would be inappropriate for Unicode to say which
 protocols should use them or not. I think one thing the above is
 saying is: don't assume you can use them unless the protocol you
 are using explicitly says you can.]

    Because of the extra implementation burden, language tags should be
    avoided in plain text unless language information is required and it is
    known that the receivers of the text will properly recognize and maintain
    the tags. ...

    Language tags should also be avoided wherever higher-level protocols,
    such as a rich-text format, HTML or MIME, provide language attributes.
    This practice prevents cases where the higher-level protocol and the
    language tags disagree. See Unicode Technical Report #20, "Unicode in
    XML and other Markup Languages".

All of which seems to be saying "Don't use Unicode language tagging if the
protocol provides some alternative, and even if it doesn't, think twice
about whether you really, really need to do it".


I would say that this is a fair summary.

Indeed, it is rumoured
that the next revision of Unicode will deprecate them even more.


I seem to remember having heard such rumors, although I couldn't
confirm off the top of my head.

OTOH, within the Email community, I find the following within RFC 2231,
when introducing a language-tag extension to RFC 2047:

    In the future it is likely that some character sets will provide
    facilities for inline language labeling. Such facilities are
    inherently more flexible than those defined here as they allow for
    language switching in the middle of a string.

    If and when such facilities are developed they SHOULD be used in
    preference to the language labeling facilities specified here. Note
    that all the mechanisms defined here allow for the omission of
    language labels so as to be able to accommodate this possible future
    usage.

And also within RFC 3066 I find:

    The issue of deciding upon the rendering of a character set based on
    the language tag is not addressed in this memo; however, it is
    thought impossible to make such a decision correctly for all cases
    unless means of switching language in the middle of a text are
    defined (for example, a rendering engine that decides font based on
    Japanese or Chinese language may produce suboptimal output when a
    mixed Japanese-Chinese text is encountered).

Which at least suggests that the ability to change languages in mid-text
is a useful feature to have.

Now whilst that may or may not represent the current view of the Email
community, it is certainly on the record as such, and so the two
communities would seem to be pulling in opposite directions. This really
needs to be clarified.


One way to check whether the two communities are actually pulling in
opposite directions would be to check to what extent things such as
Tag Characters (U+E0000-U+E007F) and/or RFC 2231 are actually implemented
and used. I don't know of actual use, but I may just not be aware of it.

My immediate concern is with the possibility that raw UTF-8 will become
the charset for headers within Netnews (and maybe even for Email one day).


That in and by itself would not be a concern at all for me.
I think that for Netnews, and also for Email, using raw UTF-8
is the right way to go.

It is in the nature of headers that the texts within them are short - they
are not intended for long essays (that is what message bodies are for). So
it could be argued that language tagging within them is hardly necessary;
in which case the question arises as to whether leaving it to Unicode
tagging for people who really want to do it would be enough.

Effectively, one would say "It will usually be unnecessary to use language
tagging within headers but, if it is considered necessary, then the
language tagging defined for Unicode MAY be used" (note that the
contexts where this would be applicable are all for human consumption).


I think that RFC 3066 is correct in saying 'may produce suboptimal
output' (when a mixed Japanese-Chinese text is encountered).
I think this is just an observation, without any kind of analysis
as to how suboptimal that output is.

I think we can agree that the rendering of email headers
(and also of email bodies) is in many ways 'suboptimal',
e.g. not of outstanding typographic quality, and so on.
In almost all cases, this suboptimality is widely accepted.
Of course, if you have a Chinese or Japanese name and want
it shown in a particular form, then this suboptimality is
less acceptable to you. But then again, there are people
who insist on particular shapes for a character in their
name, for example within Japan, in a way which cannot be
expressed in the current national standard (and therefore
not in current email practice) and also cannot be distinguished
by using language tags.
[Please note that there are a lot of heuristics (e.g. based
on analysis of the characters used; using user preference;
using the 'Content-Language' header on the body;...) and
other tricks (e.g. using a font that de-emphasizes the
differences, which will be the case anyway in many of the
small-size fonts used for headers) to easily get a very good
display in almost all cases.

So is that an allowable usage according to Unicode 3.2? Observe that
headers using raw UTF-8 will not be using any MIME protocol.

It this is not allowable, then the only way to tag these short texts would
be through the use of RFC 2047 or RFC 2231 in place of raw UTF-8, which is
OK so far as it goes, but might not be so if those usages are phased out
in the future (we are talking many many years down the line here, but it
could happen one day).


My understanding of Unicode 3.2 is that if you really, really
think that you need to use these tags, then that's what they
are for. But before using them, you should really, really
consider the issue carefully.

My proposal for how to proceed, based on my understanding of
Internet protocols and their usage, and of Unicode,
would be to not use language tagging for headers (i.e. prefer
a tiny degree of 'suboptimality' that in most cases won't matter
at all over a complex design with low chance of getting
implemented (correctly, or at all)).

As an alternative, a wording like the one you propose, which
leaves this to the implementers to 'vote with their feet'
(as my understanding is they have done up to now) will probably
also not cause too much damage.

However, in that case, I think it would be important to make a
clear exception for identifiers (e.g. newsgroup names, email addresses,
as opposed to simply descriptive text suc as subjects,...),
where adding language tag characters would lead to a breakdown
in interoperability and where they therefore clearly should
be prohibited.


Regards,    Martin.