Language Tagging within Unicode


This message is addressed primarily to Martin Duerst, but is Cced to
the Usefor and ietf-822 lists. Will people replying please ensure that
Martin is copied, since I am not aware that he regularly reads either of
those lists.


There currently appears to be some confusion between the Unicode community and
the Email community as to the correct usage of Language Tagging.

Within Unicode (since version 3.1) there have existed the Tag Characters
U+E0000-U+E007F which can, in principle, change the language over and over
again within a single text.

However, there are discouraging remarks about the use of this facility:

    ... However, the use of these characters is strongly discouraged. The
    characters in this block are reserved for use with special protocols.
    They are not to be used in the absence of such protocols, or with any
    protocols that provide alternate means for language tagging, such as
    HTML or XML.  The requirement for language information embedded in
    plain text data is often overstated. ...

(However, there is no defined list of those "special protocols".)

    Because of the extra implementation burden, language tags should be
    avoided in plain text unless language information is required and it is
    known that the receivers of the text will properly recognize and maintain
    the tags. ...

    Language tags should also be avoided wherever higher-level protocols,
    such as a rich-text format, HTML or MIME, provide language attributes.
    This practice prevents cases where the higher-level protocol and the
    language tags disagree. See Unicode Technical Report #20, "Unicode in
    XML and other Markup Languages".

All of which seems to be saying "Don't use Unicode language tagging if the
protocol provides some alternative, and even if it doesn't, think twice
about whether you really, really need to do it". Indeed, it is rumoured
that the next revision of Unicode will deprecate them even more.


OTOH, within the Email community, I find the following within RFC 2231,
when introducing a language-tag extension to RFC 2047:

    In the future it is likely that some character sets will provide
    facilities for inline language labeling. Such facilities are
    inherently more flexible than those defined here as they allow for
    language switching in the middle of a string.

    If and when such facilities are developed they SHOULD be used in
    preference to the language labeling facilities specified here. Note
    that all the mechanisms defined here allow for the omission of
    language labels so as to be able to accommodate this possible future
    usage.

And also within RFC 3066 I find:

    The issue of deciding upon the rendering of a character set based on
    the language tag is not addressed in this memo; however, it is
    thought impossible to make such a decision correctly for all cases
    unless means of switching language in the middle of a text are
    defined (for example, a rendering engine that decides font based on
    Japanese or Chinese language may produce suboptimal output when a
    mixed Japanese-Chinese text is encountered).

Which at least suggests that the ability to change languages in mid-text
is a useful feature to have.

Now whilst that may or may not represent the current view of the Email
community, it is certainly on the record as such, and so the two
communities would seem to be pulling in opposite directions. This really
needs to be clarified.


My immediate concern is with the possibility that raw UTF-8 will become
the charset for headers within Netnews (and maybe even for Email one day).
It is in the nature of headers that the texts within them are short - they
are not intended for long essays (that is what message bodies are for). So
it could be argued that language tagging within them is hardly necessary;
in which case the question arises as to whether leaving it to Unicode
tagging for people who really want to do it would be enough.

Effectively, one would say "It will usually be unnecessary to use language
tagging within headers but, if it is considered necessary, then the
language tagging defined for Unicode MAY be used" (note that the
contexts where this would be applicable are all for human consumption).

So is that an allowable usage according to Unicode 3.2? Observe that
headers using raw UTF-8 will not be using any MIME protocol.

It this is not allowable, then the only way to tag these short texts would
be through the use of RFC 2047 or RFC 2231 in place of raw UTF-8, which is
OK so far as it goes, but might not be so if those usages are phased out
in the future (we are talking many many years down the line here, but it
could happen one day).



Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5