Re: Language Tagging within Unicode


Charles Lindsey wrote:

This message is addressed primarily to Martin Duerst, but is Cced to
the Usefor and ietf-822 lists. Will people replying please ensure that
Martin is copied, since I am not aware that he regularly reads either of
those lists.


There currently appears to be some confusion between the Unicode community and
the Email community as to the correct usage of Language Tagging.

Within Unicode (since version 3.1) there have existed the Tag Characters
U+E0000-U+E007F which can, in principle, change the language over and over
again within a single text.

However, there are discouraging remarks about the use of this facility:

    ... However, the use of these characters is strongly discouraged. The
    characters in this block are reserved for use with special protocols.
    They are not to be used in the absence of such protocols, or with any
    protocols that provide alternate means for language tagging, such as
    HTML or XML.  The requirement for language information embedded in
    plain text data is often overstated. ...

(However, there is no defined list of those "special protocols".)


There is; one of the Unicode documents related to the introduction
specifically mentions ACAP as the intended use, as has been pointed
out several months ago on the Usefor list when this topic first came
up, and repeated recently. A search for "ACAP" on the Unicode web
site takes one right to the document and highlights the relevant text.
The full document is: http://www.unicode.org/unicode/reports/tr20/index.html
Martin is listed as one of the co-authors. The relevant section is

"
3.10 Language Tag Characters, U+E0000 .. U+E007F

Short description: A series of characters for expressing language tags, based 
on existing standards for language tags using the rules in [Unicode31].

Reason for inclusion: These characters allow in-band language tagging in 
situations where full markup is not available, while allowing easy filtering by 
applications that do not support them. They were solely included for the 
benefit of those Internet protocols, such as ACAP, which require a standard 
mechanism for marking language in UTF-8 strings, and at the same time to avoid 
the use of other tagging schemes that relied on specific details of the 
encoding form used.

Problems when used in markup: These characters duplicate information that can 
be expressed in markup.

Problems with other uses: Their special code range allows them to be easily 
filtered, but applications that do not expect them will treat them as garbage 
characters.

Replacement markup: Replace with equivalent language markup. XML and XHTML have 
the xml:lang attribute. HTML has the lang attribute. These attributes follow 
different scoping rules than the tag characters, therefore this replacement 
will generally not be a simple 1:1 substitution.

What to do if detected: Browsers may ignore these characters. When received in 
an editing context, editors may remove or replace them by equivalent markup.
"

[copied and pasted directly from the web site -- Charles, you may
direct complaints about line length to the Unicode consortium.]

    Because of the extra implementation burden, language tags should be
    avoided in plain text unless language information is required and it is
    known that the receivers of the text will properly recognize and maintain
    the tags. ...


In this case proper recognition requires that all recievers use an
implementation of Unicode 3.1 or 3.2; earlier versions definitely
won't work, and later versions might not work.  Maintaining the
tags not only requires a Unicode editor, it requires an editor which
properly handles the 3.1/3.2 language tags when modifying text -- I
am not aware of any such editor.

OTOH, within the Email community, I find the following within RFC 2231,
when introducing a language-tag extension to RFC 2047:

    In the future it is likely that some character sets will provide
    facilities for inline language labeling. Such facilities are
    inherently more flexible than those defined here as they allow for
    language switching in the middle of a string.


In the sense of character sets as used in RFC 2231, there is no language
lebeling provided by any character set; Unicode 3.1 is not a character
set in that sense -- utf-7 is, but utf-7 (nor any other charset) does
not provide language labeling.

Flexibility is not attained without a price; in the case of paired tags,
one price is high complexity required to maintain tags when editing. Even
higher complexity is required to support nested paired tags, as used in
Unicode 3.1 and 3.2.  That is the "implementation burden" associated
with use of those tags, a burden which does not exist with RFC 2047/2231
tagging as the 2047/2231 methods do not use paired *or* nested tags.

Note also that because of the display rules associated with encoded-words,
it *is* possible to switch languages within a contiguously-displayed
strin using RFC 2047/2231 encoding in human-readable text in comments,
phrases, and in unstructured fields.  It is not possible to do so within
a single *parameter*, but that's not a big deal since most parameters
are intended as protocol elements, not as human-readable text strings.

    If and when such facilities are developed they SHOULD be used in
    preference to the language labeling facilities specified here. Note
    that all the mechanisms defined here allow for the omission of
    language labels so as to be able to accommodate this possible future
    usage.


As noted, it hasn't happened yet.

And also within RFC 3066 I find:

    The issue of deciding upon the rendering of a character set based on
    the language tag is not addressed in this memo; however, it is
    thought impossible to make such a decision correctly for all cases
    unless means of switching language in the middle of a text are
    defined (for example, a rendering engine that decides font based on
    Japanese or Chinese language may produce suboptimal output when a
    mixed Japanese-Chinese text is encountered).

Which at least suggests that the ability to change languages in mid-text
is a useful feature to have.


RFC 2047/2231 provides that capability for human-readable text in message
header fields, MIME-part fields, and MDN and DSN fields, and some rich
text formats (e.g. text/html) provide such capability for message body
text.  RFC 2231 provides the ability to specify language on a per-parameter
basis. Yes, it is a (sometimes) useful feature, and with the exception of
within a single parameter, it is already supported by 2047/2231 for
mesaage header and MIME-part fields, and via MIME-supported rich text
media formats for body content.

Now whilst that may or may not represent the current view of the Email
community, it is certainly on the record as such, and so the two
communities would seem to be pulling in opposite directions.


There is no conflict between 2047/2231 and 3066, nor with all of those
and RFC 2277.  There is no conflict between those RFCs and the Unicode
documents; both sets of documents clearly indicate use of existing
MIME mechanisms in text messages (until such time as a *charset* provides
language labeling -- and even then 2047/2231 will be required indefinitely
to support reading of legacy content).

I don't know off the top of my head whether or not HTML supports *nested*
language tags, but that's not a big issue (for body text), since (unlike
Unicode language-tag-aware editors) there *are* HTML editors widely
available, and I expect that they do the right thing if they support
HTML language tagging.  Moreover, editing body text is practical with an
external program; not so for editing message and MIME-part header fields.

My immediate concern is with the possibility that raw UTF-8 will become
the charset for headers within Netnews (and maybe even for Email one day).


Maybe pigs will fly one day; it's not yet time to invest in
pig-manure-resistant umbrellas or reinforce flagpoles to carry the
weight of roosting pigs.

It is in the nature of headers that the texts within them are short - they
are not intended for long essays (that is what message bodies are for). So
it could be argued that language tagging within them is hardly necessary;


See RFC 2277 and consider Subject, Comments, and Keywords headers, not
to mention parenthesized comments in structured header fields and
proper names and nicknames in phrases associated with email mailboxes
(e.g. the proper name "Jesus" is pronounced differently depending on
language).  And there are some individuals with impaired vision who rely
on screen reader technology.

in which case the question arises as to whether leaving it to Unicode
tagging for people who really want to do it would be enough.


Unicode 3.1/3.2 are not universally deployed. No header fields may
yet contain unencoded non-ascii octets, either in the message header
or in MIME-part headers.  So the obvious answer is no, Unicode language
tagging is not (yet) practical.  The MIME mechanisms, on the other
hand, have been in place for some time, are widely deployed in mail
and news user agents, and are fully compliant with Best Current
Practice (e.g. RFC 2277).

Effectively, one would say "It will usually be unnecessary to use language
tagging within headers but, if it is considered necessary, then the
language tagging defined for Unicode MAY be used" (note that the
contexts where this would be applicable are all for human consumption).


That is not applicable in message or MIME-part header fields and
would not comply with the requirements of RFC 2277 for those
fields.  Because the Unicode documents expressly forbid use of the
language tags with MIME, they would be inappropriate for body text
in a MIME message, and there is no way to get body content other
than plain text in US-ASCII with a 7bit transfer encoding except
via MIME.

So is that an allowable usage according to Unicode 3.2? Observe that
headers using raw UTF-8 will not be using any MIME protocol.


By definition, no header fields may use raw UTF-8.  Any message
interpreted as a MIME message necessarily uses MIME protocol, and
that includes MIME-part header fields as well as all header fields
in any message that contains a MIME-Version header field. Even if
you package such (non-)"headers" in an application/news-transmission
wrapper, that's still (obviously!) MIME.

It this is not allowable, then the only way to tag these short texts would
be through the use of RFC 2047 or RFC 2231 in place of raw UTF-8


No, not "in place of raw UTF-8"; just RFC 2047 and/or RFC 2231 full stop.

> which is

OK so far as it goes, but might not be so if those usages are phased out
in the future (we are talking many many years down the line here, but it
could happen one day).


So how many pig-manure-resistant umbrellas have you purchased,
Charles?  Or better still, how many would you like to purchase
(such a deal I have for you...)?

As mentioned above, 2047/2231 support will be required indefinitely,
if for no other reason than for reading of legacy content.