RE: Last Call on Language Tags (RE: draft-phillips-langtags-08)

From: John C Klensin [mailto:john-ietf(_at_)jck(_dot_)com]

Ignoring whether "that very nearly happened in RFC 3066",
because some of us would have taken exception to inserting a
script mechanism then, let's assume that 3066 can be
characterized as a language-locale standard (with some funny
exceptions and edge cases) and that the new proposal could
similarly be characterized as a language-locale-script standard


I can see we might run into some terminological hurdles here. I would
decidedly *not* describe RFC 3066 as a "locale" standard just because it
allows for tags that include country identifiers. I would strongly
contend that a "language" tag and a "locale" ID are different things
serving quite different purposes. But I'll read the rest of your
comments assuming that by "language-locale(-script) standard" you simply
mean a standard for language tags that can include subtags for region
and script.

If one makes that assumption, then
the (or a) framework for the answer to the question of what
problem this solves that 3066 does not becomes clear: it meets
the needs of when a language-locale-script specification is
needed.

But that takes us immediately to the comments Ned and I seem to
be making, characterized especially by Ned's "sweet spot"
remark.  It has not been demonstrated that Internet
interoperability generally, and the settings in which 3066 are
now used in particular, require a language-local-script set of
distinctions.


I disagree. There are many cases in which script distinctions in
language tags have been recognized as being needed; several such tags
have been registered for that reason already under the terms of RFC
3066, and there are more that would already have been registered except
for the fact that people have been anticipating acceptance of this
proposed revision. (For instance, in response to recent discussions, a
representative of Reuters has indicated that he was holding off
registering various language tags that include ISO 15924 script IDs on
that basis, and that he plans to do so if this proposed revision is
delayed much longer.)

The document does not address that issue.


That is probably because those of us who have been participants of the
IETF-language list, where this draft originated, have become so familiar
with the need that it seems obvious -- evidently, it's not as obvious to
people that have not been as focused on IT-globalization issues as we
have.

Equally important, but just as one example, in the MIME context
(just one use of 3066, but a significant one), we've got a
"charset" parameter as well as a "language" one.   There are
some odd new error cases if script is incorporated into
"language" as an explicit component but is not supported in the
relevant "charset".  On the one hand, the document does not
address those issues and that is, IMO, a problem.  But, on the
other, no matter how they are addressed, the level of complexity
goes up significantly.


I don't see how such error cases are significantly different from
current possibilities, such as having a language tag of "hi" and a
charset of ISO 8859-1 (where the content is actually uses some
non-standard encoding for Devanagari).

One can also raise questions as to whether, if script
specifications are really needed, those should reasonably be
qualifiers or parameters associated with "charset" or "language"
(and which one) rather than incorporated into the latter.  I
don't have any idea what the answer to those questions ought to
be.


Having worked on these particular issues for several years, I and many
others feel we *do* have an idea what the answer to those questions
ought to be -- that script should be incorporated as a permitted subtag
within a language tag.

But they are fairly subtle, the document doesn't address
them (at least as far as I can tell), and I see no way to get to
answers to them without a lot more specificity about what real
internetworking or interoperability problem you are trying to
solve.


Some days ago, I made reference to a white paper I wrote a few years ago
that explores the kinds of distinctions that need to be made in metadata
elements declaring linguistic attributes of information objects. It's
long, and there are some details I'd change, but that may provide a
starting point. The people who have contributed to this draft are all
familiar with these ideas. You can find this paper at
http://www.sil.org/silewp/abstract.asp?ref=2002-003. Granted, this paper
does not go into details regarding specific implementations.

Similarly, as we know, painfully, from other
internationalization efforts, the only comparisons that are easy
involve bit-string identity.  Working out, at an application
level, when two "languages" under the 3066 system are close
enough that the differences can be ignored for practical
purposes is quite uncomfortable.   Attempting similar logic for
this new proposal is mind-boggling, especially if one begins to
contemplate comparison of a language-locale specification with a
language-script one -- a situation that I believe from reading
the spec is easily possible.


RFC 3066 makes reference to a fairly simplistic matching algorithm using
the notion of language-range. The proposed draft would continue to
support that same algorithm with an expectation that implementations of
language-range matching as defined in RFC 3066 would continue to operate
using exactly the same algorithm on new tags permitted by the proposed
revision -- and with generally desirable results. 

There may be implementations that use a more complex approach to
matching involving inspection of the tagged content itself, or
inspecting the particular subtags of a language tag. Certainly an
existing RFC 3066 implementation that does the latter will not be aware
of the specific syntax of the proposed revision, though it also cannot
be aware of registered RFC 3066 tags defined after the implementation
was created -- there is no categorical difference here. 

As for how difficult it would be to update such an implementation to use
a sophisticated matching algorithm based on interpretation of individual
subtags permitted by this draft, I grant that there is greater
complexity, but the draft specifically imposes syntactic constraints
that allow different types of sub-elements to be identified quite
readily. 

As for how the different sub-elements would be used for matching, for
instance in recognizing a relationship between a language-region tag and
a language-script tag, those are issues that already exist with valid
RFC 3066 tags such as zh-CN and zh-Hans. I agree that it is not a
trivial matter to decide exactly how such tags relate. 

That does not, however, change the fact that language tags that
incorporate script IDs are useful and appropriate; for instance, in this
particular example, all that was available for tagging Chinese content
for some time were tags like zh-CN and zh-TW, and this was causing very
significant problems for implementations and users, which is precisely
why zh-Hans and zh-Hant have been registered, and why many of us are
eager to see a revision of RFC 3066 that incorporates script IDs.

(Granted, that does not speak to other changes proposed by the draft.)

That situation almost invites
profiling of how this specification should be used in different
circumstances...


I have no particular counter to the opinions you expressed in your
remaining comments.



Peter Constable


_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf