RE: Last Call on Language Tags (RE: draft-phillips-langtags-08)



--On Monday, 03 January, 2005 12:29 -0800 Peter Constable
<petercon(_at_)microsoft(_dot_)com> wrote:

From: John C Klensin [mailto:john-ietf(_at_)jck(_dot_)com]

Ignoring whether "that very nearly happened in RFC 3066",
because some of us would have taken exception to inserting a
script mechanism then, let's assume that 3066 can be
characterized as a language-locale standard (with some funny
exceptions and edge cases) and that the new proposal could
similarly be characterized as a language-locale-script
standard


I can see we might run into some terminological hurdles here.
I would decidedly *not* describe RFC 3066 as a "locale"
standard just because it allows for tags that include country
identifiers. I would strongly contend that a "language" tag
and a "locale" ID are different things serving quite different
purposes. But I'll read the rest of your comments assuming
that by "language-locale(-script) standard" you simply mean a
standard for language tags that can include subtags for region
and script.


That is more than close enough for discussion purposes.

If one makes that assumption, then
the (or a) framework for the answer to the question of what
problem this solves that 3066 does not becomes clear: it meets
the needs of when a language-locale-script specification is
needed.

But that takes us immediately to the comments Ned and I seem
to be making, characterized especially by Ned's "sweet spot"
remark.  It has not been demonstrated that Internet
interoperability generally, and the settings in which 3066 are
now used in particular, require a language-local-script set of
distinctions.


I disagree. There are many cases in which script distinctions
in language tags have been recognized as being needed; several
such tags have been registered for that reason already under
the terms of RFC 3066, and there are more that would already
have been registered except for the fact that people have been
anticipating acceptance of this proposed revision. (For
instance, in response to recent discussions, a representative
of Reuters has indicated that he was holding off registering
various language tags that include ISO 15924 script IDs on
that basis, and that he plans to do so if this proposed
revision is delayed much longer.)


It would be very helpful, to me at least, if you or he could
identify the specific context in which such tags would be used
and are required.  The examples should ideally be of
IETF-standard software, not proprietary products.

The document does not address that issue.


That is probably because those of us who have been
participants of the IETF-language list, where this draft
originated, have become so familiar with the need that it
seems obvious -- evidently, it's not as obvious to people that
have not been as focused on IT-globalization issues as we have.


How nice.  In 2004, I discovered that I had no operational
experience and then that I knew nothing about standardization
processes outside the IETF.  It is now only three days into 2005
and already I've learned that I haven't been focused on "IT
globalization".  I anxiously await the opportunity to find out
what comes next in this sequence :-)

Equally important, but just as one example, in the MIME
context (just one use of 3066, but a significant one), we've
got a "charset" parameter as well as a "language" one.
There are some odd new error cases if script is incorporated
into "language" as an explicit component but is not supported
in the relevant "charset".  On the one hand, the document
does not address those issues and that is, IMO, a problem.
But, on the other, no matter how they are addressed, the
level of complexity goes up significantly.


I don't see how such error cases are significantly different
from current possibilities, such as having a language tag of
"hi" and a charset of ISO 8859-1 (where the content is
actually uses some non-standard encoding for Devanagari).


Since I haven't paid attention to IT globalization and
internationalization issues for the last 20 or 30 years, I
obviously don't know enough about alphabetic equivalency
relationships, the collection of TC 46 transliteration standards
(including, in this case, the possibility that IS 15919 is in
use), and related work to be able to address this question.

One can also raise questions as to whether, if script
specifications are really needed, those should reasonably be
qualifiers or parameters associated with "charset" or
"language" (and which one) rather than incorporated into the
latter.  I don't have any idea what the answer to those
questions ought to be.


Having worked on these particular issues for several years, I
and many others feel we *do* have an idea what the answer to
those questions ought to be -- that script should be
incorporated as a permitted subtag within a language tag.


Good.  See request for explanation and examples above.  Things
that you and your colleagues know, but that aren't in the draft
or some supplemental and equally accessible document are really
not helpful.

But they are fairly subtle, the document doesn't address
them (at least as far as I can tell), and I see no way to get
to answers to them without a lot more specificity about what
real internetworking or interoperability problem you are
trying to solve.


Some days ago, I made reference to a white paper I wrote a few
years ago that explores the kinds of distinctions that need to
be made in metadata elements declaring linguistic attributes
of information objects. It's long, and there are some details
I'd change, but that may provide a starting point. The people
who have contributed to this draft are all familiar with these
ideas. You can find this paper at
http://www.sil.org/silewp/abstract.asp?ref=2002-003. Granted,
this paper does not go into details regarding specific
implementations.


I've just now skimmed parts of this paper.  It is very
interesting and I look forward to carefully reading the rest of
it.  We are in agreement about your category model.   The only
place where there is a difference is whether, for the purposes
of the IETF and the actual demands on RFC 3066, something else
--and something as complex as I perceive this proposal as
being-- is really needed.   I can, for the record, believe that
this proposal is unnecessary and too complex while also
believing that it is possible to make registrations under the
rules of 3066 that would make quite a mess of things.   We have
tag review processes to prevent just that eventuality.  I can
also believe that 3066 represents a compromise, rather than a
perfect solution to the issues you outline in your paper,
without believing that translates into either a problem that
needs to be solved or a problem that needs to be solved with
this particular proposal.  I've got a fairly open mind on those
subjects; I just believe that the burden of demonstrating that a
major change is needed in a system that appears to be working
is, and should be, fairly high.

Similarly, as we know, painfully, from other
internationalization efforts, the only comparisons that are
easy involve bit-string identity.  Working out, at an
application level, when two "languages" under the 3066 system
are close enough that the differences can be ignored for
practical purposes is quite uncomfortable.   Attempting
similar logic for this new proposal is mind-boggling,
especially if one begins to contemplate comparison of a
language-locale specification with a language-script one -- a
situation that I believe from reading the spec is easily
possible.


RFC 3066 makes reference to a fairly simplistic matching
algorithm using the notion of language-range. The proposed
draft would continue to support that same algorithm with an
expectation that implementations of language-range matching as
defined in RFC 3066 would continue to operate using exactly
the same algorithm on new tags permitted by the proposed
revision -- and with generally desirable results. 

There may be implementations that use a more complex approach
to matching involving inspection of the tagged content itself,
or inspecting the particular subtags of a language tag.
...


Peter, you are talking, I think, about different applications
doing different things given the greater range of options and
flexibility that the new specification provides.  From my point
of view and experience, every time someone says "well, some
applications may do something else" or "some implementations may
use a more complex approach", what I hear is more potential for
ways in which things won't interoperate, more areas in which
profiles are needed to assure interoperability, and so on.
Whether the interoperability issues show up at a protocol level
or to the user as a violation of the law of least astonishment
makes little difference: such things make the Internet work less
well and should be avoided unless there is a _really_ strong
reason for them.  What I'm trying to probe here are those
reasons.

...


Let me also comment on the ISO 3166 issues here, rather than
starting another note.  For me, there is no question that
3166/MA has made quite a mess of things with a few of their
reuse decisions, most notably the recent assignment of CS to
Serbia and Montenegro.  In the pre-ICANN period, IANA had fairly
well considered procedures for dealing with code changes and I
have been appalled that ICANN has sometimes felt a need to
ignore those precedents in favor of believing that it needs to
consider ccTLD changes any time 3166/MA makes a change.   But
the solution to the problem of various ISO TCs not having an
adequate understanding of the impact on the Internet and IT
communities (and, in the case of TC46, even the
library/information sciences community that are one of their
historical main constituencies) is, IMO, to get that message
across via liaison statements and, if necessary and appropriate,
encouraging national member bodies to cast "no" votes on
standards and registration procedures that are insufficiently
stable.  After the "CS" decision, the statements from the
British Library advocating a much longer time-to-reuse and from
the IAB suggesting that a century might be adequate were, again,
IMO, just the right sort of approach.   In particular, I presume
that TC 37 has an adequate liaison mechanism in place with TC 46
to insist that a much more conservative position be adopted with
regard to changes.  If TC 37 isn't able or inclined to do that
job effectively, I'm not persuaded that shifting the task to the
IETF is an appropriate solution or one that is likely to be
effective.

As I have noted in other contexts, an attitude in the Internet
community that extreme stability in external standards is
critical is not a new development as evidenced in our continued
use of ANSI/X3.4-1968 as the base reference for "US-ASCII", just
as our response to some incompatible changes in Unicode between
3.2 and 4.0 has been to freeze some things at 3.2.  Our solution
has not been to try to create IETF standards to work around the
stability issues ISO (or other) standards.  Down that path
generally lies madness.  If it is really necessary --i.e., there
are no other practical alternatives and we have the needed
expertise-- then I think we should consider it, but that case
has, IMO, not yet been made in this case.

My apologies but, since the Last Call is closing and there is
supposed to be a -09 coming, I don't believe that it is useful
to continue this discussion much further until the IESG has made
some decisions about what should be done next and told the
community about them.

    john


_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf