From: ietf-languages-bounces(_at_)alvestrand(_dot_)no [mailto:ietf-languages-
bounces(_at_)alvestrand(_dot_)no] On Behalf Of Bruce Lilly
It was removed in the development of RFC 3066, which was appropriate
because it was a particular application involving language tags;
We may be in danger of confusing terminology; the Content-Language
field is a means of specifying the language of a (part of) a
document, using a language-tag in the context of the MIME protocols.
"The World Wide Web" is an application.
The *use* of RFC 3066 language tags in a particular field of some protocol is
an application of RFC 3066. That is what I meant.
Separating the specification
of language via a field from registration procedure was entirely
appropriate, as BCP documents are used for procedures and policies
and not for technical specifications.
I'm not as deeply versed in the distinct purposes of distinct kinds of
documents used for Internet specifications as you. It's my understanding that
there is a general expectation that BCP documents are used for something like a
registration procedure but not for a technical specification. Joel Halpern
recently reported to me that he provide similar feedback on the draft to Harald
Alvestrand some time ago, that Harald responded with reasons why things were
mixed in this case, and that the two of them concluded that mixing these in a
BCP document in this case was acceptable.
We already anticipate future revisions, and it would be a possibility to
consider whether a division of the content into distinct documents of different
types would be better. In view of the time already taken, the delays incurred,
and the fact that there have been products in development that have been
assuming the completion of this current round of revision, and think it makes
best sense to allow the mixing that has existed since RFC 1766 to persist in a
BCP for this round.
So I take it that you agree that the technical specification of
matching algorithms should also be separated from the tag registration
procedure?
Again, apart from knowing the preferred divisions of content into different
types of IETF documents, I don't think there's a problem in describing one type
of matching algorithm in this document provided it is recognized that some
applications may require different algorithms. And, again, I don't think it
would be particularly helpful to delay completion of this revision any further
to address an issue of mixing different kinds of content in a BCP that has
existed through two versions already; if it's a serious problem, it can be
remedied in the next, already-anticipated revision.
The *meaning* of any given language tag would be no more or less a
problem under the proposed revision than it was for RFC 3066 or RFC 1766...
That's a somewhat different take on the issue; certainly the ability
to use a generative mechanism (i.e. w/o review/registration of an
entire tag) can lead to a proliferation of incompatible uses by
independent generators (and possibly loss of interoperability as a
result). The draft under discussion would expand use of generative
mechanisms to encompass all but private-use tags, and thereby expands
the potential for such incompatibilities and loss of interoperability.
That was a issue I initially voiced when it was first suggested that the
registry be a registry of subtags rather than tags. In practice, I'm not sure
at this point that there's really a significant greater problem with the new
level of generativity than there was before. The reason is that the new
elements have quite specific semantic effects on the whole, whereas the
semantic impact of a region ID on the whole is less certain: it may imply
dialectal variants, spelling variants; it might actually reflect nothing but
simply have been inserted because it could. In contrast, there is little
question of what effect a script ID such as "Hans" has on the whole.
Under the proposed draft, anybody may legally generate
a tag such as
sr-Latn-CS-gaulish-boont-guoyu-i-enochian
or
sr-Latn-CS-gaulish-boont-guoyu-i-enochian-x-foo
with *no* specific registration requirements (i.e. all components
are either registered or require no registration). In the latter
case, a parser can only determine that it contains a private-use
subtag after wading through the other subtags. In either case,
it is difficult (to say the least) for the recipient or his
software to determine what the generator of that tag intended to
convey.
I've shown that this is no different in general that what already exists
for RFC 3066 or RFC 1766.
It is certainly different; under RFC 3066 rules such a tag (as a whole)
would be subject to review and registration.
You're describing a pathological case that is never going to occur, which I
don't think is particularly helpful. If the meaning of "guoyu" and of "boont"
are clearly documented, it will be evident that something like "sr-boont-guoyu"
is basically as meaningless as "sr-guoyu" or as "pl-Hant-TH".
You might respond that "pl-Hant-TH" is at least conceivable whereas "sr-guoyu"
is oxymoronic, but it's just as useless as long as it has no correlation with
the real world. I do not think it should be a requirement of a language tag
specification to constrain combinations that are not useful, inconsistent with
reality, or logically impossible. Likewise, it really isn't necessary IMO to
insist on registration of complete tags vs. subtags just to avoid tags that are
not useful, inconsistent with reality, or logically impossible.
No, you seem to have missed the point; there exist RFC 3066
implementations. Such implementations, using the RFC 3066 rules,
could match something like "sr-CS-Latn" to "sr-CS", but could
not match "sr-Latn-CS" to "sr-CS". By changing the definition of
the interpretation of the second subtag, the proposed draft fails
to be compatible with existing deployed implementations (which is
what is meant by "backwards compatibility", which is a prime
consideration for Internet protocols).
Ah, but RFC 3066 does not sanction use of tags like "sr-CS-Latn" without
registration, and no such tags are registered.
Because of the prevelance of implementations that use a left-prefix matching
algorithm, it is more useful to combine elements in the order "sr-Latn-CS"
rather than "sr-CS-Latn". If "sr-CS-Latn" were used, these implementations
would fail to match "sr-CS-Latn" with "sr-Latn", which is actually a greater
problem than failing to match "sr-Latn-CS" with "sr-CS".
At this point, I feel confident that it is not a problem to combine
script IDs into "language" tags, and this is the consensus of the domain
experts that have been discussing this proposed revision for the past year
and more.
Evidently w/o considering the implications of and for core Internet
protocols.
You assume this is w/o such consideration. I think otherwise. I can't say that
consideration has been given to every single individual protocol. But
consideration has been given to many different protocols and usage scenarios. I
think it's appropriate for the onus to be on someone to identify particular
problems they feel would exist with protocols that concern them (which is
precisely the kind of thing we have last-call announcements for).
If script *can be* specified in a language tag *between*
the language code and country code, then a parser must be able to
recognize that case and deal with it appropriately (which, as noted
above, existing RFC 3066 implementations in deployed use do not and
cannot do) at *any* time and in any context (context may not be
available when a Content-Language field is parsed).
As described above, I think this argument is invalid.
I don't have an
issue with provision for specification of script where appropriate,
but for crying out loud, at least do so in a compatible manner (e.g.
a Content-Script field) rather than a) breaking compatibility with
deployed protocols and b) burdening applications which need not be
concerned with script from having to parse script information.
I've stated that the imputed back-compat problem is a non-issue. Lot's of
consideration was given early on to this. If you want to press this argument, I
think you need to show exactly how a problem would result in realistic usage
scenarios.
Can you identify for us an Internet protocol that would not be concerned with
script distinctions?
Can you identify an Internet protocol for which matching algorithms imply that
"sr-CS-Latn" makes better sense than "sr-Latn-CS"?
There is a clear need for script codes...
But none of that applies to an audio file of spoken material,
where script would be superfluous...
Not a problem: the proposed revision *allows* for the use of script IDs
but does not require them.
Yes, it's a problem. Having allowed them, each parser must be able
to handle them.
Look, they're already there in registered tags. This draft isn't doing anything
new in that regard.
and, as noted above, would
lead to loss of backwards compatibility.
But, as noted above, this is not an issue that is peculiar to the
proposed revision -- it already existed in RFC 3066.
No, given a primary subtag which is a language code (and per RFCs
1766 and 3066, that's any primary subtag with 2 or more (RFC 3066
only, more being limited to 3) characters), the second subtag --
in either RFC 1766 or RFC 3066 language tags -- is always a country
code and never a script code.
Go back and read RFC 3066 again. It does not impose that constraint:
<quote>
The following rules apply to the second subtag:
- All 2-letter subtags are interpreted as ISO 3166 alpha-2 country
codes from [ISO 3166], or subsequently assigned by the ISO 3166
maintenance agency or governing standardization bodies, denoting
the area to which this language variant relates.
- Tags with second subtags of 3 to 8 letters may be registered with
IANA, according to the rules in chapter 5 of this document.
</quote>
It must be a country ID *if* it is two letters, but not otherwise.
The proposed draft pulls the rug out
from under existing parsers by changing that.
You are completely mistaken on this point -- the proposed draft does not change
the constraint you assumed as that constraint never existed.
Again you seem to be conflating established Internet Standards Track
protocols with "applications"
I apparently am using "applications" in a sense you're not familiar with. I
don't think it's that uncommon to refer to a specification A that makes use of
another specification B as an application of B.
and ignoring the critical importance of
backwards compatibility.
As stated earlier, I quite disagree that back-compat issues have been ignored.
Note that there is nothing that prevents other applications from using
other matching algorithms, including perhaps something that is able to
recognize in "az-AZ" and "az-Latn-AZ" that both involve Azeri and used in
Azerbaijan.
The issue at hand is the existing deployed base of RFC 3066
implementations that depend on the matching algorithm specified
therein (which doesn't work with a script tag interposed between
language code and country code).
You say that these do not work; these implementations will still work, but they
will match "sr-Latn" but not "sr-CS" with "sr-Latn-CS". If that is a problem,
please explain why.
This is all a discussion we on the IETF-languages list went through five
years ago, and in the intervening five years I think we have reached a
consensus on these issues, that consensus being reflected in the proposed
revision to RFC 3066. (Note that we made the relevant decisions over a
year and a half ago when we reached a consensus to register az-Latn etc.
The precedent was established then; the proposed revision adds nothing new
in this regard.)
As previously noted, that is a danger recognized by RFC 2026 in
activity that does not conform to IETF procedures; it is
possible to reach good consensus on the wrong approach.
Well, that potential was created when RFC 1766 was first approved. Tags like
az-Latn could have been registered under the terms of that RFC just as readily
as RFC 3066.
But you are speaking as though it's a problem that these tags are registered. I
have no idea why.
7.1 says...
The proposed revision does not create Internet-specific versions of ISO
standards...
By cherry-picking, it effectively seeks to establish such a version.
I would not call what is done "cherry-picking". Any identifier defined in the
source standard is valid for use, except in the case that the identifier was
previously defined with a different meaning in that ISO standard. That isn't
cherry-picking; that is a blindly-applied general principle, created with
reasoned motivation: to provide stability.
But speaking of selective usage, have you noticed that RFC 3454 identifies
specific characters from ISO/IEC 10646 as prohibited? Various space and control
characters are not permitted, INVISIBLE TIMES isn't permitted, END OF AYAH
isn't permitted, COMBINING GRAVE TONE MARK isn't permitted... How is what is
proposed in this draft any more "cherry-picking" than that?
10.1 states a general policy regarding IP...
The ISO, as developers of ISO 639 and 3166, have rights. In particular,
they have the right to determine what those standards specify -- in
whole -- and they have the right to revise and amend those standards,
and are the sole arbiters of what is (and what is not) "valid".
They certainly have and retain rights over standards for language, script and
country identifiers. They do not, however, determine what is valid for use in
Internet protocols. Just as it is appropriate for an IETF document RFC 3454 to
specify for particular reasons that certain encoded entities of ISO/IEC 10646
are not valid for Stringprep output, so also it is appropriate for an IETF
document to specify for particular reasons that certain encoded entities of an
ISO standard are not valid for use in language tags used on the Internet.
Peter Constable
_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf