From: ietf-languages-bounces(_at_)alvestrand(_dot_)no [mailto:ietf-languages-
bounces(_at_)alvestrand(_dot_)no] On Behalf Of Bruce Lilly
So why not then also throw in the closely linked specification of
the Content-Language field, which has historically been in the same
document (RFC 1766)?
It was removed in the development of RFC 3066, which was appropriate because it
was a particular application involving language tags; other applications exist,
and other applications may use different approaches for how matching should be
done.
No, the revision clearly expands the scope of language
distinctions that can be represented with a language tag--quite
significantly in some cases.
Indeed, and without registration of the tags and the review process
associated with that (existing RFC 3066) registration procedure. As
Harald Alvestrand pointed out some time ago, that (inappropriately)
shifts implementation effort from the tag generator (no registration
required) to the recipient (what the heck does this mysterious tag
actually *mean*).
The *meaning* of any given language tag would be no more or less a problem
under the proposed revision than it was for RFC 3066 or RFC 1766. For instance,
there is a concurrent thread that has been discussing when country distinctions
are appropriate or recommended ("ca" or "ca-ES"?); this discussion pertains to
RFC 3066, and part of the issue is that meanings of tags are implied rather
than specified -- and always have been even under RFC 1766 (I pointed this out
five years ago when we were working on preparing RFC 3066).
So, for instance, when an author uses "de-CH", what does he intend recipients
to understand to be the difference between that and "de-DE" or even "de"?
Neither RFC 1766 or RFC 3066 shed any light on this, and ultimately only the
author knows for sure.
Under RFC 3066, it was the *exceptional* case that a complete tags was
registered, allowing some indication of the meaning of the whole (though even
in that regard nothing really required that the documentation provide clear
indication of the meaning). The 98% cases were those like "de-CH" in which it
was assumed that everyone would understand what the intended meaning is.
Under the proposed draft, anybody may legally generate
a tag such as
sr-Latn-CS-gaulish-boont-guoyu-i-enochian
or
sr-Latn-CS-gaulish-boont-guoyu-i-enochian-x-foo
with *no* specific registration requirements (i.e. all components
are either registered or require no registration). In the latter
case, a parser can only determine that it contains a private-use
subtag after wading through the other subtags. In either case,
it is difficult (to say the least) for the recipient or his
software to determine what the generator of that tag intended to
convey.
I've shown that this is no different in general that what already exists for
RFC 3066 or RFC 1766. And I think we can all agree that there's no much less
likelihood of someone generating sr-Latn-CS-gaulish-boont-guoyo-i-enochian than
there is of someone generating something like pl-AZ. So, I suggest that we not
dwell on pathological cases that we aren't really likely to encounter.
A recipient using software that interprets RFC 3066
tags isn't going to be able to do anything useful with any
hypothetical tag which contains a script subtag that would be
produced under the draft rules (if the script subtag were to appear
*after* the region sugtag, one could at least match "sr-CS-Latn"[...]
to "sr-CS", which an RFC 3066 parser could handle.
This would be no more or less true of registered tags like "az-Latn-AZ", for
which registration requests were submitted but those were postponed (by the
submitter withdrawing the request) until details for RFC3066bis were worked
out. Again, the concerns you are raising in relation to the the proposed
replacement of RFC 3066 apply equally to RFC 3066 itself.
It's not entirely clear if some of those items (e.g. script) should
be expressed by an orthogonal mechanism rather than embedded in a
*language* tag (for that matter, in retrospect, country codes was
probably a bad idea).
Of course it would not be clear if you don't have a conceptual model of what
"language" tags are identifiers *of*. When RFC 3066 was being developed, there
was a suggestion that script IDs be incorporated, but some were reluctant,
raising the same question you have here. I was one of those. But I didn't
remain obstructionist over the issue; instead, I gave a fair amount of thought
to the ontology that underlies "language" tags, and subsequently published a
white paper and presented on the topic at two conferences in the spring and
fall of 2002. (Paper is available online at
http://www.sil.org/silewp/abstract.asp?ref=2002-003 -- my thinking has evolved
since then, but some key results remain valid, I think.)
At this point, I feel confident that it is not a problem to combine script IDs
into "language" tags, and this is the consensus of the domain experts that have
been discussing this proposed revision for the past year and more.
There is a clear need for script codes...
But none of that applies to an audio file of spoken material,
where script would be superfluous...
Not a problem: the proposed revision *allows* for the use of script IDs but
does not require them. In the case of audio content, one simply would never
include a script ID.
and, as noted above, would
lead to loss of backwards compatibility.
But, as noted above, this is not an issue that is peculiar to the proposed
revision -- it already existed in RFC 3066.
The bigger problem you're pointing out is the limitations of using
suffix-truncation alone as a matching algorithm. In the discussion following
the registration request for de-1996, etc., there was some discussion as to
whether de-1996-DE format or de-DE-1996 format was preferable, and in the
course of that discussion it was mentioned that some times the 1901 vs 1996
spelling differences would be more important than the regional dialect
differences, but in other situations the regional differences would be more
important than the spelling. But the problem with prefix matching used e.g. for
Accept-Language is that only one of these two can be supported. That is a
shortcoming in that application.
Note that there is nothing that prevents other applications from using other
matching algorithms, including perhaps something that is able to recognize in
"az-AZ" and "az-Latn-AZ" that both involve Azeri and used in Azerbaijan.
Surely some types
of script is indicated by the charset; in situations where that
is not the case, a separate mechanism could be used for that
orthogonal parameter without breaking compatibility with
existing parsers of language tags.
This is all a discussion we on the IETF-languages list went through five years
ago, and in the intervening five years I think we have reached a consensus on
these issues, that consensus being reflected in the proposed revision to RFC
3066. (Note that we made the relevant decisions over a year and a half ago when
we reached a consensus to register az-Latn etc. The precedent was established
then; the proposed revision adds nothing new in this regard.)
Does the ISO not set ground rules for the 3166/MA? Could it not
specify that codes are not to be reused?
No, ISO does not. The ground rules for the ISO 3166/MA are established in ISO
3166. I don't have the current version immediately at hand, but I believe the
ground rules it specified were simply that something not be re-used for at
least five years after it has been withdrawn. The re-assignment of CS made
several parties very upset, and I note that the CD for the revision to ISO
3166-1 which is in progress has upped this to 50 year, and added a clause
saying, "Before reallocating... the ISO 3166/MA shall consult, as appropriate,
the authority or agency on whose behalf the code element was
reserved and consideration shall be given to difficulties which might arise
from the reallocation" -- nothing about consulting other users.
Matching hasn't actually changed...
Do you not see the contradiction between "one should not expect to
receive anything less specific" vs. "may receive less specific
content"?
There is no substantive change from RFC 3066. RFC 3066 happened to mention one
particular matching approach used in one application (HTTP), in relation to
which it defined "language range"; but there is no question that there are
different approaches to matching used in different applications, some of which
may well involve receiving content the linguistic properties of which are not
within the specific properties requested; and besides, the proposed revision
retains the exact same definition for "language range" (for the sake of
whatever applications may use that notion).
Please see RFC 2026 sections 7.1, 7.1.1, 7.1.3, and 10.1.
Note that RFC 3066 strictly complies with those sections, while
the draft under discussion, by cherry-picking from ISO lists
for which change control has not been transferred to the IESG,
does not.
7.1 says,
<quote>
To avoid conflict between competing versions of a specification, the
Internet community will not standardize a specification that is
simply an "Internet version" of an existing external specification
unless an explicit cooperative arrangement to do so has been made.
However, there are several ways in which an external specification
that is important for the operation and/or evolution of the Internet
may be adopted for Internet use.
</quote>
The proposed revision does not create Internet-specific versions of ISO
standards; it uses IDs drawn from ISO standards with semantics defined in those
source standards at the time they were adopted for use in language tags -- the
source for the IDs, the symbols and their meanings all reside in the ISO
standards. The fact that not all are used, or that some are used as they were
specified in dated version of the ISO standard is not in contradiction with 7.1
-- it's just one of "several ways in which an external specification... may be
adopted."
7.1.1 simply says that an open extenal standard may be incorporated merely by
reference. There is no requirement here that is not met by the proposed
revision.
7.1.3 simply says that an Internet specification may be an adaptation of an
external specification provided certain conditions are met. Neither RFC 3066 or
the proposed revision are adaptations of any existing external specification,
so this is not applicable.
10.1 states a general policy regarding IP:
<quote>
In all matters of intellectual property rights and procedures, the
intention is to benefit the Internet community and the public at
large, while respecting the legitimate rights of others.
</quote>
Again, there is no requirement stated here that is not met by the proposed
revision. Clearly, the intent of the proposed draft is to benefit the Internet
community and the public at large. There are no rights of others that are in
any way violated by the proposed revision.
Thus, I see no difference between RFC 3066 and this proposed revision in
relation to compliance with the sections of RFC 2026 you referred to.
Agreed. But the activity on the ietf-languages list regarding the
draft under discussion isn't an IETF process -- there is no WG or
Chair, no charter, etc. Like the fictional Topsy, it jes' growed up.
RFC 3066 was developed in exactly the same manner as this proposed revision has
been developed -- as an internet draft prepared by a member of the the
IETF-languages list and processed among members of that list until it was
submitted for last call and subsequent IESG action.
Peter Constable
_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf