ietf
[Top] [All Lists]

RE: draft-phillips-langtags-08, process, specifications, "stability", and extensions

2004-12-30 05:30:48
From: ietf-languages-bounces(_at_)alvestrand(_dot_)no [mailto:ietf-languages-
bounces(_at_)alvestrand(_dot_)no] On Behalf Of Bruce Lilly


So why not then also throw in the closely linked specification of
the Content-Language field, which has historically been in the same
document (RFC 1766)?

It was removed in the development of RFC 3066, which was appropriate because it 
was a particular application involving language tags; other applications exist, 
and other applications may use different approaches for how matching should be 
done.



No, the revision clearly expands the scope of language
distinctions that can be represented with a language tag--quite
significantly in some cases.

Indeed, and without registration of the tags and the review process
associated with that (existing RFC 3066) registration procedure. As
Harald Alvestrand pointed out some time ago, that (inappropriately)
shifts implementation effort from the tag generator (no registration
required) to the recipient (what the heck does this mysterious tag
actually *mean*).

The *meaning* of any given language tag would be no more or less a problem 
under the proposed revision than it was for RFC 3066 or RFC 1766. For instance, 
there is a concurrent thread that has been discussing when country distinctions 
are appropriate or recommended ("ca" or "ca-ES"?); this discussion pertains to 
RFC 3066, and part of the issue is that meanings of tags are implied rather 
than specified -- and always have been even under RFC 1766 (I pointed this out 
five years ago when we were working on preparing RFC 3066).

So, for instance, when an author uses "de-CH", what does he intend recipients 
to understand to be the difference between that and "de-DE" or even "de"? 
Neither RFC 1766 or RFC 3066 shed any light on this, and ultimately only the 
author knows for sure.

Under RFC 3066, it was the *exceptional* case that a complete tags was 
registered, allowing some indication of the meaning of the whole (though even 
in that regard nothing really required that the documentation provide clear 
indication of the meaning). The 98% cases were those like "de-CH" in which it 
was assumed that everyone would understand what the intended meaning is.



Under the proposed draft, anybody may legally generate
a tag such as
  sr-Latn-CS-gaulish-boont-guoyu-i-enochian
or
  sr-Latn-CS-gaulish-boont-guoyu-i-enochian-x-foo
with *no* specific registration requirements (i.e. all components
are either registered or require no registration). In the latter
case, a parser can only determine that it contains a private-use
subtag after wading through the other subtags.  In either case,
it is difficult (to say the least) for the recipient or his
software to determine what the generator of that tag intended to
convey.

I've shown that this is no different in general that what already exists for 
RFC 3066 or RFC 1766. And I think we can all agree that there's no much less 
likelihood of someone generating sr-Latn-CS-gaulish-boont-guoyo-i-enochian than 
there is of someone generating something like pl-AZ. So, I suggest that we not 
dwell on pathological cases that we aren't really likely to encounter.


A recipient using software that interprets RFC 3066
tags isn't going to be able to do anything useful with any
hypothetical tag which contains a script subtag that would be
produced under the draft rules (if the script subtag were to appear
*after* the region sugtag, one could at least match "sr-CS-Latn"[...]
to "sr-CS", which an RFC 3066 parser could handle.

This would be no more or less true of registered tags like "az-Latn-AZ", for 
which registration requests were submitted but those were postponed (by the 
submitter withdrawing the request) until details for RFC3066bis were worked 
out. Again, the concerns you are raising in relation to the the proposed 
replacement of RFC 3066 apply equally to RFC 3066 itself.



It's not entirely clear if some of those items (e.g. script) should
be expressed by an orthogonal mechanism rather than embedded in a
*language* tag (for that matter, in retrospect, country codes was
probably a bad idea).

Of course it would not be clear if you don't have a conceptual model of what 
"language" tags are identifiers *of*. When RFC 3066 was being developed, there 
was a suggestion that script IDs be incorporated, but some were reluctant, 
raising the same question you have here. I was one of those. But I didn't 
remain obstructionist over the issue; instead, I gave a fair amount of thought 
to the ontology that underlies "language" tags, and subsequently published a 
white paper and presented on the topic at two conferences in the spring and 
fall of 2002. (Paper is available online at 
http://www.sil.org/silewp/abstract.asp?ref=2002-003 -- my thinking has evolved 
since then, but some key results remain valid, I think.) 

At this point, I feel confident that it is not a problem to combine script IDs 
into "language" tags, and this is the consensus of the domain experts that have 
been discussing this proposed revision for the past year and more.


There is a clear need for script codes...

But none of that applies to an audio file of spoken material,
where script would be superfluous...

Not a problem: the proposed revision *allows* for the use of script IDs but 
does not require them. In the case of audio content, one simply would never 
include a script ID.



and, as noted above, would
lead to loss of backwards compatibility.

But, as noted above, this is not an issue that is peculiar to the proposed 
revision -- it already existed in RFC 3066.

The bigger problem you're pointing out is the limitations of using 
suffix-truncation alone as a matching algorithm. In the discussion following 
the registration request for de-1996, etc., there was some discussion as to 
whether de-1996-DE format or de-DE-1996 format was preferable, and in the 
course of that discussion it was mentioned that some times the 1901 vs 1996 
spelling differences would be more important than the regional dialect 
differences, but in other situations the regional differences would be more 
important than the spelling. But the problem with prefix matching used e.g. for 
Accept-Language is that only one of these two can be supported. That is a 
shortcoming in that application. 

Note that there is nothing that prevents other applications from using other 
matching algorithms, including perhaps something that is able to recognize in 
"az-AZ" and "az-Latn-AZ" that both involve Azeri and used in Azerbaijan.



Surely some types
of script is indicated by the charset; in situations where that
is not the case, a separate mechanism could be used for that
orthogonal parameter without breaking compatibility with
existing parsers of language tags.

This is all a discussion we on the IETF-languages list went through five years 
ago, and in the intervening five years I think we have reached a consensus on 
these issues, that consensus being reflected in the proposed revision to RFC 
3066. (Note that we made the relevant decisions over a year and a half ago when 
we reached a consensus to register az-Latn etc. The precedent was established 
then; the proposed revision adds nothing new in this regard.)


Does the ISO not set ground rules for the 3166/MA?  Could it not
specify that codes are not to be reused?

No, ISO does not. The ground rules for the ISO 3166/MA are established in ISO 
3166. I don't have the current version immediately at hand, but I believe the 
ground rules it specified were simply that something not be re-used for at 
least five years after it has been withdrawn. The re-assignment of CS made 
several parties very upset, and I note that the CD for the revision to ISO 
3166-1 which is in progress has upped this to 50 year, and added a clause 
saying, "Before reallocating... the ISO 3166/MA shall consult, as appropriate, 
the authority or agency on whose behalf the code element was
reserved and consideration shall be given to difficulties which might arise 
from the reallocation" -- nothing about consulting other users. 


Matching hasn't actually changed...

Do you not see the contradiction between "one should not expect to
receive anything less specific" vs. "may receive less specific
content"?

There is no substantive change from RFC 3066. RFC 3066 happened to mention one 
particular matching approach used in one application (HTTP), in relation to 
which it defined "language range"; but there is no question that there are 
different approaches to matching used in different applications, some of which 
may well involve receiving content the linguistic properties of which are not 
within the specific properties requested; and besides, the proposed revision 
retains the exact same definition for "language range" (for the sake of 
whatever applications may use that notion).

 
Please see RFC 2026 sections 7.1, 7.1.1, 7.1.3, and 10.1.
Note that RFC 3066 strictly complies with those sections, while
the draft under discussion, by cherry-picking from ISO lists
for which change control has not been transferred to the IESG,
does not.

7.1 says,

<quote>
To avoid conflict between competing versions of a specification, the
   Internet community will not standardize a specification that is
   simply an "Internet version" of an existing external specification
   unless an explicit cooperative arrangement to do so has been made.
   However, there are several ways in which an external specification
   that is important for the operation and/or evolution of the Internet
   may be adopted for Internet use.
</quote>

The proposed revision does not create Internet-specific versions of ISO 
standards; it uses IDs drawn from ISO standards with semantics defined in those 
source standards at the time they were adopted for use in language tags -- the 
source for the IDs, the symbols and their meanings all reside in the ISO 
standards. The fact that not all are used, or that some are used as they were 
specified in dated version of the ISO standard is not in contradiction with 7.1 
-- it's just one of "several ways in which an external specification... may be 
adopted."


7.1.1 simply says that an open extenal standard may be incorporated merely by 
reference. There is no requirement here that is not met by the proposed 
revision.

7.1.3 simply says that an Internet specification may be an adaptation of an 
external specification provided certain conditions are met. Neither RFC 3066 or 
the proposed revision are adaptations of any existing external specification, 
so this is not applicable.

10.1 states a general policy regarding IP: 

<quote>
In all matters of intellectual property rights and procedures, the
   intention is to benefit the Internet community and the public at
   large, while respecting the legitimate rights of others.
</quote>

Again, there is no requirement stated here that is not met by the proposed 
revision. Clearly, the intent of the proposed draft is to benefit the Internet 
community and the public at large. There are no rights of others that are in 
any way violated by the proposed revision.

Thus, I see no difference between RFC 3066 and this proposed revision in 
relation to compliance with the sections of RFC 2026 you referred to.


Agreed.  But the activity on the ietf-languages list regarding the
draft under discussion isn't an IETF process -- there is no WG or
Chair, no charter, etc.  Like the fictional Topsy, it jes' growed up.

RFC 3066 was developed in exactly the same manner as this proposed revision has 
been developed -- as an internet draft prepared by a member of the the 
IETF-languages list and processed among members of that list until it was 
submitted for last call and subsequent IESG action.



Peter Constable

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf


<Prev in Thread] Current Thread [Next in Thread>