ietf
[Top] [All Lists]

Re: draft-phillips-langtags-08, process, specifications, "stability",  and extensions

2005-01-01 10:57:32
 Date: 2004-12-30 07:26
 From: "Peter Constable" <petercon(_at_)microsoft(_dot_)com>
 To: ietf-languages(_at_)alvestrand(_dot_)no, ietf(_at_)ietf(_dot_)org
 
From: ietf-languages-bounces(_at_)alvestrand(_dot_)no 
[mailto:ietf-languages-
bounces(_at_)alvestrand(_dot_)no] On Behalf Of Bruce Lilly


So why not then also throw in the closely linked specification of
the Content-Language field, which has historically been in the same
document (RFC 1766)?

It was removed in the development of RFC 3066, which was appropriate because 
it was a particular application involving language tags;

We may be in danger of confusing terminology; the Content-Language
field is a means of specifying the language of a (part of) a
document, using a language-tag in the context of the MIME protocols.
"The World Wide Web" is an application.  Separating the specification
of language via a field from registration procedure was entirely
appropriate, as BCP documents are used for procedures and policies
and not for technical specifications.

other applications exist, and other applications may use different approaches 
for how matching should be done. 

So I take it that you agree that the technical specification of
matching algorithms should also be separated from the tag registration
procedure?

Harald Alvestrand pointed out some time ago, that (inappropriately)
shifts implementation effort from the tag generator (no registration
required) to the recipient (what the heck does this mysterious tag
actually *mean*).

The *meaning* of any given language tag would be no more or less a problem 
under the proposed revision than it was for RFC 3066 or RFC 1766. For 
instance, there is a concurrent thread that has been discussing when country 
distinctions are appropriate or recommended ("ca" or "ca-ES"?); this 
discussion pertains to RFC 3066, and part of the issue is that meanings of 
tags are implied rather than specified -- and always have been even under RFC 
1766 (I pointed this out five years ago when we were working on preparing RFC 
3066).

So, for instance, when an author uses "de-CH", what does he intend recipients 
to understand to be the difference between that and "de-DE" or even "de"? 
Neither RFC 1766 or RFC 3066 shed any light on this, and ultimately only the 
author knows for sure.

That's a somewhat different take on the issue; certainly the ability
to use a generative mechanism (i.e. w/o review/registration of an
entire tag) can lead to a proliferation of incompatible uses by
independent generators (and possibly loss of interoperability as a
result). The draft under discussion would expand use of generative
mechanisms to encompass all but private-use tags, and thereby expands
the potential for such incompatibilities and loss of interoperability.
 
Under the proposed draft, anybody may legally generate
a tag such as
  sr-Latn-CS-gaulish-boont-guoyu-i-enochian
or
  sr-Latn-CS-gaulish-boont-guoyu-i-enochian-x-foo
with *no* specific registration requirements (i.e. all components
are either registered or require no registration). In the latter
case, a parser can only determine that it contains a private-use
subtag after wading through the other subtags.  In either case,
it is difficult (to say the least) for the recipient or his
software to determine what the generator of that tag intended to
convey.

I've shown that this is no different in general that what already exists for 
RFC 3066 or RFC 1766.

It is certainly different; under RFC 3066 rules such a tag (as a whole)
would be subject to review and registration.

And I think we can all agree that there's no much less likelihood of someone 
generating sr-Latn-CS-gaulish-boont-guoyo-i-enochian than there is of someone 
generating something like pl-AZ. So, I suggest that we not dwell on 
pathological cases that we aren't really likely to encounter. 

Please don't confuse a specific example with the general principle.
Also, in technical specifications (of language tag syntax or anything
else), "liklihood" is largely irrelevant; the quality of the
specification is dependent on how well it handles all cases, including
edge cases.
 
A recipient using software that interprets RFC 3066
tags isn't going to be able to do anything useful with any
hypothetical tag which contains a script subtag that would be
produced under the draft rules (if the script subtag were to appear
*after* the region sugtag, one could at least match "sr-CS-Latn"[...]
to "sr-CS", which an RFC 3066 parser could handle.

This would be no more or less true of registered tags like "az-Latn-AZ", for 
which registration requests were submitted but those were postponed (by the 
submitter withdrawing the request) until details for RFC3066bis were worked 
out. Again, the concerns you are raising in relation to the the proposed 
replacement of RFC 3066 apply equally to RFC 3066 itself.

No, you seem to have missed the point; there exist RFC 3066
implementations. Such implementations, using the RFC 3066 rules,
could match something like "sr-CS-Latn" to "sr-CS", but could
not match "sr-Latn-CS" to "sr-CS".  By changing the definition of
the interpretation of the second subtag, the proposed draft fails
to be compatible with existing deployed implementations (which is
what is meant by "backwards compatibility", which is a prime
consideration for Internet protocols).

It's not entirely clear if some of those items (e.g. script) should
be expressed by an orthogonal mechanism rather than embedded in a
*language* tag (for that matter, in retrospect, country codes was
probably a bad idea).

Of course it would not be clear if you don't have a conceptual model of what 
"language" tags are identifiers *of*. When RFC 3066 was being developed, 
there was a suggestion that script IDs be incorporated, but some were 
reluctant, raising the same question you have here. I was one of those. But I 
didn't remain obstructionist over the issue; instead, I gave a fair amount of 
thought to the ontology that underlies "language" tags, and subsequently 
published a white paper and presented on the topic at two conferences in the 
spring and fall of 2002. (Paper is available online at 
http://www.sil.org/silewp/abstract.asp?ref=2002-003 -- my thinking has 
evolved since then, but some key results remain valid, I think.) 

It's an issue of what is essential vs. what might be an orthogonal
issue applicable to specific cases.  That should (in an IETF
specification) take core Internet protocols into consideration.

At this point, I feel confident that it is not a problem to combine script 
IDs into "language" tags, and this is the consensus of the domain experts 
that have been discussing this proposed revision for the past year and more.

Evidently w/o considering the implications of and for core Internet
protocols.  If script *can be* specified in a language tag *between*
the language code and country code, then a parser must be able to 
recognize that case and deal with it appropriately (which, as noted
above, existing RFC 3066 implementations in deployed use do not and
cannot do) at *any* time and in any context (context may not be
available when a Content-Language field is parsed).  I don't have an
issue with provision for specification of script where appropriate,
but for crying out loud, at least do so in a compatible manner (e.g.
a Content-Script field) rather than a) breaking compatibility with
deployed protocols and b) burdening applications which need not be
concerned with script from having to parse script information.

There is a clear need for script codes...

But none of that applies to an audio file of spoken material,
where script would be superfluous...

Not a problem: the proposed revision *allows* for the use of script IDs but 
does not require them.

Yes, it's a problem. Having allowed them, each parser must be able
to handle them.

In the case of audio content, one simply would never include a script ID. 

But a Content-Language field parser needs to be able to parse *any*
Content-Language field, without knowledge of whether the content
that is referred to by that field is audio, video, image, model,
application, or text.  Generation is easy; printf("%s", whatever); --
the problem is in parsing, particularly considering the deployed base
of RFC 3066-compliant parsers.

and, as noted above, would
lead to loss of backwards compatibility.

But, as noted above, this is not an issue that is peculiar to the proposed 
revision -- it already existed in RFC 3066.

No, given a primary subtag which is a language code (and per RFCs
1766 and 3066, that's any primary subtag with 2 or more (RFC 3066
only, more being limited to 3) characters), the second subtag --
in either RFC 1766 or RFC 3066 language tags -- is always a country
code and never a script code.  The proposed draft pulls the rug out
from under existing parsers by changing that.

The bigger problem you're pointing out is the limitations of using 
suffix-truncation alone as a matching algorithm. In the discussion following 
the registration request for de-1996, etc., there was some discussion as to 
whether de-1996-DE format or de-DE-1996 format was preferable, and in the 
course of that discussion it was mentioned that some times the 1901 vs 1996 
spelling differences would be more important than the regional dialect 
differences, but in other situations the regional differences would be more 
important than the spelling. But the problem with prefix matching used e.g. 
for Accept-Language is that only one of these two can be supported. That is a 
shortcoming in that application. 

Again you seem to be conflating established Internet Standards Track
protocols with "applications" and ignoring the critical importance of
backwards compatibility.  Regardless, you again seem to be supporting
separation of matching algorithms from registration.

Note that there is nothing that prevents other applications from using other 
matching algorithms, including perhaps something that is able to recognize in 
"az-AZ" and "az-Latn-AZ" that both involve Azeri and used in Azerbaijan.

The issue at hand is the existing deployed base of RFC 3066
implementations that depend on the matching algorithm specified
therein (which doesn't work with a script tag interposed between
language code and country code).

Surely some types
of script is indicated by the charset; in situations where that
is not the case, a separate mechanism could be used for that
orthogonal parameter without breaking compatibility with
existing parsers of language tags.

This is all a discussion we on the IETF-languages list went through five 
years ago, and in the intervening five years I think we have reached a 
consensus on these issues, that consensus being reflected in the proposed 
revision to RFC 3066. (Note that we made the relevant decisions over a year 
and a half ago when we reached a consensus to register az-Latn etc. The 
precedent was established then; the proposed revision adds nothing new in 
this regard.)

As previously noted, that is a danger recognized by RFC 2026 in
activity that does not conform to IETF procedures; it is
possible to reach good consensus on the wrong approach.

Does the ISO not set ground rules for the 3166/MA?  Could it not
specify that codes are not to be reused?

No, ISO does not. The ground rules for the ISO 3166/MA are established in ISO 
3166. I don't have the current version immediately at hand, but I believe the 
ground rules it specified were simply that something not be re-used for at 
least five years after it has been withdrawn. The re-assignment of CS made 
several parties very upset, and I note that the CD for the revision to ISO 
3166-1 which is in progress has upped this to 50 year, and added a clause 
saying, "Before reallocating... the ISO 3166/MA shall consult, as 
appropriate, the authority or agency on whose behalf the code element was
reserved and consideration shall be given to difficulties which might arise 
from the reallocation" -- nothing about consulting other users. 

I would think that that's covered by the "difficulties which might arise..."
part.  In any event, as the ISO seems to be in the process of tightening
the rules, it would be a more productive and mutually beneficial process
to convince the ISO to add specific language addressing specific issues
than to go off in a hissy fit saying (in effect) "we're setting up a
registry in competition with the ISO lists specifically to second-guess
the ISO and its MA". [By a process which demonstrably doesn't abide by
its own rules, I might add.]

Matching hasn't actually changed...

Do you not see the contradiction between "one should not expect to
receive anything less specific" vs. "may receive less specific
content"?

There is no substantive change from RFC 3066. RFC 3066 happened to mention 
one particular matching approach used in one application (HTTP), in relation 
to which it defined "language range"; but there is no question that there are 
different approaches to matching used in different applications, some of 
which may well involve receiving content the linguistic properties of which 
are not within the specific properties requested; and besides, the proposed 
revision retains the exact same definition for "language range" (for the sake 
of whatever applications may use that notion).

The problem is that the change to the language tag format is incompatible
with that algorithm.  Incidentally, HTTP is mentioned w.r.t. the syntax
for language-range, but does not restrict use of the matching algorithm
or of the Accept-Language field to HTTP or any other specific protocol
or set of protocols.
 
Please see RFC 2026 sections 7.1, 7.1.1, 7.1.3, and 10.1.
Note that RFC 3066 strictly complies with those sections, while
the draft under discussion, by cherry-picking from ISO lists
for which change control has not been transferred to the IESG,
does not.

7.1 says,

<quote>
To avoid conflict between competing versions of a specification, the
   Internet community will not standardize a specification that is
   simply an "Internet version" of an existing external specification
   unless an explicit cooperative arrangement to do so has been made.
   However, there are several ways in which an external specification
   that is important for the operation and/or evolution of the Internet
   may be adopted for Internet use.
</quote>

The proposed revision does not create Internet-specific versions of ISO 
standards; it uses IDs drawn from ISO standards with semantics defined in 
those source standards at the time they were adopted for use in language tags 
-- the source for the IDs, the symbols and their meanings all reside in the 
ISO standards. The fact that not all are used, or that some are used as they 
were specified in dated version of the ISO standard is not in contradiction 
with 7.1 -- it's just one of "several ways in which an external 
specification... may be adopted."

By cherry-picking, it effectively seeks to establish such a version.
The "several ways' refers not to some random procedure, but to specific
provisions in RFC 2026; moreover, ISO documents are specifically
covered by provisions regarding open external standards (as opposed
to proprietary specifications).

7.1.1 simply says that an open extenal standard may be incorporated merely by 
reference. There is no requirement here that is not met by the proposed 
revision.

It does not give leave to cherry-pick bits and pieces of an external
specification.  RFC 3066 does not do so. The draft under discussion
does.

7.1.3 simply says that an Internet specification may be an adaptation of an 
external specification provided certain conditions are met. Neither RFC 3066 
or the proposed revision are adaptations of any existing external 
specification, so this is not applicable.

See above. Has ISO transferred change control to the IETF so that it
can declare some codes invalid?

10.1 states a general policy regarding IP: 

<quote>
In all matters of intellectual property rights and procedures, the
   intention is to benefit the Internet community and the public at
   large, while respecting the legitimate rights of others.
</quote>

Again, there is no requirement stated here that is not met by the proposed 
revision. Clearly, the intent of the proposed draft is to benefit the 
Internet community and the public at large. There are no rights of others 
that are in any way violated by the proposed revision.

The ISO, as developers of ISO 639 and 3166, have rights. In particular,
they have the right to determine what those standards specify -- in
whole -- and they have the right to revise and amend those standards,
and are the sole arbiters of what is (and what is not) "valid".

Agreed.  But the activity on the ietf-languages list regarding the
draft under discussion isn't an IETF process -- there is no WG or
Chair, no charter, etc.  Like the fictional Topsy, it jes' growed up.

RFC 3066 was developed in exactly the same manner as this proposed revision 
has been developed -- as an internet draft prepared by a member of the the 
IETF-languages list and processed among members of that list until it was 
submitted for last call and subsequent IESG action.

There is a time limit within which objections may be raised. That limit
has passed. Moreover, RFC 3066 had fairly minor backwards compatibility
issues and corrected some defects by splitting off an independent
specification. The draft under discussion has many serious compatibility
issues, and there are issues (e.g. cherry-picking open external standard
content, ignoring core Internet protocols) that have raised procedural
issues.  To wit, the benefit of the Internet Community is probably best
served by establishing an IETF Working Group, with corresponding
procedures, a charter, etc.

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf