Re: draft-phillips-langtags-08, process, specifications, and extensions

 Date: 2005-01-01 21:58
 From: "Peter Constable" <petercon(_at_)microsoft(_dot_)com>
 To: ietf-languages(_at_)alvestrand(_dot_)no, ietf(_at_)ietf(_dot_)org

2.  RFC 3066 did not require every possible combination of language
subtag + country subtag to be registered.


None *could* be registered.


Even if by some oversight or lapse of judgment the tag
"en-US" were to be registered, its interpretation by a
parser would be as an ISO 639 language code followed by
an ISO 3166 country code.  SUch a registration would
therefore be pointless.  In practice, therfore, it
simply wouldn't happen.

Indeed, Section 2.2 of RFC
3066 specifically says such combinations "do not need to be registered
with IANA before use."  Yet you criticize RFC 3066bis for allowing
"en-Latn-US-boont" to be used without being registered as a unit.


Yes, because an RFC 3066 parser cannot make any sense of it.
I.e. the proposed draft lacks "backwards compatibility".


It would be entirely possible for "en-Latn-US-boont" to be registered under 
the terms of RFC 3066.


But it hasn't been. No RFC 3066 parser will therefore find
that complete tag in its list of IANA registered tags, nor
will it be able to interpret "Latn" as an ISO 3166 2-letter
country code.

In what sense would any existing RFC 3066 parser (assumed that it conforms to 
RFC 3066) not be able to make any more or less sense of that than any other 
registered tag?


You're missing the critical factor: it is NOT a registered
tag -- an RFC 3066 parser has no way of recognizing it.

[de-AT-1901, incidentally, (as an example) does not meet the RFC 3066
requirement of 3 to 8 characters in the second subtag for registration
with IANA...].


There is nothing in RFC 3066 that says a registered tag must have 3 to 8 
characters in the second subtag. It simply requires that any tag in which the 
second subtag is 3 to 8 letters must be registered.


   The following rules apply to the second subtag:

   - All 2-letter subtags are interpreted as ISO 3166 alpha-2 country
     codes from [ISO 3166], or subsequently assigned by the ISO 3166
     maintenance agency or governing standardization bodies, denoting
     the area to which this language variant relates.

   - Tags with second subtags of 3 to 8 letters may be registered with
     IANA, according to the rules in chapter 5 of this document.

   - Tags with 1-letter second subtags may not be assigned except after
     revision of this standard.

That does not permit tags with two-letter second subtags to be registered
in the IANA registry; it permits that only for "Tags with second subtags
of 3 to 8 letters".  Granted, it could be clearer.

Absolutely correct.  The needs for RFC 3066 tags that go beyond language
+ country has gotten to the point where they have been registered in
violation of the RFC.  Does that not indicate the need for a revision of
the core specification?


No, it indicates that the review/registration procedure has violated
the rules of syntax specified by BCP, and as a result has caused
problems of a nature similar to those being criticized w.r.t. ISO
MA action (pot to kettle: "you're black").


Um, this entire sub-thread was based on an invalid premise. No rules of 
syntax were violated in any review/registration procedure.


See the direct quote from RFC 3066 above.

There is no reason to create a separate mechanism. When identifying textual 
content,


Language is not exclusively associated with text.  It is also a
characteristic of spoken (sung, etc.) material (but script is
not).

the identity of the writing system


Writing doesn't apply to spoken material, etc.  There is nothing
in RFC 3282 or MIME that requires that Content-Language and/or
Accept-Language fields be used exclusively with written text.

*is* very closely related to the identity of the language variety.
Indeed, the writing system is generally going to be of greater importance 
than distinctions such as dialect


For spoken material!?!  I don't think so.

It is not adequate to simply say that script can be identified from the 
charset or range of codes used. In the former regard, a charset of UTF-8 
provides no information.


Note my use of "or" not "and".  I certainly did not state that the
information could be obtained from charset alone in all cases.

In the latter regard, relying on the range of codes used in content does not 
provide a way to request an HTTP server to return pages that are (say) Azeri 
in Latin script rather than Cyrillic script. (You have mentioned numerous 
times the need to respect how language tags are used in Internet protocols; 
pot to kettle... )


The analogous way to handle that in Internet protocols would be
via Content-Script and Accept-Script where relevant (which they
would not be for audio media).

Perhaps someone will make the case that
Japanese written in Romaji needs to be specially indicated and will
write a request for "ja-Latn", and they will be right too.  Allowing
script subtags to be used generatively, instead of having to be
individually registered, solves this real problem.


In an inappropriate way. Without consideration for backwards
compatibility.  In violation of the BCP that specified the syntax
and registration procedure.


Not inappropriate at all.


Specifying script for audio material is as inappropriate as
specifying charset. In Internet protocols, we do not burden
protocols with having to interpret charset information for
non-text material; we should not do so for script information.

And all your repeated comments about lack of consideration for backwards 
compatibility and violation of syntax and procedures of BCP47 have been shown 
to be invalid.


Sorry -- saying so doesn't make it so.  I have explained in
detail that an RFC 1766/3066 parser cannot be expected to
make sense of unregistered "sr-Latn-CS" etc.  I have pointed
to specific second subtag length requirements in RFC 3066 for
registration.

RFC 3066 doesn't require "haw-US", and if encountered provides for
matching it (in an "accept" role) with "haw" (as content to be
provided). "sr-Latn" and "sr-Latn-CS" cannot be matched by an
RFC 3066-compliant process to anything, since they do not fit the
RFC 3066 syntax for well-formed language tags.


Certainly they do; and certainly an RFC 3066 parser will match "sr" with 
"sr-Latn" or "sr-Latn-CS", and "sr-Latn" with "sr-Latn-CS".


No, a strict RFC 3066 parser will not be able to identify "sr-Latn"
or "sr-Latn-CS" as valid tags.

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf