RE: draft-phillips-langtags-08, process, specifications, "stability", an

From: ietf-languages-bounces(_at_)alvestrand(_dot_)no [mailto:ietf-languages-
bounces(_at_)alvestrand(_dot_)no] On Behalf Of Bruce Lilly

Ah, but RFC 3066 does not sanction use of tags like "sr-CS-Latn" without

registration, and no such tags are registered.

Precisely; an RFC 1766/3066 parser, based on the 1766 and
3066 specifications, can expect four classes of language tags:
1. ISO 639 language code as the primary subtag, optionally
   followed by an ISO 3166 country code as the second tag
2. i as the primary tag; complete tag registered
3. x as primary tag; private-use
4. some other IANA-registered complete tag

"sr-CS-Latn" fits category 1. "sr-Latn-CS' fits none.


You are mistaken; "sr-Latn-CS" fits your category 4.

I've stated that the imputed back-compat problem is a non-issue.


You haven't convinced me of that.  Show me source code of an
existing, deployed, RFC 3066 parser that handles "sr-Latn-CS".


It matches the RFC 3066 syntax, and so can be recognized; the notion of 
language-range is still applicable, and nothing about that tag would prevent 
language-range handling. In what way could a parser *not* handle it? Even if I 
had my finger on source code, I can't demonstrate that a fault doesn't exist if 
you only say "there's a problem in there somewhere".

If you want to press this argument, I think you need to show exactly how

a problem would result in realistic usage scenarios.

I have explained the classes of tags described by RFCs
1766 and 3066, and how the proposed changed syntax permits
tags which do not fit in any of those classes.


It has been shown that you have described such classes incorrectly.

In the
interest of interoperability, I believe the onus is on the
proposers of the revised format to demonstrate that existing
deployed implementations will be able to handle the revised
syntax with no loss in functionality (meaning, e.g., that
"sr-Latn-CS" must be recognizable by all such deployed
implementations and be interpreted as equivalent to "sr-CS").


Why is it a requirement that a request for "sr-CS" must match "sr-Latn-CS"? 
That's quite unreasonable. It's like saying that a bunch of new characters are 
added to Unicode and existing implementations should recognize strings using 
the new characters as being equivalent to strings using only existing 
characters. *That tag represents _new_ functionality.*

Look, they're already there in registered tags. This draft isn't doing

anything new in that regard.

RFC 1766/3066 registered tags are integral tags, and can't
be meaningfully (in the context of a parser) be said to
contain a script subtag;


If one is registered with a script subtag, then they contain a script subtag.

the entire tag needs to be recognized
by a 1766/3066 parser and treated as a unit.


And nothing prevents that happening with a tag containing a script subtag.

 The draft
certainly changes that, in a way which an RFC 1766/3066
parser cannot be expected to cope.


Not at all. RFC 1766/3066 need to be able to deal with tags that contain pieces 
they don't know about -- the only subtags they can know about are initial 
subtags of "i", "x" or ISO 639 IDs, or a second subtag consisting of an ISO 
3166 code in case the first subtag is and ISO 639 ID. There are lots of other 
possible subtags permitted by RFC 1766/3066, including subtags that happen to 
be script IDs from ISO 15924. This draft does not change that in the slightest.

Convince me by demonstrating that all deployed implementations
handle "sr-Latn-CS" at least no differently than "sr-CS-Latn".


Why? They should not, be design.

The issue at hand is the existing deployed base of RFC 3066
implementations that depend on the matching algorithm specified
therein (which doesn't work with a script tag interposed between
language code and country code).


You say that these do not work; these implementations will still work,

but they will match "sr-Latn" but not "sr-CS" with "sr-Latn-CS". If that
is a problem, please explain why.

No, unregistered "sr-Latn" is not a valid RFC 3066 language-tag. Nor
is "sr-Latn-CS".  "sr-CS-Latn" is likely valid (the first two subtags
are legal and have defined interpretation; RFC 3066 says that there
are no requirements (implicitly including registration) other than
syntax for third and subsequent subtags). "sr-CS" is clearly valid
and in use. An RFC 1766/3066 parser/matcher has a chance of matching
legal "sr-Cs-Latn" containing script designation with legal "sr-CS"
(no script specified).


In your comments here, you are being rather loose in your assessment of what is 
or isn't valid. The tag "sr-Latn" is a registered, valid RFC 3066 language tag. 
The tag "sr-Latn-CS" is not registered, but could be and would be valid if 
registered. The tag "sr-CS" is certainly valid; I have no idea how widely it is 
used. The tag "sr-CS-Latn" would be valid if registered, but is not registered 
(and it is unlikely that, if requested, a consensus could be obtained to 
register it, given the preference among those involved in reviewing requests 
for a different ordering of subtags).

*If* "sr-CS-Latn" were registered (it is not), then a language-range matcher 
*must* match a request of "sr-CS" with content tagged "sr-CS-Latn". In 
preceisely the same way, if "sr-Latn-CS" were registered, a language-range 
matcher would, and without modification could, match a request of "sr-Latn" 
with "sr-Latn-CS".

You cannot say that "sr-Latn-CS" has any less or more likelihood of being 
handled by existing language-range matchers than "sr-CS-Latn". Either the 
matchers work per the terms of RFC 3066 or they do not, and RFC 3066 does not 
indicate that either of these is any less valid than the other.

The proposed draft would make "sr-CS-Latn"
illegal and would instead require "sr-Latn-CS" which cannot be
recognized as a valid language tag by an RFC 1766/3066 parser, let
alone matching against "sr-CS".


There is no reason why an RFC 1766/3066 parser should not recognize 
"sr-Latn-CS" as valid since it conforms to the syntax specified.

A language-range matcher should match "sr-Latn-CS" against a request for 
"sr-Latn", but not "sr-CS". That is by design since a left-prefix matching 
algorithm is limited in what tags it can match, and it is considered more 
important to match for script than for regional variations.

But you are speaking as though it's a problem that these tags are

registered. I have no idea why.

Registration of a complete tag is not itself a problem.  Registration
of a complete tag which incorporates script information is not an
ideal solution to the issue of conveying script information; that
would be more appropriately done using an orthogonal mechanism to
convey the orthogonal information...


That's one opinion; there are many who hold a different opinion.

But speaking of selective usage, have you noticed that RFC 3454

identifies specific characters from ISO/IEC 10646 as prohibited? Various
space and control characters are not permitted, INVISIBLE TIMES isn't
permitted, END OF AYAH isn't permitted, COMBINING GRAVE TONE MARK isn't
permitted... How is what is proposed in this draft any more "cherry-
picking" than that?

1. RFC 3454 is not BCP, and isn't being pushed through for immediate
   Standards status without a phased roll-in. The draft under discussion
   has been proposed as BCP which would lack phased roll-in.


So acceptability of selective usage depends upon whether the document is a BCP 
or a proposed standard? I cannot see anything in RFC 2026 that suggests that 
(and it seems pretty odd).

2. RFC 3454 does not declare any parts of ISO 10646 as not valid and
   does not call for setting up an IANA registry of code points for the
   purpose of effectively declaring ISO 10646 code points invalid.  The
   draft under discussion explicitly seeks to set up a registry to
   replace use of ISO standard list.


RFC 3454 does say that some parts of ISO 10646 are not valid in strings output 
by stringprep implementations. This draft is analogous. If new characters are 
added to ISO 10646, it is certainly possible that RFC 3454 could be updated to 
exclude some of those new characters as well; what is proposed in this draft is 
analogous; the only difference is that the values considered invalid for the 
given purpose are documented in the IANA registry rather than in an RFC -- 
which is certainly the easier way to maintain things, though perhaps it's not 
considered the preferred means of doing this in the IETF context.

3. RFC 3454 does not seek to redefine the meaning of any ISO 10646 code
   points.  The draft under discussion does, as specifically noted in
   the case of the ISO 3166 code "CS".


This draft would not change the meaning of an ISO identifier; it simply does 
not use the latest assigned meaning in case a prior ISO-assigned meaning in use 
on the Internet exists. 

(Note: the draft itself does not entail that CS in particular should be handled 
one way or another, and the question of the best handling of CS to provide 
stability on the Internet is open to comment as a separate issue from the draft 
itself.)

So, hypothetically, if some other standards body, say W3c were to declare
that "CS" used in a language-tag in an application profile of SGML (i.e.
not an Internet protocol) meant something other than what the draft
under discussion would have it mean while importing the meaning of other
language tag components w/o change, you would have no issue with such
cherry-picking?


Well, it would be a concern, though it is their prerogative to do what they 
want in their specifications. Since W3C has consistently referenced RFC 
1766/3066, however, this no more than a purely hypothetical question -- I have 
no expectation of such a thing ever happening.



Peter Constable

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf

RE: draft-phillips-langtags-08, process, specifications, "stability", and extensions