RE: New Last Call: 'Tags for Identifying Languages' to BCP

From: ietf-languages-bounces(_at_)alvestrand(_dot_)no [mailto:ietf-languages-
bounces(_at_)alvestrand(_dot_)no] On Behalf Of Bruce Lilly

The "grandfathered" production in the current draft is

grandfathered   = ALPHA *(alphanum / "-")

which does permit the sequences claimed by Bruce (except for
not-purely-alphabetic primary sub-tags),


No exception.  "alphanum" is ALPHA / DIGIT.


My mistake; again, I had on my mind constaints beyond the ABNF.

syntactically; but the set of
tags available for use is constrained by more than the ABNF syntax
alone: the acceptable productions for each sub-tag must either be taken
from one of the source standards or be registered.


So what? The ABNF is an expression of the grammar that
describes the set of all valid tags.


It is *part* of the expression of the grammar. Even in RFC 3066 this is the 
case: you know that t-abc is not valid under RFC 3066, but not because that is 
constrained by the ABNF of RFC 3066.

I will accept that the ABNF of draft should be changed to better reflect what 
the form of grandfathered productions can be, which, as I stated in my previous 
message, would be the equivalent of the ABNF of RFC 3066:

grandfathered = 1*8ALPHA *("-" 1*8alphanum)

I think that's an improvement, though technically I don't think it changes 
anything.

If
one doesn't intend to impose such requirements, the
ABNF specifying the grammar should be changed
accordingly.

This is no different
from RFC 3066, so it is no more of a problem in this specification than
it was in RFC 3066.


It is a very different grammar from RFC 3066, imposing
very different requirements on parsers.


Our disagreement amounts to a basic question of whether parsers should be 
written based on the ABNF alone, or based on the ABNF plus other constraints 
provided in the spec. Clearly, I think anyone writing a parser should consider 
other constraints as well.

In particular, tags other than private-use tags with more than
two subtags require registration under RFC 3066 rules, and it
is a trivial matter to determine the longest registered tag.
The draft, however, encourages use of more subtags as well as
removal of the subtag length upper bound; moreover, it permits
infinite numbers of subtags without requiring registration of
the resulting complete tag.


Bruce states incorrectly that there is no upper bound on the length of
sub-tags.


Look again at the draft definition of "grandfathered" -- now
show me where there's a limit in that production on subtag
length.


As mentioned, the limit is imposed by other tight constraints on 
'grandfathered'; you have already identified that the longest registered tag 
under RFC 3066 is 11 octets in length, therefore a 'grandfathered' tag can be 
at most 11 octets in length.

There are three open doors for infinite-length productions in the ABNF
of the current draft:

- unlimited extlang sub-tags
- unlimited variant sub-tags
- the number of possible extensions is limited to 25

...

, but the length of
extensions is unlimited


You have missed several others:

1. "privateuse" length is unlimited (either tacked on
    after "lang" etc., or directly as an alternative in
    "Language-Tag")


I disregarded this since it is identical to the case for RFC 3066, and you 
were, after all, charging that the draft creates problems that were worse than 
for RFC 3066.

2. "grandfathered", which as already discussed
    permits unlimited length.


But as already stated is very tightly constrained, with a de-facto upper limit 
of 11 (subject to change if new tags are registered before the proposed spec is 
accepted).

We could impose some upper limits on these things...

That leaves the extension portions' length at up to
25 * (1 + 1 + 8 * 9) = 1850 octets, not taking any other parts
of a tag into account!   That's way too long (the RFC 2047
limit for an encoded-word is 75 octets, including charset tag,
some text, and some syntactic glue in addition to the language
tag).


The problem already exists in RFC 3066. Even apart from private-use tags, 
tomorrow someone could request a registration for a tag that's 87 octets long, 
and there's nothing in RFC 3066 that would prohibit acceptance.

So, I think Bruce has identified a valid issue here. I personally would
not have characterized it as greatly exacerbating, though,


IMO, an increase from 11 octets worst-case, which is tolerable
for constructing RFC 2047/2231 encoded-words, to >> 1850
octets, which exceeds by a large margin what can be handled
in a Content-Language or Accept-Language message header
field, constitutes "greatly exacerbated".


Repeating my previous point, RFC 3066 doesn't stop a registered tag from being 
10^100 octets in length. Of course, all of us know that such a tag wouldn't be 
useful. At some point, we have to engage common sense, even for RFC 3066. The 
draft would allow a tag 

en-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont

(over 75 octets), but common sense tells us it doesn't make sense (and that 
anyone who uses such a thing deserves whatever they get). 

Now, we could try to revise the ABNF to constrain for such things, just as the 
ABNF of RFC 3066 could have been constrained further. It's not easy to express 
common-sense constraints in ABNF, however.

I suggest that wording be added to the draft giving a strong recommendatation 
to users that they not use tags the complete length of which exceeds 75 
characters.

I am absolutely shocked that a draft dealing with language
lacks an "Internationalization considerations" section as
recommended by RFC 2277 (a.k.a. BCP 18).


No more or less shocking than for RFC 3066, regarding which I'm not
aware of any complaints.


By deferring to the bilingual ISO lists for language and country
tags, 3066 at least provided a minimal degree of internationalization.
By explicitly limiting description fields to English and restricting
the charset to US-ASCII, the draft proposal takes a giant leap
backwards.


The US-ASCII limitation existed in RFC 3066, so is not new. 

On the more general point, I believe you are mistaking i18n concerns with 
localization concerns: you are looking for strings to be used in UI for 
different local markets. Apart from charset, RFC 1766, RFC 3066 or RFC 3066bis 
do not have *internationalization* concerns.

I don't quite understand what the critique is here: what is there to
internationalize about language tags?


There should probably be a reference (at least informative)
pointing to BCP 18 and mentioning that the language tags
defined provide a means of labeling the language of text,


Have you not read the abstract in the draft?

<quote>
   This document describes the structure, content, construction, and
   semantics of language tags for use in cases where it is desirable to
   indicate the language used in an information object.
</quote>


Or the introduction?
<quote>
   One means of indicating the language used is by labeling the
   information content with a language identifier...

   This document specifies an identifier mechanism...
</quote>

How much clearer does it need to be?

The draft (if/when approved) should also indicate that
it updates BCP 18, which refers to RFC 1766.


Is this right? This draft is not a replacement for RFC 2277, or an addendum to 
it. RFC 2277 also refers to RFC 1958, which was updated by RFC 3439, but surely 
RFC 3439 doesn't state that it updates BCP 18? (RFC 227 does have a section 
with significant overlap in topic, though, so perhaps this makes sense. I'm not 
well-enough versed in IETF document process to know.)

Given the divergence noted above from RFC 3066's use
of multilingual reference lists, the Internationalization
considerations section should include a synopsis of the
approach chosen (viz. to restrict description to English) and
the rationale for that choice (see BCP 18 section 6).


Again, this is a localization issue, not an internationalization issue. I do 
not consider this necessary or even appropriate.

It's
true that ALPHA and DIGIT are not defined


Non-sequitur aside, those terms are defined in RFC 2234.


Of course I meant "not defined *within this document*".

    implications (ISO 8601 date format parsing).


As mentioned above, this really is a non-issue.


It's an issue (esp. in light of the finger pointing regarding
accessibility to ISO 639/3166).


As has been pointed out, there is no such finger-pointing in the draft.

Admittedly it can be
resolved without much difficulty (but then that could
have been done earlier, couldn't it?).


I think the authors and those of us who have been reviewing thought that the 
intent was quite clearly YYYY-MM-DD, so didn't see a concern. That's why last 
calls are announced to a much wider audience.

2. the clear contradiction between the claims about
    ABNF compatibility with RFC 3066 and the factual
    incompatibility of certain provisions in the grammar.


The main concern was with the "grandfathered" production, but I've shown
that that is a non-issue.


Again, it is an issue that imposes requirements on language
tag parsers.  What you've shown is that the ABNF is not
consistent with what was desired to be expressed, and
that makes it an issue that needs to be addressed.


Again, I believe the bigger issue is not getting the ABNF to express what was 
desired, but rather whether parsers are written to consider only the ABNF or 
the ABNF plus other specified constraints as well.

The maximal length issue exists just as much
in RFC 3066 due to private-use tags; it is a technical concern that
might worth reviewing in RFC 3066bis, however; but it is not
insurmountable, and not a new problem.


Private-use carries its own considerable baggage; aside from
that, the draft proposal increases the length of non-private
tags that affect both protocol design and implementations
from a worst case maximum of 11 octets under RFC 3066...


Worst case at present; a month from now it could be unlimitedly larger. But 
I've accepted that it would be an improvement to add constraints on overall 
length.


Peter Constable
Microsoft Corporation

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf