Re: New Last Call: 'Tags for Identifying Languages' to BCP

The ABNF is an expression of the grammar that

describes the set of all valid tags.

No, this is simply incorrect. You cannot expect that any implementation that
simply does the ABNF is conformant. There are a great many constraints on
the tags that are not in the ABNF grammar, that are clearly required in any
reading of the text. Most of these *cannot* be encompassed in any ABNF
grammar. There are a few that could be expressed in the ABNF; some at little
cost, some with a great deal of complication. This is not a technical
problem for the draft.

as reasonable as the current worst-case of 11 octets.

Also simply untrue. You seem not to be reading all the messages on this
subject. Look at the ABNF for RFC 3066. There is *no* limit in the ABNF
there!

"
   The syntax of this tag in ABNF [RFC 2234] is:

    Language-Tag = Primary-subtag *( "-" Subtag )

    Primary-subtag = 1*8ALPHA

    Subtag = 1*8(ALPHA / DIGIT)
"

-- http://www.ietf.org/rfc/rfc3066.txt?number=3066


‎Mark

----- Original Message ----- 
From: "Bruce Lilly" <blilly(_at_)erols(_dot_)com>
To: <ietf-languages(_at_)alvestrand(_dot_)no>
Cc: <ietf(_at_)ietf(_dot_)org>
Sent: Friday, December 10, 2004 20:39
Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP

RE: New Last Call: 'Tags for Identifying Languages' to BCP
 Date: 2004-12-10 20:03
 From: "Peter Constable" <petercon(_at_)microsoft(_dot_)com>
 To: ietf(_at_)ietf(_dot_)org
 CC: ietf-languages(_at_)alvestrand(_dot_)no

Resuming my comments:

Specifically, the draft allows, and RFC 3066 disallows:
subtags more than 8 octets in length
hyphens which do not separate subtags
zero-length subtags
primary tags which are not purely alphabetic
Curiously, all of those are permitted by the draft ABNF
production "grandfathered"...


The "grandfathered" production in the current draft is

grandfathered = ALPHA *(alphanum / "-")

which does permit the sequences claimed by Bruce (except for
not-purely-alphabetic primary sub-tags),


No exception.  "alphanum" is ALPHA / DIGIT.  In plain
English, "grandfathered" as defined in the draft is a letter
followed by any number of letters, digits, and/or hyphens, in
any order.  And that includes "a123-xyz" as I initially stated,
and clearly 1, 2, and 3 are digits.

syntactically; but the set of
tags available for use is constrained by more than the ABNF syntax
alone: the acceptable productions for each sub-tag must either be taken
from one of the source standards or be registered.


So what? The ABNF is an expression of the grammar that
describes the set of all valid tags.  If the grammar permits
"y-----", "a123-xyz", etc. (and it does) then a parser
claiming to parse language tags as defined by that ABNF
must be able to parse such tags.  That is, the ABNF-
specified grammar imposes requirements on parsers.  If
one doesn't intend to impose such requirements, the
ABNF specifying the grammar should be changed
accordingly.

This is no different
from RFC 3066, so it is no more of a problem in this specification than
it was in RFC 3066.


It is a very different grammar from RFC 3066, imposing
very different requirements on parsers.

It might be that the wording in 2.2 could be tightened up to eliminate
any possible question regarding the source for "grandfathered"
productions.


It's not a matter of wording; the problem is with the ABNF.

Alternately, there's no reason why the "grandfathered" production
shouldn't be composed exactly to match what was used in RFC 3066:

grandfathered = 1*8ALPHA *("-" 1*8alphanum)


I believe I said as much (though one then needs to look
at reduce/reduce conflicts implied by the revised grammar):

I see no reason for the ABNF to permit such content as is
forbidden by RFC 3066; the actual ABNF for what RFC 3066
permits is contained within 3066, and could have been directly
incorporated rather than producing a "grandfathered"
production which opens up several cans of worms.


This vastly overstates the problem. There is no can of worms unless it
exists in tags currently available under RFC 3066.


I referred to the additional requirements imposed on
parsers, as well as the unlimited tag length permitted.

One defect related to tag length in RFC 3066 is not remedied
by the draft; indeed the problem is greatly exacerbated...

Unfortunately, a language- tag's length is unlimited by
the ABNF in RFC 3066 (due to an unlimited number of subtags)
and in the draft...

In particular, tags other than private-use tags with more than
two subtags require registration under RFC 3066 rules, and it
is a trivial matter to determine the longest registered tag.
The draft, however, encourages use of more subtags as well as
removal of the subtag length upper bound; moreover, it permits
infinite numbers of subtags without requiring registration of
the resulting complete tag.


Bruce states incorrectly that there is no upper bound on the length of
sub-tags.


Look again at the draft definition of "grandfathered" -- now
show me where there's a limit in that production on subtag
length.

His other concern, on the overall length of complete tags, is
valid, however: in terms of the ABNF syntax for both RFC 3066 and RFC
3066bis, infinite-length productions are possible, but RFC 3066 would
require registration of complete non-private-use tags while RFC 3066bis
does not.


Yes, and a quick look at the registry reveals that the longest
tag is 11 octets ("cel-gaulish").

There are three open doors for infinite-length productions in the ABNF
of the current draft:

- unlimited extlang sub-tags
- unlimited variant sub-tags
- the number of possible extensions is limited to 25


The ABNF indicates no such limit.

, but the length of
extensions is unlimited


You have missed several others:

1. "privateuse" length is unlimited (either tacked on
    after "lang" etc., or directly as an alternative in
    "Language-Tag")

2. "grandfathered", which as already discussed
    permits unlimited length.


We could impose some upper limits on these things; e.g.

Language-Tag = ... *8("-" extlang) ... *8("-" variant) ... 1*25("-"
extension)


I think you mean *25("-" extension), not 1*25...

extension = singleton 1*8("-" 2*8alphanum)


That leaves the extension portions' length at up to
25 * (1 + 1 + 8 * 9) = 1850 octets, not taking any other parts
of a tag into account!   That's way too long (the RFC 2047
limit for an encoded-word is 75 octets, including charset tag,
some text, and some syntactic glue in addition to the language
tag).  Heck, 1850 octets won't even fit into a maximum length
RFC [2]821/[2]822 message line (998 octets).

If we also imposed limits on the length of private-use tags and defined
the grandfathered production in a way that made clear there was an upper
limit for those, then we could end up eliminating an issue that had
existed in RFC 3066.


Perhaps; but you have a long way to go to get from 1850+
down to <64 octets.  Even farther to get to something
as reasonable as the current worst-case of 11 octets.

So, I think Bruce has identified a valid issue here. I personally would
not have characterized it as greatly exacerbating, though,


IMO, an increase from 11 octets worst-case, which is tolerable
for constructing RFC 2047/2231 encoded-words, to >> 1850
octets, which exceeds by a large margin what can be handled
in a Content-Language or Accept-Language message header
field, constitutes "greatly exacerbated".  YMMV. [N.B. that
">>1850" takes into account your proposed restrictions which
are not present in the draft]

as the issue
was present in RFC 3066: private-use tags did not need to be registered
in RFC 3066, so there was no way in implementation could be written with
certain knowledge that tags beyond some given length would not be
encountered.


True, but:
A. implementation is only one issue; protocol design (encoded-
    words and message header fields, for example) is a more
    important issue
B. private-use tags require end-to-end cooperation as a
    prerequisite; given such cooperation, agreement can be
    reached on tag length
C. Per some readings of BCP 82, not only are implementations
    not required to support experimental/private-use values,
    they are expected to erect barriers to their use, requiring
    users to specifically enable use of experimental/private-use
    functionality.

I am absolutely shocked that a draft dealing with language
lacks an "Internationalization considerations" section as
recommended by RFC 2277 (a.k.a. BCP 18).


No more or less shocking than for RFC 3066, regarding which I'm not
aware of any complaints.


By deferring to the bilingual ISO lists for language and country
tags, 3066 at least provided a minimal degree of internationalization.
By explicitly limiting description fields to English and restricting
the charset to US-ASCII, the draft proposal takes a giant leap
backwards.

I don't quite understand what the critique is here: what is there to
internationalize about language tags?


There should probably be a reference (at least informative)
pointing to BCP 18 and mentioning that the language tags
defined provide a means of labeling the language of text,
when combined with other mechanisms (RFC 2047/2231
encoded-words, Content-Language fields, etc.), to
implement the BCP 18 requirement for language tagging.

The draft (if/when approved) should also indicate that
it updates BCP 18, which refers to RFC 1766.

Given the divergence noted above from RFC 3066's use
of multilingual reference lists, the Internationalization
considerations section should include a synopsis of the
approach chosen (viz. to restrict description to English) and
the rationale for that choice (see BCP 18 section 6).
[Conversely the difficulty in writing a convincing rationale
might prompt some effort into producing a less
Anglo-centric design.]

It's
true that ALPHA and DIGIT are not defined


Non-sequitur aside, those terms are defined in RFC 2234.

Perhaps even more disturbing is the content of the "IANA
Considerations" section; the draft predicts that certain things
will happen ("IANA will"[...]), but doesn't actually direct
(e.g. "IANA shall") IANA to do anything. The placement of that
section does not correspond to current RFC-Editor guidelines
(it should appear after Security Considerations); also on that
point, Appendices should precede References.


There is a process issue here, but I have assumed that the authors have
dealt with IANA on that. Otherwise, these are editorial issues -- "even
more disturbing" seems to me to be somewhat overstated.


The words "will" and "shall" have very distinct meanings.  If
one expects IANA to take specific action, it would be advisable
to clearly specify that IANA shall do so, rather than merely
expressing the hope that IANA will do so.

Many of the references are obsolete (e.g. RFCs 1327,
1521)... and at least one reference ([19])
gives a bracketed URI rather than the correctly formatted
RFC reference.


The RFC-Editor provides an "rfc-ref.txt" file containing the
preferred citations.  That file contains an "Obsoleted By"
column that points authors to the current RFC.  This isn't
rocket science...

In fairness to the authors, page-oriented plain text is not exactly
conducive to authoring and revising a long document,


There's no requirement to author in final publication
form. In fact the original RFC Editor has provided
guidelines and suggestions in the form of RFC 2233,
discussing methods that have been used successfully in
publishing quite long documents (textbooks!).  The current
RFC-Editor staff has a draft update.

implications (ISO 8601 date format parsing).


As mentioned above, this really is a non-issue.


It's an issue (esp. in light of the finger pointing regarding
accessibility to ISO 639/3166). Admittedly it can be
resolved without much difficulty (but then that could
have been done earlier, couldn't it?).

2. the clear contradiction between the claims about
ABNF compatibility with RFC 3066 and the factual
incompatibility of certain provisions in the grammar.


The main concern was with the "grandfathered" production, but I've shown
that that is a non-issue.


Again, it is an issue that imposes requirements on language
tag parsers.  What you've shown is that the ABNF is not
consistent with what was desired to be expressed, and
that makes it an issue that needs to be addressed.

The maximal length issue exists just as much
in RFC 3066 due to private-use tags; it is a technical concern that
might worth reviewing in RFC 3066bis, however; but it is not
insurmountable, and not a new problem.


Private-use carries its own considerable baggage; aside from
that, the draft proposal increases the length of non-private
tags that affect both protocol design and implementations
from a worst case maximum of 11 octets under RFC 3066
registered tags to an infinite length, which is unworkable
for existing Standards Track protocols (RFC 2822 at
Proposed, RFC 2047 at Draft, and RFC 822 at Full Standard,
to name a few).

_______________________________________________
Ietf-languages mailing list
Ietf-languages(_at_)alvestrand(_dot_)no
http://www.alvestrand.no/mailman/listinfo/ietf-languages


_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf