Re: New Last Call: 'Tags for Identifying Languages' to BCP

 Date: 2004-12-12 15:31
 From: "Peter Constable" <petercon(_at_)microsoft(_dot_)com>
 To: ietf-languages(_at_)alvestrand(_dot_)no, ietf(_at_)ietf(_dot_)org

From: ietf-languages-bounces(_at_)alvestrand(_dot_)no 
[mailto:ietf-languages-
bounces(_at_)alvestrand(_dot_)no] On Behalf Of Bruce Lilly

Moreover, the point is that countries do change, and that use
of country codes (as provided for in RFC 3066 and in the
proposed draft) carries with it the inherent instability
which is characteristic of politics.  A quest for "stability"
of countries seems Quixotic and oxymoronic.  According to the
principle of stability as that term is used in defense of the
draft, I suppose we're all intended to refer to Malawi as
"Rhodesia" because that's what it (in part) was called 50 years
ago, or that we're supposed to ignore the breakup of the USSR,
Yugoslavia, etc., the reunification of Germany, etc.


That is not at all the aim here wrt stability; rather, the aim is that a
symbolic identifier used for metadata in IT systems not change because
some government on a whim says, "We would now prefer to use 'yz' rather
than 'xy' to designate our country."


If by international agreement, 'yz' becomes the designation
for that country, then it is rather silly to stick one's
fingers in one's ears and shout "NA-NA-NA-NA-NA I don't want
to hear you".  A more rational approach would be to say that
before such-and-such a date/time the designation was 'xy' and
after that date/time (until further notice) it is 'yz'. As I
have pointed out, politicians change the definitions of time
zones frequently, and those who have to deal with time zone
issues have found a way to cope with such change without
trying to declare international standardization organizations
irrelevant.

Sure, there will be changes that we need to deal with; but there's no
reason to subject all implementations, users and data to changes that
are purely cosmetic changes to things that are not designed to be read
by humans.


"Designed" or not, country codes *are* read by humans; they
appear in top-level domain names.  Currently the ISO 639
2-letter codes mean the same thing as the last component of
a domain name and as the second component of a language-tag.
It's rather silly to change that correspondence simply because
a few people are piqued that international agreement has been
reached to change a few 2-letter codes.

A related problem with the use of country codes in language
tags is that there is not necessarily an inherent relationship
between language and country borders.


That is not what country IDs within a language tag is intended to
suggest. In fact, if there were inherent relationships, we probably
would never have needed to use country IDs in a language tag.


I submit that it was never a good idea. Language evolves
over time, even in a given place.

The borders of Germany
have changed many, many times.  If one is referring to the
German language as spoken by inhabitants of Alsace, using
country codes would imply that that same language spoken by
the same people would have been tagged at various times as
de-DE and de-FR according to where the France-Germany border
happened to have been determined by politicians of the time.
That strikes me as being a rather silly way to tag language,
but that's the precedent set by RFC 1766.


I agree that that's a silly way to tag that language; I disagree that
RFC 1766 suggests I should tag it that way.


RFC 1766 (and 3066) leave you little choice; if you wish
to indicate a region, you either have to do it with ISO
639 codes or you have to register a separate tag (no
separate tag for German as spoken in Alsace exists). Never
mind the shortcomings of that particular example; consider
"de-DE" -- does that mean Germany as it exists today, West
Germany as it existed 25 years ago, Germany as it existed
in the 1930s, the 1900s, ...?

As far as I can tell,
the draft doesn't really deal with the issue of changing borders
or changing country names -- it merely pretends that these
things don't happen by attempting to declare a snapshot of the
status at some point in time as being valid for all time.


That may be your reading of the situation, but it is not how it is seen
by those of us who have been working on this spec and examining these
issues closely.


As far as I can tell, the draft pretends that the meaning
of "CS" hasn't changed, and would in fact change the meaning
of the currently valid RFC 3066 language tag "sr-CS".

But the user has indicated that he speaks French, and the
proposed registry contains a description in English only.
Where is the implementor supposed to get the *official*
translation for display?  N.B. under the current (RFC 3066)
situation, the definitive ISO lists provide an official
description in French.


Neither RFC 1766 or RFC 3066 has ever presented "official" translations;


Both defer to the ISO lists for definitions (not "translations")
of the various codes.

this is no different for RFC 3066bis.


It is very different; under the proposed draft, there is only
an English definition, somebody wishing to provide a French
definition finds that he has none and must resort to an
unofficial translation.

Under RFC 3066, one is pointed to 
ISO 639-1 and ISO 639-2 to get the alpha-2 and alpha-3 IDs, but it does
not anywhere state that implementors should use the English and French
language names in those ISO standards;


It defines the correspondence between the codes and
names as agreed to by international agreement.

exactly the same situation holds 
for RFC 3066bis.


SO where are the French definitions?

(Note, btw, that the names listed by ISO 639-1/-2 have 
no particular "official" status; they are normative in those standards
to the extent that the indicate what language variety a given ID
denotes, but they do not claim that the particular form of the language
names have any particular status.)


Well, sure. But the name is an important thing by itself.
It is rather pointless to ask a user to indicate the
language of a piece of text by selecting from a list "AB, ACE,
ACH,..., ZHA, ZUL, ZUN" -- the user doesn't normally refer to
languages by codes. It's quite a different matter to ask the
user to select from "Abkhaze, Aceh, Acoli,..., Zhuang (Chuang),
Zoulou, Zuni".

One possibility would be two description fields.


Why two?


There are now two in the ISO lists (and, as noted, in the
UN list).  I have no objection to more, but I object to
a reduction.


If anything, I am inclined to object to two:


We now have objections to one and to two; would anybody
care to try for three :-).

Note that the RFC 3066 specifies a registry that does not include French
language names. I suggest that this issue should be dropped.


Yes, the current IANA registry has that problem for
the non-ISO-based tags only. If the registry is to be
changed to subsume ISO codes as well, that defect should
be remedied.  I'm willing to postpone the discussion
(other problems with the proposed registry format dictate
a broader solution which could easily have provision for
an arbitrary number of descriptions).

I have an implementation which (in accordance with RFC 3066)
uses the official ISO lists. It has provision for displaying
ISO 639 language tags with their descriptions in either of the
two languages supported by the official 639 lists, and likewise
for the ISO 3166 country codes.


RFC 3066 *does not at any point* suggest let alone state that
implementations should use ISO 639 language names or ISO 3166 country
names for UI purposes. IMO, you are creating an issue where none exists.


No, you are overlooking the fact that a set of codes with
no corresponding definitions is useless.  RFC 3066 defers
the code/definition pairs to ISO, which provides multilingual
definitions. The proposed draft would remove that multilingual
characteristic.

The specification of the
draft is *NOT* compatible with that existing implementation
because it removes the existing functionality of official
descriptions in French of language and country codes. As a
result of that incompatibility,  the newly proposed
specification does not work with (at least that one)
existing implementation (but I agree that that is a crucial
concern).


Display names for languages and countries are not within the scope of
RFC 1766 or RFC 3066. It is preposterous to suggest that this draft is
not compatible with existing implementations of RFC 3066 on that basis.


On the contrary, it is preposterous to suggest that codes
will be attached to text by magic; some human somewhere,
somehow is going to have to indicate the language to
something, and it certainly isn't going to be by way of
a 2- or 3-letter code without some reference to what those
codes *mean*.  And at the present time, the meaning of
those codes is defined -- bilingually -- in the ISO
lists.

It might be worthwhile considering the differences in the
way languages tags are used, by whom they are used, and for
what purpose.  There may well be a substantial difference
between use of a tag to represent an obscure dialect of a
dead language in a research paper vs. tagging a piece of
text in one of the core Internet protocols such as SMTP.
The draft seems to ignore the needs of the core Internet
protocols (e.g. unbounded tag length which is incompatible
with those protocols).


IETF language tags are used in a wide variety of applications. The
parties involved in development of this spec (the authors and others)
have examined these issues for the past several years and have arrived
at this architecture.


Then why is there no provision for limiting the length of tags
to a range appropriate for encoded-words (N.B. we're not talking
about some obscure little-used protocol; we're talking about
email, which is widely recognized as one of the core Internet
application protocols!)?  Why, when I pointed out that the
length limit for an encoded-word, including charset name,
text, encoding, and syntactic glue, did you still assume that
limiting language-tags to 75 octets (leaving no room at all
for the text to be tagged (!!!) let alone the charset,
encoding, and necessary overhead?  Something tells me that
there has been insufficient attention paid to the implications
for core Internet protocols.

But this spec is not an ISO standard; it is an IETF
standard.


One that seeks to replace use of ISO standards with something
less functional.

But
you are simply adding localization requirements to a spec for i18n
infrastructure, and I consider that not at all appropriate.


No, I am complaining about removal of internationalized
definitions associated with language tag components.
"Localization" would be translation of the French definition
into some other language.  That is not my concern. My concern
is the elimination of the French definition in the first place.

One part of my claim is that non-private-use RFC 3066 tags
up to the present time are no longer than 11 octets in length.


Only co-incidently at the present time.


As mentioned, under RFC 1766/3066 review/registration rules,
excessively long tags would certainly raise objections. That's
no coincidence -- it's an intentional design feature.

As the draft, if/when approved, would close that registration
process, that limit (unless a longer tag is registered in
the interim) would apply for all time.


And so that limit would be a constraint applying for all time to the
'grandfathered' production which concerned you so much.


And so it can easily be incorporated into that ABNF production.

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf