ietf
[Top] [All Lists]

Re: [Ltru] draft-ietf-ltru-4645bis-10.txt issue with preferred valuefor YU

2009-03-02 13:00:55
Hi -

From: "Tex Texin" <textexin(_at_)xencraft(_dot_)com>
To: <ltru(_at_)ietf(_dot_)org>; <ietf(_at_)ietf(_dot_)org>
Sent: Monday, March 02, 2009 1:05 AM
Subject: [Ltru] draft-ietf-ltru-4645bis-10.txt issue with preferred valuefor 
YU

With respect to the proposed update to the Language Subtag Registry 
draft-ietf-ltru-4645bis-10:

I would like to lodge an objection to the deletion of the Preferred-Value for 
language subtag YU.

As ltru co-chair: it's exceedingly late for such an objection - the issue was
discussed at length in the working group over a year ago.  A recent
revisiting of the question arrived at the same conclusion.

This change breaks the equivalence class relation between YU and CS.
It detrimentally changes the behavior of existing implementations.

As a technical contributor:

The main reason that CS does not make sense as a preferred value for
YU is that there is *not* an "equivalence class relation" between them.
There are pieces of what was YU that are not covered by CS.  To treat
them as an "equivalence class" ignores linguistic, geographic, and
political reality.

The loss of the relationship between YU and CS makes documents that were
believed to be tagged equivalently, to no longer be equivalent.

In my opinion, regarding them as equivalent is an error, since
CS and YU don't encompass the same regions.

There is also no benefit to this change.

I disagree.  The change removes an error.

To be concrete, assume a user attempts to find documents for languages from 
Yugoslavia.

Language tags do *not* pretend to be able to answer this sort of query.

Using a region subtag (e.g. 'CS') says that the data subtag uses a specific
variety of the primary language, and that the party tagging the data believes
that this distinction is useful.  For example, I could tag this paragraph with
'en' or with 'en-US'.  Is that extra distinction necessary or useful?  In this
case, no.  Consequently, the "retrieving documents by region subtag" use case,
although technically permitted by RFC 4647, is not realistic, and in many
ways contrary to the basic "tag wisely" principle.

Using the then current registry data, a query tool noting the preferred
value relationship, matches either xx-YU and xx-CS.

Another user searches for documents for Serbia.

A query tool using the current registry data noting the preferred value
relationship, matches either xx-YU and xx-CS.

The results are in some sense accurate and complete, given the history of the 
subtag.

No, they are not.
  (1) there is no requirement, much less a guarantee, that the data will
      bear a region subtag at all
  (2) there are many bits and pieces of YU not covered by CS - 
      even if data always bore a region subtag, the YU->CS mapping
      would miss all the other territory that once belonged to YU.
  (3) blindly replacing all YU subtags with CS subtags would in fact
      falsify some data, since the language could well be of a variety
      covered by YU but not by CS.

After this change in the preferred value relationship, the query
tool does not know to search for both xx-YU and xx-CS, since the
registry does not indicate a relationship. Only one or the other
subtag is used for each query. However, the query results are now
incomplete since some documents for xx-YU have been tagged with
the one-time preferred tag of xx-CS.

The relationship cannot be adequately automated with a simple
one-way pointer like "preferred-value".  The former YU also
encompassed BA, HR, ME, MK, RS, and SI.

Comments in the registry are not a solution. Comments are a good
thing for recording rationale and tangential history. However,
implementers are not going to go thru and read the comments on any
or all tags in order to make a correct implementation. They are going
to implement based on the schema and operate with the data values.

If someone (or something) is applying region subtags, they'd better
have sufficient knowledge of the language varieties to do so meaningfully.
This effectively requires *understanding* those comments and much more.
The Language Subtag Registry does *not* attempt to record all the
information needed to recognize language varieties.  Rather, once
someone (or something) has made a distinction, the LSR provides
the bits needed to encode a tag for that language variety.

In the particular case of the languages of the former YU, the
region subtags now available (such as BA, HR, ME, MK, RS, and SI)
are arguably far more useful, if someone needs to distinguish
regional variations in their Croatian-language data, than just YU.
(It's unclear to me whether YU would ever have been terribly useful,
since it would allow the distinction of Croatian as spoken there
from Croatian spoken somewhere (where?) else.)

The registry should stay as it is with respect to YU and retain
CS as the preferred value.

As CS is now being used as a preferred value, deprecated or not,
there isn't a compelling motivation to remove the preferred value for YU.

Please, let's look at the actual tagged language data.
What corpora out there have employed YU (correctly) as a subtag?
To what extent would replacing that subtag with 'CS' (rather than
with BA, HR, ME, MK, RS, or SI) be correct, for Serbian, Croatian,
or any of the other languages of that region?

Please eliminate this needless change that breaks applications
relying on the relationship between YU and CS.

I would argue rather than an application that relies on an
equivalence relation between YU and CS is already in some sense
broken, in the same way as one assuming that Russia and the
Soviet Union are somehow equivalent.
 
tex

Randy

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/ietf