To:  moore(_at_)cs(_dot_)utk(_dot_)edu
Subject:  Re: language tags 
Date:  Fri, 5 Mar 93 16:49:10 -0500
In <9303052036(_dot_)AA02941(_at_)wilma(_dot_)cs(_dot_)utk(_dot_)edu>, Keith 
writes:
Brief, unique codes are *great* for machine use.  It's not a cop-out.  MI
ME
software will be more robust if implementors don't have to account for al
l
of the variant ways to spell a language name.
It's no harder to switch on a unique longer name than it is to
switch on a unique two-character abbreviation (unless you're
thinking of using 26*26 or 52*52 lookup tables).  I stand by my
criticism of two-character abbreviations as being promulgated for
the convenience of implementors or standards bodies, not users or
persons interested in extensibility.  (John's point about the
unrepresentability of language names in their own language is
well taken, although here too the two-letter abbreviation is much
more of a least-unacceptable compromise than an ideal solution.)
The trick is making the long name unique and making the list of names both
extensive enough for use and widely available.  We would have to define such
a list ourselves, which is much harder than referencing another list that
already exists. ISO 639 defines "symbols" for languages, not language names
-- the languages names themselves are for reference only and are listed in
both English and French (the document is bilingual).  Furthermore, I am told
that many of the language names in the document are arguably incorrect, but
this isn't an issue if the language codes are used.  ISO 639 is also not
very extensive...which is perhaps why it allows use of UDC language numbers
also...my guess is that the librarians care a lot more about being able to
classify documents according to their obscure dialects than the ISO people
did.
As to brevity:  ISO 639 seems to have been designed to attach a language
symbol to a description of a document -- thus XYZZY (Fr) denotes a French
version of document XYZZY.  The people who designed this valued brevity --
they wanted a compact representation for experts, not a user-friendly one
for ordinary humans.
For human use, we can suggest that an appropriate comment
follow the language parameter.  Something like:
content-type: text/plain;charset=us-ascii;language=20 (English)
Not using the current grammar.  (The grammar for Content-Type
parameters could use augmentation; I'd like to see it permit a
list of more than one token, separated by commas, for reasons I
haven't mentioned yet.)
There is perhaps a question as to whether comments are allowed in MIME
bodypart headers.  This should probably be cleared up for the draft standard
MIME version.
P.S.  I also like using a number because it's a clue to an implementor to
actually READ THE SPECIFICATION to see what the number means.
That's a much more reasonable argument.  (Though I happen to
disagree with it, because I am concerned, perhaps too much, about
the recipient who has access to neither a MIME mailreader nor the
documents which define the language name encodings.)
Such a recipient could probably tell at a glance whether the language
were one she understood.
Keith