Re: language tags

In <731291025(_dot_)384006(_dot_)KLENSIN(_at_)INFOODS(_dot_)UNU(_dot_)EDU>, 
John wrote:

Anyone for Content-Language: ?   Would that, with the "not required, but
encouraged when it is important" solve enough of this problem that we
can get on with our lives?


I am overjoyed to see this suggestion being made and taken
seriously.  I had been summoning the courage to make such a
suggestion myself, but was afraid it would get shot down, for
reasons I'll get to.

Like Keith, I'd lean towards a language= parameter on the
Content-Type line, rather than a separate header, but it's not
terribly important to me, particularly if RFC1327 is already
setting a precedent for a separate header.

One thing I would like to mention (though I fear that it won't do
any good) is that numeric codes and two-letter abbreviations
*suck*.  I would very much rather see "language=English" than
"language=EN" or "language=20".  I am aware of technical reasons
(i.e. implementation cop-out) for preferring the "tighter"
encodings, as well as political reasons (i.e. should it be
"language=German" or "language=Deutsch"?), but I always hate to
see the top-level appearance and usability of something suffer
because of inadequacies in the development process.

Could we perhaps (pretty please?) define any language tags as
something like

        language = atom [ language-description ]

rather than

        language = 2*ALPHA [ language-description ]

, to at least leave the door open for more descriptive tags, not
to mention expansion and extensions?  (Yes, the optional
language-description helps somewhat in this regard.)

I have one serious question, which should not be viewed as a
"spoiler," because as I've already indicated I'd love to see a
top-level language tag defined.

As has been repeatedly discussed, a key tenet of the MIME
definition of "charset" is that

        "No further parameters need to be parsed to get the
        complete identity of the character set."

I was going to ask this list (next Monday, so the question
wouldn't get lost over the weekend) for some more explanation of
this rule.  What, exactly, is this rule supposed to accomplish? 
What abuses, interoperability problems, inelegancies, etc. is it
supposed to prevent?  What sorts (specific examples, please) of
"charsets" does this rule permit and prohibit?  I would really
appreciate some answers to these questions; send them to me
personally if you're afraid the rest of the list wouldn't be
interested.

My reason for asking, of course, is that a strict reading of this
rule would seem to prevent utilizing the new language tag to
disambiguate unified Han characters.  (Believe me, I despise such
strict readings, but since I don't know what the appropriate
reading of the rule in question is, I have few alternatives.)


On another note, in 
<9303050409(_dot_)AA02542(_at_)samrat(_dot_)poel(_dot_)juice(_dot_)or(_dot_)jp>,
 Erik wrote:

Sorry, but I beg to differ.  In my opinion, language info is far from
being "required".  If I send you some email in ASCII, do I have to
tell you what language I'm using in that note?  I think not.  You can
probably tell just by looking at it.


Don't underestimate the utility of language tagging.  I wouldn't
make it "required," either, but I would make strong recommendations
that we move towards using it as often as possible.  (A decent
mail composing agent could of course be configured to attach the
user's native language by default to all outgoing messages.) 
Leaving off the tag is fine if you know your recipient and his
environment, but as soon as you have several recipients (i.e. a
mailing list, or a Usenet newsgroup), explicit tags become much
more useful.

To mention one example, if a message is marked as being in
German, and uses a charset which contains umlauted characters
and the German double-s character, a display process at the
recipient's end which did not have those characters available
could transliterate them:

        <ss>            =>      ss
        <a-diaeresis>   =>      ae
        <o-diaeresis>   =>      oe
        etc.

If the language is known to be German, these transliterations
are appropriate, and highly recommended.  However, for other
languages (or if the language is not known), they are not
appropriate.

It's not quite as easy (particularly for a native English speaker)
to see why messages in ASCII, in English should be tagged as being
in English, but if "we" want "foreigners" to tag their messages,
it seems only fair if "we" follow the same rules :-) .

                                        Steve Summit
                                        scs(_at_)adam(_dot_)mit(_dot_)edu

<Prev in Thread]	Current Thread	[Next in Thread>
Re: 10646, and all that, (continued) Re: 10646, and all that, John C Klensin Re: 10646, and all that, Masataka Ohta Re: 10646, and all that, Dana S Emery Re: 10646, and all that, John C Klensin Re: 10646, and all that, Dana S Emery Language header already defined, Harald Tveit Alvestrand Re: Language header already defined, John C Klensin Re: Language header already defined, Harald Tveit Alvestrand Re: 10646, and all that, Erik M. van der Poel Re: 10646, and all that, John C Klensin Re: language tags, Steve Summit <= Re: language tags, Keith Moore Re: language tags, Steve Summit Re: language tags, Keith Moore Re: language tags, John C Klensin Re: language tags, Keith Moore Re: language tags, Dana S Emery Re: language tags, Keith Moore Re: language tags, Dana S Emery Re: language tags, Masataka Ohta Re: 10646, and all that, Erik M. van der Poel