Re: A spec for showing language in MIME headers

Harald T. Alvestrand writes:

I have written up a draft of a draft, showing what I feel that
a "Content-language" header should look like.


Looks good as a draft of a draft.

Ready, set - COMMENT!


Here are mine:

Language tag for MIME body parts

<blabla goes here>


Some text should be included here about what kinds of language 
information the Content-Language: field is intended for, 
perhaps something like this:

-- By "language" in this document is meant only natural 
   languages, like Norwegian, and artificial langauges designed 
   to substitute natural languages, like Esperanto. E.g. so-
   called programming languages are not covered.

-- Both languages that have a written form and languages that
   are only spoken can be indicated by the Content-Language:
   header field. (The latter case may be relevant for
   Audio/Basic body parts.) It can be used for both living
   languages and dead languages.

-- Language groups, individual languages and dialects of 
   languages can be indicated in this scheme, which doesn't in 
   itself force the adoption of any specific position regarding 
   e.g. if Chinese is one language or a group of related 
   languages.

This document describes a Content-Language: header for use with body
parts of MIME.


I suppose a new Content-Language: header field is preferred to a
new "content-language=" parameter in the Content-Type: header,
because language information may be relevant to many different
content-types. As I understand it, each content-type has its own
name space for parameters and a parameter to a content-type C1
with the same name as a parameter to a content-type C2 may have
totally unrelated semantics. (RFC 1521 says, on one hand:

   ... The set of meaningful
   parameters differs for the different types.  In particular, there are
   NO globally-meaningful parameters that apply to all content-types.

and on the other:

   ... Although
   most parameters make sense only with certain content-types, others
   are "global" in the sense that they might apply to any subtype.  For
   example, the "boundary" parameter makes sense only for the
   "multipart" content-type, but the "charset" parameter might make
   sense with several content-types.

I suppose this doesn't exclude using "charset" with some other
semantics than the usual in the definition of some new
content-type.

Perhaps we should then also have a Content-Filename-Suggestion: 
header for those many cases where it's appropriate to save the 
body of a body-part in a file? And a Header-Filename-Suggestion: 
for saving the header of the body-part?

It also describes a new parameter to the Multipart/Alternative type,
to aid in the usage of the Content-Language: header.


Can a new parameter be added to an already defined content-type 
or must a new content-type name be chosen when extending it in 
this way?

The syntax of this header is:

Content-language: <2xAlpha>[_2xAlpha] (comment) [ , ... ]

The first 2xAlpha is an ISO 639 code for a language. If required, the
second 2xAlpha may define the country using a particular language
(such as en_GB and en_US), as per ISO 639.


I would prefer a more general syntax definition, something like 
this (using RFC 822 notation):

   The syntax of this header is:

      "Content-Language:" language-token *("," language-token)

   where

      language-token = language-code ["-" variant-code]
      language-code  = 1*8 ALPHA
      variant-code   = 1*8 ALPHA

   The case of a letter in a <language-token> is insignificant.

   Like in all structured header fields, comments as defined in 
   RFC 822 can be inserted between <language-token>s and commas. 
   These are intended for information comprehensible for human 
   readers, not for data to be interpreted by programs. The 
   comments are also subject to interpretation according to
   RFC 1522, so that characters not available in US-ASCII can be 
   used.

   At present only these parts of the <language-code> name space 
   is used:

      2 ALPHA          ; language code according to ISO 639 
      "XI" 2*6 ALPHA   ; additional language code, registered with IANA
      "XX" 2*6 ALPHA   ; language code for private or experimental use

   It is expected that

      3 ALPHA

   in the future will be used for the three-letter language 
   codes of a forthcoming second part of ISO 639.

   The <variant-part> name space is used in this way:

      2 ALPHA          ; country code registered according to ISO 3166
      "XI" 2*6 ALPHA   ; language variant code registered with IANA
      "XX" 2*6 ALPHA   ; language variant code for private or experimental use

Substantial differences from Haralds draft:

+  A syntax to which all future extensions of the allowed values 
   must conform is given.

+  Comments are not used to provide information meant to be 
   understood by programs.

+  Registration of values with IANA is introduced to compensate 
   for the slowness of the ISO standardization process.

+  The second part of a <language-token> is generalized from 
   only country code to any language variant. Different dialects 
   are a more important aspect than country for many small 
   languages, like for the Sami language (also called Lappish), 
   where the most distant dialects are not mutually 
   understandable and differ more than e.g. the languages 
   Swedish and Norwegian. Also different orthographies for the 
   same language can be handled with this extended variant code. 
   (Several former Soviet rupublics where the major language 
   belongs to the Turkish language group are switching from the 
   Cyrillic script to the Latin script.)

+  As separator between the two parts of a <langauge-token> "-" 
   is used instead of "_". This is of course a minor point, but 
   I think that this change is justified by making IETF language 
   codes easily distinguishable from the more limited language 
   codes of the form "en_US" used in Posix and X/Open locale 
   names.

If further information is needed, it is carried as RFC-822 comments
until ISO 639 is revised.


RFC 822-type comments should only be used for human-readable-
only information. I think this is in the spirit of RFC 822,
which states:

        The comment construct permits message originators to add  text
        which  will  be  useful  for  human readers, but which will be
        ignored by the formal semantics.  ...

For languages that do not have an ISO 639 code, the language "xx" is
used, with an appropriate geographical area and comment. This is not
very useful for picking the correct thing, but is better than lying.
(The codes xa to xz are reserved for local use in ISO 639 <CHECK>)


I doubt that ISO 639 has reserved any codes for private use. 
Besides this, there is definitely a potential use for more than 
26 different private language codes, considering that the total 
number of languages is probably in excess of 6000.

This may include:

- Dialect information. ISO 639 does not recognize variants of a
language that do not correspond to countries.


Also different orthographies for the same language should be 
possible to indicate, as well as competing language forms
that are not attributable to different geographical areas or 
dialects, like the Bokmal and Nynorsk forms of Norweian.

- Languages not listed in ISO 639.


The current ISO draft for three-letter language codes,
ISO CD 639-2, also contains codes for
-  groups of languages such as "gem" for Germanic (Other)
-  historical forms of some languages such as "enm" for English,
   Middle (1100-1500) and "non" for Norse, Old.

The first case should be handled by a language code. In the
second case, if no ISO-standardized code is available, a variant
code should be used when the historical language is a historical
form of a language spoken today (like Medieval Swedish), while a
language code should be used when it isn't (like the original
Indo-European language).

The meaning of the header is:

- For a single information object, it should be taken as the set of
languages that is required for a complete comprehension of the
complete object. Examples: Simple text.


What about a fully bilingual text? Or an English text containing 
a few non-translated Latin quotations? Or an English text 
containing one French phrase such as "tour de force"? Or a long
Swedish text with a summary in English at the top?

It might be better to have two different header fields:

Content-Language: indicating one or more languages, each of 
which is in itself sufficient for full understanding of the 
object.

Content-Supplemental-Language: indicating one or more languages 
that are required in addition to the Content-Langauge: 
language(s) for a complete comprehension of every part of the 
object.

Norwegian official document, with parallel text in both official
versions of Norwegian. Both versions are readable by all Norwegians.

  Content-language: no (nynorsk), no (bokm=86l)


I think the second comment should be: (=?ISO-8859-1?Q?bokm=E5l?=)

But in this case I would prefer

   Content-Language: no-xiny, no-xibok

provided that "xiny" and "xibok" have been registered as 
language variant codes with IANA for Nynorsk and Bokmal 
respectively.

Voice recording from the London docks

  Content-language: en_GB (cockney)


Here I would like

   Content-Language: en-xxcockn (cockney)

especially in the case of a text object, to allow for a speech
synthesis system to automatically select the right dialect
module. (I use a private variant code here, because I don't
expect individual dialects of many languages to be registered
with IANA.)

Document in Sami, which does not have an ISO 639 code, and is spoken
in several countries, but with about half the speakers in Norway

  Content-language: xx_no (Sami)


There is a code for Sami, "smi", in ISO CD 639-2 and it may be 
convenient to register "xismi" with IANA, waiting for 639-2 to 
be adopted as an IS. At the same time the very dissimilar Sami 
dialects should be registered as language variants. (It's 
inplausible that one would want to indicate a Norwegian or 
Swedish form of Sami, since the dialect and orthography 
differences are more or less perpendicular to the Swedish--
Norwegian border, following ancient seasonal migaration 
patterns. Both the South Sami and the North Sami dialects are 
used in both Norway and Sweden.)

My version of this example would be:

   Content-Language: xismi-xis (South Sami)

An official EC document

  Content-language: en, fr, ge, da, gr, it


An official EC (or is it EU, nowadays?) document must include 
fully equivalent texts in these languages. In this case Harald's 
semantic criterion -- the set of languages that is required for 
a complete comprehension of the complete object -- becomes 
problematic. If you have read and understood the Danish text you 
are probably not interested in also reading the same thing in 
English or French. You could in this case use e.g. "Content-
Language: en", but that would be an arbitrary selection of one 
of several equivalent alternatives. I think that my weaker 
semantics for Content-Language: is more practical.

Content-type: multipart/alternative; difference=3Dcontent-language
Content-language: en, fr

--limit
Content-language: fr

--limit
Content-language: en

--limit--


To make the example more realistic I would suggest including the 
first paragraph of Gustave Flaubert's novel "Madame Bovary" here 
(which should be free of copyright restrictions).

In order to give a sensible display first on non-MIME readers, the
English version should usually be the first one in the list of body
parts.


Maybe add here: "... if the sender has no particular reason to 
suppose another language would be more appropriate for (the 
majority of) the recipient(s)."

CHARACTER SET CONSIDERATIONS

See RFC 1342 comment. Codes are always US-ASCII.


I don't understand what "Codes are always US-ASCII." means.
Maybe something like this?

   The language codes and the language variant codes are 
   restricted to the letters of US-ASCII, since they should be 
   internationally usable and should, if possible, be based on 
   the names of languages in English. Otherwise codes should be 
   based on a name in another language, transliterated to 
   English. Comments can be given in any MIME character set, 
   though. It is anticipated that MIME-reading programs will 
   translate the short language codes to full language names in 
   a language suitable for the human reader.

--
Olle Jarnefors, Royal Institute of Technology, Stockholm 
<ojarnef(_at_)admin(_dot_)kth(_dot_)se>