Harald T. Alvestrand writes:
I have written up a draft of a draft, showing what I feel that
a "Content-language" header should look like.
Looks good as a draft of a draft.
Ready, set - COMMENT!
Here are mine:
Language tag for MIME body parts
<blabla goes here>
Some text should be included here about what kinds of language
information the Content-Language: field is intended for,
perhaps something like this:
-- By "language" in this document is meant only natural
languages, like Norwegian, and artificial langauges designed
to substitute natural languages, like Esperanto. E.g. so-
called programming languages are not covered.
-- Both languages that have a written form and languages that
are only spoken can be indicated by the Content-Language:
header field. (The latter case may be relevant for
Audio/Basic body parts.) It can be used for both living
languages and dead languages.
-- Language groups, individual languages and dialects of
languages can be indicated in this scheme, which doesn't in
itself force the adoption of any specific position regarding
e.g. if Chinese is one language or a group of related
This document describes a Content-Language: header for use with body
parts of MIME.
I suppose a new Content-Language: header field is preferred to a
new "content-language=" parameter in the Content-Type: header,
because language information may be relevant to many different
content-types. As I understand it, each content-type has its own
name space for parameters and a parameter to a content-type C1
with the same name as a parameter to a content-type C2 may have
totally unrelated semantics. (RFC 1521 says, on one hand:
... The set of meaningful
parameters differs for the different types. In particular, there are
NO globally-meaningful parameters that apply to all content-types.
and on the other:
most parameters make sense only with certain content-types, others
are "global" in the sense that they might apply to any subtype. For
example, the "boundary" parameter makes sense only for the
"multipart" content-type, but the "charset" parameter might make
sense with several content-types.
I suppose this doesn't exclude using "charset" with some other
semantics than the usual in the definition of some new
Perhaps we should then also have a Content-Filename-Suggestion:
header for those many cases where it's appropriate to save the
body of a body-part in a file? And a Header-Filename-Suggestion:
for saving the header of the body-part?
It also describes a new parameter to the Multipart/Alternative type,
to aid in the usage of the Content-Language: header.
Can a new parameter be added to an already defined content-type
or must a new content-type name be chosen when extending it in
The syntax of this header is:
Content-language: <2xAlpha>[_2xAlpha] (comment) [ , ... ]
The first 2xAlpha is an ISO 639 code for a language. If required, the
second 2xAlpha may define the country using a particular language
(such as en_GB and en_US), as per ISO 639.
I would prefer a more general syntax definition, something like
this (using RFC 822 notation):
The syntax of this header is:
"Content-Language:" language-token *("," language-token)
language-token = language-code ["-" variant-code]
language-code = 1*8 ALPHA
variant-code = 1*8 ALPHA
The case of a letter in a <language-token> is insignificant.
Like in all structured header fields, comments as defined in
RFC 822 can be inserted between <language-token>s and commas.
These are intended for information comprehensible for human
readers, not for data to be interpreted by programs. The
comments are also subject to interpretation according to
RFC 1522, so that characters not available in US-ASCII can be
At present only these parts of the <language-code> name space
2 ALPHA ; language code according to ISO 639
"XI" 2*6 ALPHA ; additional language code, registered with IANA
"XX" 2*6 ALPHA ; language code for private or experimental use
It is expected that
in the future will be used for the three-letter language
codes of a forthcoming second part of ISO 639.
The <variant-part> name space is used in this way:
2 ALPHA ; country code registered according to ISO 3166
"XI" 2*6 ALPHA ; language variant code registered with IANA
"XX" 2*6 ALPHA ; language variant code for private or experimental use
Substantial differences from Haralds draft:
+ A syntax to which all future extensions of the allowed values
must conform is given.
+ Comments are not used to provide information meant to be
understood by programs.
+ Registration of values with IANA is introduced to compensate
for the slowness of the ISO standardization process.
+ The second part of a <language-token> is generalized from
only country code to any language variant. Different dialects
are a more important aspect than country for many small
languages, like for the Sami language (also called Lappish),
where the most distant dialects are not mutually
understandable and differ more than e.g. the languages
Swedish and Norwegian. Also different orthographies for the
same language can be handled with this extended variant code.
(Several former Soviet rupublics where the major language
belongs to the Turkish language group are switching from the
Cyrillic script to the Latin script.)
+ As separator between the two parts of a <langauge-token> "-"
is used instead of "_". This is of course a minor point, but
I think that this change is justified by making IETF language
codes easily distinguishable from the more limited language
codes of the form "en_US" used in Posix and X/Open locale
If further information is needed, it is carried as RFC-822 comments
until ISO 639 is revised.
RFC 822-type comments should only be used for human-readable-
only information. I think this is in the spirit of RFC 822,
The comment construct permits message originators to add text
which will be useful for human readers, but which will be
ignored by the formal semantics. ...
For languages that do not have an ISO 639 code, the language "xx" is
used, with an appropriate geographical area and comment. This is not
very useful for picking the correct thing, but is better than lying.
(The codes xa to xz are reserved for local use in ISO 639 <CHECK>)
I doubt that ISO 639 has reserved any codes for private use.
Besides this, there is definitely a potential use for more than
26 different private language codes, considering that the total
number of languages is probably in excess of 6000.
This may include:
- Dialect information. ISO 639 does not recognize variants of a
language that do not correspond to countries.
Also different orthographies for the same language should be
possible to indicate, as well as competing language forms
that are not attributable to different geographical areas or
dialects, like the Bokmal and Nynorsk forms of Norweian.
- Languages not listed in ISO 639.
The current ISO draft for three-letter language codes,
ISO CD 639-2, also contains codes for
- groups of languages such as "gem" for Germanic (Other)
- historical forms of some languages such as "enm" for English,
Middle (1100-1500) and "non" for Norse, Old.
The first case should be handled by a language code. In the
second case, if no ISO-standardized code is available, a variant
code should be used when the historical language is a historical
form of a language spoken today (like Medieval Swedish), while a
language code should be used when it isn't (like the original
The meaning of the header is:
- For a single information object, it should be taken as the set of
languages that is required for a complete comprehension of the
complete object. Examples: Simple text.
What about a fully bilingual text? Or an English text containing
a few non-translated Latin quotations? Or an English text
containing one French phrase such as "tour de force"? Or a long
Swedish text with a summary in English at the top?
It might be better to have two different header fields:
Content-Language: indicating one or more languages, each of
which is in itself sufficient for full understanding of the
Content-Supplemental-Language: indicating one or more languages
that are required in addition to the Content-Langauge:
language(s) for a complete comprehension of every part of the
Norwegian official document, with parallel text in both official
versions of Norwegian. Both versions are readable by all Norwegians.
Content-language: no (nynorsk), no (bokm=86l)
I think the second comment should be: (=?ISO-8859-1?Q?bokm=E5l?=)
But in this case I would prefer
Content-Language: no-xiny, no-xibok
provided that "xiny" and "xibok" have been registered as
language variant codes with IANA for Nynorsk and Bokmal
Voice recording from the London docks
Content-language: en_GB (cockney)
Here I would like
Content-Language: en-xxcockn (cockney)
especially in the case of a text object, to allow for a speech
synthesis system to automatically select the right dialect
module. (I use a private variant code here, because I don't
expect individual dialects of many languages to be registered
Document in Sami, which does not have an ISO 639 code, and is spoken
in several countries, but with about half the speakers in Norway
Content-language: xx_no (Sami)
There is a code for Sami, "smi", in ISO CD 639-2 and it may be
convenient to register "xismi" with IANA, waiting for 639-2 to
be adopted as an IS. At the same time the very dissimilar Sami
dialects should be registered as language variants. (It's
inplausible that one would want to indicate a Norwegian or
Swedish form of Sami, since the dialect and orthography
differences are more or less perpendicular to the Swedish--
Norwegian border, following ancient seasonal migaration
patterns. Both the South Sami and the North Sami dialects are
used in both Norway and Sweden.)
My version of this example would be:
Content-Language: xismi-xis (South Sami)
An official EC document
Content-language: en, fr, ge, da, gr, it
An official EC (or is it EU, nowadays?) document must include
fully equivalent texts in these languages. In this case Harald's
semantic criterion -- the set of languages that is required for
a complete comprehension of the complete object -- becomes
problematic. If you have read and understood the Danish text you
are probably not interested in also reading the same thing in
English or French. You could in this case use e.g. "Content-
Language: en", but that would be an arbitrary selection of one
of several equivalent alternatives. I think that my weaker
semantics for Content-Language: is more practical.
Content-type: multipart/alternative; difference=3Dcontent-language
Content-language: en, fr
To make the example more realistic I would suggest including the
first paragraph of Gustave Flaubert's novel "Madame Bovary" here
(which should be free of copyright restrictions).
In order to give a sensible display first on non-MIME readers, the
English version should usually be the first one in the list of body
Maybe add here: "... if the sender has no particular reason to
suppose another language would be more appropriate for (the
majority of) the recipient(s)."
CHARACTER SET CONSIDERATIONS
See RFC 1342 comment. Codes are always US-ASCII.
I don't understand what "Codes are always US-ASCII." means.
Maybe something like this?
The language codes and the language variant codes are
restricted to the letters of US-ASCII, since they should be
internationally usable and should, if possible, be based on
the names of languages in English. Otherwise codes should be
based on a name in another language, transliterated to
English. Comments can be given in any MIME character set,
though. It is anticipated that MIME-reading programs will
translate the short language codes to full language names in
a language suitable for the human reader.
Olle Jarnefors, Royal Institute of Technology, Stockholm