I have two comments on substance:
1) The following special values can be useful:
iana-art (Some) artificial language, usable e.g. for source
code, a shell script, a sendmail configuration
file.
iana-hum (Some) human language without a known registered
language code.
iana-ukn (Some) unknown language(s). This code can be used
by a program that automatically analyzes (or at
least guesses) which human language(s) is used in
a written text, but failed in this particular case.
Maybe it would be appropriate to include these values, which
can't be expressed by ISO codes, in the Content-Language
standard?
2) To discourage less serious applications for language codes, I
propose that some conditions for applications should be
stated. It seems reasonable to require the _name_ in the
original language and, if possible, in English and French.
Also a reference to the linguistic literature which can be
used for to _identify_ the (sub)language shall be given.
The rest of my comments have to do only with presentation, not
substance, and can be skipped by everyone not particularly
interested.
Abstract
This document describes a Content-Language: header for use with
body parts of MIME.
Not only body parts but whole messages, which need not be MIME
messages. Write instead:
"... for use in RFC 822 messages and body parts of MIME messages."
It also describes a new parameter to the Multipart/Alternative
type, to aid in the usage of the Content-Language: header.
As I see it, this standard will do two things:
a) extend the functionality of the MIME format for email
b) introduce a registration mechanism for language and
sublanguage codes.
The latter is important and directly usable not only for email
but also for many other application layer services, e.g. WWW,
Whois++, URCs. Many of these will probably use the language
codes but not the Content-Language: header. Therefore, add to
the Abstract:
"The registration mechanism for language and sublanguage
codes introduced here can be used also by other protocols
which need to indicate natural language."
1. The Language tag
I suggest that this section is split into two sections, with the
headings:
"Syntax of the Language header and language tags"
"Registration of language tags"
The present section 1 should be split before the paragraph:
The namespace of language tags and subtags is administered by the
IANA. The following registrations are predefined:
The new section 1 should start with:
The syntax of this header in RFC-822 EBNF is:
The present first paragraph
The language tag is composed of 2 parts: A language tag and a
subtag.
should be moved to after the paragraph
Note that the Language-Header is allowed to list several languages
in a comma-separated list.
The term "language tag" is used in two different senses in the
paragraph I want to move. I would prefer that the first part of
the whole tag is called "principal tag", so the paragraph would
become:
"The language tag is composed of 2 parts: A principal tag
and, optionally, a subtag."
Language-Header = "Content-Language" ":" 1#Language
Language ::= 1*8ALPHA [ '-' 1*8ALPHA ]
To use exactly the same notation as RFC 822, change "::=" to "="
in the definitions.
It would be better to use "Language-Tag" instead of "Language"
in these definitions (its a name of the symbol, not of the thing
the symbol denotes).
Why double quotes in the first definition and single quotes, or
rather apostrophes, in the second? The RFC 822 meta language
doesn't use apostrophes.
It might be added here that white space can be used between and
after the tokens in the righthand part of the Language-Header
definition according to RFC 822 rules but not around the "-" in
the Language definition.
The namespace of language tags and subtags is administered by the
IANA. The following registrations are predefined:
Delete "and subtags".
In the language tag:
Change "language tag" to "principal tag".
- All 2-letter codes are interpreted according to ISO 639.
- All 3-letter codes are reserved for a (hopefully) forthcoming
extension to ISO 639
- The value "IANA" is reserved for IANA-defined
subregistrations
Write "iana" to follow the case convention for ISO language codes.
- The value "X" is reserved for private use. Subtags of "X"
will not be registered by the IANA.
Write "x" for the same reason.
- No other registration is allowed.
In the sublanguage tag:
Change "sublanguage tag" to "subtag". (Some subtags will denote
independent languages, not sublanguages.)
- All 2-letter codes are interpreted as ISO 3166 country codes,
according to the rules laid down in ISO 639.
The rules in ISO 639 for the semantic interpretation of a
language code supplemented with a country code are restricted to
terminology. I think we can explicitly state the rule
appropriate for general Internet use of these tags:
"All 2-letter codes are interpreted as denoting a country,
dependency or other area of geopolitical interest according
to the rules of ISO 3166. A language tag with such a subtag
indicates a variant of the language given by the principal
tag that is charactersistic of this part of the world. These
subtags should not be used with the principal tag 'iana'."
- Codes of 3 to 8 letters may be registered with the IANA by
anyone who feels a need for it. IANA has the right to reject
registrations that are felt to be misleading.
The information in the sublanguage tag may for instance be:
Change "sublanguage tag" to "subtag".
- Country identification, such as en-US (this usage is
described in ISO 639)
- Dialect information, such as no-NYNORSK or en-COCKNEY
Nynorsk is not a dialect of Norwegian. It is an example of a
variant of a language which nevertheless is worthy of registering
(as well as the competing form "bokmaal"). For a language that
can be written with different scipts, these can be seen as
different forms of the language for which subtags can be useful,
e.g. az-arabic, az-cyrillic, az-latin for Azerbaijani. I suggest
this wording:
"Dialect information or other variant of a language, such as
en-cockney and no-nynorsk"
- Languages not listed in ISO 639, which can be registered with
the IANA prefix, such as IANA-CHEROKEE
Add here, after "in ISO 639": "or dialects or other variants of
such languages".
To follow the convention "small letters for languages, capitals
for countries", write "no-nynorsk", "en-cockney", "iana-cherokee".
If multiple languages are used in the MIME body part, they are
listed with commas between them.
This paragraph should be dropped, since this has already been
said in connection with the description of the syntax of the
Language-Header.
At this point text for my first two suggestions can be inserted:
"These three special values are preregistered as subtags for
the principal tag 'iana':
art (Some) artificial language. This tag is usable e.g.
for source code and other files primarily intended for
interpretation by computer programs. Constructed human
languages such as Volapuk and Esperanto are not
regarded as artificial languages in this context.
hum (Some) human language without a known registered
language code.
ukn (Some) unknown language(s). This code can be used by
programs that automatically analyze which human
language(s) is used in a written text.
An application to IANA for registering a subtag shall
contain these elements, if it concerns a language:
L1) The original name of the language. If it is not
originally written in the Latin script the name shall be
transliterated by the standard transliteration system
for that language or, alternatively, a specified
transliteration system.
L2) The English or French name of the language, preferrably
both.
L3) A referecne to information about the language in a
specified scholarly work.
If the application concerns a sublanguage, it shall contain
these elements:
S1) The language of which it is a form or dialect.
S2) A name or descriptive phrase for the form or dialect
referred to by the proposed subtag. This shall be given
in the original language, if necessary transliterated
to the Latin script, and may also be given in English
or French.
S3) A referecne to information about the form/dialect in a
specified scholarly work."
The following codes have been added in 1989 (nothing later): ug
(Uigur), iu (Eskimo), za (Zhuang), he (Hebrew, replacing iw), yi
Since most Eskimos live in Canada I think we should use the
Canadian name of the language iu. Write:
"... iu (Inuktitut, also called Eskimo), ..."
At this point it might be useful to include some information
about ISO country codes:
"NOTE: A maintenance agency exists for ISO 3166 which makes
additions to and changes in the list of countrys and other
geopolitical areas in ISO 3166. This agency is:
ISO 3166 Maintenance Agency Secretariat
c/o DIN Deutsches Institut fuer Normung
Burggrafenstrasse 6
Postfach 1107
D-1000 Berlin 30
Germany
Phone: +49 30 26 01 320
Fax: +49 30 26 01 231
NOTE: ISO 3166 reserves these country codes as
user-assigned codes: AA, QM--QZ, XA--XZ, ZZ."
2. MEANING
A better heading, showing that this section only has MIME
significance, would be:
"Meaning of the Language header"
The meaning of the header is:
- For a single information object, it should be taken as the
set of languages that is required for a complete
comprehension of the complete object. Examples: Simple text.
- For an aggregation of information object, it should be taken
Change "object" to "objects", "Examples" to "Example".
as the set of languages used inside components of that
aggregation. Examples: Document stores and libraries.
Should we not have a MIME example here? I propose:
"Example: MIME Multipart/Digest."
- For information objects whose purpose in life is providing
alternatives, it should be regarded as a hint that the
material inside is provided in several languages, and that
one has to inspect each of the alternatives in order to find
its language or languages. In this case, multiple languages
need not mean that one needs to be multilingual to get
complete understanding of the document. Examples: MIME
multipart/alternative.
Change "Examples" to "Example".
EXAMPLES:
I think the examples here will be more illustrative if also a
Content-Type: header is included.
Norwegian official document, with parallel text in both
official versions of Norwegian. Both versions are readable by
all Norwegians.
Add: Content-Type: multipart/mixed
Content-Language: no-nynorsk, no-bokmaal
Voice recording from the London docks
Add: Content-Type: audio/basic
Content-Language: en-cockney
Document in Sami, which does not have an ISO 639 code, and is
spoken in several countries, but with about half the speakers
in Norway
Not only is Sami (formerly called Lappish) a native language for
people in four countries, it is also split into at least six
dialects which are not mutually understandable. Probably several
of these dialects will be registered with different subtags to
"iana". I suggest the following is added to the text above:
"Here the biggest dialect, North Sami, is used."
Also, add: Content-Type: text/plain; charset=ISO-8859-10
Content-Language: iana-sami
Change to: Content-Language: iana-samino
An English-French dictionary
Add: Content-Type: text/plain; charset=ISO-8859-1
Content-Language: en, fr (This is a dictionary)
An official EC document (in a few of its official languages)
This description should be amended with:
"In this case it is necessary to know only one of the
languages to get a full understanding of the document."
Add: Content-Type: multipart/alternative
Content-Language: en, fr, de, da, el, it
Here I would like to add two new examples:
"An English text including untranslated passages in German.
Understanding of both English and German is essential.
Content-Type: text/plain; charset=ISO-8859-1
Content-Language: en, de
An English text including passages in German for which
translations to English are given.
Content-Type: text/plain; charset=ISO-8859-1
Content-Language: en"
An excerpt from Star Trek dialogue
Add: Content-Type: video/mpeg
Content-Language: x-klingon
In my opinion section 3 is unnecessary now, considering the
detailed exposition of different MIME uses of the Language
header. Also, isn't it inappropriate in a MIME extension
document to present suggestions for the design of other
services like WWW?
3. Usage examples
Examples of protocol usage of this header are:
- WWW selection of an appropriate version of information for
display, based on a profile for the user listing languages
that are understood
- MIME usage of alternate body parts in E-mail
4. The differences parameter to multipart/alternative
As defined in RFC 1541, Multipart/Alternative only has one
parameter: boundary.
The common usage of Multipart/Alternative is to have more than one
format of the same message (f.ex. PostScript and ASCII).
Change "f.ex." to "e.g.", "PostScript" to "application/postscript",
and "ASCII" to "text/plain".
6. Character set considerations
Codes are always US-ASCII. The issue of deciding upon the
rendering of a character set based on the language encoding is not
addressed in this memo; however, the author cautions against
thinking that such a decision can be made correctly for all cases
(for example, a rendering engine that decides font based on
Japanese or Chinese language will fail to work when a mixed
Japanese-Chinese text is encountered)
Are references to the opinions of the author appropriate for a
document intended to become an Internet standard? At least this
reflection should be in a separate paragraph, since it has
nothing to do with the first sentence.
Regarding the first sentence, I would like to qualify it a bit.
It should read
"Codes as used in the Language header are always US-ASCII."
There is no reason to restrict the character set when other
protocols use language tags in this document.
7. Gatewaying considerations
RFC 1327 defines a Language: header. This header is not
recommended now, because it is defined to be a single 2-letter
language code, and the X.400 header it is supposed to gateway is a
list of language codes.
It is suggested that RFC 1327 be updated to produce the Content-
Language: header, and to turn this header into the ISO/CCITT
specified Language components rather than the RFC-822-headers
heading extension.
Should a standards track document express opinions about an
administrative matter such as how other Internet standards
should be revised?
8. References
[ISO 639]
ISO 639:1988 (E/F) - Code for the representation of names of
languages - The International Organization for
Standardization, 1st edition, 1988 17 pages Prepared by
ISO/TC 37 - Terminology (principles and coordination)
[ISO 3166]
ISO 3166:1988 - Codes for the representation of names of
countries
Add "(E/F)" after "1988" and "- The International Organization for
Standardization, 3rd edition, 1988-08-15" at the end.
That's all for now. 8-)
/Olle
--
Olle Jarnefors, Royal Institute of Technology, Stockholm
<ojarnef(_at_)admin(_dot_)kth(_dot_)se>