Re: New content-language draft

I have two comments on substance:

1) The following special values can be useful:

   iana-art   (Some) artificial language, usable e.g. for source
              code, a shell script, a sendmail configuration
              file.

   iana-hum   (Some) human language without a known registered
              language code.

   iana-ukn   (Some) unknown language(s). This code can be used
              by a program that automatically analyzes (or at
              least guesses) which human language(s) is used in
              a written text, but failed in this particular case.

   Maybe it would be appropriate to include these values, which
   can't be expressed by ISO codes, in the Content-Language
   standard?

2) To discourage less serious applications for language codes, I
   propose that some conditions for applications should be
   stated. It seems reasonable to require the _name_ in the
   original language and, if possible, in English and French.
   Also a reference to the linguistic literature which can be
   used for to _identify_ the (sub)language shall be given.

The rest of my comments have to do only with presentation, not
substance, and can be skipped by everyone not particularly
interested.

    Abstract

    This document describes a Content-Language: header for use with
    body parts of MIME.


Not only body parts but whole messages, which need not be MIME
messages. Write instead:

   "... for use in RFC 822 messages and body parts of MIME messages."

    It also describes a new parameter to the Multipart/Alternative
    type, to aid in the usage of the Content-Language: header.


As I see it, this standard will do two things:

a) extend the functionality of the MIME format for email

b) introduce a registration mechanism for language and
   sublanguage codes.

The latter is important and directly usable not only for email
but also for many other application layer services, e.g. WWW,
Whois++, URCs. Many of these will probably use the language
codes but not the Content-Language: header. Therefore, add to
the Abstract:

   "The registration mechanism for language and sublanguage
    codes introduced here can be used also by other protocols
    which need to indicate natural language."

    1.  The Language tag


I suggest that this section is split into two sections, with the
headings:

   "Syntax of the Language header and language tags"

   "Registration of language tags"

The present section 1 should be split before the paragraph:

    The namespace of language tags and subtags is administered by the
    IANA. The following registrations are predefined:


The new section 1 should start with:

    The syntax of this header in RFC-822 EBNF is:


The present first paragraph

    The language tag is composed of 2 parts: A language tag and a
    subtag.


should be moved to after the paragraph

    Note that the Language-Header is allowed to list several languages
    in a comma-separated list.


The term "language tag" is used in two different senses in the
paragraph I want to move. I would prefer that the first part of
the whole tag is called "principal tag", so the paragraph would
become:

   "The language tag is composed of 2 parts: A principal tag
    and, optionally, a subtag."

    Language-Header = "Content-Language" ":" 1#Language
    Language ::= 1*8ALPHA [ '-' 1*8ALPHA ]


To use exactly the same notation as RFC 822, change "::=" to "="
in the definitions.

It would be better to use "Language-Tag" instead of "Language"
in these definitions (its a name of the symbol, not of the thing
the symbol denotes).

Why double quotes in the first definition and single quotes, or
rather apostrophes, in the second? The RFC 822 meta language
doesn't use apostrophes.

It might be added here that white space can be used between and
after the tokens in the righthand part of the Language-Header
definition according to RFC 822 rules but not around the "-" in
the Language definition.

    The namespace of language tags and subtags is administered by the
    IANA. The following registrations are predefined:


Delete "and subtags".

    In the language tag:


Change "language tag" to "principal tag".

    -    All 2-letter codes are interpreted according to ISO 639.

    -    All 3-letter codes are reserved for a (hopefully) forthcoming
         extension to ISO 639

    -    The value "IANA" is reserved for IANA-defined
         subregistrations


Write "iana" to follow the case convention for ISO language codes.

    -    The value "X" is reserved for private use. Subtags of "X"
         will not be registered by the IANA.


Write "x" for the same reason.

    -    No other registration is allowed.

    In the sublanguage tag:


Change "sublanguage tag" to "subtag". (Some subtags will denote
independent languages, not sublanguages.)

    -    All 2-letter codes are interpreted as ISO 3166 country codes,
         according to the rules laid down in ISO 639.


The rules in ISO 639 for the semantic interpretation of a
language code supplemented with a country code are restricted to
terminology. I think we can explicitly state the rule
appropriate for general Internet use of these tags:

   "All 2-letter codes are interpreted as denoting a country,
    dependency or other area of geopolitical interest according
    to the rules of ISO 3166.  A language tag with such a subtag
    indicates a variant of the language given by the principal
    tag that is charactersistic of this part of the world. These
    subtags should not be used with the principal tag 'iana'."

    -    Codes of 3 to 8 letters may be registered with the IANA by
         anyone who feels a need for it. IANA has the right to reject
         registrations that are felt to be misleading.

    The information in the sublanguage tag may for instance be:


Change "sublanguage tag" to "subtag".

    -    Country identification, such as en-US (this usage is
         described in ISO 639)

    -    Dialect information, such as no-NYNORSK or en-COCKNEY


Nynorsk is not a dialect of Norwegian. It is an example of a
variant of a language which nevertheless is worthy of registering
(as well as the competing form "bokmaal"). For a language that
can be written with different scipts, these can be seen as
different forms of the language for which subtags can be useful,
e.g. az-arabic, az-cyrillic, az-latin for Azerbaijani. I suggest
this wording:

   "Dialect information or other variant of a language, such as
    en-cockney and no-nynorsk"

    -    Languages not listed in ISO 639, which can be registered with
         the IANA prefix, such as IANA-CHEROKEE


Add here, after "in ISO 639": "or dialects or other variants of
such languages".

To follow the convention "small letters for languages, capitals
for countries", write "no-nynorsk", "en-cockney", "iana-cherokee".

    If multiple languages are used in the MIME body part, they are
    listed with commas between them.


This paragraph should be dropped, since this has already been
said in connection with the description of the syntax of the
Language-Header.

At this point text for my first two suggestions can be inserted:

   "These three special values are preregistered as subtags for
    the principal tag 'iana':

    art   (Some) artificial language. This tag is usable e.g.
          for source code and other files primarily intended for
          interpretation by computer programs. Constructed human
          languages such as Volapuk and Esperanto are not
          regarded as artificial languages in this context.

    hum   (Some) human language without a known registered
          language code.

    ukn   (Some) unknown language(s). This code can be used by
          programs that automatically analyze which human
          language(s) is used in a written text.

    An application to IANA for registering a subtag shall
    contain these elements, if it concerns a language:

    L1) The original name of the language. If it is not
        originally written in the Latin script the name shall be
        transliterated by the standard transliteration system
        for that language or, alternatively, a specified
        transliteration system.

    L2) The English or French name of the language, preferrably
        both.

    L3) A referecne to information about the language in a
        specified scholarly work.

    If the application concerns a sublanguage, it shall contain
    these elements:

    S1) The language of which it is a form or dialect.

    S2) A name or descriptive phrase for the form or dialect
        referred to by the proposed subtag. This shall be given
        in the original language, if necessary transliterated
        to the Latin script, and may also be given in English
        or French.

    S3) A referecne to information about the form/dialect in a
        specified scholarly work."

    The following codes have been added in 1989 (nothing later): ug
    (Uigur), iu (Eskimo), za (Zhuang), he (Hebrew, replacing iw), yi


Since most Eskimos live in Canada I think we should use the
Canadian name of the language iu. Write:

   "... iu (Inuktitut, also called Eskimo), ..."

At this point it might be useful to include some information
about ISO country codes:

   "NOTE: A maintenance agency exists for ISO 3166 which makes
    additions to and changes in the list of countrys and other
    geopolitical areas in ISO 3166. This agency is:

       ISO 3166 Maintenance Agency Secretariat
       c/o DIN Deutsches Institut fuer Normung
       Burggrafenstrasse 6
       Postfach 1107
       D-1000 Berlin 30
       Germany
       Phone: +49 30 26 01 320
       Fax:   +49 30 26 01 231

    NOTE: ISO 3166 reserves these country codes as
    user-assigned codes: AA, QM--QZ, XA--XZ, ZZ."

    2.  MEANING


A better heading, showing that this section only has MIME
significance, would be:

   "Meaning of the Language header"

    The meaning of the header is:


    -    For a single information object, it should be taken as the
         set of languages that is required for a complete
         comprehension of the complete object. Examples: Simple text.

    -    For an aggregation of information object, it should be taken


Change "object" to "objects", "Examples" to "Example".

         as the set of languages used inside components of that
         aggregation.  Examples: Document stores and libraries.


Should we not have a MIME example here? I propose:

   "Example: MIME Multipart/Digest."

    -    For information objects whose purpose in life is providing
         alternatives, it should be regarded as a hint that the
         material inside is provided in several languages, and that
         one has to inspect each of the alternatives in order to find
         its language or languages.  In this case, multiple languages
         need not mean that one needs to be multilingual to get
         complete understanding of the document. Examples: MIME
         multipart/alternative.


Change "Examples" to "Example".

         EXAMPLES:


I think the examples here will be more illustrative if also a
Content-Type: header is included.

         Norwegian official document, with parallel text in both
         official versions of Norwegian. Both versions are readable by
         all Norwegians.


Add:         Content-Type: multipart/mixed

           Content-Language: no-nynorsk, no-bokmaal

         Voice recording from the London docks


Add:         Content-Type: audio/basic

           Content-Language: en-cockney

         Document in Sami, which does not have an ISO 639 code, and is
         spoken in several countries, but with about half the speakers
         in Norway


Not only is Sami (formerly called Lappish) a native language for
people in four countries, it is also split into at least six
dialects which are not mutually understandable. Probably several
of these dialects will be registered with different subtags to
"iana". I suggest the following is added to the text above:

   "Here the biggest dialect, North Sami, is used."

Also, add:   Content-Type: text/plain; charset=ISO-8859-10

           Content-Language: iana-sami


Change to:   Content-Language: iana-samino

         An English-French dictionary


Add:         Content-Type: text/plain; charset=ISO-8859-1

           Content-Language: en, fr (This is a dictionary)

         An official EC document (in a few of its official languages)


This description should be amended with:

   "In this case it is necessary to know only one of the
    languages to get a full understanding of the document."

Add:         Content-Type: multipart/alternative

           Content-Language: en, fr, de, da, el, it


Here I would like to add two new examples:

   "An English text including untranslated passages in German.
    Understanding of both English and German is essential.

      Content-Type: text/plain; charset=ISO-8859-1
      Content-Language: en, de

    An English text including passages in German for which
    translations to English are given.

      Content-Type: text/plain; charset=ISO-8859-1
      Content-Language: en"

         An excerpt from Star Trek dialogue


Add:         Content-Type: video/mpeg

           Content-Language: x-klingon


In my opinion section 3 is unnecessary now, considering the
detailed exposition of different MIME uses of the Language
header. Also, isn't it inappropriate in a MIME extension
document to present suggestions for the design of other
services like WWW?

    3.  Usage examples

    Examples of protocol usage of this header are:


    -    WWW selection of an appropriate version of information for
         display, based on a profile for the user listing languages
         that are understood

    -    MIME usage of alternate body parts in E-mail

    4.  The differences parameter to multipart/alternative

    As defined in RFC 1541, Multipart/Alternative only has one
    parameter: boundary.

    The common usage of Multipart/Alternative is to have more than one
    format of the same message (f.ex. PostScript and ASCII).


Change "f.ex." to "e.g.", "PostScript" to "application/postscript",
and "ASCII" to "text/plain".

    6.  Character set considerations

    Codes are always US-ASCII. The issue of deciding upon the
    rendering of a character set based on the language encoding is not
    addressed in this memo; however, the author cautions against
    thinking that such a decision can be made correctly for all cases
    (for example, a rendering engine that decides font based on
    Japanese or Chinese language will fail to work when a mixed
    Japanese-Chinese text is encountered)


Are references to the opinions of the author appropriate for a
document intended to become an Internet standard? At least this
reflection should be in a separate paragraph, since it has
nothing to do with the first sentence.

Regarding the first sentence, I would like to qualify it a bit.
It should read

   "Codes as used in the Language header are always US-ASCII."

There is no reason to restrict the character set when other
protocols use language tags in this document.

    7.  Gatewaying considerations

    RFC 1327 defines a Language: header. This header is not
    recommended now, because it is defined to be a single 2-letter
    language code, and the X.400 header it is supposed to gateway is a
    list of language codes.

    It is suggested that RFC 1327 be updated to produce the Content-
    Language: header, and to turn this header into the ISO/CCITT
    specified Language components rather than the RFC-822-headers
    heading extension.


Should a standards track document express opinions about an
administrative matter such as how other Internet standards
should be revised?

    8.  References


    [ISO 639]
          ISO 639:1988 (E/F) - Code for the representation of names of
         languages - The International Organization for
         Standardization, 1st edition, 1988 17 pages Prepared by
         ISO/TC 37 - Terminology (principles and coordination)


    [ISO 3166]
         ISO 3166:1988 - Codes for the representation of names of
         countries


Add "(E/F)" after "1988" and "- The International Organization for
Standardization, 3rd edition, 1988-08-15" at the end.

That's all for now. 8-)

/Olle

--
Olle Jarnefors, Royal Institute of Technology, Stockholm 
<ojarnef(_at_)admin(_dot_)kth(_dot_)se>