Re: language tags

So, we should have:

      charset=iso-10646-g-*
      charset=iso-10646-t-*
      charset=iso-10646-j-*
      charset=iso-10646-k-*

where "*" is replaced with Devanagari variations. As DIS 10646-1.2 cites
ISCII as its source of Devanagari characters, Devanagari distiction
should be done according to Indean standard. Are there any Indean
standard which lists names of Indean languages in Latin alphabets?


According to "Writing Systems of the World", by Akira Nakanishi,
India's constitution recognizes 15 official languages:

        Language     Script

    1.  Assamese     Bengali
    2.  Bengali      Bengali
    3.  Gujarati     Gujarati
    4.  Kannada      Kannada or Kanarese
    5.  Kashmiri     Urdu or Arabic
    6.  Malayalam    Malayalam
    7.  Marathi      Devanagari
    8.  Oriya        Oriya
    9.  Punjabi      Gurmukhi
    10. Sanskrit     Devanagari
    11. Tamil        Tamil
    12. Telugu       Telugu
    13. Urdu         Urdu or Arabic
    14. Hindi        Devanagari
    15. English      Latin

The Unicode standard lists many other languages that are written using
the Devanagari script: Nepali, Awadhi, Bagheli, Bhatneri, Bhili,
Bihari, Braj Bhasha, Chhattisgarhi, Garhwali, Gondi (Betul,
Chhindwara, Mandla dialects), Harauti, Ho, Jaipuri, Kachchhi, Kanauji,
Konkani, Kului, Kumaoni, Kurku, Kurukh, Marwari, Mundari, Newari,
Palpa, and Santali.

But perhaps you want to ignore all these other languages since you're
also ignoring the difference between Mandarin and Cantonese (since
they use the same glyphs)?

So if you want to take India's 15 official languages and sort them by
script, you get:

    Language     Script

    Assamese     Bengali
    Bengali      Bengali
    Hindi        Devanagari
    Marathi      Devanagari
    Sanskrit     Devanagari
    Gujarati     Gujarati
    Punjabi      Gurmukhi
    Kannada      Kannada or Kanarese
    English      Latin
    Malayalam    Malayalam
    Oriya        Oriya
    Tamil        Tamil
    Telugu       Telugu
    Kashmiri     Urdu or Arabic
    Urdu         Urdu or Arabic

As you can see, Devanagari is not the only script that is used to
write multiple languages.  (See Bengali and "Urdu or Arabic".)  Are
the languages that use these scripts also written using different
glyphs?  If so, your approach would seem to generalize to:

    charset=iso-10646-<han>-<devanagari>-<bengali>-<urdu/arabic>

Expanding these, we would get:

    charset=iso-10646-g-hindi-assamese-kashmiri
    charset=iso-10646-g-hindi-assamese-urdu
    charset=iso-10646-g-hindi-bengali-kashmiri
    charset=iso-10646-g-hindi-bengali-urdu
    charset=iso-10646-g-marathi-assamese-kashmiri
    charset=iso-10646-g-marathi-assamese-urdu
    charset=iso-10646-g-marathi-bengali-kashmiri
    charset=iso-10646-g-marathi-bengali-urdu
    charset=iso-10646-g-sanskrit-assamese-kashmiri
    charset=iso-10646-g-sanskrit-assamese-urdu
    charset=iso-10646-g-sanskrit-bengali-kashmiri
    charset=iso-10646-g-sanskrit-bengali-urdu
    charset=iso-10646-t-hindi-assamese-kashmiri
    charset=iso-10646-t-hindi-assamese-urdu
    charset=iso-10646-t-hindi-bengali-kashmiri
    charset=iso-10646-t-hindi-bengali-urdu
    charset=iso-10646-t-marathi-assamese-kashmiri
    charset=iso-10646-t-marathi-assamese-urdu
    charset=iso-10646-t-marathi-bengali-kashmiri
    charset=iso-10646-t-marathi-bengali-urdu
    charset=iso-10646-t-sanskrit-assamese-kashmiri
    charset=iso-10646-t-sanskrit-assamese-urdu
    charset=iso-10646-t-sanskrit-bengali-kashmiri
    charset=iso-10646-t-sanskrit-bengali-urdu
    charset=iso-10646-j-hindi-assamese-kashmiri
    charset=iso-10646-j-hindi-assamese-urdu
    charset=iso-10646-j-hindi-bengali-kashmiri
    charset=iso-10646-j-hindi-bengali-urdu
    charset=iso-10646-j-marathi-assamese-kashmiri
    charset=iso-10646-j-marathi-assamese-urdu
    charset=iso-10646-j-marathi-bengali-kashmiri
    charset=iso-10646-j-marathi-bengali-urdu
    charset=iso-10646-j-sanskrit-assamese-kashmiri
    charset=iso-10646-j-sanskrit-assamese-urdu
    charset=iso-10646-j-sanskrit-bengali-kashmiri
    charset=iso-10646-j-sanskrit-bengali-urdu
    charset=iso-10646-k-hindi-assamese-kashmiri
    charset=iso-10646-k-hindi-assamese-urdu
    charset=iso-10646-k-hindi-bengali-kashmiri
    charset=iso-10646-k-hindi-bengali-urdu
    charset=iso-10646-k-marathi-assamese-kashmiri
    charset=iso-10646-k-marathi-assamese-urdu
    charset=iso-10646-k-marathi-bengali-kashmiri
    charset=iso-10646-k-marathi-bengali-urdu
    charset=iso-10646-k-sanskrit-assamese-kashmiri
    charset=iso-10646-k-sanskrit-assamese-urdu
    charset=iso-10646-k-sanskrit-bengali-kashmiri
    charset=iso-10646-k-sanskrit-bengali-urdu

Note that this does not even take into account any of the other
scripts and languages used around the world.  Or are you saying that
the others don't have important glyph differences?  If so, how would
you know that they are not important?  Have you asked the people in
those countries for their opinion?

Or perhaps you're saying that people wouldn't normally mix so many
different languages in one MIME body part?

Or perhaps you're saying that we should at least solve the problem for
g, t, j and k, (and maybe Devanagari) and then worry about the other
glyphs later on when there is demand for such distinctions?  (You once
said that you don't want to "overgeneralize".)

Could you elaborate on what you're envisioning?  Please also tell us
what happens when people want to include more stuff in the future.


Erik