ietf-822
[Top] [All Lists]

10646, moving towards a proposal.

1993-03-15 15:31:02
With apologies to John Klensin, and anyone else who has been disgusted by this
bickering over Charset=10646 issues,  I think I can bring us a little closer to
closure.

Originally, the concern was for C/J/K disambiguation.

Somehow the Q of Devenagari folding was raised, and I for one, took it at face
value.

Now, I have reason to Q if, indeed, 10646 folds the scripts of Devanagari at
all.

My recent re-reading of the Unicode specification, (which 10646.1 is largely
derived from, and is the best source I have available to me at this point, so
pipe-down you-all in the peanut gallery) shows that it distinguishes all 9
written forms of the Devanagari family, and has no ambiguity in the
code-points.

Devanagari  U+0900..097f
Bengali     U+0980..09ff
Gurmukhi    U+0a00..0a7f
Gujarati    U+0a80..0aff
Oriya       U+0b00..0b7f
Tamil       U+0b80..0bff
Telugu      U+0c00..0c7f
Kannada     U+0c80..0cff
Malayalam   U+0d00..0d7f

(The Unicode Standard, Addison Wesley, 1990, ISBN 0-201-56788-1, pp58-67)

It may be that 10646.2 diverges from this, lacking access to a copy of the spec
for 10646.2 I cannot say, but I have been told that 10646.1 is point/point
equivalent to Unicode, so it should have preserved this distinction.

And if 10646.2, in fact, adheres to this layout of Devanagari, then
disambiguation is only needed for Han, Leaving us with the much more
manageable:

        Charset=10646-T
        Charset=10646-G
        Charset=10646-J
        Charset=10646-K

Furthermore, I am not sure that the distinction between T and G must be made
here, based on the following paragraph, which implies that Unicode itself
distinguishes them:

"GB characters with simplified Kang Xi radicals are placed in a group following
the traditional Kang Xi radical from which the simplified radical is
derived...."

(The Unicode Standard, p116.)

If it should be determined that G/T in fact continue to be disambiguated in
10646.2, then we come back to:

        Charset=10646-C
        Charset=10646-J
        Charset=10646-K

Which will allow us unambiguous transmission of poly-lingual messages with
mono-Han content, and should suffice for the majority of E-mail.

Poly-Han content would have to be marked-up in some way, either with a parrallel
data structure, in-line tags (future rfc), or perhaps via RichText, all of which
is a different problem.

Mail which has no Han content at all might be sent under:

        Charset=10646

With the understanding that since it might (improperly) contain Han, receiving
UA should be prepared to deal with this improper Han content by either
interacting with the user or by presenting the Han in some manner preselected by
the user.

It is to be expected that some UA will be unprepared to display some messages
due to lack of preparation, the richness of the worlds alphabets is expensive to
provide definitions for on disk, and vendors do like to charge for such things.
The UA which attempts poly-lingual receipt must be prepared for an inability to
display such mesages, presumably it will declare the need for a particular
script's font, perhaps it would offer to transliterate to a related script from
a list proffered to the user, perhaps it will simply suggest that the mail be
archived or left unread untill the necessary resources are installed.

I leave it to the rest of you to proove my 2 suppositions:

        10646.2  treats Devanagari script family as Unicode does?
        10646.2  Adaquatly distinguishes T/G (the ~2000 simplified Han)?

I think the seperate matter of spoken-language-form tagging should be recognized
as the irrelevancy that it is.

As to Spell-checking information, that is not as unreasonable as some would
think, but I will agree that it is a bit exotic for Text/Plain.
--
dana s emery <de19(_at_)umail(_dot_)umd(_dot_)edu>


<Prev in Thread] Current Thread [Next in Thread>