Language tags and 10646


Several points have been raised in different messages that relate to
important issues with the language header (or parameter) suggestion.
I'm going to try to draw them together and see if I can say something 
coherent.

Date: Fri, 05 Mar 1993 20:03:35 -0500
From: Dana S Emery <de19(_at_)umail(_dot_)umd(_dot_)edu>

The problem is that when 10646 is sent, complete polylingual display must be
anticipated by the receiving UA (to the best of its ability), as the user is
expecting to see that full polylingual message in all its glory.  

The user expects to see the sent-japanese rendered as japanese, and the
sent-chinese as chinese, and isnt going to be interested in excuses as to why 
it
...
And yes, some real-amount of traffic will contain poly-c/j/k.


I think this desire is clear.  I also think that we have managed to
demonstrate conclusively that 10646 isn't up to the job.  That leaves
several options:
  (i) We can pretend that SC2/WG2 is still at work and hasn't emitted a
product, or that it omitted what is, for our purposes, total trash, and
try to figure out how we would handle this if there were no 10646 or any
prospect of one.  Much as I fear it for other reasons, we have one
option on the table already in the form of 2022; there are probably
others.
   (ii) We can use language tagging at the body part level and accept the
fact that this implies that poly-c/j/k doesn't go with text/plain and
must be handled either with multipart or with in-text language markup
(even if the character codes come from 10646).
   (iii) We can decide that the likely frequency of potentially-ambiguous
polylingual materials is so high that per-body-part tagging is pointless
and we need to use 10646 *only* with in-text language markup.
   (iv) We can decide that, if the users want the precision of rendering
that Dana implies, their expectations are really going to include having
the message arrive in exactly the same form it leaves, i.e., that their
expectations are for virtual facimile, not what we have normally thought
of as textual email.   I think that expectation will occur, and it will
be difficult to come up with satisfactory reasons why it can't be met. 
But the solution to it lies in PDLs (e.g., Postscript) or in images
(e.g., G3FAX) not in worrying about how to constrain interpretation of
10646 to do what is wanted.

I am proposing a somewhat brutal solution here, based on the knowledge
that the overwhelming amount of correspondence in the world (email,
paper, etc.) is monolingual.  Sure, polylingual texts occur, but perhaps
we should solve the monolingual problem in a clean and simple way and
then work on a more complex solution to deal with the subtlties of
multilingual text.

Now, if we are going to language-tag body parts in text/plain, do we
need 10646?  That is a very interesting question.  I don't know the
answer, but we do have character groups in 10646 that don't exist in any
of the 8859 sets.  Maybe that is enough.   In addition...

Date: Fri, 05 Mar 1993 20:31:44 -0500
From: lee(_at_)sqlee(_dot_)sq(_dot_)com (Liam R. E. Quin)

It seems to me that if, <FR>par example</FR>, I want to include material
in a message that is in a different language, and I can only do so by
splitting out the message into a sequence (or my mailer can?) that this
is perhaps inadequate.

Of course, to some extent RichText or even text/SGML could handle this,
although not variant Encodings (character sets), which isn't quite the
same thing.


If we want to permit this as plain text (the example, interestingly, uses
in-text language markup, and I don't think that is text/plain any more),
then we need to use at least 8859-1 (to accommodate English and French)
and might want 10646.  Insertion of short phrases from any of several
Eastern European languages would force us to 10646.
  This also suggests that a per-body-part language tag structure might
want to permit a short list of languages present in the message.  As
long as we define the tags as advice the sender is providing the
receiver in a canonical way (rather than something that the sender must
provide and the receiver must interpret), I don't see this as a problem.
Now this suggests that, as a result of the way 10646 is structured,
(English,French) is fairly useful advice and (Japanese,Chinese) is
probably not useful advice.  I don't think we need to write rules about
that.  I don't think we need to be defensive about it either--it is
ISO's fault and, if we don't like it, our only other options are (i) and
(iii) above.

Date: Fri, 05 Mar 1993 20:38:17 -0500
From: Dana S Emery <de19(_at_)umail(_dot_)umd(_dot_)edu

could well be, I was hoping to avoid delving *deeply* into natural language
parsing (ugh, shudder), but had hoped that enough information could be gleaned
from quotation marks and other puntuation to be usefull.

Of course, if we can actually demonstrate that no heuristic is likely to be
satisfactory, then that will allow us to move on to consider more-intrusive
solutions, possibly 

  charset=cjk-tagged-10646.


My biases are that, if a receiving UA is forced into heuristics of any
sort--much less natural language parsing of multilingual text--in order
to figure out what to do with incoming text, we have lost it.  The idea
is just a non-starter.   I certainly have no problems with people
writing UAs that take advantage of whatever they can deduce, but we
better not come up with a solution that requires such cleverness as a
minimum condition for interoperability and meeting of user expectations.

But, having been around on this one many times, I've concluded that,
while    "10646-kanji-sanscrit"    is probably a character set (although
I think I like "10646" and some separable language cues better for
several reasons), "xyz-tagged-NNNN" isn't a character set any more. 
Unless there is a real dynamic switching mechanism (e.g., as in 
iso-2022-jp), it also isn't "text/plain", since an interpreter that is
not part of the basic character mechanism is needed to cope with it in
an intelligent way.  Maybe someone needs to propose
   text/language-tagged; charset=10646

Date: Sat, 06 Mar 1993 10:21:59 +0900
From: erik(_at_)poel(_dot_)juice(_dot_)or(_dot_)jp (Erik M. van der Poel)

John, I'm trying not to get on your nerves, but can you give us an
example of an Asian character where the rendering in some font would
lead to a "loss of information"?

Here are some examples that have been brought to my attention, all
from the 2nd DIS of 10646: 4e0e, 5094, and 8aa7.

Re: "loss of information": If the receiver is Japanese and he sets his
Han font to a Japanese font, then he will see the character displayed
the way he is used to.  So there is no "loss of information".


You aren't getting on my nerves.  We are trying to work through a very
complex problem here, and as long as people are working on solutions and
trying to explain positions, rather than calling names... :-)

Let me simplify the discussion by accepting your examples.  The point is
that sending a string of octets is not sufficient to get characters
displayed "close enough" to the way one is used to--one needs the
supplemental two bits to determine how to set the Han font in the
receiving display device.  The degree of information loss if the Han
rendering is set wrong is a function of the patience and understanding
of the receiving user, who is very much part of the system.

On the other hand, if the sender is Japanese and he wants to show how
he writes his name to his Chinese colleague, and if his name just
happened to contain one of the few characters where there is a
noticeable difference in the typical CJK renderings, then there would
be "loss of information" if he sent the 10646 characters without any
language info and without any font info.  But that's *his* fault, for
not providing that info.

   Right.  All I'm suggesting is a canonical way for him to provide the
two additional bits of language information--again, not "font", but
"language".  If he chooses not to, or his colleague's receiving software
chooses to discard those bits, then it is not our problem.
   And I think we are actually talking about providing two separate ways
of doing it.  We can agree on one now; the other is going to require
some thinking and a separate RFC:
   --language-tagging body parts, which implies that the "this is how I
write my name" text goes into one body part with a language tag of
"Chinese" and the form of the name goes into another body part with a
language tag of "Japanese" (actually, if I were faced with that problem,
the second part would be application/image, since I'd want the receiver
to see the exact font rendering of those characters, not just however
his UA happened to render the 10646 codes in a Japanese-Han
interpretation).
   -- Explicitly language-tagged text, in which the language could be
dynamically changed inside the text stream.

From: erik(_at_)poel(_dot_)juice(_dot_)or(_dot_)jp (Erik M. van der Poel)
Subject: Re: language tags
Hmmm...  Interesting.  Here again we have the "the way the sender sees
it, and the way the receiver wants to see it" dichotomy.  For example,
the Japanese name 九州 is sometimes transliterated as Kyushu (when the
reader is expected to be an English-speaker), but I have also seen it
transliterated as Kjoesjoe when the audience is expected to be Dutch.
...


Several more good examples why language tagging has to be considered as
advice from the sender to the receiver, which the receiver can put to
whatever good uses it sees fit (and suffer the wrath of his users if he
gets wrong).

----------------------------------
Obnoxious summary and proposal:

My goal here is to get us unstuck and moving forward, not to figure out
how many years a perfect solution will take.  To put that differently, I
want to see enough separation of the common and moderately easy cases
from the rare and complex ones that we can reach closure on the former
and move the latter aside for additional study and work.

We accept a couple of things as being unpleasant reality.  (a) 10646 is
not as good as it could be at making distinctions that some people think
are important.  Because of that, the use of 10646 without additional
hints will result in what some reasonable people will perceive as an
unacceptable level of information loss.  No IETF decisions will change
their minds.  (b) No "character set" is an adequate substitute for
well-designed markup, page description languages, or transmission of
images when the sender believes that there is important information
embedded in actual fonts or other precise details of presentation.

(1) We reaffirm the principle that text/plain implies text that is not
expected to contain markup, for languages or anything else.  Nothing
prevents transmitting marked-up text with text/plain, but a receiving UA
would be expected to display such tags, not process or interpret them.

(2) We provide a body-part language tagging capability that can take a
list of languages.  I have no real preference about syntax.  We define
it strictly in terms of hints that senders are encouraged to give
receivers when they think that would be helpful and that receivers are
encouraged to make sensible use of.  We give some examples of contexts
in which they are likely to be helpful, including C/J/K differentiation
and the interesting multipart/alternative with different translations
one.  We don't define "sensible" or any of the other weasel-words in
these statements.  If the information that senders give receivers is not
sufficient to overcome whatever information loss is perceived as
occurring from 10646 alone, we let them work it out.

(3) People who are worried about precise identification of multilingual
texts should go off and put together a definition of a good applications
type.  As a hint, a project called the Text Encoding Initiative, which
contains lots of the right sorts of librarians, linguists, lexographers,
and so on has already done a *lot* of work in this area and has been
working out SGML DTDs and other things to handle both simple cases and
ones whose complexity defies easy description.  Maybe we could just
adopt their work :-)

(4) If lightweight intra-body-part language switching is needed, then
someone should make a specific proposal and make either an applications
type out of it or, more likely, text/other-than-plain.  As with (3), new
proposal, new RFC.

   --john