Re: 10646, and all that

Ohta-san,

I believe that many of us recognize the problems you are pointing out
and are trying to work with you in solving them.  At least some of us
believe that the problems are complex and that there is not a single
simple solution that meets all needs.   It does not help with this
process when you make postings that appear to be deliberately
obstructive, with no good reason.

From 822ext-minutes-91nov.txt:
...
      (d) Character set issues
...
The above definition states that "charset", not other header fields,
should contain enough profiling information to disambiguate unified
Han characters.

   The cited definition is very old.  It was a working definition,
produced at a WG meeting, which [many of] the members of the WG never
formally endorsed.  Unlike the text of RFC1341, it never went through an
IETF "last call" review, IESG/IAB review, etc.   It was written before
the Han-unified version of 10646 emerged.  At this point, it provides a
useful reminder of the WG position at one point in time.  It may help us
understand that many people in the WG were naive about the complexity of
character set issues 16 months ago.
   But almost no one would write that definition today, and quoting it
repeatedly as if it were universal, binding, truth from which subtle 
additional meanings can be drawn is inappropriate and not constructive.
   Let's try to solve the problem instead.

No strong preference.  What I was really arguing for was separating the
language info from the char set name.


Why do you think the separation is necessary?


Because I'm seeking a position that is technically reasonable, symmetric
across languages, and that people can deal with.  It is clear that we
aren't going to get general agreement on forcing tagging for all
languages.  It is clear that we aren't going to get general agreement on
specification of a list of languages each time the string "10646"
appears.  In both cases, this isn't a matter of "compromise", it is
because that information is *supplemental* -- unimportant sometimes and
very important other times, and possibly problematic to supply when it
is is unimportant.  And it is interesting that the ability to designate
language even when it is not needed to clarify a character set may
leverage other useful things.   I think those are good design arguments.

As I have said before, we would not need to do any of this if 10646 were
really adequate to the role to which we would like to assign it.  It
isn't.  If you don't like that, take it up with ISO.  And, as we have 
discussed in private, while I understand that Japan voted against 10646
DIS-2 at the JTC1 level, I also understand that, had Japan felt very
strongly about this and been able to find a single additional JTC1
P-member to agree, 10646 could easily have been buried in ISO
procedures, probably into the next century.    It is consequently
rational to deduce the absence of a strong majority in the Japanese
standards community that the unification issue is *that* important, all
of the time.

Currently, the only character encoding which needs language information
are, as far as I know, ISO 646, ISCII and ISO 10646.


   One value of separating language information from the character set
name is that "need" is in the mind of the beholder.  Examples have been
given on this list of situations in which the information can be useful
even with relatively precise character sets as 8859-1.  Conversely,
there are clearly situations in which unified Han are interpretable from
context, however un-aesthetic or un-linguistic that might be.  The
reality is that the issue isn't a binary construction like "need", but a
scale from "harmless but probably not worth the trouble" to "required
for proper interpretation by many users".  The decision as to which is
which and whether to go to the trouble is best left in the hands of
senders and receivers, not network-wide requirements.

In the case of ISO 646, we have assigned different charset names to
each national variant.

    And deprecated their use.   But these are national variants
recognized by ISO, and national variants in which the character
descriptions and names drawn from the repertoire are different.

Or, are you saying that the following specification:
      Content-type: text/plain; charset=ASCII
      Content-language: French

   We have prohibited "ASCII" after "charset=" and my French colleagues
claim, correctly, that French cannot be written in [US-]ASCII except in
crude transliteration.

Moreover, the truly multilingual character encoding won't need:
      Content-language:
header at all. So, I object to introduce the to-be-obsoleted header.

    As others have pointed out, one often benefits from language
information even if there are no structural ambiguities about the
character encoding.  Tuples as complex as {country,language, character
set, character encoding} are common in linguistic and textual analysis
work.
    You have convinced me (not hard, I was convinced by mid-1991) and
much of the rest of the WG that 10646 isn't a "truely multilingual
character encoding".   But the choices are to provide sufficient
supplemental information, or to just decide to not use 10646 because it
is inadequate and wait for something better to come along.  I don't
think the latter is a practical alternative--people are going to use it
in some form whether you (or I) like it or not.

    ---john