Re: 10646 & MIME [was: Response]


  From: Keith Moore <moore(_at_)cs(_dot_)utk(_dot_)edu>
  Date: Wed, 26 Jan 1994 16:19:29 -0500

  >        in the
  >         case that every character in the sequence maps to more than one
  >         national standard, then choose the first standard given some
  >         prioritization based on locale (e.g., choose Japanese standard if
  >         Japanese locale)

  Hmmm.  Use of a locale sounds pretty much like "external profiling" to me.

This rule is actually unnecessary, but serves to produce the best results
given its heuristic flavor.  Without it one could say choose an arbitrary
standard which maps the sequence.  The result of this is that for such a
short sequence, the result is equivalent to the result obtained by case (3a).

  If the precise forms of the characters are important to those who use the
  language, the unified ideographs may well be sufficiently different from the
  character desired to violate the intent of the "unique mapping" MIME charset
  requirement.  In short, I think Ohta-san has a valid point which should not
  be dismissed out-of-hand or by claiming that it doesn't exist.

I'm not dismissing the fact that differences exist between national
conventions regarding the way that certain ideographs are depicted.  This
is an accepted fact and it is handled in the context of the ISO Ideographic
Rapporteur Group as a case of Z variation (known as font variants, a situation
identical to the distinction between a Helvetica 'a' and a Times Roman 'a').

What I do not accept from Ohta-san is his unsubstantiated assertion that
displaying an ideograph with a one variant glyph versus another variant
glyph is valid cause for rejecting one over the other.  I will allow this
only in the case that a typographic criteria of acceptability is used to
derive such a judgement; however, in the absence of such a criteria, and
in the absence of any evidence that somehow the "wrong choice" would make the
display illegible (a statement for which I can supply considerable counter
evidence), the choice of glyphs is entirely arbitrary and any choice is
acceptable according to the criteria of legibility.  I have repeatedly asked
Ohta-san for evidence for his position, but, as his own peers have concluded
in the official position of Japan on 10646:

  "the reasons agreeable to the CJK unified ideograph can be found, while
   the reasons opposing cannot be found. A variety of opposing opinions
   exist in Japan, but any one of them are incorrect."

Ohta-san, sorry to say, is in the group expressing the "incorrect".

  > Should MIME decide that it will establish a criteria of typographic
  > acceptability for displaying character text, then it would have to
  > describe how an multilingual European text encoded with ISO 8859-1
  > could, without profiling, "acceptably" display distinct language
  > sequences with distinct fonts; or, how a multilingual Arabic and
  > Turkish text encoded with ISO 8859-6 could, without profiling,
  > "acceptably" display distinct language sequences using, say Naskh
  > versus Ruq`ah styles of Arabic as appropriate to Arabic vs. Turkish
  > written language customs.

  This is also a good point; the problem is not specific to 10646.
  (But if we're being pedantic, does the *definition* of 8859/6 give multiple
  possible appearances for some characters?)

This is an interesting question and one I hoped someone would ask.  As
you know, 8859/6 is an encoding of Arabic graphemes which requires that
the display system will choose among any number of distinct glyphs, all
present in the same font or in multiple fonts, for depicting a given
Arabic letter according to its context.  It is immaterial whether the
standard gives multiple appearances (in the standard); for it *requires*
that multiple appearances be employed to produce a minimally legible
display, a process, which in this case, can be done deterministically
(i.e., by a DFSM with some small amount of state) based on a contextual
analysis of the character content.  [By the way, depending on the style
of Arabic script supported by the font and display subsystem, the number
of glyphs required to depict a single character may be as many as 30-50.
The latter would be the case for certain complex Arabic styles such as
Ruq`ah and Nastaliq.]

  It seems to me that the best thing we can do is to make 10646 as good as
  possible for MIME, without making it incompatible with other anticipated 

  uses of 10646.  Glenn Adams's suggestions as to how 10646 might be
  displayed seem to have the right intent -- though others may have better
  ideas.

Regarding "mak[ing] 10646 as good as possible", at this point, 10646 is
published and will not be changed until the first addendum comes along.
The first addendum will not substantially change any thing already there,
though it may augment it.  [I am aware of at least one change being suggested
which removes a restriction; namely, the restriction on direct encoding of
C1 controls.]  What can be done is to further articulate various implementation
information so that persons implementing 10646 and/or Unicode systems can do
so more effectively and compatibly.  [As the editor of the newsletter of
the Unicode Consortium, Encoding, I can say that this latter goal is one of
my highest priorities.]

As you say, 10646/Unicode *will* be used whether MIME goes forward or not,
and it would be a shame if the developers of MIME and other Internet
facilities take an unnecessarily restrictive (and may I add, unwarranted)
stance toward it.  It may interest you to note that when the ISO SC22
(Programming Language) subgroups held an adhoc meeting to attempt to under-
stand how 10646 would affect their future, that they agreed to the
recommendations found below, which, I believe it would also behoove MIME
developers and other Internet developers to consider.

[FYI, a full length article on 10646/Unicode which discusses many of the
above issues in more detail, including Ideograph Unification, may be found
in the first issue of the new ACM Journal, StandardView, issued Sep 93.]

Regards,
Glenn Adams

Excerpt of Report from SC22 Adhoc on Character Sets, held in Copenhagen
from 21-23 April 1993.

  Short Term Recommendations (1-3 years)

  1. That support be provided for ISO/IEC 10646 where the unit of processing
  is exactly one coded charaacter - as defined in 10646. "Unit of processing"
  is the smallest unit a programming language or operating system can process
  for ISO/IEC 10646 coded character data.

  This means:

  - all coded characterrs in ISO/IEC 10646 level 3 are available for use
    by applications

  - this minimum level does not require the interpretation of composite
    sequences as logical processing units.

  2. That SC22 address interlanguage communication of ISO/IEC 10646 coded
  data.

  3. That FSS-UTF be registered within ISO 2375 (ECMA).

  Long Term Recommendations

  1. That programming languages and supporting environments provide support
  for composite sequences and CC-data-elements of ISO/IEC 10646 as logical
  processing units.  Considerations must be given to the relation between
  logical processing units and natural language and orthography, and as such
  may require a mechanism for their identification.

  2. That WG15 or WG20 address the need for announcement mechanisms for the
  different encodings, levels and subrepertoires of ISO/IEC 10646 (see
  section 17.1 of ISO/IEC 10646, second sentence).  The same mechanisms
  may also be used to announce other coded character sets.