Re: Response


  From: Masataka Ohta 
<mohta(_at_)necom830(_dot_)cc(_dot_)titech(_dot_)ac(_dot_)jp>
  Date: Wed, 26 Jan 94 12:42:36 JST

  : [Ohta]
  | As I already know that I can't display Japanese text with UNICODE without
  | some profiling information, you don't have to do so.

  : [Adams] 

  | I disagree.  You *can* display Japanese text without any profiling
  | information.

  : [Ohta]
  | As long as you know the text is Japanese, which is a profiling information.

Again, I disagree that you must know it is Japanese, and, therefore, that you
need profiling information to tell you this fact.  I shall prove this below.

Given a 10646/Unicode plain text without any profiling, then display the text
as follows:

1. If the system contains only Japanese fonts (say, e.g., a collection of
fonts which possesses the glyphs needed to display the characters in JIS X0201,
JIS X0208, and JIS X 0212), then for each 10646/Unicode character in the
text, map it to its JIS counterpart yielding the glyph code.  If no mapping is
available, then map it to a substitution glyph (e.g., a GETA MARK or an
empty box, etc.).

2. If the system contains a font covering all unified CJK ideographs in
10646/Unicode (i.e., a font which, for each of the 20,902 unified ideographs,
chooses a representative glyph for each ideograph drawn from any source),
then, for each unified CJK ideograph character contained in the sample text,
display that character with its corresponding glyph from the unified CJK
font.  For other characters having no potential glyph in an available font,
then display those characters with a substitution glyph.

3. If the system contains distinct Chinese, Japanese, and Korean fonts
which cover the respective glyphs contained in their national standards,
say, e.g., BIG5, GB2312, JIS X0208, JIS X 0212, and KS C 5601 fonts, then
perform one of the following:

  a. for a given unified CJK ideograph, determine if that ideograph has
     a mapping to one of the font encodings (BIG5, GB, JIS, KSC, etc.); if
     a mapping exists, then display the glyph corresponding to that mapping.

  b. parse the text being displayed in order to determine sequences of
     substrings which can be strongly associated with a particular writing
     system based on character content; using the results of such parse,
     choose the appropriate national font(s) to display each such sequence.
     such a parser can be easily constructed based on the following
     statistical facts:

     -) sequences of Japanese text will with high probability contain at
        least one Kana character; whereas Chinese and Korean will with
        high probability not contain such a character;

     -) sequences of Korean text will with high probablility contain at
        least one Hangul character; whereas Chinese and Japanese will not;

     -) long sequences containing CJK ideographs, say >20 characters,
        containing neither Kana nor Hangul characters are with high
        probability Chinese

     -) short sequences containing neither Kana nor Hangul characters may
        be resolved by determining whether every character in the sequence
        maps to some character in a particular national standard; in the
        case that every character in the sequence maps to more than one
        national standard, then choose the first standard given some
        prioritization based on locale (e.g., choose Japanese standard if
        Japanese locale)

In each of the above cases, the text is displayable.  In case (1), given
that a single national font collection is available, no other solution is
possible in any case.  In case (2), the character may be displayed with
a national variant which does not match the text (e.g., a Chinese font's
glyph for a Japanese Kanji).  Case (3a) may produce the same results as
case (2).  Case (3b) will produce the best results for either monolingual
or multilingual CJK texts in the absence of language or font bindings
(i.e., profiling or rich text).

The above algorithm thus proves that one *may* display any given
10646/Unicode plain code text in the absence of profiling information.
[It is also useful to note that nearly all existing system fall into
case (1) above, so the issue of how to handle multilingual texts in
these cases is moot.]

The crux of the matter is whether or not such display is deemed to
be acceptable in the case that a wrong font (or glyph) is chosen to
display a given character (e.g., choosing a Japanese font to display
a unified CJK ideograph contained in a Chinese text).

As far as I know, MIME does not specify any criteria for typographic
acceptability.  In the absence of such criteria, it is not possible
to make a negative judgement about correctness or acceptability of
the above algorithm.  The purpose of the algorithm was to display
each character with some glyph and that this algorithm performs this
is plainly evident.  Therefore, in the absence of a criteria for
typographic quality, this algorithm *is* correct and serves the
requirement for a MIME client to display a 10646/Unicode text.

Should MIME decide that it will establish a criteria of typographic
acceptability for displaying character text, then it would have to
describe how an multilingual European text encoded with ISO 8859-1
could, without profiling, "acceptably" display distinct language
sequences with distinct fonts; or, how a multilingual Arabic and
Turkish text encoded with ISO 8859-6 could, without profiling,
"acceptably" display distinct language sequences using, say Naskh
versus Ruq`ah styles of Arabic as appropriate to Arabic vs. Turkish
written language customs.

Since MIME would not be able to accomplish the latter without
invalidating existing MIME usage, then it cannot enforce such
requirements of typographic acceptability on CJK usage.

Now I have presented a detailed argument stating how you *can* display
Chinese, Japanese, and Korean encoded with 10646/Unicode without the
use of profiling information of any kind.  Unless MIME wishes to
invalidate existing practices by retroactively enforcing an as yet
undetermined quality of typographic acceptability (a task which of
itself would prove extremely difficult), then I would suggest that
all discussion of requiring profiling information in order to use
10646/Unicode with MIME should cease.

If you have a detailed counterargument to the above, then please
provide it.

Regards,
Glenn Adams