Re: I don't want to be facing 8-bit bugs in 2013

    Date:        Thu, 21 Mar 2002 00:57:18 +0859 ()
    From:        Masataka Ohta 
<mohta(_at_)necom830(_dot_)hpcl(_dot_)titech(_dot_)ac(_dot_)jp>
    Message-ID:  
<200203201557(_dot_)AAA02490(_at_)necom830(_dot_)hpcl(_dot_)titech(_dot_)ac(_dot_)jp>

Otha-san

  | Anyway, with the fix, there is no reason to prefer Unicode-based
  | local character sets, which is not widely used today, than existing
  | local character sets already used world wide.

Let's assume that local char sets, and an explicit indication of which
to use is adequate for this purpose (as Harald has said, and I agree,
that's not sufficient for all purposes, but for stuff like domain names,
file names, etc, I suspect it is).

Then, let's take all the MIME names of those charsets, and number them,
0, 1, 2, 3, ... (as many as it takes), that's a 1::1 mapping, after all
the mime charset names are in ascii, we wouldn't want to keep that.

Then, rather than labelling whole names with a charset, we'll label
every individual character, so if, by some strange chance, ascii (or 8859-1)
happened to be allocated number 0, then 'A' would be 0-65.
This way we can mix and match names with characters from any random
character set (it may be a little less efficient than one label per name
but that's OK, assume we'll pay that price for the flexibility).

No, we'll just list all the possible characters in all the char sets,
with their char set numbers attached, in one big table.

What we have at that point is (essentially) 10646 - unicode.   Just up
above you said this works (throwing in some redundant labelling cannot
cause it to stop working, nor can altering the labels from ascii to binary
numerals).

Of course, a large fraction of the char sets that we just listed in a big
table contain ascii as a subset,so we end up with dozens of versions of
A (it's in 8859-1 8859-2 ... tis-620 (Thai) ...).  All of those A's are
in all respects identical, keeping them will do no more than cause
interoperability problems.   So let's go through and squash all the duplicates.
and then renumber everything to be a bit more compact (the renumbering is
a 1::1 operation that alters nothing important).

Having done that, we have exactly 10646 (or can have, if we picked the
right character sets for our initial big list of everything, gave them
the appropriate char set numbers, and compressed the number spaces in
just the right way).

Again, you said above, this works.

The only place in all of this where there's any possibility of a problem is
in the "squash all the duplicates" - if some characters that are duplicates
aren't removed, or some that aren't are treated as if they are.  If this
happened, then there'd be a bug in the actual character tables (too many, or
too few) - but this doesn't alter the principle of the thing.

If there is a bug like this (which I am not able to judge) then someone
should get it fixed.

Whether or not there is a bug, the unicode/10646 approach is clearly the most
flexible way, and is perfectly adequate for, labelling things using
whatever characters anyone wants to use - internationally or locally.

There is simply no way to rationally claim that a local char set, plus label,
is adequate and unicode is not.

kre