ietf-822
[Top] [All Lists]

Re: printable wide character (was "multibyte") encodings

1993-02-15 03:32:22
Yes, but is there anyone in this forum that is likely to use messages
with mostly non-ASCII characters a lot?  Japanese doesn't count, since
we already have iso-2022-jp for that.  Likewise for Latin-1.

The target audience isn't the list, but the whole mail-Internet.

Yes, of course.  But my point is: They're not here to voice they're
opinion.  We can make proposals, but we won't know whether or not they
will accept them, until we contact them.  Some of them don't even use
email very much yet!

My MU proposal currently uses 3 bytes for each Cyrillic character.
But if I remember correctly, some Russian communities currently use a
single-octet encoding (KOI-8?) on their networks.  So if I tried (or
the IETF tried) to get them to use MU for their multilingual messages,
there could well be some resistance.  They might instead want to
continue using KOI-8 for their own characters, and then MU for *other*
characters.  (Much like my "jpu" proposal.)

Note that I'm using words like "could" and "might", instead of "will".
That's because I cannot speak for the Russians.

So, you ask: "Well then, why are you proposing MU?  Why are you trying
to get all of us to use it?"

The answer is that I'm not.  MU is only a good solution for "the
Americans".  Or, more generally, for the people that often read and
write messages that only contain ASCII.  Not that very many of those
people would want to use Unicode characters often, but...

Also, my proposal is only really intended to be used by those that
have already decided to use Unicode and/or 10646.  GO Corporation,
Bell Labs, Microsoft and maybe even Apple, Sun and so on.  There has
been a lot of discussion on this list and in other forums about
Unicode's merits.  (Frankly, I agree with some of the complaints.)

But my MU proposal is intended to be accompanied by this sentence:

    *If* you do Unicode/10646, *please* do it this way.

This way, you can at least *hope* for some level of interoperability
between those members of the human race that have already decided to
use Unicode.

Masataka has voiced some concerns about the differences between the
displayed versions of the Chinese, Japanese and Korean "Han"
characters.  These concerns are certainly valid, though my view is
that Unicode itself is good enough for plain text (i.e. text/plain),
and that language tagging should be considered for some other subtype
of text (i.e. "enriched", or some other new one).


We should either bite the bullet on UCS-4
(32 bit) now, or should be prepared for a second transition a few years
down the line.

I now agree that we need to accommodate UCS-4.  The MU structure is:

    Characters                        Codes            No. of bytes

    <e1>                              <e1><e1>         2
    <e2>                              <e2><e2>         2
    <e3>                              <e3><e3>         2
    All other ASCIIs                  <a>              1
    Latin-1 letters                   <e1><b>          2
    All other Unicodes up to 0xfff    <e2><b><b>       3
    All other Unicodes                <e3><b><b><b>    4

(Note: <e1>, <e2> and <e3> are "escapes", much like Quoted-Printable's
"=".  <a> is an ASCII code, and <b> is a Base64 code.)

The 4-byte form uses three Base64 characters, which encode 18 bits of
info.  So that's 10646's Basic Multilingual Plane (BMP), *plus* 3 more
planes.  That should keep us going for a while.

But just in case ISO extends it even more, we could put in some more
safety latches:

    Characters           Codes                         No. of bytes

    Up to 0xffffff       <e1><e2><b><b><b><b>          6
    Up to 0x3fffffff     <e1><e3><b><b><b><b><b>       7
    Up to 0xfffffffff    <e2><e3><b><b><b><b><b><b>    8


Erik


<Prev in Thread] Current Thread [Next in Thread>