Re: printable wide character (was "multibyte") encodings

As for UTF-2...I suggest that this WG define two 10646/Unicode charsets:

1) "flat": canonical form is to transmit each n-bit character as n/8
octets, in order from most significant octet first to least significant
octet last.


That is certainly one way of doing it, but it does have the
disadvantage of making mostly-English text rather unreadable in
viewers that don't support it.


I doubt this method would be used for mostly-English text.  It would be
used when most of the characters weren't in the ASCII set, and thus
non-readable on old systems anyway.


Yes, but is there anyone in this forum that is likely to use messages
with mostly non-ASCII characters a lot?  Japanese doesn't count, since
we already have iso-2022-jp for that.  Likewise for Latin-1.

Perhaps there are a few Hebrew speakers on this list.  But would they
want to use straight 16-bit Unicode or 32-bit 10646?

The problem with UTF-2 is that it uses the 8th bit, necessitating a
Content-Transfer-Encoding of either Base64 (which would make
mostly-English messages unreadable in the installed base) or
Quoted-Printable, which would be rather lengthy.


With pure ASCII, 7BIT encoding is sufficient.
With mostly ASCII text, Quoted-printable is probably adequate.
It depends on how often the non-ASCII characters are used.


Even if only a few non-ASCII characters are used, quoted-printable is
too lengthy (i.e. 6 or 9 bytes per character).  The longer a piece of
gibberish is, the more unreadable the message as a whole becomes.
It's also rather wasteful.  And RFC 1342 headers would become long,
possibly triggering a split of an "encoded word" to the next line.

Japanese characters would
take up 9 bytes, as opposed to the 4 in my proposal.


I wouldn't expect to use this charset for pure Japanese text.  (For mixed
Japanese/ASCII, iso-2022-jp would be a better choice.)


Of course.

(I think the WG has finished up the SMTP extensions drafts...which include
optional 8-bit negotiation capability...Just sending 8-bit remains
nonstandard, but there is an informational "transition" document that
suggests how an SMTP server might deal with the situation where someone
sends it unlabeled 8-bit traffic.)


Thanks for the report, but it doesn't answer the question "Is 8-bit
SMTP expected to become universally possible?"  Unless and until 8-bit
becomes universal, when you mention UTF-2, you're really talking about
a further encoding of it (e.g. quoted-printable), or you're talking
about bit-stripping.  My position is that both are undesirable.

(The compactness of Base64'ed UCS-2 is an
advantage, but it's installed-base-hostileness is a severe
disadvantage.)


Whether you consider UTF-2 "installed-base-hostile" depends on your view of
the installed base.


Sorry, I should have explained what I meant by "UCS-2".  It is the
2-octet form of 10646 (i.e. Unicode).  (UCS-4 is the 4-octet form.)

UTF-2 is *very* installed-base-friendly to ASCII
sites, since it means they don't have to do *anything* to view ASCII text
in the UTF-2 charset.


No, UTF-2 is unfriendly because its non-ASCII characters used the 8th
bit, which either gets trashed, or causes lengthy quoted-printable
encoding or unreadable base64.

Your encoding of Unicode is less friendly to ASCII
sites -- especially those without MIME mail readers.


Wrong.  It is compact (and therefore relatively readable) and it is
transmissible.

I view the "installed base" as almost entirely pure ASCII in terms of
numbers of users...


I agree, more or less.  That's why MU is so appealing.

not that other criteria (such as "fairness") aren't also
relevant in selecting a universal character set...


Again, I agree.  But it may be hard to settle on a single set.



To summarize, I'll list the virtues of MU and the methods that don't
have those virtues:

    MU's virtue         Virtue not shared by these

    transmissibility    UCS-2, UCS-4, UTF-2
    readability         Base64'ed UCS-2, UCS-4, UTF-2
    compactness         Quoted-Printable'd UTF-2
    simplicity          Keld's Mnemonics


I must admit that the only part that I'm unsure of is the merit
relative to mnemonics.  For the Latin-1 and similar characters, it may
be better to use highly readable mnemonics.  These could still be used
in conjunction with the MU encoding for other characters.


Regards,
Erik