Re: printable wide character (was "multibyte") encodings

From: erik(_at_)poel(_dot_)juice(_dot_)or(_dot_)jp (Erik M. van der Poel)
Subject: Re: printable wide character (was "multibyte") encodings
Date: Thu, 21 Jan 93 16:12:59 +0900

As for UTF-2...I suggest that this WG define two 10646/Unicode charsets:

1) "flat": canonical form is to transmit each n-bit character as n/8
octets, in order from most significant octet first to least significant
octet last.


That is certainly one way of doing it, but it does have the
disadvantage of making mostly-English text rather unreadable in
viewers that don't support it.


I doubt this method would be used for mostly-English text.  It would be
used when most of the characters weren't in the ASCII set, and thus
non-readable on old systems anyway.

2) "UTF-2":  canonical form is a UTF-2 stream.


The problem with UTF-2 is that it uses the 8th bit, necessitating a
Content-Transfer-Encoding of either Base64 (which would make
mostly-English messages unreadable in the installed base) or
Quoted-Printable, which would be rather lengthy.


With pure ASCII, 7BIT encoding is sufficient.
With mostly ASCII text, Quoted-printable is probably adequate.
It depends on how often the non-ASCII characters are used.

 Latin-1 characters
would take up 6 bytes, as opposed to the 2 bytes in my latest
proposal, sent to the Unicode(_at_)Sun(_dot_)COM list.  Japanese characters 
would
take up 9 bytes, as opposed to the 4 in my proposal.


I wouldn't expect to use this charset for pure Japanese text.  (For mixed
Japanese/ASCII, iso-2022-jp would be a better choice.)

(Or has something wonderful happened on the ietf-smtp list in the
meantime?  Is it now more or less accepted that 8-bit SMTP will become
universally possible, either through negotiation or by "just sending
8-bit"?  I left the ietf-smtp list a while ago; it would be nice if
someone could give the readers of this list a little summary of the
current status.  For example, what is the "transition document" that
Randall mentioned on this list?  I'm not on the ietf list either,
since I have to pay overseas charges for received mail as well as sent
mail.)


(I think the WG has finished up the SMTP extensions drafts...which include
optional 8-bit negotiation capability...Just sending 8-bit remains
nonstandard, but there is an informational "transition" document that
suggests how an SMTP server might deal with the situation where someone
sends it unlabeled 8-bit traffic.)

I thought we were originally aiming for ONE universal charset.


I don't think we are likely to settle on any single charset anytime
soon...since no solution is likely to please everybody...but UTF-2 or
unicode might become the charset of choice for mixed-language text.

(since it's trivial to convert from one to the other).


Huh?  If it's trivial to convert, might as well just choose ONE.  Am I
missing something?  (The compactness of Base64'ed UCS-2 is an
advantage, but it's installed-base-hostileness is a severe
disadvantage.)


Whether you consider UTF-2 "installed-base-hostile" depends on your view of
the installed base.  UTF-2 is *very* installed-base-friendly to ASCII
sites, since it means they don't have to do *anything* to view ASCII text
in the UTF-2 charset.  Your encoding of Unicode is less friendly to ASCII
sites -- especially those without MIME mail readers.  

I view the "installed base" as almost entirely pure ASCII in terms of
numbers of users...not that other criteria (such as "fairness") aren't also
relevant in selecting a universal character set...

The idea of two versions of Unicode is just an extension of having multiple
content-transfer-encodings...the mail composer picks whichever one is
appropriate.  I'm not opposed to having only one version...though at this
point, If I had to pick one, I'd pick UTF-2.

But at this point, I'm just tossing out ideas rather than trying to
champion any particular proposal...

Keith