ietf-822
[Top] [All Lists]

Re: printable wide character (was "multibyte") encodings

1993-01-21 08:25:04
As for UTF-2...I suggest that this WG define two 10646/Unicode charsets:

1) "flat": canonical form is to transmit each n-bit character as n/8
octets, in order from most significant octet first to least significant
octet last.

That is certainly one way of doing it, but it does have the
disadvantage of making mostly-English text rather unreadable in
viewers that don't support it.  If the typical Unicode user used
mostly "exotic", non-ASCII characters, it would not be at all
unreasonable to send out unreadable stuff, since people would have to
upgrade their software ANYWAY for that kind of stuff.  But somehow I
think that the typical (American) user will use mostly ASCII, with the
occasional need for, perhaps, the "smiley" character, or
i-with-diaeresis for the word "naive", or whatever.  In that case, it
would be less hostile to the installed base to use single-byte ASCII
codes for the ASCII characters, as is done in UTF-2 and MU.


2) "UTF-2":  canonical form is a UTF-2 stream.

The problem with UTF-2 is that it uses the 8th bit, necessitating a
Content-Transfer-Encoding of either Base64 (which would make
mostly-English messages unreadable in the installed base) or
Quoted-Printable, which would be rather lengthy.  Latin-1 characters
would take up 6 bytes, as opposed to the 2 bytes in my latest
proposal, sent to the Unicode(_at_)Sun(_dot_)COM list.  Japanese characters 
would
take up 9 bytes, as opposed to the 4 in my proposal.

(Or has something wonderful happened on the ietf-smtp list in the
meantime?  Is it now more or less accepted that 8-bit SMTP will become
universally possible, either through negotiation or by "just sending
8-bit"?  I left the ietf-smtp list a while ago; it would be nice if
someone could give the readers of this list a little summary of the
current status.  For example, what is the "transition document" that
Randall mentioned on this list?  I'm not on the ietf list either,
since I have to pay overseas charges for received mail as well as sent
mail.)


...and require any reader that accepts one to accept both.

I thought we were originally aiming for ONE universal charset.


(since it's trivial to convert from one to the other).

Huh?  If it's trivial to convert, might as well just choose ONE.  Am I
missing something?  (The compactness of Base64'ed UCS-2 is an
advantage, but it's installed-base-hostileness is a severe
disadvantage.)


The sender (or his UA) can
pick whichever one seems to be the best for the text being transmitted.

Be conservative in what you send...


Regards,
Erik