As for UTF-2...I suggest that this WG define two 10646/Unicode charsets:
1) "flat": canonical form is to transmit each n-bit character as n/8
octets, in order from most significant octet first to least significant
octet last.
That is certainly one way of doing it, but it does have the
disadvantage of making mostly-English text rather unreadable in
viewers that don't support it.
I doubt this method would be used for mostly-English text. It would be
used when most of the characters weren't in the ASCII set, and thus
non-readable on old systems anyway.
Yes, but is there anyone in this forum that is likely to use messages
with mostly non-ASCII characters a lot? Japanese doesn't count, since
we already have iso-2022-jp for that. Likewise for Latin-1.
Perhaps there are a few Hebrew speakers on this list. But would they
want to use straight 16-bit Unicode or 32-bit 10646?
The problem with UTF-2 is that it uses the 8th bit, necessitating a
Content-Transfer-Encoding of either Base64 (which would make
mostly-English messages unreadable in the installed base) or
Quoted-Printable, which would be rather lengthy.
With pure ASCII, 7BIT encoding is sufficient.
With mostly ASCII text, Quoted-printable is probably adequate.
It depends on how often the non-ASCII characters are used.
Even if only a few non-ASCII characters are used, quoted-printable is
too lengthy (i.e. 6 or 9 bytes per character). The longer a piece of
gibberish is, the more unreadable the message as a whole becomes.
It's also rather wasteful. And RFC 1342 headers would become long,
possibly triggering a split of an "encoded word" to the next line.
Japanese characters would
take up 9 bytes, as opposed to the 4 in my proposal.
I wouldn't expect to use this charset for pure Japanese text. (For mixed
Japanese/ASCII, iso-2022-jp would be a better choice.)
Of course.
(I think the WG has finished up the SMTP extensions drafts...which include
optional 8-bit negotiation capability...Just sending 8-bit remains
nonstandard, but there is an informational "transition" document that
suggests how an SMTP server might deal with the situation where someone
sends it unlabeled 8-bit traffic.)
Thanks for the report, but it doesn't answer the question "Is 8-bit
SMTP expected to become universally possible?" Unless and until 8-bit
becomes universal, when you mention UTF-2, you're really talking about
a further encoding of it (e.g. quoted-printable), or you're talking
about bit-stripping. My position is that both are undesirable.
(The compactness of Base64'ed UCS-2 is an
advantage, but it's installed-base-hostileness is a severe
disadvantage.)
Whether you consider UTF-2 "installed-base-hostile" depends on your view of
the installed base.
Sorry, I should have explained what I meant by "UCS-2". It is the
2-octet form of 10646 (i.e. Unicode). (UCS-4 is the 4-octet form.)
UTF-2 is *very* installed-base-friendly to ASCII
sites, since it means they don't have to do *anything* to view ASCII text
in the UTF-2 charset.
No, UTF-2 is unfriendly because its non-ASCII characters used the 8th
bit, which either gets trashed, or causes lengthy quoted-printable
encoding or unreadable base64.
Your encoding of Unicode is less friendly to ASCII
sites -- especially those without MIME mail readers.
Wrong. It is compact (and therefore relatively readable) and it is
transmissible.
I view the "installed base" as almost entirely pure ASCII in terms of
numbers of users...
I agree, more or less. That's why MU is so appealing.
not that other criteria (such as "fairness") aren't also
relevant in selecting a universal character set...
Again, I agree. But it may be hard to settle on a single set.
To summarize, I'll list the virtues of MU and the methods that don't
have those virtues:
MU's virtue Virtue not shared by these
transmissibility UCS-2, UCS-4, UTF-2
readability Base64'ed UCS-2, UCS-4, UTF-2
compactness Quoted-Printable'd UTF-2
simplicity Keld's Mnemonics
I must admit that the only part that I'm unsure of is the merit
relative to mnemonics. For the Latin-1 and similar characters, it may
be better to use highly readable mnemonics. These could still be used
in conjunction with the MU encoding for other characters.
Regards,
Erik