Re: character sets

This presupposes that there exists a set of characters that is a
superset of *all* the character sets in the world. Although 10646 and
Unicode come close to this goal, neither of them have reached it, and
even if they do reach the goal today, there is certainly no guarantee
that tomorrow won't invent or discover some more characters. Also, if
you specify ASCII, Latin-1 and 10646, then you will not be able to
send Unicode across the Internet since some Unicode characters are not
included in 10646.


Reality hits Disneyworld.  It is true that there is still no complete
universial character set, and if there was, it would be obsolete the
day my first born scribbles a new glyph for "toast".

Requiring the use of an imcomplete character set in the Internet is
not such a change from current practice.  If you use EBCIDIC
internally, you are still required to convert to ASCII, even if there
is some information loss.  There are ugly workarounds, but my guess is
that if 10646 is annointed the Internet Mail character set, the number
of loss-ful translations will be relativly miniscule.

So, no matter what you do, you need to provide an escape hatch for the
people with "other" character sets. They can encode using e.g. BASE64
of course, but they should be allowed to write e.g. "Unicode" in one
of the headers.


If 10646 is chosen as a new character set, you can then sent unicode
around to everyone you can currently send EBCIDIC to.  Common
character sets have a lot of advantages.  I'm arguing that the
interoperability advantages far outweigh the small costs of conversion.

However, if we *do* decide to specify ASCII, Latin-1 and 10646, then
we should not allow e.g. a Unicode message to be sent across the
Internet if the message is fully convertible to one of the required
codesets. In other words, we can only add a character set to the list
of *allowed* character sets, if the conversion between that character
set and one or more of the *required* character sets has been fully
specified (probably in a separate RFC). RFC-XXXX should specify that
gateways *must* convert fully-convertible messages to one of the
required character sets. Only messages that contain unconvertible
characters should be allowed to cross the Internet "as is", but then
they *must* be encoded in a conservative way i.e. Quoted-Printable or
BASE64/BASE85.

Does this sound acceptable?


*** NEW IDEA FOLLOWING... WAKE UP. (In the spirit of too much mail)

Only if the Internet-Mail standard character sets are not called
content-type "text", but are instead called x-2022 or x-unicode.  The
text bodypart should refer to a standard character set. 

The content type for textual mail should have in it's definition a
character set.  If you send information in another character set, the
content-type may not say "text".  This is actually really useful for
maintaining "internal" and "external" formats.  Internally, I can keep
the mail in x-unicode, or x-prime-character-set, and when it is
released to the net, it gets converted into the cannonical Internet
character set and x-unicode gets replaced with "text".