character sets

Greg writes:

Well, if you can convert from ISO 10646 to national 646-n, and then
display 646-n on your dumb terminal, then I'd say you can handle
10646.  You can do something with it, even if you must suffer info
loss for Kanji.


OK, now I can see what sort of level you are aiming for. You're saying
that the UA should be able to convert from one of a small set of
encodings to the user's codeset. The UA need not display *all* of the
world's characters, but it should display the ones that are included
in the user's codeset. You say that it's OK to "suffer info loss for
Kanji". I assume that you mean that the UA does not have to display
Kanji, but it should not lose the information that it has in the
message file itself.

As far as the above example is concerned, if I were an implementor, I
think I would convert from 10646 to something like 7-bit 2022, so that
the user sees some national 646 characters, with some hyphens
interspersed for the unconvertible characters, so that the user can
see that not all characters could be converted. I might even print a
warning saying that the conversion did not succeed completely.

Actually, the conversion does not have to be done by the UA. The
conversion could be done by the enclave's gateways, again as long as
no info is lost.

This is precisely what I'm getting at.  If I pick a series of
codesets, like MAILASCII, Latin-1 and ISO 10646, they are all upwardly
compatable.  If I send Japanese, I must use ISO 10646.  I have no
option.  If I send English in ASCII, I can use 10646, but I can subset
it to ASCII.  If I send French, I can use either 10646, or subset it
to Latin-1.

There is a big difference between implementing this series of
character sets and asking that I implement (or be able to convert to
and from) Unicode, 646-n, and iso10646.  With the former series, I
implement the level of functionality I need.  By allowing any arbitrary
character set, I must implement all sets that can possibly give me
the functionality I need, because must expect any one of them.


This presupposes that there exists a set of characters that is a
superset of *all* the character sets in the world. Although 10646 and
Unicode come close to this goal, neither of them have reached it, and
even if they do reach the goal today, there is certainly no guarantee
that tomorrow won't invent or discover some more characters. Also, if
you specify ASCII, Latin-1 and 10646, then you will not be able to
send Unicode across the Internet since some Unicode characters are not
included in 10646.

So, no matter what you do, you need to provide an escape hatch for the
people with "other" character sets. They can encode using e.g. BASE64
of course, but they should be allowed to write e.g. "Unicode" in one
of the headers.

Of course, you can argue that the RFC is for the Internet, and that
people can do whatever they like within enclaves, but I think there
will be people who want to communicate across the Internet between
enclaves.

However, if we *do* decide to specify ASCII, Latin-1 and 10646, then
we should not allow e.g. a Unicode message to be sent across the
Internet if the message is fully convertible to one of the required
codesets. In other words, we can only add a character set to the list
of *allowed* character sets, if the conversion between that character
set and one or more of the *required* character sets has been fully
specified (probably in a separate RFC). RFC-XXXX should specify that
gateways *must* convert fully-convertible messages to one of the
required character sets. Only messages that contain unconvertible
characters should be allowed to cross the Internet "as is", but then
they *must* be encoded in a conservative way i.e. Quoted-Printable or
BASE64/BASE85.

Does this sound acceptable?


Erik