Re: character sets

Erik writes:

Greg writes:

Well, if you can convert from ISO 10646 to national 646-n, and then
display 646-n on your dumb terminal, then I'd say you can handle
10646.  You can do something with it, even if you must suffer info
loss for Kanji.


OK, now I can see what sort of level you are aiming for. You're saying
that the UA should be able to convert from one of a small set of
encodings to the user's codeset. The UA need not display *all* of the
world's characters, but it should display the ones that are included
in the user's codeset. You say that it's OK to "suffer info loss for
Kanji". I assume that you mean that the UA does not have to display
Kanji, but it should not lose the information that it has in the
message file itself.


Well, could it not be allowed to display an approximation of the
character missing in the current charset, for example displaying
a c-cedille in ASCII as "c," and then marked as an approximisation?

As far as the above example is concerned, if I were an implementor, I
think I would convert from 10646 to something like 7-bit 2022, so that
the user sees some national 646 characters, with some hyphens
interspersed for the unconvertible characters, so that the user can
see that not all characters could be converted. I might even print a
warning saying that the conversion did not succeed completely.


One might also just convert it into ASCII and then with the approximitations
mentioned above, I think that could be made without loss of information
and also fully reversible. Would that not be better than loosing
information, as mentioned above?

Actually, the conversion does not have to be done by the UA. The
conversion could be done by the enclave's gateways, again as long as
no info is lost.


Agree.

There is a big difference between implementing this series of
character sets and asking that I implement (or be able to convert to
and from) Unicode, 646-n, and iso10646.  With the former series, I
implement the level of functionality I need.  By allowing any arbitrary
character set, I must implement all sets that can possibly give me
the functionality I need, because (I?) must expect any one of them.


This presupposes that there exists a set of characters that is a
superset of *all* the character sets in the world. Although 10646 and
Unicode come close to this goal, neither of them have reached it, and
even if they do reach the goal today, there is certainly no guarantee
that tomorrow won't invent or discover some more characters. Also, if
you specify ASCII, Latin-1 and 10646, then you will not be able to
send Unicode across the Internet since some Unicode characters are not
included in 10646.

So, no matter what you do, you need to provide an escape hatch for the
people with "other" character sets. They can encode using e.g. BASE64
of course, but they should be allowed to write e.g. "Unicode" in one
of the headers.


The missing characters could also be coded in the private use zones
of 10646. 10646 already have the mechanisms, why not use it?

However, if we *do* decide to specify ASCII, Latin-1 and 10646, then
we should not allow e.g. a Unicode message to be sent across the
Internet if the message is fully convertible to one of the required
codesets. In other words, we can only add a character set to the list
of *allowed* character sets, if the conversion between that character
set and one or more of the *required* character sets has been fully
specified (probably in a separate RFC). RFC-XXXX should specify that
gateways *must* convert fully-convertible messages to one of the
required character sets. Only messages that contain unconvertible
characters should be allowed to cross the Internet "as is", but then
they *must* be encoded in a conservative way i.e. Quoted-Printable or
BASE64/BASE85.

Does this sound acceptable?


To summarize:

I would rather use 10646 with use of private use zones.

And I would allow the Japanese 2022 use, along with maybe Chinese
and Korean use of the same mechanisms, if this is in use today.

For character sets I would like to restrict them to a very few,
like the ones mentioned. 

For fallback when a conversion of a character cannot be done,
I would rather use an unique approximitation, which was fully reversible
- to avoid information loss.

keld