John Klensin's recent mail is a fairly good summary
of the last several weeks and (to me) is quite depressing.
I tried to think of why John's arguments were wrong and may be it
IS just all personal preference.
My model of interchange looks like this:
sender --> interchange form --> receiver
Confining this to text, it is obvious to me that the right way to
do this is to have a single character set for the interchange form.
(By character set, I mean 10646's ``coded character set'' -- a mapping
from integers to glyph-like things.) This way, the sender can do the
best job possible in converting from local formats into teh interchange
charset, and the receiver can do the best job possible converting into
the local format.
We all know this isn't the whole problem; you probably need some
extra stuff to help with ``desired display font'' and you get
inexorably drawn into the richtext vortex. But at least we would know
what damned characters we were talking about! And receivers wouldn't
have to know about iso-2022-jp and 8859-? and EBCDIC and .....
The encoding issue is real, but it is a red herring. When we
say charset=10646, thats what we mean. The choice of encoding is
completely independent and driven by forces beyond any rational control.
And the presence of 4 or 5 encodings for 10646 is not an argument against
10646; it is simply an argument that the various objective functions
people use (bandwidth utilisation, presence of 7-bit links, etc) force
different solutions. And don't forget that it is easy to deal with
4 or 5 small, algorithmic, lossless conversions from various encodings
into the local format.
Why am I blathering about this? 10646 has its problems but it
solves one link in the interchange chain, and it solves it in a way
so that many more groups can interchange easily. Take, for example,
the problems John stated:
i) desire to use characters from multiple character sets in plain text
it clearly solves this problem (or at least 95% of the problem;
as Ohta-san would point out, this may cause kanji to be displayed
as Chinese -- this is a defect but its better than NOT seeing
the kanji at all.) My experience in Plan 9 is that there was
no sudden jump to multi-lingual documents but an increase in the
use of diacritics and the like, and symbols (a real Smiley),
and a shift from using formatting constructs to get overstruck/
accented characters to just typing the characters in.
ii) it is a problem to handle multiple charsets in UAs.
10646 will solve this problem. sure, right now people use
ASCII and 8859-? and 2022-jp. fine, migrate them. the encoding
issue is independent and easy.
iii) 10646 is taking over and email should get behind it.
My system uses the 10646 charset, and I effectively cannot
just email outside my local community. Email is holding me back
here (there are obviously a number of ways around this but none
(iv) Transport of multiple character sets is really a problem and it can
be solved by using a single character set in transport, even if
different sets are used *within* receiving and sending systems.
Agreed. 10646 would solve this. But I am puzzled by the claim
this can't be done within the MIME/UA model. Might someone expand
a little on this?
To sum up, I think using a single charset for interchange is a
good idea and 10646 is the best solution for the forseeable future.
It is reasonable to work the details out in a WG -- I would prefer
that uninterested people not be turned off charset issues by having to
sit through the bickering character set stuff normally entails.