Re: RFC 2047 and gatewaying


In a message written on Fri, Jan 03, 2003 at 09:01:42PM -0800, Russ Allbery 
wrote:

   It's best to think of all 8-bit character data as encoded.  UTF-8 is
   just as much an encoding as RFC 2047.  UTF-16 is an encoding.  UTF-32
   is an encoding.  You're going to have to deal with an encoding, no


Per unicode.org (gotta love google) there are 13 formal standards
for encoding unicode, and at least a half dozen more.  That's an
order of magnitude too many.  While I would like one, the world
has existed fine with two (US-ASCII & ISO-8859-1), and three might
work.  13, no way no how give up now if you think that might work.

The world can deal with a fairly small set of encodings.  The world
cannot deal with a large (> 5) number of the.

 * Your goal above therefore reduces to "can we all agree on an encoding
   for characters?"


If we agree Unicode has all available characters (which I'm not
sure we do) then it seems quite strange to me that we can't agree
how to encode them.  Again, possibly to a couple of standards (eg,
big endian little endian, not 13).

 * The development of character sets is highly political.  It's not clear
   that development of a single character set is going to succeed,
   unfortunately not for technical reasons but for political reasons.  But


I hate to be an arrogant american in this, but I will take that
role.  If, after we adopt something better (unicode?  something
else?  I dunoo) you can't speak a character set on the Internet
you don't deserve to be online.  I want, desperate, to do this
today with US-ASCII, but no matter how much I want it much of the
world does not speak english.  So, I will adopt software that allows
the majority of foreign (to me) languages to work.  If some obscure
language with 1,000 speakers who's country GDP is less than my
sallary is excluded excuse me if I don't care.

If there is anything technology has proved it is that standards
are good.  I'm not going to hold my breath for one worldwide
"language", but the ones on the fringes can die and go away.

 * Lots of people don't use Unicode at all and continue to not use
   Unicode.  Are those people all going to switch to Unicode?  I really


Lot's of people don't care.  If their xterm/dos prompt/securecrt
prompt displays {insert favorite unusual language" they will be
happy.  If it displays "garbage" they will be happy.  If it displays
"you cannot read this" they will be happy.

The fact that most people know a single, or perhaps two languages
are irrevelant.  Display it as it was intended.  If I buy a book
in Chinese and can't read it that doesn't change the fact that it
is in Chinese.  If the software works, no one will care.

 * UTF-16 breaks Internet protocols and all the programs that manipulate
   them horribly because UTF-16 includes characters that C libraries think
   are embedded NULs and therefore end-of-string markers.  So I have a
   hard time imagining the Internet standardizing on UTF-16.  You can't
   even store files in a Unix file system with names encoded in UTF-16
   without major surgery to the entire C API.


You have just suggested a good reason UTF-16 should be removed as a
standard.  1 down, 12 to go.  Next?

 * Microsoft Windows uses UTF-16, as do many other early adopters of
   Unicode.


5 billion flies every day eat feces, why don't you?

Don't let an early adopter get in the way of the right solution.
Remember, at one time the entire ARAPnet did not run TCP/IP, that
didn't stop them from converting.

 * UTF-8 seems to be the obvious choice for Internet protocols, since


Yes, it does, at least to me the uninformed in so many issues.

 * sendmail still is not fully 8-bit clean in the latest analysis that
   I've seen, which means that 7-bit issues continue to plague the mail
   system.


While I run sendmail, it is not the only mail system.  qmail,
vmmail, etc all are widely used.  Change the RFC, then complain
about the program.  At least, in general, mail has MIME, which has
a number of ways to encode unicode in a standard.  That said, "raw"
mail is still of value, and should be made to "just work".

Universal flag days, as you point out, aren't going to work.  I'm
therefore becoming increasingly resigned to having to deal with encodings
pretty much forever, or at least for the forseeable future.  So in answer


If we have to use (many) encodings then all of this is going to fail.
Period.  Full stop.  Whatever the "." character means in your language.

If it's any comfort, the only reason why you and I have been spared this
so far is because we both use English and are therefore happy with
US-ASCII.  The rest of the world has been dealing with this for years.


I would consider you right, but for the wrong reason.  We all
communicate for one, and only one reason.  To communicate ideas.
In the end ideas of character sets and encodings and all that are
rather esoteric.  The real question is do we understand what each
other is saying?  I am 100% sure that worldwide communication will
lead to one, and only one worldwide language.  In my lifetime?
Doubtful.  In my childrens?  Likely.  Will there be a lot of pain
in the middle?  Yes.  Pockets of various languages will exisit for a
long time, and generally communicate only internally to themselves.

To that end, I suspect most of the world isn't dealing with this.
Speakers of language "x" use a similar set of tools that use a 
similar set of encodings and character sets.  The problem is when
someone who speaks language x communicates with someone using language
y, and expects them to view something like they do (eg, the book
analogy).  While I think it would be good for us all to view the same
thing (like a book), it really does no good as we can't understand it.
Ever see foreign language books in your local bookstore?  No?  There's
a good reason.

So, protocols being language independant == good, let's support them
all until the world decides on one.  Expecting anyone else to understand
your character set == bad.  For instance, for me something that decodes
chinese might as well display a row of "xxxxxxxxxxx"'s.  Displaying the
characters will do me no goood.

-- 
       Leo Bicknell - bicknell(_at_)ufp(_dot_)org - CCIE 3440
        PGP keys at http://www.ufp.org/~bicknell/
Read TMBG List - tmbg-list-request(_at_)tmbg(_dot_)org, www.tmbg.org