Re: RFC 2047 and gatewaying


Leo Bicknell <bicknell(_at_)ufp(_dot_)org> writes:

What's my point?  Today I can "grep" a newsgroup article (ok, depends on
server software and format), pass that to "dig" (ok, maybe with some sed
and other work) and find out DNS information, and then use "sendmail"
(ok, with a wrapper to generate a real e-mail) to mail someone.  I hope
we all agree this is a good thing.  I see a future developing where a
custom filter will be needed between each of those steps to preserve
"international" characters.  I hope we can all see why that is bad.


Right.  I think we've all been here and all had these same thoughts.

Here's where my thought processes went after those thoughts, roughly:

 * I realized that, due to the way that 8-bit character sets developed,
   there is no "canonical" 8-bit character encoding and certainly no
   canonical character encoding that handles the full range of characters
   people want to use.  The realization that this leads to is that there's
   simply no such thing as "unencoded" 8-bit data.

   It's best to think of all 8-bit character data as encoded.  UTF-8 is
   just as much an encoding as RFC 2047.  UTF-16 is an encoding.  UTF-32
   is an encoding.  You're going to have to deal with an encoding, no
   matter what happens, until some mythical far future in which everyone
   uses exactly the same data representation, and then we might be able to
   stop thinking about encoding like we did with US-ASCII.

 * Your goal above therefore reduces to "can we all agree on an encoding
   for characters?"

 * The development of character sets is highly political.  It's not clear
   that development of a single character set is going to succeed,
   unfortunately not for technical reasons but for political reasons.  But
   if any is going to succeed, Unicode has the best shot.  Unicode is,
   however, a character *set*, not a unique character *encoding*.  That's
   just half the puzzle, namely "what characters exist."

 * Lots of people don't use Unicode at all and continue to not use
   Unicode.  Are those people all going to switch to Unicode?  I really
   don't know; again, see the above point about politics.  I don't know
   very much about politics, particularly those politics, so I don't feel
   qualified to even guess.  That means that assuming that Unicode is
   going to win seems pretty risky to me.

 * Assuming that the world agrees on Unicode (there don't seem to be any
   other viable options for a universal character set used all over the
   world), there are at least three major encodings (UTF-8, UTF-16, and
   UTF-32) that can be used with it, as well as a bunch of other minor
   ones.  So "use Unicode" doesn't answer what encoding we should try to
   use.

 * UTF-16 breaks Internet protocols and all the programs that manipulate
   them horribly because UTF-16 includes characters that C libraries think
   are embedded NULs and therefore end-of-string markers.  So I have a
   hard time imagining the Internet standardizing on UTF-16.  You can't
   even store files in a Unix file system with names encoded in UTF-16
   without major surgery to the entire C API.

 * Microsoft Windows uses UTF-16, as do many other early adopters of
   Unicode.

 * UTF-8 seems to be the obvious choice for Internet protocols, since
   transforming US-ASCII into UTF-8 is an identity transform, and NUL
   still means NUL.  However, UTF-8 penalizes non-ASCII characters
   spacewise, and is somewhat more complex to parse and reason about than
   a pure multibyte character set.  So there are good reasons to use it
   and good reasons not to use it.

 * sendmail still is not fully 8-bit clean in the latest analysis that
   I've seen, which means that 7-bit issues continue to plague the mail
   system.

When you put all of that together, I don't see any clear character
encoding winning, and furthermore I think that the reasons why no
character encoding is winning are outside the scope of issues that the
IETF can manage.

I accordingly then became much less optimistic about UTF-8 providing the
sort of universal tool access that you talk about, and have been unable to
see any clear path that will get us to there from where we are now.
Universal flag days, as you point out, aren't going to work.  I'm
therefore becoming increasingly resigned to having to deal with encodings
pretty much forever, or at least for the forseeable future.  So in answer
to Charles's question, yes, I expect to see RFC 2047 ten or even twenty
years from now.  I hope I won't, but I do expect that.

If it's any comfort, the only reason why you and I have been spared this
so far is because we both use English and are therefore happy with
US-ASCII.  The rest of the world has been dealing with this for years.

-- 
Russ Allbery (rra(_at_)stanford(_dot_)edu)             
<http://www.eyrie.org/~eagle/>