Re: Upgrading to UTF-8


Quoteing blilly(_at_)erols(_dot_)com, on Mon, Feb 10, 2003 at 02:57:26PM -0500:

Because there are old messages, all of the existing methods need to
continue to be supported indefinitely so that those old messages can
still be read.  A transition requires backward compatibility so that
the infrastructure doesn't suddenly break (as that would not constitute
a transition), and it requires a feasible plan.  The Usefor draft
breaks backward compatibility and provides no feasible transition plan.
That's not "working on solutions", that's *compounding* the real problem.


All currently valid email is 7bit/ASCII. Its meaning will not change if
future email defines a meaning to 8bit message headers, and assigns that
meaning to be some character set, such as utf-8.

So, it is backwards compatible in this sense, is it not?

In theory, it is not backwards compatible with the SMTP transport, since
that expects messages to be 7bit ASCII.

In practice, I get 8bit messages (mostly spam, but some from native
French speakers) very frequently.

So, a sender of a message with utf-8 in the headers may find it not
delivered. This doesn't sound like a catastrophic break in the current
messaging system. It actually sounds like the only people who will
notice are the senders and receivers, and nobody else.

This disturbs me.


If you genuinely believe that the Usefor draft solves problems rather
than compounding them, you are indeed seriously disturbed.


The lack of technical content and incredibly personal comments in this
debate disturbs me! I don't understand why its like that.

The main objections to utf-8 becoming the "native" charset of internet
messages seems to be:

- objections to utf-8 as a standard/"priviledged" character set

  Fair enough. It has some problems, like any possible charset probably
  might. The lack of a standard charset has its own set of pretty
  serious drawbacks, no distinguished encoding possible for X.509
  certificates, for example.  However, a number of IETF protocol
  families, like PKIX, are going utf-8, rather than have to deal with
  multiple national language encodings.

- It is incompatible with RFC[2]822

  In my mind, it is a "compatible" extension of RFC2822. It does not
  change the meaning of any currently valid messages.

  A utf-8 message, of course, does NOT have a defined meaning to an
  RFC822 UA. One could argue that neither do RFC2047 encoded messages.
  Seeing =?iso-8859-1?b?45;lakdfj322lkdkd?= as a subject isn't much
  better than how my UA displays Korean.

  Anyhow, there are different shades of backwards compatible. S/MIMEv3
  messages, for example, can fail to be handled by S/MIME agents that
  used to be valid S/MIME implementations. Of course, they should have
  made it through the transport, at least, leading to...

- It is incompatible with SMTP.

  This is true. A valid SMTP implementation does not have to transfer
  messages that aren't pure ASCII. However, they seem to do so fairly
  frequently!


The interesting questions seem to be:

1 - does this mean that it can't be standardized?

  It WILL be transported by some SMTP implementations, and by all NNTP
  ones. But, I can see a strong objection to allowing a message format
  that "may or may not be" transportable.

2 - can a utf-8 encoded message be down-coded during transport?

  This is the real problem, it seems, and it seems to be a fundamental
  property of the RFC822 format: the header field formats aren't
  self-describing. Its not possible to know whether a header field is
  unstructured, structured, and if structured, whether words are allowed
  to be encoded. Because of that, its not possible to encode/decode
  without knowing the field definition, and an automated grep of
  all RFCs to determine it would be a little much to ask.

  Much as I dislike writing BER codecs, I have to admit that ASN.1 and
  XML are better this way.

  So, you can only transform some fields. Like the ones that you know
  are allowed to containt utf-8, because they are in the USEFOR draft.

  What about the others? What about throwing experimental headers that
  have binary in them away? Or leaving them, at the gateway admins
  option, raw.

  What are the problems with this approach, operationally?


This seems to be a really important issue, and speaking as an
implementor, if the mail standards HAVE to be as baroque and difficult
as they are, fine, I can deal keep dealing with them, but, I would
really, really, like to know the design rationale, because utf-8 sure
does seem like it would solve a whole lot of problems.

The RFCs are fairly lacking in any "design and architecture of the IETF
text messaging system" section, "read the mailing list archives" seems
to be the standard cop-out, but the flame to signal ratio is starting to
make my cheeks burn!

Cheers,
Sam