ietf-822
[Top] [All Lists]

UTF-8 over RFC 2047 (Re: Call for Usefor to recharter)

2003-01-07 11:12:01

Dan Kohn wrote :
I wrote:
I have a simple question.  What can a UTF-8 subject header
communicate that an RFC 2047 one can't?  Other than inelegance,
what's the downside of 2047, when the upside is a huge increase in
backward compatibility?

I do not know where this discussion took place, but I have an answer to it.

It's a simple fact.
In every single thread with non US-ASCII data in subject encoded by RFC2047 (sorry I wrote 2049 by error in my last mail) I've seen, the subject turned to garbage after 5 or 6 messages.

The reason for that is that all implementations of RFC2047 around are full of implementation errors.

The reason for that is that the RFC2047 encoding is full of specific cases, hard to understand rules, and enables an amazing number of different possibilities for the encoding of the same string. The analyses of it that was done during recent discussion in the usefor mailing list led to the discovery of incredibly obscure border cases, that can only result in an implementor getting it wrong, or having to choose between respecting the standard, or refusing that other will produce, which will make it look like it is the one that gets it wrong given the number of software that will produce the incorrect encodings.

And here raw UTF-8 is a clear winner. No complex implementation rules, no border cases, one string will always have one and only one representation.

Another choice would be throwing away RFC2047 and devising a new 7 bits encoding that does not have all the inconveniences.

This has been debated, but not choosen.

In my opinion, here are the main reasons that justify that :
- reserves against producing yet another encoding (that might itself have defaults that are immediatly apparent) - the fact that with a very wide majority the non US-ASCII world has *already* choosen raw 8 bit against 7 bit (RFC2047) encoded data.[1] - such an encoding would have no support at first in the installed base of softwares, whereas both RFC2047 and raw utf-8 at least are already supported by a part of them.

[1] This excludes the parts of the world where the standard encoding for email is seven bit.