ietf-822
[Top] [All Lists]

Re: UTF-8 over RFC 2047 (Re: Call for Usefor to recharter)

2003-01-09 08:57:58

Jean-Marc Desperrier wrote:
Dan Kohn wrote :

I wrote:

I have a simple question.  What can a UTF-8 subject header
communicate that an RFC 2047 one can't?  Other than inelegance,
what's the downside of 2047, when the upside is a huge increase in
backward compatibility?


I do not know where this discussion took place, but I have an answer to it.

There has been substantial discussion on the ietf-822 mailing list. See
http://www.imc.org/ietf-822/mail-archive/maillist.html for the archive.

It's a simple fact.
In every single thread with non US-ASCII data in subject encoded by RFC2047 (sorry I wrote 2049 by error in my last mail) I've seen, the subject turned to garbage after 5 or 6 messages.

The reason for that is that all implementations of RFC2047 around are full of implementation errors.

There is one specific error that could cause that, namely improper use
of RFC 2047 for any purpose other than for display.  Once a header
field has been generated with RFC 2047 encoded-words, those
encoded-words should never be modified -- the content may be decoded for
display in the specified charset (and optionally, language), but the
header content should not be modified.

Surely the solution to that problem is to sorrect the faulty implementations
which are responsible for inappropriately modifying header content.


And here raw UTF-8 is a clear winner. No complex implementation rules, no border cases, one string will always have one and only one representation.

That is not correct, simply because Unicode has multiple representations,
including non-spcing modifiers.  Therefore a utf-8 encoding of unicode
will also have multiple representations. Unless, of course some additional
"complex implementation rules" are applied for normalization.  And the
longer utf-8 sequences are effectively border cases, since different utf-8
specifications have different rules.