RE: UTF-8 over RFC 2047 (Re: Call for Usefor to recharter)


It took place <http://www.imc.org/ietf-822/mail-archive/msg02674.html>
and <http://www.imc.org/ietf-822/mail-archive/msg02696.html>.

If 2047 is broken, then it should be fixed or replaced (perhaps with
nameprep+punycode) with something that can provide adequate i18n while
still preserving backward compatibility.  Most likely, it should be
fixed, since (due to email) it will quite likely be around forever and
all newsreaders will need to be able to at least decode it for the
foreseeable future.

One solution would be to profile 2047 use for news along the lines of
"legal to generate/must understand" in sections 3 & 4 of RFC 2822.  All
newsreaders would be required to understand 2047, but compliant readers
would only generate 2047 text that actually used Q or B encoding of
UTF-8.

I'm not sure if this is the right solution, but I stand by suggestion
that rechartering would provide the context for evaluating the solution
space.

          - dan
--
Dan Kohn <mailto:dan(_at_)dankohn(_dot_)com>
<http://www.dankohn.com/>  <tel:+1-650-327-2600> 

-----Original Message-----
From: Jean-Marc Desperrier 
[mailto:jean-marc(_dot_)desperrier(_at_)certplus(_dot_)com] 
Sent: Tuesday, January 07, 2003 10:15
To: ietf-822(_at_)imc(_dot_)org
Cc: usenet-format(_at_)rkive(_dot_)landfield(_dot_)com
Subject: UTF-8 over RFC 2047 (Re: Call for Usefor to recharter)


Dan Kohn wrote :

I wrote:

I have a simple question.  What can a UTF-8 subject header
communicate that an RFC 2047 one can't?  Other than inelegance,
what's the downside of 2047, when the upside is a huge increase in
backward compatibility?


I do not know where this discussion took place, but I have an answer to
it.

It's a simple fact.
In every single thread with non US-ASCII data in subject encoded by 
RFC2047 (sorry I wrote 2049 by error in my last mail) I've seen, the 
subject turned to garbage after 5 or 6 messages.

The reason for that is that all implementations of RFC2047 around are 
full of implementation errors.

The reason for that is that the RFC2047 encoding is full of specific 
cases, hard to understand rules, and enables an amazing number of 
different possibilities for the encoding of the same string.
The analyses of it that was done during recent discussion in the usefor 
mailing list led to the discovery of incredibly obscure border cases, 
that can only result in an implementor getting it wrong, or having to 
choose between respecting the standard, or refusing that other will 
produce, which will make it look like it is the one that gets it wrong 
given the number of software that will produce the incorrect encodings.

And here raw UTF-8 is a clear winner. No complex implementation rules, 
no border cases, one string will always have one and only one 
representation.

Another choice would be throwing away RFC2047 and devising a new 7 bits 
encoding that does not have all the inconveniences.

This has been debated, but not choosen.

In my opinion, here are the main reasons that justify that :
- reserves against producing yet another encoding (that might itself 
have defaults that are immediatly apparent)
- the fact that with a very wide majority the non US-ASCII world has 
*already* choosen raw 8 bit against 7 bit (RFC2047) encoded data.[1]
- such an encoding would have no support at first in the installed base 
of softwares, whereas both RFC2047 and raw utf-8 at least are already 
supported by a part of them.

[1] This excludes the parts of the world where the standard encoding for

email is seven bit.