ietf-822
[Top] [All Lists]

Re: Upgrading to UTF-8

2003-02-11 10:15:43

In <20030211024241(_dot_)GB15342(_at_)debian> Sam Roberts 
<sroberts(_at_)uniserve(_dot_)com> writes:

All currently valid email is 7bit/ASCII. Its meaning will not change if
future email defines a meaning to 8bit message headers, and assigns that
meaning to be some character set, such as utf-8.

So, it is backwards compatible in this sense, is it not?

Yes.

In theory, it is not backwards compatible with the SMTP transport, since
that expects messages to be 7bit ASCII.

I prefer to use the term "forwards compatible" to describe the situation
where some newly introduced feature (e.g. UTF-8) can interwork
successfully with the existing installed base. So no, UTF-8 is not
forwards compatible with SMTP (though in practice most current
implementations pass it unchanged, the principal exception being sendmail,
and that will change with the next major upgrade, if not before).


However, Usefor does not permit sending raw UTF-8 in email, so the
question SHOULD NOT arise (though it probably will :-( ).

In practice, I get 8bit messages (mostly spam, but some from native
French speakers) very frequently.

The French Usenet hierarchy took a unilateral decision to use raw ISO8859-1
in headers, because available implementations of RFC 2047 were so buggy. I
think they have been persuaded of the error of their ways, at least to the
extent of switching to UTF-8.

So, a sender of a message with utf-8 in the headers may find it not
delivered. This doesn't sound like a catastrophic break in the current
messaging system. It actually sounds like the only people who will
notice are the senders and receivers, and nobody else.

Ssshhhhhh! You are not supposed to utter heresies like that on this list :-(.
But yes, that is precisely the position taken by the Usefor draft. The
main problem seems to be that if a news2mail gateway (e.g. from a
particular newsgroup to a mailing list) is not upgraded, then all the
people on the mailing list will notice; but at least then the fix needs
only be applied at the one place and, for an existing English-speaking
newsgroup, which most of them are, it is not likely to be an issue
anyway).


The main objections to utf-8 becoming the "native" charset of internet
messages seems to be:

- objections to utf-8 as a standard/"priviledged" character set

There really is bo other viable candidate for the privilege. The Chinese
might make claims for GB18030 but there is zilch chance of the rest of the
world agreeing to that.


- It is incompatible with RFC[2]822

 In my mind, it is a "compatible" extension of RFC2822. It does not
 change the meaning of any currently valid messages.

It is backwards compatible. Forwards compatibility is an issue for the
transport and user agents.

- It is incompatible with SMTP.

Yes. See above.

The interesting questions seem to be:

1 - does this mean that it can't be standardized?

 It WILL be transported by some SMTP implementations, and by all NNTP
 ones. 

Implementations will certainly catch up within the next few years, but ...

 But, I can see a strong objection to allowing a message format
 that "may or may not be" transportable.

I have some ideas on that. A header that says "this messages uses 8-bit"
might be necessary.

2 - can a utf-8 encoded message be down-coded during transport?

 This is the real problem, it seems, and it seems to be a fundamental
 property of the RFC822 format: the header field formats aren't
 self-describing. Its not possible to know whether a header field is
 unstructured, structured, and if structured, whether words are allowed
 to be encoded. Because of that, its not possible to encode/decode
 without knowing the field definition, and an automated grep of
 all RFCs to determine it would be a little much to ask.

This is essentially an RFC 2047 problem, because it requires that an
implementation already knows whether a header is srtuctured or not, and if
so how. The position that Usefor now takes is that gateways MUST be able
to recognize all the official News headers (and the usual mail ones as
well) but that any header not recognized should have an "X-" stuck on the
front of it and then be treated as unstructured. Since most 'strange'
headers likely to be found in Usenet articles are for human consumption
only, this is not likely to result in any breaches of protocol.


 So, you can only transform some fields. Like the ones that you know
 are allowed to containt utf-8, because they are in the USEFOR draft.

Exactly so.

 What about the others? What about throwing experimental headers that
 have binary in them away? Or leaving them, at the gateway admins
 option, raw.

There are no such binary headers defined in Usefor, so no problem. The
nearest is the Newgroups-header, and there is a special encoding defined
for that, for use in such emergencies.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

<Prev in Thread] Current Thread [Next in Thread>