RE: RFC 2047 and gatewaying


Kai Henningsen wrote:

I don't want to go into that whole debate right now, but I *am* of the
opinion that the IETF is going exactly the wrong way there, exchanging
short-term pain for long-term pain. That solution is *never* the right
one.


OK, I get that you want to move usenet to the right long-term solution,
and that you think that is raw UTF-8 in headers.  Personally, I believe
that raw UTF-8 headers will never make it through the IESG, and that it
is unlikely that large numbers of usenet clients or servers will
implement a proposal without an IETF imprimatur.

Frankly, while it is certainly possible to argue about the relative
merits of various solutions, I really doubt that any other solution
has *significantly* lower impact than [UTF-8 in headers].


This is the question I want to pursue with you.  At the bottom, I'd like
to understand what impact this proposal has that raw UTF-8 doesn't.

[2047] is a combination of wasting programmer's times (without any

hope of

this ever getting better), of (as a consequence) introducing
additional bugs, and of (partly as a consequence) irritating users.

Oh, and nobody please try to tell me that 2047 "just works". I've
seen it break far too often for that.

2047 *must go away*, not be perpetuated forever. It is an abomination.
(And really, the same arguments hold for 2231.) And now we get
punicode.

The Internet is increasingly feeling like typical MS code - patches
upon patches upon patches.


The Internet is feeling that way because that's exactly the design
approach that has been practiced.  HTTP/1.1, for example, made all kinds
of absurd design decisions (down to misspelling referrer) to perpetuate
backward compatibility.

Now, I don't claim that 2047/2231 is anywhere near perfect, though I do
agree that it has advantages over raw UTF-8 outside of backward
compatibility (namely, language tagging).

But ignoring that, what programmer time will be saved by adding support
for raw UTF-8 headers?  Won't 2047/2231 headers still be present in
messages?  Won't any reasonable client need to be able to decode them?
Further, won't those clients also need to be able to decode punycode,
since domains will be encoded in it?

Well, the problem is that it seems you cannot be backwards compatible
to both current mail standards and current mail usage by current
Usenet, as the two are *already* incompatible. Or at least not do
that and actually have an even halfway sane method of i18n. Or at
least I haven't seen any such proposal.


I'm suggesting that redefining the usefor article format to be RFC
2822+2047+2231 + a bunch of 1036 headers +
<http://www.normos.org/ietf/draft/draft-faerber-i18n-email-netnews-names
-00.txt> is backward compatible with both current mail and current news
formats, and provides full i18n of headers.

Let's see how it stacks up:

1. Must be able to support non-ASCII newsgroup names.


Supports full Unicode repertoire, encoded in punycode.  E.g.:

   se.test.zq--rksmrgs-5wao1o

2. Because of that is how the installed base of servers works,
   newsgroup names (while in Usenet) *can* use non-7bit characters.


Encoded newsgroup names with more than 7 bits would be deprecated.  Note
that this wouldn't break any current server or client, but would serve
to show the direction for proper i18n going forward.

Conformant servers could (if they can identify the charset of the 8-bit
newsgroup name), convert it to Unicode and punycode encode it, although
this might cause more problems than it solves.

3. By the same argument, the identity relation on newsgroup names
   *must* work without needing any form of normalization (because no
   such form is deployed).


One huge advantage of using punycode is that usenet could adopt nameprep
normalization at the same time.  Note that usenet clients will *already*
need to adopt nameprep+punycode to support entry and display of IDNs.
The identical code could be reused for each component of a newsgroup
name.

Normalization would drastically improve the chances that a newsgroup
name a user types in would bring them to the newsgroup they want.

4. Names that fit in ASCII must still be in ASCII, for obvious
reasons.


Check

5. Because of moderated groups, news articles *will* be sent to
   moderators via mail.
6. Because of the installed base, this *will* (currently) happen in
   most cases without changing any header at all; we have a small
   chance of using attachments instead.


No need to use attachments, since all usefor headers will also be legal
RFC 2822 headers.  (Note that an attachment of
application/news-transmission works OK for moderators, but has a
horrendously negative effect for cross-postings to mailing lists.)

7. Moderators, in the vast majority, refuse to do anything complicated
   with these articles before injecting them to news servers. (See the
   relevant flamewar in the USEFOR archives.) Most of them use tools
   that are barely adequate to the job as-is, or at least that's the
   impression I get listening to them.


They just add their Approved header like always.

8. Because of crossposts, non-ASCII names *will* make their way to
   people who are not all that interested in groups with those names
   themselves. This must not make anything break.


Many of those headers will be correctly displayed by any mailer that
supports 2047 and 2231.  The rest (Newsgroup, Control) will require
support for punycode newsgroup names, but all existing clients can at
least deal with the (ugly) ACE versions.

While I dislike 2047, I do think a solution could demand it for any
other fields; however, I do not see how it could possibly work for
newsgroup names (Newsgroups: and Followup-To: header fields).


Agreed, you need punycode for that.

I have a simple question.  What can a UTF-8 subject header
communicate that an RFC 2047 one can't?  Other than inelegance,
what's the downside of 2047, when the upside is a huge increase in
backward compatibility?

The downside is exactly lack of backwards compatibility. See above for
details.


Kai, other than bad aesthetics, how is this proposal not backward
compatible with both current mail and current news practices?


You listed your requirements for an i18n proposal.  Here are some of
mine, which this proposal supports, and which raw UTF-8 headers don't:

1) Provides i18n of all user-meaningful headers with support for the
full Unicode repertoire

2) Support for normalization so as to increase chances that what a user
enters is what they mean

3) Support for language tagging, as mandated by Section 4 of RFC 2277
<http://www.normos.org/ietf/rfc/rfc2277.txt>.

4) Does not break any current clients, servers, or gateways, either for
mail or news.

5) Incrementally deployable, with conformant clients getting full i18n
and non-conformant ones having an ugly though usable fallback.

6) No change to servers required.  (At least to support i18n.)

7) Least effort for coders.  This may seem controversial, but I argue
that i18n news clients already need to implement 2047/2231 due to mail
leakage, and nameprep+punycode to deal with IDNs.  So, supporting raw
UTF-8 in headers would be an *additional*, incremental piece of work.


          - dan
--
Dan Kohn <mailto:dan(_at_)dankohn(_dot_)com>
<http://www.dankohn.com/>  <tel:+1-650-327-2600>

  Randomly generated quote:
As the British Constitution is the most subtle organism which has
proceeded from the womb and long gestation of progressive history, so
the American Constitution is, so far as I can see, the most wonderful
work ever struck off at a given time by the brain and purpose of man. 
- W.E. Gladstone