Re: Unicode newsgroup name options


Russ Allbery wrote:

This is a summary of what I believe to be the options for handling Unicode
newsgroup names in an IETF standard.  So far, I see three separate viable
options for handling encoded newsgroup names through the entire protocol.
Those three options are:

  (A) UTF-8 in articles and NNTP, punycode in e-mail and IMAP
  (B) punycode in e-mail, IMAP, and articles, UTF-8 in NNTP
  (C) punycode everywhere


There is another option, viz. use any not-already-in-use name (i.e.
protocol element) and put the i18n in the description associated
with that name (said desciption could be charset- and language-tagged).

But first, let's consider some implications for the 3 options presented
which were not covered:

A and B each imply that an IMAP SEARCH and an NNTP wildmat will differ,
which is a pain for client authors.  C of course doesn't have that
problem, nor does the fourth method described briefly above.

I am also making the assumption that a standard requiring e-mail or IMAP
to handle unencoded UTF-8 in message headers and in newsgroup names is not
a viable option, due to strong oppposition from the e-mail community.


Whatever "the email community" is supposed to mean, such a proposal
simply won't pass the IESG process because of the backwards
incompatibility.

  (1) A newsreader posting via NNTP.
  (2) The NNTP server accepting posts from a client.
  (3) The NNTP transit server relaying posts to other servers.
  (4) The NNTP server providing messages to a client.
  (5) A newsreader reading via NNTP.
  (6) The NNTP server relaying a message posted to a moderated group.
  (7) The local mail system of the NNTP server.
  (8) The mail system of the moderation relay site.
  (9) The local mail system of the moderator.
 (10) The software used by the moderator of a newsgroup.
 (11) A mail to news gateway.
 (12) A news to mail gateway.
 (13) An IMAP server serving Usenet messages to a client.
 (14) An IMAP client reading Usenet messages from an IMAP server.

One or another of these proposed options affect every single component of
this system except for (11).  In the case of (11), none of these three
proposals will affect any existing mail to news gateways.  Existing mail
to news gateways may not be able to handle new non-ASCII newsgroups


Certainly that affects those gateways!  A would require such a
gateway to affect a transformation which no existing gateway
does. That is a backwards compatibility issue. B, C, and the
fourth method do not affect 11.

In the case of (14), I am making the assumption that changing the IMAP
protocol is not an option.  This means that messages served to (14) will
not contain unencoded UTF-8 in the headers, and newsgroup names in IMAP
will not have unencoded UTF-8 names.  All of the work (if any) required to
make the articles compatible with an IMAP environment would have to be
born by (13).  In all of these cases, it will therefore be desirable for
the IMAP client to be modified to understand punycode and display the
newsgroup names correctly.


13 requires (for backwards compatibility) that the article format
and the format used in IMAP be tha same, which is not the case for
A.  B may also be a problem for some IMAP implementations (any that
communicate via NNTP as opposed to taking articles from a spool).

(A) UTF-8 in articles and NNTP, punycode in e-mail and IMAP
===========================================================

This is Andrew's original proposal.  The canonical name of the newsgroup
would be in UTF-8 without further encoding.  The Usenet article format
would be defined to carry UTF-8 newsgroup names without further encoding
in those headers that contain newsgroup names (Newsgroups, Followup-To,
Control, and Xref).


As IMAP has been described in other mailing list postings, I won't
go into detail, but any scheme where the header field format differs
between "news" and "email" won't work with IMAP, and that rules out
A.  This is a fundamental incompatibility.

At every point where a Usenet article must be conveyed via e-mail,
specifically (6), (12), and (13), any non-ASCII content in Newsgroups and
Followup-To would be encoded in punycode (or some other suitable encoding
method).


13 and 14 are the areas which are incompatible with IMAP.

It
requires only minimal changes to (2), (3), and (4), the existing news
transit system, to remove restrictions preventing creation of non-ASCII
newsgroups.  It is believed that essentially all existing news transit and
server systems still in active use can handle 8-bit newsgroup names
without difficulties.  It would be desirable for (2), the injection agent,
to be able to undo the mail encoding automatically.

[...]

Additionally, a news reader (5) may be able to read such groups without
modification if it already has support for 8-bit characters and can be
configured appropriately, and similarly a news posting agent (1) may also
be able to be used without modification.  Updates to (1) and (5) to
provide Unicode character entry, canonicalization, and display would of
course be extremely desirable.  (1) and (5) require no modifications to
deal with existing ASCII newsgroups except modifications for 8-bit
cleanliness to handle crossposted messages.

[...]

Moderation software (10) would have to change in order to handle any
non-ASCII groups, since the mail encoding would have to be decoded, or the
moderator would have to arrange to use an updated injecting agent (2).

[...]

Any news to mail gateway (12) would have to be modified if it received any
messages crossposted to non-ASCII newsgroups and wanted to preserve the
Newsgroups header in the e-mail message.  Failure to encode the headers
appropriately would result in unencoded 8-bit text in the headers of a
mail message, where it may be mangled or rejected by the mail system.

[...]

Any IMAP server processing Usenet messages (13) would have to perform the
same transformations, encoding newsgroup names in Newsgroups, Followup-To,
and Control (and Xref if the IMAP server wished to maintain it).


Those are all backwards incompatibilities.  Moreover the issues affecting
13 incorrectly assume that "news" and "mail" can be differentiated. And
while *some* UAs might be usable unmodified *some* _will_ require
modification to support UTF-8 I/O; there is no alternative as that
must be what the UA puts in the article and communicates with via NNTP
(and as that applies both to origination and followups, it affects both
1 and 5).

(B) punycode in e-mail, IMAP, and articles, UTF-8 in NNTP
=========================================================

All NNTP commands would take UTF-8 arguments for newsgroup names, and the
newsgroup names returned by LIST, GROUP, and similar commands would be in
UTF-8.


That's a backwards-compatibility for any IMAP servers dealing via
NNTP.

NNTP servers (2) and (4) must be modified in order to carry non-ASCII
newsgroups to decode the newsgroup headers when receiving messages so as
to know what newsgroup into which to file them.  The active file would
also need to be kept in UTF-8.  As above, it is believed that the other
NNTP commands besides POST/IHAVE/TAKETHIS would work without modification
because existing NNTP software is already 8-bit clean.  If the NNTP
software is not modified, the newsgroups will show up in their punycode
encoded form, possibly confusing compliant news reading software.


More backwards incompatibilities.

IMAP servers (13) may want to recode the newsgroup names from
punycode to UTF-7, but would not need to make any transformations to the
articles themselves.


Why would anybody want to "recode [...] punycode to UTF-7"?
As noted above, there may be issues with IMAP<->NNTP communications.

(C) punycode everywhere
=======================

This proposal mandates modifications to the posting agents (1) and the
news readers (5) in order to properly display the names.


Strictly speaking, that is not mandated; 1 and 5 can still be used
by users. Users can still post, read, follow-up, etc.; decoding
for display (where underlying Unicode and font support is available)
is a nicety, but not a necessity.

No modifications are required to (2) or (4), the NNTP servers, although
without modifications the server administrator would have to work with
encoded group names.  It would provide a much better user interface if the
administrative tools implemented punycode encoding and decoding for easier
handling of non-ASCII newsgroup names.


Use of the canonical name may in fact be a benefit, e.g. in the case
of an administrator not familiar with Oriental, Cyrillic, Hebrew, Arabic,
Devanagari, etc. letterforms when dealing with some non-ASCII names.
Therefore the last statement above "It would provide a much better..."
is questionable.

IMAP servers (13) may again wish to recode punycode to UTF-7 for newsgroup
names, but otherwise require no modification.


Why would anybody want to "recode punycode to UTF-7"?

Summary
=======

The following chart summarizes the backward compatibility issues for each
proposal and each component of the news system.  For each portion of the
news system, N means no change required, Y means change is required to
correctly handle non-ASCII newsgroups, D means change is very desirable
but not absolutely necessary, and C means change would be convenient but
unmodified software is still fairly usable.

  | 1   2   3   4   5   6   7   8   9  10  11  12  13  14
 -+------------------------------------------------------
 A| D   C   N   N   D   Y   N   N   N   Y   N   D   Y   D
 B| Y   Y   C   Y   Y   N   N   N   N   N   N   N   C   D
 C| D   C   C   C   D   N   N   N   N   N   N   N   C   D


Based on the comments above, there are a few errors, I've had to
add a I (for incompatible) category, and I've added a fourth row:

 | 1   2   3   4   5   6   7   8   9  10  11  12  13  14
-+------------------------------------------------------
A| Y   C   N   N   Y   Y   N   N   N   Y   Y   Y   I   I
B| Y   Y   C   Y   Y   N   N   N   N   N   N   N   Y   D
C| C   C   C   C   C   N   N   N   N   N   N   N   C   C
D| C   C   C   C   C   N   N   N   N   N   N   N   C   C

The above summary I believe correctly indicates that proposal (B) requires
the most changes to be made to the news system itself.


7-9 are irrelevant (identical for all rows), and are the only
mail-specific items; everything else is in one way or another
part of "the news system". A is completely incompatible with
IMAP, and has more Y's than B. So with the caveat that A is
not viable, then yes, of the remaining viable approaches, B
has backwards incompatibilities whereas C and D do not.

It's possible to
use existing software without modification with either proposal (A)


Not IMAP.

[...] but with other news readers it may be
impossible to access a non-ASCII group because no Unicode entry is
supported.


That is why 1 and 5 in A and B are Y. [even if they are considered
D rather than Y, the situation w.r.t. overall backwards compatibility
is unchanged, due to the presence of I's or other Y's in rows
A and B]

Under proposal (C), we're guaranteed that nothing will break
and that it will always be possible to access even non-ASCII groups, but
no existing client will display the names correctly.


Likewise for D (other than clients that display, or can display,
the description, which inclues a few).

Overall, it is somewhat less necessary to change client software with (A)
than with (C); exactly how much less necessary is something of an open
question.


Not so with IMAP clients; A is incompatible.

Proposal (A) is the only proposal that requires changes to any system
outside of the news system (other than changes to an IMAP client to
understand the punycode newsgroup names, which are the same for all
proposals).


As mentioned above, the only mail-specific parts are irrelevant, and
everything else is part of the news system.

> Both (B) and (C) work with the moderation, e-mail, and IMAP

infrastructure without any additional changes.


B has backwards incompatibilities with IMAP, at least where
there is communication between an IMAP server and an NNTP
server.

Any Y or I is an incompatibility and rules out any chance of IESG
approval (barring a transition plan and protocol negotiation with
fallback where feasible).  That leaves C and D.  C requires more
background work before it can even be specified adequately
(specifically the stringprep), whereas D requires only specification
of how the description field is handled (and certainly the current
guess-the-charset, guess-the-language situation would function, though
not at all ideally -- even if C is used rather than D, the charset
and language issues in the newsgroup descriptions need proper i18n
consideration, and that includes both charset and language).