ietf-822
[Top] [All Lists]

Unicode newsgroup name options

2003-02-20 17:10:26

This is a summary of what I believe to be the options for handling Unicode
newsgroup names in an IETF standard.  So far, I see three separate viable
options for handling encoded newsgroup names through the entire protocol.
Those three options are:

  (A) UTF-8 in articles and NNTP, punycode in e-mail and IMAP
  (B) punycode in e-mail, IMAP, and articles, UTF-8 in NNTP
  (C) punycode everywhere

These are expanded in more detail below.  Note that in each case some
other encoding system besides punycode could in theory be used.  I don't
believe the choice of encoding changes the remainder of this analysis,
however, so punycode is left as a placeholder (and what seems to currently
be the most likely choice).

I am also making the assumption that a standard requiring e-mail or IMAP
to handle unencoded UTF-8 in message headers and in newsgroup names is not
a viable option, due to strong oppposition from the e-mail community.
Regardless of whether I agree with that opposition or not, I'm
uninterested in reopening that discussion, which went on both in
usenet-format and in ietf-822 at extended length.

This summary does not address internationalization issues in any other
headers or information besides Usenet newsgroup names.

In this analysis, I will be referring to the following components of the
Usenet messaging system:

  (1) A newsreader posting via NNTP.
  (2) The NNTP server accepting posts from a client.
  (3) The NNTP transit server relaying posts to other servers.
  (4) The NNTP server providing messages to a client.
  (5) A newsreader reading via NNTP.
  (6) The NNTP server relaying a message posted to a moderated group.
  (7) The local mail system of the NNTP server.
  (8) The mail system of the moderation relay site.
  (9) The local mail system of the moderator.
 (10) The software used by the moderator of a newsgroup.
 (11) A mail to news gateway.
 (12) A news to mail gateway.
 (13) An IMAP server serving Usenet messages to a client.
 (14) An IMAP client reading Usenet messages from an IMAP server.

One or another of these proposed options affect every single component of
this system except for (11).  In the case of (11), none of these three
proposals will affect any existing mail to news gateways.  Existing mail
to news gateways may not be able to handle new non-ASCII newsgroups
without modification, but all three proposals are backward-compatible in
the sense that all currently working gateways to currently existing groups
will continue to function as they do now.  Please note that I have
separated the moderation process into a separate component from a general
mail to news gateway.

For new mail to news gateways for new non-ASCII newsgroups, the issues are
essentially the same as for posting agents (1).

In the case of (14), I am making the assumption that changing the IMAP
protocol is not an option.  This means that messages served to (14) will
not contain unencoded UTF-8 in the headers, and newsgroup names in IMAP
will not have unencoded UTF-8 names.  All of the work (if any) required to
make the articles compatible with an IMAP environment would have to be
born by (13).  In all of these cases, it will therefore be desirable for
the IMAP client to be modified to understand punycode and display the
newsgroup names correctly.

Note that there is an additional component that is left unmentioned above,
namely encoding of newsgroup names in URLs.  I don't know enough about
this area to comment usefully, but I believe that it's somewhat orthogonal
to the remaining issues.


(A) UTF-8 in articles and NNTP, punycode in e-mail and IMAP
===========================================================

This is Andrew's original proposal.  The canonical name of the newsgroup
would be in UTF-8 without further encoding.  The Usenet article format
would be defined to carry UTF-8 newsgroup names without further encoding
in those headers that contain newsgroup names (Newsgroups, Followup-To,
Control, and Xref).  Similarly, the body of control messages for non-ASCII
newsgroups would be required to be in UTF-8 and would contain the UTF-8
newsgroup names.

NNTP commands would take UTF-8 arguments wherever newsgroup names are
referred to.  wildmat would be modified to match UTF-8 characters if the
server supported the ? or [] wildcards.

At every point where a Usenet article must be conveyed via e-mail,
specifically (6), (12), and (13), any non-ASCII content in Newsgroups and
Followup-To would be encoded in punycode (or some other suitable encoding
method).  (Control and Xref headers are generally not gatewayed.)  The
envelope recipient used when sending to the moderation relays (8) would
contain the encoded form of the newsgroup name.  A moderator (10) who
received a post to a non-ASCII newsgroup (either the newsgroup they
themselves are modifying or a newsgroup to which the message was
crossposted) would, in order to approve the message, have to either decode
the newsgroup name to its canonical UTF-8 form again or use an injector
(2) that will do this.  Otherwise, the article should be rejected.

This proposal requires no changes to (7), (8), or (9); in other words, the
existing mail transit systems are unaffected by this proposal.  It
requires only minimal changes to (2), (3), and (4), the existing news
transit system, to remove restrictions preventing creation of non-ASCII
newsgroups.  It is believed that essentially all existing news transit and
server systems still in active use can handle 8-bit newsgroup names
without difficulties.  It would be desirable for (2), the injection agent,
to be able to undo the mail encoding automatically.

Additionally, a news reader (5) may be able to read such groups without
modification if it already has support for 8-bit characters and can be
configured appropriately, and similarly a news posting agent (1) may also
be able to be used without modification.  Updates to (1) and (5) to
provide Unicode character entry, canonicalization, and display would of
course be extremely desirable.  (1) and (5) require no modifications to
deal with existing ASCII newsgroups except modifications for 8-bit
cleanliness to handle crossposted messages.

Moderation software (10) would have to change in order to handle any
non-ASCII groups, since the mail encoding would have to be decoded, or the
moderator would have to arrange to use an updated injecting agent (2).
Moderators of existing ASCII newsgroups who didn't want to deal with this
issue could simply reject all articles crossposted to non-ASCII
newsgroups.  There is some likelihood that messages crossposted between
moderated ASCII newsgroups and other (moderated or unmoderated) non-ASCII
newsgroups would end up under some circumstances being injected into the
news system with the non-ASCII newsgroup names encoded in the mail
encoding, with the only damage being that the articles would not show up
in the non-ASCII newsgroups that they were intended to be posted to.

Any news to mail gateway (12) would have to be modified if it received any
messages crossposted to non-ASCII newsgroups and wanted to preserve the
Newsgroups header in the e-mail message.  Failure to encode the headers
appropriately would result in unencoded 8-bit text in the headers of a
mail message, where it may be mangled or rejected by the mail system.

Any IMAP server processing Usenet messages (13) would have to perform the
same transformations, encoding newsgroup names in Newsgroups, Followup-To,
and Control (and Xref if the IMAP server wished to maintain it).  In
addition, the newsgroup name would have to be presented to the client in
an encoded form; UTF-7 may be preferrable in this case to punycode.


(B) punycode in e-mail, IMAP, and articles, UTF-8 in NNTP
=========================================================

This is the intermediate proposal, allowing use of UTF-8 directly in NNTP
where it's fairly uncontroversial and continuing to treat the canonical
name of the newsgroup as the unencoded UTF-8 form, but always encoding the
newsgroup name wherever it occurs in a news article.  This maintains
complete RFC 2822 compatibility in the article format, unlike (A), but
still allows use of UTF-8 in NNTP.

Any non-ASCII newsgroup names in Newsgroups, Followup-To, Control, and
Xref would be encoded using punycode.  For ease of processing and
consistency, that probably also means that newsgroup names in the bodies
of control messages should also be encoded in punycode.

All NNTP commands would take UTF-8 arguments for newsgroup names, and the
newsgroup names returned by LIST, GROUP, and similar commands would be in
UTF-8.

This means that the newsgroup name sent to the server in a GROUP command
and the newsgroup name in the Newsgroups and Xref headers would not be the
same.  While it may still be possible for an extremely sophisticated user
to use an unmodified news reader (5) or posting agent (1), it would
require the user to override the news client at a multitude of points and
would be at best a last-ditch sort of affair, far too clumsy to use for
any sustained period.  This proposal therefore mandates modifications to
(1) and (5) for any user who wants to use non-ASCII newsgroups.

If the user only wants to use existing ASCII newsgroups, their existing
client software can be used unmodified.  It must, however, be able to
handle 8-bit newsgroup names returned from the LIST command (but doesn't
have to be able to handle 8-bit content in the article headers).

NNTP servers (2) and (4) must be modified in order to carry non-ASCII
newsgroups to decode the newsgroup headers when receiving messages so as
to know what newsgroup into which to file them.  The active file would
also need to be kept in UTF-8.  As above, it is believed that the other
NNTP commands besides POST/IHAVE/TAKETHIS would work without modification
because existing NNTP software is already 8-bit clean.  If the NNTP
software is not modified, the newsgroups will show up in their punycode
encoded form, possibly confusing compliant news reading software.

Transit servers (3) do not need to be modified.  For the best support of
pattern-based feeds, transit servers will want to decode the newsgroup
header as it comes in and then apply wildmat patterns to the decoded form
so that wildmat patterns can be specified in UTF-8.  The transit servers
will continue to function correctly without this modification, however,
and news administrators could add additional appropriate patterns to catch
the punycode-encoded forms.  Presuming that ASCII newsgroup names are not
encoded (a reasonable assumption for any encoding format, I believe), the
only reason to add punycode support to transit servers would be for the
convenience of the administrator in expressing wildmat patterns for
non-ASCII newsgroups in an unencoded form.  (I believe that the likelihood
that a punycode-encoded name would happen to match one of the widely used
patterns like *sex* or *mp3* is fairly small, but I could be wrong as I've
not done a statistical analysis.)

This proposal requires no modifications to the moderation system of (6),
(7), (8), (9), and (10) whatsoever, including while handling non-ASCII
groups.  It similarly requires no modifications to news to mail gateways
(12).  IMAP servers (13) may want to recode the newsgroup names from
punycode to UTF-7, but would not need to make any transformations to the
articles themselves.


(C) punycode everywhere
=======================

The "most encoded" proposal, this proposal says to use punycode
everywhere.  All newsgroup names in the Usenet articles and via the NNTP
protocol would be encoded in punycode and the punycode-encoded version of
the newsgroup name would be the canonical one.  The name would only be
decoded for display purposes in the client software.  This maintains
complete RFC 2822 compatibility for the article format.

This proposal mandates modifications to the posting agents (1) and the
news readers (5) in order to properly display the names.  No modifications
must be made to news readers that are reading only ASCII newsgroups; they
will just see a bunch of additional oddly-named newsgroups.  Existing news
readers could still read and post to non-ASCII newsgroups if they didn't
mind the odd names.

No modifications are required to (2) or (4), the NNTP servers, although
without modifications the server administrator would have to work with
encoded group names.  It would provide a much better user interface if the
administrative tools implemented punycode encoding and decoding for easier
handling of non-ASCII newsgroup names.

The same issues as with (B) apply to transit servers (3), namely that it
would be convenient but not required for transit servers to decode the
newsgroup names before doing wildmat matching so that the wildmat patterns
could be specified in a convenient format.

No modifications are required for the moderation system of (6), (7), (8),
(9), and (10) whatsoever, including while handling non-ASCII groups.
Similarly, no modifications are required for news to mail gateways (12).
IMAP servers (13) may again wish to recode punycode to UTF-7 for newsgroup
names, but otherwise require no modification.


Summary
=======

The following chart summarizes the backward compatibility issues for each
proposal and each component of the news system.  For each portion of the
news system, N means no change required, Y means change is required to
correctly handle non-ASCII newsgroups, D means change is very desirable
but not absolutely necessary, and C means change would be convenient but
unmodified software is still fairly usable.

  | 1   2   3   4   5   6   7   8   9  10  11  12  13  14
 -+------------------------------------------------------
 A| D   C   N   N   D   Y   N   N   N   Y   N   D   Y   D
 B| Y   Y   C   Y   Y   N   N   N   N   N   N   N   C   D
 C| D   C   C   C   D   N   N   N   N   N   N   N   C   D

The above summary I believe correctly indicates that proposal (B) requires
the most changes to be made to the news system itself.  It's possible to
use existing software without modification with either proposal (A) or
proposal (C); under proposal (A), news readers that aren't 8-bit clean
will break, and some news readers may actually get display right without
having to make any modifications, but with other news readers it may be
impossible to access a non-ASCII group because no Unicode entry is
supported.  Under proposal (C), we're guaranteed that nothing will break
and that it will always be possible to access even non-ASCII groups, but
no existing client will display the names correctly.

Overall, it is somewhat less necessary to change client software with (A)
than with (C); exactly how much less necessary is something of an open
question.

Proposal (A) is the only proposal that requires changes to any system
outside of the news system (other than changes to an IMAP client to
understand the punycode newsgroup names, which are the same for all
proposals).  Both (B) and (C) work with the moderation, e-mail, and IMAP
infrastructure without any additional changes.

-- 
Russ Allbery (rra(_at_)stanford(_dot_)edu)             
<http://www.eyrie.org/~eagle/>

<Prev in Thread] Current Thread [Next in Thread>