ietf-822
[Top] [All Lists]

Re: RFC 2047 and gatewaying

2002-12-23 20:14:17

In <3E05888D(_dot_)6010805(_at_)Sonietta(_dot_)blilly(_dot_)com> Bruce Lilly 
<blilly(_at_)erols(_dot_)com> writes:

Charles Lindsey wrote:

In Usefor, where headers in Netnews will be in UTF-8, we have to say how
to gateway into email.

"Headers" is somewhat ambiguous; HTML has "headers", but those are not
Internet text message (RFC 822 / 2822) headers.  RFC 822 / 2822 preclude
8th-bit-high octets implied by UTF-8 in Internet text message header fields.
So presumably the new Usefor format is completely separate from RFC 822
Internet text message format, unlike RFC 1036, which used the same message
format as RFC 822 Internet text messages, and "headers" w.r.t. the Usefor
draft is presumably not the same as "headers" in RFC 822 / 2822 Internet
text messages (unlike RFC 1036 Usenet messages, where they were the same
except for some additional restrictions imposed by 1036).

You are really being rather disengenuous. You know perfectly well (because
you have been on the Usefor list) that "Headers", as defined in Usefor, are
broadly similar to those in Email, the major difference being that UTF-8
is allowed in certain places. In order to ensure that gatewaying from news
to mail can proceed, it is necessary to specify transformations that need
to be made at the boundary. The text I posted covers that, and was posted
here for comment because if makes use of RFC 2047/2231 to do most of the
work.

Reference to RFC 2047 presumes that one is dealing with RFC 822 Internet
text messages; if that is not the case (as seems to be w.r.t. the Usefor
draft), all bets are off.

That is not so, because Usefor explicitly extends RFC 2047 to apply to
Netnews (but it does not actually change anything in the 2047 protocol).

Your comments would be appreciated.

8.8.1.1.  Gatewaying into email


   2. If the header is unstructured, any word(s) which is contained
      within a comment ...

Unstructured RFC 822 / 2822 header fields cannot be said to contain comments;

Oops! Thank you for spotting my typo.


   5. If the header is not one defined by this standard or by any Email
      standard known to the gateway (so that it cannot be determined
      whether it is unstructured, or otherwise where comments and
      phrases occur within it), then it is not possible to encode it
      according to a strict interpretation of [RFC 2047].  Nevertheless,
      it is preferable to attempt an encoding than to discard that
      header or to allow the gatewaying to fail. It is therefore
      suggested that, outside of regions contained within properly
      matched DQUOTEs, <...> or [...], any word(s) contained within
      properly nested "(" and ")" be treated as being within a comment
      and any other word(s) be treated as being within a phrase.

      Likewise, following any ";", anything of the syntactic form of a
      parameter should be treated as such.

This has recently been discussed here; for display, such an error-prone 
heuristic
may be marginally acceptable (user cut-and-paste is likely to result in
problems), and it will fail on a number of not-uncommon constructs, but use
of such an unreliable mechanism for *gateways* is highly inadvisable, to say 
the
least.

Well consider a gatway that gates some newsgroup to a mailing list.

It receives an article with UTF8-xtra-chars in some unrecognized header.
For sure it was not a header known to the Netnews protocol, so it was
superfluous for news propagation. Maybe, however, it might be meaningful
when seen on the mailing list (though one would expect such gateways to
recognize at least the common email header fields). So what is the gateway
to do? I see four possibilities:

1. It leaves it as raw 8-bit and hopes that it survives. Indeed, many mail
transports will pass it on untouched, but it is liable to be munged as
soon as it hits a Sendmail. Maybe that is a reasonable risk. Maybe not.

2. It drops that particular header (presumably it was not an essential one
for delivering mail on the mailing list). But that would be a pity,
because useful information might be lost.

3. It drops the article entirely. That is hardly providing a decent
service to the readers of the mailing list.

4. It tries to encode it using RFC 2047/2231. Maybe it succeeds. Maybe it
doesn't. If it doesn't, at worst some representation will survive in the
email on the mailing list which a human might be able to decipher. At
best, some over-liberal user agent will decode the 2047/2231 stuff and
produce something sensible.

Now which of those four would you recommend the gatewayer to do? My text
recommends #4 as the least of the evils.

   In all cases, there are additional restrictions imposed by [RFC 2047]
   regarding the size, placement and contents of encoded-words which
   MUST be observed. Moreover, these transformations MUST be applied
   both within the header of the article and within any body part
   headers (including the headers of any message/rfc822).  It is
   generally preferable for encodings to use the charset UTF-8, although
   it might be wise first to confirm that that is indeed the charset
   which had been used (see 4.4.1).

That raises the issue of round-trip gateways; what mechanism is used to convey
charset (and language) information through a mail -> Usefor -> mail 
transformation
to ensure that the charset (and language) information is maintained?  If no
such mechanism is provided, one cannot "confirm that that is [...] the charset
which had been used".

If the chain is mail->news->mail, then there should be no 8bit stuff in
headers anyway, so the problem does not arise.

The nasty case is the chain news -> mail -> news. That sort of gatewaying
is _always_ difficult. There are all sorts of things that can go wrong,
and Usefor warns against them. If the 8bit stuff in the headers is all in
UTF-8, then it should work provided newsgroup-names are restored to their
canonical form (that is a MUST).

If the charset used is not UFTF-8, then the article is not-compliant with
Usefor (nor with any mail standard), in which case all bets are off.

Sadly, such non-compliance does not stop people from doing it (there is a
considerable amount of both Netnews and Email floating around with strange
oriental charsets in the headers - mostly, it must be admitted, Korean
spam). It is however quite easy to detect when non-UTF-8 charsets have
been used, and if systems detect that and are able to work out what to do,
then good luck to them. Usefor draws attention to that possibility, but
does not condone it. It applies equally to Netnews and Email, hence the
remark about confirming the charset. 

Of course, with RFC 1036, which uses RFC 822 Internet text messages, one would
simply use the Internet text message mechanisms, including RFC 2047, ...

This has been discussed on the Usefor mailing list; 

It has indeed. However, Usefor has recently come to a "Rough Cnsensus" on
how to proceed, and the text we are discussing arises from that consensus.
As I said, it is posted here for the RFC 2047/2231 experts to check it
over. It is also, of course, being discussed on the Usefor list.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

<Prev in Thread] Current Thread [Next in Thread>