Re: RFC 2047 and gatewaying


Charles Lindsey wrote:

You are really being rather disengenuous.


Not at all; the point is that -- especially when discussing gateways -- clarity
requires specifying whether "headers" refers to RFC 822 / 2822 Internet text
message headers or to something different.  Per RFC 1036, Usenet article
headers are the same as 822 headers; 1036 explicitly says so and uses the same
Internet text message format for articles.  The Usefor draft changes that; its
articles are not Internet text messages (precisely because of the presence of
untagged, unencoded characters forbidden by the Internet text message RFCs (822
and 2822) in Usefor "headers", which for the same reason are not 822 / 2822
headers even though there is a superficial similarlty).

Another major difference between Usefor draft "headers" and Internet text 
message
headers is that the former may include semicolon delimited parameters for some
fields (e.g. Date) where the syntax and ABNF for the corresponding Internet text
message header field do not permit such a construct.  Obviously gateways into
email would have to elide the extraneous content, though that is not mentioned
in the text which you have presented here (which is not unreasonable, as it does
not bear on RFC 2047 issues).

There is yet another major difference related to the definitions of structural
elements, mentioned below.

And there are some relatively minor differences, such as the difference between
the delimiter marking the end of a header field tag (the single character ':'
in Internet text messages vs. the two-character string ": " in Usefor 
"headers").

Reference to RFC 2047 presumes that one is dealing with RFC 822 Internet
text messages; if that is not the case (as seems to be w.r.t. the Usefor
draft), all bets are off.



That is not so, because Usefor explicitly extends RFC 2047 to apply to
Netnews (but it does not actually change anything in the 2047 protocol).


Those extensions may well need to be presented here before one can reasonably
expect a detailed critique of the bigger picture.

8.8.1.1.  Gatewaying into email

  2. If the header is unstructured, any word(s) which is contained
     within a comment ...

Unstructured RFC 822 / 2822 header fields cannot be said to contain comments;



Oops! Thank you for spotting my typo.


There is an example of the need for clarity; does "the header" refer to a
Usefor draft "header" or to the RFC 822 / 2822 Internet text message header
that the gateway is supposed to construct? [likewise for other parts of
the text, such as section 5 below.]  It's also unclear, at least from what
has been posted here, what the rules are for Usefor "headers". For example,
it is unclear whether or not content in a structured Usefor "header" enclosed
within U+207D and U+207E (or within U+208D and U+208E) is considered to be a
"comment" [or, for that matter, whether or not U+2474 etc. are "comments"].
Such considerations don't arise in Internet text message headers because
those characters cannot exist in headers.  But since they may exist in
Usefor "headers", the issue does arise.  And there is the more serious
matter of whether or not one is expected to use the consistent RFC 822 /
2822 / 2047 / 2231 definitions of "comment", "phrase", "quoted-string", etc.
or the quite different Usefor draft definitions (which are different because
the Usefor draft uses different definitions of "text", "ctext" "qtext", etc.).

  5. If the header is not one defined by this standard or by any Email
     standard known to the gateway (so that it cannot be determined
     whether it is unstructured, or otherwise where comments and
     phrases occur within it), then it is not possible to encode it
     according to a strict interpretation of [RFC 2047].  Nevertheless,
     it is preferable to attempt an encoding than to discard that
     header or to allow the gatewaying to fail. It is therefore
     suggested that, outside of regions contained within properly
     matched DQUOTEs, <...> or [...], any word(s) contained within
     properly nested "(" and ")" be treated as being within a comment
     and any other word(s) be treated as being within a phrase.

     Likewise, following any ";", anything of the syntactic form of a
     parameter should be treated as such.

This has recently been discussed here; for display, such an error-prone 
heuristic
may be marginally acceptable (user cut-and-paste is likely to result in
problems), and it will fail on a number of not-uncommon constructs, but use
of such an unreliable mechanism for *gateways* is highly inadvisable, to say the
least.



Well consider a gatway that gates some newsgroup to a mailing list.

It receives an article with UTF8-xtra-chars in some unrecognized header.
For sure it was not a header known to the Netnews protocol, so it was
superfluous for news propagation.


Not necessarily; it may be one of the newfangled Usefor "headers" and the
"gateway" may be one operating according to the current specification which
does not define such a "header".  Or it may be an experimental "header", the
syntax of which is only known to those participating in the experiment [in
email, such expermental headers would begin with "X-", but current practice
in Usenet articles appears to be to make up some tag without consideration
of collisions with existing standards (e.g. Supersedes)]. There may well be
implications for news->mail->news paths w.r.t. such experimental headers,
which might or might not be structured.

> Maybe, however, it might be meaningful

when seen on the mailing list (though one would expect such gateways to
recognize at least the common email header fields). So what is the gateway
to do? I see four possibilities:

1. It leaves it as raw 8-bit and hopes that it survives. Indeed, many mail
transports will pass it on untouched, but it is liable to be munged as
soon as it hits a Sendmail. Maybe that is a reasonable risk. Maybe not.

2. It drops that particular header (presumably it was not an essential one
for delivering mail on the mailing list). But that would be a pity,
because useful information might be lost.

3. It drops the article entirely. That is hardly providing a decent
service to the readers of the mailing list.

4. It tries to encode it using RFC 2047/2231. Maybe it succeeds. Maybe it
doesn't. If it doesn't, at worst some representation will survive in the
email on the mailing list which a human might be able to decipher. At
best, some over-liberal user agent will decode the 2047/2231 stuff and
produce something sensible.

Now which of those four would you recommend the gatewayer to do? My text
recommends #4 as the least of the evils.


You are presuming (incorrectly) that there are no other possible options.
The best solution would be to continue the RFC 1036 practice of using the
Internet text message format, i.e. there would be no untagged, unencoded
illegal octets or superfuous "parameters", and gateway header transformations
would ne unnecessary (though gateways may need to add or elide header fields).
Many of the other problems associated with this particular deviation from
RFC 1036 practice (viz. article format change from Internet text message to
something incompatible with RFCs 822 / 2822), several of which have been
enumerated elsewhere, would also disappear.  The entire gateway conundrum is
solely the result of deviation from that RFC 1036 practice; i.e. the change
in article format has added additional burdens to fateways above and beyond
the unrelated issues that gateways have always had to deal with.

Failing that ideal solution, there are still other possibilities.  For
example, the gateway could take the offending "header" (or the entire article),
encode it using an established mechanism (e.g. base64), and package it as
application/octet-stream.  That would preserve the content (with no possibility
of munging) and would comply with the relevant Internet RFCs for email.

There may well be other options.

Returning to your four suggestions:
#1 would violate Internet RFCs (822 / 2822 and probably 2821), so is 
unacceptable.
#2 would generate syntactically legal email, but might be considered inferior
   to packaging the content as described above.
#3 in fact might be a considerable service -- as you note, the offending
   article may well be Korean spam...
#4 presents several problems. One is that as the hypothetical header field
   syntax is unknown, one cannot determine what to encode, or which mechanism
   should be used.  The particular methods described in the quoted section
   5 have been specifically discussed (though at the time you stated that it
   was for display only, not for gateways).  With the exception of content
   within square brackets, the methods enumerated have been demonstrated by
   example to be unreliable (and I suspect examples involving square brackets
   could be provided, but that is unnecessary as the principle is that it is
   necessary to consider the header syntax when encoding/decoding; guessing
   based on superficial examination of characters is unreliable, particularly
   bad for gateways).

  In all cases, there are additional restrictions imposed by [RFC 2047]
  regarding the size, placement and contents of encoded-words which
  MUST be observed.


In the hypothetical situation proposed, the "header" syntax is unknown,
ergo it is not possible to observe the RFC 2047 rules, as those rules
vary depending on whether or not a header field content is structured
or unstructured (or indeed, within the field in the case of mixed
structured/unstructured fields). Moreover, the rules for structured
fields specifically reference the structural elements, and therefore
cannot be applied if the structure is unknown.

>>>   Moreover, these transformations MUST be applied

  both within the header of the article and within any body part
  headers (including the headers of any message/rfc822).


That is likely to be a problem for digitally signed messages and for
encrypted messages.  Of course, if the article uses the Internet text
message format, there's no need to apply any transformation here, and
hence no problem.

>>>   It is

  generally preferable for encodings to use the charset UTF-8, although
  it might be wise first to confirm that that is indeed the charset
  which had been used (see 4.4.1).

That raises the issue of round-trip gateways; what mechanism is used to convey
charset (and language) information through a mail -> Usefor -> mail 
transformation
to ensure that the charset (and language) information is maintained?  If no
such mechanism is provided, one cannot "confirm that that is [...] the charset
which had been used".



If the chain is mail->news->mail, then there should be no 8bit stuff in
headers anyway, so the problem does not arise.


You are presuming (probably incorrectly) that there will never exist any
mail->news gateway designed and/or operated by an 8-bit thug to transform
tagged and encoded email header content into untagged raw utf-8 Usefor
"header" content.  Bear in mind that the mail->news and news->mail
gateways may be operated by different entities (e.g. one part may involve
submission to a moderator via email).

The nasty case is the chain news -> mail -> news. That sort of gatewaying
is _always_ difficult. There are all sorts of things that can go wrong,
and Usefor warns against them. If the 8bit stuff in the headers is all in
UTF-8, then it should work provided newsgroup-names are restored to their
canonical form (that is a MUST).


The hypothetical unrecognized header will still be a problem so long as
there is any transformation involved.  Bearing in mind once again that
the news->mail and mail-> news trasformation may be done at different
places by different software based on different assumptions by different
authors.  There are roughly two types of potential problems; those related
to content transformations and those related to news-specific issues (loops,
etc.). The transformation issues essentially go away if the Internet text
message format is used for articles, as in RFC 1036.

It is however quite easy to detect when non-UTF-8 charsets have
been used


It is trivial in the Internet text message format, since charset is explicitly
tagged (as is language, where relevant).  It is *not* easy to *reliably*
detect charset when untagged, as in Usefor "headers".  One can detect that
an octet stream is not a valid utf-8 stream, but it is possible that an
untagged non-utf-8 octet stream may correspond to a valid utf-8 sequence
even though it is not utf-8.