Re: RFC 2047 and gatewaying


In <3E088D4B(_dot_)5050508(_at_)alex(_dot_)blilly(_dot_)com> Bruce Lilly 
<blilly(_at_)erols(_dot_)com> writes:

Charles Lindsey wrote:

That is not so, because Usefor explicitly extends RFC 2047 to apply to
Netnews (but it does not actually change anything in the 2047 protocol).

Those extensions may well need to be presented here before one can reasonably
expect a detailed critique of the bigger picture.


You can read the gory details in section 4.4.1 of
<http://www.landfield.com/usefor/draft-ietf-usefor-article-08.01.unpaged>
but all it does is to define a little terminology to allow the wording of
RFC 2047 to apply to Netnews articles.

There is an example of the need for clarity; does "the header" refer to a
Usefor draft "header" or to the RFC 822 / 2822 Internet text message header
that the gateway is supposed to construct?


Since the text we are discussing is part of Usefor, it should be obvious
to you that "header" means a Usefor header.

[likewise for other parts of
the text, such as section 5 below.]  It's also unclear, at least from what
has been posted here, what the rules are for Usefor "headers". For example,
it is unclear whether or not content in a structured Usefor "header" enclosed
within U+207D and U+207E (or within U+208D and U+208E) is considered to be a
"comment" [or, for that matter, whether or not U+2474 etc. are "comments"].


The Usefor definition of CFWS is identical to the RFC 2822 definition,
except for the extended character set allowed in ctext.

Such considerations don't arise in Internet text message headers because
those characters cannot exist in headers.  But since they may exist in
Usefor "headers", the issue does arise.  And there is the more serious
matter of whether or not one is expected to use the consistent RFC 822 /
2822 / 2047 / 2231 definitions of "comment", "phrase", "quoted-string", etc.
or the quite different Usefor draft definitions (which are different because
the Usefor draft uses different definitions of "text", "ctext" "qtext", etc.).


Again, it should be obvious to you that the purpose of the text under
discusssion is to tell you what do do when you encounter a Usefor comment
(etc.) in a Usefor header in order to ensure that it will become a valid
Email header-field.

It receives an article with UTF8-xtra-chars in some unrecognized header.
For sure it was not a header known to the Netnews protocol, so it was
superfluous for news propagation.

Not necessarily; it may be one of the newfangled Usefor "headers" and the
"gateway" may be one operating according to the current specification which
does not define such a "header".


Naturally, a gateway that has not been upgraded to conform to the new
standard will not conform to the new standard. Do you really expect such
tautologies to be included in IETF standards?

 Or it may be an experimental "header", the
syntax of which is only known to those participating in the experiment [in
email, such expermental headers would begin with "X-", but current practice
in Usenet articles appears to be to make up some tag without consideration
of collisions with existing standards (e.g. Supersedes)].


Experimental headers are, by definition, those beginning with "X-". They
are always treated as unstructured by RFC 2047, which is also what my text
says.

 So what is the gateway
to do? I see four possibilities:

1. It leaves it as raw 8-bit and hopes that it survives. Indeed, many mail
transports will pass it on untouched, but it is liable to be munged as
soon as it hits a Sendmail. Maybe that is a reasonable risk. Maybe not.

2. It drops that particular header (presumably it was not an essential one
for delivering mail on the mailing list). But that would be a pity,
because useful information might be lost.

3. It drops the article entirely. That is hardly providing a decent
service to the readers of the mailing list.

4. It tries to encode it using RFC 2047/2231. Maybe it succeeds. Maybe it
doesn't. If it doesn't, at worst some representation will survive in the
email on the mailing list which a human might be able to decipher. At
best, some over-liberal user agent will decode the 2047/2231 stuff and
produce something sensible.

Now which of those four would you recommend the gatewayer to do? My text
recommends #4 as the least of the evils.

You are presuming (incorrectly) that there are no other possible options.


Well you have failed to provide any others that were not already in my
text. And you have failed to answer the question. Which would you
recommend?

The best solution would be to continue the RFC 1036 practice of using the
Internet text message format, i.e. there would be no untagged, unencoded
illegal octets or superfuous "parameters",


Yes, but the Rough Consensus on the Usefor Group is not to do that.

Failing that ideal solution, there are still other possibilities.  For
example, the gateway could take the offending "header" (or the entire article),
encode it using an established mechanism (e.g. base64), and package it as
application/octet-stream.


Encoding the entire article (as an application/news-transmission) is
already one of the possibilities in my text, but sadly is not applicable
for moderators (not in the short term, anyway), and would hardly be
appreciated by the readers of the mailing lists.

Encoding a single header is fine, but if you invent Yet Another Encoding
you can hardly expect the average MUA to understand it, which is why the
suggestion was to force it into an RFC 2047 encoding. That way, there is a
reasonable chance that a MUA _Might_ manage to decode it, and if not
then you have not lost anything.

One has to face the fact that there IS no perfect solution. It is simply a
question of which will best serve the users' needs.

Returning to your four suggestions:
#1 would violate Internet RFCs (822 / 2822 and probably 2821), so is 
unacceptable.


So you have never seen a violation of RFC 2822? Lucky you! No Korean spam!

The fact is that, when faced with unpalatable choices, breaking the
standard may be the least worst thing to do. However, my text did not
actually suggest that option. But I expect a lot of gateways will
actually do it.

The other three options at least produce something compatible with RFC
2821.

#4 presents several problems. One is that as the hypothetical header field
   syntax is unknown, one cannot determine what to encode, or which mechanism
   should be used.


A mechanism was suggested that will produce displayable text some of the
time, and leave the viewer looking at the undecoded text the rest of the
time. That is surely better that leaving the viewer with the undecoded text
all of the time.

It is however quite easy to detect when non-UTF-8 charsets have
been used

 It is *not* easy to *reliably*
detect charset when untagged, as in Usefor "headers".  One can detect that
an octet stream is not a valid utf-8 stream, but it is possible that an
untagged non-utf-8 octet stream may correspond to a valid utf-8 sequence
even though it is not utf-8.


The experiment that Andrew Gierth did demonstrated a false positive rate
of 0.05% IIRC.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5