Re: Interpretation of RFC 2047


Charles Lindsey wrote:

I think a reasonable heuristic, which would nearly always do the "right
thing" would be:

NOT to decode anything within properly matched "...", <...> or [...] or
which follows a ';' which looks like the start of some MIME parameters.
And otherwise decode anything enclosed by WS or within properly matched
and nested (...).


As header field contents are defined by a grammar, attempts to
decode using only regular expressions (as opposed to a parser
which accepts the defined grammar) are doomed to failure. Failures
include both false positives and false negatives, as illustrated
below.

But there is a more interesting question, which is what agents that create
unrecognized headers with 8bit stuff in them could usefully do. I.e. a
user tries to create a Foobar: header with such stuff in it. This could be
a problem in news to mail gatewaying. Treating all such headers as
unstructured is possible, but might not do the right thing. Trying to
recognise comments might be better (not within "...", <...> or [...]
though).


One cannot recognise a comment unless the header field syntax is known.

   Content-Features: (& (Type="text/plain") (charset=US-ASCII) )

contains no comments.

   Foobar: (& (Type="text/plain") (charset=US-ASCII) )

might or might not contain comments depending on the definition of
the Foobar header field.  I submit that

   Foobar: file:(=?us-ascii?q?=3D?=)

does not contain a comment. It does have matched parentheses.  It does
not contain an RFC 2047 encoded-word and does not encode any 8-bit
characters  It does contain a syntactically valid absolute URI.

   Foobar: http://users.erols.com/blilly/mailparse/(=?us-ascii?q?=3D?=)

does not contain a comment. It does have matched parentheses.  It does
not contain an RFC 2047 encoded-word and does not encode any 8-bit
characters  It contains a valid absolute URI with a query.  You are
welcome to try the URI; it does work (though the query is ignored).

Either could just as well be a Content-Location header. Both would be
attempted to be decoded using a simple regular expression matching
heuristic.  If thus inappropriately decoded, they would yield

   Foobar: file:=
   Foobar: http://users.erols.com/blilly/mailparse/=

which are clearly not what was intended.  You are welcome to try the
last one as a URI, you will get a 404 not found error.  N.B. you might
(in general) instead have stumbled upon a valid URI which was different
from the intended one.

Treating an unrecognized header field as unstructured in the above
examples would not mangle the URIs, for display or otherwise.  If the
example were instead

   Foobar: =?us-ascii?q?-3D?=

treating the header as unstructured may result in decoding for display.
Gateways should not attempt to transform unrecognized header fields; it
is unknown whether or not the above example really contains an encoded-word.
If an unrecognized header field has content which is forbidden in the
destination network, the header could be elided. If the content is not
forbidden, the unrecognized header field should be passed unaltered.  A
network which would forbid RFC 2047 encoded-word content would be rather
unusual, to say the least.  A gateway should never decode RFC 2047
encoded-words in header fields, as the decoded word may have octets or
combinations of octets which are illegal in header fields (e.g. NUL, DEL,
8-bit-set, lone CR).  Such decoding might be acceptable if both of the
following conditions apply:
1. the destination network uses content in some format other than RFC
   2822 header fields (otherwise, there's no need for transformation).
2. it is guaranteed that a reverse transformation from the destination
   network to Internet mail is possible and produces content equivalent
   to the original (i.e. equivalent to leaving the header field unaltered),
   or that no reverse gateway will attempt to regenerate the header field
   (i.e. equivalent to eliding the header in the forward gateway
   transformation).

Excluding content after a semicolon would fail to decode the RFC 2047
encoded-word in the header

   To: empty-list:;, =?iso-8859-1?Q?J=FCrgen?= j(_at_)foo(_dot_)com

Excluding content bracketed in <> would also be an error. Consider RFCs
2368 and 2369 (not to be confused with 2396, which is also applicable) and:

   List-Owner: 
<mailto:%3D%3fiso-8859-1%3FQ%3fJ%3dFCrgen%3F%3d%20j(_at_)foo(_dot_)com?Subject=list>

That does contain an RFC2047 encoded-word within the <>. Decoding the
content within <> first requires decoding URI encoding to obtain the
same mailbox as specified in the To header example above.

Further note that List-Owner provides for specifying additional header
fields (as with Subject in the example above), and of course
Content-Location and the hypothetical Foobar are not excluded.

Other examples could be given, but the above show that it is necessary
to fully parse header field content in order to determine whether or
not there is an encoded-word; use of regular expressions (or the
equivalent) is inadequate.