Re: Interpretation of RFC 2047


In <3DFA180A(_dot_)7030705(_at_)alex(_dot_)blilly(_dot_)com> Bruce Lilly 
<blilly(_at_)erols(_dot_)com> writes:

Charles Lindsey wrote:

Would it not a sensible rule be to say that you should decode any occurrence
of =?<charset>?[BQ]?...?= (subject to the 76 character limit) in any
header provided:
    (a) it was immediately preceded by '(' or by CFWS
    (b) it was immediately followed by ')' or by CFWS
    (c) it was not contained within a quoted-string

(d) it was not part of a MIME parameter (RFC 2047 expressly forbids 2047
    encoding in MIME parameters; RFC 2231 provides a mechanism for parameters
    and also extends 2047 to include language tags)

... and more (see below)

Actually, there is a parsing required, because an encoded word in an
unstructured header must have LWS (i.e. CFWS) on either side of it, whereas
it can also have '(' and ')' immediately next to it in a strutured header.


I think a reasonable heuristic, which would nearly always do the "right
thing" would be:

NOT to decode anything within properly matched "...", <...> or [...] or
which follows a ';' which looks like the start of some MIME parameters.
And otherwise decode anything enclosed by WS or within properly matched
and nested (...).

Other areas that immediately come to mind are:
1. RFC 2557 Content-Location, which permits URIs, which in turn (RFC 2396)
   permit parentheses.  That's in a structured field, but a URI, not a
   comment. [there are issues with 2557 and CFWS vs. the URIs, and these
   have been discussed on the MHTML list]


But URIs are not supposed to contain 8bit stuff, so the question should
not arise. And if IRIs should ever get into the standard, then there is a
special downgrading to URI built in (yes, that is Yet Another Encoding for
us to have to worry about :-( ). Mind you, if the URIs are enclosed within
<...>, then my rule above would cover them.

I'm not certain, but I don't believe that the filter syntax permits anything
resembling 2047 encoding.  URIs probably do, but again, I haven't checked
thoroughly.


We are talking only of headers which the agent does not already know how
to parse, so presumably it is not going to do more than display them. But
displaying is probably more useful than leaving them alone as far as being
helpful to the reader is concerned.

Strictly speaking, one can only decode if one knows the relevant header
syntax.  Display is a relatively minor issue, subject to the above
caveat.  But transformations by gateways may result in fouling up content
beyond all recognition unless the header syntax is known.  Ideally,
gateways shouldn't decode encoded-words -- if they're left in encoded
form there is no chance that they'll be garbled, which is the likely
outcome unless strict syntax of headers is known and applied rigorously.


Agreed.

But there is a more interesting question, which is what agents that create
unrecognized headers with 8bit stuff in them could usefully do. I.e. a
user tries to create a Foobar: header with such stuff in it. This could be
a problem in news to mail gatewaying. Treating all such headers as
unstructured is possible, but might not do the right thing. Trying to
recognise comments might be better (not within "...", <...> or [...]
though).

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5