Re: Interpretation of RFC 2047


In <3DFF6CA2(_dot_)80305(_at_)alex(_dot_)blilly(_dot_)com> Bruce Lilly 
<blilly(_at_)erols(_dot_)com> writes:

Charles Lindsey wrote:

I think a reasonable heuristic, which would nearly always do the "right
thing" would be:

As header field contents are defined by a grammar, attempts to
decode using only regular expressions (as opposed to a parser
which accepts the defined grammar) are doomed to failure. Failures
include both false positives and false negatives, as illustrated
below.


Sure there will be failures. What I said was 'a reasonable heuristic,
which would nearly always do the "right thing"'. The question is what is
the probability of hitting one of those awkward cases (and of finding a
recognizable "=?...?...?=" within it), as opposed to the probability of
finding a "=?...?...?=" in a genuine comment or phrase. My guess (which
seems to be supported by others in this thread) is that it will do the
"right thing" far far more often than it will do the "wrong thing".

Remember that we are merely talking about the display of headers the agent
does not know about. So it will be taking no semantic action, just letting
the reader read it.

And realize also that many (maybe most) current agents already recognize
many of those extra-2047 cases simply because their implementors couldn't
be bothered to parse every obscure header that might turn up. And by and
large those agents do an acceptable job (except that they all work to
slightly different heuristics).

One cannot recognise a comment unless the header field syntax is known.


Well there seems to be some disagreement about that :-( .

   Foobar: http://users.erols.com/blilly/mailparse/(=?us-ascii?q?=3D?=)

Either could just as well be a Content-Location header. Both would be
attempted to be decoded using a simple regular expression matching
heuristic.  If thus inappropriately decoded, they would yield

   Foobar: file:=
   Foobar: http://users.erols.com/blilly/mailparse/=

which are clearly not what was intended.


And what do you suppose _was_ intended. Nobody in the real world is going
to write a real URI like that. The only people who write that sort of
stuff are spammers who are trying to obfuscate things, and if a few of
their obfuscations get misinterpreted, that might even be regarded as a
Good Thing :-) .

treating the header as unstructured may result in decoding for display.
Gateways should not attempt to transform unrecognized header fields;


Exactly so. I never suggested otherwise.

Excluding content after a semicolon would fail to decode the RFC 2047
encoded-word in the header

   To: empty-list:;, =?iso-8859-1?Q?J=FCrgen?= j(_at_)foo(_dot_)com


Yes, groups in addresses are a pain, which is one of the reasons for
disallowing MIME-style parameters in some headers in Usefor. So it is more
work than just ignoring stuff after a ";", but probably still with the
capabilities of REs.

Excluding content bracketed in <> would also be an error. Consider RFCs
2368 and 2369 (not to be confused with 2396, which is also applicable) and:

   List-Owner: 
<mailto:%3D%3fiso-8859-1%3FQ%3fJ%3dFCrgen%3F%3d%20j(_at_)foo(_dot_)com?Subject=list>

That does contain an RFC2047 encoded-word within the <>.


Eh? I don't see one.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5