Re: Interpretation of RFC 2047


Pete Resnick wrote:

On 12/24/02 at 12:43 PM -0500, Bruce Lilly wrote:

That depends on the context. In the case of the List-Owner example anda number of fairly widely-used MUAs, the URL is properly decoded whenusing the List-Owner field content to generate a message to the listowner, and some even decode the 2047 encoding for display.
Absolutely irrelevant. As far as 822/2822/2047 is concerned, List-Ownerdoes not contain a phrase.

[...]

Now,the contents of a List-Owner field can be passed to a 1738/2396 parserand, upon discovering a mailto: URL (as against any other kind of URL),one might parse the URL according to 2368 and discover within thescheme-specific part of the URL something that is a mailbox list. The*result* could then be passed to a 822/2822 parser and *then* to a 2047parser for display purposes.


We're saying the same thing different ways.

And as far as I can tell, you can't have anything in a URL in aheader field that a 2047 parser would recognize as a phrase or acomment.
URLs can contain parentheses
But URLs can't contain "=" or "?", so a 2047 parser is not going to findanything interesting in the parentheses.


Per 2396, an opaque_part can contain '='and/or '?', and opaque_part
can be part of an absoluteURI.  Also, '?' can delimit a query as part
of a relativeURI, and a query can contain '=' and/or '?'.

Examples have been given in earlier messages in this thread; asimplistic matching of header field text using regular expressionsmight incorrectly match a "comment" where there is none.
But that wasn't the issue. The question that was being asked was whethera simplistic regexp "parser" would accidentally find text which itthought was 2047 syntax. Your claim was that such text could occur in afield which contained a URL.


Yes, and I stand by that claim.

> Since "=" and "?" can't appear in a URL

according to 1738 and 2396, a 2047 parser should never be tripped up bya URL in a header field.


See 2396 as referenced above.

If you can come up with a serious example of where one might findsomething that looked like 2047 text in a field where it shouldn't findany, I'd be significantly more concerned.


I've already given several.  The bottom line is that simplistic matching
via regular expressions simply is inadequate by itself for parsing header
fields sufficiently to correctly identify RFC 2047 encoded-words. To
repeat an earlier example (slightly modified):

   Content-Location: 
http://users.erols.com/blilly/mailparse/(=?us-ascii?q?=3D?=)

That contains a valid RFC 2396 URI. A simplistic regular expression match
as proposed in Charles' Usefor draft would incorrectly identify a comment
and an encoded-word (neither exist in the example).  A *correct* grammar-based
parser would not identify a comment (presuming that the inherent ambiguity
in RFC 2557 is resolved by changing CFWS to FWS in the ABNF); it would
identify a URI.  The URI is parsed per 2396 as follows:

http              -> scheme                                           \
:                 -> ":"                                               |
//                -> "//"                                 \            |
users.erols.com   -> authority                             |           |
/                 -> "/"                      \            |           |
blilly            -> segment \                 |           |           |
/                 -> "/"      |                 > abs_path  > net_path  > 
absoluteURI
mailparse         -> segment   > path_segments |           |           |
/                 -> "/"      |                |           |           |
(=                -> segment /                /           /            |
?                 -> "?"                                               |
us-ascii?q?=3D?=) -> query                                            /