On Saturday, October 18, 2003, at 08:47 AM, Mark Lentczner wrote:
Macro interpolation - URL encoding is not so simple!
In short: In RFC 2396, "URI Generic Syntax", section 2.2 makes it clear
that what needs to be escaped depends on which component of the URI you
are escaping, and that in turn depends on which scheme the URI is for.
So technically, there is no way to generically escape data that goes
into a URL.
However, I've combed through the relative RFCs, and made some
assumptions (see below), and I suggest that the relevant parts of
section 2.3.3 of the SPF draft be changed to read:
"The uppercase versions of those macros are to be URL-encoded. That
is,
any character not in the unreserved set must be escaped. The
unreserved
set is defined in RFC 2396 (in section 2.3), as is the escaping
mechanism
(in section 2.4.1)."
While this will be a bit over-zealous in escaping, it is guaranteed
safe for the kinds of URLs that are likely to be generated.
The attribution at the end of Section 2.3.3.2 of the SPF draft should
be changed to: "See RFC 2396 regarding URL encoding."
- Mark
Ugly Details
------------
Assumptions:
- generated URIs are going to be http: or ftp:, or if any future
scheme, it will follow RFC 2396's hierarchical URI syntax
- the URI will never be relative (what would it be relative to?)
- no macro interpolation in the <scheme> portions (none of the macros
would make any sense in that locations).
- macro interpolation should only happen within the <segments> of the
<path>, not across them (in particular, a slash in one of the
interpolated values should be escaped, a not introduce an unexpected
<segment> of the <path>).
- macro interpolation within the <query> should escape the commonly
used structure characters ("&", "=", "+" and ";") even though the URI
RFCs don't require it (otherwise interpolated values could change the
structure of the <query> part.)
- overzealous escaping is generally allowed, so long as the escaping
phase is done once.
- even though HTML 4.01 specifies that spaces are to be escaped by "+",
escaping spaces with "%20" is also compliant (though never stated in
the standard directly, it can be deduced)
Here's how I cam up with the character set:
Set S: characters allowed by RFC 2396 in <segement>
alpha | digit | "-" | "_" | "." | "!" | "~" | "*" | "'" |
"(" | ")" | ":" | "@" | "&" | "=" | "+" | "$" |
"," | ";"
Set Q: characters allowed by RFC 2396 in <query> and <fragment> are
those of Set S plus:
"/" | "?"
Set H: characters that are escaped by HTML 4.01, section 17.13.4
(which uses RFC 1738 for the character set definition, but note that
this is NOT the set of characters that RFC 1738 says must be escaped -
RFC 1738 has a smaller set!):
";" | "/" | "?" | ":" | "@" | "&" | "="
Set P: there is a clear bug in the definition of HTML 4.01 in that "+"
really needs to be escaped too (as it is used to encode spaces):
"+"
Set M: the allowed non-escaped character set is equal to ((S intersect
Q) - (H union P)):
alpha | digit | "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" |
"$" | ","
Set M is RFC 2396's unreserved set plus "$" and ",". Since these two
characters could easily be used to mean something to a query processor,
and because there is value in using a normative reference rather than
defining our own character set, I felt that using the RFC 2396
definition would be best.
-------
Sender Permitted From: http://spf.pobox.com/
Archives at http://archives.listbox.com/spf-discuss/current/
Latest draft at http://spf.pobox.com/draft-mengwong-spf-02.txt
To unsubscribe, change your address, or temporarily deactivate your subscription,
please go to http://v2.listbox.com/member/?listname(_at_)©#«Mo\¯HÝÜîU;±¤Ö¤Íµø?¡