spf-discuss
[Top] [All Lists]

URL Encoding

2003-10-19 10:26:05
On Saturday, October 18, 2003, at 08:47  AM, Mark Lentczner wrote:
Macro interpolation - URL encoding is not so simple!

In short: In RFC 2396, "URI Generic Syntax", section 2.2 makes it clear that what needs to be escaped depends on which component of the URI you are escaping, and that in turn depends on which scheme the URI is for. So technically, there is no way to generically escape data that goes into a URL.

However, I've combed through the relative RFCs, and made some assumptions (see below), and I suggest that the relevant parts of section 2.3.3 of the SPF draft be changed to read:

"The uppercase versions of those macros are to be URL-encoded. That is, any character not in the unreserved set must be escaped. The unreserved set is defined in RFC 2396 (in section 2.3), as is the escaping mechanism
        (in section 2.4.1)."

While this will be a bit over-zealous in escaping, it is guaranteed safe for the kinds of URLs that are likely to be generated.

The attribution at the end of Section 2.3.3.2 of the SPF draft should be changed to: "See RFC 2396 regarding URL encoding."

        - Mark

Ugly Details
------------
Assumptions:
- generated URIs are going to be http: or ftp:, or if any future scheme, it will follow RFC 2396's hierarchical URI syntax
- the URI will never be relative (what would it be relative to?)
- no macro interpolation in the <scheme> portions (none of the macros would make any sense in that locations). - macro interpolation should only happen within the <segments> of the <path>, not across them (in particular, a slash in one of the interpolated values should be escaped, a not introduce an unexpected <segment> of the <path>). - macro interpolation within the <query> should escape the commonly used structure characters ("&", "=", "+" and ";") even though the URI RFCs don't require it (otherwise interpolated values could change the structure of the <query> part.) - overzealous escaping is generally allowed, so long as the escaping phase is done once. - even though HTML 4.01 specifies that spaces are to be escaped by "+", escaping spaces with "%20" is also compliant (though never stated in the standard directly, it can be deduced)

Here's how I cam up with the character set:

        Set S: characters allowed by RFC 2396 in <segement>
                alpha | digit | "-" | "_" | "." | "!" | "~" | "*" | "'" |
"(" | ")" | ":" | "@" | "&" | "=" | "+" | "$" | "," | ";"

Set Q: characters allowed by RFC 2396 in <query> and <fragment> are those of Set S plus:
                "/" | "?"

Set H: characters that are escaped by HTML 4.01, section 17.13.4 (which uses RFC 1738 for the character set definition, but note that this is NOT the set of characters that RFC 1738 says must be escaped - RFC 1738 has a smaller set!):
                ";" | "/" | "?" | ":" | "@" | "&" | "="

Set P: there is a clear bug in the definition of HTML 4.01 in that "+" really needs to be escaped too (as it is used to encode spaces):
                "+"

Set M: the allowed non-escaped character set is equal to ((S intersect Q) - (H union P)): alpha | digit | "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" | "$" | ","

Set M is RFC 2396's unreserved set plus "$" and ",". Since these two characters could easily be used to mean something to a query processor, and because there is value in using a normative reference rather than defining our own character set, I felt that using the RFC 2396 definition would be best.

-------
Sender Permitted From: http://spf.pobox.com/
Archives at http://archives.listbox.com/spf-discuss/current/
Latest draft at http://spf.pobox.com/draft-mengwong-spf-02.txt
To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/?listname(_at_)©#«Mo\¯HÝÜîU;±¤Ö¤Íµø?¡