Re: Filtering outgoing messages on DNS names or local-part structure


John C Klensin <john+smtp(_at_)jck(_dot_)com> wrote:

I recently posted draft-klensin-name-filters-01.txt as an I-D.

Comments welcome.


Good idea.  I'd love to see web applications stop mangling the plus-sign
in my email address.

Section 3 says:

         Abc\(_at_)def@example.com

    is a valid form of an email address.

    Conventional double-quote characters may be used to surround strings.

         "Abc(_at_)def"@example.com

I think the first form is not valid (but the second is).  If I'm not
mistaken, RFCs 821, 822, 2821, and 2822 all agree that quoted-pairs are
allowed in quoted-strings, but not in atoms.

The character you called "acute accent" is actually a grave accent.

Section 4 says "The syntax for URLs (Uniform Resource Locators) is
specified in [4]" (where [4] maps to RFC 1738).  RFC 1738 specifies
the syntax of some URLs, but others (including http and mailto) are
currently specified by other documents.  See the IANA URI scheme
registry (http://www.iana.org/assignments/uri-schemes).  Most of the URI
schemes listed there are URL schemes, but it's not always easy to tell
which ones.  I have been unable to find any formal definition of URL.

You might want to just avoid the term URL, and say that generic URI
syntax is specified by RFC 2396, and scheme-specific URI syntax is
specified by other documents: RFC 2616 for http, RFC 2368 for mailto,
RFC 1738 for several others; refer the reader to the IANA registry for
the complete list.

Section 4.3:

It looks like some lists of characters were lost.

In the first example, "Joe" has become "joe".  This brings up another
point that would be good to make in this document: filters must not
capitalize or decapitalize letters in local parts, nor in URIs (except
for the host field).

Note 2 says 'There is actually some uncertainty as to whether or not the
"+" characters requires escaping in MAILTO URLs (the standards are not
precisely clear).'  Wow, I never noticed what a nightmare RFC 2368 is in
this respect.  It says:

    Following the syntax conventions of RFC 1738 [RFC1738], a "mailto"
    URL has the form:

    mailtoURL  =  "mailto:"; [ to ] [ headers ]
    ...

    Note that all URL reserved characters in "to" must be encoded: in
    particular, parentheses, commas, and the percent sign ("%"), which
    commonly occur in the "mailbox" syntax.

Huh?  Neither parentheses nor comma nor percent are listed as reserved
by RFC 1738 (RFC 2396 lists comma but not the others).  Percent must be
escaped, but not because it is reserved (RFC 1738 calls it "unsafe", and
RFC 2396 calls it "delims", but both explain that it needs to be escaped
because it is the escape character).  RFC 1738 (which is the citation
here) explicitly lists parentheses and comma as characters that need not
be escaped.  But at-signs *are* listed as reserved in both RFC 1738 and
RFC 2396.  So at-signs must be escaped?  But the examples in RFC 2368
show unescaped at-signs.  This requirement is complete nonsense.  If I
were trying to implement mailto URIs, my best guess would be to ignore
it.

A little later, RFC 2368 says 'Within mailto URLs, the characters
"?", "=", "&" are reserved.'  Now that I can believe, because those
characters are indeed used to delimit the components of the mailto URI.

Getting back to section 4.3 of your draft, the third example escapes
the slash but not the equal-sign, which is backwards.  The equal-sign
almost certainly needs to be escaped, because it is explicitly listed as
reserved by RFC 2396, and is actually used as a delimiter in mailto URIs
(in the headers part).  But the slash is in the same boat as plus-sign.
The only thing hinting that slash or plus-sign might need to be escaped
is that crazy sentence that would also imply that at-sign needs to be
escaped, which we know from the examples is wrong.

There is another escaping-related gotcha that often breaks mail
addresses containing plus-signs:

Some server-side programs, when generating HTML containing URIs
containing query-strings containing previous user-supplied data, forget
to apply the x-www-form-urlencoded escaping to that data.  The result
is corruption of any data that contained a plus-sign, percent-sign,
ampersand, or equal-sign.

For example, Yahoo mail is able to reply to email addresses containing
these characters, but if such an address is stored in the address book,
new mail cannot be sent to it.  Hotmail has a similar problem.

Even if x-www-form-urlencoded encoding and decoding is done at all the
right places, there is another potential gotcha for plus-signs, because
the specification of application/x-www-form-urlencoded is broken.  HTML
2.0 and 4.01 both say:

    Space characters are replaced by `+', and then reserved characters
    are escaped as described in RFC 1738: non-alphanumeric characters
    are replaced by `%HH'...

Which is it, reserved characters or non-alphanumeric characters?  Either
way, the specified process is not reversible, because it perfoms %HH
escaping *after* changing spaces to plus-signs.  For example, the values
"foo+bar" and "foo bar" map to the same thing, either "foo+bar" (if
plus-sign is not escaped), or "foo%2Bbar" (if plus-sign is escaped).

As far as I know, browsers always violate the spec and do something
sane instead: they do the %HH escaping *before* changing spaces to
plus-signs, and they include plus-sign in the set of characters to be
escaped.  That way, the server can distinguish between "foo%2Bbar"
(which means "foo+bar") versus "foo+bar" (which means "foo bar").

AMC