procmail
[Top] [All Lists]

Re: Trapping non-standard URL's in spam

1999-03-22 09:59:24
"John D. Hardin" <jhardin(_at_)wolfenet(_dot_)com> writes:
On Sun, 21 Mar 1999, Walter Dnes wrote:

 NONSTANDARD="(0x[0-9a-f]+|0[0-7]+)"
 :0fb
 *  1^0 http:(//|//.*@)[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]
 *  1^0 
http:(//|//.*@)0x[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]
 *$ 1^0 http:(//|//.*@)${NONSTANDARD}\..*\..*\..*
 *$ 1^0 http:(//|//.*@).*\.${NONSTANDARD}\..*\..*
 *$ 1^0 http:(//|//.*@).*\..*\.${NONSTANDARD}\..*
 *$ 1^0 http:(//|//.*@).*\..*\..*\.${NONSTANDARD}
...
Instead of \..*\. shouldn't it be \.[0-9]+\. ? What if somebody has
a legitimate server at, say, http://0xdeadbeef.computer.lore.com/ ?


As long as you're going to do it, do it right:

URANGE = "-_.!~*'()a-z0-9;:&=+\$,"
USERPART = "(([$URANGE]|%[0-9a-f][0-9a-f])*@)?"
NOTUSER = "[^%$URANGE]"
OCTET   = '(0+|[1-9][0-9]*)'
ILLEGAL = '0(0*[1-9][0-9]*|x[0-9a-f]+)'
EITHER  = '(0x[0-9a-f]+|[0-9]+)'

* 1^0 $ http://${USERPART}\
        ($ILLEGAL(\.$EITHER(\.$EITHER(\.$EITHER)?)?)?|\
         $OCTET(|\.($ILLEGAL(\.$EITHER(\.$EITHER)?)?|\
                    $OCTET(|\.($ILLEGAL(\.$EITHER)?|\
                               $OCTET(\.$ILLEGAL)?)))))$NOTUSER

That regexp should catch all of the bad URLs that you're looking for:
a) numeric, yet fewer than four components given,
b) contains one or more octal or hex components (but no more than four
        components total)

Furthermore, it's phrased to be efficient for the regexp engine: when
matching with the above the regexp engine should only have to follow
more than two branches simultaneously for more than one character when
it encounters a leading zero.  (Two branches have to be followed almost
all of the time because of the ${USERPART} bit.)  If that doesn't make
sense to you, just trust me that almost all of the other ways of
writing the above are slower.  In particular, Walter's version forces a
lot of useless matching with all of the ".*" in it.  (Sorry Walter!)


Side issue: what would a browser try to do with that example URL?

It should look up "0xdeadbeef.computer.lore.com" in the DNS, just like
any other hostname.  Hostname are explicitly allowed to contain
components that start with a number or are entirely numeric.
"3com.com" is a legal hostname, as is "35.foo.com".


I'll note here that the following hostname is perfectly legal,
and should arguably be looked up in the DNS like any other hostname:
                0xa.0xa.0xa.0xa

However, there's no "0xa" toplevel domain (like "com"), and it would be
pretty stupid of anyone to register such an thing, so I'm not going to
worry about that loss.


Philip Guenther

<Prev in Thread] Current Thread [Next in Thread>