procmail
[Top] [All Lists]

Re: Filtering URL In Message Body

2000-12-17 03:04:43
Eric Hilding <eric(_at_)hilding(_dot_)com> writes:
1.  What is the correct way to put a specific "http://blah.blah"; URL into 
the message body filter below?  In place of  |URL HERE|  would it 
be  |http\:\/\/blah\.blah|

Forward slashes are not special in regexps unless they are the
delimiting character, as is the default in perl, sed, and awk.  Since
they aren't the delimiter in procmail, they should not be escaped.
Indeed, the escaped forward slash, \/, represents the 'capture'
operator in procmail, causing the text matched by the rest of the
regexp to be saved in the MATCH variable.  There's more to it than
that, but the key thing to remember is that if you want to match a
forward slash, you enter a forward slash.

So, to match "http://blah.blah";, you would use "http://blah\.blah";
To be precise and to cite the procmail source code itself, the only
characters that need to be preceeded by a backslash in order to
represent their literal value are the following:

        (|)*?+.^$[\

Anything else you can leave alone.

Oh yeah, if you want to match a NUL (character code zero), you have to
use an inverted character class containing the range of everything
else, control-A (character code one) to diaeresis-y (character code
255).  You're unlikely to ever need to do that, so I'm not going to
include an example (the 8bit characters would force this message to be
mimified, which would confuse the example).


2.  How would I code it to also filter on specific URL's which contain ANY 
number(s) ???

_Any_ numbers?  What about "http://www.3com.com/";?  Perhaps you _all_
numbers:

        :0 B
        * http://[0-9.]+([^a-z_]|$)
        /possible/spam


Hmm.  When I wrote the site-wide spam filter at my last job I found
that IP-address URLs were too commonly used for legitimite (albeit
misguided) purposes to be completely banned.  Instead, I decided to ban
messages that contained bogo-IP-address URLs: URLs where the host part
contained fewer than four numeric components, or where one or more of
the components was written in hex or octal (leading 0x or 0).  It
seemed that only spammers use such addresses.  So I wrote the following
regexp.  I'll only explain it by saying that it's written the way it is
to minimize processing time by procmail: it's almost fully glommed to
minimize the number of 'branches' that must be considered/followed by
the regexp engine.  The indentation matches the structure, but it's
still a mess to read.

:0 B
* http://(([!$&-.0-;=a-z_~]+|%[0-9a-f][0-9a-f])*@)?\
    (0(0*[1-9][0-9]*|x[0-9a-f]+)\
     (\.(0x[0-9a-f]+|[0-9]+)\
      (\.(0x[0-9a-f]+|[0-9]+)(\.(0x[0-9a-f]+|[0-9]+))?)?)?\
    |(0+|[1-9][0-9]*)\
     (|\.(0(0*[1-9][0-9]*|x[0-9a-f]+)\
          (\.(0x[0-9a-f]+|[0-9]+)(\.(0x[0-9a-f]+|[0-9]+))?)?\
         |(0+|[1-9][0-9]*)\
          (|\.(0(0*[1-9][0-9]*|x[0-9a-f]+)\
               (\.(0x[0-9a-f]+|[0-9]+))?\
              |(0+|[1-9][0-9]*)(\.0(0*[1-9][0-9]*|x[0-9a-f]+))?)))))\
    [^!$-.0-;=a-z_~]
/possible/spam


Philip Guenther

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>