procmail
[Top] [All Lists]

Re: Recipe to match within text - need help

2003-05-30 13:44:50
At 18:15 2003-05-30 +0000, Cyndi Norman wrote:
Hi all.  I'm having trouble coming up with a recipe for what I need and I
was hoping someone here could help.  I've studied examples and FAQ pages
but can't get anywhere.

I have an HTML classified ad submission form on my website (see:
http://www.immuneweb.org/classifieds/ ) The submissions come through as
email and I put up the ads by hand.  Unfortunately, spammers have gotten
ahold of the specs and I get 100-200 spam submissions every day.  No that's
not a typo.  I get 5-10 legit ads per week.

Suggestion: put in a graphic humanoid confirmation mechanism on your submission page. You'd have a cgi that emits a graphic with a picture of a word or number, and the visitor would need to type that into one of the form fields before submitting. That should just about eliminate automated submissions to the form (well, they can submit all they want, but the form would reject them back to the page asking for a reconfirmation - with a NEW graphic). Be sure that your page source and the URL don't contain the keyword - use a hash (or better yet, something that only the backend can use to do a lookup).

FTR, this is fairly easy to accomplish in PHP.

These graphic validation mechanisms are in use by some freemail services to reduce automated abuses. One example which comes to mind is <http://www.yahoo.co.uk/> - start the process for creating a new mail account, and when you get to the page which asks for details, scroll to the bottom and you'll see a graphic (unless you're running a filter which happens to block out the graphic because it looks like an advert <g>).

Also, avoid placing an email address into your webpage - say, if there's a "hidden" field for your form that is used to specify the address to which the form data is subsequently emailed. At a minimum, you can encode that and your webform can decode it. Ordinal encoding has worked extremely well for me over the years, even in regular mailto links.

Once you've eliminated automated abuse of the webform, you can either change the email address (if spammers are actually emailing directly to it), or have the webform insert a simple header with a secret keyword (which would be the quickest thing to check against - anything not including that keyword wasn't sent to you through the webform, which at this point shouldn't be too easy to abuse).

Since there are only a dozen or so key words in the text of these emails
that I need to ID 95% of them (without ever getting a false positive), I
thought I would use procmail to sort the spam ones into a folder.  Later,
when I'm convinced the code is right, I'll sort them into the trash.  The
fact that I never ever actually post any of these ads hasn't slowed the
bastards down.

Here's an example of one of the emails:

I've got to wonder if perhaps you've "tweaked" the message headers.

Anything originating from your webserver (versis the outside world) should have a received: line which identifies is such as:

Received: (from nobody(_at_)localhost)
        by yourwebserver.tld (8.12.9/8.12.9/Submit) id h4UHHahm067439;
        Fri, 30 May 2003 19:17:36 +0200 (CEST)

or something to that effect.  You could check for this type of header like so:

:0
* ^Received:\<*from nobody(_at_)localhost\>*by yourwebserver\.tld

(of course, any radical changes in your web or email config could disrupt this)

Return-Path: <anonymous(_at_)immuneweb(_dot_)org>

The bulk of your directly-emailed spam (versus abuses of the webform itself), will not contain this. Occasionally though, spammers forge their messages to be from yourself.

* ^Return-Path:[        ]<anonymous(_at_)immuneweb\(_dot_)org>

Message-ID: <20030530174518(_dot_)3664(_dot_)qmail(_at_)immuneweb(_dot_)org>

Some spams will have your domain in the Message-ID, but many will not.

* ^Message-ID:(_dot_)*(_at_)immuneweb\(_dot_)org>

Subject: Classified Ad Submission

Your own webform submission subject would be a useful baseline check:

* ^Subject:[    ]*Classified Ad Submission


to = classifieds(_at_)immuneweb(_dot_)org
subject = Noncommerical Classified Submission
form = http://www.immuneweb.org/classifieds/submitnoncom.html
admin = classifieds(_at_)immuneweb(_dot_)org
Background Info =
Real Name = Devika Rani
Real Email = devika_opps2003(_at_)yahoo(_dot_)co(_dot_)in
Ad Information =
Subcategory = Employment Offered
new category =

You can check for these in the less efficient (at least, when you get hammered with a large spam message) body check:

* B ?? ^(to =)
* B ?? ^(subject =)
* B ?? ^(form =)
* B ?? ^(admin =)
(etc)

Messages NOT containing these characteristics didn't _really_ come through your webform.

2 = Begin Text of Ad
Text of Ad = Finally! A Real Work @ Home Opportunity has arrived! Now you
can become an Independent Typist with Ad-Placer.com. We offer home workers

Ur, was this intended to be an example of a LEGIT posting to your service, or a spam?

If this is an example of the abuse of your webform, then it is imperative that you add a human confirmation method to avoid this type of abuse.

:0:
* ^From: anonymous(_at_)immuneweb(_dot_)org
$HOME/spambouncer/blocked/classifieds

This puts *all* the classified submissions into that folder.

'cept any that are abuses of the webform.


Now I'd like to run text matching on those emails that already match the
initial statement (from anonymous(_at_)immuneweb(_dot_)org).  But I can't 
figure it
out.

:0
* ^From: anonymous(_at_)immuneweb(_dot_)org
{
        # check for stuff you don't want, and /dev/null it
        :0
        * bad stuff
        /dev/null               (or a trash mailbox, at first)

        # fall through to "ok"
        :0:
        $HOME/spambouncer/blocked/classifieds
}



  Everything I do causes the match to fail.  I am trying to match with
"Ad-Placer" in the text and ran several tests to no avail.


:0
* B ?? Ad-Placer
junk.mbx



  I'm sure this
is easy but I just can't figure it out.

I tried:

:0:
* ^From: anonymous(_at_)immuneweb(_dot_)org
* .*Ad-Placer

This won't search the BODY - see the flags described in 'man procmailrc'. By default, the condition expressions search the HEADERS only (which is faster, since they're fairly small in size, even if the message is huge). You can either add 'HB' to the flags line (before the colon), or use the 'B ?? text' syntax to explicitly match against the body content (see elsewhere in this reply).

I would like to be able to eventually have a recipe that looks like this:

:0:
* ^From: anonymous(_at_)immuneweb(_dot_)org
and a text match on any of the following: a b c d e
Go to dev/null

No need to specify the lockfile flag for output to /dev/null. I don't think anyone will mind if the output is out of order <g>.

:0
* ^From: anonymous(_at_)immuneweb\(_dot_)org
* 9876543210^0 B ?? some_text_in_the_body_a
* 9876543210^0 B ?? some_text_in_the_body_b
* 9876543210^0 B ?? some_text_in_the_body_d
* 9876543210^0 B ?? some_text_in_the_body_e
/dev/null

The scoring - see 'man procmailsc' - allows procmail to do an OR condition across multiple condition lines, making it easier to have independent condition lines.

or:

:0
* ^From: anonymous(_at_)immuneweb\(_dot_)org
* B ?? (some_text_in_the_body_a|some_text_in_the_body_b|\
        some_text_in_the_body_d|some_text_in_the_body_e)
/dev/null

IMO, it's easier to manage the one with scoring if you expect to revisit it from time to time (though, if you fix the problem at the front end, this might all be moot).


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail