procmail
[Top] [All Lists]

Re: Recipe to match within text - need help

2003-05-30 16:32:07
those graphic validation systems are Section 508 violations.  They're also
not even necessary for a situation like this.  What could be put up would
be a security table.  This would be a table of strings with words or
numbers and combinations of both.  More than one column and more than one
row.  Then a security rule gets sent.  It might be type third entry from
top of column 3 and 2nd entry from column 5.  The security rules and
security tables should change regularly.  Something like that would defeat
the automated submissions and not violate section 508.On Fri, 30 May 2003,
Professional Software Engineering wrote:

At 18:15 2003-05-30 +0000, Cyndi Norman wrote:
Hi all.  I'm having trouble coming up with a recipe for what I need and I
was hoping someone here could help.  I've studied examples and FAQ pages
but can't get anywhere.

I have an HTML classified ad submission form on my website (see:
http://www.immuneweb.org/classifieds/ ) The submissions come through as
email and I put up the ads by hand.  Unfortunately, spammers have gotten
ahold of the specs and I get 100-200 spam submissions every day.  No that's
not a typo.  I get 5-10 legit ads per week.

Suggestion: put in a graphic humanoid confirmation mechanism on your
submission page.  You'd have a cgi that emits a graphic with a picture of a
word or number, and the visitor would need to type that into one of the
form fields before submitting.  That should just about eliminate automated
submissions to the form (well, they can submit all they want, but the form
would reject them back to the page asking for a reconfirmation - with a NEW
graphic).  Be sure that your page source and the URL don't contain the
keyword - use a hash (or better yet, something that only the backend can
use to do a lookup).

FTR, this is fairly easy to accomplish in PHP.

These graphic validation mechanisms are in use by some freemail services to
reduce automated abuses.  One example which comes to mind is
<http://www.yahoo.co.uk/> - start the process for creating a new mail
account, and when you get to the page which asks for details, scroll to the
bottom and you'll see a graphic (unless you're running a filter which
happens to block out the graphic because it looks like an advert <g>).

Also, avoid placing an email address into your webpage - say, if there's a
"hidden" field for your form that is used to specify the address to which
the form data is subsequently emailed.  At a minimum, you can encode that
and your webform can decode it.  Ordinal encoding has worked extremely well
for me over the years, even in regular mailto links.

Once you've eliminated automated abuse of the webform, you can either
change the email address (if spammers are actually emailing directly to
it), or have the webform insert a simple header with a secret keyword
(which would be the quickest thing to check against - anything not
including that keyword wasn't sent to you through the webform, which at
this point shouldn't be too easy to abuse).

Since there are only a dozen or so key words in the text of these emails
that I need to ID 95% of them (without ever getting a false positive), I
thought I would use procmail to sort the spam ones into a folder.  Later,
when I'm convinced the code is right, I'll sort them into the trash.  The
fact that I never ever actually post any of these ads hasn't slowed the
bastards down.

Here's an example of one of the emails:

I've got to wonder if perhaps you've "tweaked" the message headers.

Anything originating from your webserver (versis the outside world) should
have a received: line which identifies is such as:

Received: (from nobody(_at_)localhost)
         by yourwebserver.tld (8.12.9/8.12.9/Submit) id h4UHHahm067439;
         Fri, 30 May 2003 19:17:36 +0200 (CEST)

or something to that effect.  You could check for this type of header like so:

:0
* ^Received:\<*from nobody(_at_)localhost\>*by yourwebserver\.tld

(of course, any radical changes in your web or email config could disrupt 
this)

Return-Path: <anonymous(_at_)immuneweb(_dot_)org>

The bulk of your directly-emailed spam (versus abuses of the webform
itself), will not contain this.  Occasionally though, spammers forge their
messages to be from yourself.

* ^Return-Path:[        ]<anonymous(_at_)immuneweb\(_dot_)org>

Message-ID: <20030530174518(_dot_)3664(_dot_)qmail(_at_)immuneweb(_dot_)org>

Some spams will have your domain in the Message-ID, but many will not.

* ^Message-ID:(_dot_)*(_at_)immuneweb\(_dot_)org>

Subject: Classified Ad Submission

Your own webform submission subject would be a useful baseline check:

* ^Subject:[    ]*Classified Ad Submission


to = classifieds(_at_)immuneweb(_dot_)org
subject = Noncommerical Classified Submission
form = http://www.immuneweb.org/classifieds/submitnoncom.html
admin = classifieds(_at_)immuneweb(_dot_)org
Background Info =
Real Name = Devika Rani
Real Email = devika_opps2003(_at_)yahoo(_dot_)co(_dot_)in
Ad Information =
Subcategory = Employment Offered
new category =

You can check for these in the less efficient (at least, when you get
hammered with a large spam message) body check:

* B ?? ^(to =)
* B ?? ^(subject =)
* B ?? ^(form =)
* B ?? ^(admin =)
(etc)

Messages NOT containing these characteristics didn't _really_ come through
your webform.

2 = Begin Text of Ad
Text of Ad = Finally! A Real Work @ Home Opportunity has arrived! Now you
can become an Independent Typist with Ad-Placer.com. We offer home workers

Ur, was this intended to be an example of a LEGIT posting to your service,
or a spam?

If this is an example of the abuse of your webform, then it is imperative
that you add a human confirmation method to avoid this type of abuse.

:0:
* ^From: anonymous(_at_)immuneweb(_dot_)org
$HOME/spambouncer/blocked/classifieds

This puts *all* the classified submissions into that folder.

'cept any that are abuses of the webform.


Now I'd like to run text matching on those emails that already match the
initial statement (from anonymous(_at_)immuneweb(_dot_)org).  But I can't 
figure it
out.

:0
* ^From: anonymous(_at_)immuneweb(_dot_)org
{
         # check for stuff you don't want, and /dev/null it
         :0
         * bad stuff
         /dev/null               (or a trash mailbox, at first)

         # fall through to "ok"
         :0:
         $HOME/spambouncer/blocked/classifieds
}



  Everything I do causes the match to fail.  I am trying to match with
"Ad-Placer" in the text and ran several tests to no avail.


:0
* B ?? Ad-Placer
junk.mbx



  I'm sure this
is easy but I just can't figure it out.

I tried:

:0:
* ^From: anonymous(_at_)immuneweb(_dot_)org
* .*Ad-Placer

This won't search the BODY - see the flags described in 'man
procmailrc'.  By default, the condition expressions search the HEADERS only
(which is faster, since they're fairly small in size, even if the message
is huge).  You can either add 'HB' to the flags line (before the colon), or
use the 'B ?? text' syntax to explicitly match against the body content
(see elsewhere in this reply).

I would like to be able to eventually have a recipe that looks like this:

:0:
* ^From: anonymous(_at_)immuneweb(_dot_)org
and a text match on any of the following: a b c d e
Go to dev/null

No need to specify the lockfile flag for output to /dev/null.  I don't
think anyone will mind if the output is out of order <g>.

:0
* ^From: anonymous(_at_)immuneweb\(_dot_)org
* 9876543210^0 B ?? some_text_in_the_body_a
* 9876543210^0 B ?? some_text_in_the_body_b
* 9876543210^0 B ?? some_text_in_the_body_d
* 9876543210^0 B ?? some_text_in_the_body_e
/dev/null

The scoring - see 'man procmailsc' - allows procmail to do an OR condition
across multiple condition lines, making it easier to have independent
condition lines.

or:

:0
* ^From: anonymous(_at_)immuneweb\(_dot_)org
* B ?? (some_text_in_the_body_a|some_text_in_the_body_b|\
         some_text_in_the_body_d|some_text_in_the_body_e)
/dev/null

IMO, it's easier to manage the one with scoring if you expect to revisit it
from time to time (though, if you fix the problem at the front end, this
might all be moot).


---
  Sean B. Straw / Professional Software Engineering

  Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
  Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail