procmail
[Top] [All Lists]

Re: recipe blocking mail with attachements

2005-06-21 13:42:42
At 14:11 2005-06-21 -0500, Damian Menscher wrote:
Mind telling us what your rules are, and how many false positives you get? I use procmail to supplement spamassassin and clamav, but couldn't imagine using it alone. Your rules must be quite impressive.

As with some other posters here, I've posted bits and pieces to this list over the years, generally in response to someone's "how do I..." queries. It isn't all packaged up for general consumption, but the basis for it has been discussed frequently. As with SA, I tabulate a running score of how spammy a message is (which I call "SPAMMISHNESS"), and if that exceeds a threshold, the message is scuttled. I don't simply use individual characteristics as a "this is spam" flag (though I certainly have a few tests which score above the spam threshold, there are still a few things which provide an allowance to the scoring - say lists like procmail which discuss spam frequently).

The majority of my false positives have been on a couple of discussion lists which prune out received headers from prior to the lists resending of the message (which interfere with matching the authors domain in the message delivery history). This relates to the tweak I added recently to basically allow "for this list and this list, skip these tests", by using named conditions at the tests and checking for a match in a string of excluded tests:

# some lists strip the original Received headers from submissions before
# resending them, which causes some tests to bugger out.
# So, we define a SPAMSKIP string which sets tests which we skip.
:0
* LISTNAME ?? ^^(mysql|php-general|bugtraq)^^
{
        SPAMSKIP=" MESSIDRCVD FROMDOMRCVD "
}

(LISTNAME is defined by a recipe I posted here long ago which programmatically extracts the listname from various headers - be it a sender, or RFC-based list header, etc)

# 20050425
# message-id host does not appear within the received header chain
:0
* $! SPAMSKIP ?? [      ]MESSIDRCVD\<
* $! ^Received:.*${MESSID_DOMAIN}
{
        SPAMVAL="+50"
        SPAMMISHNESS="${SPAMMISHNESS}${SPAMVAL}"
SPAMNOTES="${SPAMNOTES}SPAM: ${SPAMVAL} message-id domain not in received chain${NL}"
}

The irony of the regexp being used is that the body keywords will result in his *OWN* message to this list not arriving back in his inbox.

Yes. In fact, I'm impressed that you managed to receive it, given that you must filter on some of those words also.

But, I don't filter based on crude constructs. I also allow have an allowance for the procmail list because spam is discussed here frequently. 90% or more of my spam filtering involves characteristics in the HEADERS of messages, not the body.

In fact, his message tripped absolutely nothing, although the message I received immediatley after it in my inbox tripped a number of conditions - including a catch-all for "too many" conditions (which allows for escalating something to spam even if the individual characteristics score low):

List: procmail
From procmail-bounces(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE  Tue Jun 21 
03:59:39 2005
 Subject: recipe blocking mail with attachements
  Folder: gzip -9fc >> procmail.gz                                       4782

SPAM: +30 Advisory - may be forged warning
SPAM: +100 non-list From and envelope differ
SPAM: +50 message-id domain does not match sender domain
SPAM: +35 from_domain not found in received chain
SPAM: +50 allcaps subject
SPAM: +(249*2) Blank To/From
SPAM: +75 no non-list cleartext recipient matching X-Envelope-To
SPAM: +125 relay hostname appears to be consumer dialup/broadband
SPAM: +249+165 Subject Scoring match 165
SPAM: +5 spam type statements (5)
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 1631
SPAM: spammishness exceeds threshold of 249
SPAM: Apparent recipient is **DELETED**(_at_)mail(_dot_)professional(_dot_)org
INFO: SpamFilter v03.11.00  SBS  20050425/1552
From gatexs01(_at_)gatexs(_dot_)nl  Tue Jun 21 04:01:30 2005
 Subject: CONGRATULATIONS LOTTERY WINNER
  Folder:  gzip -9fc >> spam.gz                                          3284

(this data is in the logfile, not inserted into the messages)

Note that it got a bayes score of 0 even though it had lots of "bad" words in it. Bayes scoring is very nice, which is why I'm so curious how you managed to get by without it.

By having well constructed scoring rules of my own. I have just three recipes which involve the body - one is my Nigerian scam recipe (which is quite effective), and the other two relate to opt-out claims. My one recipe which uses simple keywords is actually restricted to the SUBJECT, and scores very low on mundane spammish words (identity, logo, etc) and higher for more spammy words (aphrodisiac), and even higher on combinations of spammy words. Because that particular recipe requires that the recipe score above a certain level before it considers itself matched, the message has to have a fair number of low-grade matches, or a few high grade ones. This only happens when the message is loaded with crap, and unless it is REALLY loaded, it's not going to contribute a whole lot to the overall spammishness. Header tests remain the most effective method of identifying spam.

Stats thus far for this month show about 1 in 10 received messages is spam, which represent about 1/5 of the total received bytes. This does not include malware which is shed at a different stage, nor does it account for the MANY messages which are completely avoided due to DNSBLs at the SMTP level.

Yesterday, there were 45 spams blocked from my inbox via my procmail filters.

I receive a daily logfile excerpt showing data about the messages which were categorized as spew, and it takes but a minute to scroll through it to check for false pozzies. Since my tweaks for the header-trimming lists, false pozzies have dropped to ZERO, but one week is a rather short timeframe for a real figure there.

You should check my "spewhosts" and "furrin" recipes sometime (both are linked from my meagre procmail pages). ebay and paypal phishing expeditions are stopped dead in their tracks because the phishers are too lazy to even attempt to forge the messages in a qwality fashion...

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>