Re: recipe blocking mail with attachements

At 14:11 2005-06-21 -0500, Damian Menscher wrote:

Mind telling us what your rules are, and how many false positives youget? I use procmail to supplement spamassassin and clamav, but couldn'timagine using it alone. Your rules must be quite impressive.

As with some other posters here, I've posted bits and pieces to this listover the years, generally in response to someone's "how do I..."queries. It isn't all packaged up for general consumption, but the basisfor it has been discussed frequently. As with SA, I tabulate a runningscore of how spammy a message is (which I call "SPAMMISHNESS"), and if thatexceeds a threshold, the message is scuttled. I don't simply useindividual characteristics as a "this is spam" flag (though I certainlyhave a few tests which score above the spam threshold, there are still afew things which provide an allowance to the scoring - say lists likeprocmail which discuss spam frequently).

The majority of my false positives have been on a couple of discussionlists which prune out received headers from prior to the lists resending ofthe message (which interfere with matching the authors domain in themessage delivery history). This relates to the tweak I added recently tobasically allow "for this list and this list, skip these tests", by usingnamed conditions at the tests and checking for a match in a string ofexcluded tests:


# some lists strip the original Received headers from submissions before
# resending them, which causes some tests to bugger out.
# So, we define a SPAMSKIP string which sets tests which we skip.
:0
* LISTNAME ?? ^^(mysql|php-general|bugtraq)^^
{
        SPAMSKIP=" MESSIDRCVD FROMDOMRCVD "
}

(LISTNAME is defined by a recipe I posted here long ago whichprogrammatically extracts the listname from various headers - be it asender, or RFC-based list header, etc)


# 20050425
# message-id host does not appear within the received header chain
:0
* $! SPAMSKIP ?? [      ]MESSIDRCVD\<
* $! ^Received:.*${MESSID_DOMAIN}
{
        SPAMVAL="+50"
        SPAMMISHNESS="${SPAMMISHNESS}${SPAMVAL}"

SPAMNOTES="${SPAMNOTES}SPAM: ${SPAMVAL} message-id domain not inreceived chain${NL}"

The irony of the regexp being used is that the body keywords will resultin his *OWN* message to this list not arriving back in his inbox.
Yes. In fact, I'm impressed that you managed to receive it, given thatyou must filter on some of those words also.

But, I don't filter based on crude constructs. I also allow have anallowance for the procmail list because spam is discussed here frequently.90% or more of my spam filtering involves characteristics in the HEADERS ofmessages, not the body.

In fact, his message tripped absolutely nothing, although the message Ireceived immediatley after it in my inbox tripped a number of conditions -including a catch-all for "too many" conditions (which allows forescalating something to spam even if the individual characteristics score low):


List: procmail
From procmail-bounces(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE  Tue Jun 21 
03:59:39 2005
 Subject: recipe blocking mail with attachements
  Folder: gzip -9fc >> procmail.gz                                       4782

SPAM: +30 Advisory - may be forged warning
SPAM: +100 non-list From and envelope differ
SPAM: +50 message-id domain does not match sender domain
SPAM: +35 from_domain not found in received chain
SPAM: +50 allcaps subject
SPAM: +(249*2) Blank To/From
SPAM: +75 no non-list cleartext recipient matching X-Envelope-To
SPAM: +125 relay hostname appears to be consumer dialup/broadband
SPAM: +249+165 Subject Scoring match 165
SPAM: +5 spam type statements (5)
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 1631
SPAM: spammishness exceeds threshold of 249
SPAM: Apparent recipient is **DELETED**(_at_)mail(_dot_)professional(_dot_)org
INFO: SpamFilter v03.11.00  SBS  20050425/1552
From gatexs01(_at_)gatexs(_dot_)nl  Tue Jun 21 04:01:30 2005
 Subject: CONGRATULATIONS LOTTERY WINNER
  Folder:  gzip -9fc >> spam.gz                                          3284

(this data is in the logfile, not inserted into the messages)

Note that it got a bayes score of 0 even though it had lots of "bad" wordsin it. Bayes scoring is very nice, which is why I'm so curious how youmanaged to get by without it.

By having well constructed scoring rules of my own. I have just threerecipes which involve the body - one is my Nigerian scam recipe (which isquite effective), and the other two relate to opt-out claims. My onerecipe which uses simple keywords is actually restricted to the SUBJECT,and scores very low on mundane spammish words (identity, logo, etc) andhigher for more spammy words (aphrodisiac), and even higher on combinationsof spammy words. Because that particular recipe requires that the recipescore above a certain level before it considers itself matched, the messagehas to have a fair number of low-grade matches, or a few high gradeones. This only happens when the message is loaded with crap, and unlessit is REALLY loaded, it's not going to contribute a whole lot to theoverall spammishness. Header tests remain the most effective method ofidentifying spam.

Stats thus far for this month show about 1 in 10 received messages is spam,which represent about 1/5 of the total received bytes. This does notinclude malware which is shed at a different stage, nor does it account forthe MANY messages which are completely avoided due to DNSBLs at the SMTP level.


Yesterday, there were 45 spams blocked from my inbox via my procmail filters.

I receive a daily logfile excerpt showing data about the messages whichwere categorized as spew, and it takes but a minute to scroll through it tocheck for false pozzies. Since my tweaks for the header-trimming lists,false pozzies have dropped to ZERO, but one week is a rather shorttimeframe for a real figure there.

You should check my "spewhosts" and "furrin" recipes sometime (both arelinked from my meagre procmail pages). ebay and paypal phishingexpeditions are stopped dead in their tracks because the phishers are toolazy to even attempt to forge the messages in a qwality fashion...


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail