At 14:11 2005-06-21 -0500, Damian Menscher wrote:
Mind telling us what your rules are, and how many false positives you
get? I use procmail to supplement spamassassin and clamav, but couldn't
imagine using it alone. Your rules must be quite impressive.
As with some other posters here, I've posted bits and pieces to this list
over the years, generally in response to someone's "how do I..."
queries. It isn't all packaged up for general consumption, but the basis
for it has been discussed frequently. As with SA, I tabulate a running
score of how spammy a message is (which I call "SPAMMISHNESS"), and if that
exceeds a threshold, the message is scuttled. I don't simply use
individual characteristics as a "this is spam" flag (though I certainly
have a few tests which score above the spam threshold, there are still a
few things which provide an allowance to the scoring - say lists like
procmail which discuss spam frequently).
The majority of my false positives have been on a couple of discussion
lists which prune out received headers from prior to the lists resending of
the message (which interfere with matching the authors domain in the
message delivery history). This relates to the tweak I added recently to
basically allow "for this list and this list, skip these tests", by using
named conditions at the tests and checking for a match in a string of
excluded tests:
# some lists strip the original Received headers from submissions before
# resending them, which causes some tests to bugger out.
# So, we define a SPAMSKIP string which sets tests which we skip.
:0
* LISTNAME ?? ^^(mysql|php-general|bugtraq)^^
{
SPAMSKIP=" MESSIDRCVD FROMDOMRCVD "
}
(LISTNAME is defined by a recipe I posted here long ago which
programmatically extracts the listname from various headers - be it a
sender, or RFC-based list header, etc)
# 20050425
# message-id host does not appear within the received header chain
:0
* $! SPAMSKIP ?? [ ]MESSIDRCVD\<
* $! ^Received:.*${MESSID_DOMAIN}
{
SPAMVAL="+50"
SPAMMISHNESS="${SPAMMISHNESS}${SPAMVAL}"
SPAMNOTES="${SPAMNOTES}SPAM: ${SPAMVAL} message-id domain not in
received chain${NL}"
}
The irony of the regexp being used is that the body keywords will result
in his *OWN* message to this list not arriving back in his inbox.
Yes. In fact, I'm impressed that you managed to receive it, given that
you must filter on some of those words also.
But, I don't filter based on crude constructs. I also allow have an
allowance for the procmail list because spam is discussed here frequently.
90% or more of my spam filtering involves characteristics in the HEADERS of
messages, not the body.
In fact, his message tripped absolutely nothing, although the message I
received immediatley after it in my inbox tripped a number of conditions -
including a catch-all for "too many" conditions (which allows for
escalating something to spam even if the individual characteristics score low):
List: procmail
From procmail-bounces(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE Tue Jun 21
03:59:39 2005
Subject: recipe blocking mail with attachements
Folder: gzip -9fc >> procmail.gz 4782
SPAM: +30 Advisory - may be forged warning
SPAM: +100 non-list From and envelope differ
SPAM: +50 message-id domain does not match sender domain
SPAM: +35 from_domain not found in received chain
SPAM: +50 allcaps subject
SPAM: +(249*2) Blank To/From
SPAM: +75 no non-list cleartext recipient matching X-Envelope-To
SPAM: +125 relay hostname appears to be consumer dialup/broadband
SPAM: +249+165 Subject Scoring match 165
SPAM: +5 spam type statements (5)
SPAM: +249 Abundance of triggers
SPAM: Advisory - spammishness is 1631
SPAM: spammishness exceeds threshold of 249
SPAM: Apparent recipient is **DELETED**(_at_)mail(_dot_)professional(_dot_)org
INFO: SpamFilter v03.11.00 SBS 20050425/1552
From gatexs01(_at_)gatexs(_dot_)nl Tue Jun 21 04:01:30 2005
Subject: CONGRATULATIONS LOTTERY WINNER
Folder: gzip -9fc >> spam.gz 3284
(this data is in the logfile, not inserted into the messages)
Note that it got a bayes score of 0 even though it had lots of "bad" words
in it. Bayes scoring is very nice, which is why I'm so curious how you
managed to get by without it.
By having well constructed scoring rules of my own. I have just three
recipes which involve the body - one is my Nigerian scam recipe (which is
quite effective), and the other two relate to opt-out claims. My one
recipe which uses simple keywords is actually restricted to the SUBJECT,
and scores very low on mundane spammish words (identity, logo, etc) and
higher for more spammy words (aphrodisiac), and even higher on combinations
of spammy words. Because that particular recipe requires that the recipe
score above a certain level before it considers itself matched, the message
has to have a fair number of low-grade matches, or a few high grade
ones. This only happens when the message is loaded with crap, and unless
it is REALLY loaded, it's not going to contribute a whole lot to the
overall spammishness. Header tests remain the most effective method of
identifying spam.
Stats thus far for this month show about 1 in 10 received messages is spam,
which represent about 1/5 of the total received bytes. This does not
include malware which is shed at a different stage, nor does it account for
the MANY messages which are completely avoided due to DNSBLs at the SMTP level.
Yesterday, there were 45 spams blocked from my inbox via my procmail filters.
I receive a daily logfile excerpt showing data about the messages which
were categorized as spew, and it takes but a minute to scroll through it to
check for false pozzies. Since my tweaks for the header-trimming lists,
false pozzies have dropped to ZERO, but one week is a rather short
timeframe for a real figure there.
You should check my "spewhosts" and "furrin" recipes sometime (both are
linked from my meagre procmail pages). ebay and paypal phishing
expeditions are stopped dead in their tracks because the phishers are too
lazy to even attempt to forge the messages in a qwality fashion...
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
____________________________________________________________
procmail mailing list Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail