terry jones writes on 18 August 1997 at 14:00:09
1) match on some regexp like this
"reply (with)? (the)? (word)? ['"]?remove['"]? in the (subject|body)"
[...]
2) look for regexps like these:
"^[^a-z]*!!!![ ]*$"
[...]
my guess is that with this one mechanism in place, i could clobber
a LOT of spam, and pretty efficiently too (no need to fgrep on 800
domain names and add more domains as they multiply). at least in my
I like the ideas...thusfar, my SPAM filtering techniques have also
been heuristic based rather than just listing domain names.
i'd be happy to receive comments. if no one has such a standalone
spam detector, i'll write one.
Here's what I've been working on in my ~/.procmailrc file
# Invalid Message-Id:s are likely SPAM
:0
* ! ^Message-Id:[ ]*<[^ <>@]+(_at_)[^ <>@]+>[ ]*$
{
LOG="spamreject: No valid Message-Id
"
:0:
toread/spam
}
#####
##### Some Spam heuristics
#####
##### Do these before various auto-replies, but after mail-list sorting
#####
LOG="spamreject: BEGIN
"
# required headers
:0h
* ^From:
* ^(Apparently-)?To:
* ^Date:
{ }
:E:
{
LOG="spamreject: Insufficient message headers
"
:0:
toread/spam
}
# don't know who the sender is - good canidate for spam
:0
* !^FROM_DAEMON
* !? grep -i "${SENDER}" $HOME/.mailrc
* $!${FROM}.*@(.*\.)?${DOMAIN}
* !^(In-Reply-To:|References:|Subject:[ ]*Re(\[[0-9]+\])?:).+
{
LOG="spamreject: Unknown sender
"
# I'm not explicitly listed - for now don't assume EVERYTHING from
# an unknown source is spam
:0:
* $!(^TO|^X-Listmember: ?)${ME_REGEXP}
toread/spam
# dan(_at_)bristol(_dot_)com has been incorrect for more than a year
:0:
* ^TOdan(_at_)bristol(_dot_)com
toread/spam
# Apparently-To: headers should only contain my official address
:0:
* ^Apparently-To:
* $!^Apparently-To:[ ]?${ME}
toread/spam
# No large headers from unknown senders
MAX_COMMAS=45
#
# From David W. Tamkin <dattier(_at_)wwa(_dot_)com>
#
:0h # H is implicit; this is h
* ^Resent-(To|Cc):
ADDRESSES=|formail -czxResent-To: -xResent-Cc:
:0Eh
ADDRESSES=|formail -czxTo: -xCc: -xApparently-To:
# Now, the number of addressees should be the number of non-empty
# lines (procmail always sees an extra empty line at the end of a
# search area) plus the number of commas; this will still overcount
# if someone has a comma inside a name comment (thus MAX_COMMAS
# instead of MAX_ADDRESSES).
:0
* 1^1 ADDRESSES ?? ^.+$
* 1^1 ADDRESSES ?? ,
* $-${MAX_COMMAS}^0
toread/spam
}
# spam-like addresses - let friends(_at_)planetall(_dot_)com fall through
:0:
* $${FROM}(remove|delete|free|friend@)
toread/spam
# Thanks to Pegasus mail, we have this:
:0:
* ^X-Distribution:[ ]?(moderate|bulk|mass)
toread/spam
# Headers that shouldn't exist in "real" mail
#
# Might need to be a little more particular here;
# Philip Guenther <guenther(_at_)gac(_dot_)edu>: If a message comes into your
# mailbox that has the X-UIDL: header, and doesn't have your address in
# the header, then I would have strong doubts about it's legitamacy.
:0:
* ^X-UIDL:
toread/spam
# Check if From: = To:
#
# Extract Reply-To: or From: (try that order). The negation
# is to pull a deMorgan's law trick and get OR like semantics
# with short circuiting.
:0
* ! ^Reply-To: *\/[^ ].*
* ! ^From: *\/[^ ].*
{
# No Reply-To: or From: header was found. What to do here
# is your choice. *Every* message should have a From: header,
# and some MTAs (e.g., sendmail) will create one, so this
# may very well be impossible, in which case anything you put
# here will be ignored, except for comments which you'll continue
# to read and ponder until you realize how silly they are.
# I'll treat this as likely spam
:0:
toread/spam
}
# If the previous recipe failed it's conditions, then a match was
# found. Use the match as the target of a ^TO_ search. ^TO_ was
# introduced in procmail 3.11pre4. If you don't have at least that,
# just use ^TO
# We exclude anything with a Resent- header to avoid problems with
# lists that change the Reply-To: to point back to the list.
:0E
* $ ^TO$\MATCH\\>
* ! ^Resent-
toread/spam
# and From: = Reply-To:
# I've generated some messages like this myself :-), thus the added
# check against ^FROM_DAEMON
:0
* ! ^From: *\/[^ ].*
{
# see comments above
:0:
toread/spam
}
:0E
* !^FROM_DAEMON
* $ ^Reply-To:[ ]?$\MATCH\\>
toread/spam
# known SPAM that can't be generalized to fit the above rules
:0:
* $${FROM}mail(_at_)mailermachine\(_dot_)com
toread/spam
LOG="spamreject: END
"
Dan
------------------- message is author's opinion only ------------------
J. Daniel Smith <DanS(_at_)bristol(_dot_)com>
http://www.bristol.com/~DanS
Bristol Technology B.V. +31 33 450 50 50, ...51 (FAX)
Amersfoort, The Netherlands {info,jobs}(_at_)bristol(_dot_)com