procmail
[Top] [All Lists]

Re: Killing spam based on nameserver info

1997-09-13 08:48:26
Professional Software Engineering writes on 12 September 1997 at 22:03:35
At 11:32 PM 9/12/97 -0500, Conrad Sabatier wrote:

Here's a novel and interesting idea I ran across recently.  I haven't yet
gotten around to actually setting up a procmail recipe to test this method,
but it does sound extremely clever!
[...]
Instead, why not look to periodically update
cyberpromo/nancynet/quantumm/etc spamdomain lists?

With such a list of domains (mine is right around 1000 spam domains total),
you can simply egrep the message headers for occurrences of any of the

I'm trying to do my SPAM filtering my developing heuristics rather
than using a list of spam domains.  I'm surprised at how much SPAM
(and thusfar how little legitimate mail) gets caught with some fairly
simple things like checking for From: == To:.

I've appended my current spamcheck.rc file below.

   Dan
------------------- message is author's opinion only ------------------
J. Daniel Smith <DanS(_at_)bristol(_dot_)com>        
http://www.bristol.com/~DanS
Bristol Technology B.V.                   +31 33 450 50 50, ...51 (FAX)
Amersfoort, The Netherlands               {info,jobs}(_at_)bristol(_dot_)com
-----
# 
# J. Daniel Smith
# 21 August 1997
#
# spam.rc
#
# Try to detect SPAM and take appropriate actions when found.
#

# Like procmail's ^TO, but for From: and CC: lines
# The extra outer layer of parentheses are so that one can use forms like
# ${FROM}* or ${FROM}+ or ${FROM}?.
CC=${CC:-"(^((Original-)?(Resent-)?(Cc|Bcc)):(.*[^a-zA-Z])?)"}
FROM=${FROM:-"(^((X-(Envelope-)?)?(Apparently-|Resent-)*(From|Reply-To|Sender):\
(.*[^-a-z0-9_])?|From ([^       ]*[-_(_at_)!(_dot_)])?))"}

#
# Much of the following is compliments of David Tamkin 
<dattier(_at_)wwa(_dot_)com>
#
SPAMCHECK_ACTION=${SPAMCHECK_ACTION:-header} # subject, discard, or header

#####
##### Do SPAM detection
#####
##### These recipes could check for an existing X-SpamCheck-Reason: header
##### for improved efficiency, but for now it might be interesting to
##### see how many different heuristics catch a particular piece of SPAM.
#####
##### Each recipe should set IS_SPAM=yes and add a X-SpamCheck-Reason:
##### header
#####

SPAMCHECK_SPAM=no       # default

#####
##### All various header-based checks.
#####

# Invalid Message-Id:s are likely SPAM
:0
* ! ^Message-Id:[       ]*<[^   <>@]+(_at_)[^   <>@]+>[         ]*$
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: Invalid Message-Id"
}

# required headers
:0h
* ^From:
* ^(Apparently-)?To:
* ^Date:
{ }
:E
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: Insufficient message headers"
}

# bogus addresses
# $!(^TO|${FROM}).+@([-a-z0-9_]+\.)+\.[-a-z0-9_]+
# atext="[a-zA-Z0-9!#$%&'*+-=?^_`{|}~]"
# dotatom="[    ]*${atext}(\.${atext})?[        ]*"
# $!(^TO|${FROM})${dotatom}(_at_)${dotatom}
# don't execpt all syntactially valid address - who's going to have
# a real email address of "foo_(_at_)-bar-(_dot_)com"?
word="[a-z0-9][-a-z0-9_.+]*[a-z0-9]+"
tld="(com|gov|org|edu|net|[a-z][a-z])"
:0h
* $^TO${word}@(${word}\.)+${tld}
* $${FROM}${word}@(${word}\.)+${tld}
{ }
:E
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: invalid Internet address"
}


# No large headers
:0
{
  MAX_COMMAS=45
  #
  # From David W. Tamkin <dattier(_at_)wwa(_dot_)com>  
  #
  :0h # H is implicit; this is h
  * ^Resent-(To|Cc):
  ADDRESSES=|formail -czxResent-To: -xResent-Cc:
  :0Eh
  ADDRESSES=|formail -czxTo: -xCc: -xApparently-To:

  # Now, the number of addressees should be the number of non-empty
  # lines (procmail always sees an extra empty line at the end of a
  # search area) plus the number of commas; this will still overcount
  # if someone has a comma inside a name comment (thus MAX_COMMAS
  # instead of MAX_ADDRESSES).
  :0
  * 1^1 ADDRESSES ?? ^.+$
  * 1^1 ADDRESSES ?? ,
  * $-${MAX_COMMAS}^0
  {
    SPAMCHECK_SPAM=yes
    :0fwh
    | formail -A "X-SpamCheck-Reason: Too many commas in addresses"
  }
}

# spam-like addresses - let friends(_at_)planetall(_dot_)com fall through
:0
* $(${FROM}|^TO)(remove|delete|free|friend@)
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: Suspicious addresses"
}

# Thanks to Pegasus mail, we have this:
:0
* ^X-Distribution:[     ]?(moderate|bulk|mass)
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: Pegasus moderate/bulk/mass mailing"
}

# This is too easy :-)
:0
* ^X-(Adverti[sz](e)?ment|[0-9]):
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: X-Advertisement: header detected"
}

# Headers that shouldn't exist in "real" mail
#
# Might need to be a little more particular here; 
# Philip Guenther <guenther(_at_)gac(_dot_)edu>: If a message comes into your
# mailbox that has the X-UIDL: header, and doesn't have your address in
# the header, then I would have strong doubts about it's legitamacy. 
#
# Edward J. Sabol <sabol(_at_)alderaan(_dot_)gsfc(_dot_)nasa(_dot_)gov>: 
E-mails with
# X-UIDL: headers are almost definitely spam unless they've been
# Resent-To: me by someone. Also, valid X-UIDL: headers have 32 hexadecimal
# digits exactly.
:0
* ^X-UIDL:
* !^X-UIDL:[    ]*[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]\
                  [0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]\
                  [0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]\
                  [0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]\
                  [0-9a-f][0-9a-f][0-9a-f][0-9a-f][     ]*$
* !^Resent-To:
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: Invalid X-UIDL: header detected"
}

# Check if From: = To:
SENDER=${SENDER:-`formail -rtzx To:`}
MATCH=`formail -IReply-To: -rtzx To:`
# We exclude anything with a Resent- header to avoid problems with
# lists that change the Reply-To: to point back to the list.
:0
* $^TO($SENDER|$MATCH)\>
* !^Resent-
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: To: and From:/Reply-To: headers are 
identical"
}

# and From: = Reply-To:
# I've generated some messages like this myself :-), thus the added
# check against ^FROM_DAEMON
MATCH=`formail -IReply-To: -rtzx To:`
:0
* $!(^FROM_DAEMON|${FROM}majordomo)
* $^(Reply|Errors)-To:[         ]?$MATCH\>
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: From: and Reply-To:/Errors-To: headers are 
identical"
}


#####
##### Look at the body...this starts getting trickier
#####
# this is going to need some beefing up...
:0BD
* !^(In-Reply-To:|References:|Subject:[         ]*Re(\[[0-9]+\])?:).+
* [^>]*FREE
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: Text 'FREE' detected"
}

# raw HTML
:0BH
* !^(Mime-Version|Content-Type):
* \<(body.*|html)\>
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: HTML w/o MIME headers"
}

#####
##### Now deliver the mail
#####

:0
* SPAMCHECK_SPAM ?? yes
{
  :0h
  * SPAMCHECK_ACTION ?? discard
  /dev/null

  MATCH # unset it to start
  :0Efwh # if set to "subject" make it work if there is a subject or not
  * SPAMCHECK_ACTION ?? subject
  * 1^0 ^Subject\/:.*
  * 1^0
  | formail -I"Subject: SPAM$MATCH"

  # this is the "default" action, thus no "SPAMCHECK_ACTION ?? header" test.
  # Some mail systems gateways (e.g. Notes' PostalUnion) will pitch
  # this, which is why the "subject" option above exists.  Of course,
  # do this after "discard"... :-)
  #
  # Only add this header for SPAM to make further procmail filtering
  # easy.  But empty header fields might get pitched...
  :0fwh
  | formail -A "X-SpamCheck-Disposition: this message is spam"
}

# ...and record that the message passed through here
:0fwh
| formail -A"X-SpamCheck: Dan's SPAM Detector" \
          -A"X-SpamCheck-Version: 0.2"

<Prev in Thread] Current Thread [Next in Thread>