Re: un-spam in two ways?

era eriksson writes on 20 September 1997 at 14:31:52

[...]
A general problem with spam filtering is that when a filter becomes
very popular, the spammers will find a way to circumvent it. For this
reason, it makes sense to have some spam filtering measures of your


This is also a good reason why I think it's better to focus on general
heuristics rather than say matching against a list of known SPAM
domains.  Add heuristics with a weighting mechanism like procmail's
scoring and it should be possible to construct good SPAM filters that
can be readily shared.

This isn't to say that the heuristics and weights won't have to chance
as SPAMmers become more sophisticated.  For example, right now if a
message matches
   * ^(In-Reply-To:|References:|Subject:[       ]*Re(\[[0-9]+\])?:).+
then I currently think that it probalby isn't SPAM.  In the future
(tomorrow :-) ) this might have to do something such as compare the
Message-Id:s in References: against those I've sent/received in the
past N days.  Right now I haven't hooked such a thing up because I
haven't got enough SPAM which slipped past the above simple check.

Ob-procmail: I'm attaching my latest version of spamcheck.rc (it's
still short...)

   Dan
------------------- message is author's opinion only ------------------
J. Daniel Smith <DanS(_at_)bristol(_dot_)com>        
http://www.bristol.com/~DanS
Bristol Technology B.V.                   +31 33 450 50 50, ...51 (FAX)
Amersfoort, The Netherlands               {info,jobs}(_at_)bristol(_dot_)com
-----
# 
# J. Daniel Smith
# 21 August 1997
#
# spam.rc
#
# Try to detect SPAM and take appropriate actions when found.
#
# Copyright (c) 1997 J. Daniel Smith.  All rights reserved.
#
# You can do anything you want with this provided
#   * you don't make any money as a result
#   * you don't try to claim this is yours
# Obviously, everybody cuts-and-pastes procmail recipes, and I've got
# no problem with you doing that either.  However, if you use a
# significant part of this file, I'd appreciate attribution.  And if
# you figure out a way to make money with it, I want a cut. :-)
#

# Like procmail's ^TO, but for From: and CC: lines
# The extra outer layer of parentheses are so that one can use forms like
# ${FROM}* or ${FROM}+ or ${FROM}?.
CC=${CC:-"(^((Original-)?(Resent-)?(Cc|Bcc)):(.*[^a-zA-Z])?)"}
FROM=${FROM:-"(^((X-(Envelope-)?)?(Apparently-|Resent-)*(From|Reply-To|Sender):\
(.*[^-a-z0-9_])?|From ([^       ]*[-_(_at_)!(_dot_)])?))"}

#
# Much of the following is compliments of David Tamkin 
<dattier(_at_)wwa(_dot_)com>
#
SPAMCHECK_ACTION=${SPAMCHECK_ACTION:-header} # subject, discard, or header

#####
##### These recipes could check for an existing X-SpamCheck-Reason: header
##### for improved efficiency, but for now it might be interesting to
##### see how many different heuristics catch a particular piece of SPAM.
#####
##### Each recipe should set IS_SPAM=yes and add a X-SpamCheck-Reason:
##### header
#####

SPAMCHECK_SPAM=no       # default

#####
##### All various header-based checks.
#####

# Invalid Message-Id:s are likely SPAM
:0
* ! ^Message-Id:[       ]*<[^   <>@]+(_at_)[^   <>@]+>[         ]*$
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: Invalid Message-Id"
}

# don't execpt all syntactially valid address - who's going to have
# a real email address of "foo_(_at_)-bar-(_dot_)com"?
# $!(^TO|${FROM}).+@([-a-z0-9_]+\.)+\.[-a-z0-9_]+
# atext="[a-zA-Z0-9!#$%&'*+-=?^_`{|}~]"
# dotatom="[    ]*${atext}(\.${atext})?[        ]*"
# $!(^TO|${FROM})${dotatom}(_at_)${dotatom}
spamcheck_word="[a-z0-9][-a-z0-9_.+]*[a-z0-9]+"
spamcheck_tld="(com|gov|org|edu|net|[a-z][a-z])"
spamcheck_email="\<${spamcheck_word}@(${spamcheck_word}\.)+${spamcheck_tld}\>"
# 197y, 198y, 199y, 20yy, or just yy
spamcheck_year="((19[7-9]|20[0-9])[0-9]|[0-9][0-9])"
spamcheck_time="((0?|1)[0-9]|2[0-4]):[0-5][0-9](:[0-6][0-9])?"
# required headers and minimal validation
:0h
* $^From:(.*[^-a-z0-9_])?${spamcheck_email}
* $^(Apparently-)?To:(.*[^-a-z0-9_])?${spamcheck_email}
* $^Date:[      ]*.* ${spamcheck_year} ${spamcheck_time}
{
  # bogus addresses
  :0h
  * $^TO${spamcheck_email}
  * $${FROM}${spamcheck_email}
  { }
  :E
  {
    SPAMCHECK_SPAM=yes
    :0fwh
    | formail -A "X-SpamCheck-Reason: invalid Internet address"
  }
}
:E
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: Insufficient/Invalid message headers"
}

# No large headers
:0
{
  MAX_COMMAS=45
  #
  # From David W. Tamkin <dattier(_at_)wwa(_dot_)com>  
  #
  :0h # H is implicit; this is h
  * ^Resent-(To|Cc):
  ADDRESSES=|formail -czxResent-To: -xResent-Cc:
  :0Eh
  ADDRESSES=|formail -czxTo: -xCc: -xApparently-To:

  # Now, the number of addressees should be the number of non-empty
  # lines (procmail always sees an extra empty line at the end of a
  # search area) plus the number of commas; this will still overcount
  # if someone has a comma inside a name comment (thus MAX_COMMAS
  # instead of MAX_ADDRESSES).
  :0
  * 1^1 ADDRESSES ?? ^.+$
  * 1^1 ADDRESSES ?? ,
  * $-${MAX_COMMAS}^0
  {
    SPAMCHECK_SPAM=yes
    :0fwh
    | formail -A "X-SpamCheck-Reason: Too many commas in addresses"
  }
}

# spam-like addresses - let friends(_at_)planetall(_dot_)com fall through
:0
* $(${FROM}|^TO)(remove|delete|free|friend@)
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: Suspicious addresses"
}

# Thanks to Pegasus mail, we have this:
:0
* ^X-Distribution:[     ]?(moderate|bulk|mass)
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: Pegasus moderate/bulk/mass mailing"
}

# From: Gregory Sutter <gsutter(_at_)ugems(_dot_)psu(_dot_)edu>
# Pegasus mailer is the only mailer which legitimately generates
# "Comments: Authenticated sender is ..." so kill anything else.
:0
* ^Comments:.*Authenticated sender
* !^X-Mailer:.*Pegasus Mail
* !^Resent-To:
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: forged Pegasus auth"
}


# This is too easy :-)
:0
* ^X-(Adverti[sz](e)?ment|[0-9]):
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: X-Advertisement: header detected"
}

# Headers that shouldn't exist in "real" mail
#
# Might need to be a little more particular here; 
# Philip Guenther <guenther(_at_)gac(_dot_)edu>: If a message comes into your
# mailbox that has the X-UIDL: header, and doesn't have your address in
# the header, then I would have strong doubts about it's legitamacy. 
#
# Edward J. Sabol <sabol(_at_)alderaan(_dot_)gsfc(_dot_)nasa(_dot_)gov>: 
E-mails with
# X-UIDL: headers are almost definitely spam unless they've been
# Resent-To: me by someone. Also, valid X-UIDL: headers have 32 hexadecimal
# digits exactly.
:0
* ^X-UIDL:
* !^X-UIDL:[    ]*[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]\
                  [0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]\
                  [0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]\
                  [0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]\
                  [0-9a-f][0-9a-f][0-9a-f][0-9a-f][     ]*$
* !^Resent-To:
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: Invalid X-UIDL: header detected"
}

# Check if From: = To:
SENDER=${SENDER:-`formail -rtzx To:`}
FROM_ADDRESS=${FROM_ADDRESS:-`formail -IReply-To: -rtzx To:`}
# We exclude anything with a Resent- header to avoid problems with
# lists that change the Reply-To: to point back to the list.
:0
* $^TO($SENDER|$FROM_ADDRESS)\>
* !^Resent-
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: To: and From:/Reply-To: headers are 
identical"
}

# and From: = Reply-To:
# I've generated some messages like this myself :-), thus the added
# check against ^FROM_DAEMON
:0
* $!(^FROM_DAEMON|${FROM}majordomo)
* $^(Reply|Errors)-To:[         ]?$FROM_ADDRESS\>
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: From: and Reply-To:/Errors-To: headers are 
identical"
}

# be sure the domain is valid.  In addition to this being SLOW, it's
# somewhat risky as DNS timeouts can occur.  Thus, it's done (for now)
# only if there isn't already an X-SpamCheck-Reason: header
:0
* !^X-SpamCheck-Reason:
{
  # this is how to determine if a domain is invalid.  To disable the
  # nslookup, set SPAMCHECK_INVALID_DOMAIN to /bin/false
  #SPAMCHECK_INVALID_DOMAIN=/bin/false
  SPAMCHECK_INVALID_DOMAIN=${SPAMCHECK_INVALID_DOMAIN:-'/usr/sbin/nslookup 
-query=any $SPAMCHECK_DOMAIN 2>&1 | grep -c "Non-existent domain"'}

  #SPAMCHECK_DOMAIN=`echo $SENDER | awk -F@ '{print $2}'`
  :0
  * $ SENDER ?? 
()${spamcheck_word}(_at_)\/(${spamcheck_word}\.)+${spamcheck_tld}
  { SPAMCHECK_DOMAIN=$MATCH }
  :0
  * $?$SPAMCHECK_INVALID_DOMAIN
  {
    SPAMCHECK_SPAM=yes
    :0fwh
    | formail -A "X-SpamCheck-Reason: Invalid domain: $SPAMCHECK_DOMAIN"
  }
  :E
  {
    :0
    * $ FROM_ADDRESS ?? 
()${spamcheck_word}(_at_)\/(${spamcheck_word}\.)+${spamcheck_tld}
    { SPAMCHECK_DOMAIN=$MATCH }
    :0
    * $?$SPAMCHECK_INVALID_DOMAIN
    {
      SPAMCHECK_SPAM=yes
      :0fwh
      | formail -A "X-SpamCheck-Reason: Invalid domain: $SPAMCHECK_DOMAIN"
    }
    :E
    {
      :0
      * $^TO${spamcheck_word}(_at_)\/(${spamcheck_word}\.)+${spamcheck_tld}
      { SPAMCHECK_DOMAIN=$MATCH }
      :0
      * $?$SPAMCHECK_INVALID_DOMAIN
      {
        SPAMCHECK_SPAM=yes
        :0fwh
        | formail -A "X-SpamCheck-Reason: Invalid domain: $SPAMCHECK_DOMAIN"
      }
    }
  }
}

#####
##### Look at the body...this starts getting trickier
#####
# this is going to need some beefing up...
:0BD
* H ?? !^(In-Reply-To:|References:|Subject:[    ]*Re(\[[0-9]+\])?:).+
* ^[^>]*FREE
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: Text 'FREE' detected"
}

# raw HTML
:0BH
* !^(Mime-Version|Content-Type):
* ()<(body[^<>]*|html)>
{
  SPAMCHECK_SPAM=yes
  :0fwh
  | formail -A "X-SpamCheck-Reason: HTML w/o MIME headers"
}

#####
##### Now deliver the mail
#####

:0
* SPAMCHECK_SPAM ?? yes
{
  :0h
  * SPAMCHECK_ACTION ?? discard
  /dev/null

  MATCH # unset it to start
  :0Efwh # if set to "subject" make it work if there is a subject or not
  * SPAMCHECK_ACTION ?? subject
  * 1^0 ^Subject\/:.*
  * 1^0
  | formail -I"Subject: SPAM$MATCH"

  # this is the "default" action, thus no "SPAMCHECK_ACTION ?? header" test.
  # Some mail systems gateways (e.g. Notes' PostalUnion) will pitch
  # this, which is why the "subject" option above exists.  Of course,
  # do this after "discard"... :-)
  #
  # Only add this header for SPAM to make further procmail filtering
  # easy.  But empty header fields might get pitched...
  :0fwh
  | formail -A "X-SpamCheck-Disposition: this message is spam"
}

# ...and record that the message passed through here
:0fwh
| formail -A"X-SpamCheck: Dan's SPAM Detector" \
          -A"X-SpamCheck-Version: 0.3"