Re: Two recipe suggestions

terry jones writes on 18 August 1997 at 14:00:09

1) match on some regexp like this 

  "reply (with)? (the)? (word)? ['"]?remove['"]? in the (subject|body)"
[...]
2) look for regexps like these: 

    "^[^a-z]*!!!![    ]*$"
[...]
  my guess is that with this one mechanism in place, i could clobber
  a LOT of spam, and pretty efficiently too (no need to fgrep on 800
  domain names and add more domains as they multiply). at least in my


I like the ideas...thusfar, my SPAM filtering techniques have also
been heuristic based rather than just listing domain names.

i'd be happy to receive comments. if no one has such a standalone
spam detector, i'll write one.


Here's what I've been working on in my ~/.procmailrc file

# Invalid Message-Id:s are likely SPAM
:0
* ! ^Message-Id:[       ]*<[^   <>@]+(_at_)[^   <>@]+>[         ]*$
{
  LOG="spamreject: No valid Message-Id
"
  :0:
  toread/spam
}

#####
##### Some Spam heuristics
#####
##### Do these before various auto-replies, but after mail-list sorting
#####
LOG="spamreject: BEGIN
"

# required headers
:0h
* ^From:
* ^(Apparently-)?To:
* ^Date:
{ }
:E:
{
  LOG="spamreject: Insufficient message headers
"
  :0:
  toread/spam
}

# don't know who the sender is - good canidate for spam
:0
* !^FROM_DAEMON
* !? grep -i "${SENDER}" $HOME/.mailrc
* $!${FROM}.*@(.*\.)?${DOMAIN}
* !^(In-Reply-To:|References:|Subject:[         ]*Re(\[[0-9]+\])?:).+
{
  LOG="spamreject: Unknown sender
"

  # I'm not explicitly listed - for now don't assume EVERYTHING from
  # an unknown source is spam
  :0:
  * $!(^TO|^X-Listmember: ?)${ME_REGEXP}
  toread/spam

  # dan(_at_)bristol(_dot_)com has been incorrect for more than a year
  :0:
  * ^TOdan(_at_)bristol(_dot_)com
  toread/spam

  # Apparently-To: headers should only contain my official address
  :0:
  * ^Apparently-To:
  * $!^Apparently-To:[  ]?${ME}
  toread/spam

  # No large headers from unknown senders
  MAX_COMMAS=45
  #
  # From David W. Tamkin <dattier(_at_)wwa(_dot_)com>
  #
  :0h # H is implicit; this is h
  * ^Resent-(To|Cc):
  ADDRESSES=|formail -czxResent-To: -xResent-Cc:
  :0Eh
  ADDRESSES=|formail -czxTo: -xCc: -xApparently-To:

  # Now, the number of addressees should be the number of non-empty
  # lines (procmail always sees an extra empty line at the end of a
  # search area) plus the number of commas; this will still overcount
  # if someone has a comma inside a name comment (thus MAX_COMMAS
  # instead of MAX_ADDRESSES).
  :0
  * 1^1 ADDRESSES ?? ^.+$
  * 1^1 ADDRESSES ?? ,
  * $-${MAX_COMMAS}^0
  toread/spam
}

# spam-like addresses - let friends(_at_)planetall(_dot_)com fall through
:0:
* $${FROM}(remove|delete|free|friend@)
toread/spam

# Thanks to Pegasus mail, we have this:
:0:
* ^X-Distribution:[     ]?(moderate|bulk|mass)
toread/spam

# Headers that shouldn't exist in "real" mail
#
# Might need to be a little more particular here; 
# Philip Guenther <guenther(_at_)gac(_dot_)edu>: If a message comes into your
# mailbox that has the X-UIDL: header, and doesn't have your address in
# the header, then I would have strong doubts about it's legitamacy. 
:0:
* ^X-UIDL: 
toread/spam

# Check if From: = To:
#
# Extract Reply-To: or From: (try that order).  The negation
# is to pull a deMorgan's law trick and get OR like semantics
# with short circuiting.
:0
* ! ^Reply-To: *\/[^ ].*
* ! ^From: *\/[^ ].*
{
   # No Reply-To: or From: header was found.  What to do here
   # is your choice.  *Every* message should have a From: header,
   # and some MTAs (e.g., sendmail) will create one, so this
   # may very well be impossible, in which case anything you put
   # here will be ignored, except for comments which you'll continue
   # to read and ponder until you realize how silly they are.

   # I'll treat this as likely spam
   :0:
   toread/spam
}
# If the previous recipe failed it's conditions, then a match was
# found.  Use the match as the target of a ^TO_ search.  ^TO_ was
# introduced in procmail 3.11pre4.  If you don't have at least that,
# just use ^TO
# We exclude anything with a Resent- header to avoid problems with
# lists that change the Reply-To: to point back to the list.
:0E
* $ ^TO$\MATCH\\>
* ! ^Resent-
toread/spam

# and From: = Reply-To:
# I've generated some messages like this myself :-), thus the added
# check against ^FROM_DAEMON
:0
* ! ^From: *\/[^ ].*
{
  # see comments above
  :0:
  toread/spam
}
:0E
* !^FROM_DAEMON
* $ ^Reply-To:[         ]?$\MATCH\\>
toread/spam

# known SPAM that can't be generalized to fit the above rules
:0:
* $${FROM}mail(_at_)mailermachine\(_dot_)com
toread/spam

LOG="spamreject: END
"

   Dan
------------------- message is author's opinion only ------------------
J. Daniel Smith <DanS(_at_)bristol(_dot_)com>        
http://www.bristol.com/~DanS
Bristol Technology B.V.                   +31 33 450 50 50, ...51 (FAX)
Amersfoort, The Netherlands               {info,jobs}(_at_)bristol(_dot_)com