Re: Rejecting multiple names/subjects at once?

Era Eriksson posted a set of spam-killing recipes ...

| Here's what I use:
| 
|  SHELL=/bin/sh
| 
|  SPAM="!!!+|\$\$+|(,000)+|magazine| ... etcetera, make your own ;-)
| 
|  :1:
|  $^Subject:.*($SPAM).*($SPAM).*($SPAM)
|  $HOME/scratch/inbox/spam
| 
|  :2:
|  $^Subject:.*($SPAM).*($SPAM)
|  ^From:(_dot_)*(_at_)[^ ]+\.com[ ]
|  $HOME/scratch/inbox/spam
| 
|  :2:
|  $^Subject: .*($SPAM|web)
|  ^From: .*(earthlink\.net|spray\.com|spraynet\.com|spray\.net|pipeline\.com)
|  $HOME/scratch/inbox/spam

and asked,

|   This could be made a lot more straightforward with scoring (man
| procmailsc) but I have yet to see an implementation. I have asked on
| this list if somebody was using scoring to hunt for spams but no
| replies so far. 

Well, ok.  Era's procmail doesn't have scoring, nor apparently even asterisk
counting, but the latter came first, so if we have scoring, we have asterisk
notation.

| SPAM="!!!+|\$\$+|(,000)+|magazine| ... " # etcetera, make your own ;-)

Of course, if you're going to test for (,000)+ as an entire alternative
by itself with no need for anything specific to the left or the right of
it, you might as well, just test for ,000; also, I personally would prefer
to include the outer parentheses at this point rather than in every use of
the variable below.  So let's make it

  SPAM="(!!!+|[$][$]+|,000|magazine| ... )" # etcetera, make your own

Now, scoring looks for *non-overlapping* occurrences, so this:

  * $ 1^1 ^Subject:.*$SPAM

would score only 1 no matter how many times $SPAM appears in the subject.
A message would need to have multiple subject lines, two ore more of them
containing matches to $SPAM, to score more than one from that condition. 
So we work around the overlap by saving the rest of the subject in a
variable:

  :0
  * ^Subject:\/.*
  { SUBJECT=$MATCH }

Now we can scan $SUBJECT for appearances of $SPAM and get an accurate count.

If it weren't for the need to count "web" if the message is from a 2-point
site but not otherwise, the recipe would have been so much simpler.  Note
that spray.com, spraynet.com, and pipeline.com addresses will get 1 point
from the third condition and 1 from the fourth.  The first condition is
unweighted (and thus absolute) for a quick exit when there's no need.

  :0: # 2 points or fewer acceptable, but more than 2 points and you're out
  * $ SUBJECT ?? web|$SPAM
  * $ 1^1 SUBJECT ?? $SPAM
  * 1^0 ^From:(_dot_)*(_at_)[^ ]+\.com[ ]
  * 1^0 ^From:.*(spray(net)?|pipeline)\.com
  * 2^0 ^From: .*(earthlink|spray)\.net
  * -2^0
  $HOME/scratch/inbox/spam

But we do have that complication, so let's have at.  If your version of
procmail does not allow interleaved comments in the middles of recipes,
move them to a safe place:

  :0:
 # If the subject is clean, escape unconditionally.
  * $ SUBJECT ?? web|$SPAM
 # Score .9 for each appearance ("^1") of a match to $SPAM in the subject:
  * $ .9^1 SUBJECT ?? $SPAM
 # Score .8 for any .com site:
  * .8^0 ^From:(_dot_)*(_at_)[^ ]+\.com[ ]
 # Score 1.2 (total 2) for suspect .com sites:
  * 1.2^0 ^From:.*(spray(net)?|pipeline)\.com
 # Score 2 for suspect .net sites:
  * 2^0 ^From: .*(earthlink|spray)\.net
 # Score .1 if "web" is in the subject at least once:
  * .1^0 SUBJECT ?? web
 # Forgive scores of 2 or lower:
  * -2^0
 # If net score is still positive, shunt to spam folder:
  $HOME/scratch/inbox/spam

It should work out like this:

Three appearances of $SPAM in the subject guarantee at least +.7 and will
cause rejection whether or not "web" is also in the subject and regardless
of the site of origin.

Mail from a non-suspect .com site with neither "web" nor $SPAM escapes.
Mail from a non-suspect .com site with "web" but no $SPAM scores -1.1.
Mail from a non-suspect .com site with one $SPAM but no "web" scores -.3.
Mail from a non-suspect .com site with "web" and one $SPAM scores -.2.
None of those result in rejection.

Mail from a non-suspect .com site with two $SPAMs scores +.6 without "web"
or +.7 with "web" and is rejected in either situation.

Coming from a suspect site is worth 2 points (.8+1.2 for those in .com and
simply 2 for those in .net); adding "web" or even one $SPAM is enough to get
a positive final score and to be rejected.

Mail from a suspect site with a clean subject escapes.