Re: Spammish?

On 15 Feb, fleet(_at_)teachout(_dot_)org wrote:
| [...]
| 
| > >But how do I tie this to the message so I can file the spam when it has
| > >completed a trip through the filters (without using formail)?
| >
| > The VARIABLE.  It remains set as you progress through the procmailrc.
| 
| Somehow I was afraid that the VARIABLE wouldn't necessarily accompany the
| message through the filters.  Sometimes I think funny, I guess.
| 
| This has been an illuminating session.  I'm still not entirely convinced
| that a message can be "spammish;" but you've made a couple of comments
| that have shaken my conviction. :)  Another Vermont yankee characteristic
| I guess - hardheadedness!

This is a little long, but maybe another example will help.

My spamchkrc currently has 40 spam tests.  Following are a couple of
the more straightforward recipes.  Note that there are a fair number of
variables that are set before a message gets to this point.  How and
where they're set isn't really pertinent, and the names should be self
explanatory as far as understanding what this is doing, but these
cannot be copied/pasted as is and expected to do any good.

sd = '(\.|[%=]2E)'
SPAMSCORE = ${SPAMSCORE:-0}

:0
* $ 1^0 ()\/(^TO(friend|public${sd}com)|^(To|Cc):$wsstar(_at_)$DOMAINS)
*   1^0 MATCH ?? ^^()^\/.+
{
  SPAMMATCH = "$MATCH"
  SPAMTYPE = 'To: (1)'
  ADDSPAMSCORE = 5
  INCLUDERC = $RCHELPERS/rc.spamtype
}
SPAMMATCH

:0
* $ ${TOCOUNT:-0}^0
*             -15^0
{
  SPAMTYPE="To: $TOCOUNT recipients"
  xCHKBOTH = -$max
  ADDSPAMSCORE = 1
  INCLUDERC = $RCHELPERS/rc.spamtype
}

:0
* $ ${CCCOUNT:-0}^0
*             -15^0
{
  SPAMTYPE = "Cc: $CCCOUNT recipients"
  xCHKBOTH = -$max
  ADDSPAMSCORE = 1
  INCLUDERC = $RCHELPERS/rc.spamtype
}

:0
* $ ${xCHKBOTH:-0}^0
* $  ${TOCOUNT:-0}^0
* $  ${CCCOUNT:-0}^0
*              -15^0
{
  SPAMTYPE="To: $TOCOUNT + Cc: $CCCOUNT = $= recipients"
  ADDSPAMSCORE = 1
  INCLUDERC=$RCHELPERS/rc.spamtype
}
xCHKBOTH

# Obvious Subject: (1) ALL CAPS (and punctuation)
:0 D
*   SUBJECT ?? ^^[^a-z]+^^
* ! SUBJECT ?? XFREE86
{
  SPAMMATCH = "$SUBJECT"
  SPAMTYPE = 'Subject: (1)'
  ADDSPAMSCORE = 2
  INCLUDERC = $RCHELPERS/rc.spamtype
}
SPAMMATCH

# Obvious Subject: (3)
:0 D
* SUBJECT ?? ()\/(\$\$+|!!!+|MONEY|FAST|[^X]FREE[^8])
{
  SPAMMATCH = "$MATCH"
  SPAMTYPE = 'Subject: (3)'
  ADDSPAMSCORE = 1
  INCLUDERC = $RCHELPERS/rc.spamtype
}
SPAMMATCH

Ok. That's enough to illustrate it.  The first 1 checks an obvious To:,
the next 3 the total number of recipients in To: or Cc:, and the last 2
obvious subjects.

Notice each one assigns a numeric value to ADDSPAMSCORE then passes
processing to rc.spamtype.  rc.spamtype does housekeeping for all the
spam recipes.  It adds a header, deals with SPAMMATCH if there is one,
accumulates all the SPAMTYPE's in another variable for final logging,
and accumulates the total SPAMSCORE with a recipe like:

:0
* $         $SPAMSCORE^0
* $ ${ADDSPAMSCORE:-1}^0
{
  SPAMSCORE = $=
  MATCH
}

So if a message matched both of the first 2 recipes, it would have a
SPAMSCORE of 6 as it was being passed to the 3rd.  If it matched the
2nd and 5th, it would have a SPAMSCORE of 3 as it is being passed to
the 6th.  Relating that back to the spamminess terminology, the first
recipe, with a SPAMSCORE of 6 is more spammy than the second with a
score of 3.  Looking at the recipes, you can see the ADDSPAMSCORE values
range from 1 to 5.  That's because, for example, with my mail usage a
message that passes the first recipe (obvious To:) is far more likely
to be spam (score 5) than one that has a lot of recipients (score 1). 
One with all caps in the Subject is more likely to be spam (score 2)
than one with many recipients, but still not as obvious as the To:
match.

Normally, I'd have rc.spamtype test the total SPAMSCORE after its been
incremented to see if it's over the threshold.  If so, it'd be marked
as spam and delivered so no more processing was necessary in spamchkrc. 
In that case, matching the first recipe would be enough and no further
processing would be done. However I've made a conscious decision to run
every message that gets this far through every spam recipe.  It's set
up for maximum information and not optimal performance. The headers
that are added and the log entries written provide some verbose
diagnostics I can look at to see what's catching what.  I can get away
with that because they're my machines so I don't have to worry about
being friendly. Plus fewer than 3 or 4 messages a week get as far as
the spam checks anyway.

The idea is that something like too many recipients is not enough for me
to deem a message spam.  Nor is html or all caps in the subject. But if
a message has all 3 of those characteristics, the odds that it is spam
have risen appreciably.  Matching any one characteristic makes it
spammy, and each additional one makes it more spammy, until it reaches a
level of spamminess where it's easy to say it's spam.  I have many tests
that are enough by themselves to mark a message spam, but there are many
more that are a gray area until enough of them match in conjunction with
others.  Then, what was once gray becomes more black and white.

Having a separate rcfile to cumulate variables and scores means I don't
have to duplicate the same code over and over (except for the common
variable assignments and INCLUDERC).  It's much easier for maintenance.
As has been mentioned by others, the recipes in spamchkrc can be
ordered for performance or anything else your heart desires.  One of
these days I want to test if performance is optimized by having the
least costly (header searches) before the more costly (body search)
recipes, or if it's better to have the most frequently matched recipes,
regardless of processing "cost", before the others.  Many of my
cheapest recipes are old and rarely hit any more. It may be more
efficient to run more expensive recipes with a greater liklihood of
matching first.  Of course in my setup where they all get run, it
doesn't matter.

Finally a couple of unrelated notes in case someone can't make sense of
the ADDSPAMSCORE choices.  The recipient counts are intended to check
both To: and Cc: counts independently, where each will add 1 if over
15. If either one matches, then I do not bother to sum the two, so 2 is
the maximum mark against a message with many recipients  If neither
matches, then I check the sum and add 1 if the sum is greater than 15
(e.g. 10 To: and 10 Cc:).  The second Subject: recipe might seem like
it's deserving of a greater score, but anything that matches this one
almost certainly matched the one before it.  That's why it only earns 1
additional demerit.

-- 
Email address in From: header is valid  * but only for a couple of days *
This is my reluctant response to spammers' unrelenting address harvesting



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail