Re: Spammish? (cumulative scoring methods)

At 14:34 2003-02-16 +0100, Ruud H.G. van Tol wrote:

[nigerian scams]

Most of those are caught by (even the most cautious) DNSBLs.

That might be the case (I don't seem to have my DNSBL-rejected mail herethough <g>), but I still see a certain number of matches in my (post-DNSBL)received email, and my experience is that they have generally passedthrough a regular freemail service.

> Don't think I've ever seen a message with less than three Received:
> headers.

Actually, they're quite common - any time someone emails you directly andisn't off on some remote mail gateway, it should be only two or three hops(a user local to your own server is only ONE). Of course, if you usefetchmail, you're adding at least one hop in there yourself because of yourfetch and local delivery. If you internally forward a message, you'll addhops. Obviously, if you do stuff like that, you'll need to adjust thelogic accordingly.

As a spammish indicator though, a LOT of spam is going to have only one ortwo received headers (though some spam inserts additional headers to bemisleading). Regular mailing lists should have lots more.

Remember, it is merely a _spammish_ indicator, not a positive match, so itgenerally doesn't matter if your personal contacts frequently have only acouple of hops - this is something you use in conjunction with otherindicators to weigh something in as spam.

Ignore the familiar recent ones, then start counting.

I'm not positive that I follow what you're suggesting, though I suspectthat you mean to "exclude your own host and ISP, and count therest." Well, as originally stated, I didn't declare it that way, but asabove, you obviously need to take into consideration your own fetching andforwarding, as those processed affect the received header count. On atypical smtp-smtp message though, there are often three or more.

What you could do is concatenate a letter to a variable, like
  { SPAMRATE=$SPAMRATE"G" }
where each letter stands for some feature or severity. An 'A' could
mean: strong evidence against spam, a 'Z' would then mean: certainly

BTW, this is an easy way to do simple additive math for the carryover score(otherwise, you have to have interrim recipes to "add" the score, or invokea shell do to math). Imagine assigning each letter a point value, sort oflike casino chips:


A = 1
B = 5
C = 10
D = 25
E = 50
F = 100
(etc)

(lest someone get confused, no, you are NOT defining any variables for theabove, so that doesn't go in your .procmailrc)

So, lets say that some certain spammy reason has a contributive value of"30" for your spammishness, so:


:0
* conditions
{
        LOG="SPAMMY: reason$NL"

        SPAMMISHNESS="${SPAMMISHNESS}CCC"
        #or "DB", etc - mentally sum the chip values.
}

Down where you go to make the final decsion, you might have a finalSPAMMISHNESS containing:


        CCCDABFCAABada

This is the concatenated scoring from several recipes which the messagetriggered, each one having added a letter or two.


# negative value at the top would be your threshold - if the score EXCEEDS
# the positive of this, the message will be rejected.
:0D:
* -100^0
* 1^1 SPAMMISHNESS ?? A
* 5^1 SPAMMISHNESS ?? B
* 10^1 SPAMMISHNESS ?? C
* 25^1 SPAMMISHNESS ?? D
* 50^1 SPAMMISHNESS ?? E
* 100^1 SPAMMISHNESS ?? F
* 250^1 SPAMMISHNESS ?? G
* 500^1 SPAMMISHNESS ?? H
* 1000^1 SPAMMISHNESS ?? I
* 2500^1 SPAMMISHNESS ?? J
* 5000^1 SPAMMISHNESS ?? K
* 10000^1 SPAMMISHNESS ?? L
* 25000^1 SPAMMISHNESS ?? M
* 50000^1 SPAMMISHNESS ?? N
* -1^1 SPAMMISHNESS ?? a
* -5^1 SPAMMISHNESS ?? b
* -10^1 SPAMMISHNESS ?? c
* -25^1 SPAMMISHNESS ?? d
* -50^1 SPAMMISHNESS ?? e
* -100^1 SPAMMISHNESS ?? f
* -250^1 SPAMMISHNESS ?? g
* -500^1 SPAMMISHNESS ?? h
* -1000^1 SPAMMISHNESS ?? i
* -2500^1 SPAMMISHNESS ?? j
* -5000^1 SPAMMISHNESS ?? k
* -10000^1 SPAMMISHNESS ?? l
* -25000^1 SPAMMISHNESS ?? m
* -50000^1 SPAMMISHNESS ?? n
spammish.mbx

Note that this recipe uses the 'D' flag for case sensitivity. This, youcan use LOWERCASE spammishness "chips" to define negatives, which allow youto define _anti-spammishness_ of messages (or, to declare 49 as "Ea"instead of "DCAAAA"). Certain domains, matches in "recently emailed" oreven "I've received other email from this same address already" (mostspammers never re-use, though commercial "list" spammers do, so a lot ofyour spam tends to be from an address you've never seen before), etc, whichwould potentially decrease - but not totally negate - the spammishness of amessager.

Alternatley, the letters needn't represent chip values, but simply severityclassifications which you assign an your own value to - a "class C' spammay be worth 2x or 10x what a "class B" is (or may have no interrelation atall).

Or, the letters could be reason codes - "A" is "do domain in From:" and isassigned some high point value, while "B" is some other reason, andassigned an independant value. I wouldn't be prone to taking this routemyself, though the code could be easily embedded into the message header:


:0
* scoring stuff
{
        :0fh
        | formail -A "X-Spam-Codes: $SPAMMINESS ($MATCH)"

        :0:
        spam.mbx
}

Likewise, if you use regexps and 'chip' numeric math to do your summing(see below), instead of 'bc', you can embed rule ids:


        SPAMMISHNESS="${SPAMMISHNESS}Rule 12: +25 ;"

And still emit the SPAMMISHNESS string into the message header as above.

Instead of the letter chips (which admittedly could be difficult to keepstraight without a cheat-sheet), you could instead have numeric scoring,but still with chip values:


SPAMMISHNESS="${SPAMMISHNESS}+25 +5 "

(not simply +30, since that isn't a "chip value" - if you can think interms of casino chips, this really isn't hard to follow at all - I raiseyou a blue, two reds, and a white).


In the final additive function:

* 1^1 SPAMMISHNESS ?? \+1[ ]
* 5^1 SPAMMISHNESS ?? \+5[ ]
* 10^1 SPAMMISHNESS ?? \+10[ ]
* 25^1 SPAMMISHNESS ?? \+25[ ]
(etc)

Note that each of the numerics MUST be separated by a space. If you did itlike so instead:


SPAMMISHNESS="+25+25+5"

* 25^1 SPAMMISHNESS ?? \+25([^0-9]|$)

Such a regexp would cause problems when you has two consecutive identicalvalues. It would evaluate the above example to 30, not 55, because theregexp parser "eats" the following character (the plus in this case)because that matches the "{^0-9]" expression, and thus the following +25doesn't remain in the regexp buffer as "+25", but rather "25" (whichdoesn't match "+25"). The different chip values - which are handled asseparate condition lines (and thus separate regexps) will re-evaluate theline for their respective chip values, so "+25+5" would evaluate fine.

Using the chip values saves you from having to invoke a 'bc' or similarfunction to evaluate the additve score string. You _could_ use absolutenumerics (="+200") OR continue using the same chip mechanism, and do thefollowing at the summing stage instead, at the cost of some shell overhead,but greatly simplifying the summing logic:


# right before final score decision
# the prefixed zero is because bc doesn't like +25+30, but can handle 0+25+30
# (you could initialize the SPAMMISHNESS string to 0 in your recipes instead)
SPAMMISHNESS=`echo "0${SPAMMISHNESS} | bc`

Then, the recipes could just tack an absolute spammishness value into thestring, but individually, the recipes do not have to resolve the runningtotal (which is overhead you really don't want).

Note that the above methods do _not_ promote a system whereby individual_scores_ from tests are summed, which is a painful process involvingrecipes which follow EACH recipe to tack the results on (esp. for recipeswith negative scores).


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail