procmail
[Top] [All Lists]

Re: Spammish? (cumulative scoring methods)

2003-02-16 12:41:59
At 14:34 2003-02-16 +0100, Ruud H.G. van Tol wrote:

[nigerian scams]
Most of those are caught by (even the most cautious) DNSBLs.

That might be the case (I don't seem to have my DNSBL-rejected mail here though <g>), but I still see a certain number of matches in my (post-DNSBL) received email, and my experience is that they have generally passed through a regular freemail service.

> Don't think I've ever seen a message with less than three Received:
> headers.

Actually, they're quite common - any time someone emails you directly and isn't off on some remote mail gateway, it should be only two or three hops (a user local to your own server is only ONE). Of course, if you use fetchmail, you're adding at least one hop in there yourself because of your fetch and local delivery. If you internally forward a message, you'll add hops. Obviously, if you do stuff like that, you'll need to adjust the logic accordingly.

As a spammish indicator though, a LOT of spam is going to have only one or two received headers (though some spam inserts additional headers to be misleading). Regular mailing lists should have lots more.

Remember, it is merely a _spammish_ indicator, not a positive match, so it generally doesn't matter if your personal contacts frequently have only a couple of hops - this is something you use in conjunction with other indicators to weigh something in as spam.

Ignore the familiar recent ones, then start counting.

I'm not positive that I follow what you're suggesting, though I suspect that you mean to "exclude your own host and ISP, and count the rest." Well, as originally stated, I didn't declare it that way, but as above, you obviously need to take into consideration your own fetching and forwarding, as those processed affect the received header count. On a typical smtp-smtp message though, there are often three or more.

What you could do is concatenate a letter to a variable, like
  { SPAMRATE=$SPAMRATE"G" }
where each letter stands for some feature or severity. An 'A' could
mean: strong evidence against spam, a 'Z' would then mean: certainly

BTW, this is an easy way to do simple additive math for the carryover score (otherwise, you have to have interrim recipes to "add" the score, or invoke a shell do to math). Imagine assigning each letter a point value, sort of like casino chips:

A = 1
B = 5
C = 10
D = 25
E = 50
F = 100
(etc)

(lest someone get confused, no, you are NOT defining any variables for the above, so that doesn't go in your .procmailrc)

So, lets say that some certain spammy reason has a contributive value of "30" for your spammishness, so:

:0
* conditions
{
        LOG="SPAMMY: reason$NL"

        SPAMMISHNESS="${SPAMMISHNESS}CCC"
        #or "DB", etc - mentally sum the chip values.
}

Down where you go to make the final decsion, you might have a final SPAMMISHNESS containing:

        CCCDABFCAABada

This is the concatenated scoring from several recipes which the message triggered, each one having added a letter or two.

# negative value at the top would be your threshold - if the score EXCEEDS
# the positive of this, the message will be rejected.
:0D:
* -100^0
* 1^1 SPAMMISHNESS ?? A
* 5^1 SPAMMISHNESS ?? B
* 10^1 SPAMMISHNESS ?? C
* 25^1 SPAMMISHNESS ?? D
* 50^1 SPAMMISHNESS ?? E
* 100^1 SPAMMISHNESS ?? F
* 250^1 SPAMMISHNESS ?? G
* 500^1 SPAMMISHNESS ?? H
* 1000^1 SPAMMISHNESS ?? I
* 2500^1 SPAMMISHNESS ?? J
* 5000^1 SPAMMISHNESS ?? K
* 10000^1 SPAMMISHNESS ?? L
* 25000^1 SPAMMISHNESS ?? M
* 50000^1 SPAMMISHNESS ?? N
* -1^1 SPAMMISHNESS ?? a
* -5^1 SPAMMISHNESS ?? b
* -10^1 SPAMMISHNESS ?? c
* -25^1 SPAMMISHNESS ?? d
* -50^1 SPAMMISHNESS ?? e
* -100^1 SPAMMISHNESS ?? f
* -250^1 SPAMMISHNESS ?? g
* -500^1 SPAMMISHNESS ?? h
* -1000^1 SPAMMISHNESS ?? i
* -2500^1 SPAMMISHNESS ?? j
* -5000^1 SPAMMISHNESS ?? k
* -10000^1 SPAMMISHNESS ?? l
* -25000^1 SPAMMISHNESS ?? m
* -50000^1 SPAMMISHNESS ?? n
spammish.mbx

Note that this recipe uses the 'D' flag for case sensitivity. This, you can use LOWERCASE spammishness "chips" to define negatives, which allow you to define _anti-spammishness_ of messages (or, to declare 49 as "Ea" instead of "DCAAAA"). Certain domains, matches in "recently emailed" or even "I've received other email from this same address already" (most spammers never re-use, though commercial "list" spammers do, so a lot of your spam tends to be from an address you've never seen before), etc, which would potentially decrease - but not totally negate - the spammishness of a messager.

Alternatley, the letters needn't represent chip values, but simply severity classifications which you assign an your own value to - a "class C' spam may be worth 2x or 10x what a "class B" is (or may have no interrelation at all).

Or, the letters could be reason codes - "A" is "do domain in From:" and is assigned some high point value, while "B" is some other reason, and assigned an independant value. I wouldn't be prone to taking this route myself, though the code could be easily embedded into the message header:

:0
* scoring stuff
{
        :0fh
        | formail -A "X-Spam-Codes: $SPAMMINESS ($MATCH)"

        :0:
        spam.mbx
}

Likewise, if you use regexps and 'chip' numeric math to do your summing (see below), instead of 'bc', you can embed rule ids:

        SPAMMISHNESS="${SPAMMISHNESS}Rule 12: +25 ;"

And still emit the SPAMMISHNESS string into the message header as above.


Instead of the letter chips (which admittedly could be difficult to keep straight without a cheat-sheet), you could instead have numeric scoring, but still with chip values:

SPAMMISHNESS="${SPAMMISHNESS}+25 +5 "

(not simply +30, since that isn't a "chip value" - if you can think in terms of casino chips, this really isn't hard to follow at all - I raise you a blue, two reds, and a white).

In the final additive function:

* 1^1 SPAMMISHNESS ?? \+1[ ]
* 5^1 SPAMMISHNESS ?? \+5[ ]
* 10^1 SPAMMISHNESS ?? \+10[ ]
* 25^1 SPAMMISHNESS ?? \+25[ ]
(etc)

Note that each of the numerics MUST be separated by a space. If you did it like so instead:

SPAMMISHNESS="+25+25+5"

* 25^1 SPAMMISHNESS ?? \+25([^0-9]|$)

Such a regexp would cause problems when you has two consecutive identical values. It would evaluate the above example to 30, not 55, because the regexp parser "eats" the following character (the plus in this case) because that matches the "{^0-9]" expression, and thus the following +25 doesn't remain in the regexp buffer as "+25", but rather "25" (which doesn't match "+25"). The different chip values - which are handled as separate condition lines (and thus separate regexps) will re-evaluate the line for their respective chip values, so "+25+5" would evaluate fine.

Using the chip values saves you from having to invoke a 'bc' or similar function to evaluate the additve score string. You _could_ use absolute numerics (="+200") OR continue using the same chip mechanism, and do the following at the summing stage instead, at the cost of some shell overhead, but greatly simplifying the summing logic:

# right before final score decision
# the prefixed zero is because bc doesn't like +25+30, but can handle 0+25+30
# (you could initialize the SPAMMISHNESS string to 0 in your recipes instead)
SPAMMISHNESS=`echo "0${SPAMMISHNESS} | bc`

Then, the recipes could just tack an absolute spammishness value into the string, but individually, the recipes do not have to resolve the running total (which is overhead you really don't want).

Note that the above methods do _not_ promote a system whereby individual _scores_ from tests are summed, which is a painful process involving recipes which follow EACH recipe to tack the results on (esp. for recipes with negative scores).

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>