Re: Spammish? (cumulative scoring methods)
2003-02-16 12:41:59
At 14:34 2003-02-16 +0100, Ruud H.G. van Tol wrote:
[nigerian scams]
Most of those are caught by (even the most cautious) DNSBLs.
That might be the case (I don't seem to have my DNSBL-rejected mail here
though <g>), but I still see a certain number of matches in my (post-DNSBL)
received email, and my experience is that they have generally passed
through a regular freemail service.
> Don't think I've ever seen a message with less than three Received:
> headers.
Actually, they're quite common - any time someone emails you directly and
isn't off on some remote mail gateway, it should be only two or three hops
(a user local to your own server is only ONE). Of course, if you use
fetchmail, you're adding at least one hop in there yourself because of your
fetch and local delivery. If you internally forward a message, you'll add
hops. Obviously, if you do stuff like that, you'll need to adjust the
logic accordingly.
As a spammish indicator though, a LOT of spam is going to have only one or
two received headers (though some spam inserts additional headers to be
misleading). Regular mailing lists should have lots more.
Remember, it is merely a _spammish_ indicator, not a positive match, so it
generally doesn't matter if your personal contacts frequently have only a
couple of hops - this is something you use in conjunction with other
indicators to weigh something in as spam.
Ignore the familiar recent ones, then start counting.
I'm not positive that I follow what you're suggesting, though I suspect
that you mean to "exclude your own host and ISP, and count the
rest." Well, as originally stated, I didn't declare it that way, but as
above, you obviously need to take into consideration your own fetching and
forwarding, as those processed affect the received header count. On a
typical smtp-smtp message though, there are often three or more.
What you could do is concatenate a letter to a variable, like
{ SPAMRATE=$SPAMRATE"G" }
where each letter stands for some feature or severity. An 'A' could
mean: strong evidence against spam, a 'Z' would then mean: certainly
BTW, this is an easy way to do simple additive math for the carryover score
(otherwise, you have to have interrim recipes to "add" the score, or invoke
a shell do to math). Imagine assigning each letter a point value, sort of
like casino chips:
A = 1
B = 5
C = 10
D = 25
E = 50
F = 100
(etc)
(lest someone get confused, no, you are NOT defining any variables for the
above, so that doesn't go in your .procmailrc)
So, lets say that some certain spammy reason has a contributive value of
"30" for your spammishness, so:
:0
* conditions
{
LOG="SPAMMY: reason$NL"
SPAMMISHNESS="${SPAMMISHNESS}CCC"
#or "DB", etc - mentally sum the chip values.
}
Down where you go to make the final decsion, you might have a final
SPAMMISHNESS containing:
CCCDABFCAABada
This is the concatenated scoring from several recipes which the message
triggered, each one having added a letter or two.
# negative value at the top would be your threshold - if the score EXCEEDS
# the positive of this, the message will be rejected.
:0D:
* -100^0
* 1^1 SPAMMISHNESS ?? A
* 5^1 SPAMMISHNESS ?? B
* 10^1 SPAMMISHNESS ?? C
* 25^1 SPAMMISHNESS ?? D
* 50^1 SPAMMISHNESS ?? E
* 100^1 SPAMMISHNESS ?? F
* 250^1 SPAMMISHNESS ?? G
* 500^1 SPAMMISHNESS ?? H
* 1000^1 SPAMMISHNESS ?? I
* 2500^1 SPAMMISHNESS ?? J
* 5000^1 SPAMMISHNESS ?? K
* 10000^1 SPAMMISHNESS ?? L
* 25000^1 SPAMMISHNESS ?? M
* 50000^1 SPAMMISHNESS ?? N
* -1^1 SPAMMISHNESS ?? a
* -5^1 SPAMMISHNESS ?? b
* -10^1 SPAMMISHNESS ?? c
* -25^1 SPAMMISHNESS ?? d
* -50^1 SPAMMISHNESS ?? e
* -100^1 SPAMMISHNESS ?? f
* -250^1 SPAMMISHNESS ?? g
* -500^1 SPAMMISHNESS ?? h
* -1000^1 SPAMMISHNESS ?? i
* -2500^1 SPAMMISHNESS ?? j
* -5000^1 SPAMMISHNESS ?? k
* -10000^1 SPAMMISHNESS ?? l
* -25000^1 SPAMMISHNESS ?? m
* -50000^1 SPAMMISHNESS ?? n
spammish.mbx
Note that this recipe uses the 'D' flag for case sensitivity. This, you
can use LOWERCASE spammishness "chips" to define negatives, which allow you
to define _anti-spammishness_ of messages (or, to declare 49 as "Ea"
instead of "DCAAAA"). Certain domains, matches in "recently emailed" or
even "I've received other email from this same address already" (most
spammers never re-use, though commercial "list" spammers do, so a lot of
your spam tends to be from an address you've never seen before), etc, which
would potentially decrease - but not totally negate - the spammishness of a
messager.
Alternatley, the letters needn't represent chip values, but simply severity
classifications which you assign an your own value to - a "class C' spam
may be worth 2x or 10x what a "class B" is (or may have no interrelation at
all).
Or, the letters could be reason codes - "A" is "do domain in From:" and is
assigned some high point value, while "B" is some other reason, and
assigned an independant value. I wouldn't be prone to taking this route
myself, though the code could be easily embedded into the message header:
:0
* scoring stuff
{
:0fh
| formail -A "X-Spam-Codes: $SPAMMINESS ($MATCH)"
:0:
spam.mbx
}
Likewise, if you use regexps and 'chip' numeric math to do your summing
(see below), instead of 'bc', you can embed rule ids:
SPAMMISHNESS="${SPAMMISHNESS}Rule 12: +25 ;"
And still emit the SPAMMISHNESS string into the message header as above.
Instead of the letter chips (which admittedly could be difficult to keep
straight without a cheat-sheet), you could instead have numeric scoring,
but still with chip values:
SPAMMISHNESS="${SPAMMISHNESS}+25 +5 "
(not simply +30, since that isn't a "chip value" - if you can think in
terms of casino chips, this really isn't hard to follow at all - I raise
you a blue, two reds, and a white).
In the final additive function:
* 1^1 SPAMMISHNESS ?? \+1[ ]
* 5^1 SPAMMISHNESS ?? \+5[ ]
* 10^1 SPAMMISHNESS ?? \+10[ ]
* 25^1 SPAMMISHNESS ?? \+25[ ]
(etc)
Note that each of the numerics MUST be separated by a space. If you did it
like so instead:
SPAMMISHNESS="+25+25+5"
* 25^1 SPAMMISHNESS ?? \+25([^0-9]|$)
Such a regexp would cause problems when you has two consecutive identical
values. It would evaluate the above example to 30, not 55, because the
regexp parser "eats" the following character (the plus in this case)
because that matches the "{^0-9]" expression, and thus the following +25
doesn't remain in the regexp buffer as "+25", but rather "25" (which
doesn't match "+25"). The different chip values - which are handled as
separate condition lines (and thus separate regexps) will re-evaluate the
line for their respective chip values, so "+25+5" would evaluate fine.
Using the chip values saves you from having to invoke a 'bc' or similar
function to evaluate the additve score string. You _could_ use absolute
numerics (="+200") OR continue using the same chip mechanism, and do the
following at the summing stage instead, at the cost of some shell overhead,
but greatly simplifying the summing logic:
# right before final score decision
# the prefixed zero is because bc doesn't like +25+30, but can handle 0+25+30
# (you could initialize the SPAMMISHNESS string to 0 in your recipes instead)
SPAMMISHNESS=`echo "0${SPAMMISHNESS} | bc`
Then, the recipes could just tack an absolute spammishness value into the
string, but individually, the recipes do not have to resolve the running
total (which is overhead you really don't want).
Note that the above methods do _not_ promote a system whereby individual
_scores_ from tests are summed, which is a painful process involving
recipes which follow EACH recipe to tack the results on (esp. for recipes
with negative scores).
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail
|
|