procmail
[Top] [All Lists]

new spam filtering rule

2005-06-28 01:22:44
Okay, this will be wildly unpopular with those in affected countries, but as I directly correspond with so few people with two letter TLDs, this makes for a reasonable attribute to check:

Variables used in this recipe (ENVFROM and FROM_DOMAIN) are common extractions which can be found in the sandbox published at my website.

:0
* ENVFROM ?? ()\.\/..^^
* $ FROM_DOMAIN ?? ()\.$MATCH^^
{
        SPAMVAL="+50"
        SPAMMISHNESS="${SPAMMISHNESS}${SPAMVAL}"
SPAMNOTES="${SPAMNOTES}SPAM: ${SPAMVAL} Envelope sender is a two letter TLD${NL}"

        # continuing, we add MORE spammishness if the TLD matches a list
        :0
        * MATCH ?? ^^(ru|hu|it|br|uy|pl|pt|za|cl|ch|sk|ua|su|cz|cc|sg|tw|ro)^^
        {
                SPAMVAL="+50"
                SPAMMISHNESS="${SPAMMISHNESS}${SPAMVAL}"
                SPAMNOTES="${SPAMNOTES}SPAM: ${SPAMVAL} Envelope sender tld is 
${MATCH}${NL}"
        }
}

I check both the envelope and the From: domain because some lists I use remail from a two letter tld (er, such as procmail). Further, by extracting a match on the first one and matching for the SAME tld, I reduce (though not eliminate) matches where a user of a tld list happens to also be at a 2 letter tld domain. If the two intersect, yea, they're going to be flagged, but at least a .uk on a .de list won't. Granted, I'm seeing plenty of spams where they are using two different domains.

Additionally, in the second (braced) condition level of the recipe, we optionally match against a list of tlds which are particularly spammy (in my case, as determined by evaluating my own corpus of spam).


Modify to suit your needs - in my case, since the added score is relatively low (about 1/5 the total needed to classify a message as spam), it won't generally matter if the rule hits several messages which aren't spam - they'll still have to have either several more minor characteristics, or some strong spam flags in order to be categorized as such and removed from my inbox stream.

The list of domains are those which have a higher incidence in my own spam corpus and which I generally don't have correspondants within (though there are exceptions).


I could perform an initial match like so:

* ENVFROM ?? ()@\/.*\...^^
* $ FROM_DOMAIN ?? ^^$MATCH^^
* FROM_DOMAIN ?? ()\.\/..^^

Which would ensure the envelope and From: domains matched (the entire domain portions, not just the tld), then would re-match to acquire the tld as necessary for the second level recipe -- it could be omitted if that isn't going to be checked - or just moved to that recipe.


Note that because one of my spammishness tests flags based on number of characteristics matches (i.e. if there's too many characteristics - even minour, it'll bump it to actual spam), a match at the second level of the recipe above will provide 2 of 7 flags necessary to consider the message spam, even if the ultimate score isn't very high.


Comments anyone (besides arguing about specific tlds, which are a matter of preference)?
---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>