procmail
[Top] [All Lists]

Re: Again on spam with targeted meaningful text

2004-03-20 10:20:39
M. Fioretti wrote:

[...]

I follow the Africa Linux users list, which discusses how Free
Software can speed up development there. Just this morning bogofilter
let through an html message with the following text (probably in white
color, ie invisible) *and* a picture containing a promise to boost my
private life i I just clicked etc etc...
I would urge you to ask this on the bogofilter list. Interestingly, the list for each tool seem to have distinct personalities. That list tends to be filled with types fascinated by the statistical implications of things. I'll try to paraphrase some of what they've responded to me with below, but I'm certainly no expert on Bayesian or similar techniques.

In short, training is the key. New words should be more-or-less neutral, but targeted "good text" like you've shown in this example does complicate things. Here's where you'd want to verify with those more knowledgeable, but my understanding is that if you train with these messages, the "good" words will become more neutral, while the "bad" words will spike next time around. There's a tradeoff though, and you do risk skewing your bayes database if not careful. Since neutral is not bad, thoe good words (i.e. references to African issues in your example) would be less of a positive non-spam indicator, but not flagged as spam either.

The good news is that this technique is a bit more labor intensive for the spammer, though harvesting text from previous list posts make it easier.

[...]
I received other (very few however) similar messages lately, all
seeming to demonstrate that that approach *is* being tried. Any extra
recipe or comment is welcome. If you want copy of that message to look
at the headers or anything just ask.
This is where I think the layered approach works well:

1. Procmail for efficiently catching obvious patterns of abuse, defanging content, re-routing etc. 2. Spamassassin for more elaborate body pattern matches and cumulative scoring (what I refer to as "smells like spam" characteristics). 3. Bayesian (i.e. bogofilter, or spamassasin's bayes) for catching stuff that dodges the previous 2.

In particular, the spamassasin rules gang is going to great lengths to catch such spams not based on strict patterns in the message, but noting the little things that still betray them as spam. Statistically (i.e. bayes) these might not be significant, but like a drug dog sniffing for goods amonst a pile of dung, the traits can be noted --- at least often. Procmail can do this too of course, but my understanding is that body processing is not cheap (correct me if wrong, gang).

If you care to send me the COMPLETE message, including headers, I'd be happy to show how it scores with various bayes and spamassassin rules. There may be tell-tale signs that can be used in efficient procmail recipes too.

- Bob



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>