M. Fioretti wrote:
[...]
I follow the Africa Linux users list, which discusses how Free
Software can speed up development there. Just this morning bogofilter
let through an html message with the following text (probably in white
color, ie invisible) *and* a picture containing a promise to boost my
private life i I just clicked etc etc...
I would urge you to ask this on the bogofilter list. Interestingly, the
list for each tool seem to have distinct personalities. That list tends
to be filled with types fascinated by the statistical implications of
things. I'll try to paraphrase some of what they've responded to me with
below, but I'm certainly no expert on Bayesian or similar techniques.
In short, training is the key. New words should be more-or-less neutral,
but targeted "good text" like you've shown in this example does
complicate things. Here's where you'd want to verify with those more
knowledgeable, but my understanding is that if you train with these
messages, the "good" words will become more neutral, while the "bad"
words will spike next time around. There's a tradeoff though, and you do
risk skewing your bayes database if not careful. Since neutral is not
bad, thoe good words (i.e. references to African issues in your example)
would be less of a positive non-spam indicator, but not flagged as spam
either.
The good news is that this technique is a bit more labor intensive for
the spammer, though harvesting text from previous list posts make it easier.
[...]
I received other (very few however) similar messages lately, all
seeming to demonstrate that that approach *is* being tried. Any extra
recipe or comment is welcome. If you want copy of that message to look
at the headers or anything just ask.
This is where I think the layered approach works well:
1. Procmail for efficiently catching obvious patterns of abuse,
defanging content, re-routing etc.
2. Spamassassin for more elaborate body pattern matches and cumulative
scoring (what I refer to as "smells like spam" characteristics).
3. Bayesian (i.e. bogofilter, or spamassasin's bayes) for catching stuff
that dodges the previous 2.
In particular, the spamassasin rules gang is going to great lengths to
catch such spams not based on strict patterns in the message, but noting
the little things that still betray them as spam. Statistically (i.e.
bayes) these might not be significant, but like a drug dog sniffing for
goods amonst a pile of dung, the traits can be noted --- at least often.
Procmail can do this too of course, but my understanding is that body
processing is not cheap (correct me if wrong, gang).
If you care to send me the COMPLETE message, including headers, I'd be
happy to show how it scores with various bayes and spamassassin rules.
There may be tell-tale signs that can be used in efficient procmail
recipes too.
- Bob
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail