Re: Again on spam with targeted meaningful text

M. Fioretti wrote:

[...]

I follow the Africa Linux users list, which discusses how Free
Software can speed up development there. Just this morning bogofilter
let through an html message with the following text (probably in white
color, ie invisible) *and* a picture containing a promise to boost my
private life i I just clicked etc etc...

I would urge you to ask this on the bogofilter list. Interestingly, thelist for each tool seem to have distinct personalities. That list tendsto be filled with types fascinated by the statistical implications ofthings. I'll try to paraphrase some of what they've responded to me withbelow, but I'm certainly no expert on Bayesian or similar techniques.

In short, training is the key. New words should be more-or-less neutral,but targeted "good text" like you've shown in this example doescomplicate things. Here's where you'd want to verify with those moreknowledgeable, but my understanding is that if you train with thesemessages, the "good" words will become more neutral, while the "bad"words will spike next time around. There's a tradeoff though, and you dorisk skewing your bayes database if not careful. Since neutral is notbad, thoe good words (i.e. references to African issues in your example)would be less of a positive non-spam indicator, but not flagged as spameither.

The good news is that this technique is a bit more labor intensive forthe spammer, though harvesting text from previous list posts make it easier.

[...]
I received other (very few however) similar messages lately, all
seeming to demonstrate that that approach *is* being tried. Any extra
recipe or comment is welcome. If you want copy of that message to look
at the headers or anything just ask.

This is where I think the layered approach works well:

1. Procmail for efficiently catching obvious patterns of abuse,defanging content, re-routing etc.2. Spamassassin for more elaborate body pattern matches and cumulativescoring (what I refer to as "smells like spam" characteristics).3. Bayesian (i.e. bogofilter, or spamassasin's bayes) for catching stuffthat dodges the previous 2.

In particular, the spamassasin rules gang is going to great lengths tocatch such spams not based on strict patterns in the message, but notingthe little things that still betray them as spam. Statistically (i.e.bayes) these might not be significant, but like a drug dog sniffing forgoods amonst a pile of dung, the traits can be noted --- at least often.Procmail can do this too of course, but my understanding is that bodyprocessing is not cheap (correct me if wrong, gang).

If you care to send me the COMPLETE message, including headers, I'd behappy to show how it scores with various bayes and spamassassin rules.There may be tell-tale signs that can be used in efficient procmailrecipes too.


- Bob



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail