[Top] [All Lists]

Re: [Asrg] Filtering spam by detecting 'anti-Bayesian' elements?

2004-09-21 00:55:20
Laird Breyer wrote:
On Sep 20 2004, Markus Stumpf wrote:

The Spammers' Compendium

has a list of tricks spammers use to beat bayesian filters.


Clearly, this is entirely untypical of ordinary language. Like the
nonsense words, this sticks out (e.g. what percentage of legitimate
messages do *you* have that don't contain the word "the"?).

Many. Two examples : As I live in France, most messages don't contain the word "the" as they are written in french. Also, people at our organisation sends and receives messages in many other languages : german, italian, russian, and even chinese ... and english of course.

In fact, many sequences will recur if a spammer sends several messages
of this type. Even without splitting on punctuation, various parts of
the "typefaces" recur, such as '888' which is used in 'n', 'o', '!'.
So the filter will automatically think messages with large frequencies
of '888' tend to be junk.

What is missing what I have seen lately is the use of e.g.

|_) |_)|_|\/ / _ ___ | | / (_)___ _____ __________ _
| | / / / __ `/ __ `/ ___/ __ `/
| |/ / / /_/ / /_/ / / / /_/ / |___/_/\__,_/\__, /_/ \__,_/ /____/ .o. 888 ooo. .oo. .ooooo. oooo oooo ooo 888 `888P"Y88b d88' `88b `88. `88. .8' Y8P 888 888 888 888 `88..]88..8' `8' 888 888 888 888 `888'`888' .o. o888o o888o `Y8bod8P' `8' `8' Y8P

to beat a bayesian filter.


A statistical filter will recognize all these things automatically.

Maybe, but there are many legitimate senders and even companies which use this kind of message composition (Buy ... now) to add a footer at all their messages. So, false positives...

In this cases, to be something acceptable, I define "ALL" as being 100%, and "MOST OF THE TIME" as being 99.99%.

 Jose Marcio MARTINS DA CRUZ           Tel. :(33)
 Ecole des Mines de Paris    
 60, bd Saint Michel      
 75272 - PARIS CEDEX 06      

Asrg mailing list

<Prev in Thread] Current Thread [Next in Thread>