Re: [Asrg] Filtering spam by detecting 'anti-Bayesian' elements?

On Sep 20 2004, Markus Stumpf wrote:


The Spammers' Compendium
    http://www.jgc.org/tsc/

has a list of tricks spammers use to beat bayesian filters.


Yes, it's a fun read. Few tricks work well, but we are seeing
evolution in action.

With regard to your example below, this is what a filter might see (frequency, 
token):

1 |_)
1 |_)|_|\/
12 /
2 _
1 ___
5 |
1 (_)___
1 _____
1 __________
3 __
3 `/
1 ___/
1 |/
4 /_/
1 |___/_/\__,_/\__,
1 \__,_/
1 /____/
2 .o.
10 888
1 ooo.
1 .oo.
1 .ooooo.
2 oooo
1 ooo
1 `888p"y88b
1 d88'
1 `88b
2 `88.
1 .8'
2 y8p
1 `88..]88..8'
3 `8'
1 `888'`888'
2 o888o
1 `y8bod8p'


Clearly, this is entirely untypical of ordinary language. Like the
nonsense words, this sticks out (e.g. what percentage of legitimate
messages do *you* have that don't contain the word "the"?).

In fact, many sequences will recur if a spammer sends several messages
of this type. Even without splitting on punctuation, various parts of
the "typefaces" recur, such as '888' which is used in 'n', 'o', '!'.
So the filter will automatically think messages with large frequencies
of '888' tend to be junk.

What is missing what I have seen lately is the use of e.g.

|_)      
|_)|_|\/ 
      /  
  _    ___                       
 | |  / (_)___ _____ __________ _
 | | / / / __ `/ __ `/ ___/ __ `/
 | |/ / / /_/ / /_/ / /  / /_/ / 
 |___/_/\__,_/\__, /_/   \__,_/  
           /____/              
                                       .o. 
                                       888 
ooo. .oo.    .ooooo.  oooo oooo    ooo 888 
`888P"Y88b  d88' `88b  `88. `88.  .8'  Y8P 
 888   888  888   888   `88..]88..8'   `8' 
 888   888  888   888    `888'`888'    .o. 
o888o o888o `Y8bod8P'     `8'  `8'     Y8P 


to beat a bayesian filter.

Also spammers start poisoning headers with lines like:
    X-Literature: Once upon a time ...
which they expand for a few lines, just as they add randon words or
literature in the text part of multipart/alternative messages.


Again, this example is a fun step, which doesn't work as the spammer
expects.  Many (not all) bayesian filters tag token location (eg if
the token was in the header or the body). So the X-Literature attack
sticks out for several reasons, because the majority of legitimate
messages don't have an X-Literature: header, and the text "Once upon a
time" doesn't mix with body text due to tagging. 

So if now another spammer sticks a header such as e.g. "X-Literature:
time to go!", the "time" token will have been seen before, in spam headers
only. In fact, it's not necessary to go that far, the simple fact of
seeing the "X-Literature:" label is indicative of spam, because legit 
messages don't have it normally. 

A statistical filter will recognize all these things automatically.

-- 
Laird Breyer.

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg