Re: [Asrg] Filtering spam by detecting 'anti-Bayesian' elements?

On Sep 22 2004, Markus Stumpf wrote:

My eMails are mostly in English ;-)


So you account for nearly 5% of mail traffic? ;-) Ok, I thought you were
mentioning a specific problem you were having, but we can talk in general.

The problem however is that IMHO the most use simple setups as provided
by their ISP. They don't have the skills to manage slightly sophisticated
setups, some even have problems to rate and hit - if they have the
decision "spam" or "not spam" - the correct button.


In general, they don't even want those buttons. It's a hard problem...

possible tokens grows. SpamAssassin is notoriously slow, that's the
cost of using a scripting language (perl). Of course, optimized C is
harder to adapt locally.


Sure this is for running, but earlier or later even optimized C code
will have its side effects handling worldlists the size of all e.g.
english and german words with ratings. And with the spammers sending
extracts from literature (I've seen up to 1K of data) the databases
will grow far beyond the 800 words statistically used in person to
person emails.


I'm not sure I understand what you are saying. There are two issues.

The size of a single message to be classified is small (the biggest
I've seen personally, IIRC, was about 30,000 filtered tokens, but
often much less), and that means the total time to classify is 
dominated by scripting language overhead. A C implementation
can be 10-50 times faster than the equivalent perl, simply because
perl does so many things behind the scenes that the classification time
is not enough to amortize the extra time required. If messages were
normally much, much bigger the speed gain would be less of course.

I haven't found that the total size of the word list is a problem.
You need to optimize the representation of tokens, but my current spam
category for example has 400,000 unique tokens, and 100,000 for
nonspam, and this makes no impact on performance. It could be 10 times
bigger without making a serious impact in performance. These sizes are
easily big enough to contain a full vocabulary for a person's mail
(but not necessarily a large group of diverse people). After that, you
can manage word lists, or use stemming, and finally use compression if
you like. I'd estimate that a 10,000,000 unique token list on a desktop is well
within our capabilities today, with the correct datastructures. In any
given classification, all you need to do is search 800 tokens (say)
within those 10,000,000 quickly.

Sorry I was unclear. What I was trying to say is that if the spammers
with the dictionary text attacks raise the set of words and make these
words "neutral" it becomes harder to make decisions. I don't have


Ok, I can address this I think: In a typical "bayesian" filter, the
words are each given a weight, so they are both good and bad. The
weights depend on occurrence frequencies. When a spammer uses a
dictionary attack, he cannot convert a "good" word into a "bad" word
simply by using this word in a spam message. This is not like a game
of Othello. 

To make a word weight closer to "bad", a spammer needs your
cooperation (individual algorithms vary, this is general discussion).

correct numbers but what I mean is the following:
- usually spammers use 300 different words in spam mails
- these are trained fast and easy
- classification is easy and eMails containing these words but also some
  other "good" words are not classified as spam
Now the spammers send the literature attacks. This grows the set of
words used by spammers from 300 to 3000. And it grows the average number
of words in a spam mail from 150 to 250. As - say - 2000 of the 2700 more
words are also used in "good" eMails and helped the classification they
become now worthless as they have to be classified "neutral". Even worse


Ok, you appear to assume that "neutral" word weights are as important
as "extreme" word weights in making the decision. I don't think that's
true. In a typical naive Bayesian algorithm, the decision is heavily
influenced by a small number of extreme words. You could remove the
"neutral" words and have a similar result in many cases.  So those
extra 2700 words don't usually change any decisions, they were
"worthless" before the spammer used them already.

Now we have to ask, which words can the spammer convert from "good" to 
"neutral" or "bad"? If the word has a high frequency, then the spammer
must match this frequency.

If you use the word "cake" often, then the spammer can only make this
word "neutral" or "bad" by using this word more often than you have
used it yourself. And you don't stand still, you continue to use
"cake", so the spammer would have to use the word "cake" at a faster
rate than you and your friends. For example, if you have used "cake"
35 times, and the spammer has used "cake" 20 times, that would be
equivalent to if you used "cake" 15 times and the spammer never used
it. 

If the word has low frequency (once every 6 months or less), then the
spammer can stumble on that word and convert it to "neutral" or bad
very easily.  But this word never matters very much in decisions, so
it's a nearly insignificant gain for the spammer.

if they are often used in good eMails, the use of those words grows the set
of words used in spam at all weakening the rating and they carry a
"good" weight which can outweight the "bad" count.


Take a set of emails and, as a first approximation, consider the most
extreme weighted words in it only. It is not enough for the spammer to
stumble upon several words with an extreme "good" rating, he must also
*not* stumble upon several words with an extreme "bad" rating at the
same time, enough for the balance to be "good".

Now if he wants to communicate a message, he already has to use a few
extreme "bad" rated words, assuming you've trained your filter on a
representative sample. So he starts off with a handicap, and he's
looking to find random words with extreme "good" ratings, without
getting random words with extreme "bad" ratings at the same time.

How does he prevent extra "bad" ratings?  He needs the cooperation of
other spammers. If he quotes a piece from Hamlet, and several other
spammers (maybe himself) already quoted a piece from Hamlet earlier,
then those random words will easily have "bad" ratings to counterweigh
the "good" ratings he's trying to find randomly. If nobody has ever 
quoted Hamlet before, then those words don't have bad ratings yet.

Again, I want to make clear this is not an analysis of a particular
Bayesian algorithm, just general discussion of what to look for.
Also, some Bayesian filters are static, updated every 6 months
(e.g. Microsoft), so the spammers can analyse and attack these
unchanging systems, which is much easier.

-- 
Laird Breyer.

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg