Re: [Asrg] Filtering spam by detecting 'anti-Bayesian' elements?

On Wed, Sep 22, 2004 at 10:46:38AM +1000, Laird Breyer wrote:

Der, die, das... With German statistics swamping your inbound
legitimate mail, you can also try to train a second filter solely on
the English communications. Let the first filter separate the German
mail from spam, and let the second filter separate the English mail
from the spam that wasn't German.  If you have enough legitimate
English examples, this might work better than having a single filter.


My eMails are mostly in English ;-)
The problem however is that IMHO the most use simple setups as provided
by their ISP. They don't have the skills to manage slightly sophisticated
setups, some even have problems to rate and hit - if they have the
decision "spam" or "not spam" - the correct button.
They don't want spam, but they don't want to pay the implications, be it
paying someone to manage the filters or to learn how to use them.
So, for power users it is possible to effectively fight spam and get the
spammers in a corner, but for the vast majority it is not and that is
why spammers have success and why they survive (power users wouldn't buy
from spammers, joe lusers do and that is the reason spam works :(.

possible tokens grows. SpamAssassin is notoriously slow, that's the
cost of using a scripting language (perl). Of course, optimized C is
harder to adapt locally.


Sure this is for running, but earlier or later even optimized C code
will have its side effects handling worldlists the size of all e.g.
english and german words with ratings. And with the spammers sending
extracts from literature (I've seen up to 1K of data) the databases
will grow far beyond the 800 words statistically used in person to
person emails.

What you forget is that the decisions that apply to any given message
are limited. For example, a typical spam message is around 1K or less.
Only the tokens found in that message are used for decisions, so the
rest of the word list shouldn't directly matter.


Sorry I was unclear. What I was trying to say is that if the spammers
with the dictionary text attacks raise the set of words and make these
words "neutral" it becomes harder to make decisions. I don't have
correct numbers but what I mean is the following:
- usually spammers use 300 different words in spam mails
- these are trained fast and easy
- classification is easy and eMails containing these words but also some
  other "good" words are not classified as spam
Now the spammers send the literature attacks. This grows the set of
words used by spammers from 300 to 3000. And it grows the average number
of words in a spam mail from 150 to 250. As - say - 2000 of the 2700 more
words are also used in "good" eMails and helped the classification they
become now worthless as they have to be classified "neutral". Even worse
if they are often used in good eMails, the use of those words grows the set
of words used in spam at all weakening the rating and they carry a
"good" weight which can outweight the "bad" count.

It is possible that your filter is getting stale. Are you talking about 
ordinary SpamAssassin rules, or a statistical filter specifically?
Even statistical filters grow stale if not trained on new messages over time,
because new messages exhibit topic drift.


I wasn't talking about any specific filter, but of the problems the
literature attacks IMHO cause for statistical filters in general.

        \Maex

-- 
SpaceNet AG            | Joseph-Dollinger-Bogen 14 | Fon: +49 (89) 32356-0
Research & Development |       D-80807 Muenchen    | Fax: +49 (89) 32356-299
"The security, stability and reliability of a computer system is reciprocally
 proportional to the amount of vacuity between the ears of the admin"

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg