Re: [Asrg] Filtering spam by detecting 'anti-Bayesian' elements?

On Sep 21 2004, Markus Stumpf wrote:

On Tue, Sep 21, 2004 at 07:08:37PM +1000, Laird Breyer wrote:

Heh, you're right of course. Moreover, Markus, to whom I replied, lives
in Germany according to his sig. Double whammy ;-)


:-)
For really a lot of our customers that have <5% email communication in
English spamassassin - after some training - works like a charm with
overly high success rates (and then with - from my feeling - overly high
false positive rate if a message in English comes through).


Der, die, das... With German statistics swamping your inbound
legitimate mail, you can also try to train a second filter solely on
the English communications. Let the first filter separate the German
mail from spam, and let the second filter separate the English mail
from the spam that wasn't German.  If you have enough legitimate
English examples, this might work better than having a single filter.

Yes, but with all the zillions of variations the spammers are IMHO also
attacking the size of the databases.


Sure, they do whatever they can. If your filter can handle large token
sets efficiently, then it shouldn't matter. However, it could be
considered a DOS attack, if the filter slows down as the number of
possible tokens grows. SpamAssassin is notoriously slow, that's the
cost of using a scripting language (perl). Of course, optimized C is
harder to adapt locally.

I wasn't too good in statistics but IMHO it is getting harder for
statistical filters if the set of data the decisions are based on is
growing, even more if the set of tokens in a test have a lot of "no
decision" and only few decison making tokens. This still helps to


What you forget is that the decisions that apply to any given message
are limited. For example, a typical spam message is around 1K or less.
Only the tokens found in that message are used for decisions, so the
rest of the word list shouldn't directly matter. 

Decisions which don't apply shouldn't affect the spam question
for that single message.  If this is a statistical filter, what should
matter are only the relative frequencies of seen tokens in spam or
nonspam corpora.  But individual algorithms vary of course.

classify, but instead of 70:30 decisons we come close to a *lot* more
50.1:49.1 decisions which has the effect of growing false positive rates.


It is possible that your filter is getting stale. Are you talking about 
ordinary SpamAssassin rules, or a statistical filter specifically?
Even statistical filters grow stale if not trained on new messages over time,
because new messages exhibit topic drift. 

E.g. you filter messages into spam and discussions of chocolate ice
cream. When people start talking about strawberry cake, a binary
classifier will find it difficult to decide if this cake is spam or
ice cream.

Alternatively, if the number of people you filter simultaneously for
is growing, then it is normal that a (any) filter will make more mistakes.

-- 
Laird Breyer.

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg