RE: [Asrg] Filtering spam by detecting 'anti-Bayesian' elements?


First post this...

It's actually trivial for anyone to create new nonsense words: for a typical 
word length of 6 characters and an alphabet of 26 letters there are 26^6 
possible combinations. Which is a lot...

New words don't necessarily tip the filter towards spam. Remember: a bayesian 
is statistical method which calculates the probabilty of an event given the 
past history of that event.

If the token is new the bayesian filter has no past history to go on. Such 
tokens are assigned a default weight based on external assumptions. Paul 
Graham's well known paper assigns new tokens a slighly positive, not spammy, 
weight of 0.4. IMHO his stated assumption is dubious. In my implementation I 
assign a slightly spammy weighting, on the assumption that people's vocubulary 
is actually quite limited and stable over time, as opposed to that of spam.

The bigger problem for bayesian filters is not the random words. It's messages 
which contain extracts from literature. For these kinds of messages, using the 
tokens individually results in lower filtering performance. For such messages 
the co-occurence probability of the tokens has to be used instead. And then it 
wouldn't be a Naïve Bayesian filter anymore...

Brian Azzopardi

-----Original Message-----
From: asrg-bounces(_at_)ietf(_dot_)org 
[mailto:asrg-bounces(_at_)ietf(_dot_)org] On Behalf Of Laird Breyer
Sent: Saturday, September 18, 2004 2:54 AM
To: ASRG
Subject: Re: [Asrg] Filtering spam by detecting 'anti-Bayesian' elements?

On Sep 17 2004, Jim Witte wrote:

   Has anyone tried making a partial spam filter by scanning messages 
for the non-sense words they put in to try to confuse the Bayesian 
filters?


Those nonsense rules backfire with Bayesian filters. Since nonsense
words don't occur in legitimate messages (how is the spammer going to
force people to add such words?), all such words tip the balance
towards spam. When a filter looks up the words to see if they exist in
its database, then either the word is completely new or it has already
occurred in spam.

It's very hard for a spammer to create new nonsense words, all the
time.  After a while, some spammer (not necessarily the same) has
already used that nonsense word, and the filter knows about it.

I don't know of any open source statistical filter who has serious
trouble with nonsense words over time. They stick out like a sore
thumb. It's likely that the real purpose for including them is to
evade hash/signature filters.


-- 
Laird Breyer.

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg



This mail was checked for malicious code and viruses
by GFI MailSecurity. GFI MailSecurity provides email content
checking, exploit detection, threats analysis and anti-virus for
Exchange & SMTP servers. Viruses, Trojans, dangerous
attachments and offensive content are removed automatically.
Key features include: multiple virus engines; email content and
attachment checking; an exploit shield; an HTML threats engine;
a Trojan & Executable Scanner; and more.

In addition to GFI MailSecurity, GFI also produces the
GFI MailEssentials anti-spam software, the GFI FAXmaker
fax server & GFI LANguard network security product ranges.
For more information on our products, please visit
http://www.gfi.com. This disclaimer was sent by
GFI MailEssentials for Exchange/SMTP.


_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg