RE: [Asrg] Filtering spam by detecting 'anti-Bayesian' elements?


When I said that there are 26^6 possible combinations I was referring to what a 
random token generator could produce. I cant imagine a tokenizer 
being limited to the 26 letters of the alphabet + 10 digits. Who would be silly 
enough to write a tokenizer which delimits on digits? I think your discussion 
on this point is moot.

Tokenizers today have to be much smarter than simply extracting single words. 
They should recognize HTML tags (while ignoring the non-valid ones), IPs, 
dates, and other relevant information. 

The basic Bayesian algorithm is sound - what needs more research though is the 
representation of the message that is passed to the bayesian. Currently most 
filters use a direct representation - that it the filter is fed the tokens 
found. It does not have to be like this. A trivial example: tokens which are 
longer than 12 characters are given a different token, say, 
"BAYESIAN_TOKEN_TOO_LONG" which is then fed to the Bayesian.

Most email is too short to extract any meaningful higher-level information from 
it. For example, Zipf's law holds for long texts but emails are usually too 
short for it to apply.

Brian Azzopardi


On Sep 18 2004, Brian Azzopardi wrote:

It's actually trivial for anyone to create new nonsense words: for
a typical word length of 6 characters and an alphabet of 26 letters
there are 26^6 possible combinations. Which is a lot...


Yes and no. In principle, 26^6 is a lot, but in practice it also
depends on other things. For example, the spammer could use (26+10)^6,
including digits, which is an even bigger set of words. Even better?

It depends on the tokenizer. If the tokenizer is alpha only, then the
chance of at least one digit is 1 - (26/36)^6 = 0.85, and this digit
can be anywhere. With one digit, the tokenizer will split the token
into two purely alpha pieces, with respective sizes (0,5), (1,4) or
(2,3). So with probability 0.85 * (4/6), the tokenizer will see an
alpha token of maximum length 2, which has only 26^2 possibilities. If
a paragraph contains 20 nonsense words of length 6, then there are
about 11 words of length <= 2, and about half of those words are of
length 1. 

So the spam contains 5 examples of single letters, and there are only
26 letters in the (english) alphabet. But in English, most letters do
not occur alone as a single word, so a handful of emails with these
nonsense words will make most of the possible single letter words
known to the filter, and when single letters are found with uniform
frequencies over the alphabet in a message, these will already strongly
suggest that this message is spam.

This analysis is very simplified, because single letters also occur
often in machine generated tokens, e.g. urls etc, and bayesian filters
look at many other features. But it shows that the useful space of
nonsense words is not under the control of spammers only, because e.g.
they can't choose the filter's tokenizer.

Another reduction of the nonsense word space occurs when spammers
insert random words "by hand" instead of with a random number
generator. In that case, the layout of the keyboard reduces the
dramatically the types of likely random sequences produced.

Finally, the length of the nonsense words is often uniform, which is
not at all like real language. But therein lies another catch 22. Does
the spammer pick natural word lengths? In that case, he will include
many small words, and these small words have fewer nonsense
combinations, so they are more easily recognized. Does the spammer use
many long words? Then the surprising length of the words compared to
natural languages indicates something strange is going on.

New words don't necessarily tip the filter towards spam. Remember: a
bayesian is statistical method which calculates the probabilty of an
event given the past history of that event.


Yes. I'm certainly not arguing that a filter should know a priori if
nonsense words are nonsense. But as you point out, the filter can
always bias against nonsense words if it wants. The true test is the
number of errors overall.

The bigger problem for bayesian filters is not the random
words. It's messages which contain extracts from literature. For
these kinds of messages, using the tokens individually results in
lower filtering performance. For such messages the co-occurence
probability of the tokens has to be used instead. And then it
wouldn't be a Naïve Bayesian filter anymore...  Brian Azzopardi


In principle, that's debatable too. Certainly, it's a better attack
than the nonsense word attack, but a quoted message still doesn't have
the correct token frequencies of any given individual's legitimate
mail. Moreover, newspaper language has its own statistical properties too,
so for individual (not corporate gateway) filters, the newspaper style
should be separable from the "family + friends" style quite quickly.
For corporate gateways, the statistical classification problem will
always have higher error rates, if all other factors are equal.

-- 
Laird Breyer.

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg



This mail was checked for malicious code and viruses
by GFI MailSecurity. GFI MailSecurity provides email content
checking, exploit detection, threats analysis and anti-virus for
Exchange & SMTP servers. Viruses, Trojans, dangerous
attachments and offensive content are removed automatically.
Key features include: multiple virus engines; email content and
attachment checking; an exploit shield; an HTML threats engine;
a Trojan & Executable Scanner; and more.

In addition to GFI MailSecurity, GFI also produces the
GFI MailEssentials anti-spam software, the GFI FAXmaker
fax server & GFI LANguard network security product ranges.
For more information on our products, please visit
http://www.gfi.com. This disclaimer was sent by
GFI MailEssentials for Exchange/SMTP.


_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg