Re: [Asrg] Filtering spam by detecting 'anti-Bayesian' elements?

On Sep 18 2004, Brian Azzopardi wrote:

When I said that there are 26^6 possible combinations I was
referring to what a random token generator could produce. I cant
imagine a tokenizer being limited to the 26 letters of the alphabet
+ 10 digits. Who would be silly enough to write a tokenizer which
delimits on digits? I think your discussion on this point is moot.


While the computation I gave was geared towards your example, the gist
of it is entirely applicable to similar questions in general. In other
words, spammers cannot control the size of their randomly generated
chaff, unless they design them to combat a particular anti-spam filter
version. I don't know why you would think that a tokenizer shouldn't
delimit on digits.  Have you tried it?

If all you delimit on is white space, you should check out several of
the open source Bayesian projects, which have extensive discussions on
mailing list archives and in source code about best token
forms. Perhaps the most interesting lesson from this is that best
token forms are quite unintuitive, and only visible with thorough
testing on real datasets. Human ideas about what works tend to be
suboptimal.

Tokenizers today have to be much smarter than simply extracting
single words. They should recognize HTML tags (while ignoring the
non-valid ones), IPs, dates, and other relevant information.


I agree with you on this, and that is another facet, unrelated to the 
nonsense word problem. After filtering tags, noting IPs etc, you still
have the basic question of what constitutes an acceptable token.

 The basic Bayesian algorithm is sound - what needs more research
though is the representation of the message that is passed to the
bayesian. Currently most filters use a direct representation - that
it the filter is fed the tokens found. It does not have to be like
this. A trivial example: tokens which are longer than 12 characters
are given a different token, say, "BAYESIAN_TOKEN_TOO_LONG" which is
then fed to the Bayesian.


Indeed. It does all come down to representation, after which point the
question is "which algorithm?". 

What makes the representation problem particularly interesting is that
a full and complete representation is not necessarily better than a
very simple incomplete one. Sometimes, extra information only confuses the
decision procedure, not unlike the saying "too many cooks spoil the broth".
Moreover, the best representation depends on the algorithm, and conversely.
Years of fun ahead!

Most email is too short to extract any meaningful higher-level
information from it. For example, Zipf's law holds for long texts but
emails are usually too short for it to apply.


I don't know. Do you treat header content separately from body
content?  Many Bayesian filters read both header and body together,
which gives quite a lot of information. Some treat each part
separately and then combine them.  Even a spam with an empty body has
routing information and bogus sender details, which adds up to a few
quality tokens. It really depends. 

-- 
Laird Breyer.

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg