On Sep 21 2004, Jose Marcio Martins da Cruz wrote:
Clearly, this is entirely untypical of ordinary language. Like the
nonsense words, this sticks out (e.g. what percentage of legitimate
messages do *you* have that don't contain the word "the"?).
Many. Two examples : As I live in France, most messages don't contain
the word "the" as they are written in french. Also, people at our
organisation sends and receives messages in many other languages :
german, italian, russian, and even chinese ... and english of course.
Heh, you're right of course. Moreover, Markus, to whom I replied, lives
in Germany according to his sig. Double whammy ;-)
A statistical filter will recognize all these things automatically.
Maybe, but there are many legitimate senders and even companies which
use this kind of message composition (Buy ... now) to add a footer at
all their messages. So, false positives...
In the short term yes, but in the long term (ie with training), the
footer is recognized. No miracles. As a general rule, tokens which
occur commonly in both ham and spam, have little effect on a filtering
decision (Bayesian algorithms can vary). The decisions depend much
more on the presence of extreme tokens which (statistically) only
occur in spam, or only occur in ham (that's what I mean by extreme
here). It's very hard for spammers to discover which tokens are
extreme for any given individual.
In this cases, to be something acceptable, I define "ALL" as being 100%,
and "MOST OF THE TIME" as being 99.99%.
For how many people simultaneously? Statistical filters are no miracle
workers, and I wouldn't want to give the impression they are. Every
decision procedure has a nonzero error rate. You can approach your
target with personal filters, but of course if you also want to filter
spam on a corporate gateway it's a much more difficult problem.
I was merely pointing out that spammer attacks against statistical
filters are mostly hot air. Some attacks, such as exploiting bugs in
parsers, work. I'm not aware of any statistical attacks which work yet (ie
attacks which would make the algorithms useless). Of course, this is
for personal filters. For corporate filters, the problem is harder
Asrg mailing list