John Stracke <jstracke(_at_)centivinc(_dot_)com> writes:
The approach described looks only at the 15 words furthest from 0.5;
it seems likely that most messages that would rank at 0.9 or above
would have enough spam-words that words at 0.2 wouldn't show up.
I missed that point. Random words indeed wouldn't work then.
Guessing at `right' words still might. If the deployment results in
spammers sending multiple copies of their spew with different sets of
decoy words, the problem would actually get worse.
One can imagine sets of decoy words for given demographics; e.g., for
networking nerds: TCP MPLS duplex route BGP...
One can imagine 1000 sets of decoy words for different categories of
people, with each message sent by the spammer in 1000 copies (so, you
might get it in 0 copies if you're very unusual -- or in 50 copies if
you discuss fishing, computer networking, investing, travel, and such
other categories in email regularly and the spammer has decoy lists
for fishing, etc.).
One can imagine software that then looks for word combinations in
messages rather than individual words, making the state much greater
and the spammer's job harder yet. The spammers would probably retort
by using random subsets of decoy word sets.
One thing that would be necessary, and that the author doesn't
mention, would be to decode content-encodings before applying the
filter; otherwise spammers could just base64 all their messages.
interesting parts of messages, such as URLs (this way, they make it
works in the recipient's MUA, then you have a Turing-complete way of
hiding. Rice theorem works against you.
Stanislav Shalunov http://www.internet2.edu/~shalunov/
Sex is the mathematics urge sublimated. -- M. C. Reed.