On Wed, 10 Sep 2003 18:22:37 -0400, Paul Judge wrote:
On Wed, 10 Sep 2003 11:50:36 -0500, Terry Sullivan wrote:
... these messages are probably designed to confuse
statistical language classifiers. (Again, they don't
work, won't work--and ultimately *can't* work...
I don't know that we can simply say that they don't work.
Just to be clear, I'm not saying that it's not possible to try to
"confuse" statistical classifiers. What I am saying is that these
*particular* messages fail to achieve the goal (and they fail pretty
spectacularly, too). And there is no "tweak" that can be applied to
these (again, particular) messages that will allow them to succeed.
Even if the three different "tells" in these particular messages were
altered or eliminated, there are at least three *more* dead
give-aways just waiting to be picked up by some non-Bayesian
statistical classifier.
With all due respect to folks working on YABC (Yet Another Bayesian
Classifier), Naive Bayes is only one of several effective statistical
language processing (SLP) techniques--and it's a relatively simple
one, at that. (Dave Lewis did a talk at the MIT spam conference that
tried to underscore this very point.) The diversity of SLP
techniques makes it all-but-impossible to craft a message that can
get past *all* of them. In a very real sense, the very act of trying
to "fly under the radar" of one SLP tool has the predictable (maybe
even inevitable) effect of making a message *more* visible to a
different SLP tool.
One question is how long will we have the benefit of such
a large distinction between the content of spam and non-spam?
You're right, this is a very important question. Right now, the
"samrt-money" approach to statistical spam classification focuses on
finding the spam, not the "ham." It's possible to imagine that the
task would "flip," at some point, so that "ham" recognition, not spam
recognition, would become the higher-payback activity. If that day
ever comes, it's easy to imagine that spam elimination would become
much more costly.
At what point along the line of the convergence of these
vocabularies does the effectiveness and accuracy of
Bayesian filters become affected?
Again, as SLP tools go, Bayes is relatively easy to fool. Fooling
the larger class of SLP technologies is much tougher.
I recently had occasion to try to do some (decidedly non-Bayesian)
statistical characterization of ham/spam differences. I ended up
with two interesting results:
1) There were four distinct "types" of spam.
Variation within each spam-type was much
smaller than the variation between
spam-types.
2) Only one of the four spam-types was even
remotely close to "ham."
There are two different exercises here:
1. A measurement study of the vocabulary space
of actual spam mail and non-spam mail and the
change in these spaces over time.
And, there's a really excellent third point hiding in this one. If,
at some point, the goal becomes "ham" recognition, then what would
vocabulary-driven techniques need to do in order to accommodate
legitimate changes in "ham" vocabulary over time?
2. An analysis of how the effectiveness...
would be affected given certain measures of distinction
between the two vocabularies. There is probably some
existing work here that gets us close to an answer.
The proceedings of the ACM annual SIGIR conference is a great place
to start. (In case it isn't obvious, IR is just a "special case" of
classification.) There was also an *excellent* SLP textbook
published about 3 years ago. (I sorta doubt anyone's interested, but
I'll be happy to provide the reference.)
Meanwhile, as I was composing this message, Andrew chimed in...
On Thu, 11 Sep 2003 14:13:50 +0100 (BST), Andrew Akehurst wrote:
Can I suggest a subtly different approach? Rather than trying to
characterise spam, why not try and characterise your legitimate
messages and see if incoming messages match that statistical
profile?
Right now, it's easier (statistically) to recognize spam. I'm not
saying that that will never change, just that it hasn't so far, and
shows no signs of changing yet. (For purely selfish reasons, I hope
it never does.)
As the definition of "spam" becomes fuzzier, does the accuracy of
filtering decrease?
Not necessarily (although it may, someday). High-end SLP techniques
can make some very fine-grained distinctions among very highly
similar documents.
I'm particularly thinking about false positives here...would
it not make sense to match against a more stable message profile?
Again, you're following exactly the "right" line of thought. But
keep in mind that, for a message to become a false positive, then
that sorta implies by definition that it looks "un-ham-like," for
some reason, right? That is, it's not *typical* ham; it looks
somehow uncharacteristic of the stable ham base. The "hard part" in
automatic classification is trying to distinguish between *two*
ayptical classes of messages (atypical ham/atypical spam).
(It's also one of the things that makes it really tough to test an
effective SLP tool. Beyond a certain point, there's no payback to
testing against typical messages, regardless of type. I have to
manually hunt down other filters' false negs in order to get good
test cases for my SLP filter. It literally takes me hours to find
even a handful of good test cases.)
I have a few ideas for statistical spam characteristics but I must
admit that I lack the in-depth background in statistics to know if
any of them would work in practice. Some expert input here would
be welcome.
Let's talk off-list. ;-)
(Sorry. I know I probably bore 99% of folks on the list. But I do
so love this stuff.)
- Terry
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg