RE: [Asrg] Re: 2a. Analysis - Spam filled with words

On Wed, 10 Sep 2003 18:22:37 -0400, Paul Judge wrote:

On Wed, 10 Sep 2003 11:50:36 -0500, Terry Sullivan wrote:

... these messages are probably designed to confuse 
statistical language classifiers.  (Again, they don't 
work, won't work--and ultimately *can't* work...


I don't know that we can simply say that they don't work.


Just to be clear, I'm not saying that it's not possible to try to 
"confuse" statistical classifiers.  What I am saying is that these 
*particular* messages fail to achieve the goal (and they fail pretty 
spectacularly, too).  And there is no "tweak" that can be applied to 
these (again, particular) messages that will allow them to succeed.  
Even if the three different "tells" in these particular messages were 
altered or eliminated, there are at least three *more* dead 
give-aways just waiting to be picked up by some non-Bayesian 
statistical classifier.

With all due respect to folks working on YABC (Yet Another Bayesian 
Classifier), Naive Bayes is only one of several effective statistical 
language processing (SLP) techniques--and it's a relatively simple 
one, at that.  (Dave Lewis did a talk at the MIT spam conference that 
tried to underscore this very point.)  The diversity of SLP 
techniques makes it all-but-impossible to craft a message that can 
get past *all* of them.  In a very real sense, the very act of trying 
to "fly under the radar" of one SLP tool has the predictable (maybe 
even inevitable) effect of making a message *more* visible to a 
different SLP tool.

One question is how long will we have the benefit of such 
a large distinction between the content of spam and non-spam?


You're right, this is a very important question.  Right now, the 
"samrt-money" approach to statistical spam classification focuses on 
finding the spam, not the "ham."  It's possible to imagine that the 
task would "flip," at some point, so that "ham" recognition, not spam 
recognition, would become the higher-payback activity.  If that day 
ever comes, it's easy to imagine that spam elimination would become 
much more costly.

At what point along the line of the convergence of these 
vocabularies does the effectiveness and accuracy of 
Bayesian filters become affected?


Again, as SLP tools go, Bayes is relatively easy to fool.  Fooling 
the larger class of SLP technologies is much tougher.

I recently had occasion to try to do some (decidedly non-Bayesian) 
statistical characterization of ham/spam differences.  I ended up 
with two interesting results:

1) There were four distinct "types" of spam.  
   Variation within each spam-type was much 
   smaller than the variation between 
   spam-types.  

2) Only one of the four spam-types was even 
   remotely close to "ham."

There are two different exercises here:
1. A measurement study of the vocabulary space 
of actual spam mail and non-spam mail and the 
change in these spaces over time.


And, there's a really excellent third point hiding in this one.  If, 
at some point, the goal becomes "ham" recognition, then what would 
vocabulary-driven techniques need to do in order to accommodate 
legitimate changes in "ham" vocabulary over time?

2. An analysis of how the effectiveness...
would be affected given certain measures of distinction 
between the two vocabularies. There is probably some 
existing work here that gets us close to an answer.


The proceedings of the ACM annual SIGIR conference is a great place 
to start.  (In case it isn't obvious, IR is just a "special case" of 
classification.)  There was also an *excellent* SLP textbook 
published about 3 years ago.  (I sorta doubt anyone's interested, but 
I'll be happy to provide the reference.)


Meanwhile, as I was composing this message, Andrew chimed in...

On Thu, 11 Sep 2003 14:13:50 +0100 (BST), Andrew Akehurst wrote:

Can I suggest a subtly different approach? Rather than trying to
characterise spam, why not try and characterise your legitimate 
messages and see if incoming messages match that statistical
profile?


Right now, it's easier (statistically) to recognize spam.  I'm not 
saying that that will never change, just that it hasn't so far, and 
shows no signs of changing yet.  (For purely selfish reasons, I hope 
it never does.)

As the definition of "spam" becomes fuzzier, does the accuracy of
filtering decrease?


Not necessarily (although it may, someday).  High-end SLP techniques 
can make some very fine-grained distinctions among very highly 
similar documents.

I'm particularly thinking about false positives here...would 
it not make sense to match against a more stable message profile?


Again, you're following exactly the "right" line of thought.  But 
keep in mind that, for a message to become a false positive, then 
that sorta implies by definition that it looks "un-ham-like," for 
some reason, right?  That is, it's not *typical* ham; it looks 
somehow uncharacteristic of the stable ham base.  The "hard part" in 
automatic classification is trying to distinguish between *two* 
ayptical classes of messages (atypical ham/atypical spam).  

(It's also one of the things that makes it really tough to test an 
effective SLP tool.  Beyond a certain point, there's no payback to 
testing against typical messages, regardless of type.  I have to 
manually hunt down other filters' false negs in order to get good 
test cases for my SLP filter.  It literally takes me hours to find 
even a handful of good test cases.)

I have a few ideas for statistical spam characteristics but I must
admit that I lack the in-depth background in statistics to know if
any of them would work in practice. Some expert input here would
be welcome.


Let's talk off-list.  ;-)

(Sorry.  I know I probably bore 99% of folks on the list.  But I do 
so love this stuff.)

- Terry



_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg