ietf-asrg
[Top] [All Lists]

Re: [Asrg] 2a. Analysis - Spam filled with words

2003-09-12 05:58:37
On Thu, 11 Sep 2003 23:15:58 -0600, John Fenley wrote:

From: "Terry Sullivan" <terry(_at_)pantos(_dot_)org>
1) There were four distinct "types" of spam.     Variation within 
each 
spam-type was much    smaller than the variation between    
spam-types.

2) Only one of the four spam-types was even    remotely close to 
"ham."


This reminds me of something I heard about a few years ago while 
attending a lecture on multidimensional math.
[snip]
When plotted in N dimensions vertebrae from different 
species of dinosaurs formed distinct clouds that could be 
distinguished easily.

Perhaps a multidimensional Bayesian classifier could find these 
spam/ham groups on it's own. Each method for bypassing filters in a 
strange way might be easily discernable as a different cloud.

You read my mind, John.  The "distinct clouds" effect was more or 
less exactly what I was hoping to find.  (Unfortunately, the 
distinctions I found were not even remotely "cloud-like"--each one 
was more like "amorphous blob, well-anchored by a couple of 
outrageous outliers."  Which was still an interesting analytical 
result, but not at all what I had hoped to find.)  

The bad news: while "free" multidimensional methods are great for 
analysis, they are utterly impractical for classification, because 
they do not scale well at all.  (CPU requirements for these methods 
grow *at least* quadratically with the number of items being 
analyzed, and higher-dimensional solutions are even more demanding.)

Ultimately, your *core* point--that multidimensional characterization 
methods ought to work better than unidimensional methods--is exactly 
right.  (Exactly the opposite is true when one is trying to make 
fine-grained distinctions among already highly similar documents.  In 
that case, unidimensional methods are generally superior.)  
Ultimately, multidimensional characterization "works better" because, 
as Andrew recently reminded us, spam differs from regular email in 
lots of different ways (i.e., across multiple dimensions), while 
regular email is much more homogeneous.

- Terry


_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg