ietf-asrg
[Top] [All Lists]

RE: [Asrg] RE: 2.a.1 Analysis of Actual Spam Data - Titan Key reduces spam attacks

2003-08-09 18:51:40
On Sun, 10 Aug 2003 01:37:54 +0200, Brad Knowles wrote:

 Hmm.  How would you research something like this?
[snip]
Also, I'm a bit confused--just what exactly would we be 
researching?  

Peter has a system called Titan Key that responds to all incoming 
email (spam or not) from any unknown source with a 550.  He posted a 
message to the list a while back that made what seemed a fairly 
extraordinary claim (as per the subject line, "Titan Key reduces spam 
attacks").

Now, he had a modest sample of data (one account, one server, for 
~100 days), and the consensus I saw on the list seemed to be "we need 
more data."  Several folks have suggested a design along the same 
lines you did (multiple addresses, of equal visibility, etc.).

But my thoughts keep coming back to the sheer amount of statistical 
noise in something as volatile as spam volume.  That level of noise 
will make meaningful analysis extremely difficult.  (My area of 
specialization is statistical language processing, so I routinely 
encounter the frustrations associated with trying to analyze really, 
really noisy data.)  

In statistical terms, all that noise constitutes unexplained (aka 
"within-group") variance.  So, even if an effect exists, it will be 
very difficult to detect.  (Conversely, if the effect doesn't exist, 
the absence of clear results can always be blamed on the noise in the 
data.)  To make matters worse, given the amount of data that would 
have to be collected, a *statistically* significant effect would not 
necessarily be a *substantively* significant effect.  (The p-value 
might be tiny, but R^2 might be just as tiny.)

Now, it's always possible that the effect is so astonishingly large 
that it will somehow be able to shine thru all that noise.  But if it 
were really all *that* big, it strikes me as odd that no one's ever 
noticed it before (at least in any shared forum).  It would simply be 
be a matter of some mail admin at some slowly failing company saying, 
"Wow, have you noticed that, after every new round of layoffs, our 
spam volume plummets, and never bounces back?"  (Sorry, bad pun.)

So the point of my posting was: if the effect's big enough to be 
detected by a practicable amount of effort, then I can't but wonder 
why nobody's ever tripped over it, even accidentally, at some point.  
(Phrased differently: given [any] two variables observed to increase 
monotonically over time, what would it take for the "true 
state-of-nature" to be "huge negative correlation"?) 

It's fairly easy for me to imagine an expenditure of enormous 
quantities of effort, the harvest from which is vast amounts of 
low-quality data... all to no avail, because the "measure" being 
analyzed is so statistically unreliable that inconclusive results are 
just about guaranteed. 

(But hey... that's just me.)

- Terry



_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg



<Prev in Thread] Current Thread [Next in Thread>