Re: [Asrg] 2. Analysis and characterization work

On Tue, 02 Sep 2003 10:52:13 +0100, Jon Kyme wrote:

it may be thought useful to make both sets avid for spam, 
for a period before the study starts, ...

[snip]

However, this gives us two groups, one with stable 550 behaviour 
(not550->not550) and one in which we have changed the behaviour 
(not550->550), so the hypothesis is somewhat different.


Perzactly.  One design tests the effects of 550s; the other tests the 
covariance of 550s-and-time.  Since time is known to be a source of 
extraneous (read: "bad") variance, the second design has very little 
hope of yielding definitive, defensible results.  (And Scott Nelson's 
results should be enough to persuade anyone: time=bad.)

Now, any design that relies on active spam-"seeking" behavior to 
"prime the pump" is inherently more cumbersome to implement, because 
investigator effort is required to ensure that the "seeking" behavior 
is equivalent between groups for the "pre-experimental" period.  It's 
mechanically easier (not to mention "cleaner") to rely on 
otherwise-identical *visibility* between members of a pair, than it 
is to try to balance pre-experimental "seeking" behavior.

Although "A" seems easier to implement, it may not produce
big enough numbers.

Or have I got this wrong?


No, you have it exactly right, in terms of the experimental design.  
Now, the question of numbers is 'really' a question of statistical 
power.  

We sort of know already that power is a problem in this instance.  
Statistically speaking, there are only two POSSIBLE explanations for 
the existing failure to detect a robust relationship between 550s and 
volume:

 1) There's no effect to detect.  (Or there's a 
    minimal effect, one that is so utterly 
    trivial that the "conclusion" is the same: 
    550s aren't a robust "solution" to the 
    problem of spam.)

 2) There's a real, robust effect, but there's so 
    much noise in the data that the effect is 
    "masked."  (i.e., lack of statistical power)

(NOTE: This is *not* an invitation to replay the "Omphaloskepsis 
tells me that noise isn't 'really' a problem" refrain.  Believe 
whatever you like.)

Now, statistical power can be increased in a couple of different 
ways.  One way is to increase _N_.  The problem with that method is 
that, given a sufficiently large N, even a trivial effect can achieve 
"statistical" significance.  (This is exactly why the ESP debate 
rages after nearly a century of experiment: yes, there's an effect, 
and yes, it's statistically significant.  But the effect is *tiny*, 
while the number of trials is *huge*.)  

Another, subtler approach to increasing power is to craft the 
experimental design in such a way that it allows use of more powerful 
statistical tests on the data once they are gathered.  That's why I 
advocated using (otherwise-identical) *pairs* of addresses.  The use 
of pairs of addresses supports squeezing a bit more power out of the 
analysis, like so: 

Imagine that the experimental data (N=5) look like this:

 E       C 
---     ---
 1       2
 2       3
 3       5
 4       5
 5       6

First, let's imagine that these are two independent samples.  Our 
"null" hypothesis is: Mean[E]-Mean[C]=0.

Okay: yes, there's a difference, but the difference is small, and so 
is N.  Results of an independent-samples test: t=-1.18, p=0.28, df=7.  
Fail to reject.  Simply adding N risks ending up with results that 
look just like ESP research: a huge pile of low-grade evidence, which 
is ultimately not particularly convincing.

BUT...

By using *pairs* of addresses and working diligently and 
conscientiously to ensure that each member of each pair receives 
identical visibility, the analytical data become the *difference* 
between *each address* in a pair.  That is, we now look at only one 
number per pair (E-C).  So, if the data points above represented 
results obtained from otherwise-identical pairs, the analytical data 
would become:

E-C
---
 -1
 -1
 -2
 -1
 -1

One handy thing about using "correlated" pairs of observations is 
that we know exactly what to expect if the independent variable has 
no effect; our "null" hypothesis becomes: (E-C)=0.  (That is, if the 
null hypothesis is true, then (E-C), when averaged across all pairs, 
will be very close to 0).  

The use of otherwise-identical pairs means that we can run a 
correlated-samples test (as opposed to an independent-samples test) 
on the data.  The results: t=6.0, p=0.0039 (two-tailed).  
Reject-city; large N not required.

Conversely, if (E-C)=0.0, even with dozens of correlated data points 
and crisp separation of the variables, then we can be *equally 
confident* that 550s *don't* substantively affect spam attempts.  The 
results are strong and convincing (even compelling), either way--as 
befits the Herculean effort undertaken by Peter and his crew.

- Terry



_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg