On Tue, 02 Sep 2003 10:52:13 +0100, Jon Kyme wrote:
it may be thought useful to make both sets avid for spam,
for a period before the study starts, ...
[snip]
However, this gives us two groups, one with stable 550 behaviour
(not550->not550) and one in which we have changed the behaviour
(not550->550), so the hypothesis is somewhat different.
Perzactly. One design tests the effects of 550s; the other tests the
covariance of 550s-and-time. Since time is known to be a source of
extraneous (read: "bad") variance, the second design has very little
hope of yielding definitive, defensible results. (And Scott Nelson's
results should be enough to persuade anyone: time=bad.)
Now, any design that relies on active spam-"seeking" behavior to
"prime the pump" is inherently more cumbersome to implement, because
investigator effort is required to ensure that the "seeking" behavior
is equivalent between groups for the "pre-experimental" period. It's
mechanically easier (not to mention "cleaner") to rely on
otherwise-identical *visibility* between members of a pair, than it
is to try to balance pre-experimental "seeking" behavior.
Although "A" seems easier to implement, it may not produce
big enough numbers.
Or have I got this wrong?
No, you have it exactly right, in terms of the experimental design.
Now, the question of numbers is 'really' a question of statistical
power.
We sort of know already that power is a problem in this instance.
Statistically speaking, there are only two POSSIBLE explanations for
the existing failure to detect a robust relationship between 550s and
volume:
1) There's no effect to detect. (Or there's a
minimal effect, one that is so utterly
trivial that the "conclusion" is the same:
550s aren't a robust "solution" to the
problem of spam.)
2) There's a real, robust effect, but there's so
much noise in the data that the effect is
"masked." (i.e., lack of statistical power)
(NOTE: This is *not* an invitation to replay the "Omphaloskepsis
tells me that noise isn't 'really' a problem" refrain. Believe
whatever you like.)
Now, statistical power can be increased in a couple of different
ways. One way is to increase _N_. The problem with that method is
that, given a sufficiently large N, even a trivial effect can achieve
"statistical" significance. (This is exactly why the ESP debate
rages after nearly a century of experiment: yes, there's an effect,
and yes, it's statistically significant. But the effect is *tiny*,
while the number of trials is *huge*.)
Another, subtler approach to increasing power is to craft the
experimental design in such a way that it allows use of more powerful
statistical tests on the data once they are gathered. That's why I
advocated using (otherwise-identical) *pairs* of addresses. The use
of pairs of addresses supports squeezing a bit more power out of the
analysis, like so:
Imagine that the experimental data (N=5) look like this:
E C
--- ---
1 2
2 3
3 5
4 5
5 6
First, let's imagine that these are two independent samples. Our
"null" hypothesis is: Mean[E]-Mean[C]=0.
Okay: yes, there's a difference, but the difference is small, and so
is N. Results of an independent-samples test: t=-1.18, p=0.28, df=7.
Fail to reject. Simply adding N risks ending up with results that
look just like ESP research: a huge pile of low-grade evidence, which
is ultimately not particularly convincing.
BUT...
By using *pairs* of addresses and working diligently and
conscientiously to ensure that each member of each pair receives
identical visibility, the analytical data become the *difference*
between *each address* in a pair. That is, we now look at only one
number per pair (E-C). So, if the data points above represented
results obtained from otherwise-identical pairs, the analytical data
would become:
E-C
---
-1
-1
-2
-1
-1
One handy thing about using "correlated" pairs of observations is
that we know exactly what to expect if the independent variable has
no effect; our "null" hypothesis becomes: (E-C)=0. (That is, if the
null hypothesis is true, then (E-C), when averaged across all pairs,
will be very close to 0).
The use of otherwise-identical pairs means that we can run a
correlated-samples test (as opposed to an independent-samples test)
on the data. The results: t=6.0, p=0.0039 (two-tailed).
Reject-city; large N not required.
Conversely, if (E-C)=0.0, even with dozens of correlated data points
and crisp separation of the variables, then we can be *equally
confident* that 550s *don't* substantively affect spam attempts. The
results are strong and convincing (even compelling), either way--as
befits the Herculean effort undertaken by Peter and his crew.
- Terry
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg