This message is intended to address general questions of experimental
design in a "known noisy" environment, and uses "550s-vs-spam volume"
for illustrative purposes only.
If you're confident of your stat background, skip this first
paragraph and go straight to the bullet points. If not, then the
background material in this paragraph provides context for the bullet
points. Viewed thru the lens of inferential statistics, there's only
two kinds of variance in the world: explained/unexplained, aka
between-group/within-group, or systematic/random. Robust
experimental design exerts as much control as possible over
everything BUT the independent variable, which is the only thing
that's allowed to vary systematically between the groups. Everything
else that varies, however slightly, between the two conditions hurts
chances of meangingful results. (If the extraneous variance is
systematic, then the design is confounded, and the results--whatever
they are--meaningless; if the extraneous variance is random, then
statistical power is compromised.)
- Ensure *crisp* separation of the
independent variables. If the
analytical goal is to study the
effects of 550s, then have that be
the *only* source of systematic
variance. DO NOT "dilute" your
systematic variance by confounding
it with other variables (visibility,
phase of the moon, eye color, etc.)
- (Under the heading of "Hey doc, it
hurts when I do this...") If daily
spam volume is too noisy (and the
DATA, not the statistician, "say"
that it is), then pick a dependent
measure that's more naturally
noise-resistant (say, monthly spam
volume, or even quarterly, if need
be). Reliablility of initial
measurement is always preferable to
_post hoc_ "noise reduction."
- Studiously ensure and maintain
homogeneity of the experimental
conditions throughout the course of
the experiment.
Mechanics:
- Create an absolute minimum of 60
*pairs* of email addresses. (The
"magic number 30" assumes data to
be noise-free. Statistical power
is a function of the number of
"subjects," not the number of
measurements.) The use of
"otherwise identical" *pairs* of
addresses allows a little more
statistical power to be squeezed
from the data at analysis time.
- Randomly assign each address in a
pair to an experimental condition;
the addresses in the experimental
group never (repeat: never) do
anything but throw 550s; addresses
in the control group "take anything."
* Cautionary aside: If it were
me, I would zealously protect
from the general public any
knowledge of which addresses
were in which experimental
group. As proof against
experimenter mortality, I'd
ensure that 3 different people
knew which-was-which, so that
the study could continue if I
got hit by a bus. But I'd
also limit that knowledge to
*just* those 3 folks. (In
experimental-design jargon,
the study is referred to as
"blind.")
- In a perfect universe, all addresses
in both groups are served from one
and only one mail server. That way,
"server status" affects all address
pairs in both groups identically.
- Insofar as possible, ensure that each
address within a pair achieves/receives
"identical" visibility. Each control
address within a pair should "shadow"
its experimental counterpart as
precisely as possible.
* if one address signs up for
a list, posts to a newsgroup,
appears on a Web page, or
whatever, the other one should
do it too, on the very same day
- While waiting for time to pass,
order a copy of Kanji, G. (1999).
One hundred statistical tests.
ISBN: 0-7619-6151-8
(This is a very handy "cookbook"
that contains raw-score formulae
for just about every inferential
statistical test there is.)
- At the end of the experiment, pull
the pairs apart and compute a
regression equation for each
experimental group.
- Some folks may recall me saying
that the slope of the regression
line is not intrinsically
informative. (And it isn't;
dispersion, not slope, expresses
the degree of relatedness between
regression variables.) However,
the *difference between two slopes*
of otherwise "identical" conditions
can be informative.
* if the beta weight of the
regression line for the
control group is smaller
(even slightly) than that
of the experimental group,
stop. Fail to reject the
null hypothesis and move
on to something else.
* Differences between the slopes
can compared via t-test. If
that difference-in-slopes
doesn't make at least 0.01
(TWO-tailed), stop. Fail
to reject the null hypothesis
and move on to something else.
(Remember, getting "doubles"
when throwing a pair of dice
is "statistically significant"
at p=0.05.)
- Having determined the "direction" of
the effect, the magnitude of the
effect can be estimated via paired
t-test. Again, the goal is 0.01 or
bust (though 1-tailed 0.01 is now
"within reach").
(End of bullet points... philosophical paragraph follows.)
Some scientist (don't remember who) once described scientific inquiry
as something like, "slaying a beautiful theory with an ugly fact."
I'd gently urge folks to remain mindful of the fact that all the
evidence to date (admittedly all of it still on the level of
anecdote) speaks with a single voice: there is no systematic
relationship between 550s and spam volume. From Walter's "story of
Nadine" to Peter's original data (which I've now seen and analyzed),
not once has any nonexperimental design detected a statistically
significant negative relationship between these two variables.
(Peter's original data show a very modest negative correlation, too
small to support rejection of the null, even at 0.05, one-tailed. My
logs for the last month show ~exactly the same-sized correlation, and
I'm not throwing 550s at all.)
My $0.02...
- Terry
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg