ietf-asrg
[Top] [All Lists]

[Asrg] 2.a.1 Analysis of Actual Spam Data - Experimental Design

2003-08-15 10:34:07

This message is intended to address general questions of experimental 
design in a "known noisy" environment, and uses "550s-vs-spam volume" 
for illustrative purposes only.

If you're confident of your stat background, skip this first 
paragraph and go straight to the bullet points.  If not, then the 
background material in this paragraph provides context for the bullet 
points.  Viewed thru the lens of inferential statistics, there's only 
two kinds of variance in the world: explained/unexplained, aka 
between-group/within-group, or systematic/random.  Robust 
experimental design exerts as much control as possible over 
everything BUT the independent variable, which is the only thing 
that's allowed to vary systematically between the groups.  Everything 
else that varies, however slightly, between the two conditions hurts 
chances of meangingful results. (If the extraneous variance is 
systematic, then the design is confounded, and the results--whatever 
they are--meaningless; if the extraneous variance is random, then 
statistical power is compromised.)

- Ensure *crisp* separation of the 
  independent variables.  If the 
  analytical goal is to study the 
  effects of 550s, then have that be 
  the *only* source of systematic
  variance.  DO NOT "dilute" your 
  systematic variance by confounding 
  it with other variables (visibility,
  phase of the moon, eye color, etc.)

- (Under the heading of "Hey doc, it 
  hurts when I do this...")  If daily 
  spam volume is too noisy (and the 
  DATA, not the statistician, "say"
  that it is), then pick a dependent 
  measure that's more naturally 
  noise-resistant (say, monthly spam 
  volume, or even quarterly, if need
  be).  Reliablility of initial
  measurement is always preferable to 
  _post hoc_ "noise reduction."

- Studiously ensure and maintain 
  homogeneity of the experimental 
  conditions throughout the course of 
  the experiment.

Mechanics:

- Create an absolute minimum of 60 
  *pairs* of email addresses. (The
  "magic number 30" assumes data to
  be noise-free.  Statistical power 
  is a function of the number of 
  "subjects," not the number of 
  measurements.)  The use of 
  "otherwise identical" *pairs* of 
  addresses allows a little more 
  statistical power to be squeezed 
  from the data at analysis time.

- Randomly assign each address in a
  pair to an experimental condition; 
  the addresses in the experimental
  group never (repeat: never) do 
  anything but throw 550s; addresses 
  in the control group "take anything." 

    * Cautionary aside: If it were
      me, I would zealously protect 
      from the general public any 
      knowledge of which addresses
      were in which experimental 
      group.  As proof against
      experimenter mortality, I'd 
      ensure that 3 different people 
      knew which-was-which, so that
      the study could continue if I
      got hit by a bus.  But I'd 
      also limit that knowledge to 
      *just* those 3 folks.  (In
      experimental-design jargon,
      the study is referred to as
      "blind.")

- In a perfect universe, all addresses
  in both groups are served from one 
  and only one mail server.  That way, 
  "server status" affects all address
  pairs in both groups identically.

- Insofar as possible, ensure that each 
  address within a pair achieves/receives 
  "identical" visibility.  Each control
  address within a pair should "shadow" 
  its experimental counterpart as 
  precisely as possible.

    * if one address signs up for 
      a list, posts to a newsgroup, 
      appears on a Web page, or 
      whatever, the other one should 
      do it too, on the very same day

- While waiting for time to pass, 
  order a copy of Kanji, G. (1999).
  One hundred statistical tests.
  ISBN: 0-7619-6151-8
  (This is a very handy "cookbook"
  that contains raw-score formulae
  for just about every inferential 
  statistical test there is.)

- At the end of the experiment, pull 
  the pairs apart and compute a 
  regression equation for each  
  experimental group.

- Some folks may recall me saying
  that the slope of the regression
  line is not intrinsically 
  informative.  (And it isn't; 
  dispersion, not slope, expresses
  the degree of relatedness between 
  regression variables.)  However, 
  the *difference between two slopes* 
  of otherwise "identical" conditions 
  can be informative.

    * if the beta weight of the 
      regression line for the 
      control group is smaller 
      (even slightly) than that 
      of the experimental group, 
      stop.  Fail to reject the 
      null hypothesis and move 
      on to something else. 

    * Differences between the slopes 
      can compared via t-test.  If 
      that difference-in-slopes
      doesn't make at least 0.01 
      (TWO-tailed), stop.  Fail 
      to reject the null hypothesis 
      and move on to something else.

      (Remember, getting "doubles" 
      when throwing a pair of dice 
      is "statistically significant" 
      at p=0.05.)

- Having determined the "direction" of 
  the effect, the magnitude of the 
  effect can be estimated via paired 
  t-test.  Again, the goal is 0.01 or 
  bust (though 1-tailed 0.01 is now 
  "within reach").

(End of bullet points... philosophical paragraph follows.)

Some scientist (don't remember who) once described scientific inquiry 
as something like, "slaying a beautiful theory with an ugly fact." 
I'd gently urge folks to remain mindful of the fact that all the 
evidence to date (admittedly all of it still on the level of 
anecdote) speaks with a single voice: there is no systematic 
relationship between 550s and spam volume.  From Walter's "story of 
Nadine" to Peter's original data (which I've now seen and analyzed), 
not once has any nonexperimental design detected a statistically 
significant negative relationship between these two variables.  
(Peter's original data show a very modest negative correlation, too 
small to support rejection of the null, even at 0.05, one-tailed.  My 
logs for the last month show ~exactly the same-sized correlation, and 
I'm not throwing 550s at all.)

My $0.02...

- Terry



_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg