ietf-asrg
[Top] [All Lists]

RE: [Asrg] 2.a.1 Analysis of Actual Spam Data - Experimental Design

2003-08-18 07:55:25
I feel pretty confident that one box can respond to requests sent to
multiple IP addresses, and therefore can serve as home to an
arbitrarily large number of different domains.  If these email
addresses "live" on 60 different machines, then there will be an
additional mechanical step of "synching" the data from each machine.
Then too, keeping one machine up for the experimental period strikes
me as less "overhead" than keeping 60 machines going.  That I can
see, using multiple boxes only serves as a potential confound,
because server availability affects spam volume in a systematic way.
If one machine (or  worse, two) go(es) "hard down" for a week or two,
the results of the larger experiment are placed at risk.  Addressess
served by that/those machine(s) will have a lower spam volume, of
course, but not because of the indepdendent variable.  But, as I
said, this feature qualifies as nice-to-have, but not required.

I think you misread what I was proposing.  Comparisons have to be between
mail addresses which are identical except for there 550 behaviour when
determining whether 550 behaviour affects spam volume.  So you don't compare
mailboxes on two different machines, or in two different TLDs.  What I am
saying in effect is that the experiment needs to be carried out in a number
of TLDs, since it may deliver different conclusions in different TLDs.

However, to the extent that there is some reasonable basis for
believing that spammers respond differentially to 550s from different
TLDs, then that imposes an additional requirement: keep the number of
TLDs small (say, 3: .com/.org/.net), or use a LOT more addresses.

If the one-TLD experiment uses 60 pairs of adresses, then a multi-TLD
experiment must use 60 pairs for each TLD.  Simple as that. Going to even a
small number of TLDs (eg 3 TLDs) while keeping the original number of
addreseses as you suggest is going to be a disaster if the TLD does have
some effect, as it reduces the amount of data which can tell you about the
effects of the 550 responses where they are the only independent variable by
a factor of three.
It would be helpful not to restrict the TLDs to those where English is thh
prime language, as in the three you list.  Maybe use .com, .uk, .fr, .de
(plus .org and .net maybe).

There are four potential gains to using several TLDs, provided that enough
data is collected to make a valid experiment within each individual TLD.
First, we can see whether the 550 method has different effects in different
domains; second, we can get some idea of the effect of tld on spam volume
(anecdotal evidence conflicts here, and I've seen no solid numbers);  third,
if the tld does in fact make no difference we have several times as much
data to work with; fourth, if the 550 response does indeed have an effect we
will be able to see if part of that effect is a reduction or increase in the
unexplained variance.

...I think it's perfectly reasonable to measure daily volumes...

Knock yerself out.  Devote as much time as you like to analyzing
daily volume.  In fact, you can start right now, using Peter's data;
it's a large enough sample to permit a reasonably robust estimate of
the "true" population variance.  You might find analyzing those data
to be a statistically informative exercise; I know I did.

I'm not the least bit interested in trying to do any further analysis on
daily data.  What bothers me about just collecting (say) 90 day volumes is
that an appropriate measure might be seven day volumes or 1 month volumes or
three month volumes or even 90 days (unlikely - 90 days is neither an even
number of weeks nor an even number of months so it won't properly mask
periodic effects based on the calendar, which probably will be present).  If
I have 1 day numbers I can use then to produce the numbers for any period
which is a multiple of 1 day, and see which multiple of 1 day reduces the
unexplained variance best (provided the experiment runs for long enough,
that is) and that way I get to see whether the 550 responses (if they have
any efect at all) produce a flat reduction or a change in shape or both.

Anyway, you've already seen the comments I made after an initial analysis of
Peter's data - I think I was the first to point out that there was no
evidence of a downward tred, not even evidence of the absence of an upward
trend, and that daily volume is very noisy indeed. And without a data for a
properly organised control address to compare it with little useful analysis
can be done except to note that there is a good deal of unexplained variance
and no visible trend at any reasonable level of significance.

Tom


_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg