ietf-asrg
[Top] [All Lists]

RE: [Asrg] 2.a.1 Analysis of Actual Spam Data - Experimental Desi gn

2003-08-18 11:35:10
Just an FYI - My ENTIRE mail volume dropped from an average of 50M a month
to 20M a month, with the same percentage being blocked at the content
filters, after implementing address checking (550 given) at the connection.



Regards,
Damon Sauer



-----Original Message-----
From: Tom Thomson [mailto:tthomson(_at_)neosinteractive(_dot_)com] 
Sent: Monday, August 18, 2003 10:52 AM
To: Terry Sullivan; asrg(_at_)ietf(_dot_)org
Subject: RE: [Asrg] 2.a.1 Analysis of Actual Spam Data - Experimental Design


I feel pretty confident that one box can respond to requests sent to 
multiple IP addresses, and therefore can serve as home to an 
arbitrarily large number of different domains.  If these email 
addresses "live" on 60 different machines, then there will be an 
additional mechanical step of "synching" the data from each machine. 
Then too, keeping one machine up for the experimental period strikes 
me as less "overhead" than keeping 60 machines going.  That I can see, 
using multiple boxes only serves as a potential confound, because 
server availability affects spam volume in a systematic way. If one 
machine (or  worse, two) go(es) "hard down" for a week or two, the 
results of the larger experiment are placed at risk.  Addressess 
served by that/those machine(s) will have a lower spam volume, of 
course, but not because of the indepdendent variable.  But, as I said, 
this feature qualifies as nice-to-have, but not required.

I think you misread what I was proposing.  Comparisons have to be between
mail addresses which are identical except for there 550 behaviour when
determining whether 550 behaviour affects spam volume.  So you don't compare
mailboxes on two different machines, or in two different TLDs.  What I am
saying in effect is that the experiment needs to be carried out in a number
of TLDs, since it may deliver different conclusions in different TLDs.

However, to the extent that there is some reasonable basis for 
believing that spammers respond differentially to 550s from different 
TLDs, then that imposes an additional requirement: keep the number of 
TLDs small (say, 3: .com/.org/.net), or use a LOT more addresses.

If the one-TLD experiment uses 60 pairs of adresses, then a multi-TLD
experiment must use 60 pairs for each TLD.  Simple as that. Going to even a
small number of TLDs (eg 3 TLDs) while keeping the original number of
addreseses as you suggest is going to be a disaster if the TLD does have
some effect, as it reduces the amount of data which can tell you about the
effects of the 550 responses where they are the only independent variable by
a factor of three. It would be helpful not to restrict the TLDs to those
where English is thh prime language, as in the three you list.  Maybe use
.com, .uk, .fr, .de (plus .org and .net maybe).

There are four potential gains to using several TLDs, provided that enough
data is collected to make a valid experiment within each individual TLD.
First, we can see whether the 550 method has different effects in different
domains; second, we can get some idea of the effect of tld on spam volume
(anecdotal evidence conflicts here, and I've seen no solid numbers);  third,
if the tld does in fact make no difference we have several times as much
data to work with; fourth, if the 550 response does indeed have an effect we
will be able to see if part of that effect is a reduction or increase in the
unexplained variance.

...I think it's perfectly reasonable to measure daily volumes...

Knock yerself out.  Devote as much time as you like to analyzing daily 
volume.  In fact, you can start right now, using Peter's data; it's a 
large enough sample to permit a reasonably robust estimate of the 
"true" population variance.  You might find analyzing those data to be 
a statistically informative exercise; I know I did.

I'm not the least bit interested in trying to do any further analysis on
daily data.  What bothers me about just collecting (say) 90 day volumes is
that an appropriate measure might be seven day volumes or 1 month volumes or
three month volumes or even 90 days (unlikely - 90 days is neither an even
number of weeks nor an even number of months so it won't properly mask
periodic effects based on the calendar, which probably will be present).  If
I have 1 day numbers I can use then to produce the numbers for any period
which is a multiple of 1 day, and see which multiple of 1 day reduces the
unexplained variance best (provided the experiment runs for long enough,
that is) and that way I get to see whether the 550 responses (if they have
any efect at all) produce a flat reduction or a change in shape or both.

Anyway, you've already seen the comments I made after an initial analysis of
Peter's data - I think I was the first to point out that there was no
evidence of a downward tred, not even evidence of the absence of an upward
trend, and that daily volume is very noisy indeed. And without a data for a
properly organised control address to compare it with little useful analysis
can be done except to note that there is a good deal of unexplained variance
and no visible trend at any reasonable level of significance.

Tom


_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg


*****
"The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential, proprietary, and/or
privileged material.  Any review, retransmission, dissemination or other use
of, or taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited.  If you received
this in error, please contact the sender and delete the material from all
computers."

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg