ietf-asrg
[Top] [All Lists]

RE: [Asrg] 2.a.1 Analysis of Actual Spam Data - Experimental Desi gn

2003-08-18 13:34:14
I rejected only those addresses that did not exist in my address database of
users and honeypot addresses.
If the address did not exist the error is "550 user unknown". I cant go into
anymore detail as to what happens to senders that give more than x number of
550s or how that data is compiled and applied.

Regards, 
Damon Sauer 



-----Original Message-----
From: Yakov Shafranovich [mailto:research(_at_)solidmatrix(_dot_)com]
Sent: Monday, August 18, 2003 3:47 PM
To: Sauer, Damon; 'Tom Thomson'; 'Terry Sullivan'; 'asrg(_at_)ietf(_dot_)org'
Subject: RE: [Asrg] 2.a.1 Analysis of Actual Spam Data - Experimental
Desi gn


Can you provide more detailed information as to what kind of blocking did 
you do. Did you reject all incoming emails with 550 and send a challenge 
message?

At 02:32 PM 8/18/2003, Sauer, Damon wrote:
Just an FYI - My ENTIRE mail volume dropped from an average of 50M a month
to 20M a month, with the same percentage being blocked at the content
filters, after implementing address checking (550 given) at the connection.



Regards,
Damon Sauer



-----Original Message-----
From: Tom Thomson [mailto:tthomson(_at_)neosinteractive(_dot_)com]
Sent: Monday, August 18, 2003 10:52 AM
To: Terry Sullivan; asrg(_at_)ietf(_dot_)org
Subject: RE: [Asrg] 2.a.1 Analysis of Actual Spam Data - Experimental
Design


I feel pretty confident that one box can respond to requests sent to
multiple IP addresses, and therefore can serve as home to an
arbitrarily large number of different domains.  If these email
addresses "live" on 60 different machines, then there will be an
additional mechanical step of "synching" the data from each machine.
Then too, keeping one machine up for the experimental period strikes
me as less "overhead" than keeping 60 machines going.  That I can see,
using multiple boxes only serves as a potential confound, because
server availability affects spam volume in a systematic way. If one
machine (or  worse, two) go(es) "hard down" for a week or two, the
results of the larger experiment are placed at risk.  Addressess
served by that/those machine(s) will have a lower spam volume, of
course, but not because of the indepdendent variable.  But, as I said,
this feature qualifies as nice-to-have, but not required.

I think you misread what I was proposing.  Comparisons have to be between
mail addresses which are identical except for there 550 behaviour when
determining whether 550 behaviour affects spam volume.  So you don't
compare
mailboxes on two different machines, or in two different TLDs.  What I am
saying in effect is that the experiment needs to be carried out in a number
of TLDs, since it may deliver different conclusions in different TLDs.

However, to the extent that there is some reasonable basis for
believing that spammers respond differentially to 550s from different
TLDs, then that imposes an additional requirement: keep the number of
TLDs small (say, 3: .com/.org/.net), or use a LOT more addresses.

If the one-TLD experiment uses 60 pairs of adresses, then a multi-TLD
experiment must use 60 pairs for each TLD.  Simple as that. Going to even a
small number of TLDs (eg 3 TLDs) while keeping the original number of
addreseses as you suggest is going to be a disaster if the TLD does have
some effect, as it reduces the amount of data which can tell you about the
effects of the 550 responses where they are the only independent variable
by
a factor of three. It would be helpful not to restrict the TLDs to those
where English is thh prime language, as in the three you list.  Maybe use
.com, .uk, .fr, .de (plus .org and .net maybe).

There are four potential gains to using several TLDs, provided that enough
data is collected to make a valid experiment within each individual TLD.
First, we can see whether the 550 method has different effects in different
domains; second, we can get some idea of the effect of tld on spam volume
(anecdotal evidence conflicts here, and I've seen no solid numbers);
third,
if the tld does in fact make no difference we have several times as much
data to work with; fourth, if the 550 response does indeed have an effect
we
will be able to see if part of that effect is a reduction or increase in
the
unexplained variance.

...I think it's perfectly reasonable to measure daily volumes...

Knock yerself out.  Devote as much time as you like to analyzing daily
volume.  In fact, you can start right now, using Peter's data; it's a
large enough sample to permit a reasonably robust estimate of the
"true" population variance.  You might find analyzing those data to be
a statistically informative exercise; I know I did.

I'm not the least bit interested in trying to do any further analysis on
daily data.  What bothers me about just collecting (say) 90 day volumes is
that an appropriate measure might be seven day volumes or 1 month volumes
or
three month volumes or even 90 days (unlikely - 90 days is neither an even
number of weeks nor an even number of months so it won't properly mask
periodic effects based on the calendar, which probably will be present).
If
I have 1 day numbers I can use then to produce the numbers for any period
which is a multiple of 1 day, and see which multiple of 1 day reduces the
unexplained variance best (provided the experiment runs for long enough,
that is) and that way I get to see whether the 550 responses (if they have
any efect at all) produce a flat reduction or a change in shape or both.

Anyway, you've already seen the comments I made after an initial analysis
of
Peter's data - I think I was the first to point out that there was no
evidence of a downward tred, not even evidence of the absence of an upward
trend, and that daily volume is very noisy indeed. And without a data for a
properly organised control address to compare it with little useful
analysis
can be done except to note that there is a good deal of unexplained
variance
and no visible trend at any reasonable level of significance.

Tom


_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg


*****
"The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential, proprietary, and/or
privileged material. Any review, retransmission, dissemination or other use
of, or taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from all
computers."

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg



<Prev in Thread] Current Thread [Next in Thread>