ietf-asrg
[Top] [All Lists]

Re: [Asrg] 2.a.1 Analysis of Actual Spam Data - Experimental Design

2003-08-17 19:36:36
At 01:32 PM 8/15/2003, Terry Sullivan wrote:
...........
Some scientist (don't remember who) once described scientific inquiry
as something like, "slaying a beautiful theory with an ugly fact."
I'd gently urge folks to remain mindful of the fact that all the
evidence to date (admittedly all of it still on the level of
anecdote) speaks with a single voice: there is no systematic
relationship between 550s and spam volume.  From Walter's "story of
Nadine" to Peter's original data (which I've now seen and analyzed),
not once has any nonexperimental design detected a statistically
significant negative relationship between these two variables.
(Peter's original data show a very modest negative correlation, too
small to support rejection of the null, even at 0.05, one-tailed.  My
logs for the last month show ~exactly the same-sized correlation, and
I'm not throwing 550s at all.)

My $0.02...

The "GreyListing" proposal uses a similar mechanism and provides some data as well, see http://projects.puremagic.com/greylisting/. Unfortunately, this lacks some controls (no checking was done on rejected data to see if its actually spam). In particular see the following:

-------------snip----------
Analysis of Effectiveness
Based on testing with the example implementation, over a testing period of about 6 weeks, we had raw numbers of:
Unique triplets seen: 346968
Unique triplets that passed email: 8950
Effectiveness (based on triplets): 97.4%

So we have a better than 97 percent efficiency assuming that all email is spam, but it's actually better than that, since most of the email that got through was not spam. Unfortunately, telling exactly how much better we did is impossible without individually inspecting each email, which of course we did not do.

Now lets look at our inefficiency:

Total emails passed: 85745
Total deliveries deferred where email was eventually passed: 33586
Percentage of emails delayed: 39.2%

Unfortunately, this is a pretty poor number. But let's correct it a bit. Almost all of these delayed emails were mailing list traffic which used a unique id for the sender address (see above note regarding VERP). So if we disregard all triplets that passed only one email, we should exclude that type of traffic, and we get a new set of numbers:

Total emails passed: 85745
Total deliveries deferred where more than one email was eventually passed: 3512
Percentage of emails delayed (adjusted): 4.1%

This puts things in a much more favorable light, and merely disregards delays for emails that are generally not timely anyway.

Now let's see what effect greylisting would have on network bandwidth, based on some general averages.

Average size of spam emails: 5000 bytes
Average SMTP delivery attempt overhead: 500 bytes

These numbers are based on spam collected via various methods before the testing period. We picked these as nice round numbers that are pretty closely in line with analysis of previously seen spam. As for the SMTP overhead, in most cases it was less than 500 bytes, but we decided to err on the conservative side.

From this, it follows that for every spam blocked using Greylisting, we save enough bandwidth to "pay" for 10 deferred delivery attempts. If we total that up to give a real-world number (using the unadjusted numbers to give a worst case picture):

338018 (# spams) x 5000 bytes = 1.69 Gbytes of bandwidth saved
33586 (# blocks) x 500 bytes = 16.7 Mbytes of bandwidth wasted

This gives us a net gain of over 1.67 Gbytes of traffic that was saved by implementing Greylisting in our tests. And that's just on a fairly small site. -------------snip----------

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg