RE: [Asrg] RE: 2.a.1 Analysis of Actual Spam Data

At 11:00 PM 8/9/03 -0400, Yakov Shafranovich wrote:

At 09:50 PM 8/9/2003, Terry Sullivan wrote:

On Sun, 10 Aug 2003 01:37:54 +0200, Brad Knowles wrote:

 Hmm.  How would you research something like this?

[snip]

Also, I'm a bit confused--just what exactly would we be

researching?

Peter has a system called Titan Key that responds to all incoming
email (spam or not) from any unknown source with a 550.  He posted a
message to the list a while back that made what seemed a fairly
extraordinary claim (as per the subject line, "Titan Key reduces spam
attacks").

Now, he had a modest sample of data (one account, one server, for
~100 days), and the consensus I saw on the list seemed to be "we need
more data."  Several folks have suggested a design along the same
lines you did (multiple addresses, of equal visibility, etc.).


This may also apply to another proposal - Greylisting which rejects emails 
with a 445 response and forces the spammers to redeliver.


(quibble: greylisting uses a 451 response, not 445.)

Right now, I'm conducting a study of greylisting to try and determine
(or at least gather /some/ evidence of) it's effect on the volume of spam.

The approach I'm using is this;
I have a number of spam trap addresses, which I split into four groups;
spam0, spam1, spam2, and spam3 
In the first two weeks, spam0 and spam1 accept all email,
and spam2 and spam3 use greylisting.
After two weeks, I switched spam0 and spam2 to accept all email,
while spam1 and spam3 use greylisting.
(The experiment is still under way, I expect to publish final 
 results in a week or two.)

I'm hoping that comparisons will allow me to overcome most of
the "noise" inherent in the data.  I expect that spam0 and spam3
will receive close to the same relative percentage of spam,
and that can be used to refine the difference in the before
and after periods.  Of course, I could be dead wrong, but we'll see.

If it works, a similar procedure could be done for 550 blocking -
4 groups, one never blocks, one always blocks, and the other
two switch between blocking and non-blocking.
However, I think the time would need to be longer.
(though perhaps several experiments could be done concurrently 
 each with successively longer time periods)

But my thoughts keep coming back to the sheer amount of statistical
noise in something as volatile as spam volume.  That level of noise
will make meaningful analysis extremely difficult.  (My area of
specialization is statistical language processing, so I routinely
encounter the frustrations associated with trying to analyze really,
really noisy data.)


No offense, but how do you know there's a large amount of noise
in the data until after you attempt to measure it?
I mean, sure we all /expect/ there will be a lot of noise,
but has anybody actually tried to measure how much noise there is?

In statistical terms, all that noise constitutes unexplained (aka
"within-group") variance.  So, even if an effect exists, it will be
very difficult to detect.  (Conversely, if the effect doesn't exist,
the absence of clear results can always be blamed on the noise in the
data.)  To make matters worse, given the amount of data that would
have to be collected, a *statistically* significant effect would not
necessarily be a *substantively* significant effect.  (The p-value
might be tiny, but R^2 might be just as tiny.)

Now, it's always possible that the effect is so astonishingly large
that it will somehow be able to shine thru all that noise.  But if it
were really all *that* big, it strikes me as odd that no one's ever
noticed it before (at least in any shared forum).  It would simply be
be a matter of some mail admin at some slowly failing company saying,
"Wow, have you noticed that, after every new round of layoffs, our
spam volume plummets, and never bounces back?"  (Sorry, bad pun.)

So the point of my posting was: if the effect's big enough to be
detected by a practicable amount of effort, then I can't but wonder
why nobody's ever tripped over it, even accidentally, at some point.
(Phrased differently: given [any] two variables observed to increase
monotonically over time, what would it take for the "true
state-of-nature" to be "huge negative correlation"?)



Consider what the effect is you're looking for here.
An address that rejects messages with 550 gets less spam.
To see that, you'd need to compare the spam load on two addresses.
It's not hard for me to imagine that a 40% decrease in spam
volume would go unnoticed until someone looked for it explicitly.
Especially when you consider that it might well be that it's
a 40% decrease only in comparison to what it would have been had
you not done the magic thing (what ever that might be.)
On the other hand, it's also not hard to imagine that people
could be mislead by local variances and claim 40% when the
real number is closer to 10%.

It's fairly easy for me to imagine an expenditure of enormous
quantities of effort, the harvest from which is vast amounts of
low-quality data... all to no avail, because the "measure" being
analyzed is so statistically unreliable that inconclusive results are
just about guaranteed.


I think that before commencing on a spam study or analysis of spam data, we 
would have to discuss the scientific and statistical principles which will 
be used to do the study. This would also apply to any data from third party 
sources such as Brightmail.


I would argue for only one principle - any study should be
described in sufficient detail that someone else can repeat the study.


Scott Nelson <scott(_at_)spamwolf(_dot_)com>

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg

RE: [Asrg] RE: 2.a.1 Analysis of Actual Spam Data - Titan Key reduces spam attacks