RE: [Asrg] RE: 2.a.1 Analysis of Actual Spam Data

At 09:50 PM 8/9/2003, Terry Sullivan wrote:

On Sun, 10 Aug 2003 01:37:54 +0200, Brad Knowles wrote:

>  Hmm.  How would you research something like this?
[snip]
> Also, I'm a bit confused--just what exactly would we be
researching?

Peter has a system called Titan Key that responds to all incoming
email (spam or not) from any unknown source with a 550.  He posted a
message to the list a while back that made what seemed a fairly
extraordinary claim (as per the subject line, "Titan Key reduces spam
attacks").

Now, he had a modest sample of data (one account, one server, for
~100 days), and the consensus I saw on the list seemed to be "we need
more data."  Several folks have suggested a design along the same
lines you did (multiple addresses, of equal visibility, etc.).

This may also apply to another proposal - Greylisting which rejects emailswith a 445 response and forces the spammers to redeliver.

But my thoughts keep coming back to the sheer amount of statistical
noise in something as volatile as spam volume.  That level of noise
will make meaningful analysis extremely difficult.  (My area of
specialization is statistical language processing, so I routinely
encounter the frustrations associated with trying to analyze really,
really noisy data.)

In statistical terms, all that noise constitutes unexplained (aka
"within-group") variance.  So, even if an effect exists, it will be
very difficult to detect.  (Conversely, if the effect doesn't exist,
the absence of clear results can always be blamed on the noise in the
data.)  To make matters worse, given the amount of data that would
have to be collected, a *statistically* significant effect would not
necessarily be a *substantively* significant effect.  (The p-value
might be tiny, but R^2 might be just as tiny.)

Now, it's always possible that the effect is so astonishingly large
that it will somehow be able to shine thru all that noise.  But if it
were really all *that* big, it strikes me as odd that no one's ever
noticed it before (at least in any shared forum).  It would simply be
be a matter of some mail admin at some slowly failing company saying,
"Wow, have you noticed that, after every new round of layoffs, our
spam volume plummets, and never bounces back?"  (Sorry, bad pun.)

So the point of my posting was: if the effect's big enough to be
detected by a practicable amount of effort, then I can't but wonder
why nobody's ever tripped over it, even accidentally, at some point.
(Phrased differently: given [any] two variables observed to increase
monotonically over time, what would it take for the "true
state-of-nature" to be "huge negative correlation"?)

It's fairly easy for me to imagine an expenditure of enormous
quantities of effort, the harvest from which is vast amounts of
low-quality data... all to no avail, because the "measure" being
analyzed is so statistically unreliable that inconclusive results are
just about guaranteed.

I think that before commencing on a spam study or analysis of spam data, wewould have to discuss the scientific and statistical principles which willbe used to do the study. This would also apply to any data from third partysources such as Brightmail.

Yakov


_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg

RE: [Asrg] RE: 2.a.1 Analysis of Actual Spam Data - Titan Key reduces spam attacks