spf-discuss
[Top] [All Lists]

RE: Suggest New Mechanism Prefix NUMBER to Accelerate SPF Adoption

2004-08-26 04:30:56
Correct typo my post and clarify my point once and for all.

Change:

P( SPF_result @ Domain ) != SPF_result+Domain

To:

P( SPF_result @ Domain ) != P( SPF_result+Domain )


Per Mengs criteria for limiting post volume, yes I do think there is a general 
misunderstanding on this list about using SPF in Bayesian, at least all but 
Matthew who posted to this thread (can not speak for the lurkers obviously).  
So I do feel I must correct this misdirection.  Please do not be angry at me 
*personally* for sharing my opposite analysis.  I can choose not to contribute 
and I am very close to that decision.  Hooray!


What does this mean and why does it mean Stuart's (and probably many others) 
use of Bayesian is to some degree less accurate than it could be?  For example, 
say the e-mail headers he feeds into Bayesian are (as he has claimed):

From:...Domain...
[...]
Received-SPF:Domain+SPF_result
Received-SPF:SPF_result

From his posts, we see that "Received-SPF:NEUTRAL" has 0.89 Bayesian 
probability to be spam.  And we see that his Bayesian tokenizer is stemming 
only adjacent "words" (tokens), i.e. it is doing bi-gram token analysis which 
is better than any bayesian which does not do bi-grams, but it is not n-gram 
correlation, lest his 100 Megs bayesian storage would be terabytes.

So if Seth will query his 100 Megs database for all P( Domain+SPF_result ), he 
will see they are not equal to:

P( Domain ) * P( SPF_result ) / [P( Domain ) * P( SPF_result ) + (1 - P( Domain 
)) * (1 - P( SPF_result ))]

Thus, the Bayesian classifier is going to combine the following 3 probabilities:

P( Domain ) = a
P( SPF_result ) = b 
P( Domain+SPF_result ) = c

As:

P( a @ b @ c ) = abc / (abc - (1 - a)(1 - b)(1 - c)),  as quoted from Paul 
Graham's web site

This will be inaccurate, because P( SPF_result ) and P( Domain ) are correlated 
already in P( Domain+SPF_result ) (evident by the fact they are not equal as 
explained above), and thus you are assigning probabilities into the naive 
classifier under the assumption they are not correlated.  Even Paul Graham 
admits this assumption on his web site.

So what does this inaccuracy do?


(1) Take an example (insert any numbers you wish even though I estimated these 
for a real case for AOL based on AccuSpam and Stuart's posted data).  Say the 
actual probabilities are:

P( Domain ) = 0.58
P( NEUTRAL ) =  0.89
P( Domain+NEUTRAL ) = 0.91

So Stuart's use of Bayesian will get a probability of spam (for this SPF and 
Domain portion of the Bayesian data) of:

0.58 * 0.89 * 0.91 / [0.58 * 0.89 * 0.91 - (1 - 0.58)(1 - 0.89)(1 - 0.91)] = 
0.998 (1 in 500 chance to be spam) = 0.991 (1 in 113 chance to be spam)

But the actual probability is only = 0.91 (1 in 11 chance to be spam)

Thus you can see that AOL's false positives have increased by a factor of 10.

Note this calculation will get more skewed as P( Domain ) increases for AOL, 
which is bound to happen in future as forgery increases.


(2) Stuart can mitigate the above problem by removing the following header from 
his input to *naive*-Bayesian.  This is simply because as proved above, his 
naive classified is assuming that input tokens are not correlated.  He voilates 
that assumption, then he gets inaccurate result:

Received-SPF:SPF_result


(3) More importantly, for less popular domains that Stuart's classifier has not 
yet seen several times, Stuart will see that he has a no P( Domain+NEUTRAL ) 
yet, thus his calculation becomes:

P( NEUTRAL ) =  0.89

But the actual probability could be any where between >0 and <1 (more 
realistically probably between 0.50 and <1).  As I said, #2 would mitigate 
this, because Stuart would not use P( SPF_result ), only P( Domain+SPF_result 
).  But see #4...


(4) But then with #2 fix, in this case of less popular domain, Stuart will be 
not be using the return value from SPF.  That is one reason I was advocating 
returning a probability, so the apriori probability can contribute to the 
bayesian (or what ever algorithm) calculation even if domain has not been seen 
enough to formulate a probability of spam given P( Domain+SPF_result ).


(5) Please realize that the probability I was advocating would not have to be 
an absolute answer, just as Stuart's measured P( Domain+SPF_result ), in cases 
he has a measurement, is not an absolute.  In probability, all we need to know 
is how well some data correlates to the hypothesis we are testing (and what is 
it's cross-correlation with other data evidence).  So if a domain owner can say 
that xx% of my users use my SPF approved mail servers xx% of the time, then the 
value yy% = xx * xx, would be a useful apriori value to combine in with the 
Bayesian.  Stuart can still add his measured P( Domain+SPF_result ) values as 
well, as long as fix #2 is made above to not use P( SPF_result ).


(6) Thus I retract my statement that owner of domain needs to be able to 
predict *future" values for #5.  Just as Stuart's bayesian can not predict 
*future* values.  The future is either correlated with past or it isn't.  That 
is a necessary assumption for any (e.g. bayesian) classifier which uses 
historical evidence.   I little sleep helped me realize I was still correct.


(7) If there are any replies to this post, I urge the reader to come back to 
this post of mine and really read it very carefully and try to understand what 
I mean.  I am not going to respond to the probable erroneous replies that will 
follow (again using my bayesian experience in this list) this post (following 
Meng's criteria for limiting post volume).  This post will be my canonical post 
on this thread.  Thus I do anticipate the people against me in this list will 
attempt to rip this post to shreds.  Go for it!  Reader think for yourself.


(8) I do understand that the final Bayesian answer includes data from many 
tokens (often the top 15 tokens), but that does not change my point above about 
the inaccurate skewing influence of the P( SPF_result ) token and the 
inaccurate influence when lack of P( Domain+SPF_result ), i.e. no apriori data.


Any way, as I said just nevermind my suggestion, because apparently most of you 
here in this list either think I am idiot or you dislike me *personally* (again 
using my bayesian experience in this list).  So just go ahead as you wish.  I 
just hope the reader will read what I wrote above and understand the risks of 
using SPF without "-all" given the apparent level of misunderstanding in this 
list (can not speak for the lurkers and Matthew understands) about use of 
anti-spam probability algorithms.

To paraphase Meng, "you can lead a horse to water, but you can not make it 
drink".  It is futile to try.  I merely want to protect my public reputation, 
by making my case abundantly clear to astute readers.

All the best.