Correct typo my post and clarify my point once and for all.
Change:
P( SPF_result @ Domain ) != SPF_result+Domain
To:
P( SPF_result @ Domain ) != P( SPF_result+Domain )
Per Mengs criteria for limiting post volume, yes I do think there is a general
misunderstanding on this list about using SPF in Bayesian, at least all but
Matthew who posted to this thread (can not speak for the lurkers obviously).
So I do feel I must correct this misdirection. Please do not be angry at me
*personally* for sharing my opposite analysis. I can choose not to contribute
and I am very close to that decision. Hooray!
What does this mean and why does it mean Stuart's (and probably many others)
use of Bayesian is to some degree less accurate than it could be? For example,
say the e-mail headers he feeds into Bayesian are (as he has claimed):
From:...Domain...
[...]
Received-SPF:Domain+SPF_result
Received-SPF:SPF_result
From his posts, we see that "Received-SPF:NEUTRAL" has 0.89 Bayesian
probability to be spam. And we see that his Bayesian tokenizer is stemming
only adjacent "words" (tokens), i.e. it is doing bi-gram token analysis which
is better than any bayesian which does not do bi-grams, but it is not n-gram
correlation, lest his 100 Megs bayesian storage would be terabytes.
So if Seth will query his 100 Megs database for all P( Domain+SPF_result ), he
will see they are not equal to:
P( Domain ) * P( SPF_result ) / [P( Domain ) * P( SPF_result ) + (1 - P( Domain
)) * (1 - P( SPF_result ))]
Thus, the Bayesian classifier is going to combine the following 3 probabilities:
P( Domain ) = a
P( SPF_result ) = b
P( Domain+SPF_result ) = c
As:
P( a @ b @ c ) = abc / (abc - (1 - a)(1 - b)(1 - c)), as quoted from Paul
Graham's web site
This will be inaccurate, because P( SPF_result ) and P( Domain ) are correlated
already in P( Domain+SPF_result ) (evident by the fact they are not equal as
explained above), and thus you are assigning probabilities into the naive
classifier under the assumption they are not correlated. Even Paul Graham
admits this assumption on his web site.
So what does this inaccuracy do?
(1) Take an example (insert any numbers you wish even though I estimated these
for a real case for AOL based on AccuSpam and Stuart's posted data). Say the
actual probabilities are:
P( Domain ) = 0.58
P( NEUTRAL ) = 0.89
P( Domain+NEUTRAL ) = 0.91
So Stuart's use of Bayesian will get a probability of spam (for this SPF and
Domain portion of the Bayesian data) of:
0.58 * 0.89 * 0.91 / [0.58 * 0.89 * 0.91 - (1 - 0.58)(1 - 0.89)(1 - 0.91)] =
0.998 (1 in 500 chance to be spam) = 0.991 (1 in 113 chance to be spam)
But the actual probability is only = 0.91 (1 in 11 chance to be spam)
Thus you can see that AOL's false positives have increased by a factor of 10.
Note this calculation will get more skewed as P( Domain ) increases for AOL,
which is bound to happen in future as forgery increases.
(2) Stuart can mitigate the above problem by removing the following header from
his input to *naive*-Bayesian. This is simply because as proved above, his
naive classified is assuming that input tokens are not correlated. He voilates
that assumption, then he gets inaccurate result:
Received-SPF:SPF_result
(3) More importantly, for less popular domains that Stuart's classifier has not
yet seen several times, Stuart will see that he has a no P( Domain+NEUTRAL )
yet, thus his calculation becomes:
P( NEUTRAL ) = 0.89
But the actual probability could be any where between >0 and <1 (more
realistically probably between 0.50 and <1). As I said, #2 would mitigate
this, because Stuart would not use P( SPF_result ), only P( Domain+SPF_result
). But see #4...
(4) But then with #2 fix, in this case of less popular domain, Stuart will be
not be using the return value from SPF. That is one reason I was advocating
returning a probability, so the apriori probability can contribute to the
bayesian (or what ever algorithm) calculation even if domain has not been seen
enough to formulate a probability of spam given P( Domain+SPF_result ).
(5) Please realize that the probability I was advocating would not have to be
an absolute answer, just as Stuart's measured P( Domain+SPF_result ), in cases
he has a measurement, is not an absolute. In probability, all we need to know
is how well some data correlates to the hypothesis we are testing (and what is
it's cross-correlation with other data evidence). So if a domain owner can say
that xx% of my users use my SPF approved mail servers xx% of the time, then the
value yy% = xx * xx, would be a useful apriori value to combine in with the
Bayesian. Stuart can still add his measured P( Domain+SPF_result ) values as
well, as long as fix #2 is made above to not use P( SPF_result ).
(6) Thus I retract my statement that owner of domain needs to be able to
predict *future" values for #5. Just as Stuart's bayesian can not predict
*future* values. The future is either correlated with past or it isn't. That
is a necessary assumption for any (e.g. bayesian) classifier which uses
historical evidence. I little sleep helped me realize I was still correct.
(7) If there are any replies to this post, I urge the reader to come back to
this post of mine and really read it very carefully and try to understand what
I mean. I am not going to respond to the probable erroneous replies that will
follow (again using my bayesian experience in this list) this post (following
Meng's criteria for limiting post volume). This post will be my canonical post
on this thread. Thus I do anticipate the people against me in this list will
attempt to rip this post to shreds. Go for it! Reader think for yourself.
(8) I do understand that the final Bayesian answer includes data from many
tokens (often the top 15 tokens), but that does not change my point above about
the inaccurate skewing influence of the P( SPF_result ) token and the
inaccurate influence when lack of P( Domain+SPF_result ), i.e. no apriori data.
Any way, as I said just nevermind my suggestion, because apparently most of you
here in this list either think I am idiot or you dislike me *personally* (again
using my bayesian experience in this list). So just go ahead as you wish. I
just hope the reader will read what I wrote above and understand the risks of
using SPF without "-all" given the apparent level of misunderstanding in this
list (can not speak for the lurkers and Matthew understands) about use of
anti-spam probability algorithms.
To paraphase Meng, "you can lead a horse to water, but you can not make it
drink". It is futile to try. I merely want to protect my public reputation,
by making my case abundantly clear to astute readers.
All the best.