spf-discuss
[Top] [All Lists]

RE: Suggest New Mechanism Prefix NUMBER to Accelerate SPF Adoption

2004-08-25 14:39:06

That is mathematical disaster.

You are assigning the same probabilities for all domains.  But not all
domains will have the same probability.

BINGO!  And not all receivers have the same probabilities for a given
domain either.


If you can measure your forgery rate per domain (which I already asserted you 
can not apriori unless you have enough data to cover every domain you can ever 
see), then you can apply that as an additional probability to combine with 
probability set by the owner.


 And actually, bayes automatically tracks probability for each
domain as well (since the domain appears in the Received-SPF header), but
that table would be way too big to post.


Mathematically false.  One of the assumptions of NAIVE bayesian is that the 
probabilities of the different words in same message are not correlated.  If 
you tried to do non-NAIVE, you would probably need a supercomputer and 
terabytes of storage.

Two other errors in your approach:

(1) You do not record the probabilities separately for each domain.

(2) You assume that "probability of spam" == "probability of forgery"




Unless you have the algorithm (it is possible) and the huge volume of data to
feed the algorithm (you will need a significant % of internet email), then
can not mathematically reliably determine the probabilities for EACH domain.

Wrong.  That is how bayesian filters work.


Trust me it does not.

Ask someone who has a degree in Mathematics.  I took many classes towards a 
Mathematics minor in college and Probability and Statistics Theory was one that 
I got 98% grade in.


 An empirical measurement is
a lot more meaningful than some number made up on the spot.


Agreed, if you are measuring something that *strongly* correlates to what you 
are applying it to.  You are not.

Why do you think a large ISP would declare a value that they had not obtained 
with some measurement and data?  Do you think they want their e-mail to be 
miscorrect classified?  They have "as much" (less but significant) to lose as 
you (their outgoing email and reputation), and that is the basis behind SPF 
overall when they declare any SPF record.


 All
you have to do is ensure that it gets fed meaningful tokens, and the
bayesian algorithm does the rest automatically

It is not magic.  You need to understand the probability theory behind it.  
Google "naive bayesian".


- actually measuring the
statistics instead of making them up.  Numbers supplied by the sender
would be worthless as tokens, unless quantified into a handful of
ranges (called, just for example, NONE,PASS,FAIL,SOFTFAIL,NEUTRAL,UNKNOWN).

You would combine the numbers with bayesian using something equation such as 
the one I have in the opening post of this thread:

http://archives.listbox.com/spf-discuss(_at_)v2(_dot_)listbox(_dot_)com/200408/1063.html

P(a @ b) = P(a) * P(b) / [P(a) * P(b) + (1 - P(a)) * (1 - P(b))] 

So if Bayesian return .9 and SPF return 0.8, then:

P( .9 @ .8 ) = .9 * .8 / (.9 * .8 + .1 * .2) = 0.973