spf-discuss
[Top] [All Lists]

RE: Suggest New Mechanism Prefix NUMBER to Accelerate SPF Adoption

2004-08-25 13:02:25
At 03:32 PM 8/25/2004 -0400, Stuart D. Gathman wrote:
On Thu, 26 Aug 2004, AccuSpam wrote:

But I do not agree that ~ is 0.5 or that ? is 0.1.  They are forever
ambiguous in terms of probability, because they are already used under the
previous ambiguous definitions.  Actually I think ? means "do not know" or
"neutral" and since you can not doing anything then it is essentially the
same as PASS, depending on the receivers interpretation of "neutral".

It is essentially the same as NONE.


No result?  Agree.  Then it means you can not do anything with the result.


 ~ means most probably forged,
it is a softfail, not a softpass.


Agreed because definition says "MTAs SHOULD accept the message but MAY subject 
it to a higher transaction cost, deeper scrutiny, or an unfavourable score."


So the probability of forgery
would be something like 0.9.


I did not see that probability documented in official SPF specs.

How do you know that is what owner of domain intended?

In anti-spam, 0.99 is very, very, very, very different mathematical input than 
0.9.  And 0.95 is still very different.

A simple example would be if you receive 1000 forgeries a day, then 0.99 
certainty means you only get 10 forgeries, whereas 0.9 means 100 forgeries.  
You could really mess up the mathematical accuracy by assuming 0.9 when it is 
really 0.99.

For an ISP with million users, ranges will be from 0 - 0.999999.  That many dB 
of range.  You can not just make an assumption and expect to get any where near 
to correct result.


 The intent is that the only way it would
not be a forgery is if the administrator made a boo boo (which could
happen since they have just implemented SPF).

Where does the draft spec say that?

I always thought it meant, "I can not set -all because I am not 100% sure of my 
customers compliance, but I am reasonably 
sure".

The larger point being irrespective of each of our interpretations, the 
receiver will make many different interpretatoins, because the definition 
allows such.

My suggestion will fix (minimize) those errors.


I am personally happy with the current coarse grained results.  Why?
Because after rejecting the obvious forgeries (FAIL), they result
in nice tokens in the Received-SPF header.  The bayesian filter then
quickly determines empirically the spam probabilities for NONE,
PASS,NEUTRAL,SOFTFAIL,ERROR,UNKNOWN.

Here are the current stats:

SPF result     probability of spam
----------     -------------------
NEUTRAL                0.898679
NEUTRAL(guess) 0.926437
PASS           0.101463
PASS(guess)    0.257572
SOFTFAIL       0.910824
NONE(guessed)  0.658428
UNKNOWN                0.580007
ERROR          too rare to measure


That is mathematical disaster.

You are assigning the same probabilities for all domains.  But not all domains 
will have the same probability.

Unless you have the algorithm (it is possible) and the huge volume of data to 
feed the algorithm (you will need a significant % of internet email), then can 
not mathematically reliably determine the probabilities for EACH domain.


I could care less what the sender claims the probabilities are, and would
completely ignore any such extension to SPF.  I'll stick with hard data,
not some guess the sender pulled out of their [select 3 letter epithet,
e.g. "HAT"].


Owner of domain tells you what to trust, so you already trust owner.

If you have correct mathematical data to override owner's setting, then go 
ahead, but your Bayesian above is highly flawed because it does not allow that 
each domain has a different probability.

The owner can provide useful data and I already explained why it is very 
important to owners to be able to express what probabilities they want for 
their outgoing email.

You just PROVED the risk that large ISPs (AOL, Earthlink, etc) have in setting 
anything but "?all"

You PROVED that they can never set anything but "?all" or "-all" or "+all"/