RE: Suggest New Mechanism Prefix NUMBER to Accelerate SPF Adoption

On Thu, 26 Aug 2004, AccuSpam wrote:

 And actually, bayes automatically tracks probability for each
domain as well (since the domain appears in the Received-SPF header), but
that table would be way too big to post.



Mathematically false.  One of the assumptions of NAIVE bayesian is that the
probabilities of the different words in same message are not correlated.  If
you tried to do non-NAIVE, you would probably need a supercomputer and
terabytes of storage.


Get a clue, and check out a real bayesian system.  Mine just does 
adjacent tokens - which handles tracking domains (the tokenizer
recognizes domains) with SPF result just fine without doing anything special.



No.  If you are going to track the domains correlated against all the words, 
then you would have to store every combination, not just the adjacent ones.

Two reasons this is important:

(1) because otherwise you are only applying the domain word probability as one 
word in Bayesian, when in fact evidence of forgery is a much stronger evidence. 
 If your Bayesian looks at the top 15 words, then you diluted the forgery data 
evidence significantly.

(2) And also because in his example, you were depending on Bayesian to 
cross-correlate domain and SPF result word, as two non-adjacent unconnected 
words.  You could solve this by merging them into one word (or adjacent words 
if your tokenize stems adjacents).  But the main point is #1.


You could track the forgery evidence on the domain separately and give it more 
weight in Bayesian.  That might work reasonably well.  But you still won't have 
evidence on all the domains until you've seen all the domains.  You need a huge 
sample to be effective against future forgery.  With the owner declaring the 
data, you are already prepared for future forgery as it arrives.

The database is about 100Megs with db3  - not exactly a supercomputer.



Ajacent tokens is only power of 2.  Try power of N, and you can see your 100 
Megs go into terabytes.

The software is DSPAM 2.6.5.2 with Python wrapper:

http://bmsi.com/python/dspam.html

One of the clever design features is that all tokens are stored as
a 64-bit CRC, so that all tokens regardless of length are stored in 8 bytes
CRC + two 4 byte counts (plus activity date to purge stale tokens).

Two other errors in your approach:

(1) You do not record the probabilities separately for each domain.


Sure it does.  That is the beauty of bayes.



But not cross-correlated to the SPF result words.  You could fix this by 
combining them into one word, or by putting them adjacent if your tokenize 
stems adjacent words.

(2) You assume that "probability of spam" == "probability of forgery"


No I don't.  I simply stated that I as a receiver have no use for
some "probability forgery" number made up by the sender.  In fact,
no receiver has any use for the number.  Even if you came up with some
way for the sender to count actual forgeries and legit emails (and you
haven't yet), the sender could still be lying.



The owner could possibly make an estimate simply by sampling the bounces due to 
forged spam, or perhaps asking customers if they use SMTP authentication and 
how often they don't, etc..

But if you are correct that owner does not know, then they can NEVER declare 
"-all".

It's apparently magic to you - as in too advanced for you to understand
how it works.



See my notes above.