spf-discuss
[Top] [All Lists]

RE: Suggest New Mechanism Prefix NUMBER to Accelerate SPF Adoption

2004-08-25 15:40:37
On Thu, 26 Aug 2004, AccuSpam wrote:
 And actually, bayes automatically tracks probability for each
domain as well (since the domain appears in the Received-SPF header), but
that table would be way too big to post.


Mathematically false.  One of the assumptions of NAIVE bayesian is that the
probabilities of the different words in same message are not correlated.  If
you tried to do non-NAIVE, you would probably need a supercomputer and
terabytes of storage.

Get a clue, and check out a real bayesian system.  Mine just does 
adjacent tokens - which handles tracking domains (the tokenizer
recognizes domains) with SPF result just fine without doing anything special.
The database is about 100Megs with db3  - not exactly a supercomputer.
The software is DSPAM 2.6.5.2 with Python wrapper:

http://bmsi.com/python/dspam.html

One of the clever design features is that all tokens are stored as
a 64-bit CRC, so that all tokens regardless of length are stored in 8 bytes
CRC + two 4 byte counts (plus activity date to purge stale tokens).

Two other errors in your approach:

(1) You do not record the probabilities separately for each domain.

Sure it does.  That is the beauty of bayes.

(2) You assume that "probability of spam" == "probability of forgery"

No I don't.  I simply stated that I as a receiver have no use for
some "probability forgery" number made up by the sender.  In fact,
no receiver has any use for the number.  Even if you came up with some
way for the sender to count actual forgeries and legit emails (and you
haven't yet), the sender could still be lying.

Trust me it does not.

Ask someone who has a degree in Mathematics.  I took many classes towards a
Mathematics minor in college and Probability and Statistics Theory was one
that I got 98% grade in.

I can look up any token pair for any of the 40000/day emails I've seen in the
past year or so (less stale records) and read the SPAM and NOTSPAM counts.  It
works just fine.  Your lack of practical knowledge leads me to take any of your
statements with a grain of salt.  BTW, I got 100% in P&S.

 All
you have to do is ensure that it gets fed meaningful tokens, and the
bayesian algorithm does the rest automatically

It is not magic.  You need to understand the probability theory behind it.
Google "naive bayesian".

It's apparently magic to you - as in too advanced for you to understand
how it works.

-- 
              Stuart D. Gathman <stuart(_at_)bmsi(_dot_)com>
    Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.