On Wed, 2005-01-05 at 09:16, Stephane Bortzmeyer wrote:
On Wed, Jan 05, 2005 at 08:44:10AM -0600,
Andy Bakun <spf(_at_)leave-it-to-grace(_dot_)com> wrote
a message of 38 lines which said:
(I am purposely reordering the message)
In the mean time, you risk false positives and ignored email
Do you really now what is bayesian filtering? If no, think that a bad
score for one word is not enough to classify a message.
Yes, using a small score isn't a large influence on classification of a
message on the input, but the figures you provided imply a
classification in the reporting, and as such, shows correlation, not
causation. In general I find it ridiculous to add rules and patterns
that increase the spam score by some minute amount for purposes other
than recording generalized trends (that is, get the bayesian filter to
record something for reporting purposes, but don't take action on it).
It might be more interesting to see the score you assign to
SPF-{pass,fail,none} compared to the average score for all your rules
and the classify-as-spam threshold in those figures.
This scheme falls down in the face of spammers using SPF, which we
know at least some do now
Of course. See the figures I provided.
Your figures say to me that the existence of SPF-{pass,fail,none} in the
header is not useful enough to classify spam, with 50% of passes being
spam, ~55% of none being good, and not having an interesting sample size
for fail. The intent of SPF, which it is well on its way to reaching,
is to make these lone numbers even more useless than they are now for
classifying spam. Once they are combined with reputation systems, the
numbers become much more interesting.
The simplistic mapping above WILL have to be changed when SPF is
more widely deployed.
I do not think so.
If you added the rules to bogofilter solely in order to track statistics
on SPF deployment, then you are correct, they do not need to be changed,
nor should be. But if your intent is to eventually fold SPF results
into your bogofilter rules by giving them higher scores (which I've seen
people do based solely on the kind of frequency reports you've
generated), then you'll need to be constantly adjusting them as
deployment increases and SPF alone ceases to show any obvious indication
of spaminess.