RE: [ietf-dkim] Bayesian filters are the pits

I think that I am working towards a rather broader critique of the way that 
SpamBayes &ct. are applied.

Naïve Bayesian learning schemes are intrinsically vulnerable to 
counter-programming. They work on a small scale only because there is not a 
sufficient value to counter-programming.

I am reminded of the chess match between Kasparov and Deep Blue. One of the 
Professors at MIT who works on computer chess told me that they could have 
taught Kasparov how to outwit the machine by exploiting weaknesses in the 
computer strategy.

In general any naïve learning approach can be intentionally taught to identify 
a certain characteristic as a strong indicator of spam by an attacker. Once the 
attacker can control the learning system state there is no end to the tricks 
that can be played.

The common theme at the MIT conference is that the way you test an anti-spam 
measure is against a static test corpus. What is left unmeasured is the 
resistance to counter-programming.


I believe that what opponents of the DKIM approach describe as a vulnerability 
of DKIM is in fact an intrinsic weakness of the spam filtering techniques 
described and that the DKIM exploit is merely one example of a much wider class 
of attacks against those schemes.

This objection is not coming from large scale anti-spam filtering operations, 
it is coming from people who run spam assasin on their personal email file and 
take a look at the rules their system is building.

-----Original Message-----
From: ietf-dkim-bounces(_at_)mipassoc(_dot_)org 
[mailto:ietf-dkim-bounces(_at_)mipassoc(_dot_)org] On Behalf Of J.D. Falk
Sent: Tuesday, August 22, 2006 7:41 PM
To: ietf-dkim(_at_)mipassoc(_dot_)org
Subject: Re: [ietf-dkim] Bayesian filters are the pits

On 2006-08-22 12:56, Hallam-Baker, Phillip wrote:

Third we need to promote the idea that you should not look for the 
existence or even the validity of a DKIM header as being as

important

as the domain that is claiming responsibility. If you can't

correlate

the domain to some form of additional information you should ignore 
the record entirely.


That's generally true in a simplistic spam / not spam 
decision.  If you're making a forged / not forged decision, 
the record is still useful.

This has nothing to do with naive Bayes, but everything to do 
with naive mail administrators looking for simple binary spam 
/ not spam criteria.

--
J.D. Falk, Anti-Spam Product Manager
Yahoo! Communications Platform Team
_______________________________________________
NOTE WELL: This list operates according to 
http://mipassoc.org/dkim/ietf-list-rules.html


_______________________________________________
NOTE WELL: This list operates according to 
http://mipassoc.org/dkim/ietf-list-rules.html