[Asrg] Collecting IP reputation data from many people

I'd like your thoughts on collecting reputation data (% spam vs. non-spam
originating at every IP) from everyone willing to submit it.

My goal would be to be able to tell anyone how much spam and non-spam any
IP address has sent recently.  And that they could use that information to
aid spam filtration.  

I realize it wouldn't be easy to make this useful, given the interest of
spammers to corrupt the data.  The example of Mizhen vs. Microsoft posted
here a few months ago was excellent.  Also, the availability of captcha
solving for $2 per 1000.  

I realize that the vast majority of "people" submitting data could be
spammer controlled zombies, repeating back to the system that they've
received spam and non-spam from IPs others have reported, and once in a
while many of them could claim an IP is sending lots of non-spam, and no
spam, in the hopes that this system would identify that IP as a
non-spammer, allowing spam from that IP to get through filters.

I hope I'll be able to identify the quality of the data coming from each
person well enough that this would not be a problem.  Maybe a large number
of zombies could cause some spams to get through for one day, and then
be identified as malicious.  I guess that's my biggest question though.
How many times can the spammers use "a large number of zombies" to get
spam through for a day before they run out?  Obviously I'd take advantage
of things like Spamhaus's XBL (list of zombies).  It's also an interesting
problem of lacking known truth - how do you know which people are
reporting good data when everything you have to compare it to might be
maliciously bad data from spammers?

I am not interested in trying to block easily identifiable spam.  I'll
continue to run spamassassin and use DNSRBLs at my MTA, and only report
spam that makes it into my inbox into this system.

I have been involved with dnswl.org (a dns based whitelist used by
spamassassin).  My motivation for trying to collect data from many people
is largely the fact that DNSWL appears to only cover a third of non-spam
(according to http://rulequa.spamassassin.org ).

For implementation details, I'm thinking something like:

  User clicks junk / not-junk button on their MUA, MUA sends user ID
  (randomly generated by MUA once) and sender IP, along with "spam" or
  "non-spam" to my server over an encrypted tcp connection on some port which
  is not currently used.
  
  Mail filters download full list of IPs including number of spams and
  non-spams transmitted by that IP recently, once a day.  Every time an email
  is received, the filter increases or decreases the score of each email
  based on the data in that list.  I'd prefer not to deal with the network
  load of serving this data via DNS, but I haven't attempted to calculate
  the break-even point.

I think basic implementation wouldn't be a lot of work for me.  Set up the
server to accept data, aggregate it, and serve it (http?  rsync?).  Create
modules for spamassassin to report the sending IP via "spamassassin
--report" to report spam (which I did for dnswl), and "spamassassin
--revoke" to report non-spam.  This would take advantage of
internal_networks in the SA configuration to figure out which IP is
actually the sender and not relays internal to the persons email provider,
and trusted_networks to avoid reporting trusted proxies or other external
relays.


So do you think it's worth my effort?  

How would you improve this?

Do you think it could be useful enough that you'd be willing to click a
button to send me data occasionally?

-- 
"Government is not reason, it is not eloquence, it is force; like fire,
a troublesome servant and a fearful master. Never for a moment should
it be left to irresponsible action." - George Washington
http://www.ChaosReigns.com
_______________________________________________
Asrg mailing list
Asrg(_at_)irtf(_dot_)org
http://www.irtf.org/mailman/listinfo/asrg