Turning raw data into useful stats

I am in possession of a large amount of raw data, and I would like to turnit into something useful.


My goals are:

1. Gather some statistics about how spam is currently being handled.

2. Evaluate whether using SPF would help. I would like to start using SPFto reject incoming email, but first I need to show management that we havea reasonable idea of what will happen, and we have identified forwardingsources that should be whitelisted.3. Provide real, useful data back to other interested parties regarding howwell SPF works (or might work, if applied to our incoming mail).

Here is the scenario. My company receives about 3.5M email transactionsper day. Majority of these are blocked by RBL, and other methods, and onlyabout 7% are allowed past the first mailer (roughly 200K/day). But, I haveother data that suggests the real, non-abusive email is closer to 20K/day,so I would really like to get our current 7% number down to less than 1%.Not an easy task.

The edge mailers are not smart enough to process SPF yet. (Actually an SPFswitch exists but their implementation is known to have some problems andcan't be adjusted, whitelisted, etc. This is an appliance box.) Mostimportant, their implementation of SPF doesn't allow for logging only, theonly choice is to reject.

So, I will be shipping the log data to another machine and processing it inreal time, not to act on the mail, but simply to gather stats and correlatethem.



My question to you fine folks is:

1. What sort of data would be most useful? For privacy reasons I cannotrelease the raw data showing who is emailing whom, but whatevercalculations I can perform on the raw data to get summary numbers, I wantto report if I can.

Primarily, I was thinking something like correlating the actions we taketoday (Reject, Quarantine, Tag, Accept) with each of the SPF possibleresults (Pass, Fail, Neutral, etc) to see if obvious patterns emerge. Thiswould result in a matrix with a percentage in each box.


Other ideas for analysis include:
 Change in matrix if "best guess" is applied (moving from Neutral to Pass)

Top IPs that give off "forged" mail but we almost always accept (probablyforwarders)

 Top 20 domains frequently abused
 (anything else?)

2. Is there already a script, executable, or other test harness that I canfeed large amount of data into and have it do the SPF lookup and report theresult? Any tips for doing this in bulk? I will probably set up adedicated instance of dnscache on the same machine for running the lookups.(For testing the script, I will use a static file as input, but forgathering real live data I will always use the live feed and crunch numbersin real time).


3. Would this data be useful to people?

Thanks
gregc

P.S. Here is a look at what the raw data currently looks like. (This isafter applying a script I wrote that summarizes many lines with the sametransaction ID). The important bits for SPF would be IP, From_email (yesthis is MAIL FROM) and the result (action already taken).

lastline=21788, 1119933771-32399-118, lines=3, ip=222.140.76.136,errorcount=1, from_email=nofrom(_at_)email(1119933773), to=,eventlist=from_email,error, result=Sender address rejected: needfully-qualified address, badrecip=0

lastline=21791, 1119933758-23497-209, lines=6, ip=221.13.128.77,errorcount=2, from_email=adiowsx(_at_)msn(_dot_)com(1119933763),to=tfarrer(_at_)hongkong(_dot_)mydomain,tannude513(_at_)hongkong(_dot_)mydomain,eventlist=from_email,to,error,to,error, result=Recipient address rejected:Blocked, badrecip=2

lastline=21826, 1119933747-32478-150, lines=6, ip=61.52.204.152,errorcount=2, from_email=DCUHOVXMG(_at_)yahoo(_dot_)com(1119933749),to=arbrn(_at_)holodeck(_dot_)engr(_dot_)mydomain,arbrn96(_at_)holodeck(_dot_)engr(_dot_)mydomain,eventlist=from_email,to,error,to,error, result=Recipient address rejected:5.1.1 <arbrn96(_at_)holodeck(_dot_)engr(_dot_)mydomain>... User unknown, badrecip=2

lastline=21870, 1119933735-31148-118, lines=10, ip=220.112.126.109,errorcount=0, from_email=horyqtlnlkqgx(_at_)alafarmnews(_dot_)com(1119933738),to=bracamonte(_at_)albuquerque(_dot_)mydomain,eby(_at_)albuquerque(_dot_)mydomain,como(_at_)albuquerque.mydomain,steelman(_at_)albuquerque(_dot_)mydomain,montijo(_at_)albuquerque(_dot_)mydomain,eventlist=from_email,to,to,to,to,to,subject,spam_score,tag, result=Tagged,badrecip=0



--
Greg Connor <gconnor(_at_)nekodojo(_dot_)org>