spf-discuss
[Top] [All Lists]

Turning raw data into useful stats

2005-06-27 22:15:29
I am in possession of a large amount of raw data, and I would like to turn it into something useful.

My goals are:

1. Gather some statistics about how spam is currently being handled.
2. Evaluate whether using SPF would help. I would like to start using SPF to reject incoming email, but first I need to show management that we have a reasonable idea of what will happen, and we have identified forwarding sources that should be whitelisted. 3. Provide real, useful data back to other interested parties regarding how well SPF works (or might work, if applied to our incoming mail).

Here is the scenario. My company receives about 3.5M email transactions per day. Majority of these are blocked by RBL, and other methods, and only about 7% are allowed past the first mailer (roughly 200K/day). But, I have other data that suggests the real, non-abusive email is closer to 20K/day, so I would really like to get our current 7% number down to less than 1%. Not an easy task.

The edge mailers are not smart enough to process SPF yet. (Actually an SPF switch exists but their implementation is known to have some problems and can't be adjusted, whitelisted, etc. This is an appliance box.) Most important, their implementation of SPF doesn't allow for logging only, the only choice is to reject.

So, I will be shipping the log data to another machine and processing it in real time, not to act on the mail, but simply to gather stats and correlate them.


My question to you fine folks is:

1. What sort of data would be most useful? For privacy reasons I cannot release the raw data showing who is emailing whom, but whatever calculations I can perform on the raw data to get summary numbers, I want to report if I can.

Primarily, I was thinking something like correlating the actions we take today (Reject, Quarantine, Tag, Accept) with each of the SPF possible results (Pass, Fail, Neutral, etc) to see if obvious patterns emerge. This would result in a matrix with a percentage in each box.

Other ideas for analysis include:
 Change in matrix if "best guess" is applied (moving from Neutral to Pass)
Top IPs that give off "forged" mail but we almost always accept (probably forwarders)
 Top 20 domains frequently abused
 (anything else?)

2. Is there already a script, executable, or other test harness that I can feed large amount of data into and have it do the SPF lookup and report the result? Any tips for doing this in bulk? I will probably set up a dedicated instance of dnscache on the same machine for running the lookups. (For testing the script, I will use a static file as input, but for gathering real live data I will always use the live feed and crunch numbers in real time).

3. Would this data be useful to people?

Thanks
gregc

P.S. Here is a look at what the raw data currently looks like. (This is after applying a script I wrote that summarizes many lines with the same transaction ID). The important bits for SPF would be IP, From_email (yes this is MAIL FROM) and the result (action already taken).

lastline=21788, 1119933771-32399-118, lines=3, ip=222.140.76.136, errorcount=1, from_email=nofrom(_at_)email(1119933773), to=, eventlist=from_email,error, result=Sender address rejected: need fully-qualified address, badrecip=0

lastline=21791, 1119933758-23497-209, lines=6, ip=221.13.128.77, errorcount=2, from_email=adiowsx(_at_)msn(_dot_)com(1119933763), to=tfarrer(_at_)hongkong(_dot_)mydomain,tannude513(_at_)hongkong(_dot_)mydomain, eventlist=from_email,to,error,to,error, result=Recipient address rejected: Blocked, badrecip=2

lastline=21826, 1119933747-32478-150, lines=6, ip=61.52.204.152, errorcount=2, from_email=DCUHOVXMG(_at_)yahoo(_dot_)com(1119933749), to=arbrn(_at_)holodeck(_dot_)engr(_dot_)mydomain,arbrn96(_at_)holodeck(_dot_)engr(_dot_)mydomain, eventlist=from_email,to,error,to,error, result=Recipient address rejected: 5.1.1 <arbrn96(_at_)holodeck(_dot_)engr(_dot_)mydomain>... User unknown, badrecip=2

lastline=21870, 1119933735-31148-118, lines=10, ip=220.112.126.109, errorcount=0, from_email=horyqtlnlkqgx(_at_)alafarmnews(_dot_)com(1119933738), to=bracamonte(_at_)albuquerque(_dot_)mydomain,eby(_at_)albuquerque(_dot_)mydomain,como(_at_)albuquerqu e.mydomain,steelman(_at_)albuquerque(_dot_)mydomain,montijo(_at_)albuquerque(_dot_)mydomain, eventlist=from_email,to,to,to,to,to,subject,spam_score,tag, result=Tagged, badrecip=0


--
Greg Connor <gconnor(_at_)nekodojo(_dot_)org>


<Prev in Thread] Current Thread [Next in Thread>