Turning raw data into useful stats
2005-06-27 22:15:29
I am in possession of a large amount of raw data, and I would like to turn
it into something useful.
My goals are:
1. Gather some statistics about how spam is currently being handled.
2. Evaluate whether using SPF would help. I would like to start using SPF
to reject incoming email, but first I need to show management that we have
a reasonable idea of what will happen, and we have identified forwarding
sources that should be whitelisted.
3. Provide real, useful data back to other interested parties regarding how
well SPF works (or might work, if applied to our incoming mail).
Here is the scenario. My company receives about 3.5M email transactions
per day. Majority of these are blocked by RBL, and other methods, and only
about 7% are allowed past the first mailer (roughly 200K/day). But, I have
other data that suggests the real, non-abusive email is closer to 20K/day,
so I would really like to get our current 7% number down to less than 1%.
Not an easy task.
The edge mailers are not smart enough to process SPF yet. (Actually an SPF
switch exists but their implementation is known to have some problems and
can't be adjusted, whitelisted, etc. This is an appliance box.) Most
important, their implementation of SPF doesn't allow for logging only, the
only choice is to reject.
So, I will be shipping the log data to another machine and processing it in
real time, not to act on the mail, but simply to gather stats and correlate
them.
My question to you fine folks is:
1. What sort of data would be most useful? For privacy reasons I cannot
release the raw data showing who is emailing whom, but whatever
calculations I can perform on the raw data to get summary numbers, I want
to report if I can.
Primarily, I was thinking something like correlating the actions we take
today (Reject, Quarantine, Tag, Accept) with each of the SPF possible
results (Pass, Fail, Neutral, etc) to see if obvious patterns emerge. This
would result in a matrix with a percentage in each box.
Other ideas for analysis include:
Change in matrix if "best guess" is applied (moving from Neutral to Pass)
Top IPs that give off "forged" mail but we almost always accept (probably
forwarders)
Top 20 domains frequently abused
(anything else?)
2. Is there already a script, executable, or other test harness that I can
feed large amount of data into and have it do the SPF lookup and report the
result? Any tips for doing this in bulk? I will probably set up a
dedicated instance of dnscache on the same machine for running the lookups.
(For testing the script, I will use a static file as input, but for
gathering real live data I will always use the live feed and crunch numbers
in real time).
3. Would this data be useful to people?
Thanks
gregc
P.S. Here is a look at what the raw data currently looks like. (This is
after applying a script I wrote that summarizes many lines with the same
transaction ID). The important bits for SPF would be IP, From_email (yes
this is MAIL FROM) and the result (action already taken).
lastline=21788, 1119933771-32399-118, lines=3, ip=222.140.76.136,
errorcount=1, from_email=nofrom(_at_)email(1119933773), to=,
eventlist=from_email,error, result=Sender address rejected: need
fully-qualified address, badrecip=0
lastline=21791, 1119933758-23497-209, lines=6, ip=221.13.128.77,
errorcount=2, from_email=adiowsx(_at_)msn(_dot_)com(1119933763),
to=tfarrer(_at_)hongkong(_dot_)mydomain,tannude513(_at_)hongkong(_dot_)mydomain,
eventlist=from_email,to,error,to,error, result=Recipient address rejected:
Blocked, badrecip=2
lastline=21826, 1119933747-32478-150, lines=6, ip=61.52.204.152,
errorcount=2, from_email=DCUHOVXMG(_at_)yahoo(_dot_)com(1119933749),
to=arbrn(_at_)holodeck(_dot_)engr(_dot_)mydomain,arbrn96(_at_)holodeck(_dot_)engr(_dot_)mydomain,
eventlist=from_email,to,error,to,error, result=Recipient address rejected:
5.1.1 <arbrn96(_at_)holodeck(_dot_)engr(_dot_)mydomain>... User unknown, badrecip=2
lastline=21870, 1119933735-31148-118, lines=10, ip=220.112.126.109,
errorcount=0, from_email=horyqtlnlkqgx(_at_)alafarmnews(_dot_)com(1119933738),
to=bracamonte(_at_)albuquerque(_dot_)mydomain,eby(_at_)albuquerque(_dot_)mydomain,como(_at_)albuquerqu
e.mydomain,steelman(_at_)albuquerque(_dot_)mydomain,montijo(_at_)albuquerque(_dot_)mydomain,
eventlist=from_email,to,to,to,to,to,subject,spam_score,tag, result=Tagged,
badrecip=0
--
Greg Connor <gconnor(_at_)nekodojo(_dot_)org>
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- Turning raw data into useful stats,
Greg Connor <=
|
|
|