spf-discuss
[Top] [All Lists]

Re: Turning raw data into useful stats

2005-06-28 23:11:59
Here is my first pass at generating some semi-useful data. (I wouldn't call this data "useful" yet but perhaps a whole day's worth of data would actually show some trends). Maybe they could be called "interesting" if not really useful.

If the data gets used for anything, it will probably be best used for finding out what areas need a lot more data :)


Snapshot of 10,000 transactions received Jun 28 between 21:10:07 and 21:36:47

Format is Action__SPFresult, where Action is what we actually did with the message, and SPFresult is what SPF would have told us about the message if we had consulted it.

Note however that a huge amount of our incoming mail is bounces, and I don't have the HELO name in the log data, so I couldn't test SPF at all for those - they just say (null sender)

"badrecip" means something was wrong with the recipient, either an unknown username, unknown domain name, null or mangled, or blocked/internal use only. Note that 15% is the largest number appearing on this list other than RBLs, so this is a fairly sizeable chunk. Even 1% of our incoming mail means 30,000 per day.
       1550 (15%)      badrecip__(null sender)
          9  (0%)      badrecip__error
          8  (0%)      badrecip__fail
        307  (3%)      badrecip__neutral
          2  (0%)      badrecip__pass
        172  (1%)      badrecip__softfail
          2  (0%)      badrecip__unknown

"badsender" means the MAIL FROM address is mangled/illegal or a nonexistent domain or something.
         40  (0%)      badsender__(null sender)
         17  (0%)      badsender__neutral

"blocked_spam" and "blocked_virus" is rejected due to content (usually "known spammer URL in the body")
         88  (0%)      blocked_spam__(null sender)
          1  (0%)      blocked_spam__fail
         19  (0%)      blocked_spam__neutral
         10  (0%)      blocked_spam__softfail
          2  (0%)      blocked_virus__(null sender)
          1  (0%)      blocked_virus__neutral

"delivered" means the message made it through in the clear, though it's still got a good chance of being spam at this point, sadly.
        260  (2%)      delivered__(null sender)
          3  (0%)      delivered__error
          3  (0%)      delivered__fail
         58  (0%)      delivered__neutral
          5  (0%)      delivered__pass
          7  (0%)      delivered__softfail

"miscerror" are things like protocol misuse, and a handful of errors to rare to get their own category (message rejected)
        110  (1%)      miscerror__(null sender)
          1  (0%)      miscerror__error
          2  (0%)      miscerror__fail
         19  (0%)      miscerror__neutral
          1  (0%)      miscerror__pass
          4  (0%)      miscerror__softfail

wow, we sure don't quarantine stuff very much, .01%... hmm.
          1  (0%)      quarantined__(null sender)

ratelimit means too many connections, causing us to reject before even checking the RBL (most ips subject to ratelimit are also on an RBL anyway)
        118  (1%)      ratelimit__(null sender)

RBL is the lion's share right now. I would probably get more detailed numbers for everything else if I just skipped this one, but I wanted to make sure I was comparing apples to apples. RBL means we cut off the connection before HELO.
       6722 (67%)      rbl__(null sender)

misc blah
          1  (0%)      syntaxerr__(null sender)

"tagged" means that the message was allowed through to someone's mailbox (or to be bounced somewhere down the line) but the Subject is altered to say it's probably spam.
         36  (0%)      tagged__(null sender)
          1  (0%)      tagged__error
          1  (0%)      tagged__fail
         16  (0%)      tagged__neutral

"temperror" is a 400-series error due to timed out DNS lookups, couldn't verify recipient exists, etc.
        394  (3%)      temperror__(null sender)
          4  (0%)      temperror__error
          3  (0%)      temperror__neutral
          2  (0%)      temperror__softfail


This was all run with trusted-forwarder.org being checked, AND with "best guess", though I haven't figured out how to tell whether the "pass" results were due to a guess or not using m:s:q.



Same data sliced a different way... This time arranged by SPF result

Something like 89% of my transactions have a null sender, either MAIL FROM: <> or the transaction doesn't get to the MAIL FROM stage at all. (again, I wish I had HELO names but I don't have access to that)
       1550 (15%)      badrecip__(null sender)
         40  (0%)      badsender__(null sender)
         88  (0%)      blocked_spam__(null sender)
          2  (0%)      blocked_virus__(null sender)
        260  (2%)      delivered__(null sender)
        110  (1%)      miscerror__(null sender)
          1  (0%)      quarantined__(null sender)
        118  (1%)      ratelimit__(null sender)
       6722 (67%)      rbl__(null sender)
          1  (0%)      syntaxerr__(null sender)
         36  (0%)      tagged__(null sender)
        394  (3%)      temperror__(null sender)

SPF processing returned "error"
          9  (0%)      badrecip__error
          3  (0%)      delivered__error
          1  (0%)      miscerror__error
          1  (0%)      tagged__error
          4  (0%)      temperror__error

SPF processing returned "fail" - precious little of these: 0.15%
          8  (0%)      badrecip__fail
          1  (0%)      blocked_spam__fail
          3  (0%)      delivered__fail
          2  (0%)      miscerror__fail
          1  (0%)      tagged__fail

SPF returned "neutral" - 4.4%
        307  (3%)      badrecip__neutral
         17  (0%)      badsender__neutral
         19  (0%)      blocked_spam__neutral
          1  (0%)      blocked_virus__neutral
         58  (0%)      delivered__neutral
         19  (0%)      miscerror__neutral
         16  (0%)      tagged__neutral
          3  (0%)      temperror__neutral

SPF returned "pass" 0.08%
          2  (0%)      badrecip__pass
          5  (0%)      delivered__pass
          1  (0%)      miscerror__pass

SPF returned "softfail" - this is kinda cool actually: 1.95%
        172  (1%)      badrecip__softfail
         10  (0%)      blocked_spam__softfail
          7  (0%)      delivered__softfail
          4  (0%)      miscerror__softfail
          2  (0%)      temperror__softfail

spf returned "unknown" - this maybe means unknown mechanism
          2  (0%)      badrecip__unknown


The bright side for me is that my script (using Mail::SPF::Query) takes 19 min to process 26 min of real-time data, so hopefully I will be able to keep it running constantly and have a readout in real time of the whole 3 mil, not just 10,000.



--
Greg Connor <gconnor(_at_)nekodojo(_dot_)org>


<Prev in Thread] Current Thread [Next in Thread>