Re: Turning raw data into useful stats

Here is my first pass at generating some semi-useful data. (I wouldn'tcall this data "useful" yet but perhaps a whole day's worth of data wouldactually show some trends). Maybe they could be called "interesting" ifnot really useful.

If the data gets used for anything, it will probably be best used forfinding out what areas need a lot more data :)

Snapshot of 10,000 transactions received Jun 28 between 21:10:07 and21:36:47

Format is Action__SPFresult, where Action is what we actually did with themessage, and SPFresult is what SPF would have told us about the message ifwe had consulted it.

Note however that a huge amount of our incoming mail is bounces, and Idon't have the HELO name in the log data, so I couldn't test SPF at all forthose - they just say (null sender)

"badrecip" means something was wrong with the recipient, either an unknownusername, unknown domain name, null or mangled, or blocked/internal useonly. Note that 15% is the largest number appearing on this list otherthan RBLs, so this is a fairly sizeable chunk. Even 1% of our incomingmail means 30,000 per day.

       1550 (15%)      badrecip__(null sender)
          9  (0%)      badrecip__error
          8  (0%)      badrecip__fail
        307  (3%)      badrecip__neutral
          2  (0%)      badrecip__pass
        172  (1%)      badrecip__softfail
          2  (0%)      badrecip__unknown

"badsender" means the MAIL FROM address is mangled/illegal or a nonexistentdomain or something.

         40  (0%)      badsender__(null sender)
         17  (0%)      badsender__neutral

"blocked_spam" and "blocked_virus" is rejected due to content (usually"known spammer URL in the body")

         88  (0%)      blocked_spam__(null sender)
          1  (0%)      blocked_spam__fail
         19  (0%)      blocked_spam__neutral
         10  (0%)      blocked_spam__softfail
          2  (0%)      blocked_virus__(null sender)
          1  (0%)      blocked_virus__neutral

"delivered" means the message made it through in the clear, though it'sstill got a good chance of being spam at this point, sadly.

        260  (2%)      delivered__(null sender)
          3  (0%)      delivered__error
          3  (0%)      delivered__fail
         58  (0%)      delivered__neutral
          5  (0%)      delivered__pass
          7  (0%)      delivered__softfail

"miscerror" are things like protocol misuse, and a handful of errors torare to get their own category (message rejected)

        110  (1%)      miscerror__(null sender)
          1  (0%)      miscerror__error
          2  (0%)      miscerror__fail
         19  (0%)      miscerror__neutral
          1  (0%)      miscerror__pass
          4  (0%)      miscerror__softfail

wow, we sure don't quarantine stuff very much, .01%... hmm.
          1  (0%)      quarantined__(null sender)

ratelimit means too many connections, causing us to reject before evenchecking the RBL (most ips subject to ratelimit are also on an RBL anyway)

        118  (1%)      ratelimit__(null sender)

RBL is the lion's share right now. I would probably get more detailednumbers for everything else if I just skipped this one, but I wanted tomake sure I was comparing apples to apples. RBL means we cut off theconnection before HELO.

       6722 (67%)      rbl__(null sender)

misc blah
          1  (0%)      syntaxerr__(null sender)

"tagged" means that the message was allowed through to someone's mailbox(or to be bounced somewhere down the line) but the Subject is altered tosay it's probably spam.

         36  (0%)      tagged__(null sender)
          1  (0%)      tagged__error
          1  (0%)      tagged__fail
         16  (0%)      tagged__neutral

"temperror" is a 400-series error due to timed out DNS lookups, couldn'tverify recipient exists, etc.

        394  (3%)      temperror__(null sender)
          4  (0%)      temperror__error
          3  (0%)      temperror__neutral
          2  (0%)      temperror__softfail

This was all run with trusted-forwarder.org being checked, AND with "bestguess", though I haven't figured out how to tell whether the "pass" resultswere due to a guess or not using m:s:q.




Same data sliced a different way... This time arranged by SPF result

Something like 89% of my transactions have a null sender, either MAIL FROM:<> or the transaction doesn't get to the MAIL FROM stage at all. (again, Iwish I had HELO names but I don't have access to that)

       1550 (15%)      badrecip__(null sender)
         40  (0%)      badsender__(null sender)
         88  (0%)      blocked_spam__(null sender)
          2  (0%)      blocked_virus__(null sender)
        260  (2%)      delivered__(null sender)
        110  (1%)      miscerror__(null sender)
          1  (0%)      quarantined__(null sender)
        118  (1%)      ratelimit__(null sender)
       6722 (67%)      rbl__(null sender)
          1  (0%)      syntaxerr__(null sender)
         36  (0%)      tagged__(null sender)
        394  (3%)      temperror__(null sender)

SPF processing returned "error"
          9  (0%)      badrecip__error
          3  (0%)      delivered__error
          1  (0%)      miscerror__error
          1  (0%)      tagged__error
          4  (0%)      temperror__error

SPF processing returned "fail" - precious little of these: 0.15%
          8  (0%)      badrecip__fail
          1  (0%)      blocked_spam__fail
          3  (0%)      delivered__fail
          2  (0%)      miscerror__fail
          1  (0%)      tagged__fail

SPF returned "neutral" - 4.4%
        307  (3%)      badrecip__neutral
         17  (0%)      badsender__neutral
         19  (0%)      blocked_spam__neutral
          1  (0%)      blocked_virus__neutral
         58  (0%)      delivered__neutral
         19  (0%)      miscerror__neutral
         16  (0%)      tagged__neutral
          3  (0%)      temperror__neutral

SPF returned "pass" 0.08%
          2  (0%)      badrecip__pass
          5  (0%)      delivered__pass
          1  (0%)      miscerror__pass

SPF returned "softfail" - this is kinda cool actually: 1.95%
        172  (1%)      badrecip__softfail
         10  (0%)      blocked_spam__softfail
          7  (0%)      delivered__softfail
          4  (0%)      miscerror__softfail
          2  (0%)      temperror__softfail

spf returned "unknown" - this maybe means unknown mechanism
          2  (0%)      badrecip__unknown

The bright side for me is that my script (using Mail::SPF::Query) takes 19min to process 26 min of real-time data, so hopefully I will be able tokeep it running constantly and have a readout in real time of the whole 3mil, not just 10,000.




--
Greg Connor <gconnor(_at_)nekodojo(_dot_)org>