ietf-mxcomp
[Top] [All Lists]

Re: So here it is one year later...

2005-01-28 13:17:15

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Dean Anderson writes:
On Fri, 28 Jan 2005, Justin Mason wrote:


FWIW, here's the results of a check of 54725 spams and 6680 nonspam mails,
from SpamAssassin's weekly mass-check of network rules (at
http://www.pathname.com/~corpus/NET.age ).

All these messages were received less than 1 month ago, and are taken from
5 people's hand-classified corpora.

  SPF records passing HELO strings: 4.98% of spam, 13.29% of ham
  SPF records passing the MAIL FROM: 3.72% spam, 18.90% of ham

So it certainly looks like that statement is untrue.

Err, no:

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME:0-1
OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME:1-3
OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME:3-6

  5.377   3.7259  18.9072    0.165   0.23   -0.00  SPF_PASS:0-1
  1.361   0.9087   3.3508    0.213   0.25   -0.00  SPF_PASS:1-3
  1.749   0.5116  18.4304    0.027   0.34   -0.00  SPF_PASS:3-6

As you see, spam + ham does not add up to overall.  Its not clear what
these statistics mean, nor how they were calculated.  But your
interpretation is clearly either wrong or at least not supported by the
page.

Actually, you're wrong there. This is SpamAssassin's "hit-frequencies"
tool output.  Those are percentages, not message counts, so simply summing
SPAM%+HAM% will not add up to OVERALL%.

Here's a quick walk through the pertinent parts. (I'm discarding the 1-3
and 3-6 month ranges -- those are old mails so that data isn't very useful
for network tests -- and just concentrate on the 0-1 month range.)

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME:0-1
  61405    54725     6680    0.891   0.00    0.00  (all messages):0-1

This means that there were 61405 messages mass-checked in total,
with 54725 spams, and 6680 "hams" (non-spam messages).

  5.377   3.7259  18.9072    0.165   0.23   -0.00  SPF_PASS:0-1

looking at the SPAM% and HAM% columns, that means that 3.7259% of the spams
checked had SPF_PASS, and 18.9072% of the hams. that means

  ((3.7259 / 100) * 54725) = 2038.998775

I'd suspect that rounding error means that 2039 spam messages passed the
SPF check, so round to 2039.

  ((18.9072 / 100) * 6680) = 1263.00096

and 1263 hams passed SPF.  Total those, and you get 3302 messages
from the overall corpus passing SPF; to express that as a percentage
of the total overall corpus, in other words "OVERALL%", you compute
(3302 / 61405) * 100 = 5.377.

If you have any more questions on the hit-frequencies format, I'll
be happy to fill you in -- I wrote the tool in question ;)

FYI, this is the original post:

---------- Forwarded message ----------
Date: Thu, 9 Sep 2004 15:18:42 +0200
From: Markus Stumpf <maex-lists-email-ietf-mxcomp(_at_)Space(_dot_)Net>
To: ietf-mxcomp(_at_)vpnc(_dot_)org
Subject: SPF abused by spammers

Justin Murdock posted this link on the qmail list:
    http://news.bbc.co.uk/1/hi/technology/3631350.stm
    "CipherTrust [...] found that 34% more spam is passing SPF checks than
    legitimate e-mail."

Sure.  But this was guaranteed to change over time, and vary depending on
corpus composition.  It's pretty radically different now, from where I and
the other SpamAssassin corpus contributors are viewing it.

- --j.

        \Maex

--
SpaceNet AG            | Joseph-Dollinger-Bogen 14 | Fon: +49 (89) 32356-0
Research & Development |       D-80807 Muenchen    | Fax: +49 (89) 
32356-299
"The security, stability and reliability of a computer system is 
reciprocally
 proportional to the amount of vacuity between the ears of the admin"

              --Dean
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB+p3CMJF5cimLx9ARAlLpAKCAOosS1dSm7hjSgzH0dzRTWNsaBwCgpY5Y
lLmvE2U+4KdCyOXLXAdgYFY=
=HChc
-----END PGP SIGNATURE-----