Re: So here it is one year later...


On Fri, 28 Jan 2005, Justin Mason wrote:


BTW, I'd say that 38% of the SPF use was ham, and 62% was spam.
    1263 ham / 3302  = ~38%
    2039 spam / 3302 = ~62%

So, 24% versus the 34% in September.  Either slightly better than
September, or perhaps the sample is too small and skewed to be useful. Or 
both.  Its still better to block whenever you see SPF.


hmm!  not sure if that's a good assumption -- it's very much dependent on
the comparative ham:spam ratio a domain would see. This set of corpora is
heavily skewed towards receiving more spam than ham; 87.8% of our messages
being spam.

Let's say those corpora were only receiving a third as much spam as they
currently do, possibly because they were younger email addresses or
whatever.  In that case, we'd see 1263 ham, 680 spam (2039/3 ~= 680), in
which case the proportion of SPF-using-spam vs SPF-using-ham would be on
its head: 34% being spam, 65% being ham.


Err, no. If there was less total spam, then instead of some 60k messages,
you might have 50k messages. However, very likely, there would still be
3302 messages that had correct SPF records. Your calculation, while a
useful number, adds in all sorts of other noise, not related to SPF. This
sort of misunderstanding is unfortunately, all too frequent.

If you decrease the number of spams by 10k messages (roughly 15%), and
3.7% of total spam has SPF, and all spammers do the same thing, then the
SPF spam number would only decrease by 3.7%. Assuming that, we'd expect
our SPF-using spams to drop by 75 messages. (neglecting variance)

So now we'd have:

1263 ham / 3227    = ~39%
1964 spam / 3227   = ~61%

Barely a difference, yet a roughly 15% change in spam volume.

However, the SPF spammers are not doing the same thing as all other
spammers. We have no reason to think so. So a change in the behavior of
the non-SPF using spammers would _not_ affect the behavior of the
SPF-using spammers, so you'd expect the same 3302 messages as before.  Of
course, virus joe-jobbers probably think that by generating more spam,
they alter the statistics of real spammers.  That's only true if there is
no way to distinguish joe-jobs from real spam. But of course, we can make
such distinctions with CAN-SPAM. No doubt, this is precisely why the DMA
was behind CAN-SPAM.

Fake spam violates CAN-SPAM: no real product or service, etc.  Genuine
spam MOSTLY doesn't: There is a strong incentive for genuine spammers to
comply, and compliance isn't hard for a real company. But it is
practically impossible for fake spammers to comply, because if they had
real products or services, they'd be real spammers, but of course, they 
aren't real commercial operations.

SPF added another interesting element to all this, in that genuine
spammers leapt onto the bandwagon early, and essentially added a label to
their spam, a lable that fake spammers can't easilly add at the moment.  
Eventually, viruses such will be able to start using ISP relays identified
by SPF, but for now, they can't do that easily, and so don't have SPF
"protection" or labeling.  The unexpected segregation created during this
transition period should yield a great deal of insight into genuine spam
and virus activity.

BTW, I noticed that John Levine published a diatribe on the "failure of
CAN-SPAM" on circleid a few days ago. He is wrong again, asserting
incorrectly that CAN-SPAM was meant to outlaw spam. It legalized spam, and
gave a way to distinguish the real (ie DMA member) spammers, from the fake
(and I suspect anti-spam radical) abusers. I suspect that if you go back
to Congress, the DMA will simply push for enforcement of the criminal
provisions of CAN-SPAM, which should reveal the identities of our virus
operators.  I wonder who that will reveal, and if it will be another ISP
abuse admin, or radical anti-spammer.  I forsee the end of a certain type
of spam, in the same way that open relay abuse ended: The abusers will
just give up. It will be nice if this accounts for 96.3% of all spam.  
But, I've digressed enough.

What I'm trying to illustrate here is that it's important to compare
figures using figures that compensate for the comparative ham:spam ratio,
because that varies wildly.   Hence, comparing (SPF-bearing-ham / all-ham)
to (SPF-bearing-spam / all-spam) is safer than comparing the message
counts of SPF-bearing-ham and SPF-bearing-spam directly.


Safer?  Its different. Safer has some other implication thats not clear. I
don't know what "safer" means.  It could mean that since 18% of total ham
has SPF records, that perhaps deleting based on the roughly 2 to 1 chance
that "SPF records mean spam", is an "unsafe" bet.  More rules to cover the
ham would be important.

However, when you ask the question "what is the ratio of SPF use that is
spam and SPF use that ham?", the Sum of the percents have to add to 100,
otherwise you didn't answer the question. You answered some other
question.  Possibly, that question is also useful, but it wasn't the
question asked. The assertion was the spammers jumped on SPF. The percent
SPF use that is spam is 62%.  The percent of SPF use that ham is 38%. The
total SPF use is composed of spam and ham. Spam SPF use + Ham SPF use must
equal 100%.

Your corpus supports the Ciphertrust assertion.  But you said it didn't.  
That was incorrect.  The mistake was that you answered a different 
question.

Still, I'd think the 3.7% of total spam using SPF is still fairly
significant, and probably reflects the relative proportion of genuine
commercial spam to non-commercial spam.  It would be interesting to know,
of those spams that pass SPF, how many are CAN-SPAM compliant?  How many
of the non-SPF spams were CAN-SPAM compliant?  I'd conjecture a strong
correlation.


now that's something I don't have time to get into, CAN-SPAM compliance
not being something that's easy to automate checking for (more's the
pity).


You should have accepted the IEMCC proposal back in 1997. Wallace and co.
proposed everything that's in CAN-SPAM, with the benefit of a special
X-spam header (I think it was called something else, like X-advertisement
or some such).  That sort of labeling would have made the problem easy.  
But the radicals didn't want to be reasonable at the time--thought they
could end spam by techincal means.

                --Dean

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFB+sdSMJF5cimLx9ARAhvgAKCdDpROw4/yqVxCwOwMipg46uh/LACbBD/6
HsR9l0+iPdHRG4RDM/zOSmQ=
=lLX4
-----END PGP SIGNATURE-----


-- 
Av8 Internet   Prepared to pay a premium for better service?
www.av8.net         faster, more reliable, better service
617 344 9000