Meng Weng Wong <mengwong(_at_)dumbo(_dot_)pobox(_dot_)com> writes:
I can see the value of such an attribute for testing, but the protocol
was designed as an on-the-wire, realtime test. If we started down
this road, we might end up with an era syntax with different effective
SPF expressions during different date ranges, and that seems like too
much complexity.
That's a bit of a slippery slope fallacy. This would just be a binary
switch for knowing whether a message is testable given the current
definition. I agree that anything else would be too complicated for
both sender and receiver.
Actually, it might make sense to switch around the proposal from "valid
since" to be "never valid before" to make it clearer that it is only
advisory.
Besides improving testing of accuracy, it also could provide some mildly
useful debugging information: "did this just break or is it just me?"
and such.
When updating your SPF TXT record, you'd have a few options, all equally
valid:
- leave off the date
- use today's date
- only change to today's date if a meaningful change was made
The change could also be used as an mtime of sorts for when you think
you fixed your last failure source. For example, I'd probably just
stick in the date (*cough* last week) when I stopped forging my From:
from my mobile phone because that's also when false positives should
have stopped happening for my domain.
For corpus testing I suggest using the assumption that an SPF record,
if present in the DNS today, is likely to have been valid for recent
transactions within, say, a month.
That would probably work okay. Our SPF_FAIL rule is working much better
on recent email than old, but I suspect a lot of that is due to the
initial learning curve of SPF.
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
0.588 0.6388 0.0000 1.000 0.71 1.00 SPF_FAIL:0-1
1.525 2.3910 0.0351 0.986 0.72 1.00 SPF_FAIL:1-3
1.384 1.5097 0.2952 0.836 0.62 1.00 SPF_FAIL:3-6
The number range at the end of the line is the age in months.
How do people solve this problem for corpus-testing DNSBLs?
In my experience testing DNSBLs (which I've done for SpamAssassin), most
DNSBLs tend to have relatively constant accuracy rates. There are
occasional cases where a spammer IP is inherited by a non-spammer, but
those are dwarfed by other forms of collateral damage even in the best
blacklists like SBL. (And this is why we use blacklists and actually
most rules as only part of the input to the final spam/ham result.)
Some DNSBLs like SpamCop retire IP addresses really quickly (which makes
it harder to evaluate them) so the accuracy changes more quickly, but
for most, the overall accuracy (the ratio of spam hits to the number of
overall hits assuming a 50/50 distribution of spam and ham) stays
roughly the same until you get to mail that's 3, maybe 6, months old.
Some examples:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
6.509 7.0558 0.1495 0.979 0.82 1.27 RCVD_IN_SBL:0-1
11.977 18.9237 0.0351 0.998 0.96 1.27 RCVD_IN_SBL:1-3
11.308 12.5981 0.1476 0.988 0.92 1.27 RCVD_IN_SBL:3-6
(no real change in accuracy)
60.647 65.8169 0.5629 0.992 0.79 1.10 RCVD_IN_DSBL:0-1
37.545 59.2916 0.1639 0.997 0.96 1.10 RCVD_IN_DSBL:1-3
55.455 61.8219 0.3690 0.994 0.98 1.10 RCVD_IN_DSBL:3-6
(no real change in accuracy)
48.469 52.5977 0.4838 0.991 0.80 2.55 RCVD_IN_SORBS_DUL:0-1
23.239 36.7166 0.0703 0.998 0.98 2.55 RCVD_IN_SORBS_DUL:1-3
32.342 36.0542 0.2214 0.994 0.98 2.55 RCVD_IN_SORBS_DUL:3-6
(no real change in accuracy)
75.277 81.6876 0.7740 0.991 0.77 1.00 RCVD_IN_XBL:0-1
18.170 28.4469 0.5035 0.983 0.84 1.00 RCVD_IN_XBL:1-3
7.990 8.8280 0.7380 0.923 0.82 1.00 RCVD_IN_XBL:3-6
(big drop off, probably because this is tracking infected hosts and
the same hosts weren't infected in the past)
53.575 58.1435 0.4838 0.992 0.81 2.25 RCVD_IN_BL_SPAMCOP_NET:0-1
7.209 11.2943 0.1874 0.984 0.86 2.25 RCVD_IN_BL_SPAMCOP_NET:1-3
2.699 2.9939 0.1476 0.953 0.73 2.25 RCVD_IN_BL_SPAMCOP_NET:3-6
(fairly big drop off, due to rapid automated expiry of IPs)
We still only look at the last 6 months when evaluating DNSBLs.
Daniel
--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting