ietf-mxcomp
[Top] [All Lists]

PRA Algorithm Stats

2004-07-19 14:32:13
Thanks, Andy for your stats. I've been running my own this weekend, and I came up with slightly different results. I started with four data sets: my inbox from 2003 (~1500 messages), my last three days of spam (~500 messages), and two from Chuck Mead (ham @ ~1000 messages and spam @ ~4000 messages).

I took messages, computed the PRA and compared it's domain to the domain of the envelope from. If they were the same, I did no more checks, since they would use the same record and have the same results (barring records that make distinctions based on localparts.) The percentage where they differ is:

ham1 17%
ham2 16%
spam1 8%
spam2 9%

I then ran the SPF record check against the PRA and env. MAIL FROM identities on these messages. I considered both only messages where the SPF records were published, and all cases using guesses for the SPF records when missing. Here is the percentage of time a guess was used:

PRA FROM
ham1 82% 92%
ham2 72% 75%
spam1 67% 94%
spam2 78% 89%

Unlike Andy, the query results differed some percentage of the time:

ham1 66%
ham2 27%
spam1 43%
spam2 31%

Lastly, like Andy, I looked at the Pass and Fail percentages (for both only published and published + guessed records):

Ham1 Summary: published SPF + guessed SPF
----------------------- --------------- ---------------
PRA Pass: 11 ( 24%) 81 ( 31%)
env. from Pass: 18 (100%) 241 ( 94%)

PRA Fail: 0 ( 0%) 0 ( 0%)
env. from Fail: 0 ( 0%) 0 ( 0%)

Ham2 Summary: published SPF + guessed SPF
----------------------- --------------- ---------------
PRA Pass: 14 ( 53%) 62 ( 64%)
env. from Pass: 16 ( 66%) 72 ( 75%)

PRA Fail: 7 ( 26%) 7 ( 7%)
env. from Fail: 0 ( 0%) 0 ( 0%)

Spam1 Summary: published SPF + guessed SPF
----------------------- --------------- ---------------
PRA Pass: 2 ( 16%) 6 ( 16%)
env. from Pass: 0 ( 0%) 10 ( 27%)

PRA Fail: 5 ( 41%) 5 ( 13%)
env. from Fail: 2 (100%) 2 ( 5%)

Spam2 Summary: published SPF + guessed SPF
----------------------- --------------- ---------------
PRA Pass: 40 ( 44%) 94 ( 22%)
env. from Pass: 0 ( 0%) 97 ( 23%)

PRA Fail: 22 ( 24%) 22 ( 5%)
env. from Fail: 13 ( 29%) 13 ( 3%)

While the numbers are low, and I'd be first to worry about generalizing from these stats, I can begin to see a trend:

1) Over 80% of mail is not complicated and the PRA produces the same domain as env. from. Oddly enough, even more so for spam. (I suppose spammers just haven't caught on.)

2) While still only a small percentage of sites have records, more sites identified by PRA had records than env. from. This could be because the PRA is more often a bigger site.

3) When the domains differ, the check result from PRA vs. env. from is often different. This is the only stat. that is at odds with Andy's findings.

4) For ham, env. from produces a greater percentage of passes than PRA. It also produces few fails, but the numbers here are very low. For ham, more passes and fewer fails is good.

5) For spam, the numbers seem less conclusive, but PRA seems to produce more definite results (passes and fails) than env. from.

I've attached the detailed output of my script runs for anyone that wants to plunge in deeper...

- Mark

Mark Lentczner
http://www.ozonehouse.com/mark/
markl(_at_)glyphic(_dot_)com

Attachment: all-results2
Description: Binary data