Re: PRA algorithm and use of non-standard header fields

On Jul 15, 2004, at 11:39 PM, Mark Lentczner wrote:

Does anyone have a largish data base of messages (just need the headers) categorized as spam/not-spam? I would happy to code up a quick PRA test and run it over the dataset. Of course, without the domains publishing records, we'll have to make some guesses as to if the identified PRA matches the smtp client IP...

I reran my tests using PRA and SPF-classic. Here are the results:

HAM - 15678 messages (most from mailing lists, btw)
==== Msgs w/usable RR Deny Pass
---------------- ----- -----
SPF-C 32% 3.9% 23.5%
PRA 32% 4.5% 23.7%

SPAM - 15710 messages
==== Msgs w/usable RR Deny Pass
---------------- ----- -----
SPF-C 4.6% 74.1% 2.5%
PRA 6.5% 51.0% 1.7%

Other interesting data: of the 31,388 messages, there was not a single one where SPF-C and PRA contradicted each other. Also, of the 31,388 messages, 2822.From was not the PRA in 14,945 or 48%, but of the 15678 HAM messages the percentage is 78%.

The only conclusion that I can draw is this: without more domains publishing records, there is no way to know if there is a difference between PRA and SPF-C.

I do have the following comments on the marid-core PRA algorithm:

1) The six-steps really ought to be put into pseudo-code with each step spelled out in a separate routine. I found that the textual descriptions were a little confusing. If need be, I can contribute my Python code.

2) Many of the steps talk about a "non-Empty" header. Isn't this requirement also fulfilled by step 5, therefore making this requirement in the subsequent steps redundant?

3) It may be necessary to modify the check of the "Sender" header because many systems do not attach a domain name to a mailbox address if the injection and delivery are on the same box. Or perhaps this should go into an applicability statement section 7 regarding checking of email local to the system.

-andy

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

RE: The licensing issue, Hallam-Baker, Phillip

Next by Date:

Regarding the recent licensing thread, Ted Hardie

Previous by Thread:

Re: PRA algorithm and use of non-standard header fields, Andrew Newton

Next by Thread:

PRA Algorithm Stats, Mark Lentczner

Indexes:

[Date] [Thread] [Top] [All Lists]