ietf-asrg
[Top] [All Lists]

Re: [Asrg] 7. Best Practices - DNSBLs - Article

2003-08-13 09:58:37

Brad Knowles writes:
At 5:42 PM -0700 2003/08/12, Justin Mason wrote:

 OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
  495260   343948   151312    0.694   0.00    0.00  (all messages)

      Hmm.  This is the total number of messages, and not the total 
number of IP addresses to look up, right?  Do you have any idea how 
many IP addresses there are per message that you have to look up?

The total number of messages -- and I'm afraid I don't have a clue on how
many IPs there are there.  There's about 15-20 individual contributors
behind all that mail, they do not send us the entire mail message itself
during the process, and grepping out that data would be tricky :(

      I am curious -- is there a reason why you tested with a much 
larger spam archive than your ham archive?

I don't know -- Theo ran the corpus selection ;)   But as far as I know,
no reason -- probably simply that more people submitted results for
spam over ham.   Since the main reason for this was to run the genetic
algorithm, and in our experience the GA is resilient to that kind of
proportion, it wasn't an issue.

Thinking about this some more, I notice that the MAPS black lists 
do not appear to be tested at all.  For the sake of comparison, I 
believe that they should be included in the tests and ranked against 
the other black lists, or they should be omitted from the list of 
black lists altogether.

This is true.   (They are only included in that list because of my
oversight. oops.)

Moreover, this is less than forty black lists.  My understanding 
is that there are well over a hundred in existence.  This list would 
need to be significantly expanded, in order to cover all known black 
lists and be a more fair comparison.

Good point...  note that it's actually *under* 40 as many of those
"rules" indicate that a multi-value DNSBL -- like SORBS -- returned
one of its possible values -- like 127.0.0.6 .

Those are just the DNSBLs that we found to provide useful results.  Many
of the others either (a) have unusual listing/delisting criteria, (b) were
not totally free to query, (c) did not want to have thousands of
SpamAssassin users hitting their servers (which is pretty reasonable!), or
(d) didn't make the grade in terms of QA.

In particular, referring back to the "overlap" question -- if I recall
correctly Dan dropped a couple of BLs due to large overlap with the ones
in that list.

I wonder -- have you run this comparison with other spam/ham 
corpii?  Do you continue to expand your spam/ham corpus as time goes 
on?

This comparison is made using ham/spam corpora from about 15 to 20 people.

      In terms of analyzing black list performance, all we need is the 
IP address(es) found in the headers of the message.  Everything else 
is superfluous, and indeed gets in our way.

FWIW, SpamAssassin 2.60 will include a method for the user/admin to tag
all messages with a header containing the DNSBL lookup results at the
moment the message was scanned -- like so:

  X-spam-rbl: <dns:244.130.118.163.opm.blitzed.org> [127.1.0.4]

Once that's in the field for a while, it'd be easy enough to grep out
those headers and come up with "live" DNSBL performance data, and the
list of IPs.

--j.

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg