Re: [ietf-dkim] New canonicalizations

-----Original Message-----
From: ietf-dkim-bounces(_at_)mipassoc(_dot_)org 
[mailto:ietf-dkim-bounces(_at_)mipassoc(_dot_)org] On Behalf Of Hector Santos
Sent: Wednesday, May 18, 2011 1:49 PM
To: IETF-DKIM
Subject: Re: [ietf-dkim] New canonicalizations

Whatever the actual reason, since its not the default and the reality
the option exist and serves a purpose, there is an reasonable
practical explanation there is a certain population of domains seeking
the path of least resistance with reduced accidental <cr><lf>
injections and mutations along the path as its very possible to occur
in our heterogeneous networks of Unix (LF), MAC (CR) or DOS (CRLF)
transport, gateways and storage I/O differences.


I think you're asking for a count of domains using various canonicalizations 
that produce spam.  Here's what we have:

+------------------------+-----------+------------+
| count(distinct domain) | hdr_canon | body_canon |
+------------------------+-----------+------------+
|                    214 |         0 |          0 |
|                      1 |         0 |          1 |
|                     62 |         1 |          0 |
|                   3805 |         1 |          1 |
+------------------------+-----------+------------+

This counts a domain as "spammy" if the mail we've seen signed by that domain 
is labeled as spam by Spamassassin at least 50% of the time, just as a starting 
point.  But if instead I report on less than 50% (relatively clean domains), 
the ratios are about the same:

+------------------------+-----------+------------+
| count(distinct domain) | hdr_canon | body_canon |
+------------------------+-----------+------------+
|                   2703 |         0 |          0 |
|                      6 |         0 |          1 |
|                   2238 |         1 |          0 |
|                  20573 |         1 |          1 |
+------------------------+-----------+------------+

So I don't think a conclusion's really possible here.

I don't think there is anything reliable there from I can see, but its
not unreasonable for one to hypothesize that there might be a direct
correlation between the number of hops and the tendency to use
relaxed/relaxed. It might be interesting to see if that may be a
motivation for using relaxed/relaxed:

      c-param vs ave # of hops (received lines)


+---------------------+-----------+------------+----------+
| avg(received_count) | hdr_canon | body_canon | count(*) |
+---------------------+-----------+------------+----------+
|              1.0976 |         0 |          0 |     2214 |
|              1.0000 |         0 |          1 |        7 |
|              1.0338 |         1 |          0 |     7569 |
|              2.3349 |         1 |          1 |    14086 |
+---------------------+-----------+------------+----------+

Canonicalizations of "0" mean "simple", "1" is "relaxed".  So there is possibly 
a correlation between use of relaxed/relaxed and the hop count for spam, but I 
have trouble envisioning that as something that's being actively considered by 
signers.

The same report for non-spam, however, shows that there's probably not much of 
a statistically significant difference:

+---------------------+-----------+------------+----------+
| avg(received_count) | hdr_canon | body_canon | count(*) |
+---------------------+-----------+------------+----------+
|              1.2570 |         0 |          0 |   220497 |
|              1.0971 |         0 |          1 |      412 |
|              1.4505 |         1 |          0 |   172136 |
|              2.0206 |         1 |          1 |   980337 |
+---------------------+-----------+------------+----------+

I don't know where all this is leading, but there you go.

-MSK


_______________________________________________
NOTE WELL: This list operates according to 
http://mipassoc.org/dkim/ietf-list-rules.html