Re: [ietf-dkim] New canonicalizations

Murray S. Kucherawy wrote:

According to what we have, the biggest users of "relaxed/relaxed" are the 
large mailbox providers like Gmail and Yahoo and other legitimate senders, 
not spammers.  The top 20, for example:

+----------------------------------+----------+
| name                             | count(*) |
+----------------------------------+----------+
| gmail.com                        |   421745 |
| yahoo.com                        |   313109 |
| facebookmail.com                 |   233441 |
| yahoogroups.com                  |   104523 |
| auth.ccsend.com                  |    90195 |
| linkedin.com                     |    74710 |
| google.com                       |    59049 |
| reply.newsmax.com                |    53286 |
| ATT.NET                          |    43602 |
| sbcglobal.net                    |    36534 |
| googlegroups.com                 |    34359 |
| e.groupon.com                    |    30350 |
| paypal.com                       |    24568 |
| f74d39fa044aa309eaea14b9f57fe79c |    21019 |
| emailinfo.bestbuy.com            |    17067 |
| ebay.com                         |    16192 |
| 636ae4d78ec2b46248fc59ac1ad737df |    14580 |
| expediamail.com                  |    13058 |
| bellsouth.net                    |    12431 |
| googlemail.com                   |    12426 |
+----------------------------------+----------+

Total relaxed/relaxed signatures received = 3444978; total above = 1626244 
(47%)

In fact, the first domain name that (statistically) looked likely to be a 
spammer is way down on the list, around #106 (out of 63314), and everything 
before that accounted for 58% of total signatures.  So, our data don't agree 
with the claim, and certainly not with "by far".

But I don't understand why this is a useful line of analysis.  If spammers 
are using relaxed/relaxed, they merely have the same concern as a legitimate 
sender, namely signature survivability.  This shouldn't be a surprise.  I 
hope we're not talking about the idea of filtering based on which 
canonicalization is in use, which is almost certainly a bad idea.


Some good info Murray.

It is all reflective of whats called Peer or Personal Network 
Community (PCN).

The collection you have is an aggregate of many sites.  However, in 
reality each site will have a different PCN.

I agree, even for my small site collection, the majority volume are 
DKIM signed mail are from:

     Gmail
     Facebook

and for my PCN, the third is:

     mipassoc.org

But when you normalized it, there are a small part (3 to 4) of the 
total domains which are, by far, good/bad spammers.

When we started our SMTP daily stats collection in 2003, it started as 
a per site basis and the PCN patterns were obvious.  At some point, we 
automated the collection with the attempt to show an aggregation of 
the total sites.  Almost immediately, the various measurements were 
skewed in one direction or another simply because one or more sites 
had a higher measurement for thing or another.

For my PCN,  by the time mail is finally accepted, the RFC5322 payload 
is indeterminate (i.e. everything that could be done was done), and 
the analysis of the DKIM signed mail is that most of it are spamming 
domains.

While you may be eager to publicly state this input is insignificant 
and doesn't matter, my 35+ years of producing software for thousands 
of customers and inter-operating with my industry peers says it is 
very significant.  One can not always lump a total aggregation summary 
to reflect what is true or false at the site level. The fact that DKIM 
analysis is in a limbo state is reflective of whats I am stating.

Why not try redoing your stats for your PCN only and see what it shows?

Keep in mind we got of the ESP business in 1998, but we still got a 
lot of dirty, inactive addresses coming our way.  So our PCN will be 
very different.

Finally, in my opinion you have two motivations for C14N and it could 
be based simply on the degrees of separation of what one deems important:

    Private Communications:

        Desired to have the most secured integrity

    Bulk, Public Communications:

        More relaxed, less secured, with a wide degree of
        receivers, minimized C14N related issues with an
        relaxed algorithm.

-- 
Hector Santos, CTO
http://www.santronics.com
http://santronics.blogspot.com


_______________________________________________
NOTE WELL: This list operates according to 
http://mipassoc.org/dkim/ietf-list-rules.html