Re: Long SPF Records

David MacQuigg wrote:

Todd,
Thanks for your excellent and authoritative answer to my questions. Myuse of the rr.com example was not to imply there was a problem with yoursetup, which appears to be very well managed. I was using it as anexample of a large domain with many sub-domains, the kind of situationwhere we might expect a problem to occur if there were some futureSPF-doom virus.

I am as guilty of using RR's visible config as an example. It is acomplex setup that lends itself well to asking 'what if' questions.


If you mind this, I will pick another network as an example.

To your credit, as well as megapathdsl.net, you are the first twoinstances of ISPs that have shown willingness to improve their SPFrecord. This kind of demonstrated responsibility, among with theprofessionalism you've shown give me hope that other ISPs will do thesame when called.

The example you set is a very good weapon against those arguments wehear often that those who already published SPF records want to forgetabout them, and therefore we must accomodate them.

On the question of rr.com caching records for its subdomains, if Iunderstand you correctly, you are saying its possible, but notnecessary, and you actually do better by distributing the load tonameservers in the subdomains. Your expectation is that queries torr.com will be cached by the client's nameserver, so an SPF-doom attackwould be amplified very little from the few extra queries to rr.com.That seems reasonable, but maybe Radu could comment on this. I canimagine rr.com being included, just because it has a long, thoughperfectly legal, SPF record.


Since you ask:

When I connect to my ISP, I get assigned an IP address, gateway and acouple of name servers. In my case, the nameservers I get assigned andthe authoritative name servers for the ISP's domain are different IPaddresses.

It's a perfect example of load balancing. Maybe they use a number ofcaching DNS servers relating to how many customers they have. Maybe theyeven use some formula like 1000 customers/caching server in order tofigure out when to add new machines. They probably do this for theirSMTP and POP/IMAP services too. It makes sense.

So in the context of the zombies, if 500 customers of my ISP's whohappen to be assigned the same nameserver, get infected, the 500 copiesof the virus will be very efficient finding targets, because once thefirst copy of the virus does a query, all the others will just use thequery from the server's cache. If the virus writers were any good, theywould implement a random search through the available list of domains,to find all the SPF publishers faster.

The connection between the ISP's caching name server and theauthoritative NS servers responsible for the various domains on thehit-list is probably 100Mbit. So the 500 virii are not limited toscanning the world's DNS servers at 1-Mbit each. Just by randomizingtheir search order they can be much more efficient. Team work, it'scalled. When a virus tries to find out about the SPF record that anothervirus has already expanded, the queries will be reponded from the ISP'scache, so it can move on to the next target very quickly.

Note that the discovery phase essentially has ISP's caching serversDDOS's the hit-list at 100Mbit each.

That was the hit-list discovery phase. DNS caching makes that veryefficient. Unfortunately, there's nothing one can do, other thendisallowing customers to do TXT queries. That would stop the discoverycold. But it would also break a few things, like Sender-ID, who runs onthe client side and depends on TXT queries. So we can't do that.

Next, the attack phase. Each of the 500 viruses connects to the innocentMTAs (the list includes all domains on the internet) and send the60-byte packet to each one. Each virus randomizes, again, and uses 1Mbiteach. Actually, my earlier example said each virus would contact 72MTA's per second. Let's stick with that.

I don't know how many innocent MTAs are out there, but let's say1,000,000. The 500 viruses contact those MTAs at a rate of 72 per second(as per the calculation I did earlier) each. Roughly they transfer 200uplink bytes per MTA contact, including all the connection overhead andso on. The downlink usage is proably about 500bytes, because of the MTAbanners that give their name, supported features and so on. So they usea bandwidth of 14.4KBps uplink and 36KBps to bug 72 MTAs per secondeach. That's 36,000 MTA's 'bugged' per second.

On the ISP's connectivity, 500 * (14.4KB/up + 36KB/down) works out to18MBps downlink, 7.2MBps uplink. Just slightly more than that 100Mbitduplex pipe can take. So the viruses will be throttled back (perhaps toonly bugging 25,000 MTA's per second), while the ISP is essentially'tapping' MTAs with all its might.

Remeber that the virus only 'taps' the MTA, and then the MTA 'pounds'the DNS. The amplification factor there was 60x. this means that for the60bytes input, the MTA would use 3600bytes of bandwidth to verify theSPF record pointed to in the 60-byte packet.

So the 1,000,000 MTAs are now under attack, 25,000 per second. 25KHz -that's a buzz that only a cat might hear.

Overall, the 25,000 MTAs will generate 7.2MBit * 60 of traffic persecond in virus related traffic. this means 432MB (4.32Gbits inaggregate traffic. Ie, evenly spread across the 25,000 MTAs)

The 432MBytes is sucked out the DNS's of those publishing expensiverecords. So even though it is launched from 25,000 places on earth, itis focused on a few of the domains, and on a handful of DNS providers.

But if SPF were more widely adopted, there would be more targets, sothis attack would be more evenly distributed.

Since most of those innocent MTAs have likely never heard of thesedomain names in questions (those with expensive records), they can donothing but query the root servers for the NS records of those domains.So the attack percolates all the way up to the root servers.

What I described was a tiny ISP with a 100Mbit connection. Most ISPshave much bigger pipes.

I think generally this is the worksheet. If my assumption numbers areoff, and you have a more realistic set of numbers, please provide them.The real numbers may be less, but may be more. How much would I have tobe off by to reduce the magnitude of the problem from 4.32Gbps to100Mbps (which means to get rid of the amplification) ? And what is themost I can be off by on the low side? How much worse than I imagine canthis be ?


Regards,
Radu.