spf-discuss
[Top] [All Lists]

Re: Long SPF Records

2005-03-25 15:10:10
David MacQuigg wrote:
Todd,

Thanks for your excellent and authoritative answer to my questions. My use of the rr.com example was not to imply there was a problem with your setup, which appears to be very well managed. I was using it as an example of a large domain with many sub-domains, the kind of situation where we might expect a problem to occur if there were some future SPF-doom virus.

I am as guilty of using RR's visible config as an example. It is a complex setup that lends itself well to asking 'what if' questions.

If you mind this, I will pick another network as an example.

To your credit, as well as megapathdsl.net, you are the first two instances of ISPs that have shown willingness to improve their SPF record. This kind of demonstrated responsibility, among with the professionalism you've shown give me hope that other ISPs will do the same when called.

The example you set is a very good weapon against those arguments we hear often that those who already published SPF records want to forget about them, and therefore we must accomodate them.

On the question of rr.com caching records for its subdomains, if I understand you correctly, you are saying its possible, but not necessary, and you actually do better by distributing the load to nameservers in the subdomains. Your expectation is that queries to rr.com will be cached by the client's nameserver, so an SPF-doom attack would be amplified very little from the few extra queries to rr.com. That seems reasonable, but maybe Radu could comment on this. I can imagine rr.com being included, just because it has a long, though perfectly legal, SPF record.

Since you ask:

When I connect to my ISP, I get assigned an IP address, gateway and a couple of name servers. In my case, the nameservers I get assigned and the authoritative name servers for the ISP's domain are different IP addresses.

It's a perfect example of load balancing. Maybe they use a number of caching DNS servers relating to how many customers they have. Maybe they even use some formula like 1000 customers/caching server in order to figure out when to add new machines. They probably do this for their SMTP and POP/IMAP services too. It makes sense.

So in the context of the zombies, if 500 customers of my ISP's who happen to be assigned the same nameserver, get infected, the 500 copies of the virus will be very efficient finding targets, because once the first copy of the virus does a query, all the others will just use the query from the server's cache. If the virus writers were any good, they would implement a random search through the available list of domains, to find all the SPF publishers faster.

The connection between the ISP's caching name server and the authoritative NS servers responsible for the various domains on the hit-list is probably 100Mbit. So the 500 virii are not limited to scanning the world's DNS servers at 1-Mbit each. Just by randomizing their search order they can be much more efficient. Team work, it's called. When a virus tries to find out about the SPF record that another virus has already expanded, the queries will be reponded from the ISP's cache, so it can move on to the next target very quickly.

Note that the discovery phase essentially has ISP's caching servers DDOS's the hit-list at 100Mbit each.

That was the hit-list discovery phase. DNS caching makes that very efficient. Unfortunately, there's nothing one can do, other then disallowing customers to do TXT queries. That would stop the discovery cold. But it would also break a few things, like Sender-ID, who runs on the client side and depends on TXT queries. So we can't do that.

Next, the attack phase. Each of the 500 viruses connects to the innocent MTAs (the list includes all domains on the internet) and send the 60-byte packet to each one. Each virus randomizes, again, and uses 1Mbit each. Actually, my earlier example said each virus would contact 72 MTA's per second. Let's stick with that.

I don't know how many innocent MTAs are out there, but let's say 1,000,000. The 500 viruses contact those MTAs at a rate of 72 per second (as per the calculation I did earlier) each. Roughly they transfer 200 uplink bytes per MTA contact, including all the connection overhead and so on. The downlink usage is proably about 500bytes, because of the MTA banners that give their name, supported features and so on. So they use a bandwidth of 14.4KBps uplink and 36KBps to bug 72 MTAs per second each. That's 36,000 MTA's 'bugged' per second.

On the ISP's connectivity, 500 * (14.4KB/up + 36KB/down) works out to 18MBps downlink, 7.2MBps uplink. Just slightly more than that 100Mbit duplex pipe can take. So the viruses will be throttled back (perhaps to only bugging 25,000 MTA's per second), while the ISP is essentially 'tapping' MTAs with all its might.

Remeber that the virus only 'taps' the MTA, and then the MTA 'pounds' the DNS. The amplification factor there was 60x. this means that for the 60bytes input, the MTA would use 3600bytes of bandwidth to verify the SPF record pointed to in the 60-byte packet.

So the 1,000,000 MTAs are now under attack, 25,000 per second. 25KHz - that's a buzz that only a cat might hear.

Overall, the 25,000 MTAs will generate 7.2MBit * 60 of traffic per second in virus related traffic. this means 432MB (4.32Gbits in aggregate traffic. Ie, evenly spread across the 25,000 MTAs)

The 432MBytes is sucked out the DNS's of those publishing expensive records. So even though it is launched from 25,000 places on earth, it is focused on a few of the domains, and on a handful of DNS providers.

But if SPF were more widely adopted, there would be more targets, so this attack would be more evenly distributed.

Since most of those innocent MTAs have likely never heard of these domain names in questions (those with expensive records), they can do nothing but query the root servers for the NS records of those domains. So the attack percolates all the way up to the root servers.

What I described was a tiny ISP with a 100Mbit connection. Most ISPs have much bigger pipes.


I think generally this is the worksheet. If my assumption numbers are off, and you have a more realistic set of numbers, please provide them. The real numbers may be less, but may be more. How much would I have to be off by to reduce the magnitude of the problem from 4.32Gbps to 100Mbps (which means to get rid of the amplification) ? And what is the most I can be off by on the low side? How much worse than I imagine can this be ?

Regards,
Radu.


<Prev in Thread] Current Thread [Next in Thread>