Re: Long SPF Records
David MacQuigg wrote:
Thanks for your excellent and authoritative answer to my questions. My
use of the rr.com example was not to imply there was a problem with your
setup, which appears to be very well managed. I was using it as an
example of a large domain with many sub-domains, the kind of situation
where we might expect a problem to occur if there were some future
I am as guilty of using RR's visible config as an example. It is a
complex setup that lends itself well to asking 'what if' questions.
If you mind this, I will pick another network as an example.
To your credit, as well as megapathdsl.net, you are the first two
instances of ISPs that have shown willingness to improve their SPF
record. This kind of demonstrated responsibility, among with the
professionalism you've shown give me hope that other ISPs will do the
same when called.
The example you set is a very good weapon against those arguments we
hear often that those who already published SPF records want to forget
about them, and therefore we must accomodate them.
On the question of rr.com caching records for its subdomains, if I
understand you correctly, you are saying its possible, but not
necessary, and you actually do better by distributing the load to
nameservers in the subdomains. Your expectation is that queries to
rr.com will be cached by the client's nameserver, so an SPF-doom attack
would be amplified very little from the few extra queries to rr.com.
That seems reasonable, but maybe Radu could comment on this. I can
imagine rr.com being included, just because it has a long, though
perfectly legal, SPF record.
Since you ask:
When I connect to my ISP, I get assigned an IP address, gateway and a
couple of name servers. In my case, the nameservers I get assigned and
the authoritative name servers for the ISP's domain are different IP
It's a perfect example of load balancing. Maybe they use a number of
caching DNS servers relating to how many customers they have. Maybe they
even use some formula like 1000 customers/caching server in order to
figure out when to add new machines. They probably do this for their
SMTP and POP/IMAP services too. It makes sense.
So in the context of the zombies, if 500 customers of my ISP's who
happen to be assigned the same nameserver, get infected, the 500 copies
of the virus will be very efficient finding targets, because once the
first copy of the virus does a query, all the others will just use the
query from the server's cache. If the virus writers were any good, they
would implement a random search through the available list of domains,
to find all the SPF publishers faster.
The connection between the ISP's caching name server and the
authoritative NS servers responsible for the various domains on the
hit-list is probably 100Mbit. So the 500 virii are not limited to
scanning the world's DNS servers at 1-Mbit each. Just by randomizing
their search order they can be much more efficient. Team work, it's
called. When a virus tries to find out about the SPF record that another
virus has already expanded, the queries will be reponded from the ISP's
cache, so it can move on to the next target very quickly.
Note that the discovery phase essentially has ISP's caching servers
DDOS's the hit-list at 100Mbit each.
That was the hit-list discovery phase. DNS caching makes that very
efficient. Unfortunately, there's nothing one can do, other then
disallowing customers to do TXT queries. That would stop the discovery
cold. But it would also break a few things, like Sender-ID, who runs on
the client side and depends on TXT queries. So we can't do that.
Next, the attack phase. Each of the 500 viruses connects to the innocent
MTAs (the list includes all domains on the internet) and send the
60-byte packet to each one. Each virus randomizes, again, and uses 1Mbit
each. Actually, my earlier example said each virus would contact 72
MTA's per second. Let's stick with that.
I don't know how many innocent MTAs are out there, but let's say
1,000,000. The 500 viruses contact those MTAs at a rate of 72 per second
(as per the calculation I did earlier) each. Roughly they transfer 200
uplink bytes per MTA contact, including all the connection overhead and
so on. The downlink usage is proably about 500bytes, because of the MTA
banners that give their name, supported features and so on. So they use
a bandwidth of 14.4KBps uplink and 36KBps to bug 72 MTAs per second
each. That's 36,000 MTA's 'bugged' per second.
On the ISP's connectivity, 500 * (14.4KB/up + 36KB/down) works out to
18MBps downlink, 7.2MBps uplink. Just slightly more than that 100Mbit
duplex pipe can take. So the viruses will be throttled back (perhaps to
only bugging 25,000 MTA's per second), while the ISP is essentially
'tapping' MTAs with all its might.
Remeber that the virus only 'taps' the MTA, and then the MTA 'pounds'
the DNS. The amplification factor there was 60x. this means that for the
60bytes input, the MTA would use 3600bytes of bandwidth to verify the
SPF record pointed to in the 60-byte packet.
So the 1,000,000 MTAs are now under attack, 25,000 per second. 25KHz -
that's a buzz that only a cat might hear.
Overall, the 25,000 MTAs will generate 7.2MBit * 60 of traffic per
second in virus related traffic. this means 432MB (4.32Gbits in
aggregate traffic. Ie, evenly spread across the 25,000 MTAs)
The 432MBytes is sucked out the DNS's of those publishing expensive
records. So even though it is launched from 25,000 places on earth, it
is focused on a few of the domains, and on a handful of DNS providers.
But if SPF were more widely adopted, there would be more targets, so
this attack would be more evenly distributed.
Since most of those innocent MTAs have likely never heard of these
domain names in questions (those with expensive records), they can do
nothing but query the root servers for the NS records of those domains.
So the attack percolates all the way up to the root servers.
What I described was a tiny ISP with a 100Mbit connection. Most ISPs
have much bigger pipes.
I think generally this is the worksheet. If my assumption numbers are
off, and you have a more realistic set of numbers, please provide them.
The real numbers may be less, but may be more. How much would I have to
be off by to reduce the magnitude of the problem from 4.32Gbps to
100Mbps (which means to get rid of the amplification) ? And what is the
most I can be off by on the low side? How much worse than I imagine can
this be ?