Re: short circuiting evaluation


This TTL discussion on the A records that exists uses is academic, as
I've changed my position on it absolutely needing to be low.

I think changing the TTL based on load is extremely dangerous. I assume 
you mean that the TTL is increased when the load increases.


The TTL I was suggesting for these exists: records is either around zero
or significantly smaller than you'd normally set the TTL to if you
wanted to avoid significant downtime, such that even an increase like 2x
or 3x is still less than 5 or 10 minutes.

If you're getting hammered with queries at x per second with a TTL of 10
seconds, then your load would be x/2 if you increase the TTL to 20
seconds.  This isn't significantly different, but when there's an attack
going on, this would reduce your load without significantly hindering
your ability to fail over.

So if 
something fails under the heavier load and you have to relocate it, 
you'll suffer longer downtime because the TTLs are longer.


There's also nothing keeping it from going the other way -- set the TTL
to 1 hour in the normal case.  If it looks like your load is up because
your domain is being forged, decrease it so that in the case you do need
to fail over, your downtime window is decreased.  That is, if you've
been serving TTLs of 1hour, and you change to 30 minutes, then cache
entries will expire in an average of 45 minutes, thereby reducing the
length your downtime/inaccessibility due to caching.

I'll think about whether it would actually help in this 
scenario, but for a typical case, say the TTL of a web-server address, 
it's definately a bad idea to increase the TTL as you're pushing the 
equiment closer to failure.


Sure, except we are specifically not talking about "typical cases" here,
and especially not web-servers.  If I send email and your server is
overloaded, I may get a DSN from my server saying it temporarily can't
connect, but assuming your load comes down in a reasonable amount of
time, the mail will go through with no action on any one else's part. 
Web servers are really different, in that if your retail website isn't
responding, people will immediately go to your competitors.

In the "typical case", email goes through.  The atypical case is the
mythical SPF-doom virus that is pounding on mail servers causing mail
servers to pound on DNS through SPF.  I thought we were trying to
optimize for the atypical case here.

Ok, but let's look at a higher connection rate for a second.

If you get 255 connections from around the world, 1 from each A class 
net, you have to do 1 query (TXT) + 243 A (for those which are in 
different class A nets than ebay's servers) + 8 queries for those that 
are in the same class A nets as ebay.


This is a fine thought experiment, but most likely not that realistic. 
Chances are, most, if not all, zombied machines are going to come from
some small (and maybe even predictable) set of class A addresses during
any given single attack.  It seems most of the single digit class As are
out immediately, for example.  I'd think large attacks are going to come
from IP blocks that are hosting connectivity services sold to the
public/consumers.

1 for the exists and 1 for the MX (if the entire MX list fits in the
additional portion of the MX response).  In any case, this gains back
some of the usefulness of the other mechanisms without having to
recompile (or test for needing to recompile) continually and without
forcing their complex evaluation in all instances.  The cache expire
time for the records used in exists should definitely be kept low.


Excellent! so let's look at a 24 hour period. Say that we get 2540 
connections per hour, 10 from each class A network. Let's assume a TTL 
of 24H for the MX, 1H for the exists records, and 1H for the TXT record.

Recall I explained why the exists records have the same TTL as the TXT 
record.

Total traffic with your method:
1*MX + 24*TXT  +  24*254*A = 6121 queries during the 24H period.

In total, you called ns_resolv 24*2540*3 times (182880 times). So the 
cache saved you traffic 96.6% of the time.


This is a very convenient calculation that makes my masking method using
exists look significantly worse.  I don't believe it actually needs to
be that bad.  You assume that all the queries in exists would need a
short TTL, or even the same TTL, and I initially agreed because of the
failover scenario.  One of the advantages of my method, even taking into
account your "I need a short TTL so I can fail over" scenario, is that
all the other SPF mechanisms are usable (as long that they don't cross
administrative boundaries where you don't know how things could change)
without having to recompile the record at all.

You should keep the TTL for any given A record used in exists low if you
plan on using an addresses in that class A as part of your failover
plan.  Fortunately, most of them won't be used.  If you're the kind of
person who is prepared for failover such that the TTL is a concern, you
already know where you are going to failover to (it may even be one of
the addresses that is already listed in the MX).  Say my MX is on 1/8
and my failover is at a different ISP (which is otherwise unlisted, not
even as a backup MX) on 2/8.  I have these records:

                       24h IN TXT "v=spf1 -exists:%{ir1}._spf.%{d}"
                                  " +mx -all"
                        1h IN MX 10 mailhost
$GENERATE 3-254 $._spf 24h IN  A 127.0.0.1
                2._spf  1h IN  A 127.0.0.1
              mailhost  1h IN  A 1.1.1.1

That is, the TXT and 252 of the class A exists records are cachable for
24 hours, and the ones I need to change if I fail over (two As and the
MX) are 1 hour.  At 2540 connections per hour, 10 from each class A,
this design makes

        24*MX + 1*TXT + 1*252*A + 24*2A = 325 queries

calling ns_resolv 24*2540*3 (182880) times, with a cache hit percentage
of 99.82%

(like we have previously, I'm again assuming the load of 1*MX includes
the lookup of the resultant As, thus it's fixed).  By taking our actual
current and failover network information into account, the number of
queries have been reduced by nearly 95% over that 24 hour period, and
the cache hit is significantly better.  And the TTLs that should be
longer can still be without significant hits to our failover plan.

In addition, there is less to go wrong with the failover scheme because
fewer records need to be changed (either manually or through automatic
means); you'd have to change the 2 to a 1 in the 2._spf record and the
IP in the A record for mailhost.  Nothing needs to be recompiled.  The
SPF, the largest of all the records, also doesn't need to be touched, so
there is less chance of screwing up that complex looking thing.  So even
if you screw up the DNS changes for failover, it's not as damaging as
compiling the incorrect information to a record that has a longer cache
time (because you base the cache time on the minimum of the inputs'
TTLs).  Fewer points of failure are always good when you're under the
stress of dealing with a failover.

If I'm more correct about zombie distribution than you are, then the
largest term in the number of queries per day calculation, the 1*252*A,
might be significantly less because of the distribution of zombiable
computers being concentrated on popular class As.


I've included your original calculations for your method below for
reference.

With my method, mask included at the end of the top level TXT, total of 
9 records with the same TTL of 1H. The records are fully compiled and 
contain only IP4 and redirects.

Total traffic with my method:
24*1*TXT = 24 queries, if the mask is top notch.
24*9*TXT = 216 queries, if the mask is useless.

More likely the actual number of queries is between 24 and 216.

In total, I called ns_resolv between 24*2540*1=60960 times if I the mask 
was top notch and 24*2540*9=548640 times if the mask was crap. So the 
cache saved me traffic exactly 99.96% of the time, whether the mask was 
good or not.

As you can see, there's a huge difference, and most of it is owed to the 
fact that the exists are AGAU, even though 96.6% _looks_ like a pretty 
high cache efficiency.


Let's keep in mind that we are not comparing the same exact records, but
it shouldn't matter much.  If all of ebay's sending IPs can be encoded
in a single A record, you could substitue the lookup for that A record
for the +mx in my sample record.  It would still be the same load.

As a comparison, and for the record, here are the numbers for the same
record without using any kind of masking:

                       24h IN TXT "v=spf1 +mx -all"
                        1h IN MX 10 mailhost
              mailhost  1h IN  A 1.1.1.1

That's 24*MX + 1*TXT = 25 queries, and calling ns_resolv 24*2540*2
(121920) times with a cache hit ratio of 99.9795%.  Note, again, I
didn't include the A record lookup for mailhost, because it wasn't
included in any of the other calculations.  The remaining mechanisms
would have to be really expensive (in terms of number of queries and
query cachability) to make masking mean something.  The typical case
(legit mail) is made worse by planning to be able to handle the atypical
case (SPF-Doom attack!).  If the numbers you've been preaching are
correct, masking may be a good trade off for complex, amplifying
records.

I still think this should be evaluated on a case-by-case basis.  Masking
using exists or compiling and using a masking directive can make simple
records worse, especially if they would overflow into multiple records
because of include flattening.

I'm going to sleep on this for a little while, and see how the exists: 
method can be better than the mask method.


Well, the most obvious way :) it is better is that it is implementable
as soon as yesterday, without having to change the spec, redeploy SPF
evaluators to make them mask-syntax aware, install stunt DNS servers or
upgrade DNS software.  In fact, if you are using bind9, the $GENERATE
construct allows easy and quick generation of the necessary class A
records without using an SPF record compiler or outside script.

BTW, I've been referring to doing either of our methods as "masking". 
My suggestion uses exists to generate the mask, your's uses a new
mechanism (too bad it's order dependent, otherwise it could be a
modifier and thus deployed SPF evaluators would skip over it -- although
redirect= is order dependent, isn't it?)

One obvious way is if all 
forger traffic came from the same A class net all the time, _AND_ the 
specific address was close enough to the servers that the mask would 
miss it. [...] It's pretty unrealistic though, given all the 
restrictive ifs.


Baring some obscenely large hole on ALL networks, I think past patterns
suggest that those who are most vulnerable, and will remain vulnerable,
are those who sell consumer oriented services (because of the nature of
consumers to not really be security oriented, thus a target for
zombies).  If the mask being ineffective, however it is implemented, is
a concern, avoiding class As that are shared with home subscribers might
be wise.  BUT, using either your method or mine, you could implement
masks even more restrictive.  This could be as simple as, using my
method:

$GENERATE 2-254 $._spf   24h IN  A 127.0.0.1
$GENERATE 2-254 1.$._spf 24h IN  A 127.0.0.1

if you want to allow only the 1.1/16 range to be further evaluated.  But
at some point, my method becomes diminishing returns because the "normal
case" is not an attack but rather legit email, and things like that only
add to the number of queries performed, of course.  So you have to weigh
your chances of getting attacked (and having to deal with the increased
load) and what should be considered "normal operation".  It largely
depends on where the attacks are coming from.

If there is a long lived, sustained attack, modifying the SPF record to
include masks may be a good short term solution (until the attacks
subside) as a way to control the load that your SPF records is putting
on receivers' and your own systems.

Whatever we conclude, I really enjoy these thoughtful discussions.


Heh, incidentally, I'm starting to find them tedious, but overall
interesting -- after all, I'm up at 4am (I'm in central time)
responding, so that must mean something. :)

-- 
Andy Bakun <spf(_at_)leave-it-to-grace(_dot_)com>