Re: Need for Complexity in SPF Records

At 09:06 PM 3/27/2005 -0500, Radu wrote:

David MacQuigg wrote:
Radu, I wrote this response yesterday, then today decided it doesn'tsound quite right. I'm really not as sure of what I'm saying as itsounds. Show me I'm wrong, and I'll re-double my efforts to findsolutions that don't abandon what is already in SPF, solutions like yourmask modifier. Examples are the best way to do that. Your example.combelow is almost there, but it still doesn't tell me why we really needexists and redirect.
Ok, we'll have a look at all the ideas on the table. That's what the tableis for, right ? :)
I won't cut anything out of your message, so that the progression of theexplanation is easily seen and reflected upon if necessary.

Looks good in Eudora. I hope the deep indentations don't look too bad inother readers. Also, to keep the length of this main thread to a minimum,I'll split off sub-topics to another thread, like the need for %{i} macros.

At 07:21 PM 3/26/2005 -0500, Radu wrote:
David MacQuigg wrote:
At 04:06 PM 3/26/2005 -0500, Radu wrote:
David MacQuigg wrote:
Now I'm confused. If the reason for masks is *not* to avoid sendingmultiple packets, and *only* to avoid processing mechanisms thatrequire another lookup, why do we need these lookups on the clientside? Why can't the compiler do whatever lookups the client woulddo, and make the clients job as simple as possible?
Sorry for creating confusion.

Say that you have a policy that compiles to 1500 bytes.

The compiler will split it into 4 records, about 400-bytes each or so.

example.com     IN TXT \
     "v=spf1 exists:{i}.{d} ip4:... redirect=_s1.{d2} m=-65/8 m=24/8"
_s1.example.com IN TXT "v=spf1 ip4:.... .... ....  redirect=_s2.{d2}"
_s2.example.com IN TXT "v=spf1 ip4:.... .... ....  redirect=_s3.{d2}"
_s3.example.com IN TXT "v=spf1 ip4:.... .... ....  -all"

We want the mask to be applied after the exists:{i}.{d}. Since that
mechanism was in the initial query, cannot be expanded to a list of IPs
the mask cannot possibly apply to it.
I think what you are saying is that the compiler can't get this down toa simple list of IPs, because we need redirects containing macros thatdepend on information only the client has. So if we are to put theburden of complex SPF evaluations on the server side, where it belongs,seems like we have to pass all the necessary information to the serverin the initial query. We already pass the domain name.Adding the IP address should not be a big burden, and it would havesome other benefits we discussed.
If you can find a way to do that and still keep the query cacheable, letme know. If it is compatible with the way DNS works currently, I'll evenlisten and pay attention. ;)
That 1 UDP packet might not seem like a lot. But currently it iscacheable and most of the time is not even seen on the internet. Makingit uncacheable would be a multiple fold burden on bandwidth. That'sexactly why caching and the TTL mechanism was invented, and now yousuggest we give it up?
No, I see your point. If we truly need %{i} macros, and we evaluate themon the server side, that would produce a different response record forevery IP address, and it might not make sense to cache such records.
Responses for SPF records with no %{i} macros would cache as always.
The %{d} macros would not impair caching. Even the %{i} responses mightbe worth caching for a few minutes, if you are getting hammered by one IP.
Actually, all records should have the longest possible TTL (within theconstraints of the network design). This avoids caching name serverseverywhere asking the same queries too often.
Responses to %{i} queries are no different. Since there are 2^32 possiblequestions, you want each one to come up as infrequently as possible. Ifyou have a pest or even regular traffic every hour, but your %{i} TTL is59 minutes, then the cache efficiency is 0%. But if you could make it 1hand 1 minute, the cache efficiency would be 50%. On the other hand, forsteady traffic, the cache efficiency would be really high, so even a lowerTTL would not make much difference, as the savings are huge compared tothe cost. tt's a little bit counter intuitive that the "uncacheable"records should have long TTLs. Anyway, this is somewhat philosophical,because you can't cache 2^32 * {number of forged domains that publish %{i}}.
As an example, lets pretend that yahoo publishes a record with %{i} and aTTL of 10 minutes. Potentially it will receive the same 2^32 questionsfrom all the caching servers of the world, every 10 minutes. I know forsure that ohmi will be asking every 10 minutes, because I get lots offorgeries as yahoo.com (say 1 every 11 minutes). So will all the otherlittle servers. So doubling that TTL means I'll only ask every 20 minutes.This is where the damage is, little servers asking for the informationevery 10 minutes, but never using it more than once.
But when yahoo users send 300M messages a day to their hotmail friends,hotmail will ask yahoofor the information 144 times, and use it's cachethe other 299,... million times. So the cost of %{i} as seen by yahoo isnot coming from hotmail querying it, from the swarm of little serverseverywhere.
Whether the loss of caching on a few records is too high a price dependson the severity of the threatened abuse. Should we tolerate a smallincrease in DNS load for the normal flow of email, to limit theworst-case abuse of the %{i} macro. I don't know.
Well, the %{i} is not a small increase. It is even far more expensive thanPTR. Let's say that you have a spewing spambox that uses forgerytechniques. (let's say it's at 1.1.1.1)
Let's say that all domains used one %{i} mechanism.

The spambox sends ohmi N forgeries from different domains.
If every domain listed a PTR mechanism, I would query the 1.1.1.1.ARPAadress once, and for the remaining N-1 queries I would find it in thelocal cache. So my cost of the PTR is 1 query per mail source.
But if everyone uses an %{i}, I now have to ask the following questions:

1.1.1.1._spf.domain1.com
1.1.1.1._spf.domain2.com
1.1.1.1._spf.domain3.com
1.1.1.1._spf.domain4.com
...
1.1.1.1._spf.domainN.com
these are distinct queries, and I only ask each question exactly once, sothe fact the the local DNS cache does cache the answers, I will never askfor them again. All that traffic will go over my DSL connection to the ISPto the root servers, and so on. Actually, as Tod pointed out, every timemy caching server is asked about a new domain, it generates multiplerecursive queries: 1st one to the root servers, 2nd to the authority NSservers, 3rd one the the subdomains and so on. I hadn't thought aboutthis, or I would have presented a much gloomier SPF-doom scenarion.
So every one of those queries costs 3 queries on my DSL line. 3*N.Compared to the PTR mechanism that only costs 1 query across DSL. I havethe caching server on my side of the DSL modem, I don't use the ISP's. Ialso get charged for excess bandwidth consumed.
If I used the ISP's caching server, I would ask N questions even for thePTR case. The further the caching server is, the more expensive it is touse it. Also the benefit is lost, as the further it is, the higher theresponse latency gets. (Assume my DSL connection had a 200ms latency.Asking N questions would take N*200ms, while asking the same N questionsfrom a cache on my side of the modem would be 200ms for the 1st question,and 0.1ms for every subsequent one). And I'd pe paying dollars for theN*200ms performance.

This is a very convincing argument that we need to *deprecate* %{i}. I'mOK with that, and my proposal above was predicated on the assumption that%{i} is truly needed. If someone wants to defend the need for %{i}, let'ssplit this off as a separate sub-topic.

What I *would* do is discourage the widespread use of macros, redirects,and includes, and state in the standard that processing of records withthese features SHOULD be lower priority than processing simple records.
That may help to implement a defense mode if these features are abused.
Absolutely, I'm with you on this. I already suggested that the expensivemacros are to be limited to 1 per record. The d and o are not expensive asthey expand the same, no matter what the source of the connection is orwhat the claimed mail-from is.
I would not introduce the concept of 'priority' though.
After all, no-one is forcing postmaster to do 10 queries, or N queries.Even my sendmail implementation of SPF has configuration options for howexpensive the check is allowed to get. You can say that checks with %i arenever done, and in that case the policy does not result in an answer, andyou can also configure the max number of DNS mechs to an arbitrarily lownumber. If it is lower than the spec, and the checker sees more than thatin the record, it doesn't try to expand even one, and returns with "recordtoo expensive". In both of those cases, no Received-SPF header is added.

We need to close the loop by providing feedback from the point where an SPFrecord is deemed "too expensive", and the publisher of that record. Oneway to do this might be a comment in the authentication header, somethinglike "SPF record from <domain> exceeds complexity limit." Then postmastersdownstream can put pressure on the publisher to compile their records.

Maybe I'm just not seeing the necessity of setups like the aboveexample.com. I'm sure someone could come up with a scenario where itwould be real nice if all SPF checkers could run a Perl script embeddedin an SPF record, but we have to ask, is that really necessary toverify a domain name?
The "..." imply a list of ip4: mechanism that is 400-bytes long. That'swhy the chaining is necessary. ebay.com has something like that.hotmail.com uses something similar too. When you have lots of outgoingservers, you need more space to list them, no?
Why can't they make each group of servers a sub-domain with its ownsimple DNS records, as rr.com has done with its subdomains?_s3.example.com can have as many servers as can be listed in a 400 byteSPF record, and that includes some racks with hundreds of servers listedin one 20 byte piece of the 400 byte record. With normal clustering ofaddresses, I would think you could list thousands of servers in eachsubdomain, with nothing but ip4's in the SPF record.
It may already be that way. If I had that longer list of domains thatpublish SPF, I could run the spfcompiler on then and find out very quicklywhat the average, min and max compiled record lengths would be.

senderbase.org has a nice list of all the biggest email domains. The topthree now are comcast.net, yahoo.com, and rr.com. comcast and yahoo don'tpublish SPF. You might want to look down the list for some big ones that do.

One reason I can see why mail server's can't be clustered too tightly isin an application like ebay's. Their business depends on being able tosend "last chance" emails, so they have to have mail servers sprinkled allover for redundancy (load sharing too).

Even their 4-level SPF hierarchy, however, can be flattened to a fewsubdomains, like s._spf.ebay.com, with nothing but a short list of ip4's ineach subdomain's SPF record. ( Check me on this. I may have missedsomething.) The question is - why can't they do like rr.com, and use DNS(not SPF) to establish the hierarchy? I understand that Ebay may want morecentralized control than RR, but having a bunch of decentralized subdomainsdoesn't mean they can't control the nameservers for those subdomains.

As I understand it, users sending mail from _s3.example.com will stillsee 'example.com' in their headers, but the envelope address will be thereal one _s3.example.com. That's the one that needs to authenticate, andthe one that will inherit its reputation from example.com.
I'm afraid you misunderstood. The _s3-like names are generated by thecompiler, but nothing in the configuration of the SMTP server is changedto reflect it. So if the next version of the compiler changes to using_p3, there is zero effect on the the mail users. Because the _s recordsare daisy chained, it's only the root of the chain that can be used as asstart of policy. That root is at domain.com.
Also, as the network changes, the contents of _s3 changes too. Maybe thewhole daisy chain gets shorter or longer. That will not affect theenvelope address used on mail. Evaluation must always start at domain.com(top of daisy chain)

OK, so _s3.example.com is just a fictional subdomain that exists *only* tosplit these SPF records into 400-byte chunks. What I'm suggesting is thatthese be real subdomains, in the sense that they have their own DNSrecords, and that these subdomains follow the actual structure of the IPblocks assigned to the mail servers. Re-structuring the network would meanchanging the MAIL FROM domain on servers that moved to a differentsubdomain, but the header addresses would still be'example.com'. Authentication queries would go straight to_s3.example.com, where they would get a simple one-packet response. Isthat difficult?

Seems to me this is using DNS exactly the way it was intended,distributing the data out to the lowest levels, and avoiding the need toconstruct hierarchies within the SPF records. Sure, it can be done, butwhat is the advantage over just putting simple records at the lowestlevels, and letting DNS take care of the hierarchy? Why does ebay.comneed four levels of hierarchy in its SPF records?
Currently just for convenience, as they're not using any compiler. In thefuture, the compiler will flatten the hierarchy. It may be a while tillthen, so in the meanwhile we need a transition plan.

The migration plan needs to provide some incentive for companies to*compile* their SPF records, even if they don't see a problem with theirown DNS loads. Some things I can think of are 1) Make the compiler so easyto use that companies will use it just for convenience. 2) A schedule fordeprecation of the troublesome features, or reduction of the number ofallowed queries. 3) A well-publicized RECURSION option that will allowmailsystem admins to control the number of queries allowed in theirauthentication checks.

If we simply can't sell SPF without all these whiz-bang features, Iwould say put it *all* on the server side. All the client should haveto do is ask - "Hey <domain> is this <ip> OK?" We dropped that ideabecause it doesn't allow caching on the client side, but with a simplePASS/FAIL response, the cost of no caching is only one UDP round tripper email. This seems like small change compared to worries aboutrunaway redirects, malicious macros, etc.
I'll humour you:
This server-side processing would not be happening on a caching server,correct? That would not save anything. I hope you agree.
If the caching server were in the domain which created the expensive SPFrecord, then it would save traffic to and from the client, at the expenseof traffic within the domain that deserves it. If example.com needs 100queries within their network to answer my simple query "Is this <ip>OK?", then they need to think about how to better organize theirrecords. All I need is a simple PASS/FAIL, or preferably a list of IPblocks that I can cache to avoid future queries. ( This should be theserver's choice.)
I see where the misunderstanding started. Let me attempt to clear it up:
Caching servers are rarely/never deployed close to the authoritativeservers. Caching servers really only make sense if they are close to wherethe queries are generated. I showed this above with my 200ms DSLconnection example. It was a little exagerated, but it serves the purposeof explanation well.

I'm probably using the wrong terminology. I should have said "slaveservers" instead of "caching servers". If we can get everyone to compiletheir records, and use subdomains as I suggest above, then the question ofwhether slave servers or clients should bear the expense of complex SPFrecords goes away. So I suggest we come back to this point, if necessary,after we resolve the above questions.


< snip discussion on the role of slave servers >

Let's estimate the worst-case load on DNS if we say "no lookups, onepacket only in any response". I'm guessing 90% of domains will provide asimple, compiled, cachable, list of IP blocks. This is as good as itgets, with the possible exception of a fallback to TCP if the packet istoo long. The 10% with really complex policies may have a big burdenfrom queries and computations within their own network, but what goesacross the Internet is a simple UDP packet with a PASS or FAIL.
Oh, but the critical detail is that a lot of firewalls block port 53 TCP,whether by design or configuration. Since this is the state of the world,DNS queries over TCP are inherently unreliable.
I doubt if the 10% of domains with long compiled SPF records will acceptthat unreliability as a fact of life. They will stick to UDP, which ismore or less guaranteed, in the sense that even if a packet is lost once,the next time it will probably make it. The DNS system deals gracefullywith temporary problems like this, so not a problem.
But when your record depends on TCP, and some firewall somewhere blocksit, there's no amount of retrying that will get that connection through.

OK, so we should not depend on TCP, but simply set a maximum record length,like 450 bytes, and expect that all DNS servers will be able to handleit. I'm surprised there isn't a well-established number we can use here.

And because of this, we're stuck with daisy-chaing the longest records. Inthe end, it's done for the sake of reliability, at the expense of someextra traffic.

The alternative to daisy-chaining long SPF records is separating the chunksinto subdomains. See above.

That response is not cacheable, but lets compare the added load to someother things that happen with each email. Setting up a TCP connection isa minimum of three packets. SMTP takes two packets for the HELO andresponse. MAIL FROM is another two. Then we need two for theauthentication. At that point we can send a reject (one packet) andterminate the connection (4 packets).Looks to me like the additional load on DNS is insignificant for normalmail, and only a few percent of the minimum traffic per email in a DoSstorm. Also, the additional load is primarily on the domain with theexpensive SPF records, where it should be.

Please notice, the premise for this discussion on DNS load is "no lookups,one packet only in any response". The following examples show how bad itcan get if we *don't* make that assumption.

This is not always the case. Consider a case like:
"v=spf1 ip4:1.1.1.1/28 mx:t-online.de include:isp1.com include:isp2.cominclude:isp3.com -all"
Say that the 3 ISP's don't even publish SPF yet, but the includes arethere just in case they ever do.
This record is very cheap on the publisher's DNS (only 1 TXT query goes tothe publisher's DNS). But for every bandwidth penny spent by thepublisher, the 3 ISPs have to spend 1 penny each. Poor t-online.de has tospend 10 pennies for each penny that the publisher spends.
And the sad thing is, while the isp's can minimize the cost by publishingcheap SPF records, there's nothing t-online can do to lower it's damage.
What's even worse, is that t-online can't even find out why it seesincreased bandwidth levels. It's extremely complicated to track an MX or Aquery back to an email address.
Even worse than that, the default max-ncache-ttl in BIND is 3 hours. Thatmeans that even if the publisher's TXT record has a TTL of 24H, the isp'swill be hit with a query every 3 hours, while the publisher only every 24H.
So no, the costs is not necesarily on the publisher.
Taking into account the TTLs above, and the TTL of t-online's records of1H, the score would be
1:8:240
So the publisher's record costs t-online.de 2.40euro for every penny costto the publisher.
The ISP's pay 8 pennies each.
Even if this were a spammer domain, and they weren't *really* doing anyinternal lookups, the load on their DNS server is two packets for everyadditional two-packet load on the vitims. No amplification factor here.
Add that the spammer is actually likely to both use the t-online.de'sresources, and be stupid enough to not realize that mail doesn't gothrough the MX exchange. Suddenly, the amplification factor becomes acertainty.

I agree with you entirely, and your examples above add even more weight tothe argument that we should deprecate lookups.

How about this: All SPF records SHOULD be compiled down to a list ofIPs. If you need more than that, then do as much as you like, but givethe client a simple PASS or FAIL. Most domains will then say "Here isour list of IPs. Don't ask again for X hours." Only a few will say"Our policy is so complex, you can't possibly understand it. Send usevery IP you want checked."
That's exactly what the exists:{i}.domain does. It tells the domainabout every IP it wants checked, and the server checks it.Unfortunately, it is extremely expensive because it's AGAU.
If I were writing an SPF-doom virus, this is where I would start.
I need to get back to designing ICs. :>)
Nah... you've got some great ideas and I value your contribution andfeedback.
And I appreciate your time in getting me up to speed on these problems.
I hope one day I can return the favor.
It's a pleasure to be of service. SPF is a good cause, and I think itdeserves to be saved.

Saved is the right word. Unless we change course quickly, I think SPF isheading for oblivion. Did you see the article "Stopping Spam" inScientific American, April 2005? SenderID is a small part of the overallplan, and SPF is a part of SenderID not even worth mentioning. I think thefolks at Microsoft don't really care how messy or inefficient theinfrastructure is. They will just adapt to whatever standard is eventuallyadopted. If that ends up some crazy mix of SPF and PRA, nobody but a fewprogrammers will be bothered by it.

Incidentally, I got curious and did some tests, and it appears that yahoodoes not do any DNS queries on incoming mail. Hotmail does two but eitherdoesn't respect TTLs, or do queries on a spot-check basis, because eventhough I have a low TTL, they did not refresh.
It could be that already, even without checking SPF these two figured outthat DNS is more expensive than storing spam. Fascinating!
This wasn't a scientific test as I would normally do, but a quickcheck-your-fears check.
So at least for now, I think I know that yahoo and hotmail will not do anyspf checks any time soon, based on this little test and a lot ofextrapolation. ;)

I would love to get Yahoo to collaborate on a shared protocol that wouldsupport both DomainKeys and SPF, but I haven't found any way to get thatdiscussion started. The FTC is talking about doing something if theindustry fails, but the current attitude seems to be that they can let theindustry work it out. I see a year of confusion and lots ofinter-operability problems.


-- Dave
************************************************************     *
* David MacQuigg, PhD      email:  dmquigg-spf at yahoo.com      *  *
* IC Design Engineer            phone:  USA 520-721-4583      *  *  *
* Analog Design Methodologies                                 *  *  *
*                                   9320 East Mikelyn Lane     * * *
* VRS Consulting, P.C.              Tucson, Arizona 85710        *

************************************************************ *