spf-discuss
[Top] [All Lists]

Re: Need for Complexity in SPF Records

2005-03-28 10:08:37
At 09:06 PM 3/27/2005 -0500, Radu wrote:

David MacQuigg wrote:
Radu, I wrote this response yesterday, then today decided it doesn't sound quite right. I'm really not as sure of what I'm saying as it sounds. Show me I'm wrong, and I'll re-double my efforts to find solutions that don't abandon what is already in SPF, solutions like your mask modifier. Examples are the best way to do that. Your example.com below is almost there, but it still doesn't tell me why we really need exists and redirect.

Ok, we'll have a look at all the ideas on the table. That's what the table is for, right ? :)

I won't cut anything out of your message, so that the progression of the explanation is easily seen and reflected upon if necessary.

Looks good in Eudora. I hope the deep indentations don't look too bad in other readers. Also, to keep the length of this main thread to a minimum, I'll split off sub-topics to another thread, like the need for %{i} macros.


At 07:21 PM 3/26/2005 -0500, Radu wrote:

David MacQuigg wrote:

At 04:06 PM 3/26/2005 -0500, Radu wrote:

David MacQuigg wrote:

Now I'm confused. If the reason for masks is *not* to avoid sending multiple packets, and *only* to avoid processing mechanisms that require another lookup, why do we need these lookups on the client side? Why can't the compiler do whatever lookups the client would do, and make the clients job as simple as possible?

Sorry for creating confusion.

Say that you have a policy that compiles to 1500 bytes.

The compiler will split it into 4 records, about 400-bytes each or so.

example.com     IN TXT \
     "v=spf1 exists:{i}.{d} ip4:... redirect=_s1.{d2} m=-65/8 m=24/8"
_s1.example.com IN TXT "v=spf1 ip4:.... .... ....  redirect=_s2.{d2}"
_s2.example.com IN TXT "v=spf1 ip4:.... .... ....  redirect=_s3.{d2}"
_s3.example.com IN TXT "v=spf1 ip4:.... .... ....  -all"

We want the mask to be applied after the exists:{i}.{d}. Since that
mechanism was in the initial query, cannot be expanded to a list of IPs
the mask cannot possibly apply to it.

I think what you are saying is that the compiler can't get this down to a simple list of IPs, because we need redirects containing macros that depend on information only the client has. So if we are to put the burden of complex SPF evaluations on the server side, where it belongs, seems like we have to pass all the necessary information to the server in the initial query. We already pass the domain name. Adding the IP address should not be a big burden, and it would have some other benefits we discussed.

If you can find a way to do that and still keep the query cacheable, let me know. If it is compatible with the way DNS works currently, I'll even listen and pay attention. ;)

That 1 UDP packet might not seem like a lot. But currently it is cacheable and most of the time is not even seen on the internet. Making it uncacheable would be a multiple fold burden on bandwidth. That's exactly why caching and the TTL mechanism was invented, and now you suggest we give it up?

No, I see your point. If we truly need %{i} macros, and we evaluate them on the server side, that would produce a different response record for every IP address, and it might not make sense to cache such records.
Responses for SPF records with no %{i} macros would cache as always.
The %{d} macros would not impair caching. Even the %{i} responses might be worth caching for a few minutes, if you are getting hammered by one IP.

Actually, all records should have the longest possible TTL (within the constraints of the network design). This avoids caching name servers everywhere asking the same queries too often.

Responses to %{i} queries are no different. Since there are 2^32 possible questions, you want each one to come up as infrequently as possible. If you have a pest or even regular traffic every hour, but your %{i} TTL is 59 minutes, then the cache efficiency is 0%. But if you could make it 1h and 1 minute, the cache efficiency would be 50%. On the other hand, for steady traffic, the cache efficiency would be really high, so even a lower TTL would not make much difference, as the savings are huge compared to the cost. tt's a little bit counter intuitive that the "uncacheable" records should have long TTLs. Anyway, this is somewhat philosophical, because you can't cache 2^32 * {number of forged domains that publish %{i}}.

As an example, lets pretend that yahoo publishes a record with %{i} and a TTL of 10 minutes. Potentially it will receive the same 2^32 questions from all the caching servers of the world, every 10 minutes. I know for sure that ohmi will be asking every 10 minutes, because I get lots of forgeries as yahoo.com (say 1 every 11 minutes). So will all the other little servers. So doubling that TTL means I'll only ask every 20 minutes. This is where the damage is, little servers asking for the information every 10 minutes, but never using it more than once.

But when yahoo users send 300M messages a day to their hotmail friends, hotmail will ask yahoofor the information 144 times, and use it's cache the other 299,... million times. So the cost of %{i} as seen by yahoo is not coming from hotmail querying it, from the swarm of little servers everywhere.

Whether the loss of caching on a few records is too high a price depends on the severity of the threatened abuse. Should we tolerate a small increase in DNS load for the normal flow of email, to limit the worst-case abuse of the %{i} macro. I don't know.

Well, the %{i} is not a small increase. It is even far more expensive than PTR. Let's say that you have a spewing spambox that uses forgery techniques. (let's say it's at 1.1.1.1)

Let's say that all domains used one %{i} mechanism.

The spambox sends ohmi N forgeries from different domains.

If every domain listed a PTR mechanism, I would query the 1.1.1.1.ARPA adress once, and for the remaining N-1 queries I would find it in the local cache. So my cost of the PTR is 1 query per mail source.

But if everyone uses an %{i}, I now have to ask the following questions:

1.1.1.1._spf.domain1.com
1.1.1.1._spf.domain2.com
1.1.1.1._spf.domain3.com
1.1.1.1._spf.domain4.com
...
1.1.1.1._spf.domainN.com

these are distinct queries, and I only ask each question exactly once, so the fact the the local DNS cache does cache the answers, I will never ask for them again. All that traffic will go over my DSL connection to the ISP to the root servers, and so on. Actually, as Tod pointed out, every time my caching server is asked about a new domain, it generates multiple recursive queries: 1st one to the root servers, 2nd to the authority NS servers, 3rd one the the subdomains and so on. I hadn't thought about this, or I would have presented a much gloomier SPF-doom scenarion.

So every one of those queries costs 3 queries on my DSL line. 3*N. Compared to the PTR mechanism that only costs 1 query across DSL. I have the caching server on my side of the DSL modem, I don't use the ISP's. I also get charged for excess bandwidth consumed.

If I used the ISP's caching server, I would ask N questions even for the PTR case. The further the caching server is, the more expensive it is to use it. Also the benefit is lost, as the further it is, the higher the response latency gets. (Assume my DSL connection had a 200ms latency. Asking N questions would take N*200ms, while asking the same N questions from a cache on my side of the modem would be 200ms for the 1st question, and 0.1ms for every subsequent one). And I'd pe paying dollars for the N*200ms performance.

This is a very convincing argument that we need to *deprecate* %{i}. I'm OK with that, and my proposal above was predicated on the assumption that %{i} is truly needed. If someone wants to defend the need for %{i}, let's split this off as a separate sub-topic.

What I *would* do is discourage the widespread use of macros, redirects, and includes, and state in the standard that processing of records with these features SHOULD be lower priority than processing simple records.
That may help to implement a defense mode if these features are abused.

Absolutely, I'm with you on this. I already suggested that the expensive macros are to be limited to 1 per record. The d and o are not expensive as they expand the same, no matter what the source of the connection is or what the claimed mail-from is.

I would not introduce the concept of 'priority' though.

After all, no-one is forcing postmaster to do 10 queries, or N queries. Even my sendmail implementation of SPF has configuration options for how expensive the check is allowed to get. You can say that checks with %i are never done, and in that case the policy does not result in an answer, and you can also configure the max number of DNS mechs to an arbitrarily low number. If it is lower than the spec, and the checker sees more than that in the record, it doesn't try to expand even one, and returns with "record too expensive". In both of those cases, no Received-SPF header is added.

We need to close the loop by providing feedback from the point where an SPF record is deemed "too expensive", and the publisher of that record. One way to do this might be a comment in the authentication header, something like "SPF record from <domain> exceeds complexity limit." Then postmasters downstream can put pressure on the publisher to compile their records.

Maybe I'm just not seeing the necessity of setups like the above example.com. I'm sure someone could come up with a scenario where it would be real nice if all SPF checkers could run a Perl script embedded in an SPF record, but we have to ask, is that really necessary to verify a domain name?

The "..." imply a list of ip4: mechanism that is 400-bytes long. That's why the chaining is necessary. ebay.com has something like that. hotmail.com uses something similar too. When you have lots of outgoing servers, you need more space to list them, no?

Why can't they make each group of servers a sub-domain with its own simple DNS records, as rr.com has done with its subdomains? _s3.example.com can have as many servers as can be listed in a 400 byte SPF record, and that includes some racks with hundreds of servers listed in one 20 byte piece of the 400 byte record. With normal clustering of addresses, I would think you could list thousands of servers in each subdomain, with nothing but ip4's in the SPF record.

It may already be that way. If I had that longer list of domains that publish SPF, I could run the spfcompiler on then and find out very quickly what the average, min and max compiled record lengths would be.

senderbase.org has a nice list of all the biggest email domains. The top three now are comcast.net, yahoo.com, and rr.com. comcast and yahoo don't publish SPF. You might want to look down the list for some big ones that do.

One reason I can see why mail server's can't be clustered too tightly is in an application like ebay's. Their business depends on being able to send "last chance" emails, so they have to have mail servers sprinkled all over for redundancy (load sharing too).

Even their 4-level SPF hierarchy, however, can be flattened to a few subdomains, like s._spf.ebay.com, with nothing but a short list of ip4's in each subdomain's SPF record. ( Check me on this. I may have missed something.) The question is - why can't they do like rr.com, and use DNS (not SPF) to establish the hierarchy? I understand that Ebay may want more centralized control than RR, but having a bunch of decentralized subdomains doesn't mean they can't control the nameservers for those subdomains.

As I understand it, users sending mail from _s3.example.com will still see 'example.com' in their headers, but the envelope address will be the real one _s3.example.com. That's the one that needs to authenticate, and the one that will inherit its reputation from example.com.

I'm afraid you misunderstood. The _s3-like names are generated by the compiler, but nothing in the configuration of the SMTP server is changed to reflect it. So if the next version of the compiler changes to using _p3, there is zero effect on the the mail users. Because the _s records are daisy chained, it's only the root of the chain that can be used as as start of policy. That root is at domain.com.

Also, as the network changes, the contents of _s3 changes too. Maybe the whole daisy chain gets shorter or longer. That will not affect the envelope address used on mail. Evaluation must always start at domain.com (top of daisy chain)

OK, so _s3.example.com is just a fictional subdomain that exists *only* to split these SPF records into 400-byte chunks. What I'm suggesting is that these be real subdomains, in the sense that they have their own DNS records, and that these subdomains follow the actual structure of the IP blocks assigned to the mail servers. Re-structuring the network would mean changing the MAIL FROM domain on servers that moved to a different subdomain, but the header addresses would still be 'example.com'. Authentication queries would go straight to _s3.example.com, where they would get a simple one-packet response. Is that difficult?

Seems to me this is using DNS exactly the way it was intended, distributing the data out to the lowest levels, and avoiding the need to construct hierarchies within the SPF records. Sure, it can be done, but what is the advantage over just putting simple records at the lowest levels, and letting DNS take care of the hierarchy? Why does ebay.com need four levels of hierarchy in its SPF records?

Currently just for convenience, as they're not using any compiler. In the future, the compiler will flatten the hierarchy. It may be a while till then, so in the meanwhile we need a transition plan.

The migration plan needs to provide some incentive for companies to *compile* their SPF records, even if they don't see a problem with their own DNS loads. Some things I can think of are 1) Make the compiler so easy to use that companies will use it just for convenience. 2) A schedule for deprecation of the troublesome features, or reduction of the number of allowed queries. 3) A well-publicized RECURSION option that will allow mailsystem admins to control the number of queries allowed in their authentication checks.

If we simply can't sell SPF without all these whiz-bang features, I would say put it *all* on the server side. All the client should have to do is ask - "Hey <domain> is this <ip> OK?" We dropped that idea because it doesn't allow caching on the client side, but with a simple PASS/FAIL response, the cost of no caching is only one UDP round trip per email. This seems like small change compared to worries about runaway redirects, malicious macros, etc.

I'll humour you:

This server-side processing would not be happening on a caching server, correct? That would not save anything. I hope you agree.

If the caching server were in the domain which created the expensive SPF record, then it would save traffic to and from the client, at the expense of traffic within the domain that deserves it. If example.com needs 100 queries within their network to answer my simple query "Is this <ip> OK?", then they need to think about how to better organize their records. All I need is a simple PASS/FAIL, or preferably a list of IP blocks that I can cache to avoid future queries. ( This should be the server's choice.)

I see where the misunderstanding started. Let me attempt to clear it up:

Caching servers are rarely/never deployed close to the authoritative servers. Caching servers really only make sense if they are close to where the queries are generated. I showed this above with my 200ms DSL connection example. It was a little exagerated, but it serves the purpose of explanation well.

I'm probably using the wrong terminology. I should have said "slave servers" instead of "caching servers". If we can get everyone to compile their records, and use subdomains as I suggest above, then the question of whether slave servers or clients should bear the expense of complex SPF records goes away. So I suggest we come back to this point, if necessary, after we resolve the above questions.

< snip discussion on the role of slave servers >

Let's estimate the worst-case load on DNS if we say "no lookups, one packet only in any response". I'm guessing 90% of domains will provide a simple, compiled, cachable, list of IP blocks. This is as good as it gets, with the possible exception of a fallback to TCP if the packet is too long. The 10% with really complex policies may have a big burden from queries and computations within their own network, but what goes across the Internet is a simple UDP packet with a PASS or FAIL.

Oh, but the critical detail is that a lot of firewalls block port 53 TCP, whether by design or configuration. Since this is the state of the world, DNS queries over TCP are inherently unreliable.

I doubt if the 10% of domains with long compiled SPF records will accept that unreliability as a fact of life. They will stick to UDP, which is more or less guaranteed, in the sense that even if a packet is lost once, the next time it will probably make it. The DNS system deals gracefully with temporary problems like this, so not a problem.

But when your record depends on TCP, and some firewall somewhere blocks it, there's no amount of retrying that will get that connection through.

OK, so we should not depend on TCP, but simply set a maximum record length, like 450 bytes, and expect that all DNS servers will be able to handle it. I'm surprised there isn't a well-established number we can use here.

And because of this, we're stuck with daisy-chaing the longest records. In the end, it's done for the sake of reliability, at the expense of some extra traffic.

The alternative to daisy-chaining long SPF records is separating the chunks into subdomains. See above.

That response is not cacheable, but lets compare the added load to some other things that happen with each email. Setting up a TCP connection is a minimum of three packets. SMTP takes two packets for the HELO and response. MAIL FROM is another two. Then we need two for the authentication. At that point we can send a reject (one packet) and terminate the connection (4 packets). Looks to me like the additional load on DNS is insignificant for normal mail, and only a few percent of the minimum traffic per email in a DoS storm. Also, the additional load is primarily on the domain with the expensive SPF records, where it should be.

Please notice, the premise for this discussion on DNS load is "no lookups, one packet only in any response". The following examples show how bad it can get if we *don't* make that assumption.

This is not always the case. Consider a case like:

"v=spf1 ip4:1.1.1.1/28 mx:t-online.de include:isp1.com include:isp2.com include:isp3.com -all"

Say that the 3 ISP's don't even publish SPF yet, but the includes are there just in case they ever do.

This record is very cheap on the publisher's DNS (only 1 TXT query goes to the publisher's DNS). But for every bandwidth penny spent by the publisher, the 3 ISPs have to spend 1 penny each. Poor t-online.de has to spend 10 pennies for each penny that the publisher spends.

And the sad thing is, while the isp's can minimize the cost by publishing cheap SPF records, there's nothing t-online can do to lower it's damage.

What's even worse, is that t-online can't even find out why it sees increased bandwidth levels. It's extremely complicated to track an MX or A query back to an email address.

Even worse than that, the default max-ncache-ttl in BIND is 3 hours. That means that even if the publisher's TXT record has a TTL of 24H, the isp's will be hit with a query every 3 hours, while the publisher only every 24H.

So no, the costs is not necesarily on the publisher.

Taking into account the TTLs above, and the TTL of t-online's records of 1H, the score would be

1:8:240

So the publisher's record costs t-online.de 2.40euro for every penny cost to the publisher.

The ISP's pay 8 pennies each.

Even if this were a spammer domain, and they weren't *really* doing any internal lookups, the load on their DNS server is two packets for every additional two-packet load on the vitims. No amplification factor here.

Add that the spammer is actually likely to both use the t-online.de's resources, and be stupid enough to not realize that mail doesn't go through the MX exchange. Suddenly, the amplification factor becomes a certainty.

I agree with you entirely, and your examples above add even more weight to the argument that we should deprecate lookups.

How about this: All SPF records SHOULD be compiled down to a list of IPs. If you need more than that, then do as much as you like, but give the client a simple PASS or FAIL. Most domains will then say "Here is our list of IPs. Don't ask again for X hours." Only a few will say "Our policy is so complex, you can't possibly understand it. Send us every IP you want checked."

That's exactly what the exists:{i}.domain does. It tells the domain about every IP it wants checked, and the server checks it. Unfortunately, it is extremely expensive because it's AGAU.

If I were writing an SPF-doom virus, this is where I would start.

I need to get back to designing ICs. :>)

Nah... you've got some great ideas and I value your contribution and feedback.

And I appreciate your time in getting me up to speed on these problems.
I hope one day I can return the favor.

It's a pleasure to be of service. SPF is a good cause, and I think it deserves to be saved.

Saved is the right word. Unless we change course quickly, I think SPF is heading for oblivion. Did you see the article "Stopping Spam" in Scientific American, April 2005? SenderID is a small part of the overall plan, and SPF is a part of SenderID not even worth mentioning. I think the folks at Microsoft don't really care how messy or inefficient the infrastructure is. They will just adapt to whatever standard is eventually adopted. If that ends up some crazy mix of SPF and PRA, nobody but a few programmers will be bothered by it.

Incidentally, I got curious and did some tests, and it appears that yahoo does not do any DNS queries on incoming mail. Hotmail does two but either doesn't respect TTLs, or do queries on a spot-check basis, because even though I have a low TTL, they did not refresh.

It could be that already, even without checking SPF these two figured out that DNS is more expensive than storing spam. Fascinating!

This wasn't a scientific test as I would normally do, but a quick check-your-fears check.

So at least for now, I think I know that yahoo and hotmail will not do any spf checks any time soon, based on this little test and a lot of extrapolation. ;)

I would love to get Yahoo to collaborate on a shared protocol that would support both DomainKeys and SPF, but I haven't found any way to get that discussion started. The FTC is talking about doing something if the industry fails, but the current attitude seems to be that they can let the industry work it out. I see a year of confusion and lots of inter-operability problems.

-- Dave
************************************************************     *
* David MacQuigg, PhD      email:  dmquigg-spf at yahoo.com      *  *
* IC Design Engineer            phone:  USA 520-721-4583      *  *  *
* Analog Design Methodologies                                 *  *  *
*                                   9320 East Mikelyn Lane     * * *
* VRS Consulting, P.C.              Tucson, Arizona 85710        *
************************************************************ *


<Prev in Thread] Current Thread [Next in Thread>