Interpreting Results From Multiple "Identical" RR Sets (Was several thr

Dealing with SPF records in two different record types with live data is the
one part of SPF that is, in fact, brand new and experimental.  The
discussion we've been having on this topic makes it clear that there are
some lessons to be learned.  I also think that many of these lessons are
generic to other technologies (e.g. DKIM) that hope to do something similar.
So it isn't surprising we'd learn some things that ought to be folded into
the AUTH48.

The first point I would like to make is that while some people have issues
with "MUST be identical", from a validation/implementation perspective this
is a very simple concept to code.  Type 99 support for pyDNS was, IIRC 4
lines of code and multiple rr support (TXT/type 99) in pySPF was 6.  If we
say anything other than "MUST be identical", then implementation complexity
will jump considerably.

Just to give one example, some parts of an SPF record are case sensitive and
some aren't.  With identical, it's a simple if x == y test.  Anything else
and I have to decompose the record into the case sensitive parts and the
non-case sensistive parts before I do anything to determine equivalence.
Let's not go down this road.

So, I really suggest that we not make any changes to the requirements for
SPF record publishers.  The real challenge is what is the receiver supposed
to do with this situation.

I think that rather than pick one over the other (should check SPF first or
should check TXT first), there should be a hierarchy based on the amount of
information provided.  My thought is that an answer always beats no answer.

There are several scenarios we have to deal with....

1.  Has both types and identical
2.  Has TXT and not SPF
3.  Has SPF and not TXT
4.  Has both types and not identical
5.  Has no SPF record on TXT and no response on SPF
6.  Has TXT and no response on SPF
7.  No response on TXT or SPF

I don't think we need to deal with the DNS servers that refuse to answer on
TXT, but do answer on SPF.  Are there others that we should consider?

I think that the current draft deals with cases 1 through 3 just fine.  Case
4 is what Florian Weimer brought up yesterday.  Case 5 is Stuart Gathman's
from a few days ago (and I think the most serious problem).  Case 6 is a
logical extension of case 5.  Case 7 I've never seen, but I could imagine it
happening.

The first point I would make is that we really shouldn't specify which
record to check first.  The purist approach would be to mandate type SPF
first, but we really have no idea if that's going to take off or not.  In
the meantime, checking type SPF first may have substantial performance
penalties.  Let the market decided which one to check first.

The second point is that once check_host() has found a record it SHOULD NOT
keep looking for another one.  The current draft says check_host() isn't
required to look, but I think it needs to be stronger in the other
direction.  Yes, with good DNS management practices case 4 can largely be
avoided, but why beg for trouble.

Case 5 above is, in my opinion, the one that has the biggest implication for
the spec.  As written, it's a TempError which should get a 45x response.
Here's the actual case that Stuart reported:

 Consider the domain szco.com
$ host -t txt szco.com
;; no records
$ host -t type99 szco.com
;; connection timed out; no servers could be reached


Since there is a DNS timeout, it's a TempError.  The problem here is that
this is pretty clearly a DNS server that doesn't respond to unknown RR types
(this is, I understand it, not uncommon).  Eventually this sort of problem
should go away, but it won't be for a long time.

From visual inspection it's pretty easy to see what's going on.  While the

connection timed out, there is a good indication that the remote DNS is in
fact reachable because they answere no records for type TXT.  This wouldn't
be very hard to code either.

We can deal with cases 5 and 6 pretty easily if we say that an answer on one
RR (even if it's no records) over-rides a DNS error on the other.

Case 7 isn't really something SPF can deal with directly, but at a higher
level a similar construct might be applied.  If the SPF result is TempError,
but the MTA has sucessully retreived A or MX records for the domain, then
the TempError should be treated like none.

This isn't as complex as it probably sounds.

The current hierarchy of responses is:

1.  TempError - if either RR returns an error it's a TempError
2.  SPF - if SPF records are present in both SPF and TXT, use SPF
3.  TXT or SPF - if SPF records are only in one RR type, use it
4.  None - If no records are returned and there are no errors, the result is
None.

All I'm really proposing is that we say that any answer on either RR type
means it's not a TempError since we have good evidence the the remote DNS is
uo and reachable:

1.  SPF - if SPF records are present in both SPF and TXT, use SPF
2.  TXT or SPF - if SPF records are only in one RR type, use it
3.  None - If no records are returned and there are no errors, the result is
None.
4.  TempError - if either RR returns an error it's a TempError.

The impact of these changes would be confined to paragraph 4.4:

http://www.schlitt.net/spf/spf_classic/draft-schlitt-spf-classic-02.html#anc
hor19

If this has any legs, I'll write up proposed language for the AUTH48.

Scott K

Interpreting Results From Multiple "Identical" RR Sets (Was several threads on type 99)