ietf-smtp
[Top] [All Lists]

Re: BCP for handling DNS SERVFAIL results

2005-11-26 21:25:55

Ned,

I got the hotfix from Microsoft (a per request item, not part of the service
packs), apply it and the SERVFAIL issue is now resolved. :-)

I think what this all says that many of the legacy servers, including NT,
Windows 2000, and BIND had SERVFAIL related issues, and the idea or (bad)
consideration to view SERVFAIL as a NXDOMAIN in order to fallback to the A
record reflected the fact that this issue did exist in the past for many of
the DNS servers.

I don't agree with this SMTP outgoing DNS logic consideration, but I can
certainly see why it was discussed in some of the net research I did on the
matter.

Thanks for your comments.

--
Hector Santos, Santronics Software, Inc.
http://www.santronics.com


----- Original Message -----
From: "Hector Santos" <hsantos(_at_)santronics(_dot_)com>
To: <ned+ietf-smtp(_at_)mrochek(_dot_)com>
Cc: "IETF-SMTP" <ietf-smtp(_at_)imc(_dot_)org>
Sent: Saturday, November 26, 2005 3:08 AM
Subject: Re: BCP for handling DNS SERVFAIL results



Well Ned! I did some detailed searching, and found this for Windows NT 4.0
and 2000 DNS.EXE server.

http://support.microsoft.com/default.aspx?scid=kb;en-us;295933

Here its description (For Windows):

----
SYMPTOMS

The resolution of e-mail names and other names may not work if a Windows
2000-based or Windows NT 4.0-based Domain Name System (DNS) server
receives
a non-authoritative response from a root hint or forwarder. The Windows
2000-based or Windows NT 4.0-based DNS server sends a "Server Failure"
message to the client when it receives a Start of Authority (SOA) record
from a non-authoritative resource.

Some non-Microsoft DNS servers (for example, some BIND versions) cache
empty
authoritative responses, which Windows 2000-based and Windows NT 4.0-based
DNS servers consider to be referrals. When a Windows 2000-based or Windows
NT 4.0-based DNS server receives such a response, the response is ignored
and a "Server Failure" message is sent to the client instead of the SOA
record.
----

I am trying to get this update now which I thought SP6 already covered
this.
Apparently not.  Hopefully, this will clear the server issue.

BTW, I found SERVFAIL related issues too with BIND too.

http://www.isc.org/pubs/tn/isc-tn-2002-2.html
http://archives.neohapsis.com/archives/bind/2002/0007.html

I downloaded the BIND source to learn about DNS server logic.

--
Hector Santos, Santronics Software, Inc.
http://www.santronics.com




----- Original Message -----
From: <ned+ietf-smtp(_at_)mrochek(_dot_)com>
To: "Hector Santos" <hsantos(_at_)santronics(_dot_)com>
Cc: <ned+ietf-smtp(_at_)mrochek(_dot_)com>; "IETF-SMTP" 
<ietf-smtp(_at_)imc(_dot_)org>
Sent: Friday, November 25, 2005 11:39 PM
Subject: Re: BCP for handling DNS SERVFAIL results



Thanks Ned. Excellent info and insight.

I do have a few follow-up questions related this:

2) Use Multiple DNS server.

   An IBM solution was to suggest to make sure you have
   additional DNS servers to query.

Well, sure. Having at least one secondary DNS server is pretty much
essential,
and they need to be geographically separate. I note in passing that
interlink.com has two servers and they appear to be on separate
networks.

ns5.ecsecure.com. [208.56.100.1] [TTL=86400]
ns6.ecsecure.com. [216.147.1.227] [TTL=86400]

One of the confusing issues about this, and no doubt probably a
misunderstanding on my part, is related to having multiple DNS servers
vs
Primary DNS recursion lookups.

First a disclaimer: DNS operations are not exactly my primary area of
expertise
either. Hopefully what I say here will be correct and if it isn't
hopefully
someone else will chime in and correct me.

With that said...

Be very careful here with your terminology. In the DNS world a "primary"
server
is one that provides authoritative information for one or more domains.
A
"secondary" is a slave that periodically transfers information from the
primary and makes it available for queries.

The servers for a given domain are specified by NS records in the
"upper"
domain. So the way this works is that a resolver starts at the top of
the
tree
and walks down using NS records at each level to find the servers below.
If
there are multiple NS records I believe the approach is to pick one  at
random
and if that doesn't work try another. This may also depend on the
resolver
implementation.

Caching of course eliminates the need for many of these queries and
makes
the
load at the upper levels manageable. There's also a bunch of tricky
stuff
done
at the very top to make things sufficiently performant while allowing
multiple
providers - I know very little about all this magic.

How do I best ask this because again, I am not a DNS admin or a server
expert.

Well, I guess I qualify as an admin since I handle a bunch of primary
and
secondary domains. But again, I'm no expert.

I guess the question is, can the same results be expected with:

  1) A server with multiple uplinks, versus

I'm afraid this exceeds my level of expertise. I use bind but I don't
have
a
multihomed environment so I don't know how it or any other server
implementation handles being multihomed.

  2) Multiple Server list

Applications typically don't have a full resolver built in that's
capable
of
walking the DNS tree. Rather, they have a so-called "stub" resolver that
is
given a list of full resolvers to send queries to. The stub resolver
builds a
query and sends it to one of the resolvers, gets back a result and
decodes
it.

I guess your statement above about having geographically separate
servers
makes all this work better to increase the odds of getting result.

Right. Of course geographic separation doesn't matter when the problem
is
that
a server is down, but it can save you when a link is down.

But it was my impression that when you query a primary server, if the
query
is not available in the zone and not currently cached, that the server
will
query its uplinks. No?

Your terminology is confusing here. I think by "primary server" you mean
"full
resolver" and by "uplink" you mean "servers for uplevel domains". If so,
then
yes, this is more or less how it works, but it works its way down from
the
lowest uplevel entry that the resolver has cached.

You see, for my company SMTP server, I have:

      208.247.131.10
      198.6.1.2

208.247.131.10 is where I have my ns.santronics.com primary DNS
server,

OK, but that has nothing to do with whether or not you're using that
machine as
your full resolver. For example, the primary server for mrochek.com is
mauve.mrochek.com, but DNS queries on that machine are actually
forwarded
to a
completely different system for resolution. I believe it's considered
good
practice not to have your DNS primaries or secondaries performing
general
DNS
resolution services.

and I have as forwarders the UUNET servers:

      198.6.1.2
      198.6.1.3

I had the impression this provided the uplink queries when the primary
did
not have the information.

Maybe. It would depend on how you have things configured. Forwarders are
basiclaly used to offload DNS processing from one machine to another.

I just happen to see this SERVFAIL fail when I was testing this
customer's
db.usinterlink.com MX record against 208.247.131.10 via Window's
NSLOOKUP.EXE.

I was assisting him remotely from home and didn't see this SERVFAIL
against
the bellsouth.net server:

   NSLOOKUP -query=mx -debug db.usinterlink.com ns.santronics.com
   NSLOOKUP -query=mx -debug db.usinterlink.com dns.msy.bellsouth.net

First one returns SERVFAIL, second one NOERROR.

You might consider clearing the cache on your home server and see if
that
helps.

I was able to send him a test message because the SMTP server was
finally
able to get to the second unnet server, and thus fallback to a
successful A
record result.

But the situation got me wondering what was wrong or different between
the
two, and also what if I or other customers didn't have a second DNS
server
setup for SMTP, if its something to worry about.

4) Ignore SERVFAIL?

Some just said that the SMTP client should be looking at SERVFAIL
as
a
NXDOMAIN, etc.

Bad idea IMO. Configuration glitches happen, and when they do you
don't want to bounce mail to the domain unnecessarily. Most of the
time these problems get fixed and the mail goes on through with
only a small delay.

I agree.  I was wondering, and now realize that its probably wrong to
jump
the gun with this, if it would make sense to do a A record lookup for
a
SERVFAIL.

I've actually seen cases where MX record queries got a SERVFAIL but an A
record
query got a successful result. (I've always believed this is due to the
server
being down or misconfigured and having one record type cached but not
the
other, but I've never been able to track down the specific cause.) This
stuff
gets very complex because DNS servers do tricky stuff like piggyback A
record
information on MX queries, making problems hard to isolate.

But this doesn't mean that you should do this: Nothing prevents someone
from
having an MX for foo.example.com pointing to a completely separate
mail.example.org while having a SMTP server running on foo.example.com
that
silently eats everything sent to it. And yes, such a setup would be
stupid
and
dangerous, but people do stupid and dangerous things all the time.

Ned