To my small mind, forcing a new DNS lookup in the event of a
TCP session failure and restart would be a good thing.
perhaps, but it won't work reliably as long as there can be more than
one host associated with a DNS name, nor will it work as long as DNS
name-to-address mapping is used to distribute load over a set of hosts.
We already have the DNS hooks to distingish services from
hosts. We had them for the last 8 years.
Yes but SRV records weren't really meant to handle this case either.
And they actually can make applications less reliable because they
introduce a new dependency on DNS (another lookup that can fail, in a
different zone and potentially on a different server, another piece of
configuration data that can be incorrect.) What we'd really need is a
RR type specifically intended to map service names onto instance
ID+address pairs, and also a special query type that wasn't defined to
return all of the matching RR records, but would instead return a random
subset or a subset based on heuristics, and finally an instance ID to
address mapping service. But arguably DNS isn't the right place to do
that at all - there should instead be a generic referral service at
layer 3 or 4.
Of course, part of the reason that people started using A records to
refer to multiple hosts was that a number of applications "just worked"
when they did that. And I remember when people used to object loudly to
such things, and insist that a DNS name and a host name had to be the
same thing. Anyway, this kind of overloading of A records has been such
a widespread practice for so long that I don't see it changing. And
it's not as if we came up with a better way of doing things for IPv6
addresses.
in other words, doing another DNS lookup of the original DNS name only
looks like a good way to solve the problem if you don't look very deep.
now if you somehow got a host-specific (or narrower) identifier as a
result of setting up the initial connection (maybe via a TCP option),
and you had a way to map that host-specific identifer to its current IP
address (assume for now that you're using DNS, though there are still
other problems with that) - then you could do a different kind of lookup
to get the new IP address and use that to do a restart.
even then, it wouldn't help the numerous applications which don't have a
way to cleanly recover from dropped TCP connections. (remember, TCP
was supposed to make sure data were retransmitted as necessary and that
duplicated data were sorted out, provide a clean close, that sort of
thing. once you expect apps to handle dropped connections they have to
re-implement TCP functionality at a higher layer.)
Applications need to deal with TCP connections breaking for
all sorts of reasons. Renumbering should be a relatively
infrequent event compared to all the other possible ways a
TCP connection can fail.
Mumble. Seems like the whole point of TCP was to recover from such
failures at a lower level. And I remember how people used to say that
TCP was better than X.25 VCs (in part) because TCP would recover from
temporary network outages that would cause hangups in X.25.
I also don't have a lot of faith in "should be", not when I've seen DHCP
servers routinely refuse to renew leases after very short times, nor
when I've heard people say that a site should be able to renumber every
day.
So, someone misconfigured something. Such misconfigurations
usually get fixed fast.
Getting the automation to the state where a daily renumber
is possible is an achievable goal. If we were doing that
the long running apps would have been fixed long ago. The
fact that they aren't is more a matter of pressure than
anything else. That's why I started with a large period
when I was suggesting that router and firewall vendors
actually renumber themselves periodically. It was to keep
the problem in the management space rather than the application
space.
Have each vendor work on their part of the problem is the
way to go.
I used to try to get people to specify a minimum amount of time that a
non-deprecated address should be expected to be valid - say a day. Then
application writers and application protocol designers would have an
idea about whether they needed a strategy for recovery from a
renumbering event, and what kind of strategy they needed. But the only
people who seemed to like this idea were application area people.
Until applications deal nicely with the other failure modes,
complaints about renumbering causing problems at the
application level are just noise.
in other words, one design error can be used to justify another? sort
of like the blind leading the blind?
No. People should work on making renumbering work efficiently.
Using TCP failures at the application level as a excuse to
no persue making renumbering work cleanly is just that, an
excuse.
I see a significant difference between a design flaw in a particular
application that cripples that application, and a design flaw in a lower
layer that cripples all applications.
Reconnect is a reasonable strategy for most applications.
Holding a TCP session open in the presense of ICMP
host/net unreachable is also a reasonable strategy.
Keith
--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: Mark_Andrews(_at_)isc(_dot_)org
_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf