Re: Call for action vs. lost opportunity (Was: Re: Renumbering)

To my small mind, forcing a new DNS lookup in the event of a
TCP session failure and restart would be a good thing.

perhaps, but it won't work reliably as long as there can be more than
one host associated with a DNS name, nor will it work as long as DNS
name-to-address mapping is used to distribute load over a set of hosts.


    We already have the DNS hooks to distingish services from
    hosts.  We had them for the last 8 years.

Yes but SRV records weren't really meant to handle this case either. 
And they actually can make applications less reliable because they
introduce a new dependency on DNS (another lookup that can fail, in a
different zone and potentially on a different server, another piece of
configuration data that can be incorrect.)  What we'd really need is a
RR type specifically intended to map service names onto instance
ID+address pairs, and also a special query type that wasn't defined to
return all of the matching RR records, but would instead return a random
subset or a subset based on heuristics, and finally an instance ID to
address mapping service.  But arguably DNS isn't the right place to do
that at all - there should instead be a generic referral service at
layer 3 or 4.

Of course, part of the reason that people started using A records to
refer to multiple hosts was that a number of applications "just worked"
when they did that.  And I remember when people used to object loudly to
such things, and insist that a DNS name and a host name had to be the
same thing.  Anyway, this kind of overloading of A records has been such
a widespread practice for so long that I don't see it changing.  And
it's not as if we came up with a better way of doing things for IPv6
addresses.

in other words, doing another DNS lookup of the original DNS name only
looks like a good way to solve the problem if you don't look very deep.
 
now if you somehow got a host-specific (or narrower) identifier as a
result of setting up the initial connection (maybe via a TCP option),
and you had a way to map that host-specific identifer to its current IP
address (assume for now that you're using DNS, though there are still
other problems with that) - then you could do a different kind of lookup
to get the new IP address and use that to do a restart.

even then, it wouldn't help the numerous applications which don't have a
way to cleanly recover from dropped TCP connections.  (remember,  TCP
was supposed to make sure data were retransmitted as necessary and that
duplicated data were sorted out, provide a clean close, that sort of
thing.   once you expect apps to handle dropped connections they have to
re-implement TCP functionality at a higher layer.)


    Applications need to deal with TCP connections breaking for
    all sorts of reasons.  Renumbering should be a relatively
    infrequent event compared to all the other possible ways a
    TCP connection can fail.

Mumble.  Seems like the whole point of TCP was to recover from such
failures at a lower level.  And I remember how people used to say that
TCP was better than X.25 VCs (in part) because TCP would recover from
temporary network outages that would cause hangups in X.25.

I also don't have a lot of faith in "should be", not when I've seen DHCP
servers routinely refuse to renew leases after very short times, nor
when I've heard people say that a site should be able to renumber every
day.


        So, someone misconfigured something.  Such misconfigurations
        usually get fixed fast.

        Getting the automation to the state where a daily renumber
        is possible is an achievable goal.  If we were doing that
        the long running apps would have been fixed long ago.  The
        fact that they aren't is more a matter of pressure than
        anything else.  That's why I started with a large period
        when I was suggesting that router and firewall vendors
        actually renumber themselves periodically.  It was to keep
        the problem in the management space rather than the application
        space.

        Have each vendor work on their part of the problem is the
        way to go.

I used to try to get people to specify a minimum amount of time that a
non-deprecated address should be expected to be valid - say a day.  Then
application writers and application protocol designers would have an
idea about whether they needed a strategy for recovery from a
renumbering event, and what kind of strategy they needed.  But the only
people who seemed to like this idea were application area people.

    Until applications deal nicely with the other failure modes,
    complaints about renumbering causing problems at the
    application level are just noise.

in other words, one design error can be used to justify another?  sort
of like the blind leading the blind?


        No. People should work on making renumbering work efficiently.

        Using TCP failures at the application level as a excuse to
        no persue making renumbering work cleanly is just that, an
        excuse.

I see a significant difference between a design flaw in a particular
application that cripples that application, and a design flaw in a lower
layer that cripples all applications.


        Reconnect is a reasonable strategy for most applications.

        Holding a TCP session open in the presense of ICMP
        host/net unreachable is also a reasonable strategy.

Keith

-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742                 INTERNET: Mark_Andrews(_at_)isc(_dot_)org

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf