Re: Last Call: draft-ietf-sipping-overload-reqs (Requirements for Manage

thanks for the comments, Matt. Responses below:

Matt Mathis wrote:

I reviewed draft-ietf-sipping-overload-reqs-02 at the request of the 
transport 
area directors.  Note that my area of expertise is TCP, congestion control 
and 
bulk data transport.  I am not a SIP expert, and have not been following the 
SIP documents.

I have serious concerns about this document because it explicitly excludes 
the 
only approach for coping with overload that is guaranteed to be robust under 
all conditions.  Although I know it is considered bad form to describe 
solutions while debating requirements, I think a sketch of a solution will 
greatly clarify the discussion of the requirements.

The only robust overload signal is the natural implicit signal - silently 
discarding excess requests.  Explicit overload messages (code 503) should be 
optional, and must have an explicit rate limit.


Agree. Our intention for the solution was exactly that; we have an 
explicit feedback mechanism (like ECN provides) that can be used, in 
addition to treating lack of any signal as a sign of congestion as well.

Sending additional messages to explicitly indicate overload is intrinsically 
fragile.


Agree too. SIP requests normally generate responses, and so the plan is 
to have a response code which can be used to clearly say "I'm 
overloaded". This is not an additional message - its the normal SIP 
message that is sent - but with clear meaning.

And of course, lack of any response at all needs to be treated as a sign 
of congestion too.

My specific objections to the document are as follows: Requirement 6 calls 
for 
explicit overload messages and forbids silently discarding requests, since 
they are not unambiguous in their meaning.


That was not the intent of the requirement. The requirement is meant to 
say that, any explicit message used to signal overload must be used 
solely for that purpose, and not to signal other, non-overload related 
events. I've reworded to say:

<t hangText="REQ 6:">When overload is signaled by means of a specific
message, the message must clearly indicate that it is being sent
because of overload, as opposed to other, non-overload based failure
conditions. This requirement is meant to avoid some of the problems
that have arisen from the reuse of the 503 response code for multiple
purposes. Of course, overload is also signaled by lack of response to
requests. This requirement applies only to explicit overload
signals. </t>

Requirement 15 seems to provide a 
loophole (allowing complete failures) but seems to forbid using it as the 
preferred mechanism.


Per above, the intention all along was to treat lack of a response as an 
indication of congestion. The requirement most certainly does not limit 
itself to complete failures; it calls out overload as the first cause of 
this problem. Neither does the requirement forbid lack of a response 
from being the preferred mechanism. The requirement reads:

<t hangText="REQ 15:"> In cases where a network element fails,  is
so overloaded that it cannot process messages, or cannot communicate
due to a network failure or network partition, it will
not be able to provide explicit indications of its levels of
congestion. The mechanism should properly function in these cases.
</t>

I think this is pretty clear and it directly addresses your concern - 
the solution has to work in cases where there is no response whatsoever. 
Can you suggest alternate text that would improve here?

Requirement 8 does not make sense without explicit 
notification.


Reworded to:

<t hangText="REQ 8:"> The mechanism shall ensure that, when a request
was not processed successfully due to overload (or failure) of a
downstream element, the request will not be retried on another
element which is also overloaded or whose status is unknown. This
requirement derives from REQ 1.
</t>

which handles both explicit and implicit overload signals.

Requirements 7, 8 and 9 should note that they can be (are 
already?)  equivalently satisfied by properly structured exponential 
retransmission backoff timers in SIP itself.


Requirements 8 and 9 deal with sending requests to other elements, 
besides the one which was overloaded. That case is not handled by the 
structured exponential backoff timers in SIP, which handle 
retransmissions of a request within a single transaction to a single 
server. These requirements are dealing with behavior across different 
servers and different transactions.

Requirement 7 is partly addressed by SIPs retransmit behavior. However, 
those timers apply independently to each transaction, and in cases of a 
large number of transactions between a pair of servers, is not 
sufficient to prevent overload. This requirement is meant to improve on 
this situation.


I would like to point out that TCP, IP and several other transport protocols 
have evolved in the same direction as I am advocating for SIP: the only 
robust 
indication that an error has occurred is connection failure.


True, and we absolutely need to utilize that. However, I do not believe 
this eliminates the utility of explicit congestion indicators, as ECN 
provides (for example), as a way to further improve performance.

Error messages 
are cached and sometimes accelerate timers (e.g. retransmit now, or go to the 
next IP address now), but do not change basic protocol behavior.  Error 
messages are most often rate limited at the sender and the saved error codes 
are used to provide a clue why something failed, but the fact that it failed 
most likely comes from a timer, not the message itself.  The number of error 
massages that are required for correct operation is declining (note that 4821 
makes ICMP can't fragment optional), and may be zero.

Rate limiting all errors messages and treating them as advisory improves 
robustness in several ways: fraudulent messages have less impact, error 
messages can not be used an DDOS attack magnifiers, and overload is addressed 
implicitly by silently discarding requests.

Note that the normal, non-crisis, behavior has not changed significantly: 
error message are sent, cached and reported to the application.  However, in 
a 
crisis, the error reporting degrades gracefully, while the throughput goes 
flat, without any negative slope.  This is where SIP (and all other 
protocols) 
should strive to be.


Right - and the purpose of the explicit signals are these periods of 
overload but not periods of crisis.

Thanks,
Jonathan R.
-- 
Jonathan D. Rosenberg, Ph.D.                   499 Thornall St.
Cisco Fellow                                   Edison, NJ 08837
Cisco, Voice Technology Group
jdrosen(_at_)cisco(_dot_)com
http://www.jdrosen.net                         PHONE: (408) 902-3084
http://www.cisco.com
_______________________________________________
IETF mailing list
IETF(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/ietf

Re: Last Call: draft-ietf-sipping-overload-reqs (Requirements for Management of Overload in the Session Initiation Protocol) to Informational RFC