thanks for the comments, Matt. Responses below:
Matt Mathis wrote:
I reviewed draft-ietf-sipping-overload-reqs-02 at the request of the
area directors. Note that my area of expertise is TCP, congestion control
bulk data transport. I am not a SIP expert, and have not been following the
I have serious concerns about this document because it explicitly excludes
only approach for coping with overload that is guaranteed to be robust under
all conditions. Although I know it is considered bad form to describe
solutions while debating requirements, I think a sketch of a solution will
greatly clarify the discussion of the requirements.
The only robust overload signal is the natural implicit signal - silently
discarding excess requests. Explicit overload messages (code 503) should be
optional, and must have an explicit rate limit.
Agree. Our intention for the solution was exactly that; we have an
explicit feedback mechanism (like ECN provides) that can be used, in
addition to treating lack of any signal as a sign of congestion as well.
Sending additional messages to explicitly indicate overload is intrinsically
Agree too. SIP requests normally generate responses, and so the plan is
to have a response code which can be used to clearly say "I'm
overloaded". This is not an additional message - its the normal SIP
message that is sent - but with clear meaning.
And of course, lack of any response at all needs to be treated as a sign
of congestion too.
My specific objections to the document are as follows: Requirement 6 calls
explicit overload messages and forbids silently discarding requests, since
they are not unambiguous in their meaning.
That was not the intent of the requirement. The requirement is meant to
say that, any explicit message used to signal overload must be used
solely for that purpose, and not to signal other, non-overload related
events. I've reworded to say:
<t hangText="REQ 6:">When overload is signaled by means of a specific
message, the message must clearly indicate that it is being sent
because of overload, as opposed to other, non-overload based failure
conditions. This requirement is meant to avoid some of the problems
that have arisen from the reuse of the 503 response code for multiple
purposes. Of course, overload is also signaled by lack of response to
requests. This requirement applies only to explicit overload
Requirement 15 seems to provide a
loophole (allowing complete failures) but seems to forbid using it as the
Per above, the intention all along was to treat lack of a response as an
indication of congestion. The requirement most certainly does not limit
itself to complete failures; it calls out overload as the first cause of
this problem. Neither does the requirement forbid lack of a response
from being the preferred mechanism. The requirement reads:
<t hangText="REQ 15:"> In cases where a network element fails, is
so overloaded that it cannot process messages, or cannot communicate
due to a network failure or network partition, it will
not be able to provide explicit indications of its levels of
congestion. The mechanism should properly function in these cases.
I think this is pretty clear and it directly addresses your concern -
the solution has to work in cases where there is no response whatsoever.
Can you suggest alternate text that would improve here?
Requirement 8 does not make sense without explicit
<t hangText="REQ 8:"> The mechanism shall ensure that, when a request
was not processed successfully due to overload (or failure) of a
downstream element, the request will not be retried on another
element which is also overloaded or whose status is unknown. This
requirement derives from REQ 1.
which handles both explicit and implicit overload signals.
Requirements 7, 8 and 9 should note that they can be (are
already?) equivalently satisfied by properly structured exponential
retransmission backoff timers in SIP itself.
Requirements 8 and 9 deal with sending requests to other elements,
besides the one which was overloaded. That case is not handled by the
structured exponential backoff timers in SIP, which handle
retransmissions of a request within a single transaction to a single
server. These requirements are dealing with behavior across different
servers and different transactions.
Requirement 7 is partly addressed by SIPs retransmit behavior. However,
those timers apply independently to each transaction, and in cases of a
large number of transactions between a pair of servers, is not
sufficient to prevent overload. This requirement is meant to improve on
I would like to point out that TCP, IP and several other transport protocols
have evolved in the same direction as I am advocating for SIP: the only
indication that an error has occurred is connection failure.
True, and we absolutely need to utilize that. However, I do not believe
this eliminates the utility of explicit congestion indicators, as ECN
provides (for example), as a way to further improve performance.
are cached and sometimes accelerate timers (e.g. retransmit now, or go to the
next IP address now), but do not change basic protocol behavior. Error
messages are most often rate limited at the sender and the saved error codes
are used to provide a clue why something failed, but the fact that it failed
most likely comes from a timer, not the message itself. The number of error
massages that are required for correct operation is declining (note that 4821
makes ICMP can't fragment optional), and may be zero.
Rate limiting all errors messages and treating them as advisory improves
robustness in several ways: fraudulent messages have less impact, error
messages can not be used an DDOS attack magnifiers, and overload is addressed
implicitly by silently discarding requests.
Note that the normal, non-crisis, behavior has not changed significantly:
error message are sent, cached and reported to the application. However, in
crisis, the error reporting degrades gracefully, while the throughput goes
flat, without any negative slope. This is where SIP (and all other
should strive to be.
Right - and the purpose of the explicit signals are these periods of
overload but not periods of crisis.
Jonathan D. Rosenberg, Ph.D. 499 Thornall St.
Cisco Fellow Edison, NJ 08837
Cisco, Voice Technology Group
http://www.jdrosen.net PHONE: (408) 902-3084
IETF mailing list