Re: Last Call: <draft-ietf-bmwg-sip-bench-term-08.txt> (Terminology for

Reviews of draft-ietf-bmwg-sip-bench-term-08 anddraft-ietf-bmwg-sip-bench-meth-08


Summary: These drafts are not ready for publication as RFCs.

First, some of the text in these documents shows signs of being old, and the

working group may have been staring at them so long that they've becomehard tosee. The terminology document says "The issue of overload in SIPnetworks is

currently a topic of discussion in the SIPPING WG." (SIPPING was closed in
2009). The methodology document suggests a "flooding" rate that is orders of

magnitude below what simple devices achieve at the moment. That thesesurvivedworking group last call indicates a different type of WG review may beneeded

to groom other bugs out of the documents.

Who is asking for these benchmarks, and are they (still) participatingin the

group?  The measurements defined here are very simplistic and will provide
limited insight into the relative performance of two elements in a real

deployment. The documents should be clear about their limitations, andit wouldbe good to know that the community asking for these benchmarks isgetting toolsthat will actually be useful to them. The crux of these two documents isin the

last paragraph of the introduction to the methodology doc: "Finally, the
overall value of these tests is to serve as a comparison function between

multiple SIP implementations". The documents punt on providing anycomparison

guidance, but even if we assume someone can figure that out, do these
benchmarks provide something actually useful for inputs?

It would be good to explain how these documents relate to RFC6076.

The terminology tries to refine the definition of session, but thedefinitionprovided, "The combination of signaling and media messages and processesthatsupport a SIP-based service" doesn't answer what's in one session vsanother.Trying to generically define session has been hard and several workinggroups

have struggled with it (see INSIPID for a current version of that

conversation). This document doesn't _need_ a generic definition ofsession -it only needs to define the set of messages that it is measuring. Itwould bemuch clearer to say "for the purposes of this document, as session isthe setof SIP messages associated with an INVITE initiated dialog and anyAssociated

Media, or a series of related SIP MESSAGE requests". (And looking at the

benchmarks, you aren't leveraging related MESSAGE requests - they allappear to

be completely independent). Introducing the concepts of Invite-initiated
sessions and non-invite-initiated sessions doesn't actually help define the
metrics. When you get to the metrics, you can speak concretely in terms of a

series of INVITEs, REGISTERs, and MESSAGEs. Doing that, and providing ashort

introduction pointing folks with PSTN backgrounds relating these to "Session
Attempts" will be clearer.

To be clear, I strongly suggest a fundamental restructuring of thedocument todescribe the benchmarks in terms of dialogs and transactions, and removethe IS

and NS concepts completely.

The INVITE related tests assume no provisional responses, leaving out the

effect on a device's memory when the state machines it is maintainingtransitionto the proceeding state. Further, by not including provisionals, andbuildingthe tests to search for Timer B firing, the tests insure there will bemultipleretransmissions of the INVITE (when using UDP) that the device beingtested hasto handle. The traffic an element has to handle and likely the memory itwillconsume will be very different with even a single 100 trying, which isthe more

usual case in deployed networks. The document should be clear _why_ it chose
the test model it did and left out metrics that took having a provisional

response into account. Similarly, you are leaving out the delayed-offerINVITEtransactions used by 3pcc and it should be more obvious that you aredoing so.


Likewise, the media oriented tests take a very basic approach to simulating

media. It should be explicitly stated that you are simulating theeffects of acodec like G.711 and that you are assuming an element would only beforwarding

packets and has to do no transcoding work. It's not clear from the documents
whether the EA is generating actual media or dummy packets. If it's actual
media, the test parameters that assume constant sized packets at a constant

rate will not work well for video (and I suspect endpoints, like B2BUAs,will

terminate your call early if you send them garbage).

The sections on a series of INVITEs is fairly clear that you mean eachof them

to have different dialog identifiers.  I don't see any discussion of varying
the To: URI. If you don't, what's going to keep a gateway or B2BUA from
rejecting all but the first with something like Busy? Similarly, I'm not

finding where you talk about how many AoRs you are registering againstin theregistration tests. I think, as written, someone could write this whereall the

registers affected only one AoR.

The methodology document calls Stress Testing out of scope, but the verynatureof the Benchmarking algorithm is a stress test. You are iterativelypushing tosee at what point something fails, _exactly_ by finding the rate ofattempted

sessions per second that the thing under test would consider too high.

Now to specific issues in document order, starting with the terminology
document (nits are separate and at the end):

* T (for Terminology document): The title and abstract are misleading -this is

not general benchmarking for SIP performance. You have a narrow set of
tests, gathering metrics on a small subset of the protocol machinery.
Please (as RFC 6076 did) look for a title that matches the scope of the

document. For instance, someone testing a SIP Events server would beill-served

with the benchmarks defined here.

* T, section 1: RFC5393 should be a normative reference. You probablyalso needto pull in RFCs 4320 and 6026 in general - they affect the statemachines you

are measuring.

* T, 3.1.1: As noted above, this definition of session is not useful. It
doesn't provide any distinction between two different sessions. I strongly
disagree that SIP reserves "session" to describe services analogous to

telephone calls on a switched network - please provide a reference. SIPINVITE

transactions can pend forever - it is only the limited subset of the use of

the transactions (where you don't use a provisional response) that keepsthiscommunication "brief". In the normal case, an INVITE an its finalresponse can

be separated by an arbitrary amount of time. Instead of trying to tweak this

text, I suggest replacing all of it with simpler, more directdescriptions of

the sequence of messages you are using for the benchmarks you are defining
here.

*T, 3.1.1: How is this vector notion (and graph) useful for this document? I
don't see that it's actually used anywhere in the documents. Similarly, the
arrays don't appear to be actually used (though you reference them from some

definitions) - What would be lost from the document if you simplyremoved all

this text?

*T, 3.1.5, Discussion, last sentence: Why is it important to say "ForUA-typeof network devices such as gateways, it is expected that the UA will bedriveninto overload based on the volume of media streams it is processing."It's not

clear that's true for all such devices. How is saying anything here useful?

*T, 3.1.6: This definition says an outstanding BYE or CANCEL is a Session
Attempt. Why not just say INVITE? You aren't actually measuring "session
attempts" for INVITEs or REGISTERs - you have separate benchmarks for them.

*T, 3.1.7: It needs to be explicit that these benchmarks are not accounting
for/allowing early dialogs.

*T, 3.1.8: The words "early media" appear here for the first time. Given the
way the benchmarks are defined, does it make sense to discuss early media in
these documents at all (beyond noting you do not account for it)? If so,
there needs to be much more clarity. (By the way, this Discussion will be
much easier to write in terms of dialogs).

*T, 3.1.9, Discussion point 2: What does "the media session is established"

mean? If you leave this written as a generic definition, then is thiswhen an

MSRP connection has been made? If you simplify it to the simple media model

currently in the document, does it mean an RTP packet has been sent? Ordoes it

have to be received?. For the purposes of the benchmarks defined here, it
doesn't seem to matter, so why have this as part of the discussion anyway?

*T, 3.1.9, Definition: A series of CANCELs meets this definition.

*T, 3.1.10 Discussion: This doesn't talk about 3xx responses, and theyaren't

covered elsewhere in the document.

*T, 3.1.11 Discussion: Isn't the MUST in this section methodology? Whyis it in

this document and not -meth-?

*T, 3.1.11 Discussion, next to last sentence: "measured by the number of

distinct Call-IDs" means you are not supporting forking, or you wouldnot count

answers from more than on leg of the fork as different sessions, like you

should. Or are you intending that there would never be an answer frommore than

one leg of a fork? If so, the documents need to be clearer about the
methodology and what's actually being measured.

*T, 3.2.2 Definition: There's something wrong with this definition. For
example, proxies do not create sessions (or dialogs). Did you mean "forwards
messages between"?

*T, 3.2.2 Discussion: This is definition by enumeration since it uses aMUST,and is exclusive of any future things that might sit in the middle. Ifthat'swhat you want, make this the definition. The MAY seems contradictoryunless you

are saying a B2BUA or SBC is just a specialized User Agent Server. If so,
please say it that way.

*T, 3.2.3: This seems out of place or under-explored.  You don't appear to
actually _use_ this definition in the documents.You declare these things in
scope, but the only consequence is the line in this section about the not

lowering performance benchmarks when present. Consider making that partof the

methodology of a benchmark and removing this section. If you think it's

essential, please revisit the definition - you may want to generalize itinto

_anything_ that sits on the path and may affect SIP processing times
(otherwise, what's special about this either being SIP Aware, or being a
Firewall)?

*T, 3.2.5 Definition: This definition just obfuscates things. Point to3261'sdefinition instead. How is TCP a measurement unit? Does the generalterminology

template include "enumeration" as a type? Do you really want to limit this
enumeration to the set of currently defined transports? Will you never run
these benchmarks for SIP over websockets?

*T, 3.3.2 Discussion: Again, there needs to be clarity about what itmeans to"create" a media session. This description differentiates attempt vssuccess,so what is it exactly that makes a media session attempt successful?When yousay number of media sessions, do you mean number of M lines or totalnumber of

INVITEs that have SDP with m lines?

*T, 3.3.3: This would much clearer written in terms of transactions anddialogs

(you are already diving into transaction state machine details). This is a

place where the document needs to point out that it is not providingbenchmarks

relevant to environments where provisionals are allowed to happen and INVITE
transactions are allowed to pend.

*T, 3.3.4: How does this model (A single session duration separate from the

media session hold time) produce useful benchmarks? Are you using it toallowmedia to go beyond the termination of a call? If not, then you havemedia only

for the first part of a call? What real world thing does this reflect?
Alternatively, what part of the device or system being benchmarked does this
provide insight to?

*T, 3.3.5: The document needs to be honest about the limits of this simple

model of media. It doesn't account for codecs that do not have constantpacketsizes. The benchmarks that use the model don't capture the differencesbased on

content of the media being sent - a B2BUA or gateway, may will behave
differently if it is transcoding or doing content processing (such as DTMF
detection) than it will if it is just shoveling packets without looking at
them.

*T, 3.3.6: Again, the model here is that any two media packets presentthe same

load to the thing under test. That's not true for transcoding, mixing, or
analysis (such as for dtmf detection). It's not clear that if you have two

streams, each stream has its own "constant rate". You call out havingone audio

and one video stream - how do you configure different rates for them?

*T, 3.3.7: This document points to the methodology document for indicating
whether streams are bi-directional or uni-directional. I cant find where the
methodology document talks about this (the string 'direction' does not
occur in that document).

*T, 3.3.8: This text is old - it was probably written pre-RFC5056. Ifyou fork,

loop detection is not optional. This, and the methodology document should be
updated to take that into account.

*T, 3.3.9: Clarify if more than one leg of a fork can be answeredsuccessfully

and update 3.1.11 accordingly. Talk about how this affects the success
benchmarks (how will the other legs getting failure responses affect the
scores?)

*T, 3.3.9, Measurement units: There is confusion here. The unit is probably
"endpoints". This section talks about two things, that, and type of forking.

How is "type of forking" a unit, and are these templates supposed toallow more

than one unit for a term?

*T, 3.4.2, Definition: It's not clear what "successfully completed"means. Didyou mean "successfully established"? This is a place where speaking interms of

dialogs and transactions rather than sessions will be much clearer.

*T, 3.4.3, This benchmark metric is underdefined. I'll focus on that in the

context of the methodology document (where the docs come closer todefining it).This definition includes a variable T but doesn't explain it - you haveto read

the methodology to know what T is all about. You might just say "for the
duration of the test" or whatever is actually correct.

*T, 3.4.3, Discussion: "Media Session Hold Time MUST be set toinfinity". Why?

The argument you give in the next sentence just says the media session hold
time has to be at least as long as the session duration. If they were equal,

and finite, the test result does not change. What's the utility of theinfinity

concept here?

*T, 3.4.4: "until it stops responding". Any non-200 response is still a
response, and if something sends a 503 or 4xx with a retry-after (which is

likely when it's truly saturating) you've hit the condition you aretrying to

find. The notion that the Overload Capacity is measurable by not getting any

responses at all is questionable. This discussion has a lot ofmethodology in

it - why isn't that (only) in the methodology document?

*T, 3.4.5: A normal, fully correct system that challenged requests and

performed flawlessly would have a .5 Session Establishment Performancescore.Is that what you intended? The SHOULD in this section looks likemethodology.Why is this a SHOULD and not a MUST (the document should be clearerabout whysessions remaining established is important). Or wait - is this whatNote 2 insection 5.1 of the methodology document (which talks about reportingformats)is supposed to change? If so, that needs to be moved to the actualmethodology

and made _much_ clearer.

*T, 3.4.6: You talk of the first non-INVITE in an NS. How are you
distinguishing subsequent non-INVITES in this NS from requests in some other

NS? Are you using dialog identifiers or something else? Why do youexpect thatto matter (why is the notion of a sequence of related non-INVITEs usefulfrom abenchmarking perspective - there isn't state kept in intermediariesbecause of

them - what will make this metric distinguishable from a metric that just
focuses on the transactions?)

*T, 3.4.7: What's special about MESSAGE? Why aren't you focusing on INFO or
some other end-to-end non-INVITE? I suspect it's because you are wanting to
focus on a simple non-INVITE transaction (which is why you are leaving out
SUBSCRIBE/NOTIFY). MESSAGE is good enough for that, but you should be clear

that's why you chose it. You should also talk about whether the payloadof allof the MESSAGE requests are the same size and whether that size is aparameter

to the benchmark. (You'll likely get very different behavior from a MESSAGE
that fragments.)

*T, 3.4.7: The definition says "messages completed" but the discussion talks
about "definition of success". Does success mean an IM transaction completed
successfully?  If so, the definition of success for a UAC has a problem. As

written, it describes a binary outcome for the whole test, not how todetermine

the success of an individual transaction - how do you get from what it
describes to a rate?

*T, Appendix A: The document should better motivate why this is here.
Why does it mention SUBSCRIBE/NOTIFY when the rest of the document(s) are
silent on them.  The discussion says you are _selecting_ a Session Attempts
Arrival Rate distribution. It would be clearer to say you are selecting the

distribution of messages sent from the EA. It's not clear how thisparticular

metric will benefit from different sending distributions.

Now the Methodology document (comments prefixed with an M):

*M, Introduction: Can the document say why the subset of functionality
benchmarked here was chosen over other subsets? Why was Subscribe/Notify or
Info not included (or invites with MSRP or even simple early media, etc)?

*M, Introduction paragraph 4: This points to section 4 and section 2 of the
terminology document for configuration options. Section 4 is the iana

considerations section (which has no options). What did you mean topoint to?

*M, Introduction paragraph 4, last sentence: This seems out of place -why is

it in the introduction and not in a section on that specific methodology.

*M, 4.1: It's not clear here, or in the methodology sections whether thetestsallow the transport to change as you go across an intermediary. Do youintendto be able to benchmark a proxy that has TCP on one side and UDP on theother?

*M, 4.2: This is another spot where pointing to the Updates to 3261 thatchange

the transaction state machines is important.

*M, 4.4: Did you really mean RTSP? Maybe you meant MSRP or somethingelse? RTSP

is not, itself, a media protocol.

*M, 4.9: There's something wrong with this sentence: "This test is runfor an

extended period of time, which is referred to as infinity, and which is,
itself, a parameter of the test labeled T in the pseudo-code". What value is
there in giving some finite parameter T the name "infinity"?

*M, 4.9: Where did 100 (as an initial value for s) come from? Modern devices
process at many orders of magnitude higher rates than that. Do you want to
provide guidance instead of an absolute number here?

*M 4.9: In the pseudo-code, you often say "the largest value". It wouldhelp to

say the the largest value of _what_.

*M 4.9: What is the "steady_state" function called in the pseudo-code?

*M 6.3: Expected Results: The EA will have different performance
characteristics if you have them sending media or not. That could cause this
metric to be different from session establishment without media.

*M 6.5: This section should call out that loop detection is not optionalwhen

forking. The Expected Results description is almost tautological - could it
instead say how having this measurement is _useful_ to those consuming this
benchmark?

*M 6.8, Procedure: Why is "May need to run for each transport ofinterest." in

a section titled "Session Establishement Rate with TLS Encrypted SIP"?

*M 6.10: This document doesn't define Flooding. What do you mean? How isthisdifferent than "Stress test" as called out in section 4.8? Where does500 comefrom? (Again, I suspect that's a very old value - and you should beproviding

guidance rather than an absolute number). But it's not clear how this isn't

just the session establishment rate test that just starts with a biggernumber.

What is it actually trying to report on that's different from the session
establishment rate test, and how is the result useful?

*M 6.11: Is each registration going to a different AoR? (You must be, or the
re-registration test makes no sense.) You might talk about configuring the
registrar and the EA so they know what to use.

*M 6.12, Expected Results: Where do you get the idea that re-registration

should be faster than initial registration? How is knowing thedifference (or

even that there is a difference) between this and the registration metric
likely to be useful to the consumer?

*M 6.14: Session Capacity, as defined in the terminology doc, is a count of
sessions, not a rate. This section treats it as a rate and says it can be
interpreted as "throughput". I'm struggling to see what it actually is

measuring. The way your algorithm is defined in section 4.9, I find sbefore Iuse T. Lets say I've got a box where the value of s that's found is10000, andI've got enough memory that I can deal with several large values of T.If I runthis test with T of 50000, my benchmark result is 500,000,000. If I runwith a

T of 100000, my benchmark result is 1,000,000,000. How are those numbers

telling me _anything_ about session capacity. That the _real_ sessioncapacityis at least that much? Is there some part of this methodology that hasme huntfor a maximal value of T? Unless I've missed something, this metricneeds more

clarification to not be completely misleading. Maybe instead of "Session
Capacity" you should simply be reporting "Simultaneous Sessions Measured"

*M 8: "and various other drafts" is not helpful - if you know of other
important documents to point to, point to them.

Nits:

T : The definition of Stateful Proxy and Stateless Proxy copied the words
"defined by this specification" from RFC3261. This literal copy introduces

confusion. Can you make it more visually obvious you are quoting? Andeven if

you do, could you replace "by this specification" with "by [RFC3261]"?

T, Introduction, 2nd paragraph, last sentence: This rules out stateless
proxies.

T, Section 3: In the places where this template is used, you are carefulto sayNone under Issues when there aren't any, but not so careful to say NoneunderSee Also when there isn't anything. Leaving them blank makes sometransitions

hard to read - they read like you are saying see also (whatever the next
section heading is).

T, 3.1.6, Discussion: s/tie interval/time interval/

M, Introduction, paragraph 2: You say "any [RFC3261] conforming device", but
you've ruled endpoint UAs out in other parts of the documents.

M 4.9: You have comments explaining send_traffic the _second_ time youuse it.

They would be better positioned at the first use.

M 5.2: This is the first place the concept of re-Registration ismentioned. A

forward pointer to what you mean, or an introduction before you get to this
format would be clearer.


On 1/16/13 3:48 PM, The IESG wrote:

The IESG has received a request from the Benchmarking Methodology WG
(bmwg) to consider the following document:
- 'Terminology for Benchmarking Session Initiation Protocol (SIP)
    Networking Devices'
   <draft-ietf-bmwg-sip-bench-term-08.txt> as Informational RFC

The IESG plans to make a decision in the next few weeks, and solicits
final comments on this action. Please send substantive comments to the
ietf(_at_)ietf(_dot_)org mailing lists by 2013-01-30. Exceptionally, comments 
may be
sent to iesg(_at_)ietf(_dot_)org instead. In either case, please retain the
beginning of the Subject line to allow automated sorting.

Abstract


    This document provides a terminology for benchmarking the SIP
    performance of networking devices.  The term performance in this
    context means the capacity of the device- or system-under-test to
    process SIP messages.  Terms are included for test components, test
    setup parameters, and performance benchmark metrics for black-box
    benchmarking of SIP networking devices.  The performance benchmark
    metrics are obtained for the SIP signaling plane only.  The terms are
    intended for use in a companion methodology document for
    characterizing the performance of a SIP networking device under a
    variety of conditions.  The intent of the two documents is to enable
    a comparison of the capacity of SIP networking devices.  Test setup
    parameters and a methodology document are necessary because SIP
    allows a wide range of configuration and operational conditions that
    can influence performance benchmark measurements.  A standard
    terminology and methodology will ensure that benchmarks have
    consistent definition and were obtained following the same
    procedures.




The file can be obtained via
http://datatracker.ietf.org/doc/draft-ietf-bmwg-sip-bench-term/

IESG discussion can be tracked via
http://datatracker.ietf.org/doc/draft-ietf-bmwg-sip-bench-term/ballot/


No IPR declarations have been submitted directly on this I-D.

Re: Last Call: <draft-ietf-bmwg-sip-bench-term-08.txt> (Terminology for Benchmarking Session Initiation Protocol (SIP) Networking Devices) to Informational RFC