ietf
[Top] [All Lists]

Re: Comments: [AVT] Last Call: RTP Payload for Comfort Noise to Proposed Standard

2002-05-02 13:42:00
On Tue, 30 Apr 2002 James_Renkel(_at_)3com(_dot_)com wrote:

The problem in the above described situation is that the
*receiver* won't know this until it receives the packet after the gap,
which could be a long time, well longer than the depth of the receiver's
jitter buffer. So, when the receiver's jitter buffer underflows, it has
no way of distinguishing between:
1. the transmitter detected silence and just didn't bother to send any
packets, and the receiver should play out silence; and
2. the network is congested, packets are getting lost, and the receiver
should interpolate audio in an attempt to preserve audio quality.

By definition, interpolation can only occur across a gap between two
received packets, not between one received packet and nothing.  And,
in practice, interpolation is only effective across relatively short
gaps.  Therefore, if there is a gap that is shorter than the jitter
buffer, you have the packet before and after the gap, you can tell
that one or a few packets were lost by the sequence number difference,
and you can interpolate.

If the jitter buffer runs dry, you have a gap too long to interpolate
over.  One technique for that situation is to fade to silence (either
absolute silence or comfort noise).  If the lack of packets was
intentional due to VAD at the sender, then the last packet will be, or
at least should be, very near silence so there is not much fading to
do.  I don't see any different behavior you can take independent of
whether you know the lack of packets was intentional or not.

Sure, if CN is sent, then you know silence is intended and you know
what level of noise you should produce.  But if CN is not being sent,
then you need to be prepared to handle equally a case where packets
are lost or where a poor-quality sender stops sending when the sound
level has not yet decayed all the way.

I hope you can all agree with me that action 2., above, is common practice
whether explicit VAD and CN is being used and not. Beyond that, many would
say that action 2. is extremely desirable, that the technique used to
accomplish it is a key differentiator of their product(s), and that for
the general good of VoIP maybe should be considered a recommended practice.

I fully agree.

But the general tone of your comment above, and elsewhere in the
same e-mail, lead me to believe that you (and possibly others) do
not support this, that you support *always* simply playing out
silence if a packet is not available for playout at the required
time (when the jitter buffer underflows).

Absolutely disagree.  I strongly believe that protocol specifications
should not say how to implement the protocol.

That's fine and dandy as your personal view. But the suggested language of
the section of the RFC that you wrote would "standardize" this behavior in
the face of extensive use of exactly the opposite behavior.

I did not intend any words that I wrote to imply that.  Where do you
think the text does so?

The purpose of the comfort noise coding is *exactly* to allow the receiver
to distinguish between cases 1. and 2., above.

Yes, and that is why this generic CN payload format has been defined.
This discussion is about whether the lack of use of CN implies that VAD
and DTX will not be used.  I want to state clearly in my message and
in the specification that this is false.

True, if packets are lost
they could just as well have been CN packets as not (But if the last packet
not lost was a CN packet, the receiver would interplotate comfort noise.).
True, CN packets consume more bandwidth that sending nothing (But less than
sending CODEC encoded near-silence.). Ya want to eliminate that bandwidth
at a potential loss of audio quality when packets are lost, fine, don't
implement or advertise support of CN.

I would assume that most implementations would not send CN in every
frame time, just one at the end of speech.  That CN might occasionally
be lost.  A good implementation should still produce an acceptable
result.

I think before this RFC can go forward, we need to clear this up. I think
the best we can and must say is that if packets aren't received in time,
the result is receiver implementation independent (Interpolate if ya want;
play silence if ya want; play "Yankee Doodle" if ya want. Let the
marketplace decide if they like interpolation, silence, or "Yankee Doodle"
better.). I don't think we can say, or imply, or leave open to
interpretation sans a statement to the contrary, that the intended action
when packets are not received in time is to *always* play silence.

I agree.


On Tue, 30 Apr 2002 Leland_Thompson(_at_)3com(_dot_)com wrote:

It seems that any protocol that actively communicates state transition
information to a system should theoretically, in general, notify the system
at the start of the event not at the end of the event.  If a state
transition has occurred, I may need to take some action or do something
differently.  With Silence Suppression, this is obviously the case.

For instance, of particular concern is not knowing when transitions
actually occur, but just as importantly, now having the possiblity that
significant time may elapse without knowing the actual state of the system.
This last issue can cause other issues.  For instance, delays in accurate
state information create additional problems if a system can end in a state
that is not known to all parties.  The possibility of not having a
transition to speech would cause the state information from the previous
transmission (the silence transition) to be lost, because your method relys
on receiving the next Voice packet, which never occurred.

We are not discussing theory.  You may object to the definition of the
Marker bit in RTP indicating the start rather than the end of a
talkspurt.  We debated this question at length when the protocol was
designed years ago.  It is unlikely to be changed now.

The draft in question is defining a CN payload type precisely so that
more information about the silence transition can be conveyed to the
receiver.  Good quality senders will implement CN.  A robust receiver
must behave well without it.

It is very unlikely that you would get agreement from the working
group to change the base specification of RTP to say that VAD and DTX
may not be used without CN.

If necessary, we can have more discussion about signaling so that a
receiver can refuse to accept a call that would use VAD and DTX
without CN.

What happens during the umpteen frame periods that we didn't correctly
identify the silence period?  How does it impact the speech signal/voice
quality?
How is this error reflected by the system in the form of statistics,
counters, etc?  Are the statistics accurate anymore?

Yes.  The receiver reports the highest number received.  The sender
knows it sent a higher number.

If one allows the TimeStamp information along with the Sequence Number to
together tell an RTP Decoder (Receiver) when a Loss Event is really just a
Silence Period, one is presented with the following delemas.

1)
    -What is one to do during the first audio frame time when data is not
present?  In absence of a valid CN/SID frame, most (some) compliant
implementations will transition to a Loss State which will cause an
Interpolation of the Codec's decoder to occur.
    - There is no reason to believe, yet, that Comfort Noise Generation
should be activated.
    - Furthermore, if one where to activate CNG, what is to be generated?
You don't even have a minimal noise level to try and match just the back
ground noise level of the channel, let alone the spectral information that
might be typically present.

I discussed this above.

2)
What happens toward the end of a session where an RTP Encoder (Transmitter)
has transitioned to silence, however, the RTP Decoder (Reciever) thinks
this may be a loss event, and the call ends without the RTP Decoder ever
seeing another RTP Packet, which would have told him "BIG Change in
TimeStamp, Little change in Seq Num".  The state transition information is
lost, and now inaccurate statistics could be stored for this call because
of it.  Would this scenario have a potential impact to perceived Quality of
Service for this connection?  Absolutely it might!

The quality of this last silence is no different than the quality of a
silence earlier in the call.

Today there are real implementations of VOIP GWs that operate in real
Carrier Networks that monitor Quality of Service (QOS) in the form of
Excessive Packet Loss indicators for TRAPS and Alarms within a Network
Operations Center (NOC).  It is theoretically very important, therefore, to
actively and accurately monitor state transitions within the system that
would possibly cause a fault or alarm.

A complete RTP implementation will also be sending RTCP Sender Reports
that would let the receiver know, during a very long silence, whether
or not some packets had been transmitted and lost.

Even if you have CN at the end of a talkspurt, the receiver has no
idea whether it has lost some packets after that point if it receives
nothing more.  It is impossible to answer the question: "Did you
receive my last packet?"

Silence Indication Descriptions in
the form of CN or SID frames are incredibly important in order to robustly
detect these state transitions at the point (time) of occurance.

Great!  Use CN.  That is why the format is proposed.

 I
strongly recommend we rethink my original statements about RTP Decoding and
how higher level protocol negotiations (i.e.  SIP - SDP, H.323/H.245, etc)
really may only make sense in establishing what an RTP Encoder (transmitter
- to packet network) does.

Therefore, if CN is not negotiated as supported, it should not be activated
or used.

I agree, CN should (must) not be used if it has not been negotiated.
The only reason why your previous proposal would be possible with CN
is that it has a static payload type.  All new codec assignments have
dynamic payload types, so receiving one of those encodings when its
use has not been negotiated will not work.

 VAD should only allowed when negotiated as supported and the
implementation of an IETF - CN (silence indication method) should comply to
a clearly identifiable transition of state as close to the actual state
transition as possible while communicating all the relavent information to
make Comfort Noise Generation (CNG) possible.

Implementation agreements may make this recommendation.  The RTP
protocol specificatin will not require it.

                                                        -- Steve



<Prev in Thread] Current Thread [Next in Thread>