Hi, Jean-Marc.
... and thanks for the super-quick response! You have been quite busy.
I have had a look through the new draft and I think the additions help
considerably with comprehension for the naive (and to give new
implementers a way in.)
I'll leave you to negotiate with the RFC Editor over the Wikipaedia
references. To quote the RFC Style guide
http://www.rfc-editor.org/rfc-style-guide/rfc-style
Section 4.8, item (x) References, last section:
URLs and DNS names in RFCs
The use of URLs in RFCs is discouraged, because many URLs are not
stable references. Exceptions may be made for normative
references in those cases where the URL is demonstrably the most
stable reference available. References to long-lived files on
ietf.org and rfc-editor.org are generally acceptable.
They are certainly convenient *as long as they remain in place and
aren't corrupted*.
I found a couple of trivial editorial nits in the changes:
s4.3.3 (in the added text):
The CELT layer, however, can adapt over a very wide range of rates,
and thus has a large number of codebooks sizes
s/codebooks/codebook/
s4.3.3, para after Table 57: s?the maximums in bit/sample are
precomputed?the maximums in bits/sample are precomputed?
Also suggest:
s4.3: Add reference for Bark scale: Zwicker, E. (1961), "Subdivision of
the audible frequency range into critical bands," The Journal of the
Acoustical Society of America, 33, Feb., 1961.
A few responses in line below (agreed pieces elided):
Regards,
Elwyn Davies
====================================================================================
On Tue, 2012-05-15 at 20:33 -0400, Jean-Marc Valin wrote:
Hi Elwyn,
Thanks for the very thorough review. We've addressed your issues and
submitted draft version -13. See our response to each of the issues you
raised (aggregated from all the authors) in the attached document.
Thanks very much to all the authors.
Cheers,
Jean-Marc
Elwyn Davies wrote:
Major issues:
Can we accept the code as normative? If not how do we proceed?
The issue with code being normative was specifically addressed in
the guidelines document for this WG (RFC 6569).
Yes. I failed to read this although Robert Sparks did point out to me
that the code was normative - but he didn't think he said this was
agreed in advance (or maybe I didn't understand what he was saying).
To be honest I would like to have had the time to tie the text to the
code but realistically that is several weeks or even months work to do
it properly - without that I feel that I have only done half a job. I
may decide that it interests me enough to have a superficial go next
week but no promises!
Minor issues:
Contrast between decoder descriptions of LP part and CELT part: The
SILK descriptions go into gory detail on the values used in lots of
tables, etc., whereas the CELT part has a very limited treatment of
the
numeric values used (assuming reliance on finding the values in the
reference implementation, either explictly or implicitly). There
are
things to be said for both techniques. I was wondering (while
reading
the SILK description) if the authors have any means of automatically
generating the tables from the code in the SILK part (or vice versa)
to
avoid double maintenance. On the other hand, there are pieces of the
CELT decoder description (especially in s4.3.3 where knowing numbers
of
bands, etc.) where some actual numbers would help comprehension.
We have made many changes to section 4.3 (and 4.3.3 specifically) to
address
the specific issues below. As for the tables, they are not generated
automatically.
I think this is addressed satisfactorily now. There is still some
difference but it is much reduced and not so glaring now. The addition
of the band tables really helps.
s2 (and more generally): The splitting of the signal in the
frequency
domain into signal (components?) below and above 8kHz presumably
requires that the signal is subjected to a Discrete Fourier
Transform to
allow the signal to be split. I think sometging is needed in s2 to
explain how this is managed (or if I don't understand, to explain
why it
isn't necessary).
No DFT is used. The lower band is obtained through resampling (which
is already described) and the higher band is obtained by not coding
the lower band with CELT (the text says that CELT starts at band 17 in
hybrid mode). The explanation was reworded to make this as clear as
possible at this point in the text.
[I thought I had reworded this comment in the 2nd version to talk about
MDCT but no matter].
Yes, para 5 of s2 does say that the bands are discarded. I think it
would useful to have a concrete statement in the new text added to s4.3
that bands 0 to 16 are discarded in hybrid mode (thereby making the 17
in the band boost section more obvious) [There is a comment below that
you have added some text about band 17 in section 4.3 but I can't see
it].
s4.2.5, para 3:
When switching from 20 ms to 10 ms, the 10 ms
20 ms Opus frame, potentially leaving a hole that needs to be
concealed from even a single packet loss.
How?
As explained in the LBRR text, a 10 ms frame will only contain 10 ms
LBRR data even if the previous frame was 20 ms, so there's 10 ms
"missing".
Indeed - that there would be a hole was clear. The 'How' referred to
how would it be concealed. Having read further by now this may be down
to Packet Loss Concealment - so maybe all it needs is a foward ref to
s4.4.
s4.3:
As a complete newcomer to CELT, I would have appreciated a more high
level understanding of what CELT is doing at this point. I tried
reading s4.3 without any additional input and found it very hard
going.
Eventually I gave up and went looking for some additional input.
This
presentation seems to have a useful view
http://www.ietf.org/proceedings/79/slides/codec-2.pdf
I think that it would be extremely helpful to have a description
similar
to this at this point in the document, even though there is some
material in section 5.3 which could also be forward referenced.
Still
the material in s5.3 does not start from the basic principles that
CELT
is using, and since these are essentially novel, it would be very
good
to give prospective implementers/users an understanding of what is
going
on. Incidentally, I found the above IETF presentation more useful
than
http://www.celt-codec.org/presentations/misc/lca-celt.pdf
Note that the SILK part seems rather less opaque. It would also be
useful to indicate numerically how many bands are involved and what
the
number of MDCT bins are in the various bands.
The intro of section 4.3 has been expanded with general information
about CELT
similar to what the codec-2.pdf slides from 79 included.
I think this is an excellent improvement.
Nits/editorial comments:
global: bytes -> octets
We believe that in the field of audio codecs, the mention of "byte"
without
further context is well understood to mean 8 bits.
True. But this is a matter of IETF style. The style is to use octets
where we mean 8 bit bytes. I think you now have a mixture!
global: The form/terminology Q<n> (e.g., Q13, Q15, Q16) ought to be
explained.
This was already defined in the section on notation:
The notation "Q<n>", where n
is an integer, denotes the number of binary digits to the right of
the decimal point in a fixed-point number.
Sorry - I missed that.
s3.4: I am not keen on duplicating normative requirements in this
way
(double maintenance issue). It would be better to put explicit
numbered
requirements in the sections above an reference the resulting
numbers
here.
A checklist style summary is quite useful for implementors, and worth
the maintenance burden for a document that is (hopefully) going to be
published once and read many times. The list intentionally does not
include any RFC 2119 keywords, to avoid any conflict should there
(accidentally) be room to interpret the re-statement any differently
from the original statement. Numbering the requirements and
referencing the numbers is still a good idea, but it should be
possible to read the list without flipping back and forth to the
previous sections.
Good solution!
s4.1.1: This is really a global point. This section refers to
entdec.c. Presumably (since we haven't reached the code yet) and it
is
still compressed, there is some file structure. I don't think this
has
been said above. It would be good to provide a list of the file
components (i.e., sectional structure of the code) at the start,
maybe
even giving line number positions within the decompressed code.
In cases where it was deemed necessary (e.g. large functions), there
are
indeed line numbers in references to the code. As for a list of files
we did not think it was useful because one already needs to decompress
the code to see the references.
OK. We'll have to live with this situation.
Having looked at the code, I think it is a considerable pity that it
isn't Doxygen commented (or some such) throughout so that the whole
system can be viewed as a Doxygen tree. I can smell roasted programmer
from here... :-)
s4.2.7.5.1, para 1: s/This indexes an element in a coarse codebook,
selects the PDFs for the second stage of the VQ/This indexes an
element in a coarse codebook that selects the PDFs for the
second stage
of the VQ/
The text as written is correct. The index I1 is what selects the PDFs
for the second stage, not the vector from the coarse codebook in
Tables 23 and 24. I.e., it's saying, "This does A, B, and C."
OK. I think it might be clearer if the three things were separated out
as a list. Now you point it out I can read it correctly but it
triggered minor confusion - worth turning the three things into bullet
points.
NEW: s4.3: Add reference for Bark scale: Zwicker, E. (1961),
"Subdivision of the audible frequency range into critical bands," The
Journal of the Acoustical Society of America, 33, Feb., 1961.
Generally the new intro to s4.3 helps *a lot*.
s4.3.2.1: Avoid the equation picture splitting across page
boundaries.
in the current version it is unclear what the denominator is. (use
needLines processing direcrive in xml2rfc). Same applies to the
equation below Table 57 in s4.3.4.3.
It's not quite clear how to use needLines without undesirable
side-effects.
Hopefully this is something the RFC editor should be able to handle.
Indeed.. but I don't know what undesirable side effects there are?
AFAIK (and in my own usage) it just ensures there are n lines available
on the current page at some point in the text and forces a page throw if
not.
s4.3.3: (was specified as s 4.3.2.3 whcj was wrong) Paragraph on
decoding band boosts: Might be improved by using
equations rather than the wordy descriptions used at present.
Any thoughts on this one
(global)s4.3.2.3, para above table 56: s/iff/if and only if/
Fixed.
s4.3.3: <<snip>>.
Added an explanation of band 17
I don't think this happened.
s6.1: This section is perhaps a little 'triumphalist' for the
reference
implementation (this may of course be justified!. The quality
metric is
a 'relative quality metric' and presumably if someone does a
*really*
good job of coding, it is entirely possible that the new algorithms
might get better than 100 on the quality score (i.e., provide a
better
answer than the reference implementation).
Conformance with the specification is defined by a faithful
reproduction of
the specified decoder behavior (see RFC 6569 s4.3). By specifying the
decoder, future encoders have the freedom to have improved quality
with the
confidence that they know what output a conforming decoder will
produce.
The tests in s6.1 are measuring the quality of the decoded signal
against the
reference decoded signal, not against the original audio, so greater
than 100%
wouldn't be possible or meaningful. The test signals have been
specially
prepared to thoroughly test a decoder implementation, and they
sacrifice encoded
quality in order to rapidly exercise the corner cases.
You might want to add this comment to the text.
As regards the 100 limit, I was sort of assuming that the quality figure
was derived from improving on the 48dB SNR figure. Probably a
misreading. AS a matter of interest, would one be able to tell from the
tests that a putative new implementation really was 'better' in some
sense? Or is this now almost a subjective matter that can only be
determined by extensive listening tests? I got the impression we may be
converging on the diminishing returns point.
/Elwyn