-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better understanding the feature extraction process #13
Comments
Hi again, It looks like the pitch estimation method you implemented is SIFT [1]. Right? Alessio [1] Markel, J. (1972). The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5), 367-377. |
While waiting for an answer, I've checked the code better and managed to answer some of my questions. Plus, I got new ones. This is what I've learned:
My remaining questions are the following:
Alessio |
@alebzk the pitch estimation code is mostly copied from Opus, so it's quite possible everything in there isn't optimal for rnnoise. The idea as you seem to have guessed is to compute the auto-correlation on the residual. The (.008fi)(.008f*i) term here is means to approximate a Gaussian for lag windowing (stabilizes the LPC analysis). As for the c1 term, it's meant to convolve the LPC filter with a slight low-pass filter to make the analysis better. I'm sure it would be possible to run the pitch even lower than 12 kHz, but that seemed fine for the use I had.
This avoids searching for very short periods, which can sometimes cause false detection due to formants. The very short periods are searched through remove_doubling()
Yes, because we want to maximize xy/sqrt(xx*yy) but rather than compute a sqrt(), we just square everything and since the xx term is constant, we only need to maximize xy^2/yy
They were hand-tuned to look for expected peaks that wouldn't be expected for a different pitch period. For example, if there's a peak at T/6, then we can expect one at 5*T/6, which is a position you wouldn't expect to find a peak if the period was T/3.
Period doubling (or tripling) is a common error for auto-correlation-based pitch estimators since if there's a periodicity at T, then there's also going to be a periodicity at 2T and 3T, ... OTOH, there won't be a periodicity at T/2. |
Thanks so much for the detailed answer! |
hi,@alebzk: |
Hi @zhly0, @jmvalin can surely say more, but I'm happy to give a first answer. Yes. PITCH_BUF_SIZE unit is number of samples at 48 kHz; hence, it has to be adapted if the sample rate changes. However, note that the whole pitch estimation algorithm is kind of specific for the 48k case (there are 2 downsampling steps and the estimated pitch is used to compute spectral the cross-correlation features). If you want to avoid resampling from 16k to 48k, then part of the code must be adapted. Alessio |
@alebzk thanks for your write up, very helpful (altough im just half way through). (and thanks to jmvalin ofcourse!) I dont understand your comment: "The autocorrelation with a maximum lag of 4 is computed since the pitch is estimated not on the audio frames directly, but on their LP residuals." I dont understand where the residual is calculated (residual as in difference between x_lp and lpcestimated). Thanks again |
See this call: |
@jmvalin thanks! I have gone through that function plenty of times but double checked now when you wrote, and I think my problem comes from the fact that lpc coefficients are defined with opposite sign compared to how I thought they were. I now assume that you define them as e.g. Matlab do: https://se.mathworks.com/help/signal/ref/lpc.html. Thanks. |
@shakingWaves @zhly0 @jmvalin @alebzk |
Hi, first of all thank you for your amazing work!
I've been looking at the code and I am interested in better understanding the feature extraction process.
I have some questions, I will do my best to be as detailed as possibe.
Let me start from
compute_frame_features()
indenoise.c
.The first step is computing the energy in the Opus bands (done in
frame_analysis()
) and then the pitch is estimated/tracked.pitch_downsample()
performs the 20ms frames downsampling by halving the samples using a [0.25, 0.5, 0.25] kernel around the even samples. Is this a less expensive way to perform downsampling and low-pass filtering jointly to avoid aliasing?Then
_celt_autocorr()
is called with a lag of just 4 samples on the downsampled sequence (hence, 24kHz). Why just 4? What is achieved exactly with such a low lag? And if I look at_celt_autocorr()
, I don't understand why the autocorrelation computed bycelt_pitch_xcorr()
is modified afterwards by summing the autocorrelation for different lags.After that, the autocorrelation is further modified once
_celt_autocorr()
returns (below the// Noise floor -40 dB
comment). Why is that done? And finally_celt_lpc()
is called and the LPC coefficients modified (I meanlpc2
) and used to filter the downsampled sequence viacelt_fir5()
.This whole part is a bit obscure to me and it's also hard to understand where some constants come from (e.g.,
ac[0] *= 1.0001f;
andac[i] -= ac[i]*(.008f*i)*(.008f*i);
) - I've found some possible mappings between the dB and the linear scale, but I'm not fully sure.Overall,
pitch_downsample()
looks like a pre-processing step before the pitch is sought inpitch_search()
. It would be great if you can share details on what is done.My apologies if I am asking something that may be obvious to others. I'm to some extent familiar with LPC and auto-correlation, a little bit with pitch tracking. That's probably why I can't grasp all the details in the code.
Cheers,
Alessio
The text was updated successfully, but these errors were encountered: