APPENDIX In real-world applications, noise reduction algorithms

advertisement
APPENDIX
In real-world applications, noise reduction algorithms generally apply suppression gain
functions to the mixture envelopes of speech and noise. In doing this, it can be seen that a gain value
of 0 yields least residual noise but most speech distortion; while a gain value of 1 yields least speech
distortion but most residual noise. Hence, a tradeoff between residual noise and speech distortion
needs to be achieved and this is usually done by deriving the suppression gain functions of noise
reduction based on mathematically optimized criteria. These criteria mostly target the goal of
minimizing speech distortion with residual noise being kept below a threshold. As a result, real-world
noise reduction algorithms introduce speech distortion to some degrees while minimizing the effects
of noise on speech intelligibility. In addition, to achieve this goal, an accurate noise-estimation
algorithm is required in noise reduction, and clearly noise estimation also plays an important role in
order for the suppression gain functions to reduce noise without introducing unnecessary speech
distortion.
This Appendix only briefly describes the Wiener-filtering algorithm and the logMMSE
algorithm used in the present study, as the subspace and spectral-subtractive noise-reduction methods
have been described in earlier studies (e.g., Loizou, et al., 2005; Yang and Fu, 2005). Both
noise-suppression gain functions rely on one or both of the two SNR estimators, namely, the a priori
SNR, and the a posteriori SNR, both of which in turn depend on the noise spectrum estimation.
1) Estimation of a priori SNR, a posteriori SNR and gain function
The concept of the a priori SNR has been introduced to achieve the best trade-off between speech
distortion and residual noise. The a priori SNR ξπ‘˜ is defined as the ratio of the clean-speech power
spectrum to the noise power spectrum, and it can been seen that without access to the clean-speech
power spectrum, ξπ‘˜ has to be estimated from the noisy speech power spectrum. The a posteriori
SNR π›Ύπ‘˜ is defined as the ratio of the noisy-speech power spectrum to the noise power spectrum. In
real-world practice, the a priori SNR is estimated using the recursive decision-directed method
(Ephraim and Malah, 1984) involving the estimated clean-speech power spectrum in the previous
speech frame and the a posteriori SNR in the current frame.
The gain function π‘”π‘˜ is defined as the ratio of the estimated clean-speech power spectrum and
the noisy-speech power spectrum, and for the Wiener-filtering algorithm, π‘”π‘˜ can be expressed in
terms of the a priori SNR ξπ‘˜ as:
π‘”π‘˜ = ξ
ξπ‘˜
;
π‘˜ +1
and for the logMMSE algorithm, π‘”π‘˜ can be expressed as:
π‘”π‘˜ =
ξπ‘˜
ξπ‘˜
1
∞ 𝑒 −𝑑
𝑑𝑑} , π‘£π‘˜
π‘˜ 𝑑
𝑒π‘₯𝑝 { ∫𝑣
+1
2
=
ξπ‘˜
𝛾 .
ξπ‘˜ +1 π‘˜
After the gain function is estimated, it is straightforward to compute the estimated clean-speech power
spectrum.
2) Noise power spectrum estimation
The smoothed power spectrum of the noisy speech is first computed using a first-order recursive
equation involving the short-time power spectrum of the noisy speech and a smoothing constant. Next,
a nonlinear rule is used to track the minimum of the noisy-speech power spectrum by continuously
averaging past spectral values. Then, the speech presence in each frame and frequency will be
determined by comparing the ratio between the noisy-speech power spectrum and its local minimum
to a frequency-dependent threshold. If the above ratio is found to be greater than the threshold, it is
taken as a speech-present frequency bin; otherwise, it is taken as a speech-absent frequency bin. The
above processing is based on the principal that the power spectrum of the noisy speech will be nearly
equal to its local minimum when speech is absent. Hence, the smaller the ratio, the higher the
probability it will be a noise-only region and vice versa. The speech-presence probability is updated
using a first-order recursion that implicitly exploits the correlation for speech presence in adjacent
frames. Using the speech-presence probability estimate, the time-frequency-dependent smoothing
factor is computed. Finally, the noise power spectrum estimate is updated by using the
frequency-dependent smoothing.
Download