Voice DSP Processing II Yaakov J. Stein Chief Scientist RAD Data Communications Stein VoiceDSP 2.1 Voice DSP Part 1 Speech biology and what we can learn from it Part 2 Speech DSP (AGC, VAD, features, echo cancellation) Part 3 Speech compression techiques Part 4 Speech Recognition Stein VoiceDSP 2.2 Voice DSP - Part 2 Simplest processing – Gain – AGC – VAD More complex processing – pitch tracking – U/V decision – computing LPC – other features Echo Cancellation – – – – – – – Sources of echo Echo suppression Echo cancellation Adaptive noise cancellation The LMS algorithm Other adaptive algorithms The standard LEC Stein VoiceDSP 2.3 Voice DSP - Part 2a Simplest voice DSP Stein VoiceDSP 2.4 Gain (volume) Control In analog processing (electronics) gain requires an amplifier Great care must be taken to ensure linearity! In digital processing (DSP) gain requires only multiplication y=Gx Need enough bits! Stein VoiceDSP 2.5 Automatic Gain Control (AGC) Can we set the gain automatically? Yes, based on the signal’s Energy! E= x2 (t) dt = S xn2 All we have to do is apply gain until attain desired energy Assume we want the energy to be Y Then y = Y/ E x = Gx has exactly this energy Stein VoiceDSP 2.6 AGC - cont. What if the input isn’t stationary (gets stronger and weaker over time) ? <t< 8 8 The energy is defined for all times so it can’t help! So we define “energy in window” E(t) and continuously vary gain G(t) This is Adaptive Gain Control We don’t want gain to jump from window to window so we smooth the instantaneous gain G(t) a G(t) + (1-a) Y/E(t) IIR filter Stein VoiceDSP 2.7 AGC - cont. The a coefficient determines how fast G(t) can change In more complex implementations we may separately control integration time, attack time, release time What is involved in the computation of G(t) ? – – – – Squaring of input value Accumulation Square root (or Pythagorean sum) Inversion (division) Square root and inversion are hard for a DSP processor but algorithmic improvements are possible (and often needed) Stein VoiceDSP 2.8 Simple VAD Sometimes it is useful to know whether someone is talking (or not) – Save bandwidth – Suppress echo – Segment utterances We might be able to get away with “energy VOX” Normally need Noise Riding Threshold / Signal Riding Threshold However, there are problems energy VOX since it doesn’t differentiate between speech and noise What we really want is a speech-specific activity detector Voice Activity Detector Stein VoiceDSP 2.9 Simple VAD - cont. VADs operate by recognizing that speech is different from noise – Speech is low-pass while noise is white – Speech is mostly voiced and so has pitch in a given range – Average noise amplitude is relatively constant A simple VAD may use: – zero crossings – zero crossing “derivative” – spectral tilt filter – energy contours – combinations of the above Stein VoiceDSP 2.10 Other “simple” processes Simple = not significantly dependent on details of speech signal Speed change of recorded signal Speed change with pitch compensation Pitch change with speed compensation Sample rate conversion Tone generation Tone detection Dual tone generation Dual tone detection (need high reliability) Stein VoiceDSP 2.11 Voice DSP - Part 2b Complex voice DSP Stein VoiceDSP 2.12 Correlation One major difference between simple and complex processing is the computation of correlations (related to LPC model) Correlation is a measure of similarity Shouldn’t we use squared difference to measure similarity? D2 = < (x(t) - y(t) )2 > No, since squared difference is sensitive to – gain – time shifts Stein VoiceDSP 2.13 Correlation - cont. D2 = < (x(t) - y(t) )2 > = < x2 > + < y2 > - 2 < x(t) y(t) > So when D2 is minimal C(0) = < x(t) y(t) > is maximal and arbitrary gains don’t change this To take time shifts into account C(t) = < x(t) y(t+t) > and look for maximal t! We can even find out how much a signal resembles itself Stein VoiceDSP 2.14 Autocorrelation Crosscorrelation Cx y (t) = < x(t) y(t+t) > Autocorrelation Cx (t) = < x(t) x(t+t) > Cx (0) is the energy! Autocorrelation helps find hidden periodicities! Much stronger than looking in the time representation Wiener Khintchine Autocorrelation C(t) and Power Spectrum S(f) are FT pair So autocorrelation contains the same information as the power spectrum … and can itself be computed by FFT Stein VoiceDSP 2.15 Pitch tracking How can we measure (and track) the pitch? We can look for it in the spectrum – but it may be very weak – may not even be there (filtered out) – need high resolution spectral estimation Correlation based methods The pitch periodicity should be seen in the autocorrelation! Sometimes computationally simpler is the Absolute Magnitude Difference Function < | x(t) - x(t+t) | > Stein VoiceDSP 2.16 Pitch tracking - cont. Sondhi’s algorithm for autocorrelation-based pitch tracking : – obtain window of speech – determine if the segment is voiced (see U/V decision below) – low-pass filter and center-clip to reduce formant induced correlations – compute autocorrelation lags corresponding to valid pitch intervals • find lag with maximum correlation OR • find lag with maximal accumulated correlation in all multiples Post processing Pitch trackers rarely make small errors (usually double pitch) So correct outliers based on neighboring values Stein VoiceDSP 2.17 Other Pitch Trackers Miller’s data-reduction & Gold and Rabiner’s parallel processing methods Zero-crossings, energy, extrema of waveform Noll’s cepstrum based pitch tracker Since the pitch and formant contributions are separated in cepstral domain Most accurate for clean speech, but not robust in noise Methods based on LPC error signal LPC technique breaks down at pitch pulse onset Find periodicity of error by autocorrelation Inverse filtering method Remove formant filtering by low-order LPC analysis Find periodicity of excitation by autocorrelation Sondhi-like methods are the best for noisy speech Stein VoiceDSP 2.18 U/V decision Between VAD and pitch tracking Simplest U/V decision is based on energy and zero crossings More complex methods are combined with pitch tracking Methods based on pattern recognition Is voicing well defined? Degree of voicing (buzz) Voicing per frequency band (interference) Degree of voicing per frequency band Stein VoiceDSP 2.19 LPC Coefficients How do we find the vocal tract filter coefficients? System identification problem Unknown known input filter All-pole (AR) filter Connection to prediction Sn = G en + known output Sm am sn-m Can find G from energy (so let’s ignore it) Stein VoiceDSP 2.20 LPC Coefficients For simplicity let’s assume three a coefficients Sn = en + a1 sn-1 + a 2 s n-2 + a 3 s n-3 Need three equations! Sn = en + a1 sn-1 + a 2 s n-2 + a 3 s n-3 Sn+1 = en+1 + a1 sn + a 2 s n-1 + a 3 s n-2 Sn+2 = en+2 + a1 sn+1 + a 2 s n + a 3 s n-1 In matrix form Sn Sn+1 Sn+2 s = = en en+1 en+2 + sn-1 s n-2 s n-3 sn s n-1 s n-2 sn+1 s n s n-1 e + S a1 a2 a3 a Stein VoiceDSP 2.21 LPC Coefficients - cont. S=e+Sa so by simple algebra a = S-1 ( s - e ) and we have reduced the problem to matrix inversion Toeplitz matrix so the inversion is easy (Levinson-Durbin algorithm) Unfortunately noise makes this attempt break down! Move to next time and the answer will be different. Need to somehow average the answers The proper averaging is before the equation solving correlation vs autocovariance Stein VoiceDSP 2.22 LPC Coefficients - cont. Can’t just average over time - all equations would be the same! Let’s take the input to be zero Sn = Sm am sn-m multiply by Sn-q and sum over n Sn Sn Sn-q = Sm am Sn sn-m sn-q we recognize the autocorrelations Cs (q) = Sm Cs (|m-q|) am Yule-Walker equations autocorrelation method: sn outside window are zero (Toeplitz) autocovariance method: use all needed sn (no window) Also - pre-emphasis! Stein VoiceDSP 2.23 Alternative features The a coefficients aren’t the only set of features Reflection coefficients (cylinder model) log-area coefficients (cylinder model) pole locations LPC cepstrum coefficients Line Spectral Pair frequencies All theoretically contain the same information (algebraic transformations) Euclidean distance in LPC cepstrum space ~ Itakura Saito measure so these are popular in speech recognition LPC (a) coefficients don’t quantize or interpolate well so these aren’t good for speech compression LSP frequencies are best for compression Stein VoiceDSP 2.24 LSP coefficients a coefficients are not statistically equally weighted pole positions are better (geometric) but radius is sensitive near unit circle Is there an all-angle representation? Theorem 1: Every real polynomial with all roots on the unit circle is palindromic (e.g. 1 + 2t + t2) or antipalindromic (e.g. t + t2 - t3) Theorem 2: Every polynomial can be written as the sum of palindromic and antipalindromic polynomials Consequence: Every polynomial can be represented by roots on the unit circle, that is, by angles Stein VoiceDSP 2.25 Voice DSP - Part 2c Echo Cancellation Stein VoiceDSP 2.26 Acoustic Echo Stein VoiceDSP 2.27 Line echo Telephone 1 Telephone hybrid hybrid 2 Stein VoiceDSP 2.28 Echo suppressor 4w switch comp inv switch 4w In practice need more: VOX, over-ride, reset, etc. Stein VoiceDSP 2.29 Why not echo suppresion? Echo suppression makes conversation half duplex – Waste of full-duplex infrastructure – Conversation unnatural – Hard to break in – Dead sounding line - far end near end It would be better to cancel the echo subtract the echo signal allowing desired signal through but that requires DSP. Stein VoiceDSP 2.30 Echo cancellation? Unfortunately, it’s not so easy Outgoing signal is delayed, attenuated, distorted - far end MODEM TYPE near end Two echo canceller architectures: echo path clean LINE ECHO CANCELLER (LEC) far end - near end clean echo path Stein VoiceDSP 2.31 LEC architecture h y b r i d NLP - Y doubletalk detector filter H adapt far end near end A/D X D/A Stein VoiceDSP 2.32 Adaptive Algorithms How do we find the echo cancelling filter? keep it correct even if the echo path parameters change? Need an algorithm that continually changes the filter parameters All adaptive algorithms are based on the same ideas (lack of corellation between desired signal and interference) Let’s start with a simpler case - adaptive noise cancellation Stein VoiceDSP 2.33 Noise cancellation y hn x en x n - y h e Stein VoiceDSP 2.34 Noise cancellation - cont. Assume that noise is distorted only by unknown gain h We correct by transmitting e n so that the audience hears y = x + h n - e n = x + (h-e) n the energy of this signal is Ey = < y2 > = < x2 > + (h-e)2 < n2 > + 2 (h-e) < x n> Assume that Cxn = < x n> = 0 We need only set e to minimize Ey ! (turn knob until minimal) Even if the distortion is a complete filter h we set the ANC filter e to minimize Ey Stein VoiceDSP 2.35 The LMS algorithm Gradient descent on energy correction to H is proportional to error d times input X H H+ldX Stein VoiceDSP 2.36 Nonlinear processing Because of finite numeric precision the LEC (linear) filtering can not completely remove echo Standard LEC adds center clipping to remove residual echo Clipping threshold needs to be properly set by adaptation Stein VoiceDSP 2.37 Doubletalk detection Adaptation of H should take place only when far end speaks So we freeze adaptation when no far end or double-talk, that is whenever near end speaks Geigel algorithm compares absolute value of near-end speech to half the maximum absolute value in X buffer If near-end exceeds far-end can assume only near-end is speaking Stein VoiceDSP 2.38