An online dereverberation algorithm for hearing aids with binaural cues preservation 1 Boaz Schwartz1, Sharon Gannot1, and Emanuël A.P. Habets2 2 Faculty of Engineering, International Audio Laboratories, Erlangen Bar-Ilan University University of Erlangen-Nuremberg & Fraunhofer IIS Ramat-Gan, Israel Erlangen, Germany boazsh0@gmail.com, sharon.gannot@biu.ac.il Emanuel.Habets@audiolabs-erlangen.de Experimental Study Experiments took place in the Speech and Acoustic Lab at Bar-Ilan University: • Starkey hearing aids mounted on B&K HATS. • RIRs were recorded. • 3 different reverberation times, 3 positions, 5 angles. Motivation Previous work [1-3]: 3 Significant dereverberation was achieved. 3 Online algorithm tested on moving speakers. 7 Single-channel output → binaural required. 7 Early signal estimated → dry output. In this work we present: 3 Online algorithm for binaural dereverberation. 3 Preserving the desired part of the RIR. Fig. 1: Left and right RIRs Algorithm outline - RKEMD Statistical model Fig. 3: Experimental setup - hearing aid (left), lab setup (right) In the STFT domain, the model for speech signal in the t-th time frame and frequency k is, s(t, k) ∼ NC {0, φs(t, k)} φs(t, k) = −2 T φw (t) · 1 − at ek . Using the convolutive transfer function (CTF) model for reverberation, the j-th input signal is, hj (l, k)s(t − l, k) + vj (t, k) l=0 vj (t, k) ∼ NC 0, φvj (t, k) j = 0, ..., J − 1 . , We use the state-vector representation (k omitted), zj (t) = hjT st + vj (t) sTt ≡ [s(t − L + 1), . . . , s(t)] hjT ≡ [hj (L − 1), ..., hj (0)] , and for the multi-channel signal we use matrix form, Fig. 2: General scheme of the proposed algorithm zt ≡ [z1(t), . . . , zJ (t)]T vt ≡ [v1(t), . . . , vJ (t)]T zt = Hst + vt Fig. 4: ITD and ILD distribution for the different reverberation levels (rows) and different values of α (columns). These plots relate to the farthest speaker (denoted by speaker 3 in the above setup) Recursive-EM algorithm with H = (h0, . . . , hJ−1)T comprising all J CTFs. In order to apply MMSE estimates, define the following matrices, ! φv1 · · · 0 0 ··· 0 0 I . .. . .. . . . . . , . G= . Φ= Ft = . T 00 0 · · · φvj 0 · · · φx(t) and the state-space model of the desired source, Binaural problem formulation Define the following signal as the reference, 4 α=0.1 T xt = Φxt−1 + ut , ut ≡ [0, . . . , x(t)] . and xt defined similarly to st. The matrix form is now zt = t + vt with = e h 0 1 e ,...,h T J−1 . The target signals are defined by iT e yj (t) = W · h j xt , where W is a weighting matrix. An intuitive choice is, Wα ≡ diag e , e 0 −α , ..., e b t|t−1 = Φ x b t−1|t−1 x Pt|t−1 = Φ Pt−1|t−1 ΦT + Ft −(M −1)α W∞ = diag {1, 0, ..., 0} , W0 = diag {1, 1, ..., 1} . Update: i−1 fx et = zt − H t b t|t−1 b t|t = x b t|t−1 + Kt et x h i f P Pt|t = IM − KtH t t|t−1 [0.5,4] References (t) (t−1) H b t|tx b t|t Rxx = β Rxx + (1 − β) · x + Pt|t Finally, the binaural target signal is xt where r is the index of the reference microphone on the right device. i (t) (t−1) ∗ b rxzj = β rxzj + (1 − β) · xt|t zj (t) 2 (t) (t−1) rzj zj = β rzj zj + (1 − β) · |zj (t)| Parameters: e (t) ← linear fit of x and z (t) (least-squares). h j t j φvj (t) ← residual of the linear fit. Output signal: h e (t) , h e (t) yB(t) = W · h ` r iT b t|t x B. Schwartz, S. Gannot, and E. A. P. Habets, “Online speech dereverberation using Kalman filter and EM algorithm,” IEEE/ACM Transactions on Audio, Speech, and Language Processing vol. 23, no. 2, February 2015. 2 B. Schwartz, S. Gannot, and E. A. P. Habets, “Multi-microphone speech dereverberation using expectation-maximization and Kalman smoothing,” European Signal Processing Conference (EUSIPCO), Marakech, Morocco, Sept. 2013. 3 B. Schwartz, S. Gannot, and E. A. P. Habets, “LPC-based speech dereverberation using Kalman-EM algorithm,” International Workshop on Acoustic Echo and Noise Control (IWAENC), Antibes – Juan les Pins, France, Sept. 2014. 1 Sufficient statistics: h e , h e yB(t) = W · h ` r [-2.3,0.5] {0, 0.7}: more reverberation is removed. • Above 0.7, there is a slight degradation, due to estimation errors. This degradation is more pronounced in subjective evaluation. • As α increases, the signal becomes more directional, as can be deduced from the more concentrated cue scattering. Statistics and M-step iT [-4,-2.3] • Speech quality is improved as α increases in the range o and the two extremes would be h [-6,-4] Conclusions Predict: fH H fP fH + G Kt = Pt|t−1H H t t|t−1 t t t n 2 Fig. 5: WSNR improvement (w.r.t the direct speech) for different αs. E-step: Kalman filter h h 2.5 DRR range (dB) e T x + v (t) with h e = h−1(0) · h , zj (t) = h j j j j t ` f H α=∞ 3 [-10,-6] where ` denotes the index of the reference microphone of the left device. Now, we re-write the state vector representation as f Hx α=0.7 1.5 x(t) = h`(0)s(t) , α=0.35 3.5 WSNR Imp (dB) zj (t, k) ≈ L−1 X