Full-rank Gaussian modeling of convolutive audio mixtures applied to source separation Ngoc Q. K. Duong, Supervisor: R. Gribonval and E. Vincent METISS project team, INRIA, Center de Rennes - Bretagne Atlantique, France Nov. 2010. 1 Table of content Problem introduction and motivation Considered framework and contributions Estimation of model parameters Conclusion and perspective 2 Under-determined source separation Use I recorded mixture signals x(t ) x1 (t ),..., xI (t ) separate J sources s j (t ) , where I J T to Convolutive mixing model: Denotes c j (t ) the source images, i.e. the contribution of a source to all microphones, and x(t ) the vector of mixture signals J x(t ) c j (t ) j 1 c j (t ) h j ( )s j (t ) where h j (t ) h1 j (t ),..., hIj (t ) source j to microphone array T the vector of mixing filters from 3 Baseline approaches c j (t ) x(t ) h j s j (t ) STFT with narrowband approximation c j ( n, f ) x(n, f ) h ( f )s (n, f ) j j j j Sparsity assumption: only FEW sources are active at each time-frequency point Binary masking (DUET): only ONE source is active at each time-frequency point L1-norm minimization: n, f J arg min s j (n, f ) , s.t. s j ( n, f ) j 1 These techniques remain limited in the realistic reverberant environments since the narrowband approximation does not hold 4 Considered framework Models the STFT coefficients of the source images as zero-mean multivariate Gaussian random variables, i.e. c j (n, f ) N c 0, R c j (n, f ) Rc j (n, f ) v j (n, f )R j ( f ) I x I spatial covariance matrices encoding spatial position and spatial spread of sources Scalar source variances encoding spectro-temporal power of sources Spatial covariance models Rank-1 model (given by the narrowband assumption): R j ( f ) h j ( f )h j H Full-rank unconstrained model: The coefficients of (f) R j ( f ) are unrelated a priori Most general possible model which allows more flexible modeling the mixing process 5 Considered framework Source separation can be achieved in two steps: 1. Model parameters are estimated in the ML sense - Expectation Maximization (EM) algorithm is well-known as an appropriate choice for this ML estimation of the Gaussian mixing model 2. Source separation by multichannel Wiener filtering 1 c j (n, f ) v j (n, f )R j ( f )Rx (n, f )x(n, f ) Raised issues: - Parameter initialization for EM - Permutation alignment (well-known in frequency-domain BSS) 6 Proposed algorithm x(t ) x(n, f ) Wiener filtering STFT init hinit j ( f ), R j ( f ) Initialization by Hierarchical Clustering sˆ(n, f ) ISTFT sˆ (t ) Permutation alignment Model parameter estimation by EM h j ( f ), R j ( f ), v j (n, f ) Flow of the BSS algorithms In each step, we adapt the existing methods for the rank-1 model to our proposed full-rank unconstrained model 7 Parameter initialization [S. Winter et al. EURASIP vol.2007] Principle: perform the hierarchical clustering of the mixture STFT coefficients x( n, f ) in each frequency bin after a proper phase and amplitude normalization Adaptations to our algorithms: init 1. h j ( f ) and Rinit j ( f ) are computed from the phase normalized STFT coefficients instead of from both phase and amplitude normalized coefficients 2. We defines the distance between clusters as the average distance between samples instead of the minimum distance between them. Source variance initialization: v j (n, f ) 1, j, n,f 8 EM algorithm EM for rank-1 model [C. Fevotte and J-F Cardoso, WASPAA2005] - Mixing model: must consider noise component J x(n, f ) h j ( f )s j (n, f ) b(n, f ) j 1 Adaptations to the full-rank model - Apply EM directly to the noiseless mixing model, i.e. x(n, f ) J c (n, f ) j 1 j - Derive alternating parameter update rule (M-step) by maximizing the likelihood of the complete data c j ( n, f ) j , n 1 ˆ (n, f ) tr R j 1 ( f )R cj I 1 1 ˆ (n, f ) Rj( f ) R c N n v j (n, f ) j v j (n, f ) 9 Permutation alignment [H. Sawada et al. ICASSP2006 ] Principle: permute the source orders base on the estimated source DoAs and the clustered phase-normalized mixing vectors. Adaptation to the full-rank model: Computing the first principal component w j ( f ) of R j ( f ) by PCA and then applying the algorithm to the “equivalent” mixing vector w j ( f ) The order of v j (n, f ) is permuted identically to that of R j ( f ) i arg w1 j ( f ) Phase of w2 j ( f )e before and after permutation alignment with T60 250ms 10 Experiment setup 3 Speech length 8s Sampling rate 16 kHz STFT window type Sine Window length 1024 Number of EM iterations 10 Number of clusters K 30 Parameter and program settings s2 r=0.5m m1 s1 m2 1.8m 1.5m Number of stereo mixtures Source and microphone height: 1.4 m Room dimensions: 4.45 x 3.35 x 2.5 m Microphone distance: d = 0.05 m Reverberation time: 50, 130, 250, 500ms s3 Geometry setting 11 Experimental result mixture Full-rank model outperforms both the rank-1 model and baseline approaches in a realistic reverberant environments 12 Conclusion & future work Contributions - Proposed to model the convolutive mixing process by full-rank unconstrained spatial covariance matrices - Designed the model parameter estimation algorithms for the full-rank model by adapting the estimation for rank-1 model - We showed that the proposed algorithm using the full-rank unconstrained spatial covariance model outperforms state-of-the-art approaches. Current result (in collaboration with S. Arberet and A. Ozerov) Combined the proposed full-rank unconstrained covariance model with NMF model for source spectra (to appear in ISSPA, May 2010). Future work Consider the full-rank unconstrained model in the context of source localization. 13 Thanks for your attention! & Your comments…? 14