Oral presentation

advertisement
Full-rank Gaussian modeling of
convolutive audio mixtures applied to
source separation
Ngoc Q. K. Duong,
Supervisor: R. Gribonval and E. Vincent
METISS project team, INRIA,
Center de Rennes - Bretagne Atlantique, France
Nov. 2010.
1
Table of content

Problem introduction and motivation

Considered framework and contributions

Estimation of model parameters

Conclusion and perspective
2
Under-determined source separation
Use I recorded mixture signals x(t )   x1 (t ),..., xI (t ) 
separate J sources s j (t ) , where I  J
T


to
Convolutive mixing model:
Denotes c j (t ) the source images, i.e. the contribution of a source to
all microphones, and x(t ) the vector of mixture signals
J
x(t )   c j (t )
j 1
c j (t )   h j ( )s j (t   )

where h j (t )   h1 j (t ),..., hIj (t ) 
source j to microphone array
T
the vector of mixing filters from
3
Baseline approaches
c j (t )
x(t )   h j  s j (t )
STFT with narrowband
approximation
c j ( n, f )
x(n, f )
 h ( f )s (n, f ) 
j
j
j
j
Sparsity assumption: only FEW sources are
active at each time-frequency point
Binary masking (DUET): only ONE source is
active at each time-frequency point
L1-norm minimization:
 n, f 
J
arg min  s j (n, f ) , s.t. 
s j ( n, f )
j 1
These techniques remain limited in the realistic reverberant
environments since the narrowband approximation does not hold
4
Considered framework
Models the STFT coefficients of the source images as zero-mean
multivariate Gaussian random variables, i.e.
c j (n, f )

N c 0, R c j (n, f )

Rc j (n, f )  v j (n, f )R j ( f )
I x I spatial covariance matrices encoding
spatial position and spatial spread of sources
Scalar source variances encoding
spectro-temporal power of sources
Spatial covariance models
Rank-1 model (given by the narrowband assumption): R j ( f )  h j ( f )h j
H
Full-rank unconstrained model: The coefficients of
(f)
R j ( f ) are unrelated
a priori
Most general possible model which allows more flexible modeling the mixing process
5
Considered framework
Source separation can be achieved in two steps:
1. Model parameters are estimated in the ML sense
- Expectation Maximization (EM) algorithm is well-known as an
appropriate choice for this ML estimation of the Gaussian mixing model
2. Source separation by multichannel Wiener filtering
1
c j (n, f )  v j (n, f )R j ( f )Rx (n, f )x(n, f )
Raised issues:
- Parameter initialization for EM
- Permutation alignment (well-known in frequency-domain BSS)
6
Proposed algorithm
x(t )
x(n, f )
Wiener
filtering
STFT
init
hinit
j ( f ), R j ( f )
Initialization by
Hierarchical
Clustering
sˆ(n, f )
ISTFT
sˆ (t )
Permutation
alignment
Model parameter
estimation by EM
h j ( f ), R j ( f ), v j (n, f )
Flow of the BSS algorithms
In each step, we adapt the existing methods for the rank-1 model to
our proposed full-rank unconstrained model
7
Parameter initialization
[S. Winter et al. EURASIP vol.2007]
Principle: perform the hierarchical clustering of the mixture STFT
coefficients x( n, f ) in each frequency bin after a proper phase and
amplitude normalization
Adaptations to our algorithms:
init
1. h j ( f ) and Rinit
j ( f ) are computed from the phase normalized
STFT coefficients instead of from both phase and amplitude
normalized coefficients
2. We defines the distance between clusters as the average distance
between samples instead of the minimum distance between them.
Source variance initialization:
v j (n, f )  1, j, n,f
8
EM algorithm
EM for rank-1 model [C. Fevotte and J-F Cardoso, WASPAA2005]
- Mixing model: must consider noise component
J
x(n, f )   h j ( f )s j (n, f )  b(n, f )
j 1
Adaptations to the full-rank model
- Apply EM directly to the noiseless mixing model, i.e. x(n, f ) 
J
 c (n, f )
j 1
j
- Derive alternating parameter update rule (M-step) by maximizing the likelihood


of the complete data c j ( n, f ) j , n


1
ˆ (n, f )
tr R j 1 ( f )R
cj
I
1
1
ˆ (n, f )
Rj( f )  
R
c
N n v j (n, f ) j
v j (n, f ) 
9
Permutation alignment
[H. Sawada et al. ICASSP2006 ]
Principle: permute the source orders base on the estimated source DoAs
and the clustered phase-normalized mixing vectors.
Adaptation to the full-rank model: Computing the first principal
component w j ( f ) of R j ( f ) by PCA and then applying the algorithm to
the “equivalent” mixing vector w j ( f )
The order of v j (n, f ) is permuted identically to that of R j ( f )
 i arg  w1 j ( f ) 
Phase of w2 j ( f )e
before and after permutation
alignment with T60  250ms
10
Experiment setup
3
Speech length
8s
Sampling rate
16 kHz
STFT window type
Sine
Window length
1024
Number of EM iterations
10
Number of clusters K
30
Parameter and program settings
s2
r=0.5m
m1
s1
m2
1.8m
1.5m
Number of stereo mixtures
Source and microphone height: 1.4 m
Room dimensions: 4.45 x 3.35 x 2.5 m
Microphone distance: d = 0.05 m
Reverberation time: 50, 130, 250, 500ms
s3
Geometry setting
11
Experimental result
mixture
Full-rank model outperforms both the rank-1 model and baseline
approaches in a realistic reverberant environments
12
Conclusion & future work
Contributions
- Proposed to model the convolutive mixing process by full-rank unconstrained
spatial covariance matrices
- Designed the model parameter estimation algorithms for the full-rank model
by adapting the estimation for rank-1 model
- We showed that the proposed algorithm using the full-rank unconstrained
spatial covariance model outperforms state-of-the-art approaches.
Current result
(in collaboration with S. Arberet and A. Ozerov)
Combined the proposed full-rank unconstrained covariance model with NMF
model for source spectra (to appear in ISSPA, May 2010).
Future work
Consider the full-rank unconstrained model in the context of source localization.
13
Thanks for your attention!
& Your comments…?
14
Download