Using Speech Models for Separation 1. 2.

advertisement
Using Speech Models
for Separation
Dan Ellis
Comprising the work of Michael Mandel and Ron Weiss
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
http://labrosa.ee.columbia.edu/
1. Eigenvoice Speaker Models
2. Spatial Parameter Models in Reverb
3. Combining Source + Spatial
Speech Models for Separation - Dan Ellis
2010-04-20 - 1 /19
1. Speech Separation Using Models
• Cooke & Lee’s Monaural Speech Separation Task
pairs of short, grammatically-constrained utterances:
<command:4><color:4><preposition:4><letter:25><number:10><adverb:4>
e.g. "bin white by R 8 again"
task: report letter + number for “white”
• Separation depends on source constraints
the more the better - ASR model
separation
mix
t-f masking
+ resynthesis
identify
target energy
ASR
speech
models
words
mix
vs.
identify
find best
words model
words
speech
models
source
knowledge
Speech Models for Separation - Dan Ellis
2010-04-20 - 2 /19
Speech Mixture Recognition
Kristjansson, Hershey et al. ’06
• Speech recognizers contain speech models
ASR is just argmax P(W | X)
• Recognize mixtures with Factorial HMM
one model+state sequence for each voice
exploit sequence constraints, speaker differences
model 2
model 1
observations / time
separation relies on detailed speaker model
Speech Models for Separation - Dan Ellis
2010-04-20 - 3 /19
Eigenvoices
• Idea: Find
Kuhn et al. ’98, ’00
Weiss & Ellis ’07, ’08, ’09
speaker model parameter space
generalize without
losing detail?
Speaker models
Speaker subspace bases
• Eigenvoice model:
µ = µ̄ + U
adapted
model
mean
voice
Speech Models for Separation - Dan Ellis
eigenvoice
bases
w + B
weights
h
channel channel
bases weights
2010-04-20 - 4 /19
Eigenvoice Bases
additional
components for
acoustic channel
10
20
6
30
4
40
2
b d g p t k jh ch s z f th v dh m n l
8
Frequency (kHz)
•
Eigencomponents
shift formants/
coloration
Mean Voice
50
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 1
8
6
6
4
4
2
2
0
b d g p t k jh ch s z f th v dh m n l
8
Frequency (kHz)
280 states x 320 bins
= 89,600 dimensions
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 2
8
6
6
4
4
2
2
0
b d g p t k jh ch s z f th v dh m n l
8
Frequency (kHz)
• Mean model
Frequency (kHz)
8
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 3
8
6
6
4
4
2
2
0
b d g p t k jh ch s z f th v dh m n l
Speech Models for Separation - Dan Ellis
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
2010-04-20 - 5 /19
Speaker-Adapted Separation
Weiss & Ellis ’08
• Factorial HMM analysis
with tuning of source model parameters
= eigenvoice speaker adaptation
Speech Models for Separation - Dan Ellis
2010-04-20 - 6 /19
Speaker-Adapted Separation
Speech Models for Separation - Dan Ellis
2010-04-20 - 7 /19
Speaker-Adapted Separation
• Eigenvoices for Speech Separation task
speaker adapted (SA) performs midway between
speaker-dependent (SD) & speaker-indep (SI)
SI
SA
SD
Speech Models for Separation - Dan Ellis
2010-04-20 - 8 /19
2. Spatial Models & Reverb
Mandel & Ellis ’07
• 2 or 3 sources
in reverberation
assume just 2 ‘ears’
• Model interaural spectrum of each source
as stationary level and time differences:
Speech Models for Separation - Dan Ellis
2010-04-20 - 9 /19
IPD, ILD Distributions
• Source at 75° in reverberation
IPD
IPD residual
ILD
IPD residual offsets phase by constant ωτ
IPD can be fit by single Gaussian
ILD needs frequency-dependence
Speech Models for Separation - Dan Ellis
2010-04-20 - 10/19
Model-Based EM Source Separation
and Localization (MESSL)
Re-estimate
source parameters
Mandel & Ellis ’09
Assign spectrogram points
to sources
can model more sources than sensors
flexible initialization
Speech Models for Separation - Dan Ellis
2010-04-20 - 11/19
MESSL Results
• Modeling uncertainty improves results
tradeoff between constraints & noisiness
2.45 dB
0.22 dB
12.35 dB
8.77 dB
9.12 dB
Speech Models for Separation - Dan Ellis
-2.72 dB
2010-04-20 - 12/19
MESSL Results
• Speech recognizer (Digits)
Percent correct
100
Ground truth masks
100
80
80
60
60
40
0
−40
40
Human
DP−Oracle
Oracle
OracleAllRev
Mixes
20
−20
0
20
Target−to−masker ratio (dB)
Speech Models for Separation - Dan Ellis
Algorithmic masks
20
40
0
−40
Human
Sawada
Mouba
MESSL−G
MESSL−ΩΩ
DUET
Mixes
−20
0
20
Target−to−masker ratio (dB)
40
2010-04-20 - 13/19
3. Combining Spatial + Speech Models
Weiss, Mandel & Ellis ’08
• Interaural parameters give
•
•
ILDi(ω), ITDi, Pr(X(t, ω) = Si(t, ω))
Speech source model can give
Pr(Si(t, ω) is speech signal)
Can combine into one big EM framework...
E-step
M-step
Speech Models for Separation - Dan Ellis
u is: Pr(cell from source i)
phoneme sequence
Θ is: interaural params
speaker params
2010-04-20 - 14/19
Parameter estimation
and source
separation
MESSL-SP
(Source
Prior)
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
Speech Models for Separation - Dan Ellis
May 4, 2009
27 / 34
2010-04-20 - 15/19
MESSL-SP Results
• Source models function as priors
• Interaural parameter spatial separation
Frequency (kHz)
Frequency (kHz)
source model prior improves spatial estimate
Ground truth (12.04 dB)
8
DUET (3.84 dB)
8
8
6
6
6
4
4
4
2
2
2
0
0
0.5
1
1.5
MESSL (5.66 dB)
8
0
0
0.5
1
1.5
MESSL SP (10.01 dB)
8
0
6
6
4
4
4
2
2
2
0
0.5
1
Time (sec)
1.5
Speech Models for Separation - Dan Ellis
0
0
0.5
1
Time (sec)
1.5
1
0.8
0
0
0.5
1
1.5
MESSL EV (10.37 dB)
8
6
0
2D FD BSS (5.41 dB)
0.6
0.4
0.2
0
0
0.5
1
Time (sec)
1.5
2010-04-20 - 16/19
Outline
Introduction
MESSL-SP Results
Speaker subspace model
•
Monaural speech separation
Binaural separation
Conclusions
Experiments
Performance as
distractor
angle
SNR–improvement
vs.function
source of
angle
separation
SNR improvement (dB)
2 sources (anechoic)
2 sources (reverb)
15
15
10
10
5
5
0
0
0
20
40
60
80
20
SNR improvement (dB)
3 sources (anechoic)
15
40
60
80
3 sources (reverb)
15
Ground truth
MESSL!EV
MESSL!SP
MESSL
2S!FD!BSS
10
10
5
5
0
0
0
20
40
60
80
Separation (degrees)
DUET
20
40
60
80
Separation (degrees)
SpeechRon
Models
- Dan Ellis
Weiss for Separation
Underdetermined
Source Separation Using Speaker Subspace Models
- 17/19
May 4, 2010-04-20
2009
29 / 34
Future Work
• Better parametric speaker models
limitations of eigenvoices
varying style
• Understanding reverb & ASR
early echoes
what spoils ASR?
• Models of other sources
eigeninstruments?
Speech Models for Separation - Dan Ellis
2010-04-20 - 18/19
Summary & Conclusions
• Source models provide the constraints to
make scene analysis possible
• Eigenvoices (model subspace) can be used
to provide detailed models that generalize
• Spatial parameters can identify more
sources than models in reverb (MESSL)
• Can combine source + spatial models
Speech Models for Separation - Dan Ellis
2010-04-20 - 19/19
Download