Using Source Models in Speech Separation 1. 2.

advertisement
Using Source Models
in Speech Separation
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Mixtures, Separation, and Models
Monaural Speech Separation
Binaural Speech Separation
Conclusions
Source Models in Separation - Dan Ellis
2007-11-09 - 1 /17
LabROSA Overview
Information
Extraction
Music
Recognition
Environment
Separation
Machine
Learning
Retrieval
Signal
Processing
Speech
Source Models in Separation - Dan Ellis
2007-11-09 - 2 /17
1. Mixtures, Separation, and Models
• Sounds rarely occur in isolation
.. so analyzing mixtures is a problem
.. for humans and machines
freq / kHz
chan 3
freq / kHz
chan 1
4
freq / kHz
chan 0
mr-2000-11-02-14:57:43
4
4
2
20
0
-20
-40
-60
level / dB
2
2
chan 9
freq / kHz
mr-20001102-1440-cE+1743.wav
4
2
0
0
1
2
3
4
5
Source Models in Separation - Dan Ellis
6
7
8
9
time / sec
2007-11-09 - 3 /17
Mixture Organization Scenarios
• Interactive voice systems
human-level understanding is expected
• Speech prostheses
crowds: #1 complaint of hearing aid users
• Archive analysis
freq / kHz
identifying and isolating sound events
dpwe-2004-09-10-13:15:40
8
20
6
0
4
-20
2
0
-40
0
1
2
3
4
5
6
7
8
9
level / dB
time / sec
• Unmixing/remixing/enhancement...
Source Models in Separation - Dan Ellis
2007-11-09 - 4 /17
Separation vs. Inference
• Ideal separation is rarely possible
many situations where overlaps cannot be removed
• Overlaps → Ambiguity
scene analysis = find “most reasonable” explanation
• Ambiguity can be expressed probabilistically
i.e. posteriors of sources {Si} given observations X:
P({Si}| X)
P(X |{Si}) P({Si})
combination physics source models
search over {Si} ??
source models → better inference
• Better
.. learn from examples?
Source Models in Separation - Dan Ellis
2007-11-09 - 5 /17
Approaches to Separation
ICA
CASA
•Multi-channel
•Single-channel
•Fixed filtering
•Time-var. filter
•Perfect separation •Approximate
– maybe!
separation
target x
interference n
-
mix
proc
x^
+
mix x+n
^
n
s
stft
proc
mask
istft
x^
Model-based
•Any domain
•Param. search
•Synthetic
output
mix x+n
params
x^
param. fit
synthesis
source
models
or combinations ...
Source Models in Separation - Dan Ellis
2007-11-09 - 6 /17
EM for Model-based Separation
• Expectation-Maximization algorithm
– for solving partially-unknown problems
(only local optimality guaranteed)
• EM for model-based separation
E-step: find distribution of unknowns p(u)
E-step
given current
model parameters Θ
and observations x
M-step
M-step: optimize Θ
to maximize fit to
u is... GMM mixture assignment
x given current p(u)
... T-F cell dominance
... current phone of voice i
...
Source Models in Separation - Dan Ellis
2007-11-09 - 7 /17
What is a Source Model?
• Source Model describes signal behavior
encapsulates constraints on form of signal
(any such constraint can be seen as a model...)
• A model has parameters
model + parameters
→ instance
Excitation
source g[n]
• What is not a source model?
Resonance
filter H(ej!)
n
detail not provided in instance
e.g. using phase from original mixture
constraints on interaction between sources
e.g. independence, clustering attributes
Source Models in Separation - Dan Ellis
2007-11-09 - 8 /17
n
!
2. Monaural Speech Separation
• Cooke & Lee’s Speech Separation Challenge
short, grammatically-constrained utterances:
<command:4><color:4><preposition:4><letter:25><number:10><adverb:4>
e.g. "bin white by R 8 again"
task: report letter + number for “white”
special session at Interspeech ’06
• Separation or Description?
separation
mix
t-f masking
+ resynthesis
identify
target energy
ASR
words
speech
models
mix
vs.
identify
find best
words model
speech
models
source
knowledge
Source Models in Separation - Dan Ellis
2007-11-09 - 9 /17
words
Codebook Models
Roweis ’01, ’03
Kristjannson ’04, ’06
• Given models for sources,
find “best” (most likely) states for spectra:
combination
p(x|i1, i2) = N (x; ci1 + ci2, Σ) model
inference of
{i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) source
state
can include sequential constraints...
different domains for combining c and defining 
• E.g. stationary noise:
In speech-shaped noise
(mel magsnr = 2.41 dB)
freq / mel bin
Original speech
80
80
80
60
60
60
40
40
40
20
20
20
0
1
2 time / s
0
1
Source Models in Separation - Dan Ellis
2
VQ inferred states
(mel magsnr = 3.6 dB)
0
1
2
2007-11-09 - 10/17
Speech Recognition Models
Kristjansson, Hershey et al. ’06
• Decode with Factorial HMM
i.e. two state sequences,
one model for each voice
exploit sequence constraints,
speaker differences?
model 2
• IBM “superhuman” Iroquois system
model 1
observations / time
fewer errors than people for same speaker, level
exploit grammar constraints - higher-level dynamics
wer / %
Same Gender Speakers
100
50
0
6 dB
SDL Recognizer
3 dB
0 dB
No dynamics
Source Models in Separation - Dan Ellis
- 3 dB
Acoustic dyn.
- 6 dB
- 9 dB
Grammar dyn.
Human
2007-11-09 - 11/17
Speaker-Adapted (SA) Models
• Factorial HMM needs distinct speakers
)*+,-.*/0,1
Weiss & Ellis ’07
(%%
Mixture: t32_swil2a_m18_sbar9n
0
6
4
2
0
'%
!20
!40
Adaptation iteration 1
%
0
!20
!40
0
6
4
2
0
(%%
23341*35
Adaptation iteration 3
!20
'%
!40
Adaptation iteration 5
0
6
4
2
0
!20
!"#
$"#
%"#
!$"#
!!"#
89::-6,7",1
!40
!&"#
SD
-
(%%
SD model separation
0
6
4
2
0
%
!20
acc %
Frequency (kHz)
6
4
2
0
use “eigenvoice” speaker space
iterate
estimating
voice!!"#
& !&"#
!"#
$"#
%"#
!$"#
)*+,-6,7",1
separating speech
SI
performs midway between
speaker-independent (SI) and SA
speaker-dependent (SD)
'%
!40
0
0.5
1
Time (sec)
1.5
%;1*3/,
Source Models in Separation - Dan Ellis
)8
)2
)<
2007-11-09 - 12/17
#*=,/97,
3. Binaural Speech Separation
• 2 or 3 sources in reverberation
assume just 2 ‘ears’
• Tasks:
identify positions of sources (and number?)
recover source signals
Source Models in Separation - Dan Ellis
2007-11-09 - 13/17
Spatial Estimation in Reverb
Mandel & Ellis ’07
• Model interaural spectrum of each source
as stationary level and time differences:
converge via EM to a(), τ for each source
mask is Pr(X(t,ω) dominated by source i)
Left
Left
Right
Right
ILD
ILD
IPD
IPD
Mask
Mask 22
Mask
Mask 11
8.77dB
8.77dB
Source Models in Separation - Dan Ellis
2007-11-09 - 14/17
Spatial Estimation Results
• Modeling uncertainty improves results
tradeoff between constraints & noisiness
2.45 dB
0.22 dB
12.35 dB
8.77 dB
9.12 dB
Source Models in Separation - Dan Ellis
-2.72 dB
2007-11-09 - 15/17
Combining Spatial + Speech Model
• Interaural parameters give
•
•
ILDi(ω), ITDi, Pr(X(t, ω) = Si(t, ω))
Speech source model can give
Pr(Si(t, ω) is speech signal)
Can combine into one big EM framework...
E-step
M-step
Source Models in Separation - Dan Ellis
u is: Pr(cell from source i)
phoneme sequence
Θ is: interaural params
speaker params
2007-11-09 - 16/17
Summary & Conclusions
• Inferring model parameters is very general
.. and very difficult, in general
• Speech models can separate single channels
.. better match to individual → better results
• Spatial cues can separate binaural signals
.. but account for uncertainty from e.g. reverb
• EM-type approach can integrate them both
Source Models in Separation - Dan Ellis
2007-11-09 - 17/17
Download