Using Speech Models for Separation

advertisement
Using Speech Models
for Separation
Dan Ellis
Comprising the work of Michael Mandel and Ron Weiss
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Source Models and Scene Analysis
Eigenvoice Speaker Models
Spatial Parameter Models in Reverb
Combining Source + Spatial
Speech Models for Separation - Dan Ellis
2009-10-13 - 1 /28
LabROSA Overview
Information
Extraction
Music
Recognition
Environment
Separation
Machine
Learning
Retrieval
Signal
Processing
Speech
Speech Models for Separation - Dan Ellis
2009-10-13 - 2 /28
1. Source Models and Scene Analysis
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
-60
level / dB
0
2
4
6
8
10
12
time/s
02_m+s-15-evil-goodvoice-fade
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
• Sounds rarely occur in isolation
.. so analyzing mixtures (“scenes”) is a problem
.. for humans and machines
Speech Models for Separation - Dan Ellis
2009-10-13 - 3 /28
Approaches to Separation
ICA
CASA
•Multi-channel
•Single-channel
•Fixed filtering
•Time-var. filter
•Perfect separation •Approximate
– maybe!
separation
target x
interference n
-
mix
proc
x^
+
n^
Speech Models for Separation - Dan Ellis
mix x+n
s
stft
proc
mask
istft
x^
Model-based
•Any domain
•Param. search
•Synthetic
output?
mix x+n
params
x^
param. fit
synthesis
source
models
2009-10-13 - 4 /28
Separation vs. Inference
• Ideal separation is rarely possible
many situations where overlaps cannot be removed
• Overlaps → Ambiguity
scene analysis = find “most reasonable” explanation
• Ambiguity can be expressed probabilistically
i.e. posteriors of sources {Si} given observations X:
P({Si}| X) ∝ P(X |{Si}) ∏i P(Si|Mi)
combination physics
source models
search over all source signal sets {Si} ??
• Better source models → better inference
Speech Models for Separation - Dan Ellis
2009-10-13 - 5 /28
2. Speech Separation Using Models
• Cooke & Lee’s Speech Separation Challenge
pairs of short, grammatically-constrained utterances:
<command:4><color:4><preposition:4><letter:25><number:10><adverb:4>
e.g. "bin white by R 8 again"
task: report letter + number for “white”
(special session at Interspeech ’06)
• Separation or Description?
separation
mix
t-f masking
+ resynthesis
identify
target energy
ASR
speech
models
words
vs.
mix
identify
find best
words model
words
speech
models
source
knowledge
Speech Models for Separation - Dan Ellis
2009-10-13 - 6 /28
Codebook Models
Roweis ’01, ’03
Kristjannson ’04, ’06
• Given models for sources,
find “best” (most likely) states for spectra:
combination
p(x|i1, i2) = N (x; ci1 + ci2, ) model
{i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) inference of
source state
can include sequential constraints...
• E.g. stationary noise:
In speech-shaped noise
(mel magsnr = 2.41 dB)
freq / mel bin
Original speech
80
80
80
60
60
60
40
40
40
20
20
20
0
1
2 time / s
Speech Models for Separation - Dan Ellis
0
1
2
VQ inferred states
(mel magsnr = 3.6 dB)
0
1
2
2009-10-13 - 7 /28
Speech Recognition Models
Varga & Moore ’90
• Speech recognizers contain speech models
ASR is just argmax P(W | X)
• Recognize mixtures with Factorial HMM
i.e. two state sequences, one model for each voice
exploit sequence constraints, speaker differences
model 2
model 1
observations / time
Speech Models for Separation - Dan Ellis
2009-10-13 - 8 /28
Speech Factorial Separation
ABSTRACT
We present a framework for speech enhancement and robust speech recognition that exploits the harmonic structure
of speech. We achieve substantial gains in signal to noise ratio (SNR) of enhanced speech as well as considerable gains
in accuracy of automatic speech recognition in very noisy
conditions.
The method exploits the harmonic structure of speech
by employing a high frequency resolution speech model in
the log-spectrum domain and reconstructs the signal from
the estimated posteriors of the clean signal and the phases
from the original noisy signal.
We achieve a gain in signal to noise ratio of 8.38 dB for
enhancement of speech at 0 dB. We also present recognition
results on the Aurora 2 data-set. At 0 dB SNR, we achieve
a reduction of relative word error rate of 43.75% over the
baseline, and 15.90% over the equivalent low-resolution algorithm.
There are two observations that motivate the consideration of high frequency resolution modelling of speech for
noise robust speech recognition and enhancement. First is
the observation that most noise sources do not have harmonic structure similar to that of voiced speech. Hence,
voiced speech sounds should be more easily distinguishable from environmental noise in a high dimensional signal
space1 .
Kristjansson, Hershey et al. ’06
• IBM’s 2006 Iroquois speech separation system
Key features:
• “Superhuman” performance
35
30
25
20
15
x estimate
noisy vector
clean vector
A long standing goal in speech enhancement and robust
10
speech recognition has been to exploit the harmonic struc0
200
400
600
800
1000
Frequency [Hz]
ture of speech to improve intelligibility and increase recognition accuracy.
Fig. 1. The noisy input vector (dot-dash line), the correThe source-filter model of speech assumes that speech
sponding clean vector (solid line) and the estimate of the
is produced by an excitation source (the vocal cords) which
clean speech (dotted line), with shaded area indicating the
has strong regular harmonic structure
during
voiced
phonemes.
Same
Gender
Speakers
uncertainty of the estimate (one standard deviation). Notice
100 The overall shape of the spectrum is then formed by a filthat the uncertainty on the estimate is considerably larger in
ter (the vocal tract). In non-tonal languages the filter shape
the valleys between the harmonic peaks. This reflects the
alone determines which phone component of a word is prolower SNR in these regions. The vector shown is frame 100
50 duced (see Figure 2). The source on the other hand introfrom Figure 2
duces fine structure in the frequency spectrum that in many
cases varies strongly among different utterances of the same
A second observation is that in voiced speech, the signal
0 phone.
power
is -concentrated
in areas near the harmonics of the
6 dBhas traditionally
3 dB
0 dBthe use -of
3 dB
9 dB
This fact
inspired
smooth - 6 dB
fundamental frequency, which show up as parallel ridges in
representations
of the speech
spectrum, suchAcoustic
as the MelSDL Recognizer
No dynamics
dyn.
Grammar dyn.
Human
frequency cepstral coefficients, in an attempt to accurately
1 Even if the interfering signal is another speaker, the harmonic structure
estimate the filter component of speech in a way that is inof the two signals may differ at different times, and the long term pitch
variant
to theEllis
non-phonetic effects of the excitation[1].
contour of the speakers may be exploited
to separate the two sources
Separation
- Dan
2009-10-13
- 9 [2].
/28
wer / %
... in some conditions
Speech Models for
Noisy vector, clean vector and estimate of clean vector
40
Energy [dB]
detailed state combinations
large speech recognizer
exploits grammar constraints
34 per-speaker
models
1. INTRODUCTION
45
Adapting Source Models
• Power of model-based separation depends
on detail of model
• Speech separation relies on prior knowledge
of every speaker?
Diff Gender
Not In Train Set
5
5
4
4
SNR (dB)
SNR (dB)
In Train 32 Set
3
2
1
0
Diff Gender
3
2
1
0
32
64
128 250
Num training speakers
Oracle
SD
32
64
128 250
Num training speakers
SA
SA (SD init)
• Can this be practical?
Speech Models for Separation - Dan Ellis
2009-10-13 - 10/28
Eigenvoices
• Idea: Find
Kuhn et al. ’98, ’00
Weiss & Ellis ’07, ’08, ’09
model parameter space
generalize without
losing detail?
Speaker models
Speaker subspace bases
• Eigenvoice model:
µ = µ̄ + U
adapted
model
mean
voice
Speech Models for Separation - Dan Ellis
eigenvoice
bases
w + B
weights
h
channel channel
bases weights
2009-10-13 - 11/28
Eigenvoice Bases
additional
components for
channel
Frequency (kHz)
10
20
6
30
4
40
2
b d g p t k jh ch s z f th v dh m n l
8
Frequency (kHz)
•
Eigencomponents
shift formants/
coloration
Mean Voice
50
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 1
8
6
6
4
4
2
2
0
b d g p t k jh ch s z f th v dh m n l
8
Frequency (kHz)
280 states x 320 bins
= 89,600 dimensions
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 2
8
6
6
4
4
2
2
0
b d g p t k jh ch s z f th v dh m n l
8
Frequency (kHz)
• Mean model
8
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 3
8
6
6
4
4
2
2
0
b d g p t k jh ch s z f th v dh m n l
Speech Models for Separation - Dan Ellis
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
2009-10-13 - 12/28
Speaker-Adapted Separation
Weiss & Ellis ’08
• Factorial HMM analysis
with tuning of source model parameters
= eigenvoice speaker adaptation
Speech Models for Separation - Dan Ellis
2009-10-13 - 13/28
Speaker-Adapted Separation
Speech Models for Separation - Dan Ellis
2009-10-13 - 14/28
Speaker-Adapted Separation
• Eigenvoices for Speech Separation task
speaker adapted (SA) performs midway between
speaker-dependent (SD) & speaker-indep (SI)
SI
SA
SD
Speech Models for Separation - Dan Ellis
2009-10-13 - 15/28
3. Spatial Models & Reverb
Mandel & Ellis ’07
• 2 or 3 sources
in reverberation
assume just 2 ‘ears’
• Model interaural spectrum of each source
as stationary level and time differences:
Speech Models for Separation - Dan Ellis
2009-10-13 - 16/28
ILD and IPD
• Sources at 0° and 75° in reverb
R i gh t
IPD
I LD
Mixture
Source 2
Source 1
L eft
Speech Models for Separation - Dan Ellis
2009-10-13 - 17/28
IPD, ILD Distributions
• Source at 75° in reverberation
IPD
IPD residual
ILD
IPD residual offsets phase by constant ωτ
IPD can be fit by single Gaussian
ILD needs frequency-dependence
Speech Models for Separation - Dan Ellis
2009-10-13 - 18/28
Model-Based EM Source Separation
and Localization (MESSL)
Re-estimate
source parameters
Mandel & Ellis ’09
Assign spectrogram points
to sources
can model more sources than sensors
flexible initialization
Speech Models for Separation - Dan Ellis
2009-10-13 - 19/28
MESSL Results
• Modeling uncertainty improves results
tradeoff between constraints & noisiness
2.45 dB
0.22 dB
12.35 dB
8.77 dB
9.12 dB
Speech Models for Separation - Dan Ellis
-2.72 dB
2009-10-13 - 20/28
MESSL Results
• Signal-to-Distortion Ratio (SDR)
Anechoic 2 talkers
Reverb 2 talkers
30
14
25
12
SDR (dB)
15
10
10
20
8
15
5
6
10
0
4
5
2
0
−5
Reverb 3 talkers
−5
0
0
20
40
60
Angle (deg)
80
100
−2
0
Speech Models for Separation - Dan Ellis
20
40
60
Angle (deg)
80
100
−10
0
20
40
60
Angle (deg)
80
DP−Wiener
Wiener
MESSL−G
100
MESSL−ΩΩ
TRINICON
Mouba
Sawada
DUET
Random
2009-10-13 - 21/28
MESSL Results
• Speech recognizer (Digits)
Percent correct
100
Ground truth masks
100
80
80
60
60
40
0
−40
40
Human
DP−Oracle
Oracle
OracleAllRev
Mixes
20
−20
0
20
Target−to−masker ratio (dB)
Speech Models for Separation - Dan Ellis
Algorithmic masks
20
40
0
−40
Human
Sawada
Mouba
MESSL−G
MESSL−ΩΩ
DUET
Mixes
−20
0
20
Target−to−masker ratio (dB)
40
2009-10-13 - 22/28
4. Combining Spatial + Speech Models
Weiss, Mandel & Ellis ’08
• Interaural parameters give
•
•
ILDi(ω), ITDi, Pr(X(t, ω) = Si(t, ω))
Speech source model can give
Pr(Si(t, ω) is speech signal)
Can combine into one big EM framework...
E-step
M-step
Speech Models for Separation - Dan Ellis
u is: Pr(cell from source i)
phoneme sequence
Θ is: interaural params
speaker params
2009-10-13 - 23/28
Parameter estimation
and source
separation
MESSL-SP
(Source
Prior)
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
Speech Models for Separation - Dan Ellis
May 4, 2009
27 / 34
2009-10-13 - 24/28
MESSL-SP Results
• Source models function as priors
• Interaural parameter spatial separation
Frequency (kHz)
Frequency (kHz)
source model prior improves spatial estimate
Ground truth (12.04 dB)
8
DUET (3.84 dB)
8
8
6
6
6
4
4
4
2
2
2
0
0
0.5
1
1.5
MESSL (5.66 dB)
8
0
0
0.5
1
1.5
MESSL SP (10.01 dB)
8
0
6
6
4
4
4
2
2
2
0
0.5
1
Time (sec)
1.5
Speech Models for Separation - Dan Ellis
0
0
0.5
1
Time (sec)
1.5
1
0.8
0
0
0.5
1
1.5
MESSL EV (10.37 dB)
8
6
0
2D FD BSS (5.41 dB)
0.6
0.4
0.2
0
0
0.5
1
Time (sec)
1.5
2009-10-13 - 25/28
Outline
Introduction
MESSL-SP Results
Speaker subspace model
•
Monaural speech separation
Binaural separation
Conclusions
Experiments
Performance as
distractor
angle
SNR–improvement
vs.function
source of
angle
separation
SNR improvement (dB)
2 sources (anechoic)
2 sources (reverb)
15
15
10
10
5
5
0
0
0
20
40
60
80
20
SNR improvement (dB)
3 sources (anechoic)
15
40
60
80
3 sources (reverb)
15
Ground truth
MESSL!EV
MESSL!SP
MESSL
2S!FD!BSS
10
10
5
5
0
0
0
20
40
60
80
Separation (degrees)
DUET
20
40
60
80
Separation (degrees)
SpeechRon
Models
- Dan Ellis
Weiss for Separation
Underdetermined
Source Separation Using Speaker Subspace Models
- 26/28
May 4, 2009-10-13
2009
29 / 34
Future Work
• Better parametric speaker models
limitations of eigenvoices
varying style
• Understanding reverb & ASR
early echoes
what spoils ASR?
• Models of other sources
eigeninstruments?
Speech Models for Separation - Dan Ellis
2009-10-13 - 27/28
Summary & Conclusions
• Source models provide the constraints to
make scene analysis possible
• Eigenvoices (model subspace) can be used
to provide detailed models that generalize
• Spatial parameters can identify more
sources than models in reverb (MESSL)
• Can combine source + spatial models
Speech Models for Separation - Dan Ellis
2009-10-13 - 28/28
Download