Model-Based Separation in Humans and Machines 1. 2.

advertisement
Model-Based Separation
in Humans and Machines
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
http://labrosa.ee.columbia.edu/
1. Audio Source Separation
2. Human Performance
3. Model-Based Separation
Model-based Separation - Dan Ellis
2006-03-08 - 1 /19
1. Audio Source Separation
• Sounds rarely occurs in isolation
.. but organizing mixtures is a problem
.. for humans and machines
)re+ & ,-.
chan 9
)re+ & ,-.
chan 3
)re+ & ,-.
chan 1
3
3
)re+ & ,-.
chan 0
mr-2000-11-02-14:57:43
3
3
1
1/
/
-1/
-3/
-5/
le;el & d=
1
1
1
/
/
0
1
2
3
Model-based Separation - Dan Ellis
4
5
6
7
8
time & sec
2006-03-08 - 2 /19
mr-20001102-1440-cE+1743.wav
Audio Separation Scenarios
• Interactive voice systems
human-level understanding is expected
• Speech prostheses
crowds: #1 complaint of hearing aid users
• Multimedia archive analysis
freq / kHz
identifying and isolating speech, other events
dpwe-2004-09-10-13:15:40
8
20
6
0
4
-20
2
0
-40
0
1
2
3
4
5
• Surveillance...
Model-based Separation - Dan Ellis
6
7
8
9
level / dB
time / sec
pa-2004-09-10-131540.wav
2006-03-08 - 3 /19
How Can We Separate?
• By between-sensor differences (spatial cues)
‘steer a null’ onto a compact interfering source
• By finding a ‘separable representation’
spectral? but speech is broadband
periodicity? maybe – for voiced speech
something more signal-specific...
• By inference (based on knowledge/models)
speech is redundant
→ use part to guess the remainder
Model-based Separation - Dan Ellis
2006-03-08 - 4 /19
Outline
1. Audio Source Separation
2. Human Performance
scene analysis
speech separation by location
speech separation by voice characteristics
3. Model-Based Separation
Model-based Separation - Dan Ellis
2006-03-08 - 5 /19
Auditory Scene Analysis Bregman’90
• Listeners organize sound mixtures
Darwin & Carlyon’95
common
onset
+ continuity
harmonicity
freq / kHz
into discrete perceived sources
based on within-signal cues (audio + ...)
4
30
20
3
10
0
2
-10
-20
1
-30
0
-40
0
spatial, modulation, ...
learned “schema”
Model-based Separation - Dan Ellis
0.5
1
1.5
2
2.5
3
3.5
4
Reynolds-McAdams oboe
time / sec
reynolds-mcadams-dpwe.wav
2006-03-08 - 6 /19
level / dB
Speech Mixtures: Spatial Separation
• Task: Coordinate Response Measure
Brungart et al.’02
“Ready Baron go to green eight now”
256 variants, 16 speakers
correct = color and number for “Baron”
crm-11737+16515.wav
• Accuracy as a function of spatial separation:
A, B same speaker
Model-based Separation - Dan Ellis
o Range effect
2006-03-08 - 7 /19
Separation by Vocal Differences
Brungart et al.’01
varying the level and voice character
• CRM
!"#$%&$'()#*"&+$,-"#$'(!$./-#0
(same spatial1232'(4-**2&2#52.
location)
energetic vs. informational masking
1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"#
1-.,2#(*"&(,72(9%-2,2&(,$'/2&:
Model-based Separation - Dan Ellis
2006-03-08 - 8 /19
Varying the Number of Voices
Brungart et al.’01
• Two voices OK;
More than two voices harder
:/1$9($%";1./(0$#5/1$%":$)&51<
=90>'("/."?$%&'()
(same spatial
origin)
mix -'(./(0$12'"3(/4)"3($0$#52$%%6
of N voices tends to speech-shaped noise...
78'1"+(3 /(",#8 #$%&'("5)"$33'3"#/")#509%9)
Model-based Separation - Dan Ellis
2006-03-08 - 9 /19
!"#$%&'()"**"+"#$%&'()"*","#$%&'()
Outline
1. Audio Source Separation
2. Human Performance
3. Model-Based Separation
Separation vs. Inference
The Speech Fragment Decoder
Model-based Separation - Dan Ellis
2006-03-08 - 10/19
Separation Approaches
ICA
•Multi-channel
•Fixed filtering
•Perfect separation
– maybe!
!ar$%! '
(n!%r*%r%nc% n
-
m('
0r1c
',
/
CASA / Model-based
•Single-channel
•Time-varying filtering
•Approximate
separation
!"#$#%&
,
n
,(-+.)*
/)0!
,.2.
()*+
#
!0,1
• Very different approaches...
Model-based Separation - Dan Ellis
2006-03-08 - 11/19
",.2.
#'
Separation vs. Inference
Ellis’96
• Ideal separation is rarely possible
i.e. no projection can completely remove overlaps
• Overlaps Ambiguity
scene analysis = find “most reasonable” explanation
• Ambiguity can be expressed probabilistically
i.e. posteriors of sources {Si} given observations X:
P({Si}| X)
P(X |{Si}) P({Si})
combination physics source models
• Better source models → better inference
.. learn from examples?
Model-based Separation - Dan Ellis
2006-03-08 - 12/19
• Central idea:
Employ strong learned constraints
time
time
to disambiguate possible sources
Varga & Moore’90
Roweis’03...
frequency
{Si} = argmax{Si} P(X | {Si})
frequency
speech-trained Vector-Quantizer
• e.g. fitMAXVQ
Results: Denoising
totimemixed spectrum:
time
from Roweis’03
time
frequency
frequency
frequency
frequency
frequency
frequency
Model-Based Separation
time
time
time
Model-based Separation - Dan Ellis
frequency
frequency
Training: 300sec of
isolated speech
separate
via T-F(TIMIT)
mask to fit 512 codewords, and 100sec of
solated noise (NOISEX) to fit 32 codewords; testing on
new mpgr+noise+tfmask.wav
signals at 0dB SNR
mpgr+noise.wwav
2006-03-08 - 13/19
Separation or Description?
• Are isolated waveforms required?
clearly sufficient, but may not be necessary
not part of perceptual source separation!
• Integrate separation with application?
e.g. speech recognition
separation
mix
t-f masking
+ resynthesis
identify
target energy
ASR
words
speech
models
mix
vs.
identify
find best
words model
speech
models
source
knowledge
words output = abstract description of signal
Model-based Separation - Dan Ellis
2006-03-08 - 14/19
words
1
M ! = argmax P " M X # = argmax P " X M
M
M
+-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(*
#"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X
The Speech
Fragment
Decoder
1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)
•
P " MBarker
#
et al. ’05
!
M = argmax P " M X # = argmax P " X M # ! -------------%44*&,4%-,2*6>*P(
X | Y,S
)*;#
P" X
M
M
1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=*
Match
‘uncorrupt’
Observation
%44*&,4%-,2*6>*P( X | Y,S )*;
Y(f )
spectrum to ASR
Observation
models using
Source
Y(f )
X(f )
missing data
Source
Segregation S
X(f )
freq
model M and segregation S
• Joint search for
1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/
Segregation S
freq
1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;
to maximize:
P" X Y $ S #
P" X Y $ S #
P " M $ S Y #P=" M
P "$M
# !"-----------------------! P#" S! -----------------------Y#
-d
S# %YP#" X=M P
MP#"%XP# "-XdXM
P
"
X
#
Isolated Source Model
Segregation Model
<$P(X)$#*$&*#521$=*#(0"#0$>
<$P(X)$#*$&*#521$=*#(0"#0$>
!"#$%&&'( Separation - Dan
)*+#,-$.'/0+12($3$42"1#'#5
Model-based
Ellis
!"#$%&&'(
67789::976$9$::$;$68
2006-03-08 - 15/19
)*+#,-$.'/0+12($3$42"1#'#5
67789
Segregation S
1
freq
Using CASA cues
?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;
P" X Y $ S #
!"#$%&'()(&*+,-./+"
P " M $ S Y # = P " M # % P " X M # ! ------------------------- dX ! P " S Y #
P" X #
0 P(S|Y)&1#$2"&,34."-#3&#$*4/5,-#4$&-4&"+%/+%,-#4$
•
9<$P(X)$#*$&*#521$=*#(0"#0$>
'($0<'($(25125"0'*#$=*10<$>*#(',21'#5?
CASA can
help search
9 <*=$&'@2&A$'($'0?
•
E12F+2#>A$G"#,($G2&*#5$0*520<21
CASA can
rate
segregation
9 *#(20;>*#0'#+'0AD
consider
only
segregations
made
from
CASA
!"#$%&&'(
)*+#,-$.'/0+12($3$42"1#'#5
67789::976$9$::$;$68
0 67+$#$%&*4/&'()(8"-91+&143,1&*+,-./+"
chunks 9 B21'*,'>'0A;<"1C*#'>'0AD
construct0'C29E12F+2#>A$125'*#$C+(0$G2$=<*&2
P(S|Y) to reward CASA qualities:
Frequency Proximity
Model-based Separation - Dan Ellis
!"#$%&&'(
Common Onset
)*+#,-$.'/0+12($3$42"1#'#5
Harmonicity
2006-03-08 - 16/19
67789::976$9$:8$;$68
Speech-Fragment Recognition
• CASA-based fragments give extra gain
over missing-data recognition
from
Barker et al. ’05
Model-based Separation - Dan Ellis
2006-03-08 - 17/19
Evaluating Separation
• Real-world speech tasks
crowded environments
M. Cooke & T.-W. Lee “Speech Separation Challenge”
0
6
4
-50
2
-100
0
level / dB
Pitch Track + Speaker Active Ground Truth
200
150
Lags
human
intelligibility?
‘diarization’
annotation
(but not
transcription)
freq / kHz
• Metric
Personal Audio - Speech + Noise
8
100
ks-noisyspeech.wav
50
0
0
Model-based Separation - Dan Ellis
1
2
3
4
5
6
7
8
2006-03-08 - 18/19
9
10
time / sec
Summary & Conclusions
• Listeners do well separating speech
using spatial location
using source-property variations
• Machines do less well
difficult to apply enough constraints
need to exploit signal detail
• Models capture constraints
learn from the real world
adapt to sources
• Inferring state (≈ recognition)
is a promising approach to separation
Model-based Separation - Dan Ellis
2006-03-08 - 19/19
Sources / See Also
• NSF/AFOSR Montreal Workshops ’03, ’04
www.ebire.org/speechseparation/
labrosa.ee.columbia.edu/Montreal2004/
as well as the resulting book...
• Hanse meeting:
www.lifesci.sussex.ac.uk/home/Chris_Darwin/
Hanse/
• DeLiang Wang’s ICASSP’04 tutorial
www.cse.ohio-state.edu/~dwang/presentation.html
• Martin Cooke’s NIPS’02 tutorial
www.dcs.shef.ac.uk/~martin/nips.ppt
Model-based Separation - Dan Ellis
2006-03-08 - 20/19
References 1/2
[Barker et al. ’05] J. Barker, M. Cooke, D. Ellis, “Decoding speech in the presence of other sources,” Speech
Comm. 45, 5-25, 2005.
[Bell & Sejnowski ’95] A. Bell & T. Sejnowski, “An information maximization approach to blind separation and
blind deconvolution,” Neural Computation, 7:1129-1159, 1995.
[Blin et al.’04] A. Blin, S. Araki, S. Makino, “A sparseness mixing matrix estimation (SMME) solving the
underdetermined BSS for convolutive mixtures,” ICASSP, IV-85-88, 2004.
[Bregman ’90] A. Bregman, Auditory Scene Analysis, MIT Press, 1990.
[Brungart ’01] D. Brungart, “Informational and energetic masking effects in the perception of two simultaneous
talkers,” JASA 109(3), March 2001.
[Brungart et al. ’01] D. Brungart, B. Simpson, M. Ericson, K. Scott, “Informational and energetic masking effects in
the perception of multiple simultaneous talkers,” JASA 110(5), Nov. 2001.
[Brungart et al. ’02] D. Brungart & B. Simpson, “The effects of spatial separation in distance on the informational
and energetic masking of a nearby speech signal”, JASA 112(2), Aug. 2002.
[Brown & Cooke ’94] G. Brown & M. Cooke, “Computational auditory scene analysis,” Comp. Speech & Lang. 8
(4), 297–336, 1994.
[Cooke et al. ’01] M. Cooke, P. Green, L. Josifovski, A.Vizinho, “Robust automatic speech recognition with missing
and uncertain acoustic data,” Speech Communication 34, 267-285, 2001.
[Cooke’06] M. Cooke, “A glimpsing model of speech perception in noise,” submitted to JASA.
[Darwin & Carlyon ’95] C. Darwin & R. Carlyon, “Auditory grouping” Handbk of Percep. & Cogn. 6: Hearing, 387–
424, Academic Press, 1995.
[Ellis’96] D. Ellis, “Prediction-Driven Computational Auditory Scene Analysis,” Ph.D. thesis, MIT EECS, 1996.
[Hu & Wang ’04] G. Hu and D.L. Wang, “Monaural speech segregation based on pitch tracking and amplitude
modulation,” IEEE Tr. Neural Networks, 15(5), Sep. 2004.
[Okuno et al. ’99] H. Okuno, T. Nakatani, T. Kawabata, “Listening to two simultaneous speeches,” Speech
Communication 27, 299–310, 1999.
Model-based Separation - Dan Ellis
2006-03-08 - 21/19
References 2/2
[Ozerov et al. ’05] A. Ozerov, P. Phillippe, R. Gribonval, F. Bimbot, “One microphone singing voice separation using
source-adapted models,” Worksh. on Apps. of Sig. Proc. to Audio & Acous., 2005.
[Pearlmutter & Zador ’04] B. Pearlmutter & A. Zador, “Monaural Source Separation using Spectral Cues,” Proc.
ICA, 2005.
[Parra & Spence ’00] L. Parra & C. Spence, “Convolutive blind source separation of non-stationary sources,” IEEE
Tr. Speech & Audio, 320-327, 2000.
[Reyes et al. ’03] M. Reyes-Gómez, B. Raj, D. Ellis, “Multi-channel source separation by beamforming trained with
factorial HMMs,” Worksh. on Apps. of Sig. Proc. to Audio & Acous., 13–16, 2003.
[Roman et al. ’02] N. Roman, D.-L. Wang, G. Brown, “Location-based sound segregation,” ICASSP, I-1013-1016,
2002.
[Roweis ’03] S. Roweis, “Factorial models and refiltering for speech separation and denoising,” EuroSpeech, 2003.
[Schimmel & Atlas ’05] S. Schimmel & L. Atlas, “Coherent Envelope Detection for Modulation Filtering of Speech,”
ICASSP, I-221-224, 2005.
[Slaney & Lyon ’90] M. Slaney & R. Lyon, “A Perceptual Pitch Detector,” ICASSP, 357-360, 1990.
[Smaragdis ’98] P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Intl. Wkshp. on
Indep. & Artif.l Neural Networks, Tenerife, Feb. 1998.
[Seltzer et al. ’02] M. Seltzer, B. Raj, R. Stern, “Speech recognizer-based microphone array processing for robust
hands-free speech recognition,” ICASSP, I–897–900, 2002.
[Varga & Moore ’90] A.Varga & R. Moore, “Hidden Markov Model decomposition of speech and noise,” ICASSP,
845–848, 1990.
[Vincent et al. ’06] E.Vincent, R. Gribonval, C. Févotte, “Performance measurement in Blind Audio Source
Separation.” IEEE Trans. Speech & Audio, in press.
[Yilmaz & Rickard ’04] O.Yilmaz & S. Rickard, “Blind separation of speech mixtures via time-frequency masking,”
IEEE Tr. Sig. Proc. 52(7), 1830-1847, 2004.
Model-based Separation - Dan Ellis
2006-03-08 - 22/19
Download