Using Learned Source Models to Organize Sound Mixtures

advertisement
Using Learned Source Models
to Organize Sound Mixtures
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Source Models as Constraints
Examples of Model-Based Systems
Acquiring and Using Models
Biological Relevance?
Models to Organize Sound - Dan Ellis
2006-05-12 - 1 /12
The Problem of Scene Analysis
• How do we achieve ‘perceptual constancy’
freq / kHz
of sources in mixtures?
8
20
6
0
4
-20
2
-40
0
0.5
1
1.5
freq / kHz
0
2
level / dB 0
time / s
0.5
1
1.5
2
time / s
8
6
4
2
0
0
0.5
1
1.5
2
time / s
no obvious segmentation of objects
underconstrained: infinitely many decompositions
time-frequency overlaps cause obliteration
Models to Organize Sound - Dan Ellis
2006-05-12 - 2 /12
Scene Analysis as Inference
Ellis’96
• Ideal separation is rarely possible
i.e. no projection can completely remove overlaps
• Overlaps Ambiguity
scene analysis = find “most reasonable” explanation
• Ambiguity can be expressed probabilistically
i.e. posteriors of sources {Si} given observations X:
P({Si}| X)
P(X |{Si}) P({Si})
combination physics source models
• Better source models → better inference
.. learn from examples?
Models to Organize Sound - Dan Ellis
2006-05-12 - 3 /12
An Example: Fingerprinting
• “Impossible” separation task (Avery Wang)
separation and restoration!
Models to Organize Sound - Dan Ellis
2006-05-12 - 4 /12
Fingerprinting: How it Works
•
!"#$%&'(")%'*+,'-.%&/
Library!"#$%&'(&)*+,#)-.
of songs (>1M) described
by hashes
! "#$%&'&()*%+,-.$
/ 01234.$*4+3.&
/ 01234.$*%&5&%,6*
%++'*7)42'38.
/ 01234.$*4+4(34&2%*
73.$+%$3+4
! 9&:%+7-83,(&
/ "5&%)$;341*)+-*
<24$
! =&47*$+*.-%535&*
$;%+-1;*5+38&*
8+7&8
! "#$%&'(()%'*+,-(.%
,/%#01*(2&#03%
(04*+'/%+5%5(246*(%
&'21(
! 7&(%1+.,#024#+0&%
+5%2%&.2--%06.,(*%
89:;<%+5%
1+0&4(--24#+0%'+#04&
! =21>%'+#04%#&%42?(0%
2&%20%@201>+*%
'+#04A
! =21>%201>+*%'+#04%
>2&%2%@42*3(4%B+0(A
• After ~10s, song/segment identified > 98%
• Key ideas:
known-item database of exact waveforms
tiny part of signal used (... the most robust part)
Models to Organize Sound - Dan Ellis
2006-05-12 - 5 /12
or green. The task is to recognize the letter and the digit of the
speaker that said white.
We decoded the two component signals under the assumption
that one signal contains white and the other does not, and vice
versa. We then used the association that yielded the highest combined likelihood.
Log-power spectrum features were computed at a 15 ms rate.
Each frame was of length 40 ms and a 640 point FFT was used
producing a 3195 dimensional log-power-spectrum feature vector.
The second plot in Figure 1 shows WER for the Same
condition. In this condition, recordings from two different
ers of the same gender are mixed together. In this condit
system surpasses human performance in all conditions e
dB and −9 dB.
The third plot in Figure 1 shows WER for the Different
condition. In this condition, our system surpasses human
mance in the 0 dB and 3 dB conditions. Interestingly, te
constraints do not improve performance relative to GMM
dynamics as dramatically as in the same talker case, whi
cates that the characteristics of the two speakers in a short s
a) Same Talker
are effective for separation.
<command:4><color:4><preposition:4><letter:25><number:10><adverb:4>
The performance of our best system, which uses both
mar and acoustic-level dynamics, is
summarized in Table
t5_bwam5s_m5_bbilzp_6p1.wav
e.g. "bin white at M 5 soon"
system surpassed human lister performance at SNRs of 0
−3 dB on average acrossKristjansson
all speaker conditions.
et al.Averaging
all SNRs, the system surpassed human performance in th
6 dB
3 dB
0 dB
!3 dB
!6 dB
!9 dB
Gender condition. Based on
these initial results, we envis
Interspeech’06
b) Same Gender
super-human performance over all conditions is within rea
Example 2: Mixed Speech Recog.
• Cooke & Lee’s Speech Separation Challenge
short, grammatically-constrained utterances:
100
50
0
wer / %
100
• IBM’s “superhuman” recognizer:
o Model individual speakers
7. References
(512 mix GMM)
o Infer speakers and gain
o Reconstruct speech
o Recognize as normal...
50
0
6 dB
3 dB
0 dB
!3 dB
!6 dB
!9 dB
c) Different Gender
100
50
0
6 dB
3 dB
SDL Recognizer
0 dB
No dynamics
!3 dB
Acoustic dyn.
!6 dB
!9 dB
Grammar dyn.
Models to Organize Sound - Dan Ellis
Human
Figure 1: Word error rates for the a) Same Talker, b) Same Gender
and c) Different Gender cases.
5 the
DC component was discarded
[1] Martin Cooke and Tee-Won Lee,
“Interspeech
separation
challenge,”
http : //www.dcs.shef
∼ martin/SpeechSeparationChallenge.htm, 2006.
[2] T. Kristjansson, J. Hershey, and H. Attias, “Single microphon
separation using high resolution signal reconstruction,” ICASS
[3] S. Roweis, “Factorial models and refiltering for speech separa
denoising,” Eurospeech, pp. 1009–1012, 2003.
[4] P. Varga and R.K. Moore, “Hidden markov model decompo
speech and noise,” ICASSP, pp. 845–848, 1990.
[5] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estim
multivariate Gaussian mixture observations of Markov chains
Transactions on Speech and Audio Processing, vol. 2, no. 2,
2006-05-12 - 6 /12
298, 1994.
[6] Peder Olsen and Satya Dharanipragada, “An efficient integra
der detection scheme and time mediated averaging of gende
dent acoustic models,,” in Proceedings of Eurospeech 2003,
Switzerland, September 1-4 2003, vol. 4, pp. 2509–2512.
• Grammar constraints
a big help
Scene Analysis as Recognition
• We don’t want waveforms
limits to what listeners discriminate
.. especially over long term
• The outcome of perception is percepts
source identities (categories)
.. plus some salient parameters
• Scene analysis: recovering source + params
classification + parameter estimation
.. implies predefined set of classes = source models
Models to Organize Sound - Dan Ellis
2006-05-12 - 7 /12
What are the Models?
•
Models allow world knowledge (experience)
to help perception
Explicit Models (dictionaries)
can represent anything (“non-parametric”)
conceptually simple but inefficient in space/time
• Parametric Models (subspaces)
encapsulate broader constraints (e.g. harmonicity)
rely on actual regularity in the domain
may not be easy to apply (fit)
• Middle ground?
e.g. locally-learned manifolds
or dictionaries + parametric transformations
Models to Organize Sound - Dan Ellis
2006-05-12 - 8 /12
Learning, Representing, Applying
• Models encapsulate experience/environment
evolutionary scale (hardware)
vs. lifetime scale (conventional learning)
• Tradeoff between an efficient domain
and a flexible learner
auditory percepts already factor out e.g. channel
characteristics (phase, reflections, gain)
• Learned knowledge must be easy to apply
e.g. representations that are easier to recall/match
Models to Organize Sound - Dan Ellis
2006-05-12 - 9 /12
Dictionaries vs. CASA
• Source models can learn harmonicity, onset
freq / mel band
... to subsume rules/representations of CASA
VQ800 Codebook - Linear distortion measure
80
-10
-20
-30
-40
-50
60
40
20
0
100
200
300
400
500
600
700
800
level / dB
can capture spatial info too [Pearlmutter & Zador’04]
• Can also capture sequential structure
e.g. consonants follow vowels
... like people do?
• Maybe equivalent results in the end
.. i.e. algorithm, not computational theory
Models to Organize Sound - Dan Ellis
2006-05-12 - 10/12
Biological Relevance of Models
How do we explain illusions?
2,,,
%&$'($)*+
•
3,,,
1,,,
0,,,
pulsation threshold
,
,
,-.
/
/-.
!"#$
0
0-.
0
!"#$
0-.
4
sinewave speech
%&$'($)*+
.,,,
1,,,
4,,,
0,,,
/,,,
,
phonemic restoration
,-.
/
/-.
4-.
1,,,
%&$'($)*+
4,,,
0,,,
/,,,
,
• Something is providing the missing (illusory)
,
,-.
/
!"#$
pieces
Models to Organize Sound - Dan Ellis
2006-05-12 - 11/12
/-.
0
Summary
• Scene Analysis is possible only thanks to
constraints
most sound combinations are unlikely
• Listeners care about individual sources
.. in a wide range of combinations
• Statistical source models can be learned
from the environment
exactly how is more of a detail...
Models to Organize Sound - Dan Ellis
2006-05-12 - 12/12
Download