Using Learned Source Models to Organize Sound Mixtures Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Source Models as Constraints Examples of Model-Based Systems Acquiring and Using Models Biological Relevance? Models to Organize Sound - Dan Ellis 2006-05-12 - 1 /12 The Problem of Scene Analysis • How do we achieve ‘perceptual constancy’ freq / kHz of sources in mixtures? 8 20 6 0 4 -20 2 -40 0 0.5 1 1.5 freq / kHz 0 2 level / dB 0 time / s 0.5 1 1.5 2 time / s 8 6 4 2 0 0 0.5 1 1.5 2 time / s no obvious segmentation of objects underconstrained: infinitely many decompositions time-frequency overlaps cause obliteration Models to Organize Sound - Dan Ellis 2006-05-12 - 2 /12 Scene Analysis as Inference Ellis’96 • Ideal separation is rarely possible i.e. no projection can completely remove overlaps • Overlaps Ambiguity scene analysis = find “most reasonable” explanation • Ambiguity can be expressed probabilistically i.e. posteriors of sources {Si} given observations X: P({Si}| X) P(X |{Si}) P({Si}) combination physics source models • Better source models → better inference .. learn from examples? Models to Organize Sound - Dan Ellis 2006-05-12 - 3 /12 An Example: Fingerprinting • “Impossible” separation task (Avery Wang) separation and restoration! Models to Organize Sound - Dan Ellis 2006-05-12 - 4 /12 Fingerprinting: How it Works • !"#$%&'(")%'*+,'-.%&/ Library!"#$%&'(&)*+,#)-. of songs (>1M) described by hashes ! "#$%&'&()*%+,-.$ / 01234.$*4+3.& / 01234.$*%&5&%,6* %++'*7)42'38. / 01234.$*4+4(34&2%* 73.$+%$3+4 ! 9&:%+7-83,(& / "5&%)$;341*)+-* <24$ ! =&47*$+*.-%535&* $;%+-1;*5+38&* 8+7&8 ! "#$%&'(()%'*+,-(.% ,/%#01*(2&#03% (04*+'/%+5%5(246*(% &'21( ! 7&(%1+.,#024#+0&% +5%2%&.2--%06.,(*% 89:;<%+5% 1+0&4(--24#+0%'+#04& ! =21>%'+#04%#&%42?(0% 2&%20%@201>+*% '+#04A ! =21>%201>+*%'+#04% >2&%2%@42*3(4%B+0(A • After ~10s, song/segment identified > 98% • Key ideas: known-item database of exact waveforms tiny part of signal used (... the most robust part) Models to Organize Sound - Dan Ellis 2006-05-12 - 5 /12 or green. The task is to recognize the letter and the digit of the speaker that said white. We decoded the two component signals under the assumption that one signal contains white and the other does not, and vice versa. We then used the association that yielded the highest combined likelihood. Log-power spectrum features were computed at a 15 ms rate. Each frame was of length 40 ms and a 640 point FFT was used producing a 3195 dimensional log-power-spectrum feature vector. The second plot in Figure 1 shows WER for the Same condition. In this condition, recordings from two different ers of the same gender are mixed together. In this condit system surpasses human performance in all conditions e dB and −9 dB. The third plot in Figure 1 shows WER for the Different condition. In this condition, our system surpasses human mance in the 0 dB and 3 dB conditions. Interestingly, te constraints do not improve performance relative to GMM dynamics as dramatically as in the same talker case, whi cates that the characteristics of the two speakers in a short s a) Same Talker are effective for separation. <command:4><color:4><preposition:4><letter:25><number:10><adverb:4> The performance of our best system, which uses both mar and acoustic-level dynamics, is summarized in Table t5_bwam5s_m5_bbilzp_6p1.wav e.g. "bin white at M 5 soon" system surpassed human lister performance at SNRs of 0 −3 dB on average acrossKristjansson all speaker conditions. et al.Averaging all SNRs, the system surpassed human performance in th 6 dB 3 dB 0 dB !3 dB !6 dB !9 dB Gender condition. Based on these initial results, we envis Interspeech’06 b) Same Gender super-human performance over all conditions is within rea Example 2: Mixed Speech Recog. • Cooke & Lee’s Speech Separation Challenge short, grammatically-constrained utterances: 100 50 0 wer / % 100 • IBM’s “superhuman” recognizer: o Model individual speakers 7. References (512 mix GMM) o Infer speakers and gain o Reconstruct speech o Recognize as normal... 50 0 6 dB 3 dB 0 dB !3 dB !6 dB !9 dB c) Different Gender 100 50 0 6 dB 3 dB SDL Recognizer 0 dB No dynamics !3 dB Acoustic dyn. !6 dB !9 dB Grammar dyn. Models to Organize Sound - Dan Ellis Human Figure 1: Word error rates for the a) Same Talker, b) Same Gender and c) Different Gender cases. 5 the DC component was discarded [1] Martin Cooke and Tee-Won Lee, “Interspeech separation challenge,” http : //www.dcs.shef ∼ martin/SpeechSeparationChallenge.htm, 2006. [2] T. Kristjansson, J. Hershey, and H. Attias, “Single microphon separation using high resolution signal reconstruction,” ICASS [3] S. Roweis, “Factorial models and refiltering for speech separa denoising,” Eurospeech, pp. 1009–1012, 2003. [4] P. Varga and R.K. Moore, “Hidden markov model decompo speech and noise,” ICASSP, pp. 845–848, 1990. [5] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estim multivariate Gaussian mixture observations of Markov chains Transactions on Speech and Audio Processing, vol. 2, no. 2, 2006-05-12 - 6 /12 298, 1994. [6] Peder Olsen and Satya Dharanipragada, “An efficient integra der detection scheme and time mediated averaging of gende dent acoustic models,,” in Proceedings of Eurospeech 2003, Switzerland, September 1-4 2003, vol. 4, pp. 2509–2512. • Grammar constraints a big help Scene Analysis as Recognition • We don’t want waveforms limits to what listeners discriminate .. especially over long term • The outcome of perception is percepts source identities (categories) .. plus some salient parameters • Scene analysis: recovering source + params classification + parameter estimation .. implies predefined set of classes = source models Models to Organize Sound - Dan Ellis 2006-05-12 - 7 /12 What are the Models? • Models allow world knowledge (experience) to help perception Explicit Models (dictionaries) can represent anything (“non-parametric”) conceptually simple but inefficient in space/time • Parametric Models (subspaces) encapsulate broader constraints (e.g. harmonicity) rely on actual regularity in the domain may not be easy to apply (fit) • Middle ground? e.g. locally-learned manifolds or dictionaries + parametric transformations Models to Organize Sound - Dan Ellis 2006-05-12 - 8 /12 Learning, Representing, Applying • Models encapsulate experience/environment evolutionary scale (hardware) vs. lifetime scale (conventional learning) • Tradeoff between an efficient domain and a flexible learner auditory percepts already factor out e.g. channel characteristics (phase, reflections, gain) • Learned knowledge must be easy to apply e.g. representations that are easier to recall/match Models to Organize Sound - Dan Ellis 2006-05-12 - 9 /12 Dictionaries vs. CASA • Source models can learn harmonicity, onset freq / mel band ... to subsume rules/representations of CASA VQ800 Codebook - Linear distortion measure 80 -10 -20 -30 -40 -50 60 40 20 0 100 200 300 400 500 600 700 800 level / dB can capture spatial info too [Pearlmutter & Zador’04] • Can also capture sequential structure e.g. consonants follow vowels ... like people do? • Maybe equivalent results in the end .. i.e. algorithm, not computational theory Models to Organize Sound - Dan Ellis 2006-05-12 - 10/12 Biological Relevance of Models How do we explain illusions? 2,,, %&$'($)*+ • 3,,, 1,,, 0,,, pulsation threshold , , ,-. / /-. !"#$ 0 0-. 0 !"#$ 0-. 4 sinewave speech %&$'($)*+ .,,, 1,,, 4,,, 0,,, /,,, , phonemic restoration ,-. / /-. 4-. 1,,, %&$'($)*+ 4,,, 0,,, /,,, , • Something is providing the missing (illusory) , ,-. / !"#$ pieces Models to Organize Sound - Dan Ellis 2006-05-12 - 11/12 /-. 0 Summary • Scene Analysis is possible only thanks to constraints most sound combinations are unlikely • Listeners care about individual sources .. in a wide range of combinations • Statistical source models can be learned from the environment exactly how is more of a detail... Models to Organize Sound - Dan Ellis 2006-05-12 - 12/12