Using Sound Source Models to Organize Mixtures Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Mixtures and Models Human Sound Organization Machine Sound Organization Ambient Sounds Sound Source Models - Dan Ellis 2007-05-24 - 1 /30 The Problem of Mixtures “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Received waveform is a mixture 2 sensors, N sources - underconstrained • Undoing mixtures: hearing’s primary goal? .. by any means available Sound Source Models - Dan Ellis 2007-05-24 - 2 /30 Sound Organization Scenarios • Interactive voice systems human-level understanding is expected • Speech prostheses crowds: #1 complaint of hearing aid users • Archive analysis freq / kHz identifying and isolating sound events dpwe-2004-09-10-13:15:40 8 20 6 0 4 -20 2 0 -40 0 1 2 3 4 5 6 7 8 9 level / dB time / sec pa-2004-09-10-131540.wav • Unmixing/remixing/enhancement... Sound Source Models - Dan Ellis 2007-05-24 - 3 /30 How Can We Separate? • By between-sensor differences (spatial cues) ‘steer a null’ onto a compact interfering source the filtering/signal processing paradigm • By finding a ‘separable representation’ spectral? sources are broadband but sparse periodicity? maybe – for pitched sounds something more signal-specific... • By inference (based on knowledge/models) acoustic sources are redundant → use part to guess the remainder - limited possible solutions Sound Source Models - Dan Ellis 2007-05-24 - 4 /30 Separation vs. Inference • Ideal separation is rarely possible i.e. no projection can completely remove overlaps • Overlaps → Ambiguity scene analysis = find “most reasonable” explanation • Ambiguity can be expressed probabilistically i.e. posteriors of sources {Si} given observations X: P({Si}| X) P(X |{Si}) P({Si}) combination physics source models • Better source models → better inference .. learn from examples? Sound Source Models - Dan Ellis 2007-05-24 - 5 /30 A Simple Example • Source models are codebooks from separate subspaces Sound Source Models - Dan Ellis 2007-05-24 - 6 /30 A Slightly Less Simple Example • Sources with Markov transitions Sound Source Models - Dan Ellis 2007-05-24 - 7 /30 What is a Source Model? • Source Model describes signal behavior encapsulates constraints on form of signal (any such constraint can be seen as a model...) • A model has parameters model + parameters → instance Excitation source g[n] Resonance filter H(ej!) n • What is not a source model? detail not provided in instance e.g. using phase from original mixture constraints on interaction between sources e.g. independence, clustering attributes Sound Source Models - Dan Ellis 2007-05-24 - 8 /30 n ! Outline 1. Mixtures and Models 2. Human Sound Organization Auditory Scene Analysis Using source characteristics Illusions 3. Machine Sound Organization 4. Ambient Sounds Sound Source Models - Dan Ellis 2007-05-24 - 9 /30 Auditory Scene Analysis Bregman’90 Darwin & Carlyon’95 • How do people analyze sound mixtures? break mixture into small elements (in time-freq) elements are grouped in to sources using cues sources have aggregate attributes • Grouping rules (Darwin, Carlyon, ...): cues: common onset/modulation, harmonicity, ... Onset map Sound Frequency analysis Harmonicity map Elements Sources Grouping mechanism Source properties Spatial map • Also learned “schema” (for speech etc.) Sound Source Models - Dan Ellis 2007-05-24 - 10/30 (after Darwin 1996) Perceiving Sources • Harmonics distinct in ear, but perceived as freq one source (“fused”): time depends on common onset depends on harmonics • Experimental techniques ask subjects “how many” match attributes e.g. pitch, vowel identity brain recordings (EEG “mismatch negativity”) Sound Source Models - Dan Ellis 2007-05-24 - 11/30 Auditory “Illusions” freq pulsation threshold 2,,, %&$'($)*+ • How do we explain illusions? 3,,, 1,,, 0,,, + ? , + ,-. / /-. !"#$ 0 0-. 0 !"#$ 0-. 4 / !"#$ /-. time .,,, %&$'($)*+ sinewave speech , + 1,,, 4,,, 0,,, /,,, , phonemic restoration ,-. / /-. 4-. 1,,, • Something is providing the %&$'($)*+ 4,,, 0,,, /,,, , , ,-. missing (illusory) pieces ... source models Sound Source Models - Dan Ellis 2007-05-24 - 12/30 0 Human Speech Separation Brungart et al.’02 • Task: Coordinate Response Measure “Ready Baron go to green eight now” 256 variants, 16 speakers correct = color and number for “Baron” crm-11737+16515.wav • Accuracy as a function of spatial separation: A, B same speaker Sound Source Models - Dan Ellis o Range effect 2007-05-24 - 13/30 Separation by Vocal Differences !"#$%&$'()#*"&+$,-"#$'(!$./-#0 Brungart et al.’01 • CRM varying the level and voice character 1232'(4-**2&2#52. (same spatial location) 1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"# energetic vs. informational masking more than1-.,2#(*"&(,72(9%-2,2&(,$'/2&: pitch .. source models Sound Source Models - Dan Ellis 2007-05-24 - 14/30 Outline 1. Mixtures and Models 2. Human Sound Organization 3. Machine Sound Organization Computational Auditory Scene Analysis Dictionary Source Models 4. Ambient Sounds Sound Source Models - Dan Ellis 2007-05-24 - 15/30 Source Model Issues • Domain parsimonious expression of constraints nice combination physics • Tractability size of search space tricks to speed search/inference • Acquisition hand-designed vs. learned static vs. short-term • Factorization independent aspects hierarchy & specificity Sound Source Models - Dan Ellis 2007-05-24 - 16/30 Computational Auditory Scene Analysis Brown & Cooke’94 Okuno et al.’99 Hu & Wang’04 ... • Central idea: Segment time-frequency into sources based on perceptual grouping cues Segment input mixture Front end signal features (maps) Group Object formation discrete objects Grouping rules Source groups freq onset time period frq.mod ... principal cue is harmonicity Sound Source Models - Dan Ellis 2007-05-24 - 17/30 CASA limitations • Limitations of T-F masking cannot undo overlaps – leaves gaps • Typically driven by local features limited model scope ➝ no inference or illusions • Processing hand-defined, not learned Sound Source Models - Dan Ellis 2007-05-24 - 18/30 from Hu & Wang ’04 huwang-v3n7.wav Can Models Do CASA? • Source models can learn harmonicity, onset freq / mel band ... to subsume rules/representations of CASA VQ800 Codebook - Linear distortion measure 80 -10 -20 -30 -40 -50 60 40 20 0 100 200 300 400 500 600 700 800 level / dB can capture spatial info too [Pearlmutter & Zador’04] • Can also capture sequential structure e.g. consonants follow vowels ... like people do? • But: need source-specific models ... for every possible source use model adaptation? [Ozerov et al. 2005] Sound Source Models - Dan Ellis 2007-05-24 - 19/30 Separation or Description? • Are isolated waveforms required? clearly sufficient, but may not be necessary not part of perceptual source separation! • Integrate separation with application? e.g. speech recognition separation mix t-f masking + resynthesis identify target energy ASR speech models words mix vs. identify find best words model speech models source knowledge words output = abstract description of signal Sound Source Models - Dan Ellis 2007-05-24 - 20/30 words Dictionary Models Roweis ’01, ’03 Kristjannson ’04, ’06 • Given models for sources, find “best” (most likely) states for spectra: combination p(x|i1, i2) = N (x; ci1 + ci2, Σ) model inference of {i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) source state can include sequential constraints... different domains for combining c and defining • E.g. stationary noise: In speech-shaped noise (mel magsnr = 2.41 dB) freq / mel bin Original speech 80 80 80 60 60 60 40 40 40 20 20 20 0 1 2 time / s 0 Sound Source Models - Dan Ellis 1 2 VQ inferred states (mel magsnr = 3.6 dB) 0 1 2 2007-05-24 - 21/30 Speech Recognition Models • Cooke & Lee Speech Separation Challenge short, grammatically-constrained utterances: <command:4><color:4><preposition:4><letter:25><number:10><adverb:4> e.g. "bin white by R 8 again" task: report letter+number for “white” • Decode with Factorial HMM i.e. two state sequences, one model for each voice exploit sequence constraints exploit speaker differences • IBM “superhuman” system Kristjansson, Hershey et al. ’06 fewer errors than people for same speaker, level exploits known speakers, limited grammar Sound Source Models - Dan Ellis 2007-05-24 - 22/30 Speaker-Adapted (SA) Models • )*+,-.*/0,1 (%% Factorial HMM needs distinct speakers Mixture: t32_swil2a_m18_sbar9n 0 6 4 2 0 '% !20 !40 % Adaptation iteration 1 0 !20 (%% Adaptation iteration 3 0 6 4 2 0 23341*35 !40 !20 '% !40 Adaptation iteration 5 % 0 6 4 2 0 !"# $"# !20 !40 SD model separation 0 6 4 2 0 use “eigenvoice” speaker space iterate estimating voice!!"# & !&"# !"# $"# %"# !$"# )*+,-6,7",1 separating speech SI performs midway between speaker-independent (SI) and SA speaker-dependent (SD) !20 %"# !$"# !!"# 89::-6,7",1 !&"# SD - (%% acc % Frequency (kHz) 6 4 2 0 Ron Weiss '% !40 0 0.5 1 1.5 Time (sec) Sound Source Models - Dan Ellis %;1*3/, )8 )2 )< 2007-05-24 - 23/30 #*=,/97, (Pitch) Factored Dictionaries • Separate representations for Ghandi & Has-John. ’04 Radfar et al. ‘06 “source” (pitch) and “filter” NM codewords from N+M entries but: overgeneration... • Faster search direct extraction of pitches immediate separation of (most of) spectra Sound Source Models - Dan Ellis 2007-05-24 - 24/30 Outline 1. 2. 3. 4. Mixtures & Models Human Sound Organization Machine Sound Organization Ambient Sounds binaural separation “personal audio” analysis Sound Source Models - Dan Ellis 2007-05-24 - 25/30 Binaural Localization by EM Mike Mandel, NIPS’06 • 2 or 3 sources in reverberation • Iteratively estimate ILD, IPD initialize from PHAT ITD histogram output is soft TF mask !"#$%&%'() 3"4567%8"59: ;<=8 >(?8%(@A94B"CD 2 1 , 0 / . - - *+N *+, - -+, . !"#$%&%'() =EFGH; *+, - -+, . =EF-GH;%I9@#7%!J *+, - -+, . *+1 *+0 =EKGH;%IG>;%46LMJ *+. 2 * 1 , 0 / . *+, - -+, . *+, Sound Source Models - Dan Ellis - -+, . *+, - -+, . 9@D#%&%A 2007-05-24 - 26/30 “Personal Audio” Archives • Continuous recordings with MP3 player • Segment / cluster “episodes” .. by statistics of ~10 s segments .. for browsing interface Sound Source Models - Dan Ellis 2007-05-24 - 27/30 Personal Audio Speech Detection Keansub Lee, Interspeech’06 • Pitch is last speech cue to disappear freq / kHz noise robust pitch tracker for voice detection biggest problem was periodic noise (air conditioning) Personal Audio - Speech + Noise 8 0 6 4 -50 2 -100 0 level / dB Pitch Track + Speaker Active Ground Truth 200 Lags 150 100 ks-noisyspeech.wav 50 0 0 1 2 3 Sound Source Models - Dan Ellis 4 5 6 7 8 9 10 time / sec 2007-05-24 - 28/30 Repeating Events in Personal Audio Jim Ogle, ICASSP’07 • “Unsupervised” feature to help browsing • Full NxN search is very expensive use Shazam fingerprint hashes to find repeats freq / kHz Phone ring - Shazam fingerprint 4 -10 -20 3 -30 -40 2 -50 -60 1 -70 0 0 0.5 1 1.5 2 2.5 time / s -80 level / dB only works for exact repeats (alarms, jingles) • O(N) scan for repeats fixed-size hash table multiple common hashes ➝ confident match Sound Source Models - Dan Ellis 2007-05-24 - 29/30 Summary & Conclusions • Listeners do well separating sound mixtures using signal cues (location, periodicity) using source-property variations • Machines do less well difficult to apply enough constraints need to exploit signal detail • Models capture constraints learn from the real world adapt to sources • Separation feasible only sometimes describing source properties is easier Sound Source Models - Dan Ellis 2007-05-24 - 30/30