EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound Michael Mandel <mim@ee.columbia.edu> Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/∼dpwe/e6820 March 27, 2008 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 1 / 33 Outline 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 2 / 33 Spatial acoustics Received sound = source + channel I so far, only considered ideal source waveform Sound carries information on its spatial origin I ”ripples in the lake” I evolutionary significance The basis of scene analysis? I yes and no—try blocking an ear Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 3 / 33 Ripples in the lake Listener Source Wavefront (@ c m/s) Energy ∝ 1/r2 Effect of relative position on sound I I I I delay = ∆r c energy decay ∼ r12 absorption ∼ G (f )r direct energy plus reflections Give cues for recovering source position Describe wavefront by its normal Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 4 / 33 Recovering spatial information Source direction as wavefront normal pressure moving plane found from timing at 3 points B ∆t/c = ∆s = AB·cosθ θ A C time wavefront need to solve correspondence ge ran Space: need 3 parameters r e.g. 2 angles and range elevation φ azimuth θ Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 5 / 33 The effect of the environment Reflection causes additional wavefronts reflection diffraction & shadowing + scattering, absorption many paths → many echoes I I Reverberant effect causal ‘smearing’ of signal energy I + reverb from hlwy16 freq / Hz freq / Hz Dry speech 'airvib16' 8000 6000 8000 6000 4000 4000 2000 2000 0 0 0.5 Michael Mandel (E6820 SAPR) 1 1.5 0 time / sec Spatial sound 0 0.5 1 1.5 time / sec March 27, 2008 6 / 33 Reverberation impulse response Exponential decay of reflections: hlwy16 - 128pt window freq / Hz ~e-t/T hroom(t) 8000 -10 -20 6000 -30 -40 4000 -50 2000 -60 t 0 -70 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 time / s Frequency-dependent I greater absorption at high frequencies → faster decay Size-dependent I larger rooms → longer delays → slower decay Sabine’s equation: 0.049V S ᾱ Time constant as size, absorption RT60 = Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 7 / 33 Outline 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 8 / 33 Binaural perception R L head shadow (high freq) path length difference source What is the information in the 2 ear signals? I I the sound of the source(s) (L+R) the position of the source(s) (L-R) Example waveforms (ShATR database) shatr78m3 waveform 0.1 Left 0.05 0 -0.05 Right -0.1 2.2 Michael Mandel (E6820 SAPR) 2.205 2.21 2.215 2.22 Spatial sound 2.225 2.23 2.235 time / s March 27, 2008 9 / 33 Main cues to spatial hearing Interaural time difference (ITD) I I I from different path lengths around head dominates in low frequency (< 1.5 kHz) max ∼750 µs → ambiguous for freqs > 600 Hz Interaural intensity difference (IID) I I from head shadowing of far ear negligible for LF; increases with frequency Spectral detail (from pinna reflections) useful for elevation & range Direct-to-reverberant useful for range freq / kHz Claps 33 and 34 from 627M:nf90 20 15 10 5 0 0 0.2 Michael Mandel (E6820 SAPR) 0.4 0.6 0.8 1 Spatial sound 1.2 1.4 1.6 1.8 time / s March 27, 2008 10 / 33 Head-Related Transfer Functions (HRTFs) Capture source coupling as impulse responses {`θ,φ,R (t), rθ,φ,R (t)} Collection: (http://interface.cipic.ucdavis.edu/) HRIR_021 Left @ 0 el HRIR_021 Right @ 0 el 1 45 0 0 1 -45 0 Azimuth / deg RIGHT LEFT -1 0 0.5 1 1.5 0 0.5 1 1.5 time / ms 0 HRIR_021 Left @ 0 el 0 az HRIR_021 Right @ 0 el 0 az 0.5 1 1.5 time / ms Highly individual! Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 11 / 33 Cone of confusion azimuth θ Cone of confusion (approx equal ITD) Interaural timing cue dominates (below 1kHz) I from differing path lengths to two ears But: only resolves to a cone I Up/down? Front/back? Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 12 / 33 Further cues Pinna causes elevation-dependent coloration Monaural perception I separate coloration from source spectrum? Head motion I I synchronized spectral changes also for ITD (front/back) etc. Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 13 / 33 Combining multiple cues Both ITD and ILD influence azimuth; What happens when they disagree? Identical signals to both ears → image is centered l(t) r(t) t 1 ms t Delaying right channel moves image to left l(t) r(t) t t Attenuating left channel returns image to center l(t) r(t) t t “Time-intensity trading” Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 14 / 33 Binaural position estimation Imperfect results: (Wenzel et al., 1993) Judged Azimuth (Deg) 180 120 60 0 -60 -120 -180 -180 -120 -60 0 60 120 180 Target Azimuth (Deg) listening to ‘wrong’ HRTFs → errors front/back reversals stay on cone of confusion Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 15 / 33 The Precedence Effect Reflections give misleading spatial cues l(t) direct reflected R t r(t) R/ c t But: Spatial impression based on 1st wavefront then ‘switches off’ for ∼50 ms . . . even if ‘reflections’ are louder . . . leads to impression of room Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 16 / 33 Binaural Masking Release Adding noise to reveal target Tone + noise to one ear: tone is masked + t t Identical noise to other ear: tone is audible + t t t Binaural Masking Level Difference up to 12dB I greatest for noise in phase, tone anti-phase Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 17 / 33 Outline 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 18 / 33 Synthesizing spatial audio Goal: recreate realistic soundfield I I hi-fi experience synthetic environments (VR) Constraints I I I resources information (individual HRTFs) delivery mechanism (headphones) Source material types I I live recordings (actual soundfields) synthetic (studio mixing, virtual environments) Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 19 / 33 Classic stereo L R ‘Intensity panning’: no timing modifications, just vary level ±20 dB I works as long as listener is equidistant (ILD) Surround sound: extra channels in center, sides, . . . I same basic effect: pan between pairs Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 20 / 33 Simulating reverberation Can characterize reverb by impulse response I I spatial cues are important: record in stereo IRs of ∼1 sec → very long convolution Image model: reflections as duplicate sources virtual (image) sources reflected path source listener ‘Early echos’ in room impulse response: direct path early echos hroom(t) t Actual reflection may be hreflect (t), not δ(t) Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 21 / 33 Artificial reverberation Reproduce perceptually salient aspects I I I early echo pattern (→ room size impression) overall decay tail (→ wall materials. . . ) interaural coherence (→ spaciousness) Nested allpass filters (Gardner, 1992) Allpass y[n] x[n] + z-k - g 1 - g·z-k H(z) = -g z-k + h[n] 1-g2 g(1-g2) 2 g (1-g2) k g g,k n Synthetic Reverb + 30,0.7 a0 + AP0 g Michael Mandel (E6820 SAPR) 3k -g Nested+Cascade Allpass 50,0.5 20,0.3 2k Spatial sound AP1 + a1 a2 AP2 LPF March 27, 2008 22 / 33 Synthetic binaural audio Source convolved with {L,R} HRTFs gives precise positioning . . . for headphone presentation I can combine multiple sources (by adding) Where to get HRTFs? I I I measured set, but: specific to individual, discrete interpolate by linear crossfade, PCA basis set or: parametric model - delay, shadow, pinna (Brown and Duda, 1998) Delay Shadow Pinna z-tDL(θ) 1 - bL(θ)z-1 1 - azt Σ pkL(θ,φ)·z-tPkL(θ,φ) Room echo Source z-tDR(θ) + KE·z-tE 1 - bR(θ)z-1 1 - azt Σ pkR(θ,φ)·z-tPkR(θ,φ) + (after Brown & Duda '97) Head motion cues? I head tracking + fast updates Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 23 / 33 Transaural sound Binaural signals without headphones? Can cross-cancel wrap-around signals I I speakers SL,R , ears EL,R , binaural signals BL,R Goal: present BL,R to EL,R BL BR M SL SR −1 SL = HLL (BL − HRL SR ) HLR −1 SR = HRR (BR − HLR SL ) HRL HRR HLL EL ER Narrow ‘sweet spot’ I head motion? Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 24 / 33 Soundfield reconstruction Stop thinking about ears I just reconstruct pressure + spatial derivatives ∂p(t)/∂y ∂p(t)/∂z ∂p(t)/∂x p(x,y,z,t) I ears in reconstructed field receive same sounds Complex reconstruction setup (ambisonics) I able to preserve head motion cues? Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 25 / 33 Outline 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 26 / 33 Extracting spatial sounds Given access to soundfield, can we recover separate components? I I degrees of freedom: > N signals from N sensors is hard but: people can do it (somewhat) Information-theoretic approach I I use only very general constraints rely on precision measurements Anthropic approach I I examine human perception attempt to use same information Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 27 / 33 Microphone arrays Signals from multiple microphones can be combined to enhance/cancel certain sources ‘Coincident’ mics with different directional gains m1 s1 a11 a22 a12 a21 s2 m2 m1 m2 = a11 a12 a21 a22 s1 s2 ŝ1 ŝ2 ⇒ = Â−1 m Microphone arrays (endfire) 0 λ = 4D -20 λ = 2D -40 λ=D + Michael Mandel (E6820 SAPR) Spatial sound D + D + D March 27, 2008 28 / 33 Adaptive Beamforming & Independent Component Analysis (ICA) Formulate mathematical criteria to optimize Beamforming: Drive interference to zero I cancel energy during nontarget intervals ICA: maximize mutual independence of outputs I from higher-order moments during overlap m1 m2 x a11 a12 a21 a22 s1 s2 −δ MutInfo δa Limited by separation model parameter space I only N × N? Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 29 / 33 Binaural models Human listeners do better? I certainly given only 2 channels Extract ITD and IID cues? I I I I cross-correlation finds timing differences ‘consume’ counter-moving pulses how to achieve IID, trading vertical cues... Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 30 / 33 Time-frequency masking How to separate sounds based on direction? I I I assume one source dominates each time-frequency point assign regions of spectrogram to sources based on probabilistic models re-estimate model parameters based on regions selected Model-based EM Source Separation and Localization I I I Mandel and Ellis (2007) models include IID as RLωω and IPD as arg RLωω independent of source, but can model it separately Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 31 / 33 Summary Spatial sound I sampling at more than one point gives information on origin direction Binaural perception I time & intensity cues used between/within ears Sound rendering I I conventional stereo HRTF-based Spatial analysis I I optimal linear techniques elusive auditory models Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 32 / 33 References Elizabeth M. Wenzel, Marianne Arruda, Doris J. Kistler, and Frederic L. Wightman. Localization using nonindividualized head-related transfer functions. The Journal of the Acoustical Society of America, 94(1):111–123, 1993. William G. Gardner. A real-time multichannel room simulator. The Journal of the Acoustical Society of America, 92(4):2395–2395, 1992. C. P. Brown and R. O. Duda. A structural model for binaural sound synthesis. IEEE Transactions on Speech and Audio Processing, 6(5):476–488, 1998. Michael I. Mandel and Daniel P. Ellis. EM localization and separation using interaural level and phase cues. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 275–278, 2007. J. C. Middlebrooks and D. M. Green. Sound localization by human listeners. Annu Rev Psychol, 42:135–159, 1991. Brian C. J. Moore. An Introduction to the Psychology of Hearing. Academic Press, fifth edition, April 2003. ISBN 0125056281. Jens Blauert. Spatial Hearing - Revised Edition: The Psychophysics of Human Sound Localization. The MIT Press, October 1996. V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano. The cipic hrtf database. In Applications of Signal Processing to Audio and Acoustics, 2001 IEEE Workshop on the, pages 99–102, 2001. Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 33 / 33