Lecture 8: Spatial sound

advertisement
EE E6820: Speech & Audio Processing & Recognition
Lecture 8:
Spatial sound
Michael Mandel <mim@ee.columbia.edu>
Columbia University Dept. of Electrical Engineering
http://www.ee.columbia.edu/∼dpwe/e6820
March 27, 2008
1
Spatial acoustics
2
Binaural perception
3
Synthesizing spatial audio
4
Extracting spatial sounds
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
1 / 33
Outline
1
Spatial acoustics
2
Binaural perception
3
Synthesizing spatial audio
4
Extracting spatial sounds
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
2 / 33
Spatial acoustics
Received sound = source + channel
I
so far, only considered ideal source waveform
Sound carries information on its spatial origin
I
”ripples in the lake”
I
evolutionary significance
The basis of scene analysis?
I
yes and no—try blocking an ear
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
3 / 33
Ripples in the lake
Listener
Source
Wavefront (@ c m/s)
Energy ∝ 1/r2
Effect of relative position on sound
I
I
I
I
delay = ∆r
c
energy decay ∼ r12
absorption ∼ G (f )r
direct energy plus reflections
Give cues for recovering source position
Describe wavefront by its normal
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
4 / 33
Recovering spatial information
Source direction as wavefront normal
pressure
moving plane found from timing at 3 points
B
∆t/c = ∆s = AB·cosθ
θ
A
C
time
wavefront
need to solve correspondence
ge
ran
Space: need 3 parameters
r
e.g. 2 angles and range
elevation
φ
azimuth
θ
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
5 / 33
The effect of the environment
Reflection causes additional wavefronts
reflection
diffraction
& shadowing
+ scattering, absorption
many paths → many echoes
I
I
Reverberant effect
causal ‘smearing’ of signal energy
I
+ reverb from hlwy16
freq / Hz
freq / Hz
Dry speech 'airvib16'
8000
6000
8000
6000
4000
4000
2000
2000
0
0
0.5
Michael Mandel (E6820 SAPR)
1
1.5
0
time / sec
Spatial sound
0
0.5
1
1.5
time / sec
March 27, 2008
6 / 33
Reverberation impulse response
Exponential decay of reflections:
hlwy16 - 128pt window
freq / Hz
~e-t/T
hroom(t)
8000
-10
-20
6000
-30
-40
4000
-50
2000
-60
t
0
-70
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7 time / s
Frequency-dependent
I
greater absorption at high frequencies → faster decay
Size-dependent
I
larger rooms → longer delays → slower decay
Sabine’s equation:
0.049V
S ᾱ
Time constant as size, absorption
RT60 =
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
7 / 33
Outline
1
Spatial acoustics
2
Binaural perception
3
Synthesizing spatial audio
4
Extracting spatial sounds
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
8 / 33
Binaural perception
R
L
head shadow
(high freq)
path length
difference
source
What is the information in the 2 ear signals?
I
I
the sound of the source(s) (L+R)
the position of the source(s) (L-R)
Example waveforms (ShATR database)
shatr78m3 waveform
0.1
Left
0.05
0
-0.05
Right
-0.1
2.2
Michael Mandel (E6820 SAPR)
2.205
2.21
2.215
2.22
Spatial sound
2.225
2.23
2.235
time / s
March 27, 2008
9 / 33
Main cues to spatial hearing
Interaural time difference (ITD)
I
I
I
from different path lengths around head
dominates in low frequency (< 1.5 kHz)
max ∼750 µs → ambiguous for freqs > 600 Hz
Interaural intensity difference (IID)
I
I
from head shadowing of far ear
negligible for LF; increases with frequency
Spectral detail (from pinna reflections) useful for elevation &
range
Direct-to-reverberant useful for range
freq / kHz
Claps 33 and 34 from 627M:nf90
20
15
10
5
0
0
0.2
Michael Mandel (E6820 SAPR)
0.4
0.6
0.8
1
Spatial sound
1.2
1.4
1.6
1.8
time / s
March 27, 2008
10 / 33
Head-Related Transfer Functions (HRTFs)
Capture source coupling as impulse responses
{`θ,φ,R (t), rθ,φ,R (t)}
Collection: (http://interface.cipic.ucdavis.edu/)
HRIR_021 Left @ 0 el
HRIR_021 Right @ 0 el
1
45
0
0
1
-45
0
Azimuth / deg
RIGHT
LEFT
-1
0
0.5
1
1.5
0
0.5
1
1.5
time / ms 0
HRIR_021 Left @ 0 el 0 az
HRIR_021 Right @ 0 el 0 az
0.5
1
1.5
time / ms
Highly individual!
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
11 / 33
Cone of confusion
azimuth
θ
Cone of confusion
(approx equal ITD)
Interaural timing cue dominates (below 1kHz)
I
from differing path lengths to two ears
But: only resolves to a cone
I
Up/down? Front/back?
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
12 / 33
Further cues
Pinna causes elevation-dependent coloration
Monaural perception
I
separate coloration from source spectrum?
Head motion
I
I
synchronized spectral changes
also for ITD (front/back) etc.
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
13 / 33
Combining multiple cues
Both ITD and ILD influence azimuth;
What happens when they disagree?
Identical signals to both ears
→ image is centered
l(t)
r(t)
t
1 ms
t
Delaying right channel
moves image to left
l(t)
r(t)
t
t
Attenuating left channel
returns image to center
l(t)
r(t)
t
t
“Time-intensity trading”
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
14 / 33
Binaural position estimation
Imperfect results: (Wenzel et al., 1993)
Judged Azimuth (Deg)
180
120
60
0
-60
-120
-180
-180
-120
-60
0
60
120
180
Target Azimuth (Deg)
listening to ‘wrong’ HRTFs → errors
front/back reversals stay on cone of confusion
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
15 / 33
The Precedence Effect
Reflections give misleading spatial cues
l(t)
direct
reflected
R
t
r(t)
R/
c
t
But: Spatial impression based on 1st wavefront
then ‘switches off’ for ∼50 ms
. . . even if ‘reflections’ are louder
. . . leads to impression of room
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
16 / 33
Binaural Masking Release
Adding noise to reveal target
Tone + noise to one ear:
tone is masked
+
t
t
Identical noise to other ear:
tone is audible
+
t
t
t
Binaural Masking Level Difference up to 12dB
I
greatest for noise in phase, tone anti-phase
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
17 / 33
Outline
1
Spatial acoustics
2
Binaural perception
3
Synthesizing spatial audio
4
Extracting spatial sounds
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
18 / 33
Synthesizing spatial audio
Goal: recreate realistic soundfield
I
I
hi-fi experience
synthetic environments (VR)
Constraints
I
I
I
resources
information (individual HRTFs)
delivery mechanism (headphones)
Source material types
I
I
live recordings (actual soundfields)
synthetic (studio mixing, virtual environments)
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
19 / 33
Classic stereo
L
R
‘Intensity panning’:
no timing modifications, just vary level ±20 dB
I
works as long as listener is equidistant (ILD)
Surround sound:
extra channels in center, sides, . . .
I
same basic effect: pan between pairs
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
20 / 33
Simulating reverberation
Can characterize reverb by impulse response
I
I
spatial cues are important: record in stereo
IRs of ∼1 sec → very long convolution
Image model: reflections as duplicate sources
virtual (image) sources
reflected
path
source
listener
‘Early echos’ in room impulse response:
direct path
early echos
hroom(t)
t
Actual reflection may be hreflect (t), not δ(t)
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
21 / 33
Artificial reverberation
Reproduce perceptually salient aspects
I
I
I
early echo pattern (→ room size impression)
overall decay tail (→ wall materials. . . )
interaural coherence (→ spaciousness)
Nested allpass filters (Gardner, 1992)
Allpass
y[n]
x[n]
+
z-k - g
1 - g·z-k
H(z) =
-g
z-k
+
h[n]
1-g2
g(1-g2) 2
g (1-g2)
k
g
g,k
n
Synthetic Reverb
+
30,0.7
a0
+
AP0
g
Michael Mandel (E6820 SAPR)
3k
-g
Nested+Cascade Allpass
50,0.5
20,0.3
2k
Spatial sound
AP1
+
a1
a2
AP2
LPF
March 27, 2008
22 / 33
Synthetic binaural audio
Source convolved with {L,R} HRTFs gives precise positioning
. . . for headphone presentation
I can combine multiple sources (by adding)
Where to get HRTFs?
I
I
I
measured set, but: specific to individual, discrete
interpolate by linear crossfade, PCA basis set
or: parametric model - delay, shadow, pinna (Brown and Duda,
1998)
Delay
Shadow
Pinna
z-tDL(θ)
1 - bL(θ)z-1
1 - azt
Σ pkL(θ,φ)·z-tPkL(θ,φ)
Room echo
Source
z-tDR(θ)
+
KE·z-tE
1 - bR(θ)z-1
1 - azt
Σ pkR(θ,φ)·z-tPkR(θ,φ)
+
(after Brown & Duda '97)
Head motion cues?
I
head tracking + fast updates
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
23 / 33
Transaural sound
Binaural signals without headphones?
Can cross-cancel wrap-around signals
I
I
speakers SL,R , ears EL,R , binaural signals BL,R
Goal: present BL,R to EL,R
BL BR
M
SL
SR
−1
SL = HLL
(BL − HRL SR )
HLR
−1
SR = HRR
(BR − HLR SL )
HRL
HRR
HLL
EL
ER
Narrow ‘sweet spot’
I
head motion?
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
24 / 33
Soundfield reconstruction
Stop thinking about ears
I
just reconstruct pressure + spatial derivatives
∂p(t)/∂y
∂p(t)/∂z
∂p(t)/∂x
p(x,y,z,t)
I
ears in reconstructed field receive same sounds
Complex reconstruction setup (ambisonics)
I
able to preserve head motion cues?
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
25 / 33
Outline
1
Spatial acoustics
2
Binaural perception
3
Synthesizing spatial audio
4
Extracting spatial sounds
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
26 / 33
Extracting spatial sounds
Given access to soundfield, can we recover separate
components?
I
I
degrees of freedom: > N signals from N sensors is hard
but: people can do it (somewhat)
Information-theoretic approach
I
I
use only very general constraints
rely on precision measurements
Anthropic approach
I
I
examine human perception
attempt to use same information
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
27 / 33
Microphone arrays
Signals from multiple microphones can be combined to
enhance/cancel certain sources
‘Coincident’ mics with different directional gains
m1
s1
a11
a22
a12
a21
s2
m2
m1
m2
=
a11 a12
a21 a22
s1
s2
ŝ1
ŝ2
⇒
= Â−1 m
Microphone arrays (endfire)
0
λ = 4D
-20
λ = 2D
-40
λ=D
+
Michael Mandel (E6820 SAPR)
Spatial sound
D
+
D
+
D
March 27, 2008
28 / 33
Adaptive Beamforming &
Independent Component Analysis (ICA)
Formulate mathematical criteria to optimize
Beamforming: Drive interference to zero
I
cancel energy during nontarget intervals
ICA: maximize mutual independence of outputs
I
from higher-order moments during overlap
m1
m2
x
a11 a12
a21 a22
s1
s2
−δ MutInfo
δa
Limited by separation model parameter space
I
only N × N?
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
29 / 33
Binaural models
Human listeners do better?
I
certainly given only 2 channels
Extract ITD and IID cues?
I
I
I
I
cross-correlation finds timing differences
‘consume’ counter-moving pulses
how to achieve IID, trading
vertical cues...
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
30 / 33
Time-frequency masking
How to separate sounds based on direction?
I
I
I
assume one source dominates each time-frequency point
assign regions of spectrogram to sources based on probabilistic
models
re-estimate model parameters based on regions selected
Model-based EM Source Separation and Localization
I
I
I
Mandel and Ellis (2007)
models include IID as RLωω and IPD as arg RLωω
independent of source, but can model it separately
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
31 / 33
Summary
Spatial sound
I
sampling at more than one point gives information on origin
direction
Binaural perception
I
time & intensity cues used between/within ears
Sound rendering
I
I
conventional stereo
HRTF-based
Spatial analysis
I
I
optimal linear techniques
elusive auditory models
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
32 / 33
References
Elizabeth M. Wenzel, Marianne Arruda, Doris J. Kistler, and Frederic L. Wightman.
Localization using nonindividualized head-related transfer functions. The Journal of
the Acoustical Society of America, 94(1):111–123, 1993.
William G. Gardner. A real-time multichannel room simulator. The Journal of the
Acoustical Society of America, 92(4):2395–2395, 1992.
C. P. Brown and R. O. Duda. A structural model for binaural sound synthesis. IEEE
Transactions on Speech and Audio Processing, 6(5):476–488, 1998.
Michael I. Mandel and Daniel P. Ellis. EM localization and separation using interaural
level and phase cues. In IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics, pages 275–278, 2007.
J. C. Middlebrooks and D. M. Green. Sound localization by human listeners. Annu
Rev Psychol, 42:135–159, 1991.
Brian C. J. Moore. An Introduction to the Psychology of Hearing. Academic Press,
fifth edition, April 2003. ISBN 0125056281.
Jens Blauert. Spatial Hearing - Revised Edition: The Psychophysics of Human Sound
Localization. The MIT Press, October 1996.
V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano. The cipic hrtf
database. In Applications of Signal Processing to Audio and Acoustics, 2001 IEEE
Workshop on the, pages 99–102, 2001.
Michael Mandel (E6820 SAPR)
Spatial sound
March 27, 2008
33 / 33
Download