Auditory Scene Analysis in Humans and Machines 1. 2.

advertisement
Auditory Scene Analysis
in Humans and Machines
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
5.
http://labrosa.ee.columbia.edu/
The ASA Problem
Human ASA
Machine Source Separation
Systems & Examples
Concluding Remarks
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 1 /54
1. Auditory Scene Analysis
• Sounds rarely occurs in isolation
.. but recognizing sources in mixtures is a problem
.. for humans and machines
freq / kHz
chan 9
freq / kHz
chan 3
freq / kHz
chan 1
4
4
freq / kHz
chan 0
mr-2000-11-02-14:57:43
4
4
2
20
0
-20
-40
-60
level / dB
2
2
2
0
0
1
2
3
4
Auditory Scene Analysis - Dan Ellis
5
6
7
8
9
time / sec
2006-05-20 - 2 /54
mr-20001102-1440-cE+1743.wav
Sound Mixture Organization
• Goal: recover individual sources from scenes
.. duplicating the perceptual effect
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
0
2
4
6
8
10
12
time/s
-60
level / dB
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
• Problems: competing sources, channel effects
• Dimensionality loss
need additional constraints
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 3 /21
The Problem of Mixtures
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
• Received waveform is a mixture
2 sensors, N sources - underconstrained
• Undoing mixtures: hearing’s primary goal?
.. by any means available
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 4 /21
Source Separation Scenarios
• Interactive voice systems
human-level understanding is expected
• Speech prostheses
crowds: #1 complaint of hearing aid users
• Archive analysis
freq / kHz
identifying and isolating sound events
dpwe-2004-09-10-13:15:40
8
20
6
0
4
-20
2
0
-40
0
1
2
3
4
5
6
7
8
9
level / dB
time / sec
pa-2004-09-10-131540.wav
• Unmixing/remixing/enhancement...
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 5 /54
How Can We Separate?
• By between-sensor differences (spatial cues)
‘steer a null’ onto a compact interfering source
• By finding a ‘separable representation’
spectral? sources are broadband but sparse
periodicity? maybe – for pitched sounds
something more signal-specific...
• By inference (based on knowledge/models)
acoustic sources are redundant
→ use part to guess the remainder
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 6 /54
Outline
1. The ASA Problem
2. Human ASA
scene analysis
separation by location
separation by source characteristics
3. Machine Source Separation
4. Systems & Examples
5. Concluding Remarks
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 7 /54
Auditory Scene Analysis
• Listeners organize sound mixtures
common
onset
+ continuity
harmonicity
freq / kHz
into discrete perceived sources
based on within-signal cues (audio + ...)
4
30
20
3
10
0
2
-10
-20
1
-30
0
-40
0
spatial, modulation, ...
learned “schema”
Auditory Scene Analysis - Dan Ellis
0.5
1
1.5
2
2.5
3
3.5
4
Reynolds-McAdams oboe
time / sec
reynolds-mcadams-dpwe.wav
2006-05-20 - 8 /54
level / dB
Perceiving Sources
• Harmonics distinct in ear, but perceived as
%req
one source (“fused”):
!ime
depends on common onset
depends on harmonics
• Experimental techniques
ask subjects “how many”
match attributes e.g. pitch, vowel identity
brain recordings (EEG “mismatch negativity”)
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 9 /54
Auditory Scene Analysis Bregman’90
Darwin & Carlyon’95
• How do people analyze sound mixtures?
break mixture into small elements (in time-freq)
elements are grouped in to sources using cues
sources have aggregate attributes
• Grouping rules (Darwin, Carlyon, ...):
cues: common onset/offset/modulation,
harmonicity, spatial location, ...
Onset
map
Sound
Frequency
analysis
Elements
Harmonicity
map
Sources
Grouping
mechanism
Source
properties
Spatial
map
(after Darwin 1996)
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 10/21
Streaming
• Sound event sequences are organized into
streams
i.e. distinct perceived sources
difficult to make comparisons between streams
• Two-tone streaming experiments:
frequency
TRT: 60-150 ms
1 kHz
!f:
asacd-36-streaming-50s
–2 octaves
time
ecological relevance?
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 11/54
Illusions & Restoration
freq
• Illusion = hearing more than is “there”
e.g. “pulsation threshold”
example - tone is masked
+
?
+
+
time
“old-plus-new” heuristic:
existing sources continue
frequency & /0z
asacd-26-pulsation
2
1
+
0
0.0
0.4
0.8
1.2
time & s
• Need to infer most likely real-world events
observation equally good match to either case
prior likelihood of continuity much higher
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 12/54
bregnoise
Human Performance:
Spatial Separation
• Task: Coordinate Response Measure
Brungart et al.’02
“Ready Baron go to green eight now”
256 variants, 16 speakers
correct = color and number for “Baron”
crm-11737+16515.wav
• Accuracy as a function of spatial separation:
A, B same speaker
Auditory Scene Analysis - Dan Ellis
o Range effect
2006-05-20 - 13/54
Separation by Vocal Differences
Brungart et al.’01
varying the level and voice character
• CRM
!"#$%&$'()#*"&+$,-"#$'(!$./-#0
(same spatial1232'(4-**2&2#52.
location)
energetic vs. informational masking
1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"#
1-.,2#(*"&(,72(9%-2,2&(,$'/2&:
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 14/54
Varying the Number of Voices
Brungart et al.’01
• Two voices OK;
More than two voices harder
:/1$9($%";1./(0$#5/1$%":$)&51<
=90>'("/."?$%&'()
(same spatial
origin)
mix -'(./(0$12'"3(/4)"3($0$#52$%%6
of N voices tends to speech-shaped noise...
78'1"+(3 /(",#8 #$%&'("5)"$33'3"#/")#509%9)
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 15/54
!"#$%&'()"**"+"#$%&'()"*","#$%&'()
Outline
1. The ASA Problem
2. Human ASA
3. Machine Source Separation
Independent Component Analysis
Computational Auditory Scene Analysis
Model-Based Separation
4. Systems & Examples
5. Concluding Remarks
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 16/54
Scene Analysis Systems
• “Scene Analysis”
not necessarily separation, recognition, ...
scene = overlapping objects, ambiguity
• General Framework:
distinguish input and output representations
distinguish engine (algorithm) and control
(constraints, “computational model”)
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 17/21
Human and Machine Scene Analysis
• CASA (e.g. Brown’92):
Input: Periodicity, continuity, onset “maps”
Output: Waveform (or mask)
Engine: Time-frequency masking
Control: “Grouping cues” from input
- or: spatial features (Roman, ...)
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 18/21
Human and Machine Scene Analysis
• CASA (e.g. Brown’92):
• ICA (Bell & Sejnowski et seq.):
Input: waveform (or STFT)
Output: waveform (or STFT)
Engine: cancellation
Control: statistical independence of outputs
- or energy minimization for beamforming
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 19/21
Human and Machine Scene Analysis
• CASA (e.g. Brown’92):
• ICA (Bell & Sejnowski et seq.):
• Human Listeners:
Input: excitation patterns ...
Output: percepts ...
Engine: ?
Control: find a plausible explanation
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 20/21
Machine Separation
• Problem: Features of combinations are not
combinations of features
voice is easy to characterize when in isolation
redundancy needed for real-world communication
(*,,
-"%.e re.1234
!"l"
$"%&e
freq / 9Hz
5r%6%27l
(r*-11-0.0/0.-n12se
(r*-11-0.0/0.
4
/
2
1
0
0
()*
$"%&e
(%+
freq / 9Hz
-20
(r*-11-0.0/0.=16-0>010>
4
(r*-11-0.0/0.=16-0>010>-n12se
-40
crm-11737.wav
/
crm-11737-noise.wav
-60
level / dB
2
1
0
crm-11737+16515.wav crm-11737+16515-noise.wav
0
0?>
1
1?>
0
t2*e / se(
Auditory Scene Analysis - Dan Ellis
0?>
1
1?>
t2*e / se(
2006-05-20 - 21/54
Separation Approaches
ICA
•Multi-channel
•Fixed filtering
•Perfect separation
– maybe!
target x
interference n
-
mix
proc
x^
+
CASA / Model-based
•Single-channel
•Time-varying filtering
•Approximate
Separation
mix x+n
^
n
spectro
gram
stft
proc
x
mask
• Very different approaches!
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 22/54
istft
x^
Independent Component
Bell & Sejnowski’95
Analysis
Smaragdis’98
• Central idea:
...
Search unmixing space
to maximize independence of outputs
a11
a21
s1
a12
a22
×
a11
a21
a12
s1
s2
w11 w12
×
w21 w22
x1
x2
s^1
s^2
x1
x2
a22
s2
!" MutInfo
"wij
simple mixing
→ a good solution (usually) exists
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 23/54
Mixtures, Scatters, Kurtosis
• Mixtures of sources become more Gaussian
can measure e.g. via ‘kurtsosis’ (4th moment)
s1 vs. s2
0.6
x1 vs. x2
0.6
0.4
0.4
0.2
0.2
0
0
-0.2
-0.4
-0.2
-0.6
-0.4
-0.4
-0.2
-0.2
0
0.2
0.4
0.6
-0.5
0
0.5
p(s1) kurtosis = 27.90
p(x1) kurtosis = 18.50
p(s2) kurtosis = 53.85
p(x2) kurtosis = 27.90
-0.1
0
0.1
Auditory Scene Analysis - Dan Ellis
0.2
-0.2
-0.1
0
0.1
0.2
2006-05-20 - 24/54
ICA Limitations
• Cancellation is very finicky
hard to get more than ~ 10 dB rejection
s2
0.6
0.4
kurtosis
mix 2
Mixture Scatter
0.8
s1
from
Parra &
Spence’00
Kurtosis vs. !
12
10
8
0.2
6
0
4
-0.2
-0.4
2
-0.6
-0.3
0
lee_ss_in_1.wav lee_ss_para_1.wav
s1
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0
0.2
0.4
mix 1
0.6
s2
0.8
!/"
1
• The world is not instantaneous, fixed, linear
subband models for reverberation
continuous adaptation
• Needs spatially-compact interfering sources
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 25/54
Computational Auditory
Scene Analysis Brown & Cooke’94
Okuno et al.’99
Hu & Wang’04 ...
• Central idea:
Segment time-frequency into sources
based on perceptual grouping cues
Segment
input
mixture
Front end
signal
features
(maps)
Group
Object
formation
discrete
objects
Grouping
rules
Source
groups
freq
onset
time
period
frq.mod
... principal cue is harmonicity
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 26/54
CASA Preprocessing
"#$#!%&'()*+(,!-&'.+//0(1
Slaney & Lyon ’90
a 3rd “periodicity” axis
2 "'&&+3'1&456
• Correlogram:
7''/+38!94/+,!'(!:(';(<-'//093+!-=8/0'3'18
envelope of wideband channels follows pitch
yl
ine
la
de
short-time
autocorrelation
sound
Cochlea
filterbank
frequency channels
"'&&+3'1&45
/30.+
envelope
follower
freq
lag
time
linear filterbank
cochlear &
approximation
c/w -Modulation
Filtering [Schimmel
Atlas ’05]
static
nonlinearity
Auditory-Scene
Analysis
- Dan Ellis
2006-05-20 - 27/54
- zero-delay slice is like spectrogram
“Weft” Periodic Elements
Ellis ’96
• Represent harmonics without grouping?
Correlogram
Smooth Spectrum
slices
freq
freq
lag
time
Period Track
commonperiod
feature
period
time
time
hard to separate multiple pitch tracks
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 28/21
Time-Frequency (T-F) Masking
Original
freq / kHz
• “Local Dominance” assumption
Female
Male
8
40
6
20
4
0
2
-20
0
level / dB
Mix +
Oracle
Labels
freq / kHz
Oraclebased
Resynth
cooke-v3n7.wav
8
6
4
2
0
cooke-v3msk-ideal.wav
0
0.5
1
time / sec
0
0.5
1
time / sec
oracle masks are remarkably effective!
|mix – max(male, female)| < 3dB for ~80% of cells
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 29/54
cooke-n7msk-ideal.wav
Combining Spatial + T-F Masking
• T-F masks based on
inter-channel properties
[Roman et al. ’02],
[Yilmaz & Rickard ’04]
lee_ss_in_1.wav
lee_ss_duet_1.wav
multiple channels make
CASA-like masks better
masking after ICA
• T-F
[Blin et al. ’04]
cancellation can remove energy within T-F cells
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 30/54
CASA limitations
• Driven by local features
problems with masking, aperiodic sources...
• Limitations of T-F masking
need to identify single-source regions
cannot undo overlaps – leaves gaps





































Auditory Scene Analysis - Dan Ellis











2006-05-20 - 31/54
from
Hu &
Wang ’04
huwang-v3n7.wav
Auditory “Illusions”
How do we explain illusions?
pulsation threshold
2,,,
%&$'($)*+
•
3,,,
1,,,
0,,,
,
,
,-.
/
/-.
!"#$
0
0-.
0
!"#$
0-.
4
sinewave speech
%&$'($)*+
.,,,
1,,,
4,,,
0,,,
/,,,
,
/
/-.
4-.
1,,,
4,,,
%&$'($)*+
phonemic restoration
,-.
0,,,
/,,,
,
• Something is providing the missing (illusory)
,
,-.
/
!"#$
pieces ... source models
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 32/54
/-.
0
Adding Top-Down Constraints
Ellis ’96
• Bottom-up CASA: limited to what’s “there”
input
mixture
Front end
signal
eatures
Object
formation
discrete
objects
Grouping
rules
Source
groups
• Top-down predictions allow illusions
hypotheses
Hypothesis
management
input
mixture
Front end
signal
eatures
prediction
errors
Compare
& reconcile
Noise
components
Periodic
components
Predict
& combine
predicted
features
match observations to a “world-model”...
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 33/54
Separation vs. Inference
• Ideal separation is rarely possible
i.e. no projection can completely remove overlaps
• Overlaps Ambiguity
scene analysis = find “most reasonable” explanation
• Ambiguity can be expressed probabilistically
i.e. posteriors of sources {Si} given observations X:
P({Si}| X)
P(X |{Si}) P({Si})
combination physics source models
• Better source models → better inference
.. learn from examples?
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 34/54
Simple Source Separation
Roweis ’01, ’03
Kristjannson ’04, ’06
• Given models for sources,
find “best” (most likely) states for spectra:
combination
p(x|i1, i2) = N (x; ci1 + ci2, Σ) model
inference of
{i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) source
state
can include sequential constraints...
different domains for combining c and defining 
• E.g. stationary noise:
In speech-shaped noise
(mel magsnr = 2.41 dB)
freq / mel bin
Original speech
80
80
80
60
60
60
40
40
40
20
20
20
0
1
2 time / s
0
Auditory Scene Analysis - Dan Ellis
1
2
VQ inferred states
(mel magsnr = 3.6 dB)
0
1
2
2006-05-20 - 35/54
Can Models Do CASA?
• Source models can learn harmonicity, onset
freq / mel band
... to subsume rules/representations of CASA
VQ800 Codebook - Linear distortion measure
80
-10
-20
-30
-40
-50
60
40
20
0
100
200
300
400
500
600
700
800
level / dB
can capture spatial info too [Pearlmutter & Zador’04]
• Can also capture sequential structure
e.g. consonants follow vowels
... like people do?
• But: need source-specific models
... for every possible source
use model adaptation? [Ozerov et al. 2005]
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 36/54
Separation or Description?
• Are isolated waveforms required?
clearly sufficient, but may not be necessary
not part of perceptual source separation!
• Integrate separation with application?
e.g. speech recognition
separation
mix
t-f masking
+ resynthesis
identify
target energy
ASR
words
speech
models
mix
vs.
identify
find best
words model
speech
models
source
knowledge
words output = abstract description of signal
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 37/54
words
!! !! ! !!
"#$$#%&!'()(!*+,-&%#)#-%
!
Missing
Data
Recognition
. /0++,1!2-3+4$!p(x|m)!(5+!264)#3#2+%$#-%(4777
"#$$#%&!'()(!*+,-&%#)#-%
Cooke et al. ’01
•
need values for all dimensions to evaluate p(•)
. 86)9!,(%!+:(46()+!-:+5!(!
inferences given
• But: can make
$6;$+)!-<!3#2+%$#-%$!x
9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2&
Speech
models
are multidimensional...
. /0++,1!2-3+4$!
p(p(x|M)
x|m)!(5+!264)#3#2+%$#-%(4777
9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!p(•)
9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2&
9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!p(•)
.
86)9!,(%!+:(46()+!-:+5!(!
$6;$+)!-<!3#2+%$#-%$!
xk
Z
kp(xk,xu)
xu
just pa! xsubset
m " =of pdimensions
! x # x m " dxxk
y
$
k
k
u
p ! x k m " = $ p ! x k# x u m " dx u
p(x |M) = p(x , x |M)dx
k
k
u
y
u
. =+%,+>!
=+%,+>!
p(xk|xu<y )
2#$$#%&!3()(!5+,-&%#)#-%9
2#$$#%&!3()(!5+,-&%#)#-%9
?5+$+%)!3()(!2($@
p(xk,xu)
u
• Hence, missing data recognition:
.
xu
P(x | q) =
?5+$+%)!3()(!2($@
P(x
| q)
· P(x2 | q)
· P(x3 | q)
· P(x4 | q)
· P(x5 | q)
· P(x6 | q)
1
xk
) k|xu<y )
p(xk p(x
P(x | q) =
dimension !
dimension !
P(x1 | q)
· P(x2 | q)
· P(x3 | q)
· P(x4 | q)
time !
· P(x5 | q)
hard9 part
is finding the mask (segregation)
C#1,!D#10!'(!!$,'$5!0C2!=#(E!F(25125#0'*$G
· P(x6 | q)
"#$!%&&'(
)*+$,-!.'/0+12(!3!42#1$'$5
Auditory
Scene Analysis
- Dan Ellis
67789::976!9!:7!;!68
time
!
2006-05-20 - 38/54
9 C#1,!D#10!'(!!$,'$5!0C2!=#(E!F(25125#0'*$G
xk
p(xk )
1
M ! = argmax P " M X # = argmax P " X M
M
M
+-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(*
#"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X
The Speech
Fragment
Decoder
1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)
•
P " MBarker
#
et al. ’05
!
M = argmax P " M X # = argmax P " X M # ! -------------%44*&,4%-,2*6>*P(
X | Y,S
)*;#
P" X
M
M
1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=*
Match
‘uncorrupt’
Observation
%44*&,4%-,2*6>*P( X | Y,S )*;
Y(f )
spectrum to ASR
Observation
models using
Source
Y(f )
X(f )
missing data
Source
Segregation S
X(f )
freq
model M and segregation S
• Joint search for
1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/
Segregation S
freq
1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;
to maximize:
P" X Y $ S #
P" X Y $ S #
P " M $ S Y #P=" M
P "$M
# !"-----------------------! P#" S! -----------------------Y#
-d
S# %YP#" X=M P
MP#"%XP# "-XdXM
P
"
X
#
Isolated Source Model
Segregation Model
<$P(X)$#*$&*#521$=*#(0"#0$>
<$P(X)$#*$&*#521$=*#(0"#0$>
!"#$%&&'(
)*+#,-$.'/0+12($3$42"1#'#5
Auditory
Scene Analysis - Dan
Ellis
!"#$%&&'(
67789::976$9$::$;$68
2006-05-20 - 39/54
)*+#,-$.'/0+12($3$42"1#'#5
67789
Segregation S
1
freq
Using CASA cues
?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;
P" X Y $ S #
!"#$%&'()(&*+,-./+"
P " M $ S Y # = P " M # % P " X M # ! ------------------------- dX ! P " S Y #
P" X #
0 P(S|Y)&1#$2"&,34."-#3&#$*4/5,-#4$&-4&"+%/+%,-#4$
•
9<$P(X)$#*$&*#521$=*#(0"#0$>
'($0<'($(25125"0'*#$=*10<$>*#(',21'#5?
CASA can
help search
9 <*=$&'@2&A$'($'0?
•
E12F+2#>A$G"#,($G2&*#5$0*520<21
CASA can
rate
segregation
9 *#(20;>*#0'#+'0AD
consider
only
segregations
made
from
CASA
!"#$%&&'(
)*+#,-$.'/0+12($3$42"1#'#5
67789::976$9$::$;$68
0 67+$#$%&*4/&'()(8"-91+&143,1&*+,-./+"
chunks 9 B21'*,'>'0A;<"1C*#'>'0AD
construct0'C29E12F+2#>A$125'*#$C+(0$G2$=<*&2
P(S|Y) to reward CASA qualities:
Frequency Proximity
Auditory Scene Analysis - Dan Ellis
!"#$%&&'(
Common Onset
)*+#,-$.'/0+12($3$42"1#'#5
Harmonicity
2006-05-20 - 40/54
67789::976$9$:8$;$68
Outline
1.
2.
3.
4.
The ASA Problem
Human ASA
Machine Source Separation
Systems & Examples
Periodicity-based
Model-based
Music signals
5. Concluding Remarks
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 41/54
Current CASA
• State-of-the-art bottom-up separation
Hu & Wang’03
1,,,
1,,,
5,,,
5,,,
0,,,
0,,,
4,,,
4,,,
%&$'($)*+
%&$'($)*+
noise robust pitch track
label T-F cells by pitch
extensions to unvoiced transients by onset
/,,,
3,,,
/,,,
3,,,
.,,,
.,,,
2,,,
2,,,
,
,
,-.
,-/
,-0
,-1
!"#$
2
2-.
2-/
Auditory Scene Analysis - Dan Ellis
,
,
,-.
,-/
,-0
,-1
!"#$
2
2-.
2-/
2006-05-20 - 42/54
Prediction-Driven CASA
f/Hz City
Ellis’96
4000
2000
1000
400
200
1000
400
200
100
50
0
1
2
3
f/Hz Wefts1!4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
• Identify objects
9
Wefts9!12
4000
2000
1000
in real-world
scenes
400
200
1000
400
200
100
50
using “sound
elements”
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz Noise1
4000
2000
1000
!40
400
200
!50
!60
Squeal (6/10)
Truck (7/10)
!70
0
1
2
3
4
5
6
7
Auditory Scene Analysis - Dan Ellis
8
9
time/s
dB
2006-05-20 - 43/54
Singing Voice Separation
freq / kHz
• Pitch tracking + harmonic separation
Avery Wang’94
8
6
freq / kHz
4
2
20
0
8
0
-20
6
4
level / dB
2
0
0
1
2
3
Auditory Scene Analysis - Dan Ellis
4
5
6
7
time / sec
8
2006-05-20 - 44/54
Periodic/Aperiodic SeparationVirtanen’03
• Harmonic structure + repetition of drums
4,,,
%&$'($)*+
2,,,
0,,,
.,,,
,
,
-
.
/
0
1
!"#$
2
3
4
5
-,
,
-
.
/
0
1
!"#$
2
3
4
5
-,
,
-
.
/
0
1
!"#$
2
3
4
5
-,
4,,,
%&$'($)*+
2,,,
0,,,
.,,,
,
4,,,
%&$'($)*+
2,,,
0,,,
.,,,
,
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 45/54
“Speech Separation Challenge”
• Mixed and Noisy Speech ASR task
defined by Martin Cooke and Te-Won Lee
short, grammatically-constrained utterances:
<command:4><color:4><preposition:4><letter:25><number:10><adverb:4>
e.g. "bin white at M 5 soon"
t5_bwam5s_m5_bbilzp_6p1.wav
to be presented at Interspeech’06
• Results
http://www.dcs.shef.ac.uk/~martin/SpeechSeparationChallenge.htm
• See also “Statistical And Perceptual Audition”
workshop
http://www.sapa2006.org/
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 46/54
ABSTRACT
There are two observations that motivate the consideration of high frequency resolution modelling of speech for
noise robust speech recognition and enhancement. First is
the observation that most noise sources do not have harmonic structure similar to that of voiced speech. Hence,
voiced speech sounds should be more easily distinguishable from environmental noise in a high dimensional signal
space1 .
IBM’s “Superhuman” Separation
Kristjansson et a
We present a framework for speech enhancement and robust speech recognition that exploits the harmonic structure
of speech. We achieve substantial gains in signal to noise ratio (SNR) of enhanced speech as well as considerable gains
in accuracy of automatic speech recognition in very noisy
conditions.
The method exploits the harmonic structure of speech
by employing a high frequency resolution speech model in
the log-spectrum domain and reconstructs the signal from
the estimated posteriors of the clean signal and the phases
from the original noisy signal.
We achieve a gain in signal to noise ratio of 8.38 dB for
enhancement of speech at 0 dB. We also present recognition
results on the Aurora 2 data-set. At 0 dB SNR, we achieve
a reduction of relative word error rate of 43.75% over the
baseline, and 15.90% over the equivalent low-resolution algorithm.
• Optimal inference on Mixed Spectra
model each speaker (512 mix GMM)



40
35
Energy [dB]

45
Interspeech’06
Noisy vector, clean vector and estimate of clean vector
30
25
20

x estimate
noisy vector
clean vector
1. INTRODUCTION
15







• Applied to Speech Separation Challenge:
10
0
200
400
600
Frequency [Hz]
800
1000
Fig. 1. The noisy input vector (dot-dash line), the corresponding clean vector (solid line) and the estimate of the
clean speech (dotted line), with shaded area indicating the
uncertainty of the estimate (one standard deviation). Notice
that the uncertainty on the estimate is considerably larger in
the valleys between the harmonic peaks. This reflects the
lower SNR in these regions. The vector shown is frame 100
from Figure 2
o Infer speakers and gain
o Reconstruct speech
A second observation
that in voiced speech, the signal
o Recognize
asis normal...
power is concentrated in areas near the harmonics of the
wer / %
100
50
0
A long standing
goal in speech enhancement
and robust



speech recognition has been to exploit the harmonic structure of speech to improve intelligibility and increase recognition accuracy.
The source-filter model of speech assumes that speech
is produced by an excitation source (the vocal cords) which
Same Gender Speakers has strong regular harmonic structure during voiced phonemes.
The overall shape of the spectrum is then formed by a filter (the vocal tract). In non-tonal languages the filter shape
alone determines which phone component of a word is produced (see Figure 2). The source on the other hand introduces fine structure in the frequency spectrum that in many
cases varies strongly among different utterances of the same
phone.
This fact has traditionally inspired the use of smooth
representations
of the
spectrum, such as the Mel3 dB
0 dB
- 3 dB
- 6 dB
- 9speech
dB
frequency cepstral coefficients, in an attempt to accurately
estimate the
filter component
of speech
in a way that is inNo dynamics
Acoustic dyn.
Grammar
dyn.
Human
variant to the non-phonetic effects of the excitation[1].

6 dB
SDL Recognizer
0-7803-7980-2/03/$17.00
Auditory Scene Analysis
- Dan Ellis © 2003 IEEE
• Use grammar
constraints
fundamental frequency, which show up as parallel ridges in
1 Even if the interfering signal is another speaker, the harmonic structure
of the two signals may differ at different times, and the long term pitch
contour of the speakers may be exploited to separate the two sources [2].
2006-05-20 - 47/54
291
ASRU 2003
Transcription as Separation
• Transcribe piano recordings by classification
train SVM detectors for every piano note
88 separate detectors, independent smoothing
• Trained on player piano recordings
freq / pitch
Bach 847 Disklavier
A6
20
A5
10
A4
0
A3
A2
-10
A1
-20
0
1
2
3
4
5
6
7
8
level / dB
9
time / sec
• Sse transcription to resynthesize...
Music Information Extraction - Ellis
2006-02-13
p. 48/32
Piano Transcription Results
improvement from classifier:
• Significant
Table 1: Frame level transcription results.
frame-level accuracy results:
Algorithm
SVM
Klapuri&Ryynänen
Marolt
Errs
43.3%
66.6%
84.6%
120
False Pos
27.9%
28.1%
36.5%
False Neg
15.4%
38.5%
48.1%
d!
3.44
2.71
2.35
False Negatives
False Positives
Classification error %
Breakdown
• Overall Accuracy
by frameAcc: Overall accuracy is a frame-level version of the metric
proposed by Dixon in [Dixon, 2000] defined as:
type:
100
80
60
40
N
Acc
=2 3 4 5 6 7 8
0
1
(F
P present
+ FN + N)
# notes
20
http://labrosa.ee.columbia.edu/projects/melody/
(3)
where Music
N isInformation
the number
of correctly
transcribed2006-02-13
frames, F P
is/32
the number of
p. 49
Extraction
- Ellis
unvoiced frames U V transcribed as voiced V , and F N is the number of voiced
Outline
1.
2.
3.
4.
5.
The ASA Problem
Human ASA
Machine Source Separation
Systems & Examples
Concluding Remarks
Evaluation
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 50/54
Evaluation
• How to measure separation performance?
depends what you are trying to do
• SNR?
energy (and distortions) are not created equal
different nonlinear components [Vincent et al. ’06]
rare for nonlinear processing
to improve intelligibility
listening tests expensive
• ASR performance?
Net
effect
Transmission
errors
• Intelligibility?
Increased
artefacts
Reduced
interference
optimum
Agressiveness
of processing
separate-then-recognize too simplistic;
ASR needs to accommodate separation
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 51/54
Evaluating Scene Analysis
• Need to establish ground truth
subjective sources in real sound mixtures?
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 52/54
More Realistic Evaluation
• Real-world speech tasks
crowded environments
applications:
communication, command/control, transcription
0
6
4
-50
2
-100
0
level / dB
Pitch Track + Speaker Active Ground Truth
200
150
Lags
human
intelligibility?
‘diarization’
annotation
(not
transcription)
freq / kHz
• Metric
Personal Audio - Speech + Noise
8
100
50
0
ks-noisyspeech.wav
0
Auditory Scene Analysis - Dan Ellis
1
2
3
4
5
6
7
8
2006-05-20 - 53/54
9
10
time / sec
Summary & Conclusions
• Listeners do well separating sound mixtures
using signal cues (location, periodicity)
using source-property variations
• Machines do less well
difficult to apply enough constraints
need to exploit signal detail
• Models capture constraints
learn from the real world
adapt to sources
• Separation feasible in certain domains
describing source properties is easier
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 54/54
Sources / See Also
• NSF/AFOSR Montreal Workshops ’03, ’04
www.ebire.org/speechseparation/
labrosa.ee.columbia.edu/Montreal2004/
as well as the resulting book...
• Hanse meeting:
www.lifesci.sussex.ac.uk/home/Chris_Darwin/
Hanse/
• DeLiang Wang’s ICASSP’04 tutorial
www.cse.ohio-state.edu/~dwang/presentation.html
• Martin Cooke’s NIPS’02 tutorial
www.dcs.shef.ac.uk/~martin/nips.ppt
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 55/54
References 1/2
[Barker et al. ’05] J. Barker, M. Cooke, D. Ellis, “Decoding speech in the presence of other sources,” Speech
Comm. 45, 5-25, 2005.
[Bell & Sejnowski ’95] A. Bell & T. Sejnowski, “An information maximization approach to blind separation and
blind deconvolution,” Neural Computation, 7:1129-1159, 1995.
[Blin et al.’04] A. Blin, S. Araki, S. Makino, “A sparseness mixing matrix estimation (SMME) solving the
underdetermined BSS for convolutive mixtures,” ICASSP, IV-85-88, 2004.
[Bregman ’90] A. Bregman, Auditory Scene Analysis, MIT Press, 1990.
[Brungart ’01] D. Brungart, “Informational and energetic masking effects in the perception of two simultaneous
talkers,” JASA 109(3), March 2001.
[Brungart et al. ’01] D. Brungart, B. Simpson, M. Ericson, K. Scott, “Informational and energetic masking effects in
the perception of multiple simultaneous talkers,” JASA 110(5), Nov. 2001.
[Brungart et al. ’02] D. Brungart & B. Simpson, “The effects of spatial separation in distance on the informational
and energetic masking of a nearby speech signal”, JASA 112(2), Aug. 2002.
[Brown & Cooke ’94] G. Brown & M. Cooke, “Computational auditory scene analysis,” Comp. Speech & Lang. 8
(4), 297–336, 1994.
[Cooke et al. ’01] M. Cooke, P. Green, L. Josifovski, A.Vizinho, “Robust automatic speech recognition with missing
and uncertain acoustic data,” Speech Communication 34, 267-285, 2001.
[Cooke’06] M. Cooke, “A glimpsing model of speech perception in noise,” submitted to JASA.
[Darwin & Carlyon ’95] C. Darwin & R. Carlyon, “Auditory grouping” Handbk of Percep. & Cogn. 6: Hearing, 387–
424, Academic Press, 1995.
[Ellis’96] D. Ellis, “Prediction-Driven Computational Auditory Scene Analysis,” Ph.D. thesis, MIT EECS, 1996.
[Hu & Wang ’04] G. Hu and D.L. Wang, “Monaural speech segregation based on pitch tracking and amplitude
modulation,” IEEE Tr. Neural Networks, 15(5), Sep. 2004.
[Okuno et al. ’99] H. Okuno, T. Nakatani, T. Kawabata, “Listening to two simultaneous speeches,” Speech
Communication 27, 299–310, 1999.
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 56/54
References 2/2
[Ozerov et al. ’05] A. Ozerov, P. Phillippe, R. Gribonval, F. Bimbot, “One microphone singing voice separation using
source-adapted models,” Worksh. on Apps. of Sig. Proc. to Audio & Acous., 2005.
[Pearlmutter & Zador ’04] B. Pearlmutter & A. Zador, “Monaural Source Separation using Spectral Cues,” Proc.
ICA, 2005.
[Parra & Spence ’00] L. Parra & C. Spence, “Convolutive blind source separation of non-stationary sources,” IEEE
Tr. Speech & Audio, 320-327, 2000.
[Reyes et al. ’03] M. Reyes-Gómez, B. Raj, D. Ellis, “Multi-channel source separation by beamforming trained with
factorial HMMs,” Worksh. on Apps. of Sig. Proc. to Audio & Acous., 13–16, 2003.
[Roman et al. ’02] N. Roman, D.-L. Wang, G. Brown, “Location-based sound segregation,” ICASSP, I-1013-1016,
2002.
[Roweis ’03] S. Roweis, “Factorial models and refiltering for speech separation and denoising,” EuroSpeech, 2003.
[Schimmel & Atlas ’05] S. Schimmel & L. Atlas, “Coherent Envelope Detection for Modulation Filtering of Speech,”
ICASSP, I-221-224, 2005.
[Slaney & Lyon ’90] M. Slaney & R. Lyon, “A Perceptual Pitch Detector,” ICASSP, 357-360, 1990.
[Smaragdis ’98] P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Intl. Wkshp. on
Indep. & Artif.l Neural Networks, Tenerife, Feb. 1998.
[Seltzer et al. ’02] M. Seltzer, B. Raj, R. Stern, “Speech recognizer-based microphone array processing for robust
hands-free speech recognition,” ICASSP, I–897–900, 2002.
[Varga & Moore ’90] A.Varga & R. Moore, “Hidden Markov Model decomposition of speech and noise,” ICASSP,
845–848, 1990.
[Vincent et al. ’06] E.Vincent, R. Gribonval, C. Févotte, “Performance measurement in Blind Audio Source
Separation.” IEEE Trans. Speech & Audio, in press.
[Yilmaz & Rickard ’04] O.Yilmaz & S. Rickard, “Blind separation of speech mixtures via time-frequency masking,”
IEEE Tr. Sig. Proc. 52(7), 1830-1847, 2004.
Auditory Scene Analysis - Dan Ellis
2006-05-20 - 57/54
Download