Computational Auditory Scene Analysis 1. 2.

advertisement
Computational Auditory
Scene Analysis
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
http://labrosa.ee.columbia.edu/
1. The Scene Analysis problem
2. ASA and CASA
3. Issues in CASA
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 1 /21
LabROSA Overview
Information
Extraction
Music
Environment
Scene
Analysis
Signal
Processing
Machine
Learning
Speech
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 2 /21
1. Scene Analysis
• Recover individual sources from scenes
.. duplicate the perceptual effect
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
0
2
4
6
8
10
12
time/s
-60
level / dB
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
• Problems competing sources, channel effects
• Dimensionality loss
need additional constraints
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 3 /21
Scene Analysis Systems
• “Scene Analysis”
not necessarily separation, recognition, ...
scene = overlapping objects, ambiguity
• General Framework:
distinguish input and output representations
distinguish engine (algorithm) and control
(constraints, “computational model”)
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 4 /21
Human and Machine Scene Analysis
• CASA (Brown’92 et seq.):
Input: Periodicity, continuity, onset “maps”
Output: Waveform (or mask)
Engine: Time-frequency masking
Control: “Grouping cues” from input
- or: spatial features (Roman, ...)
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 5 /21
Human and Machine Scene Analysis
• CASA (e.g. Brown’92):
• ICA (Bell & Sejnowski et seq.):
Input: waveform (or STFT)
Output: waveform (or STFT)
Engine: cancellation
Control: statistical independence of outputs
- or energy minimization for beamforming
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 6 /21
Human and Machine Scene Analysis
• CASA (e.g. Brown’92):
• ICA (Bell & Sejnowski et seq.):
• Human Listeners:
Input: excitation patterns ...
Output: percepts ...
Engine: ?
Control: find a plausible explanation
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 7 /21
2. Auditory Scene Analysis
(Bregman 1990)
• How do people analyze sound mixtures?
break mixture into small elements (in time-freq)
elements are grouped in to sources using cues
sources have aggregate attributes
• Grouping rules (Darwin, Carlyon, ...):
cues: common onset/offset/modulation,
harmonicity, spatial location, ...
Onset
map
Sound
Frequency
analysis
Elements
Harmonicity
map
Sources
Grouping
mechanism
Source
properties
Spatial
map
(after Darwin 1996)
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 8 /21
Grouping cues
freq / Hz
• Main cues: Harmonicity + Onset
8000
6000
4000
2000
0
0
1
2
3
4
5
6
not necessarily consistent!
7
8
9 time / s
(from Pierce 1980)
• Other cues:
spatial information
‘schema’ – learned patterns
• Cues ≈ constraints
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 9 /21
Bottom-up CASA
(Brown’92, Hu & Wang’02)
Segment
Group
input
mixture
Front end
signal
features
(maps)
Object
formation
discrete
objects
Grouping
rules
Source
groups
freq
onset
time
period
frq.mod
• Literal implementation of psychoacoustics
segment time-frequency into elements
group into sources
• Output via time-frequency masking
i.e. time-varying filter
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 10/21
Correlogram front-end
•
(Slaney’90 et seq.)
"#$#!%&'()*+(,!-&'.+//0(1
Periodic modulation as 3rd separating axis
2
"'&&+3'1&456
envelope
to handle unresolved harmonics
7''/+38!94/+,!'(!:(';(<-'//093+!-=8/0'3'18
y
ela
lin
e
short-time
autocorrelation
d
sound
Cochlea
filterbank
frequency channels
"'&&+3'1&45
/30.+
envelope
follower
freq
lag
time
- linear filterbank cochlear approximation
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 11/21
- static nonlinearity
“Weft” Periodic Elements
(Ellis’96)
• Represent harmonics without grouping?
Correlogram
Smooth Spectrum
slices
freq
freq
lag
time
Period Track
commonperiod
feature
period
time
time
hard to separate multiple pitch tracks
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 12/21
CASA Output
• Time-Frequency masked reconstruction
/012%B%CD%E%F514%8:
)*$+%&%,-.
)*$+%&%,-.
/012%3*"4"156%#"7
=
=
2
:8
<
<
>8
?
?
8
;
;
B>8
0
0
B:8
:
:
B08
>
>
B;8
2
8
8
89:
89;
89<
89=
>
>9:
>9;
!"#$%&%'$(
8
8
89:
89;
89<
89=
>
>9:
>9;
!"#$%&%'$(
6$/$6%9&%@A
works surprisingly well (for speech?)
cannot undo overlapping energy (< 20%?)
applicable to reverberation also?
• Or: parametric resynthesis
e.g. ‘wefts’, speech synthesizer
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 13/21
Challenges
for CASA
"#$%&'()!*+,-!.%$,,$(/012!3454
freq
• Circumscribing time-frequency elements
need to have ‘regions’, but hard to find
• Periodicity is the primary cue
how to handle aperiodic energy?
• Bottom-up leaves no ambiguity or context
how to model illusions/interpolations?
• Need to group over longer timespans
time
6
3+#70()7#+%+89!,+('/:#';0'87<!'&'('8,)
- need to have ‘regions’, but hard to find
6
"'#+$=+7+,<!+)!,-'!1#+(>#<!70'
- how to handle aperiodic energy?
6
?')<8,-')+)!@+>!(>)A'=!!&,'#+89
- cannot separate within a single t-f element
6
B$,,$(/01!&'>@')!8$!>(%+90+,<!$#!7$8,'C,
- how to model illusions?
local properties not enough
E6820 SAPR - Dan Ellis
L11 - Signal Separation
Comp. Aud. Scene Analysis - Dan Ellis
2005-04-13 - 15
2005-06-30 - 14/21
Model-based integration
!"#$%&'()*&+,*-'$
high-level constraints?
• How to represent
How to integrate disparate fragments?
• “Speech fragment decoder” (Barker et al. ’05)
.
/-%-(-'$)S)(,)01,2&)3"#$%&'(4)
%#5&4)167,(1&4-4)4&#"+1)("#+(#82&9
Fragments
Hypotheses
w1
w5
w2
w6
w3
w7
w8
w4
time
;<
;=
;>
;?
;@
;A
model 9of=>*'=2$*?$?1"5@2#0($12!2=0($P(S|Y)$A$P(X|M)
source (e.g. speech recognition HMM)
'B2B$C2(0$=*@C'#"0'*#$*?$(25125"0'*#
to say which
parts go together
"#,$@"0=>$0*$(D22=>$@*,2&(
Comp. Aud. Scene Analysis - Dan Ellis
.
2005-06-30 - 15/21
:&"$-'$)167,(1&4&4)2-%-(4)47#+&)*&%#'*4
9 BB$C+0$21"(2($(D2='"=$>'(0*1E
Disambiguation
• Scene multiple possible explanations
Analysis choose most reasonable one
• Most reasonable means...
consistent with grouping cues (CASA)
independent sources (ICA)
consistent with experience ... (human)
max P({Si}| X) ∝ P(X |{Si}) P({Si})
combination physics source models
• i.e. some kind of constraints to disambiguate
Learning as the source of this disambiguation
knowledge
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 16/21
Recognition for Separation
• Speech recognizers embody knowledge
trained on 100s of hours of speech
useEstimation
them as a ‘reasonableness’
measure
Target
Iteration & Filter
Training
12
MIC1
ASR
Alignment
Target 1
MICM
Filter 1
Speaker 1
Filter M
Speaker 1
!
FE
My
Ms
Error
MIC1
MICM
!
speech recognizer’s best-match provides
optimization target
MIC1
Comp. Aud.
Scene Analysis
Target 2 - Dan Ellis
Factorial
Processing
MICM
Filter 1
Speaker 2
Filter M
!
My
2005-06-30
- 17/21Ms
FE
Error
from Manuel Reyes’s
WASPAA 2003
presentation
• e.g. Seltzer, Raj, Reyes:
Obliteration and Outputs
• Perfect separation is rarely possible
e.g. cancellation after psychoacoustic coding?
strong interference will obliterate part of target
• What should the output be?
can fill-in missing-data holes using source models
- ‘pretend’ we observed the full signal
- but: hides observed/inferred distinction
output internal model state instead?
- e.g. ASR output
- synthesize with “minimally informative noise”
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 18/21
Practical CASA?
• When will CASA be useful?
• Obstacles:
Transmission
errors
no agreed way to measure progress!
intelligibility is a novel idea
Net
effect
graceful degradation
- effect of distortions
unpitched sounds
optimum
computation
- look-ahead
integration with multichannel techniques
- sequential or all-at-once?
Comp. Aud. Scene Analysis - Dan Ellis
Increased
artefacts
Reduced
interference
Agressiveness
of processing
2005-06-30 - 19/21
Current CASA work
• Handling unvoiced events (OSU)
• Partial recognition & grouping (Sheffield)
• Model-based separation (Columbia)
• Spatial cues?
• Dereverberation?
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 20/21
Conclusions
• Framework for scene analysis
Input, Output, Engine, Control
• Auditory Scene Analysis
in humans and machines
• Scene analysis as Disambiguation
finding the additional constraints
• Big problems still to overcome
Comp. Aud. Scene Analysis - Dan Ellis
2005-06-30 - 21/21
Download