Sound, Mixtures, and Learning: Lab overview

advertisement
Sound, Mixtures, and Learning:
LabROSA overview
Dan Ellis
<dpwe@ee.columbia.edu>
Laboratory for Recognition and Organization of Speech and Audio
Electrical Engineering Dept., Columbia University, New York
http://labrosa.ee.columbia.edu/
Outline
1
Auditory Scene Analysis
2
Speech Recognition & Mixtures
3
Music Analysis & Similarity
4
General Sound Organization
5
Future Work
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 1/33
Auditory Scene Analysis
1
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
0
2
4
6
8
10
12
time/s
-60
level / dB
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
•
Auditory Scene Analysis: describing a complex
sound in terms of high-level sources/events
- ... like listeners do
•
Hearing is ecologically grounded
- reflects ‘natural scene’ properties
- subjective, not absolute
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 2/33
Sound, mixtures, and learning
freq / kHz
Speech
Noise
Speech + Noise
4
3
+
2
=
1
0
0
0.5
1
time / s
0
0.5
1
time / s
0
0.5
1
•
Sound
- carries useful information about the world
- complements vision
•
Mixtures
- .. are the rule, not the exception
- medium is ‘transparent’, sources are many
- must be handled!
•
Learning
- the ‘speech recognition’ lesson:
let the data do the work
- like listeners
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 3/33
time / s
The problem with recognizing mixtures
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
•
Received waveform is a mixture
- two sensors, N signals ... underconstrained
•
Disentangling mixtures as the primary goal?
- perfect solution is not possible
- need experience-based constraints
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 4/33
Human Auditory Scene Analysis
(Bregman 1990)
•
How do people analyze sound mixtures?
- break mixture into small elements (in time-freq)
- elements are grouped in to sources using cues
- sources have aggregate attributes
•
Grouping ‘rules’ (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation,
harmonicity, spatial location, ...
Onset
map
Frequency
analysis
Harmonicity
map
Grouping
mechanism
Position
map
(after Darwin, 1996)
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 5/33
Source
properties
Cues to simultaneous grouping
freq / Hz
•
Elements + attributes
8000
6000
4000
2000
0
0
1
2
3
4
5
6
7
8
9 time / s
•
Common onset
- simultaneous energy has common source
•
Periodicity
- energy in different bands with same cycle
•
Other cues
- spatial (ITD/IID), familiarity, ...
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 6/33
The effect of context
Context can create an ‘expectation’:
i.e. a bias towards a particular interpretation
•
e.g. Bregman’s “old-plus-new” principle:
A change in a signal will be interpreted as an
added source whenever possible
frequency / kHz
•
2
1
+
0
0.0
0.4
0.8
1.2
time / s
- a different division of the same energy
depending on what preceded it
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 7/33
Computational Auditory Scene Analysis
(CASA)
CASA
Object 1 description
Object 2 description
Object 3 description
...
•
Goal: Automatic sound organization ;
Systems to ‘pick out’ sounds in a mixture
- ... like people do
•
E.g. voice against a noisy background
- to improve speech recognition
•
Approach:
- psychoacoustics describes grouping ‘rules’
- ... just implement them?
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 8/33
The Representational Approach
(Brown & Cooke 1993)
•
input
mixture
Implement psychoacoustic theory
signal
features
Front end
(maps)
Object
formation
discrete
objects
Source
groups
Grouping
rules
freq
onset
time
period
frq.mod
- ‘bottom-up’ processing
- uses common onset & periodicity cues
•
frq/Hz
Able to extract voiced speech:
brn1h.aif
frq/Hz
3000
3000
2000
1500
2000
1500
1000
1000
600
600
400
300
400
300
200
150
200
150
100
brn1h.fi.aif
100
0.2
0.4
0.6
0.8
1.0
time/s
Sound, mixtures, learning - Dan Ellis
0.2
0.4
0.6
0.8
2003-03-20 - 9/33
1.0
time/s
Restoration in sound perception
•
Auditory ‘illusions’ = hearing what’s not there
•
The continuity illusion
f/Hz
ptshort
4000
2000
1000
0.0
•
0.2
0.4
0.6
0.8
1.0
1.2
1.4
time/s
SWS
f/Bark
15
80
60
S1−env.pf:0
10
5
40
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
- duplex perception
•
How to model in CASA?
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 10/33
Adding top-down constraints
Perception is not direct
but a search for plausible hypotheses
•
Data-driven (bottom-up)...
input
mixture
Front end
signal
features
Object
formation
discrete
objects
Grouping
rules
Source
groups
- objects irresistibly appear
vs. Prediction-driven (top-down)
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
Front end
signal
features
Compare
& reconcile
Predict
& combine
Periodic
components
predicted
features
- match observations
with parameters of a world-model
- need world-model constraints...
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 11/33
Prediction-Driven CASA
(Ellis 1996)
•
Explain a complex sound with basic elements
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
f/Hz
1
2
3
Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
1
2
Sound, mixtures, learning - Dan Ellis
3
4
5
6
7
8
9
time/s
dB
2003-03-20 - 12/33
Approaches to sound mixture recognition
•
Recognize combined signal
- ‘multicondition training’
- combinatorics..
•
Separate signals
- e.g. CASA, ICA
- nice, if you can do it
•
Segregate features into fragments
- then missing-data recognition
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 13/33
Aside: Evaluation
•
Evaluation is a big problem for CASA
- what is the goal, really?
- what is a good test domain?
- how do you measure performance?
•
SNR improvement
- not easy given only before-after signals:
correspondence problem
- can do with fixed filtering mask;
rewards removing signal as well as noise
•
ASR improvement
- recognizers typically very sensitive to artefacts
•
‘Real’ task?
- mixture corpus with specific sound events...
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 14/33
Outline
1
Auditory Scene Analysis
2
Speech Recognition & Mixtures
- the information in speech
- Meeting Recorder project
- speech fragment decoding
3
Music Analysis & Similarity
4
General Sound Organization
5
Future Work
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 15/33
The information in speech
2
(Patricia Scanlon)
•
-
Mutual Information
identifies where the
information is in time/
frequency:
- little temporal
structure averaged
over all sounds
Phoneme
info
Better with just vowels:
Phoneme
info
Sound, mixtures, learning - Dan Ellis
Speaker
info
2003-03-20 - 16/33
The best subword units?
(Eric Fosler)
•
Speech recognizers typically use phonemes
- inherited from linguistics
•
Alternative approach is ‘articulatory features’
- orthogonal attributes defining subwords
•
Can we infer a feature set from the data
- using e.g. Independent Component Analysis
Principal
Components
Spectrogram
test/dr1/faks0/sa2
15
10
5
8
6
0
4
2
0
Independent
Components
Basis vectors
1
0
1
2
3
4
-1
8
2
6
4
1
2
0
0
d ow n ae s
m iytix k eh r iyixn
oy l iy r ae
g l ay kdh ae tcl
-1
time / labels
Sound, mixtures, learning - Dan Ellis
0
5
10
15
20
frequency / Bark
2003-03-20 - 17/33
The Meeting Recorder Project
(CompSci, ICSI, UW, IDIAP, SRI, IBM)
•
Microphones in conventional meetings
- for summarization/retrieval/behavior analysis
- informal, overlapped speech
•
Data collection (ICSI, UW, IDIAP):
- 100 hours collected, ongoing transcription
•
NSF ‘Mapping Meetings’ project
- also interest from NIST, DARPA, EU
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 18/33
Speaker Turn detection
(Huan Wei Hee, Jerry Liu)
•
Acoustic:
Triangulate tabletop mic timing differences
- use normalized peak value for confidence
mr-2000-11-02-1440: PZM xcorr lags
Example cross coupling response, chan3 to chan0
1
4
250
lag 3-4 / ms
0
200
150
100
0
-1
3
-2
50
2
1
0
50
100
•
150
time / s
200
250
300
-3
-3
-2
-1
0
1
lag 1-2 / ms
Behavioral: Look for patterns of speaker turns
mr04: Hand-marked speaker turns vs. time + auto/manual boundaries
Participant
100xR skew/samps
300
10:
9:
8:
7:
5:
3:
2:
1:
0
5
10
15
20
Sound, mixtures, learning - Dan Ellis
25
30
35
40
45
50
55
60
time/min
2003-03-20 - 19/33
2
3
Speech Fragment recognition
(Barker & Cooke/Sheffield)
•
Standard classification chooses between
models M to match source features X
P( M )
M ∗ = argmax P ( M X ) = argmax P ( X M ) ⋅ -------------P( X )
M
M
•
Mixtures → observed features Y, segregation S,
all related by P ( X Y , S )
Observation
Y(f )
Source
X(f )
Segregation S
freq
- spectral features allow clean relationship
•
Joint classification of model and segregation:
P( X Y , S )
P ( M , S Y ) = P ( M ) P ( X M ) ⋅ ------------------------- dX ⋅ P ( S Y )
P( X )
∫
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 20/33
Multi-source decoding
•
Search for more than one source
q2(t)
S2(t)
Y(t)
S1(t)
q1(t)
•
Mutually-dependent data masks
•
Use e.g. CASA features to propose masks
- locally coherent regions
•
Theoretical vs. practical limits
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 21/33
Outline
1
Auditory Scene Analysis
2
Speech Recognition & Mixtures
3
Music Analysis & Similarity
- musical structure analysis
- similarity browsing
4
General Sound Organization
5
Future Work
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 22/33
Music Structure Analysis
3
(Alex Sheh)
•
Fine-level information from music
- for searching
- for modeling/statistics
•
e.g. Chord sequences via PCPs :
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 23/33
Ground truth for Music Recordings
(Rob Turetsky)
•
Machine Learning algorithms need labels
- but real recordings don’t have labels
•
MIDI ‘replicas’ exist
•
Alignment locates MIDI notes in real sound:
freq / kHz
"Don't you want me" (Human League), verse1
4
2
0
freq / kHz
Original
4
17
18
19
20
21
22
23
24
25
26
MIDI Replica
2
MIDI #
0
80
MIDI notes
60
40
19
20
21
22
Sound, mixtures, learning - Dan Ellis
23
24
25
26
27
28
2003-03-20 - 24/33
time / sec
Music Similarity Browsing
(Adam Berenzweig)
•
‘Anchor models’ : music on subjective axes
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 25/33
Outline
1
Auditory Scene Analysis
2
Speech Recognition & Mixtures
3
Music Analysis & Similarity
4
General Sound Organization
- alarm detection
- sound texture modeling
- recognition of multiple sources
5
Future Work
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 26/33
Alarm sound detection
freq / Hz
4
•
Alarm sounds have particular structure
- people ‘know them when they hear them’
•
Isolate alarms in sound mixtures
5000
5000
5000
4000
4000
4000
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
1
1.5
2
0
2.5
1
1.5
2
time / sec
2.5
0
1
1
1.5
2
2.5
freq / Hz
- sinusoid peaks have invariant properties
Speech + alarm
4000
2000
Pr(alarm)
0
1
0.5
0
1
2
3
4
5
6
7
8
9 time / sec
- cepstral coefficients are easy to model
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 27/33
Sound Texture Modeling
(Marios Athineos)
•
Best sound models are based on sinusoids
- noise residual modeled quite simply
•
Noise ‘textures’ have extra temporal structure
- need a more detailed model
•
Linear prediction of spectrum defines a
parametric temporal envelope:
mpgr1-sx419: TDLPC env (60 poles / 300 ms)
0.1
0.05
0
-0.05
0.65
0.7
•
0.75
0.8
0.85
0.9
High-quality noise-excited resynthesis:
- original - resynth - x2 TSM - c/w PVOC
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 28/33
Sound mixture decomposition
(Manuel Reyes)
•
Full or approximate Bayesian inference to
model multiple, independent sound sources:
Generation
model
Channel
parameters
Source
models
Source
signals
M1
Y1
Received
signals
X1
C2
Model
dependence
Θ
C1
Observations
M2
Y2
X2
O
Analysis
structure
Observations
O
Fragment
formation
Mask
allocation
p(X|Mi,Ki)
Sound, mixtures, learning - Dan Ellis
{Ki}
Model
fitting
{Mi}
Likelihood
evaluation
2003-03-20 - 29/33
Outline
1
Auditory Scene Analysis
2
Speech Recognition & Mixtures
3
Music Analysis & Similarity
4
General Sound Organization
5
Future Work
- audio-visual information
- real-world sound indexing
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 30/33
5
Future work:
Automatic audio-video analysis
(Shih-Fu Chang, Kathy McKeown)
•
Documentary archive management
- huge ratio of raw-to-finished material
- costly manual logging
•
Problem: term ↔ signal mapping
- training corpus of past annotations
- interactive semi-automatic learning
MM Concept Network
annotations
A/V data
text
processing
A/V
segmentation
and
feature
extraction
terms
A/V
features
concept
discovery
A/V feature
unsupervised
clustering
generic
detectors
Sound, mixtures, learning - Dan Ellis
Multimedia
Fusion:
Mining
Concepts &
Relationships
Complex
SpatioTemporal
Classification
Question
Answering
2003-03-20 - 31/33
The ‘Listening Machine’
•
Smart PDA records everything
•
Only useful if we have index, summaries
- monitor for particular sounds
- real-time description
•
Scenarios
- personal listener → summary of your day
- future prosthetic hearing device
- autonomous robots
•
Meeting data, ambulatory audio
Sound, mixtures, learning - Dan Ellis
2003-03-20 - 32/33
DOMAINS
LabROSA Summary
• Meetings
• Personal recordings
• Location monitoring
• Broadcast
• Movies
• Lectures
ROSA
• Object-based structure discovery & learning
APPLICATIONS
• Speech recognition
• Speech characterization
• Nonspeech recognition
•
•
•
•
•
Sound, mixtures, learning - Dan Ellis
• Scene analysis
• Audio-visual integration
• Music analysis
Structuring
Search
Summarization
Awareness
Understanding
2003-03-20 - 33/33
Download