Sound, Mixtures, and Learning: LabROSA overview

advertisement
Sound, Mixtures, and Learning:
LabROSA overview
1 Sound Content Analysis
2 Recognizing sounds
3 Organizing mixtures
4 Accessing large datasets
5 Music Information Retrieval
Dan Ellis <dpwe@ee.columbia.edu>
Laboratory for Recognition and Organization of Speech and Audio
(LabROSA)
Columbia University, New York
http://labrosa.ee.columbia.edu/
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 1
Sound Content Analysis
1
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
0
2
4
6
8
10
12
time/s
-60
level / dB
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
•
Sound understanding: the key challenge
- what listeners do
- understanding = abstraction
•
Applications
- indexing/retrieval
- robots
- prostheses
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 2
The problem with recognizing mixtures
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
•
Auditory Scene Analysis: describing a complex
sound in terms of high-level sources/events
- ... like listeners do
•
Hearing is ecologically grounded
- reflects natural scene properties = constraints
- subjective, not absolute
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 3
Auditory Scene Analysis
(Bregman 1990)
•
How do people analyze sound mixtures?
- break mixture into small elements (in time-freq)
- elements are grouped in to sources using cues
- sources have aggregate attributes
•
Grouping ‘rules’ (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation,
harmonicity, spatial location, ...
Onset
map
Frequency
analysis
Harmonicity
map
Source
properties
Grouping
mechanism
Position
map
(after Darwin, 1996)
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 4
Cues to simultaneous grouping
freq / Hz
•
Elements + attributes
8000
6000
4000
2000
0
0
1
2
3
4
5
6
7
8
9 time / s
•
Common onset
- simultaneous energy has common source
•
Periodicity
- energy in different bands with same cycle
•
Other cues
- spatial (ITD/IID), familiarity, ...
•
But: Context ...
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 5
Outline
1
Sound Content Analysis
2
Recognizing sounds
- Clean speech
- Speech-in-noise
- Nonspeech
3
Organizing mixtures
4
Accessing large datasets
5
Music Information Retrieval
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 6
Recognizing Sounds: Speech
2
•
Standard speech recognition structure:
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Word models
s
ah
t
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
•
Dan Ellis
Acoustic
classifier
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
How to handle additive noise?
- just train on noisy data: ‘multicondition training’
Sound, Mxtures & Learning
2003-07-21 - 7
How ASR Represents Speech
•
Markov model structure: states + transitions
S
A
0.1
0.02
T
0.9
4
Model M'2
0.8
0.9
K
freq / kHz
0.8
0.18
0.2
3
0.8
0.05
2
A
0.15
20
T
0.2
30
E
40
O
0
•
10
0.8
0.1
1
O
0.1
0.9
K
S
E
0.1
freq / kHz
M'1
State Transition Probabilities
State models (means)
0.9
50
10
A generative model
- but not a good speech generator!
4
3
2
1
0
0
1
2
3
4
5
time / sec
- only meant for inference of p(X|M)
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 8
20
30
40
50
General Audio Recognition
(with Manuel Reyes)
•
Searching audio databases
- speech .. use ASR
- text annotations .. search them
- sound effects library?
•
e.g. Muscle Fish “SoundFisher” browser
- define multiple ‘perceptual’ feature dimensions
- search by proximity in (weighted) feature space
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Results
Segment
feature
analysis
Query example
- features are global for each soundfile,
no attempt to separate mixtures
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 9
Audio Recognition: Results
•
Musclefish corpus
- most commonly reported set
•
Features
- MFCC, brightness, bandwidth, pitch ...
- no temporal structure
•
Results:
- 208 examples, 16 classes
Global features: 41% corr
Mu
Musical
Sp
59/
46
11/ 6
Speech
HMM models: 81% corr.
Env
An
Mec
Mu
24
2
19
136/
6
4
5
1
Eviron.
7/ 2
Animals
2
1/
2
4
1
8/ 4
Mechan
1
Dan Ellis
1
Sp
14/ 2
Env
An
Mec
2
1
5
5
3
1
7/
1
4/
3
Sound, Mxtures & Learning
3
2003-07-21 - 10
1
12/
What are the HMM states?
•
No sub-units defined for nonspeech sounds
•
Final states depend structure, initialization
- number of states
- initial clusters / labels / transition matrix
- EM update objective
•
Have ideas of what we’d like to get
- investigate features/initialization to get there
dogBarks2
freq / kHz
s7
s3
s2 s4
s2 s3
s7 s2
s3s5
s2
s3 s2 s4 s3 s4
s3 s7
s4
8
7
6
5
4
3
2
1
time
1.15
1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 11
Alarm sound detection
(Ellis 2001)
•
freq / kHz
s0n6a8+20
4
Alarm sounds have particular structure
- people ‘know them when they hear them’
- clear even at low SNRs
hrn01
bfr02
buz01
20
3
0
2
-20
1
0
0
5
10
15
20
25
time / s
•
Why investigate alarm sounds?
- they’re supposed to be easy
- potential applications...
•
Contrast two systems:
- standard, global features, P(X|M)
- sinusoidal model, fragments, P(M,S|Y)
Dan Ellis
Sound, Mxtures & Learning
-40
level / dB
2003-07-21 - 12
freq / kHz
Alarms: Results
Restaurant+ alarms (snr 0 ns 6 al 8)
4
3
2
1
0
MLP classifier output
freq / kHz
0
Sound object classifier output
4
6
9
8
7
3
2
1
0
20
•
25
35
40
45
time/sec 50
Both systems commit many insertions at 0dB
SNR, but in different circumstances:
Noise
Neural net system
Del
Ins
Tot
Sinusoid model system
Del
Ins
Tot
1 (amb)
7 / 25
2
36%
14 / 25
1
60%
2 (bab)
5 / 25
63
272%
15 / 25
2
68%
3 (spe)
2 / 25
68
280%
12 / 25
9
84%
4 (mus)
8 / 25
37
180%
9 / 25
135
576%
170
192%
50 / 100
147
197%
Overall 22 / 100
Dan Ellis
30
Sound, Mxtures & Learning
2003-07-21 - 13
Outline
1
Sound Content Analysis
2
Recognizing sounds
3
Organizing mixtures
- Auditory Scene Analysis
- Parallel model inference
4
Accessing large datasets
5
Music Information Retrieval
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 14
Organizing mixtures:
3
Approaches to handling overlapped sound
•
Separate signals, then recognize
- e.g. CASA, ICA
- nice, if you can do it
•
Recognize combined signal
- ‘multicondition training’
- combinatorics..
•
Recognize with parallel models
- full joint-state space?
- or: divide signal into fragments,
then use missing-data recognition
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 15
Computational Auditory Scene Analysis:
The Representational Approach
(Cooke & Brown 1993)
•
input
mixture
Direct implementation of psych. theory
signal
features
Front end
(maps)
Object
formation
discrete
objects
Source
groups
Grouping
rules
freq
onset
time
period
frq.mod
- ‘bottom-up’ processing
- uses common onset & periodicity cues
•
frq/Hz
Able to extract voiced speech:
brn1h.aif
frq/Hz
3000
3000
2000
1500
2000
1500
1000
1000
600
600
400
300
400
300
200
150
200
150
100
brn1h.fi.aif
100
0.2
0.4
Dan Ellis
0.6
0.8
1.0
time/s
Sound, Mxtures & Learning
0.2
0.4
0.6
0.8
2003-07-21 - 16
1.0
time/s
Adding top-down constraints
Perception is not direct
but a search for plausible hypotheses
•
Data-driven (bottom-up)...
input
mixture
Front end
signal
features
Object
formation
discrete
objects
Grouping
rules
Source
groups
- objects irresistibly appear
vs. Prediction-driven (top-down)
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
- match observations
with parameters of a world-model
- need world-model constraints...
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 17
Prediction-Driven CASA
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
f/Hz
1
2
3
Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
Dan Ellis
1
2
3
4
5
Sound, Mxtures & Learning
6
7
8
9
time/s
2003-07-21 - 18
dB
Segregation vs. Inference
•
Source separation
requires attribute separation
- sources are characterized by attributes
(pitch, loudness, timbre + finer details)
- need to identify & gather different attributes for
different sources ...
•
Need representation that segregates attributes
- spectral decomposition
- periodicity decomposition
•
Sometimes values can’t be separated
- e.g. unvoiced speech
- maybe infer factors from probabilistic model?
p ( O, x , y ) → p ( x , y O )
- or: just skip those values,
infer from higher-level context
- do both: missing-data recognition
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 19
Missing Data Recognition
•
Speech models p(x|m) are multidimensional...
- i.e. means, variances for every freq. channel
- need values for all dimensions to get p(•)
•
But: can evaluate over a
subset of dimensions xk
p ( xk m ) =
•
∫ p ( x k, x u
m ) dx u
Hence,
missing data recognition:
Present data mask
xu
p(xk,xu)
y
p(xk|xu<y )
P(x | q) =
dimension →
P(x1 | q)
· P(x2 | q)
· P(x3 | q)
· P(x4 | q)
· P(x5 | q)
· P(x6 | q)
time →
- hard part is finding the mask (segregation)
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 20
xk
p(xk )
Comparing different segregations
•
Standard classification chooses between
models M to match source features X
P( M )
M ∗ = argmax P ( M X ) = argmax P ( X M ) ⋅ -------------P( X )
M
M
•
Mixtures → observed features Y, segregation S,
all related by P ( X Y , S )
Observation
Y(f )
Source
X(f )
Segregation S
freq
- spectral features allow clean relationship
•
Joint classification of model and segregation:
P( X Y , S )
P ( M , S Y ) = P ( M ) ∫ P ( X M ) ⋅ ------------------------- dX ⋅ P ( S Y )
P( X )
- probabilistic relation of models & segregation
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 21
Multi-source decoding
•
Search for more than one source
q2(t)
S2(t)
Y(t)
S1(t)
q1(t)
•
Mutually-dependent data masks
•
Use e.g. CASA features to propose masks
- locally coherent regions
•
Lots of issues in models, representations,
matching, inference...
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 22
Outline
1
Sound Content Analysis
2
Recognizing sounds
3
Organizing mixtures
4
Accessing large datasets
- Spoken documents
- The Listening Machine
- Music preference modeling
5
Music Information Retrieval
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 23
Accessing large datasets:
The Meeting Recorder Project
4
(with ICSI, UW, IDIAP, SRI, Sheffield)
•
Microphones in conventional meetings
- for summarization / retrieval / behavior analysis
- informal, overlapped speech
•
Data collection (ICSI, UW, IDIAP, NIST):
- ~100 hours collected & transcribed
•
Dan Ellis
NSF ‘Mapping Meetings’ project
Sound, Mxtures & Learning
2003-07-21 - 24
Meeting IR tool
•
IR on (ASR) transcripts from meetings
- ASR errors have limited impact on retrieval
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 25
Speaker Turn detection
(Huan Wei Hee, Jerry Liu)
•
Acoustic:
Triangulate tabletop mic timing differences
- use normalized peak value for confidence
mr-2000-11-02-1440: PZM xcorr lags
Example cross coupling response, chan3 to chan0
1
4
250
lag 3-4 / ms
0
200
150
100
0
-1
3
-2
50
2
1
0
50
100
•
150
time / s
200
250
300
-3
-3
-2
-1
0
1
lag 1-2 / ms
Behavioral: Look for patterns of speaker turns
mr04: Hand-marked speaker turns vs. time + auto/manual boundaries
Participant
100xR skew/samps
300
10:
9:
8:
7:
5:
3:
2:
1:
0
5
Dan Ellis
10
15
20
25
30
35
Sound, Mxtures & Learning
40
45
50
55
60
time/min
2003-07-21 - 26
2
3
Speech/nonspeech detection
(Williams & Ellis 1999)
•
ASR run over entire soundtracks?
- for nonspeech, result is nonsense
•
Watch behavior of speech acoustic model:
- average per-frame entropy
- ‘dynamism’ - mean-squared 1st-order difference
Dynamism vs. Entropy for 2.5 second segments of speecn/music
3.5
Spectrogram
frq/Hz
Speech
Music
Speech+Music
3
4000
2.5
2000
2
speech
music
Entropy
0
speech+music
1.5
40
1
20
0
0
2
4
6
8
10
12
time/s
0.5
Posteriors
0
0
•
Dan Ellis
0.05
0.1
0.15
0.2
Dynamism
0.25
0.3
1.3% error on 2.5 second speech-music testset
Sound, Mxtures & Learning
2003-07-21 - 27
The Listening Machine
•
Smart PDA records everything
•
Only useful if we have index, summaries
- monitor for particular sounds
- real-time description
•
Scenarios
- personal listener → summary of your day
- future prosthetic hearing device
- autonomous robots
•
Dan Ellis
Meeting data, ambulatory audio
Sound, Mxtures & Learning
2003-07-21 - 28
Personal Audio
•
LifeLog / MyLifeBits / Remembrance Agent:
Easy to record everything you hear
•
Then what?
- prohibitively time consuming to search
- but .. applications if access easier
•
Automatic content analysis / indexing...
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 29
Outline
1
Sound Content Analysis
2
Recognizing sounds
3
Organizing mixtures
4
Accessing large datasets
5
Music Information Retrieval
- Anchor space
- Playola browser
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 30
Music Information Retrieval
5
•
Transfer search concepts to music?
- “musical Google”
- finding something specific / vague / browsing
- is anything more useful than human annotation?
•
Most interesting area: finding new music
- is there anything on mp3.com that I would like?
- audio is only information source for new bands
•
Basic idea:
Project music into a space where
neighbors are “similar”
•
Also need models of personal preference
- where in the space is the stuff I like
- relative sensitivity to different dimensions
•
Evaluation problems
- requires large, shareable music corpus!
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 31
Artist Classification
(Berenzweig et al. 2001)
•
Artists’ oeuvres as similarity-sets
•
Train MLP to classify frames among 21 artists
•
Using (detected) voice segments:
Song-level accuracy improves 56.7% → 64.9%
Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)
true voice
Michael Penn
The Roots
The Moles
Eric Matthews
Arto Lindsay
Oval
Jason Falkner
Built to Spill
Beck
XTC
Wilco
Aimee Mann
The Flaming Lips
Mouse on Mars
Dj Shadow
Richard Davies
Cornelius
Mercury Rev
Belle & Sebastian
Sugarplastic
Boards of Canada
0
50
100
150
200
time / sec
Track 4 - Arto Lindsay (dynvox=Arto, unseg=Oval)
true voice
Michael Penn
The Roots
The Moles
Eric Matthews
Arto Lindsay
Oval
Jason Falkner
Built to Spill
Beck
XTC
Wilco
Aimee Mann
The Flaming Lips
Mouse on Mars
Dj Shadow
Richard Davies
Cornelius
Mercury Rev
Belle & Sebastian
Sugarplastic
Boards of Canada
0
10
Dan Ellis
20
30
40
50
Sound, Mxtures & Learning
60
70
80
2003-07-21 - 32
time / sec
Artist Similarity
• Recognizing work from each artist is all very well...
• But: what is
similarity between
artists?
- pattern recognition
systems give a
number...
Dan Ellis
roxette
toni_braxton
e
ron_carter
erasure
lara_fabian
jessica_simpson
mariah_carey
new_
janet_jackson
a
whitney_
eiffel_65
celine_dionpet_shop_boys
christina_aguilera
aqua
lauryn_hill
rs
sade sof
all_saints
backstreet_boys
madonna pr
spice_girlsbelinda_carlisle
wain
nelly_furtado
miroquai
annie_lennox
•
Need subjective ground truth:
Collected via web site
www.musicseer.com
•
Results:
- 1,000 users, 22,300 judgments
collected over 6 months
Sound, Mxtures & Learning
2003-07-21 - 33
Music similarity from Anchor space
•
A classifier trained for one artist (or genre)
will respond partially to a similar artist
•
Each artist evokes a particular pattern of
responses over a set of classifiers
•
We can treat these classifier outputs as a new
feature space in which to estimate similarity
n-dimensional
vector in "Anchor
Space"
Anchor
Anchor
p(a1|x)
Audio
Input
(Class i)
AnchorAnchor
Anchor
Audio
Input
(Class j)
|x)
p(a2n-dimensional
vector in "Anchor
Space"
GMM
Modeling
Similarity
Computation
p(a1|x)p(an|x)
p(a2|x)
Anchor
Conversion to Anchorspace
GMM
Modeling
KL-d, EMD, etc.
p(an|x)
Conversion to Anchorspace
•
Dan Ellis
“Anchor space” reflects subjective qualities?
Sound, Mxtures & Learning
2003-07-21 - 34
Anchor space visualization
•
Comparing 2D projections of per-frame feature
points in cepstral and anchor spaces:
Anchor Space Features
Cepstral Features
0
0.6
0.2
Electronica
fifth cepstral coef
0.4
0
0.2
0.4
0.6
madonna
bowie
0.8
1
0.5
0
third cepstral coef
5
10
15
0.5
madonna
bowie
15
10
Country
5
- each artist represented by 5GMM
- greater separation under MFCCs!
- but: relevant information?
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 35
Playola interface ( www.playola.org )
•
Browser finds closest matches to single tracks
or entire artists in anchor space
•
Direct manipulation of anchor space axes
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 36
Evaluation
•
Are recommendations good or bad?
•
Subjective evaluation is the ground truth
- .. but subjects aren’t familiar with the bands
being recommended
- can take a long time to decide if a
recommendation is good
•
Measure match to other similarity judgments
- e.g. musicseer data:
Top rank agreement
80
70
60
SrvKnw 4789x3.58
%
50
SrvAll 6178x8.93
40
GamKnw 7410x3.96
GamAll 7421x8.92
30
20
10
0
cei
Dan Ellis
cmb
erd
e3d
opn
kn2
rnd
Sound, Mxtures & Learning
ANK
2003-07-21 - 37
Summary
•
Sound
- .. contains much, valuable information at many
levels
- intelligent systems need to use this information
•
Mixtures
- .. are an unavoidable complication when using
sound
- looking in the right time-frequency place to find
points of dominance
•
Learning
- need to acquire constraints from the
environment
- recognition/classification as the real task
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 38
DOMAINS
LabROSA Summary
• Meetings
• Personal recordings
• Location monitoring
• Broadcast
• Movies
• Lectures
ROSA
• Object-based structure discovery & learning
APPLICATIONS
• Speech recognition
• Speech characterization
• Nonspeech recognition
•
•
•
•
•
Dan Ellis
• Scene analysis
• Audio-visual integration
• Music analysis
Structuring
Search
Summarization
Awareness
Understanding
Sound, Mxtures & Learning
2003-07-21 - 39
Extra Slides
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 40
Independent Component Analysis (ICA)
(Bell & Sejnowski 1995 et seq.)
•
Drive a parameterized separation algorithm to
maximize independence of outputs
m1
m2
x
a11 a12
a21 a22
s1
s2
−δ MutInfo
δa
•
Advantages:
- mathematically rigorous, minimal assumptions
- does not rely on prior information from models
•
Disadvantages:
- may converge to local optima...
- separation, not recognition
- does not exploit prior information from models
Dan Ellis
Sound, Mxtures & Learning
2003-07-21 - 41
Download