1 Sound Content Analysis
2 Recognizing sounds
3 Organizing mixtures
4 Accessing large datasets
5 Music Information Retrieval
Dan Ellis <dpwe@ee.columbia.edu>
Laboratory for Recognition and Organization of Speech and Audio
(Lab ROSA )
Columbia University, New York http://labrosa.ee.columbia.edu/
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 1
1
4000 frq/Hz
3000
2000
1000
0
0
2 4 6
Analysis
8 10 12
0
-20
-40 time/s
-60 level / dB
Voice (evil)
Stab
Rumble
Voice (pleasant)
Choir
Strings
• Sound understanding : the key challenge
- what listeners do
- understanding = abstraction
• Applications
- indexing/retrieval
- robots
- prostheses
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 2
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90)
• Auditory Scene Analysis : describing a complex sound in terms of high-level sources /events
- ... like listeners do
• Hearing is ecologically grounded
- reflects natural scene properties = constraints
- subjective, not absolute
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 3
Frequency analysis
(Bregman 1990)
• How do people analyze sound mixtures?
- break mixture into small elements (in time-freq)
- elements are grouped in to sources using cues
- sources have aggregate attributes
• Grouping ‘rules’ (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation, harmonicity, spatial location, ...
Dan Ellis
Onset map
Harmonicity map
Position map
Grouping mechanism
Sound, Mixtures & Learning
(after Darwin, 1996)
2003-09-24 - 4
Source properties
• Elements + attributes
8000
6000
4000
2000
0
0 1 2 3 4 5 6 7 8 9 time / s
• Common onset
- simultaneous energy has common source
• Periodicity
- energy in different bands with same cycle
• Other cues
- spatial (ITD/IID), familiarity, ...
• But: Context ...
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 5
Outline
1 Sound Content Analysis
2 Recognizing sounds
- Speech recognition
- Nonspeech
3 Organizing mixtures
4 Accessing large datasets
5 Music Information Retrieval
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 6
2
• Standard speech recognition structure:
Acoustic model parameters
Word models s ah t
Language model p("sat"|"the","cat") p("saw"|"the","cat") sound
Feature calculation
Acoustic classifier feature vectors phone probabilities
HMM decoder phone / word
sequence
Understanding/ application...
• How to handle additive noise ?
- just train on noisy data: ‘multicondition training’
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 7
• Markov model structure: states + transitions
M'
1
S
0.9
0.8
0.18
A
0.1
0.8
K 0.9
T
0.2
E
0.02
O
0.1
4
State models (means)
0.9
Model M'
2
3
0.8
0.05
A
0.1
2
S K
0.9
1
0
0.15
O
0.1
State Transition Probabilities
10
0.8
T
20
0.2
30
40
50
E
10 20 30 40 50
• A generative model
- but not a good speech generator!
4
3
2
1
0
0 1 2 3 4 5 time / sec
Dan Ellis
- only meant for inference of p ( X | M )
Sound, Mixtures & Learning 2003-09-24 - 8
4k
2k
0
4k
2k
0
0
(with Marios Athineos)
• Common sound models use 10ms frames
- but: sub-10ms envelope is perceptible
Fireplce - orig TDLP resynth CTFLP resynth CTFLP resynth 4x TSM
4k 4k 4k
2
Time
4
2k
0
4k
2k
0
0 2
Time
4
2k
0
4k
2k
0
0 2
Time
4
2k
0
4k
2k
0
0 2
Time
4
• Use a parametric
(LPC) model on spectrum 4
2
0
2
200 400 600 800 1000 1200 1400 1600 1800 2000
5
4.5
4
3.5
3
2.5
2
1.5
1
2-4 kHz
1-2 kHz
0.5-1 kHz
0-0.5 kHz waveform
• Convert to features for ASR
- improvements esp. for stops
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 9
(with Patricia Scanlon)
• Mutual Information in time-frequency:
All phones Stops
15
10
5
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0.04
0.03
0.02
0.01
0
Vowels
15
10
5
-200 -100 0 100 200
0.08
0.06
0.04
0.02
Speaker (vowels)
-200 -100 0 100 200 time / ms
0.025
0.02
0.015
0.01
0.005
MI / bits
• Use to select classifier input features time / ms
IRREG IRREG+CHKBD
70
15
10
5
133 ftrs
85 ftrs
65
15
10
5
95 ftrs
47 ftrs
60
15
10
5
15
10
5
Dan Ellis
-50 0 50
57 ftrs
19 ftrs
55
28 ftrs
50
-50 0
9 ftrs
50 time / ms
45
0
Sound, Mixtures & Learning
20 40 60 80
IRREG+CHKBD
RECT+CHKBD
IRREG
RECT
RECT, all frames
100 120 140
Number of features
2003-09-24 - 10
(Ellis 2001)
3
2
1
0
0 s0n6a8+20
4
• Alarm sounds have particular structure
- people ‘know them when they hear them’
- clear even at low SNRs hrn01 bfr02 buz01
5 10 15 20 25
20
0
-20 time / s
-40 level / dB
• Why investigate alarm sounds?
- they’re supposed to be easy
- potential applications...
• Contrast two systems:
- standard, global features , P ( X | M )
- sinusoidal model, fragments , P ( M , S | Y )
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 11
Restaurant+ alarms (snr 0 ns 6 al 8)
2
1
0
4
3
MLP classifier output
0
2
1
4
3
6 7
Sound object classifier output
8 9
25 30 35 40 45 time/sec
50
• Both systems commit many insertions at 0dB
SNR, but in different circumstances:
Noise
Neural net system Sinusoid model system
Del Ins Tot Del Ins Tot
1 (amb) 7 / 25
2 (bab) 5 / 25
3 (spe) 2 / 25
4 (mus) 8 / 25
2
63
68
37
36%
272%
280%
14 / 25
15 / 25
12 / 25
180% 9 / 25
1 60%
2 68%
9 84%
135 576%
Overall 22 / 100 170 192% 50 / 100 147 197%
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 12
Outline
1 Sound Content Analysis
2 Recognizing sounds
3 Organizing mixtures
- Auditory Scene Analysis
- Missing data recognition
- Parallel model inference
4 Accessing large datasets
5 Music Information Retrieval
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 13
3
• Separate signals , then recognize
- e.g. CASA, ICA
- nice, if you can do it
• Recognize combined signal
- ‘multicondition training’
- combinatorics..
• Recognize with parallel models
- full joint-state space?
- or: divide signal into fragments, then use missing-data recognition
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 14
(Cooke & Brown 1993) input mixture
• Direct implementation of psych. theory
Front end signal features
(maps)
Object formation discrete objects
Grouping rules
Source groups freq onset time period frq.mod
- ‘bottom-up’ processing
- uses common onset & periodicity cues frq/Hz
3000
2000
1500
1000 brn1h.aif
600
400
300
200
150
100
0.2
• Able to extract voiced speech: brn1h.fi.aif
frq/Hz
3000
2000
1500
1000
600
400
300
200
150
100
0.4
0.6
0.8
1.0
time/s 0.2
Dan Ellis Sound, Mixtures & Learning
0.4
0.6
0.8
2003-09-24 - 15
1.0
time/s
Perception is not direct but a search for plausible hypotheses
• Data-driven (bottom-up)...
input mixture
Front end signal features
Object formation discrete objects
Grouping rules
Source groups
- objects irresistibly appear vs. Prediction-driven (top-down) input mixture
Front end signal features hypotheses
Noise components
Hypothesis management
Periodic components prediction errors
Compare
& reconcile predicted features
Predict
& combine
Dan Ellis
- match observations with parameters of a world-model
- need world-model constraints...
Sound, Mixtures & Learning 2003-09-24 - 16
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0 f/Hz
Wefts1−4
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
1 2 3
Weft5
4 5 6
Wefts6,7
7
Weft8
8 9
Wefts9−12 f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10) f/Hz
Noise1
4000
2000
1000
400
200
Squeal (6/10)
Truck (7/10)
Dan Ellis
0 1 2 3 4 5
Sound, Mixtures & Learning
6 7
−40
−50
−60
−70
8 9 time/s
2003-09-24 - 17 dB
• Source separation requires attribute separation
- sources are characterized by attributes
(pitch, loudness, timbre + finer details)
- need to identify & gather different attributes for different sources ...
• Need representation that segregates attributes
- spectral decomposition
- periodicity decomposition
• Sometimes values can’t be separated
- e.g. unvoiced speech
- maybe infer factors from probabilistic model?
) Æ (
O
)
- or: just skip those values, infer from higher-level context
- do both: missing-data recognition
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 18
• Speech models p(x|m) are multidimensional...
- i.e. means, variances for every freq. channel
- need values for all dimensions to get p(•)
• But: can evaluate over a k subset of dimensions x k m
)
k
, x u m
) d x u x u y
p(x k
,x u
)
• Hence, missing data recognition :
p(x k
| x u
< y )
Present data mask
P(x | q) =
P(x
1
| q)
· P(x
2
| q)
· P(x
3
· P(x
4
| q)
| q)
· P(x
5
· P(x
6
| q)
| q) time
- hard part is finding the mask ( segregation )
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 19
p(x x k k
)
M
*
• Standard classification chooses between models M to match source features X
M
)
M
) ◊
( )
P X
• Mixtures
Æ
observed features Y , segregation S , all related by
( )
Observation
Y(f )
Source
X(f )
Segregation S freq
spectral features allow clean relationship
• Joint classification of model and segregation:
)
P M
( ) ◊
P X Y S
) d X
◊
P S Y
)
- probabilistic relation of models & segregation
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 20
• Search for more than one source
Y(t) S
2
(t) q
2
(t)
S
1
(t)
• Mutually-dependent data masks
• Use e.g. CASA features to propose masks
- locally coherent regions
• Lots of issues in models, representations, matching, inference...
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 21 q
1
(t)
(Roweis 2000, Manuel Reyes)
• State sequences
Æ
t-f estimates
Æ
mask
Speaker 1 Speaker 2
4000
3000
2000
1000
0
4000
3000
2000
1000
0
4000
3000
2000
1000
0
4000
3000
2000
1000
0
0 1 2 3 4 0 1 2 3 4 5 time / sec
6 5 6 time / sec
- 1000 states/model (
Æ
10
6
transition probs.)
- simplify by modeling subbands (coupled HMM)?
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 22
Outline
1 Sound Content Analysis
2 Recognizing sounds
3 Organizing mixtures
4 Accessing large datasets
- Meeting Recordings
- The Listening Machine
5 Music Information Retrieval
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 23
4
(with ICSI, UW, IDIAP, SRI, Sheffield)
• Microphones in conventional meetings
- for summarization / retrieval / behavior analysis
- informal, overlapped speech
• Data collection (ICSI, UW, IDIAP, NIST):
- ~100 hours collected & transcribed
• NSF ‘Mapping Meetings’ project
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 24
(Huan Wei Hee, Jerry Liu)
• Acoustic :
Triangulate tabletop mic timing differences
- use normalized peak value for confidence
Example cross coupling response, chan3 to chan0 mr-2000-11-02-1440: PZM xcorr lags
1
4
0
300
250
200
150
100
50
0
0
-1
-2 3
2
50 100 150 time / s
200 250 300
-3
-3
1
-2 -1 0 lag 1-2 / ms
1
10:
9:
8:
7:
5:
3:
2:
1:
0
• Behavioral : Look for patterns of speaker turns mr04: Hand-marked speaker turns vs. time + auto/manual boundaries
5 10 15 20 25 30 35 40 45 50 55 time/min
60
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 25
2 3
• Smart PDA records everything
• Only useful if we have index, summaries
- monitor for particular sounds
- real-time description
• Scenarios
- personal listener
Æ
summary of your day
- future prosthetic hearing device
- autonomous robots
• Meeting data, ambulatory audio
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 26
• LifeLog / MyLifeBits /
Remembrance Agent:
Easy to record everything you hear
• Then what?
- prohibitively time consuming to search
- but .. applications if access easier
• Automatic content analysis / indexing...
4
2
0
50 100 150 200 250 time / min
15
10
5
14:30 15:00 15:30 16:00 16:30 17:00 17:30
Dan Ellis Sound, Mixtures & Learning
18:00 18:30 clock time
2003-09-24 - 27
Outline
1 Sound Content Analysis
2 Recognizing sounds
3 Organizing mixtures
4 Accessing large datasets
5 Music Information Retrieval
- Anchor space
- Playola browser
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 28
5
• Transfer search concepts to music?
- “ musical Google ”
- finding something specific / vague / browsing
- is anything more useful than human annotation?
• Most interesting area: finding new music
- is there anything on mp3.com that I would like?
audio is only information source for new bands
• Basic idea:
Project music into a space where neighbors are “similar”
• Also need models of personal preference
- where in the space is the stuff I like
- relative sensitivity to different dimensions
• Evaluation problems
- requires large, shareable music corpus!
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 29
• Recognizing work from each artist is all very well...
• But: what is similarity between artists ?
- pattern recognition systems give a number...
e toni_braxton jessica_simpson mariah_carey new_ janet_jackson eiffel_65 pet_shop_boys lauryn_hill roxette lara_fabian rs sade sof pr spice_girls
• Need subjective ground truth:
Collected via web site www.musicseer.com
• Results:
- 1,000 users, 22,300 judgments
collected over 6 months
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 30
Audio
Input
(Class i)
Audio
Input
(Class j)
• A classifier trained for one artist (or genre) will respond partially to a similar artist
• Each artist evokes a particular pattern of responses over a set of classifiers
• We can treat these classifier outputs as a new feature space in which to estimate similarity
Anchor n-dimensional vector in "Anchor
Space"
Anchor
Anchor p(a
1
|x) p(a vector in "Anchor
Space" p(a
1
|x) p(a n
|x) p(a
2
|x)
Anchor
Conversion to Anchorspace p(a n
|x)
GMM
Modeling
GMM
Modeling
Similarity
Computation
KL-d, EMD, etc.
Conversion to Anchorspace
• “ Anchor space ” reflects subjective qualities?
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 31
• Comparing 2D projections of per-frame feature points in cepstral and anchor spaces:
Cepstral Features Anchor Space Features
0
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1 madonna bowie
0.5
0 third cepstral coef
0.5
5
10
15 madonna bowie
15 10
Country
5
- each artist represented by 5GMM
- greater separation under MFCCs!
- but: relevant information?
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 32
• Browser finds closest matches to single tracks or entire artists in anchor space
• Direct manipulation of anchor space axes
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 33
• Are recommendations good or bad?
• Subjective evaluation is the ground truth
- .. but subjects aren’t familiar with the bands being recommended
- can take a long time to decide if a recommendation is good
80
70
60
50
40
30
20
10
0
• Measure match to other similarity judgments
- e.g. musicseer data:
Top rank agreement
SrvKnw 4789x3.58
SrvAll 6178x8.93
GamKnw 7410x3.96
GamAll 7421x8.92
cei cmb erd e3d opn kn2 rnd ANK
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 34
• Sound
- .. contains much, valuable information at many levels
- intelligent systems need to use this information
• Mixtures
- .. are an unavoidable complication when using sound
- looking in the right time-frequency place to find points of dominance
• Learning
- need to acquire constraints from the environment
- recognition/classification as the real task
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 35
• Broadcast
• Movies
• Lectures
• Meetings
• Personal recordings
• Location monitoring
ROSA
• Object-based structure discovery & learning
• Speech recognition
• Speech characterization
• Nonspeech recognition
• Scene analysis
• Audio-visual integration
• Music analysis
• Structuring
• Search
• Summarization
• Awareness
• Understanding
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 36
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 37
• Music as a complex, information-rich sound
• Applications of separation & recognition :
- note/chord detection & classification
2
DYWMB: Alignments to MIDI note 57 mapped to Orig Audio
1
0
2
23 24 31 32 40 41 48 49
1 true voice
Michael Penn
The Roots
The Moles
Eric Matthews
Arto Lindsay
Oval
Jason Falkner
Built to Spill
Beck
XTC
Wilco
Aimee Mann
The Flaming Lips
Mouse on Mars
Dj Shadow
Richard Davies
Cornelius
Mercury Rev
Belle & Sebastian
Sugarplastic
Boards of Canada
0
Dan Ellis
0
72 73 80 81 85 86 88 89
- singing detection (
Æ
genre identification ...)
Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)
50 100 150
Sound, Mixtures & Learning
200
2003-09-24 - 38 time / sec
(Bell & Sejnowski 1995 et seq.)
• Drive a parameterized separation algorithm to maximize independence of outputs m
1 m
2 x a
11 a
21 a
12 a
22 s
1 s
2
MutInfo a
• Advantages :
- mathematically rigorous, minimal assumptions
- does not rely on prior information from models
• Disadvantages :
- may converge to local optima...
- separation, not recognition
- does not exploit prior information from models
Dan Ellis Sound, Mixtures & Learning 2003-09-24 - 39