Sound, Mixtures, and Learning: LabROSA overview Sound Content Analysis

advertisement
Sound, Mixtures, and Learning:
LabROSA overview
1 Sound Content Analysis
2 Recognizing sounds
3 Organizing mixtures
4 Accessing large datasets
Dan Ellis <dpwe@ee.columbia.edu>
Laboratory for Recognition and Organization of Speech and Audio
(LabROSA)
Columbia University, New York
http://labrosa.ee.columbia.edu/
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 1
Sound Content Analysis
1
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
0
2
4
6
8
10
12
time/s
-60
level / dB
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
•
Sound understanding: the key challenge
- what listeners do
- understanding = abstraction
•
Applications
- indexing/retrieval
- robots
- prostheses
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 2
The problem with recognizing mixtures
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
•
Auditory Scene Analysis: describing a complex
sound in terms of high-level sources/events
- ... like listeners do
•
Hearing is ecologically grounded
- reflects natural scene properties = constraints
- subjective, not absolute
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 3
Approaches to handling sound mixtures
•
Separate signals, then recognize
- e.g. CASA, ICA
- nice, if you can do it
•
Recognize combined signal
- ‘multicondition training’
- combinatorics..
•
Recognize with parallel models
- full joint-state space?
- or: divide signal into fragments,
then use missing-data recognition
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 4
Outline
1
Sound Content Analysis
2
Recognizing sounds
- Speech recognition
- Nonspeech
3
Organizing mixtures
4
Accessing large datasets
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 5
Recognizing Sounds: Speech
2
•
Standard speech recognition structure:
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Word models
s
ah
t
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
•
Dan Ellis
Acoustic
classifier
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
How to handle additive noise?
- just train on noisy data: ‘multicondition training’
Sound, Mixtures & Learning
2003-09-26 - 6
Novel speech signal representations
(with Marios Athineos)
•
Common sound models use 10ms frames
- but: sub-10ms envelope is perceptible
TDLP resynth
CTFLP resynth
CTFLP resynth 4x TSM
4k
2k
2k
2k
2k
0
4k
2k
0
0
4k
2k
0
2
Time
•
4
0
Frequency
4k
Frequency
4k
Frequency
Frequency
Fireplce - orig
4k
0
4k
2k
0
2
Time
4
Use a parametric
(LPC) model on
spectrum
0
0
4k
2k
0
2
Time
0
4
0
2
Time
5
4.5
4
3.5
4
3
2
2.5
2
0
2-4 kHz
1-2 kHz
0.5-1 kHz
0-0.5 kHz
waveform
1.5
2
1
200
•
Dan Ellis
400
600
800
1000
1200
1400
1600
1800
2000
Convert to features for ASR
- improvements esp. for stops
Sound, Mixtures & Learning
4
2003-09-26 - 7
Finding the Information in Speech
(with Patricia Scanlon)
•
Mutual Information in time-frequency:
freq / Bark
All phones
Stops
0.14
0.12
0.1
0.08
0.06
0.04
0.02
15
10
5
Vowels
0.08
15
0.06
10
0.04
5
0.02
0.04
0.03
0.02
0.01
0
Speaker (vowels)
0.025
0.02
0.015
0.01
0.005
-200
•
-100
0
100
200
-200
-100
0
100
MI / bits
200
time / ms
Use to select classifier input features
time / ms
IRREG+CHKBD
15
133
ftrs
10
frfeq / Bark
5
15
10
5
85
ftrs
95
ftrs
47
ftrs
57
ftrs
28
ftrs
Frame accuracy / %
IRREG
70
65
60
55
15
10
5
IRREG+CHKBD
RECT+CHKBD
IRREG
RECT
RECT, all frames
50
15
19
ftrs
10
5
-50
Dan Ellis
0
50
9
ftrs
-50
0
45
0
50 time / ms
Sound, Mixtures & Learning
20
40
60
80
100
120
140
Number of features
2003-09-26 - 8
Alarm sound detection
(Ellis 2001)
•
freq / kHz
s0n6a8+20
4
Alarm sounds have particular structure
- people ‘know them when they hear them’
- clear even at low SNRs
hrn01
bfr02
buz01
20
3
0
2
-20
1
0
0
5
10
15
20
25
time / s
•
Why investigate alarm sounds?
- they’re supposed to be easy
- potential applications...
•
Contrast two systems:
- standard, global features, P(X|M)
- sinusoidal model, fragments, P(M,S|Y)
Dan Ellis
Sound, Mixtures & Learning
-40
level / dB
2003-09-26 - 9
freq / kHz
Alarms: Results
Restaurant+ alarms (snr 0 ns 6 al 8)
4
3
2
1
0
MLP classifier output
freq / kHz
0
Sound object classifier output
4
6
9
8
7
3
2
1
0
20
•
25
30
40
45
time/sec 50
Both systems commit many insertions at 0dB
SNR, but in different circumstances:
Noise
Neural net system
Del
Ins
Tot
Sinusoid model system
Del
Ins
Tot
1 (amb)
7 / 25
2
36%
14 / 25
1
60%
2 (bab)
5 / 25
63
272%
15 / 25
2
68%
3 (spe)
2 / 25
68
280%
12 / 25
9
84%
4 (mus)
8 / 25
37
180%
9 / 25
135
576%
170
192%
50 / 100
147
197%
Overall 22 / 100
Dan Ellis
35
Sound, Mixtures & Learning
2003-09-26 - 10
Outline
1
Sound Content Analysis
2
Recognizing sounds
3
Organizing mixtures
- Auditory Scene Analysis
- Missing data recognition
- Parallel model inference
4
Accessing large datasets
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 11
Auditory Scene Analysis
3
(Bregman 1990)
•
How do people analyze sound mixtures?
- break mixture into small elements (in time-freq)
- elements are grouped in to sources using cues
- sources have aggregate attributes
•
Grouping ‘rules’ (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation,
harmonicity, spatial location, ...
Onset
map
Frequency
analysis
Harmonicity
map
Source
properties
Grouping
mechanism
Position
map
(after Darwin, 1996)
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 12
Cues to simultaneous grouping
freq / Hz
•
Elements + attributes
8000
6000
4000
2000
0
0
1
2
3
4
5
6
7
8
9 time / s
•
Common onset
- simultaneous energy has common source
•
Periodicity
- energy in different bands with same cycle
•
Other cues
- spatial (ITD/IID), familiarity, ...
•
But: Context ...
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 13
Computational Auditory Scene Analysis:
The Representational Approach
(Cooke & Brown 1993)
•
input
mixture
Direct implementation of psych. theory
signal
features
Front end
(maps)
Object
formation
discrete
objects
Source
groups
Grouping
rules
freq
onset
time
period
frq.mod
- ‘bottom-up’ processing
- uses common onset & periodicity cues
•
frq/Hz
Able to extract voiced speech:
brn1h.aif
frq/Hz
3000
3000
2000
1500
2000
1500
1000
1000
600
600
400
300
400
300
200
150
200
150
100
brn1h.fi.aif
100
0.2
0.4
Dan Ellis
0.6
0.8
1.0
time/s
Sound, Mixtures & Learning
0.2
0.4
0.6
0.8
2003-09-26 - 14
1.0
time/s
Adding top-down constraints
Perception is not direct
but a search for plausible hypotheses
•
Data-driven (bottom-up)...
input
mixture
Front end
signal
features
Object
formation
discrete
objects
Grouping
rules
Source
groups
- objects irresistibly appear
vs. Prediction-driven (top-down)
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
- match observations
with parameters of a world-model
- need world-model constraints...
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 15
Prediction-Driven CASA
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
f/Hz
1
2
3
Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
Dan Ellis
1
2
3
4
5
Sound, Mixtures & Learning
6
7
8
9
time/s
2003-09-26 - 16
dB
Segregation vs. Inference
•
Source separation
requires attribute separation
- sources are characterized by attributes
(pitch, loudness, timbre + finer details)
- need to identify & gather different attributes for
different sources ...
•
Need representation that segregates attributes
- spectral decomposition
- periodicity decomposition
•
Sometimes values can’t be separated
- e.g. unvoiced speech
- maybe infer factors from probabilistic model?
p ( O, x , y ) Æ p ( x , y O )
- or: just skip those values,
infer from higher-level context
- do both: missing-data recognition
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 17
Missing Data Recognition
•
Speech models p(x|m) are multidimensional...
- i.e. means, variances for every freq. channel
- need values for all dimensions to get p(•)
•
But: can evaluate over a
subset of dimensions xk
p ( xk m ) =
•
Ú p ( x k, x u
m ) dx u
Hence,
missing data recognition:
Present data mask
xu
p(xk,xu)
y
p(xk|xu<y )
P(x | q) =
dimension
P(x1 | q)
· P(x2 | q)
· P(x3 | q)
· P(x4 | q)
· P(x5 | q)
· P(x6 | q)
time
- hard part is finding the mask (segregation)
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 18
xk
p(xk )
Comparing different segregations
•
Standard classification chooses between
models M to match source features X
P( M )
M * = argmax P ( M X ) = argmax P ( X M ) ◊ -------------P( X )
M
M
•
Mixtures Æ observed features Y, segregation S,
all related by P ( X Y , S )
Observation
Y(f )
Source
X(f )
Segregation S
freq
- spectral features allow clean relationship
•
Joint classification of model and segregation:
P( X Y , S )
P ( M , S Y ) = P ( M ) Ú P ( X M ) ◊ ------------------------- dX ◊ P ( S Y )
P( X )
- probabilistic relation of models & segregation
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 19
Multi-source decoding
•
Search for more than one source
q2(t)
S2(t)
Y(t)
S1(t)
q1(t)
•
Mutually-dependent data masks
•
Use e.g. CASA features to propose masks
- locally coherent regions
•
Lots of issues in models, representations,
matching, inference...
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 20
What a speech HMM contains
•
Markov model structure: states + transitions
S
A
0.1
0.02
T
0.9
4
Model M'2
0.8
0.9
K
freq / kHz
0.8
0.18
0.2
3
0.8
0.05
2
A
0.15
20
T
0.2
30
E
40
O
0
•
10
0.8
0.1
1
O
0.1
0.9
K
S
E
0.1
freq / kHz
M'1
State Transition Probabilities
State models (means)
0.9
50
10
A generative model
- but not a good speech generator!
4
3
2
1
0
0
1
2
3
4
5
time / sec
- only meant for inference of p(X|M)
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 21
20
30
40
50
“One microphone source separation”
(Roweis 2000, Manuel Reyes)
State sequences Æ t-f estimates Æ mask
•
Original
voices
freq / Hz
Speaker 1
Speaker 2
4000
3000
2000
1000
0
Mixture
4000
3000
2000
1000
0
Resynthesis
masks
State
means
4000
3000
2000
1000
0
4000
3000
2000
1000
0
0
1
2
3
4
5
6
time / sec
0
1
2
3
4
5
6
time / sec
- 1000 states/model (Æ 106 transition probs.)
- simplify by modeling subbands (coupled HMM)?
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 22
Subband models
(Reyes, Jojic)
•
Reduce the number of states required
- 4000 states ¥1 band Æ 30 states ¥19 bands
•
Train coupled HMMs via variational approx
S 5 ,1
S 5,2
S 5 ,3
S 5,4
S 5 ,5
S 5,6
S 5 ,1
S 5,2
S 5 ,3
S 5,4
S 5 ,5
S 5,6
S 4 ,1
S 4,2
S 4 ,3
S 4,4
S 4 ,5
S 4 ,6
S 4 ,1
S 4,2
S 4 ,3
S 4,4
S 4 ,5
S 4 ,6
S 3 ,1
S 3,2
S 3,3
S 3,4
S 3,5
S 3,6
S 3 ,1
S 3,2
S 3,3
S 3,4
S 3,5
S 3,6
S 2 ,1
S 2,2
S 2 ,3
S 2,4
S 2 ,5
S 2 ,6
S 2 ,1
S 2,2
S 2 ,3
S 2,4
S 2 ,5
S 2 ,6
S 1 ,1
S 1, 2
S 1, 3
S 1, 4
S 1, 5
S 1, 6
S 1 ,1
S 1, 2
S 1, 3
S 1, 4
S 1, 5
S 1, 6
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 23
Tracking source normalization
•
Standard HMM sound models try to normalize
away energy variation
- but: each source in mixture has different ‘gain’
Source 1
Source 2
Data
Data
Normalization
÷k1
Normalization
÷k2
Model 1
Model 2
•
Composed
Data
Normalization ?
Normalization
mismatch
Model 2
Model 1
Instead, factor out scalar gain for each source
S1
S2
S3
a1
x1
a2
x2
S5
S4
a3
x3
S6
a5
a4
x4
x5
a6
x6
- solve with var. approx. to P(S,a)
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 24
Outline
1
Sound Content Analysis
2
Recognizing sounds
3
Organizing mixtures
4
Accessing large datasets
- Meeting Recordings
- The Listening Machine
- Music Information Retrieval
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 25
Accessing large datasets:
The Meeting Recorder Project
4
(with ICSI, UW, IDIAP, SRI, Sheffield)
•
Microphones in conventional meetings
- for summarization / retrieval / behavior analysis
- informal, overlapped speech
•
Data collection (ICSI, UW, IDIAP, NIST):
- ~100 hours collected & transcribed
•
Dan Ellis
NSF ‘Mapping Meetings’ project
Sound, Mixtures & Learning
2003-09-26 - 26
Speaker Turn detection
(Huan Wei Hee, Jerry Liu)
•
Acoustic:
Triangulate tabletop mic timing differences
- use normalized peak value for confidence
mr-2000-11-02-1440: PZM xcorr lags
Example cross coupling response, chan3 to chan0
1
4
250
lag 3-4 / ms
0
200
150
100
0
-1
3
-2
50
2
1
0
50
100
•
150
time / s
200
250
300
-3
-3
-2
-1
0
1
lag 1-2 / ms
Behavioral: Look for patterns of speaker turns
mr04: Hand-marked speaker turns vs. time + auto/manual boundaries
Participant
100xR skew/samps
300
10:
9:
8:
7:
5:
3:
2:
1:
0
5
Dan Ellis
10
15
20
25
30
35
Sound, Mixtures & Learning
40
45
50
55
60
time/min
2003-09-26 - 27
2
3
The Listening Machine
•
Smart PDA records everything
•
Only useful if we have index, summaries
- monitor for particular sounds
- real-time description
•
Scenarios
- personal listener Æ summary of your day
- future prosthetic hearing device
- autonomous robots
•
Dan Ellis
Meeting data, ambulatory audio
Sound, Mixtures & Learning
2003-09-26 - 28
freq / Hz
Personal Audio
•
LifeLog / MyLifeBits /
Remembrance Agent:
Easy to record everything you
hear
•
Then what?
- prohibitively time consuming to
search
- but .. applications if access easier
•
Automatic content analysis / indexing...
4
2
freq / Bark
0
50
100
150
200
250 time / min
15
10
5
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
18:30
clock time
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 29
Music Information Retrieval
•
Transfer search concepts to music?
- “musical Google”
- finding something specific / vague / browsing
- is anything more useful than human annotation?
•
Most interesting area: finding new music
- is there anything on mp3.com that I would like?
- audio is only information source for new bands
•
Basic idea:
Project music into a space where
neighbors are “similar”
•
Also need models of personal preference
- where in the space is the stuff I like
- relative sensitivity to different dimensions
•
Evaluation problems
- requires large, shareable music corpus!
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 30
Music similarity from Anchor space
•
A classifier trained for one artist (or genre)
will respond partially to a similar artist
•
Each artist evokes a particular pattern of
responses over a set of classifiers
•
We can treat these classifier outputs as a new
feature space in which to estimate similarity
n-dimensional
vector in "Anchor
Space"
Anchor
Anchor
p(a1|x)
Audio
Input
(Class i)
AnchorAnchor
Anchor
Audio
Input
(Class j)
|x)
p(a2n-dimensional
vector in "Anchor
Space"
GMM
Modeling
Similarity
Computation
p(a1|x)p(an|x)
p(a2|x)
Anchor
Conversion to Anchorspace
GMM
Modeling
KL-d, EMD, etc.
p(an|x)
Conversion to Anchorspace
•
Dan Ellis
“Anchor space” reflects subjective qualities?
Sound, Mixtures & Learning
2003-09-26 - 31
Playola interface ( www.playola.org )
•
Browser finds closest matches to single tracks
or entire artists in anchor space
•
Direct manipulation of anchor space axes
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 32
Evaluation
•
Are recommendations good or bad?
•
Subjective evaluation is the ground truth
- .. but subjects aren’t familiar with the bands
being recommended
- can take a long time to decide if a
recommendation is good
•
Measure match to other similarity judgments
- e.g. musicseer data:
Top rank agreement
80
70
60
SrvKnw 4789x3.58
%
50
SrvAll 6178x8.93
40
GamKnw 7410x3.96
GamAll 7421x8.92
30
20
10
0
cei
Dan Ellis
cmb
erd
e3d
opn
kn2
rnd
Sound, Mixtures & Learning
ANK
2003-09-26 - 33
Summary
•
Sound
- .. contains much, valuable information at many
levels
- intelligent systems need to use this information
•
Mixtures
- .. are an unavoidable complication when using
sound
- looking in the right time-frequency place to find
points of dominance
•
Learning
- need to acquire constraints from the
environment
- recognition/classification as the real task
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 34
DOMAINS
LabROSA Summary
• Meetings
• Personal recordings
• Location monitoring
• Broadcast
• Movies
• Lectures
ROSA
• Object-based structure discovery & learning
APPLICATIONS
• Speech recognition
• Speech characterization
• Nonspeech recognition
•
•
•
•
•
Dan Ellis
• Scene analysis
• Audio-visual integration
• Music analysis
Structuring
Search
Summarization
Awareness
Understanding
Sound, Mixtures & Learning
2003-09-26 - 35
Extra Slides
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 36
Music Applications
•
Music as a complex, information-rich sound
•
Applications of separation & recognition:
- note/chord detection & classification
freq / kHz
DYWMB: Alignments to MIDI note 57 mapped to Orig Audio
2
1
0
23
24
31
32
40
41
48
49
2
1
0
72
73
80
81
85
86
88
89
- singing detection (Æ genre identification ...)
Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)
true voice
Michael Penn
The Roots
The Moles
Eric Matthews
Arto Lindsay
Oval
Jason Falkner
Built to Spill
Beck
XTC
Wilco
Aimee Mann
The Flaming Lips
Mouse on Mars
Dj Shadow
Richard Davies
Cornelius
Mercury Rev
Belle & Sebastian
Sugarplastic
Boards of Canada
0
50
Dan Ellis
100
150
Sound, Mixtures & Learning
200
2003-09-26 - 37
time / sec
Artist Similarity
• Recognizing work from each artist is all very well...
• But: what is
similarity between
artists?
- pattern recognition
systems give a
number...
Dan Ellis
roxette
toni_braxton
e
ron_carter
erasure
lara_fabian
jessica_simpson
mariah_carey
new_
janet_jackson
a
whitney_
eiffel_65
celine_dionpet_shop_boys
christina_aguilera
aqua
lauryn_hill
rs
sade sof
all_saints
backstreet_boys
madonna pr
spice_girlsbelinda_carlisle
wain
nelly_furtado
miroquai
annie_lennox
•
Need subjective ground truth:
Collected via web site
www.musicseer.com
•
Results:
- 1,000 users, 22,300 judgments
collected over 6 months
Sound, Mixtures & Learning
2003-09-26 - 38
Independent Component Analysis (ICA)
(Bell & Sejnowski 1995 et seq.)
•
Drive a parameterized separation algorithm to
maximize independence of outputs
m1
m2
x
a11 a12
a21 a22
s1
s2
MutInfo
a
•
Advantages:
- mathematically rigorous, minimal assumptions
- does not rely on prior information from models
•
Disadvantages:
- may converge to local optima...
- separation, not recognition
- does not exploit prior information from models
Dan Ellis
Sound, Mixtures & Learning
2003-09-26 - 39
Download