Lab Research Overview ROSA

advertisement
LabROSA
Research Overview
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Real-World Sound
Speech Separation
Environmental Audio Classification
Music Audio Analysis
LabROSA Overview - Dan Ellis
2011-09-09
1 /17
LabROSA Overview
• Getting information from sound
Information
Extraction
Music
Recognition
Environment
Separation
Machine
Learning
Retrieval
Signal
Processing
Speech
LabROSA Overview - Dan Ellis
2011-09-09
2 /17
1. Real-World Sound
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
-60
level / dB
0
2
4
6
8
10
12
time/s
02_m+s-15-evil-goodvoice-fade
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
• Sounds rarely occur in isolation
.. so analyzing mixtures (“scenes”) is a problem
.. for humans and machines
LabROSA Overview - Dan Ellis
2011-09-09
3 /17
Auditory Scene Analysis
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregmanʼ90)
• Received waveform is a mixture
2 sensors, N sources - underconstrained
• Use prior knowledge (models) to constrain
LabROSA Overview - Dan Ellis
2011-09-09
4 /17
2. Speech Separation
Roweis ’01, ’03
Kristjannson ’04, ’06
• Given models for sources,
find “best” (most likely) states for spectra:
combination
p(x|i1, i2) = N (x; ci1 + ci2, Σ) model
{i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) inference of
source state
can include sequential constraints...
• E.g. stationary noise:
In speech-shaped noise
(mel magsnr = 2.41 dB)
freq / mel bin
Original speech
80
80
80
60
60
60
40
40
40
20
20
20
0
1
LabROSA Overview - Dan Ellis
2 time / s
0
1
2
VQ inferred states
(mel magsnr = 3.6 dB)
0
1
2
2011-09-09
5 /17
Eigenvoices
Weiss & Ellis ’09, ’10
• Idea: Find speaker model
parameter space
generalize without
losing detail?
Speaker models
Speaker subspace bases
Mean Voice
Frequency (kHz)
8
• Eigenvoice model:
20
6
30
4
40
2
50
b d g p t k jh ch s z f th v dh m n l
280 states x 320 bins
= 89,600 dimensions
10-30 dimensions
Frequency (kHz)
8
8
6
6
4
4
2
2
0
mean
voice
eigenvoice
bases
w + B
weights
Frequency (kHz)
adapted
model
8
6
h
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 2
8
6
4
channel4 channel
bases 2 weights
2
0
b d g p t k jh ch s z f th v dh m n l
8
)
LabROSA Overview - Dan Ellis
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 1
b d g p t k jh ch s z f th v dh m n l
µ = µ̄ + U
10
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 3
2011-09-09
6 /17
8
Speaker-Adapted Separation
LabROSA Overview - Dan Ellis
2011-09-09
7 /17
Speaker-Adapted Separation
• Eigenvoices for Speech Separation task
speaker adapted (SA) performs midway between
speaker-dependent (SD) & speaker-indep (SI)
Mix
SA
LabROSA Overview - Dan Ellis
2011-09-09
8 /17
3. Soundtrack Classification
• Short video clips as the
evolution of snapshots
10-100 sec, one location,
no editing
browsing?
• Need information for indexing...
video + audio
foreground + background
LabROSA Overview - Dan Ellis
2011-09-09
9 /17
MFCC Covariance Representation
• Each clip/segment → fixed-size statistics
similar to speaker ID and music genre classification
• Full Covariance matrix of MFCCs
8
7
6
5
4
3
2
1
0
VTS_04_0001 - Spectrogram
MFCC
Covariance
Matrix
30
20
10
0
-10
MFCC covariance
-20
1
2
3
4
5
6
7
8
9
time / sec
20
18
16
14
12
10
8
6
4
2
1
2
3
4
5
6
7
8
9
time / sec
50
20
level / dB
18
16
20
15
10
5
0
-5
-10
-15
-20
value
MFCC dimension
MFCC
features
MFCC bin
Video
Soundtrack
freq / kHz
maps the kinds of spectral shapes present
14
12
0
10
8
6
4
2
5
10
15
MFCC dimension
• Clip-to-clip distances for SVM classifier
20
-50
by KL or 2nd Gaussian model
LabROSA Overview - Dan Ellis
2011-09-09 10/17
Classification Results
Chang, Ellis et al. ’07
Lee & Ellis ’10
some concepts are
more audio-related
Mutual Information
Proportion
I(classifier; label)
MIP =
H(label)
LabROSA Overview - Dan Ellis
1
Classifiers
vs. all labels
CCV: Average Precision (mean=0.300)
RAND
Playground
Beach
Parade
NonMusicPerf
MusicPerf
WedDance
WedCerem
WedRecep
Birthday
Graduation
Bird
Dog
Cat
Biking
Swimming
Skiing
IceSkating
Soccer
Baseball
Basketball
RAND
Playground
Beach
Parade
NonMusicPerf
MusicPerf
WedDance
WedCerem
WedRecep
Birthday
Graduation
Bird
Dog
Cat
Biking
Swimming
Skiing
IceSkating
Soccer
Baseball
Basketball
0.5
0
Bb Bs So Ic Sk Sw Bi Ca Do Bi Gr Bd Wr WcWd Mp Np Pa Be Pl RN
AvPrec
Mutual Info Prop (mean=0.175)
0.25
Classifiers
• All classifiers
0.2
0.15
0.1
0.05
Bb Bs So Ic Sk Sw Bi Ca Do Bi Gr Bd Wr WcWd Mp Np Pa Be Pl RN
MIProp
Labels
2011-09-09 11/17
Matching Videos via Fingerprints
are a noise-robust
fingerprint
freq / kHz
• Landmark pairs
Cotton & Ellis ’10
VIdeo IMpLQaiHWbE at 195s
4
3
2
• Use to match
0
freq / kHz
distinct videos
with same
sound ambience
1
195.5
196
196.5
197
197.5
198
198.5
199
time / sec
VIdeo Yi1hkNkqHBc at 218 s
4
3
2
1
0
LabROSA Overview - Dan Ellis
218.5
219
219.5
220
220.5
221
221.5
222
time / sec
2011-09-09 12/17
4. Music Audio Analysis
Signal
freq / kHz
Let it Be (final verse)
4
20
0
2
-20
0
• ... at all levels
from notes
to genres
162
Melody
C5
C4
C3
C2
Piano
C5
C4
C3
C2
164
166
168
170
172
level / dB
174 time / s
Onsets
& Beats
G
Per-frame
chroma
E
D
C
1
0.75
0.5
0.25
0
intensity
A
Per-beat
normalized
chroma
G
E
D
C
A
390
LabROSA Overview - Dan Ellis
395
400
405
410
415
time / beats
2011-09-09 13/17
Polyphonic Transcription
• Apply the Eigenvoice idea to music
eigeninstruments?
LabROSA Overview - Dan Ellis
Grindlay & Ellis ’09
• Subspace NMF
2011-09-09 14/17
Melodic-Harmonic Mining
Bertin-Mahieux et al. ’10, ’11
• Million Song Dataset
as Echo Nest Analyze
• Frequent
clusters
of 12 x 8
binarized
eventchroma
Original
Beat
tracking
Music
audio
Chroma
features
Key
normalization
Locality
Sensitive
Hash Table
Landmark
identification
#1 (3491)
#2 (2775))
#3 (2255)
#4 (1241))
#5 (1224))
#6 (1218))
#7 (1092))
#8 (1084))
#9 (1080))
#10 (1035))
# (1021)
#11
# (1005))
#12
#13 (974)
#14 (942))
#15 (936))
#16 (924))
#17 (920))
#18 (913))
#19 (901))
#20 (897)
#21 (887)
#22 (882))
#23 (881)
#24 (881))
#25 (879))
#26 (875))
#27 (875))
#28 (874))
#29 (868))
#30 (844)
#31 (839)
#32 (839))
#33 (794)
#34 (786))
#35 (785))
#36 (747))
#37 (731))
#38 (714))
#39 (706))
#40 (698)
#41 (682)
#42 (678))
#43 (675)
#44 (657))
#45 (656))
#46 (651))
#47 (647))
#48 (638))
#49 (610))
#50 (593)
#51 (592)
#52 (591))
#53 (589)
#56 (550))
#57 (549))
#58 (534))
#59 (534))
#60 (531)
LabROSA Overview - Dan Ellis
Reconstruction
#54 (572))
#55 (571))
2011-09-09 15/17
Results - Beatles
• Over 86 Beatles tracks
• All beat offsets = 41,705 patches
LSH takes 300 sec - approx NlogN in patches?
• High-pass
• Song filter
remove hits
in same
track
LabROSA Overview - Dan Ellis
05-Here There And Everywhere 12.1-20.5s
10
10
8
8
chroma bin
12
6
6
4
4
2
2
09-Martha My Dear 90.9-98.6s
12-Piggies 22.0-29.6s
12
12
10
10
8
8
chroma bin
to avoid
sustained
notes
chroma bin
along time
chroma bin
02-I Should Have Known Better 92.4-97.7s
12
6
6
4
4
2
2
5
10
beat
15
20
5
10
beat
15
20
2011-09-09 16/17
Summary
• LabROSA : getting information from sound
• Speech
monaural separation using eigenvoices
binaural + reverb using MESSL
• Environmental
classification of consumer video
landmark-based events and matching
• Music
transcription of notes, chords, ...
large corpus mining
• http://labrosa.ee.columbia.edu/
LabROSA Overview - Dan Ellis
2011-09-09 17/17
Download