Lab Research Overview ROSA

advertisement
LabROSA
Research Overview
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Real-World Sound
Speech Separation
Environmental Audio Classification
Music Audio Analysis
LabROSA Overview - Dan Ellis
2010-09-10
1 /19
LabROSA Overview
Information
Extraction
Music
Recognition
Environment
Separation
Machine
Learning
Retrieval
Signal
Processing
Speech
LabROSA Overview - Dan Ellis
2010-09-10
2 /19
1. Real-World Sound
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
-60
level / dB
0
2
4
6
8
10
12
time/s
02_m+s-15-evil-goodvoice-fade
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
• Sounds rarely occur in isolation
.. so analyzing mixtures (“scenes”) is a problem
.. for humans and machines
LabROSA Overview - Dan Ellis
2010-09-10
3 /19
Auditory Scene Analysis
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregmanʼ90)
• Received waveform is a mixture
2 sensors, N sources - underconstrained
• Use prior knowledge (models) to constrain
LabROSA Overview - Dan Ellis
2010-09-10
4 /19
2. Speech Separation
Roweis ’01, ’03
Kristjannson ’04, ’06
• Given models for sources,
find “best” (most likely) states for spectra:
combination
p(x|i1, i2) = N (x; ci1 + ci2, Σ) model
{i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) inference of
source state
can include sequential constraints...
• E.g. stationary noise:
In speech-shaped noise
(mel magsnr = 2.41 dB)
freq / mel bin
Original speech
80
80
80
60
60
60
40
40
40
20
20
20
0
1
LabROSA Overview - Dan Ellis
2 time / s
0
1
2
VQ inferred states
(mel magsnr = 3.6 dB)
0
1
2
2010-09-10
5 /19
Eigenvoices
Weiss & Ellis ’09, ’10
• Idea: Find speaker model
parameter space
generalize without
losing detail?
Speaker models
Speaker subspace bases
Mean Voice
Frequency (kHz)
8
• Eigenvoice model:
20
6
30
4
40
2
50
b d g p t k jh ch s z f th v dh m n l
280 states x 320 bins
= 89,600 dimensions
10-30 dimensions
Frequency (kHz)
8
8
6
6
4
4
2
2
0
mean
voice
eigenvoice
bases
w + B
weights
Frequency (kHz)
adapted
model
8
6
h
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 2
8
6
4
channel4 channel
bases 2 weights
2
0
b d g p t k jh ch s z f th v dh m n l
8
)
LabROSA Overview - Dan Ellis
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 1
b d g p t k jh ch s z f th v dh m n l
µ = µ̄ + U
10
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 3
2010-09-10
6 /19
8
Speaker-Adapted Separation
LabROSA Overview - Dan Ellis
2010-09-10
7 /19
Speaker-Adapted Separation
• Eigenvoices for Speech Separation task
speaker adapted (SA) performs midway between
speaker-dependent (SD) & speaker-indep (SI)
SI
SA
LabROSA Overview - Dan Ellis
2010-09-10
8 /19
3. Soundtrack Classification
• Short video clips as the
evolution of snapshots
10-100 sec, one location,
no editing
browsing?
• Need information for indexing...
video + audio
foreground + background
LabROSA Overview - Dan Ellis
2010-09-10
9 /19
MFCC Covariance Representation
• Each clip/segment → fixed-size statistics
similar to speaker ID and music genre classification
• Full Covariance matrix of MFCCs
8
7
6
5
4
3
2
1
0
VTS_04_0001 - Spectrogram
MFCC
Covariance
Matrix
30
20
10
0
-10
MFCC covariance
-20
1
2
3
4
5
6
7
8
9
time / sec
20
18
16
14
12
10
8
6
4
2
1
2
3
4
5
6
7
8
9
time / sec
50
20
level / dB
18
16
20
15
10
5
0
-5
-10
-15
-20
value
MFCC dimension
MFCC
features
MFCC bin
Video
Soundtrack
freq / kHz
maps the kinds of spectral shapes present
14
12
0
10
8
6
4
2
5
10
15
MFCC dimension
• Clip-to-clip distances for SVM classifier
20
-50
by KL or 2nd Gaussian model
LabROSA Overview - Dan Ellis
2010-09-10 10/19
AP
Classification Results
• Wide range:
0.8
Chang, Ellis et al. ’07
Lee & Ellis ’10
1G + KL
1G + Mah
GMM Hist. + pLSA
Guess
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Da Ski
nc
in
g
Bo
at
Ba
by
Be
a
W ch
ed
di
M ng
us
eu
m
Pa
Pl ra
ay de
gr
ou
nd
Su
ns
et
Pi
cn
Bi ic
rth
da
y
Gr
On
e
pe
r
ou son
p
of
3
+
M
us
ic
Ch
e
Gr
ou er
p
of
2
Sp
or
ts
Pa
r
Cr k
ow
d
Ni
gh
An t
im
Si al
ng
in
g
Sh
ow
0
Concepts
audio (music, ski) vs. non-audio (group, night)
large AP uncertainty on infrequent classes
LabROSA Overview - Dan Ellis
2010-09-10 11/19
Matching Videos via Fingerprints
are a noise-robust
fingerprint
freq / kHz
• Landmark pairs
Cotton & Ellis ’10
VIdeo IMpLQaiHWbE at 195s
4
3
2
• Use to match
0
freq / kHz
distinct videos
with same
sound ambience
1
195.5
196
196.5
197
197.5
198
198.5
199
time / sec
VIdeo Yi1hkNkqHBc at 218 s
4
3
2
1
0
LabROSA Overview - Dan Ellis
218.5
219
219.5
220
220.5
221
221.5
222
time / sec
2010-09-10 12/19
4. Music Audio Analysis
Signal
freq / kHz
Let it Be (final verse)
4
20
0
2
-20
0
• ... at all levels
from notes
to genres
162
Melody
C5
C4
C3
C2
Piano
C5
C4
C3
C2
164
166
168
170
172
level / dB
174 time / s
Onsets
& Beats
G
Per-frame
chroma
E
D
C
1
0.75
0.5
0.25
0
intensity
A
Per-beat
normalized
chroma
G
E
D
C
A
390
LabROSA Overview - Dan Ellis
395
400
405
410
415
time / beats
2010-09-10 13/19
High-Resolution Transcription
Smit & Ellis ’09
• Extract notes from audio
time alignment &
approximate match
• Find most likely
parameters
fundamental, amplitudes
of all harmonics
Frequency (Hz)
• Use musical score
Spectrogram of triplum track: YIN (red)
700
30
500
20
400
300
10
5
6
7
8
9
10
dB
Spectrogram of combined trackes: YIN (red), Prob (blue)
700
40
600
30
500
20
400
300
200
LabROSA Overview - Dan Ellis
40
600
200
Frequency (Hz)
despite overlapping voices
with precise tuning, timing
10
5
6
7
8
Time (s)
9
10
dB
2010-09-10 14/19
Polyphonic Transcription
• Apply the Eigenvoice idea to music
eigeninstruments?
LabROSA Overview - Dan Ellis
Grindlay & Ellis ’09
• Subspace NMF
2010-09-10 15/19
Chord Recognition Weller, Ellis, Jebara ’09
• StructSVM vs. HMM: better results
LabROSA Overview - Dan Ellis
2010-09-10 16/19
Melodic-Harmonic Mining
• 100,000 tracks from morecowbell.dj
as Echo Nest Analyze
• Frequent
clusters
of 12 x 8
binarized
eventchroma
Bertin-Mahieux et al. ’10
Beat
tracking
Music
audio
Chroma
features
Key
normalization
Locality
Sensitive
Hash Table
Landmark
identification
#1 (3491)
#2 (2775))
#3 (2255)
#4 (1241))
#5 (1224))
#6 (1218))
#7 (1092))
#8 (1084))
#9 (1080))
#10 (1035))
# (1021)
#11
# (1005))
#12
#13 (974)
#14 (942))
#15 (936))
#16 (924))
#17 (920))
#18 (913))
#19 (901))
#20 (897)
#21 (887)
#22 (882))
#23 (881)
#24 (881))
#25 (879))
#26 (875))
#27 (875))
#28 (874))
#29 (868))
#30 (844)
#31 (839)
#32 (839))
#33 (794)
#34 (786))
#35 (785))
#36 (747))
#37 (731))
#38 (714))
#39 (706))
#40 (698)
#41 (682)
#42 (678))
#43 (675)
#44 (657))
#45 (656))
#46 (651))
#47 (647))
#48 (638))
#49 (610))
#50 (593)
#51 (592)
#52 (591))
#53 (589)
#54 (572))
#55 (571))
#56 (550))
#57 (549))
#58 (534))
#59 (534))
#60 (531)
#61 (528)
#62 (525))
#63 (522)
#64 (514))
#65 (510))
#66 (507))
#67 (500))
#68 (497))
#69 (486))
#70 (479)
#71 (476)
#72 (468))
#73 (468)
#74 (466))
#75 (463))
#76 (454))
#77 (453))
#78 (448))
#79 (441))
#80 (440)
LabROSA Overview - Dan Ellis
2010-09-10 17/19
Results - Beatles
• Over 86 Beatles tracks
• All beat offsets = 41,705 patches
LSH takes 300 sec - approx NlogN in patches?
• High-pass
• Song filter
remove hits
in same
track
LabROSA Overview - Dan Ellis
05-Here There And Everywhere 12.1-20.5s
10
10
8
8
chroma bin
12
6
6
4
4
2
2
09-Martha My Dear 90.9-98.6s
12-Piggies 22.0-29.6s
12
12
10
10
8
8
chroma bin
to avoid
sustained
notes
chroma bin
along time
chroma bin
02-I Should Have Known Better 92.4-97.7s
12
6
6
4
4
2
2
5
10
beat
15
20
5
10
beat
15
20
2010-09-10 18/19
Summary
• LabROSA : getting information from sound
• Speech
monaural separation using eigenvoices
binaural + reverb using MESSL
• Environmental
classification of consumer video
landmark-based events and matching
• Music
transcription of notes, chords, ...
large corpus mining
LabROSA Overview - Dan Ellis
2010-09-10 19/19
Download