Some projects in Real-World Sound Analysis 1. 2.

advertisement
Some projects in
Real-World Sound Analysis
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Real-World Sound
Speech Separation
Soundtrack Classification
Music Audio Analysis
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 1 /30
LabROSA Overview
Information
Extraction
Music
Recognition
Environment
Separation
Machine
Learning
Retrieval
Signal
Processing
Speech
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 2 /30
1. Real-World Sound
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregmanʼ90)
• Received waveform is a mixture
2 sensors, N sources - underconstrained
• Use prior knowledge (models) to constrain
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 3 /30
Approaches to Separation
ICA
CASA
•
Multi-channel
•
Single-channel
•
Fixed filtering
•
Time-var. filter
•
Perfect separation •
Approximate
– maybe!
separation
target x
interference n
-
mix
proc
x^
+
n^
Real-World Sound Analysis - Dan Ellis
mix x+n
s
stft
proc
mask
istft
x^
Model-based
•
Any domain
•
Param. search
•
Synthetic
output?
mix x+n
params
x^
param. fit
synthesis
source
models
2009-12-10 - 4 /30
2. Speech Separation
Roweis ’01, ’03
Kristjannson ’04, ’06
• Given models for sources,
find “best” (most likely) states for spectra:
combination
p(x|i1, i2) = N (x; ci1 + ci2, Σ) model
{i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) inference of
source state
can include sequential constraints...
• E.g. stationary noise:
In speech-shaped noise
(mel magsnr = 2.41 dB)
freq / mel bin
Original speech
80
80
80
60
60
60
40
40
40
20
20
20
0
1
Real-World Sound Analysis - Dan Ellis
2 time / s
0
1
2
VQ inferred states
(mel magsnr = 3.6 dB)
0
1
2
2009-12-10 - 5 /30
Speech Mixture Recognition
Varga & Moore ’90
Kristjansson, Hershey et al. ’06
• Speech recognizers contain speech models
ASR is just argmax P(W | X)
• Recognize mixtures with Factorial HMM
i.e. two state sequences, one model for each voice
exploit sequence constraints, speaker differences
model 2
model 1
• Speaker-specific models...
Real-World Sound Analysis - Dan Ellis
observations / time
2009-12-10 - 6 /30
Eigenvoices
• Idea: Find
Kuhn et al. ’98, ’00
Weiss & Ellis ’07, ’08, ’09
speaker model parameter space
generalize without
losing detail?
Speaker models
Speaker subspace bases
• Eigenvoice model:
µ = µ̄ + U
adapted
model
mean
voice
Real-World Sound Analysis - Dan Ellis
eigenvoice
bases
w + B
weights
h
channel channel
bases weights
2009-12-10 - 7 /30
Eigenvoice Bases
additional
components for
channel
Frequency (kHz)
10
20
6
30
4
40
2
b d g p t k jh ch s z f th v dh m n l
8
Frequency (kHz)
•
Eigencomponents
shift formants/
coloration
Mean Voice
50
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 1
8
6
6
4
4
2
2
0
b d g p t k jh ch s z f th v dh m n l
8
Frequency (kHz)
280 states x 320 bins
= 89,600 dimensions
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 2
8
6
6
4
4
2
2
0
b d g p t k jh ch s z f th v dh m n l
8
Frequency (kHz)
• Mean model
8
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 3
8
6
6
4
4
2
2
0
b d g p t k jh ch s z f th v dh m n l
Real-World Sound Analysis - Dan Ellis
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
2009-12-10 - 8 /30
Speaker-Adapted Separation
Weiss & Ellis ’08
• Factorial HMM analysis
with tuning of source model parameters
= eigenvoice speaker adaptation
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 9 /30
Speaker-Adapted Separation
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 10/30
Speaker-Adapted Separation
• Eigenvoices for Speech Separation task
speaker adapted (SA) performs midway between
speaker-dependent (SD) & speaker-indep (SI)
SI
SA
SD
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 11/30
Spatial Separation
Mandel & Ellis ’07
• 2 or 3 sources
in reverberation
assume just 2 ‘ears’
• Model interaural spectrum of each source
as stationary level and time differences:
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 12/30
ILD and IPD
• Sources at 0° and 75° in reverb
R i gh t
IPD
I LD
Mixture
Source 2
Source 1
L eft
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 13/30
Model-Based EM Source Separation
and Localization (MESSL)
Re-estimate
source parameters
Mandel & Ellis ’09
Assign spectrogram points
to sources
can model more sources than sensors
flexible initialization
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 14/30
MESSL Results
• Modeling uncertainty improves results
tradeoff between constraints & noisiness
2.45 dB
0.22 dB
12.35 dB
8.77 dB
9.12 dB
Real-World Sound Analysis - Dan Ellis
-2.72 dB
2009-12-10 - 15/30
Parameter estimation
and source
separation
MESSL-SP
(Source
Prior)
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
Real-World Sound Analysis - Dan Ellis
May 4, 2009
27 / 34
2009-12-10 - 16/30
MESSL-SP Results
• Source models function as priors
• Interaural parameter spatial separation
Frequency (kHz)
Frequency (kHz)
source model prior improves spatial estimate
Ground truth (12.04 dB)
8
DUET (3.84 dB)
8
8
6
6
6
4
4
4
2
2
2
0
0
0.5
1
1.5
MESSL (5.66 dB)
8
0
0
0.5
1
1.5
MESSL SP (10.01 dB)
8
0
6
6
4
4
4
2
2
2
0
0.5
1
Time (sec)
1.5
Real-World Sound Analysis - Dan Ellis
0
0
0.5
1
Time (sec)
1.5
1
0.8
0
0
0.5
1
1.5
MESSL EV (10.37 dB)
8
6
0
2D FD BSS (5.41 dB)
0.6
0.4
0.2
0
0
0.5
1
Time (sec)
1.5
2009-12-10 - 17/30
3. Soundtrack Classification
• Short video clips as the
evolution of snapshots
10-100 sec, one location,
no editing
browsing?
• Need information for indexing...
video + audio
foreground + background
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 18/30
MFCC Covariance Representation
• Each clip/segment → fixed-size statistics
similar to speaker ID and music genre classification
• Full Covariance matrix of MFCCs
8
7
6
5
4
3
2
1
0
VTS_04_0001 - Spectrogram
MFCC
Covariance
Matrix
30
20
10
0
-10
MFCC covariance
-20
1
2
3
4
5
6
7
8
9
time / sec
20
18
16
14
12
10
8
6
4
2
1
2
3
4
5
6
7
8
9
time / sec
50
20
level / dB
18
16
20
15
10
5
0
-5
-10
-15
-20
value
MFCC dimension
MFCC
features
MFCC bin
Video
Soundtrack
freq / kHz
maps the kinds of spectral shapes present
14
12
0
10
8
6
4
2
5
10
15
MFCC dimension
• Clip-to-clip distances for SVM classifier
20
-50
by KL or 2nd Gaussian model
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 19/30
Latent Semantic Analysis (LSA)
• Probabilistic LSA (pLSA) models each
Hofmann ’99
histogram as a mixture of several ‘topics’
.. each clip may have several things going on
• Topic sets optimized through EM
p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip)
=
GMM histogram ftrs
*
“Topic”
p(topic | clip)
p(ftr | clip)
“Topic”
AV Clip
AV Clip
GMM histogram ftrs
p(ftr | topic)
use p(topic | clip) as per-clip features
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 20/30
AP
Classification Results
• Wide range:
0.8
Chang, Ellis et al. ’07
Lee & Ellis ’10
1G + KL
1G + Mah
GMM Hist. + pLSA
Guess
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Da Ski
nc
in
g
Bo
at
Ba
by
Be
a
W ch
ed
di
M ng
us
eu
m
Pa
Pl ra
ay de
gr
ou
nd
Su
ns
et
Pi
cn
Bi ic
rth
da
y
Gr
On
e
pe
r
ou son
p
of
3
+
M
us
ic
Ch
e
Gr
ou er
p
of
2
Sp
or
ts
Pa
r
Cr k
ow
d
Ni
gh
An t
im
Si al
ng
in
g
Sh
ow
0
Concepts
audio (music, ski) vs. non-audio (group, night)
large AP uncertainty on infrequent classes
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 21/30
Temporal Refinement
• Global vs. local class models
tell-tale acoustics may be ‘washed out’ in statistics
try iterative realignment of HMMs:
YT baby 002:
voice
baby
laugh
4
New Way:
Limited temporal extents
freq / kHz
freq / kHz
Old Way:
All frames contribute
3
4
2
1
1
5
10
voice
15
0
time / s
baby
3
2
0
voice
bg
5
voice baby
10
laugh
15
bg
time / s
laugh
baby
laugh
“background” (bg) model shared by all clips
Multiple-Instance Learning
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 22/30
Landmark-based Features
• Describe audio
with biggest
Gabor elements
extract with Matching
Pursuit (MP)
neighbor pairs as “features”?
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 23/30
Landmark models for similar events
• Build index of Gabor neighbor pairs
Cotton & Ellis ’09
recognize repeated events with similar pairs
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 24/30
Matching Videos via Fingerprints
are a noise-robust
fingerprint
freq / kHz
• Landmark pairs
Cotton & Ellis ’10?
VIdeo IMpLQaiHWbE at 195s
4
3
2
• Use to match
0
freq / kHz
distinct videos
with same
sound ambience
1
195.5
196
196.5
197
197.5
198
198.5
199
time / sec
VIdeo Yi1hkNkqHBc at 218 s
4
3
2
1
0
Real-World Sound Analysis - Dan Ellis
218.5
219
219.5
220
220.5
221
221.5
222
time / sec
2009-12-10 - 25/30
4. Music Audio Analysis
freq / kHz
Let it Be (final verse)
Signal
4
20
0
2
-20
0
• Interested
in all levels
from notes
to genres
162
Melody
C5
C4
C3
C2
Piano
C5
C4
C3
C2
164
166
168
170
172
level / dB
174 time / s
Onsets
& Beats
G
Per-frame
chroma
E
D
C
1
0.75
0.5
0.25
0
intensity
A
Per-beat
normalized
chroma
G
E
D
C
A
390
Real-World Sound Analysis - Dan Ellis
395
400
405
410
415
time / beats
2009-12-10 - 26/30
Polyphonic Transcription
• Apply the Eigenvoice idea to music
eigeninstruments?
Real-World Sound Analysis - Dan Ellis
Grindlay & Ellis ’09
• Subspace NMF
2009-12-10 - 27/30
Chord Recognition
Structural Support Vector Machine
‡ Joint features
‡ Learn weights
describe match between x and y
so that
is max for correct y
Tsochantaridis et al ’05
Weller, Ellis, Jebara ’09
…
• Much better results
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 28/30
Melodic-Harmonic Mining
• 100,000 tracks from morecowbell.dj
as Echo Nest Analyze
• Frequent
clusters
of 12 x 8
binarized
eventchroma
Beat
tracking
Music
audio
Chroma
features
Key
normalization
Locality
Sensitive
Hash Table
Landmark
identification
#1 (3491)
#2 (2775))
#3 (2255)
#4 (1241))
#5 (1224))
#6 (1218))
#7 (1092))
#8 (1084))
#9 (1080))
#10 (1035))
# (1021)
#11
# (1005))
#12
#13 (974)
#14 (942))
#15 (936))
#16 (924))
#17 (920))
#18 (913))
#19 (901))
#20 (897)
#21 (887)
#22 (882))
#23 (881)
#24 (881))
#25 (879))
#26 (875))
#27 (875))
#28 (874))
#29 (868))
#30 (844)
#31 (839)
#32 (839))
#33 (794)
#34 (786))
#35 (785))
#36 (747))
#37 (731))
#38 (714))
#39 (706))
#40 (698)
#41 (682)
#42 (678))
#43 (675)
#44 (657))
#45 (656))
#46 (651))
#47 (647))
#48 (638))
#49 (610))
#50 (593)
#51 (592)
#52 (591))
#53 (589)
#54 (572))
#55 (571))
#56 (550))
#57 (549))
#58 (534))
#59 (534))
#60 (531)
#61 (528)
#62 (525))
#63 (522)
#64 (514))
#65 (510))
#66 (507))
#67 (500))
#68 (497))
#69 (486))
#70 (479)
#71 (476)
#72 (468))
#73 (468)
#74 (466))
#75 (463))
#76 (454))
#77 (453))
#78 (448))
#79 (441))
#80 (440)
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 29/30
Summary
• LabROSA : getting information from sound
• Speech
monaural separation using eigenvoices
binaural + reverb using MESSL
• Environmental
classification of consumer video
landmark-based events and matching
• Music
transcription of notes, chords, ...
large corpus mining
Real-World Sound Analysis - Dan Ellis
2009-12-10 - 30/30
Download