Using the Soundtrack to Classify Videos 1. 2.

advertisement
Using the Soundtrack
to Classify Videos
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Machine Listening
Global Classification
Foreground Classification
Outstanding Issues
Soundtrack Classification - Dan Ellis
2011-05-04
1 /16
1. Machine Listening
Describe
Automatic
Narration
Emotion
Music
Recommendation
Classify
Environment
Awareness
ASR
Music
Transcription
“Sound
Intelligence”
VAD
Speech/Music
Environmental
Sound
Speech
Dectect
Task
• Extracting useful information from sound
Soundtrack Classification - Dan Ellis
Music
Domain
2011-05-04
2 /16
The Information in Audio
• Environmental recordings contain info on:
location – type (restaurant,
street, ...) and specifics
activity – talking, walking,
typing, ...
people – generic (2 males),
specific (Chuck & John)
spoken content (sometimes)
• but not:
what people and things “looked like”
day/night ...
... except when correlated with audible features
Soundtrack Classification - Dan Ellis
2011-05-04
3 /16
•
Applications
Audio Lifelog
Diarization
09:00
09:30
10:00
10:30
11:00
11:30
2004-09-14
preschool
cafe
preschool
cafe
Ron
lecture
12:00
office
12:30
office
outdoor
group
13:00
L2
cafe
office
outdoor
lecture
outdoor
DSP03
compmtg
meeting2
13:30
lab
14:00
cafe
meeting2 Manuel
outdoor
office
cafe
office
Mike
Arroyo?
outdoor
Sambarta?
14:30
15:00
15:30
•
2004-09-13
16:00
office
office
office
postlec
office
Lesser
16:30
Consumer Video Classification
Soundtrack Classification - Dan Ellis
17:00
17:30
18:00
outdoor
lab
cafe
2011-05-04
4 /16
Consumer Video Dataset
• 25 “concepts” from
1G+KL2 (10/15)
Kodak user study
boat, crowd, cheer, dance, ...
from YouTube search
filter for raw, unedited
= 1873 videos
manually relabel with
concepts
• Concept overlap:
Soundtrack Classification - Dan Ellis
museum
picnic
wedding
animal
birthday
sunset
ski
graduation
sports
boat
parade
playground
baby
park
beach
dancing
show
group of two
night
one person
singing
cheer
crowd
music
group of 3+
Labeled Concepts
• Grab top 200 videos
1
8GMM+Bha (9/15)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
3mc c s o n g s d b p b p p b s g s s b a w pm 0
Overlapped Concepts
pLSA500+lognorm (12/15)
2011-05-04
5 /16
2. Global Classification
• Baseline for soundtrack classification
8
7
6
5
4
3
2
1
0
VTS_04_0001 - Spectrogram
MFCC
Covariance
Matrix
30
20
10
0
-10
MFCC covariance
-20
1
2
3
4
5
6
7
8
9
time / sec
20
18
16
14
12
10
8
6
4
2
1
2
3
4
5
6
7
8
9
time / sec
50
20
level / dB
18
16
20
15
10
5
0
-5
-10
-15
-20
value
MFCC dimension
MFCC
features
MFCC bin
Video
Soundtrack
freq / kHz
divide sound into short frames (e.g. 30 ms)
calculate features (e.g. MFCC) for each frame
describe clip by statistics of frames (mean, covariance)
= “bag of features”
14
12
0
10
8
6
4
2
5
10
15
MFCC dimension
• Classify by e.g. Mahalanobis distance + SVM
Soundtrack Classification - Dan Ellis
20
2011-05-04
-50
6 /16
Codebook Histograms
• Convert nonplanar distributions to multinomial
8
150
6
4
7
2
MFCC
features
Per-Category
Mixture Component
Histogram
count
MFCC(1)
Global
Gaussian
Mixture Model
0
2
5
-2
6
1
10
14
8
100
15
9
12
13 3
4
-4
50
11
-6
-8
-10
-20
-10
0
10
20
0
1
2
3
• Classify by distance on histograms
MFCC(0)
4
5
6 7 8 9 10 11 12 13 14 15
GMM mixture
KL, Chi-squared
+ SVM
Soundtrack Classification - Dan Ellis
2011-05-04
7 /16
Global Classification Results
Average Precision
Lee & Ellis ’10
1
Guessing
MFCC + GMM Classifier
Single−Gaussian + KL2
8−GMM + Bha
1024−GMM Histogram + 500−log(P(z|c)/Pz))
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
pl
gr
• Wide range in performance
sk
su i
ns
e
bi
rth t
da
an y
im
w al
ed
di
ng
pi
cn
ic
m
us
eu
m
M
EA
N
a
sp t
o
gr
ad rts
ua
tio
n
bo
de
ra
nd
pa
ou
ay
gr
ba
by
rk
h
pa
ac
g
be
in
da
nc
ow
o
sh
tw
t
p
of
ni
gh
on
ou
e
pe
rs
g
in
r
si
ng
ee
d
ch
ow
us
m
cr
on
gr
ou
p
of
3+
0
ic
0.1
Concept class
audio (music, ski) vs. non-audio (group, night)
large AP uncertainty on infrequent classes
Soundtrack Classification - Dan Ellis
2011-05-04
8 /16
Combining with Video
Average Precision
• Video classification by SIFT codebooks
0.9
0.8
Chang et al. ’07
Random Baseline
Video Only
Audio Only
A+V Fusion
0.7
0.6
0.5
0.4
0.3
0.2
0.1
AP
M
on
e
p
gr er s
ou o
p n
of
3
m +
us
c
gr h ic
ou ee
p r
of
2
sp
or
pa t
r
cr k
ow
d
ni
g
an ht
im
si al
ng
in
sh g
ow
s
da sk
nc i
in
g
bo
at
ba
b
be y
w ac
ed h
d
m ing
us
eu
pa m
pl ra
ay d
gr e
ou
su nd
ns
et
pi
c
bi nic
rth
da
y
0
• Audio adds most for dancing, baby, museum ...
Soundtrack Classification - Dan Ellis
2011-05-04
9 /16
3. Foreground Classification
• Global vs. local class models
Lee, Ellis & Loui ’10
tell-tale acoustics may be ‘washed out’ in statistics
try iterative realignment of HMMs:
YT baby 002:
voice
baby
laugh
4
New Way:
Limited temporal extents
freq / kHz
freq / kHz
Old Way:
All frames contribute
3
4
2
1
1
5
10
voice
15
0
time / s
baby
3
2
0
voice
bg
5
voice baby
10
laugh
15
bg
time / s
laugh
baby
laugh
“background” model shared by all clips
Soundtrack Classification - Dan Ellis
2011-05-04 10/16
Transient Features
Cotton, Ellis, Loui ’11
• Onset detector
finds energy bursts
best SNR
• PCA basis to
represent each
300 ms x auditory frq
• “bag of transients”
• Object-related...
Soundtrack Classification - Dan Ellis
2011-05-04 11/16
Nonnegative Matrix Factorization
templates
+ activation
freq / Hz
• Decompose spectrograms into
3474
2203
883
442
Basis 1
(L2)
Basis 2
(L1)
Basis 3
(L1)
0
Soundtrack Classification - Dan Ellis
Original mixture
1398
X=W·H
fast forgiving
gradient descent
algorithm
2D patches
sparsity control...
5478
Smaragdis Brown ’03
Abdallah Plumbley ’04
Virtanen ’07
1
2
3
4
5
6
7
8
9
10
time / s
2011-05-04 12/16
4. Outstanding Issues
• How to separate foreground & background?
• How to exploit prior knowledge of sounds?
• How to make classification credible?
Soundtrack Classification - Dan Ellis
2011-05-04 13/16
Classifier Insight
• How can we understand classification results?
.. to inspire confidence & enrich results
• Temporal breakdown
freq / Hz
Spectrogram : YT_beach_001.mpeg,
Manual Clip-level Labels : “beach”, “group of 3+”, “cheer”
4k
3k
2k
1k
0
etc
speaker4
speaker3
speaker2
unseen1
Background
25 concepts + 2 global BGs
global BG2
global BG1
music
cheer
ski
singing
parade
dancing
wedding
sunset
sports
show
playground
picnic
Soundtrack
park
one person
night
museum
0
−50
drum
Manual Frame-level Annotation
wind
cheer
voice
• Cluster examples
voice
voice
voice
water wave
voice voice
voice cheer
Viterbi Path by 27−state Markov Model
Classification - Dan Ellis
cluster 123 (baby)
2011-05-04 14/16
Summary
• Machine Listening:
Getting useful information from sound
• Environmental sound classification
... from whole-clip statistics?
• Transients
energy peaks
... separate foreground background
•
Classifier insight
... for confidence & analysis
Soundtrack Classification - Dan Ellis
2011-05-04 15/16
References
•
[Chang et al. 2007] S.-F. Chang, D. Ellis, W. Jiang, K. Lee, A.Yanagawa, A. Loui, &
J. Luo, “Large-scale multimodal semantic concept detection for consumer
video,” Proc. ACM MIR, 255-264, 2007.
•
[Lee & Ellis 2010] K. Lee & D. Ellis, “Audio-Based Semantic Concept
Classification for Consumer Video,” IEEE Tr. Audio, Speech, Lang. Process. 18
(6): 1406-1416, Aug. 2010.
•
[Lee, Ellis & Loui 2010] K. Lee, D. Ellis, & A. Loui, “Detecting Local Semantic
Concepts in Environmental Sounds using Markov Model based Clustering,”
Proc. ICASSP, 2278-2281, Dallas, 2010.
•
[Cotton, Ellis & Loui 2011] C. Cotton, D. Ellis, & A. Loui, “Soundtrack
classification by transient events,” Proc. ICASSP, to appear, Prague, 2011.
•
[Ellis, Zhang & McDermott 2011] D. Ellis, X. Zheng, & J. McDermott,
“Classifying soundtracks with audio texture features,” Proc. ICASSP, to
appear, Prague, 2011.
Soundtrack Classification - Dan Ellis
2011-05-04 16/16
Download