Subband Autocorrelation Features for Video Soundtrack Classification 1.

advertisement
Subband Autocorrelation
Features for Video Soundtrack
Classification
Courtenay Cotton & Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
{cvcotton,dpwe}@ee.columbia.edu
http://labrosa.ee.columbia.edu/
1. Soundtrack Classification
2. Auditory Model Features
3. Results & Future Work
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29
1 /17
1. Soundtrack Classification
• Goal: Describe soundtracks with a vocabulary
of user-relevant acoustic events/sources
Spectrogram : YT_beach_001.mpeg,
freq / Hz
4k
3k
sp4
sp3
sp2
sp1
Bgd
2k
1k
0
0
0
1
2
3
4
5
6
7
8
cheer voice
cheer
voice
voice
voice
water wave
cheer
voice
1
2
9 10
3
4
5
6
7
8
9
10
time/sec
time/sec
• Challenges:
Defining acoustic event vocabulary
Overlapping sounds
Ground-truth training data
Classifier accuracy
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 2 /17
Environmental Sound Applications
• Audio Lifelog
Diarization
09:00
09:30
10:00
10:30
11:00
11:30
2004-09-14
preschool
cafe
preschool
cafe
Ron
lecture
12:00
office
12:30
office
outdoor
group
13:00
L2
cafe
office
outdoor
lecture
outdoor
DSP03
compmtg
meeting2
13:30
lab
14:00
cafe
meeting2 Manuel
outdoor
office
cafe
office
Mike
Arroyo?
outdoor
Sambarta?
14:30
15:00
15:30
16:00
• Consumer Video
2004-09-13
office
office
office
postlec
office
Lesser
16:30
17:00
17:30
18:00
outdoor
lab
cafe
Classification & Search
• Live hearing prosthesis app
• Robot environment sensitivity
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 3 /17
Consumer Video Dataset
Y-G. Jiang et al. 2011
• Columbia Consumer Video (CCV) set
9,317 videos / 210 hours
20 concepts based on consumer user study
Labeled via Amazon Mechanical Turk
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 4 /17
Soundtrack Classification
K. Lee & Ellis 2010
• Baseline soundtrack classification system:
8
7
6
5
4
3
2
1
0
VTS_04_0001 - Spectrogram
MFCC
Covariance
Matrix
30
20
10
0
-10
MFCC covariance
-20
1
2
3
4
5
6
7
8
9
time / sec
20
18
16
14
12
10
8
6
4
2
1
2
3
4
5
6
7
8
9
time / sec
50
20
level / dB
18
16
20
15
10
5
0
-5
-10
-15
-20
value
MFCC dimension
MFCC
features
MFCC bin
Video
Soundtrack
freq / kHz
divide sound into short frames (e.g. 30 ms)
calculate MFCC features for each frame
describe clip by statistics of frames (mean, covariance)
= “bag of features”
14
12
0
10
8
6
4
2
5
10
15
MFCC dimension
• Classify by e.g. Mahalanobis distance + SVM
Subband Autocorrelation Features - Cotton & Ellis
20
-50
2013-05-29 5 /17
Codebook Histograms
• Instead of Gaussian model,
convert high-dim. distributions to multinomial
Vector Quantization (VQ)
8
150
6
4
7
2
MFCC
features
Per-Category
Mixture Component
Histogram
count
MFCC(1)
Global
Gaussian
Mixture Model
0
2
5
-2
6
1
10
14
8
100
15
9
12
13 3
4
-4
50
11
-6
-8
-10
-20
-10
0
10
20
0
1
2
3
MFCC(0)
• Classify by distance on histograms
4
5
6 7 8 9 10 11 12 13 14 15
GMM mixture
KL, Chi-squared
+ SVM
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 6 /17
broad spectral shape
but strip away
fine detail
transients, pitch, texture
Original
4
3
2
1
0
freq / kHz
• MFCCs model
freq / kHz
Baseline Limitations
MFCC resynthesis
4
3
2
1
0
10
10.5
11
11.5
12
12.5
13
13.5
14
14.5
15
time / s
• BoF loses temporal structure...
• Channel & Noise
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 7 /17
2. Auditory Model Features
Lyon et al. 2010
• Lyon’s Stabilized Auditory Image (SAI) model
fine structure stabilized by ‘strobing’ on transients
1
Audio in
3
Cochlea
simulation
Strobe
detection
Temporal
integration
Auditory
image
Multiscale
segmentation
Sparse
coding
Sparse
code
Aggregate
features
2
4
... 0 0 0 1 0 0 0 0 0 0 ...
... 0 0 0 1 0 0 0 0 0 0 ...
... 0 0 0 0 1 0 0 0 0 0 ...
Document
feature
vector
... 0 0 0 1 0 0 0 0 0 0 ...
Σ
... 0 0 0 0 0 0 0 1 0 0 ...
... 0 0 0 2 1 0 0 1 0 0 ...
Figure 1: The
systems for
generating
It
Subband Autocorrelation
Features
- Cotton
& sparse
Ellis codes from audio using an auditory frontend.2013-05-29
8 /17
Subband Autocorrelation (SBPCA)
B-S Lee & Ellis 2012
• Simplified version of Lyon model
10x faster (RTx5 → RT/2)
• Captures fine time structure in multiple bands
.. the information that is lost in MFCC features
de
lay
short-time
autocorrelation
line
frequency channels
Sound Cochlea
filterbank
Subband VQ
Subband VQ
Histogram
Feature Vector
Subband VQ
Subband VQ
freq
lag
Correlogram
slice
time
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 9 /17
Filterbank
lay
lin
short-time
autocorrelation
e
Sound Cochlea
filterbank
Subband VQ
Subband VQ
Histogram
Feature Vector
Subband VQ
frequency channels
• Simple four-pole, two-zero
bandpass filters
• Constant-Q, log-spacing
de
Subband VQ
freq
lag
Correlogram
slice
time
very rough approximation to cochlea
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 10 /17
Subband Autocorrelation
lin
short-time
autocorrelation
e
Subband VQ
Sound Cochlea
filterbank
Subband VQ
Histogram
Subband VQ
Feature Vector
fine time structure
lay
frequency channels
• Autocorrelation stabilizes
de
Subband VQ
freq
lag
Correlogram
slice
time
25 ms window,
lags up to 25 ms
calculated every
10 ms
normalized to
max (zero lag)
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 11 /17
PCA Summarization
lin
short-time
autocorrelation
e
Subband VQ
Sound Cochlea
filterbank
full image = 24 bands x 200 lags
but intrinsic information is much lower
Subband VQ
Subband VQ
Histogram
Feature Vector
bandpass signals is constrained
lay
frequency channels
• Autocorrelation of
de
Subband VQ
freq
lag
Correlogram
slice
time
• Reduce dimensionality
with Principal Component
Analysis (PCA)
calculated on individual
bands to retain separability
per-subband PCA bases
have little dependence
on training material
10 dimensions adequate
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 12 /17
Subband VQ
• 4 separate bands to
lin
short-time
autocorrelation
e
Subband VQ
Sound Cochlea
filterbank
Subband VQ
Subband VQ
Histogram
Feature Vector
segments by VQ histogram
lay
frequency channels
• Summarize features across
de
Subband VQ
freq
lag
Correlogram
slice
time
provide overlap
resistance
non-overlapping
groups of
6 subbands
x 10 PCA coefs
• Each band quantized into 1000 codewords;
whole soundtrack → codeword histogram
•
4000 dimensions, sparse
SVM classifier with Chi2 kernel
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 13 /17
Auditory Model Feature Results
• SAI and SBPCA
close to
MFCC baseline
MAP
All 3
MFCC + SBPCA
MFCC + SAI
SAI + SBPCA
MFCC
SAI
SBPCA
Playground
Beach
Parade
NonMusicPerf
MusicPerfmnce
• Fusing MFCC and
SBPCA improves
mAP by 15% rel
mAP: 0.35 → 0.40
WeddingDance
WeddingCerem
WeddingRecep
Birthday
Graduation
Bird
Dog
Cat
Biking
• Calculation time
MFCC: 6 hours
SAI: 1087 hours
SBPCA: 110 hours
Subband Autocorrelation Features - Cotton & Ellis
Swimming
Skiing
IceSkating
Soccer
Baseball
Basketball
0
0.2
0.4
0.6
Avg. Precision (AP)
0.8
1
2013-05-29 14 /17
Retrieval Examples
• Browsing results are sometimes surprising!
file:///u/drspeech/data/aladdin/code/genVidClassif/html/mfcc230/Cat-max.html
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 15 /17
3. Open Issues
• Foreground & Events
transients at a coarser scale
• Better object/event separation
parametric models
spatial information?
computational
auditory scene analysis...
• Better Annotation
Frequency Proximity
Common Onset
Harmonicity
Barker et al. ’05
granularity in time & concept
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 16 /17
Summary
• Soundtrack classification
Acoustic properties of different events
• Standard model
MFCC + bag-of-features + classifier
• Auditory model features
to recapture the fine time structure
combination improves retrieval
code available:
http://labrosa.ee.columbia.edu/projects/calcSBPCA/
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 17 /17
References
•
•
•
•
•
Jon Barker, Martin Cooke, & Dan Ellis, “Decoding Speech in the Presence of Other Sources,” Speech
Communication 45(1): 5-25, 2005.
Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A.C. Loui, “Consumer video understanding: A
benchmark database and an evaluation of human and machine per- formance,” in Proc. ACM
International Conference on Multimedia Retrieval (ICMR), Apr. 2011, p. 29.
Byung-Suk Lee & Dan Ellis, “Noise robust pitch tracking by subband autocorrelation classification,”
in Proc. INTERSPEECH-12, Sept. 2012.
Keansub Lee & Dan Ellis, “Audio-Based Semantic Concept Classification for Consumer Video,” IEEE
Tr. Audio, Speech and Lang. Proc. 18(6): 1406-1416, Aug. 2010.
R.F. Lyon, M. Rehn, S. Bengio, T.C. Walters, and G. Chechik, “Sound retrieval and ranking using
sparse auditory representations,” Neural Computation, vol. 22, no. 9, pp. 2390–2416, Sept. 2010.
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 18 /17
Acknowledgment
Supported by the Intelligence Advanced Research Projects Activity (IARPA) via
Department of Interior National Business Center contract number D11PC20070. The
U.S. Government is authorized to reproduce and distribute reprints for Government
purposes notwithstanding any copyright annotation thereon.
Disclaimer: The views and conclusions contained herein are those of the authors and
should not be interpreted as necessarily representing the official policies or
endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S.
Government.
Subband Autocorrelation Features - Cotton & Ellis
2013-05-29 19 /17
Download