Environmental Sound Recognition and Classification 1. 2.

advertisement
Environmental Sound
Recognition and Classification
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
5.
http://labrosa.ee.columbia.edu/
Machine Listening
Background Classification
Foreground Event Recognition
Speech Separation
Open Issues
Environmental Sound Recognition - Dan Ellis
2011-06-01
1 /36
1. Machine Listening
• Extracting useful information from sound
Describe
Automatic
Narration
Emotion
Music
Recommendation
Classify
Environment
Awareness
ASR
Music
Transcription
“Sound
Intelligence”
VAD
Speech/Music
Environmental
Sound
Speech
Dectect
Task
... like animals do
Environmental Sound Recognition - Dan Ellis
Music
Domain
2011-06-01
2 /36
Environmental Sound Applications
•
Audio Lifelog
Diarization
09:00
09:30
10:00
10:30
11:00
11:30
2004-09-14
preschool
cafe
L2
cafe
office
preschool
cafe
Ron
lecture
12:00
office
12:30
office
outdoor
group
13:00
outdoor
lecture
outdoor
DSP03
compmtg
meeting2
13:30
lab
14:00
cafe
meeting2 Manuel
outdoor
office
cafe
office
Mike
Arroyo?
outdoor
Sambarta?
14:30
15:00
15:30
•
2004-09-13
16:00
office
office
office
postlec
office
Lesser
16:30
Consumer Video Classification & Search
Environmental Sound Recognition - Dan Ellis
17:00
17:30
18:00
outdoor
lab
cafe
2011-06-01
3 /36
• Environment
Prior Work
Classification
speech/music/
silent/machine
Zhang & Kuo ’01
• Nonspeech Sound Recognition
Meeting room
Audio Event Classification
sports events - cheers, bat/ball
sounds, ...
Temko & Nadeu ’06
Environmental Sound Recognition - Dan Ellis
2011-06-01
4 /36
Consumer Video Dataset
• 25 “concepts” from
1G+KL2 (10/15)
Kodak user study
boat, crowd, cheer, dance, ...
1
Labeled Concepts
museum
picnic
wedding
animal
birthday
sunset
ski
graduation
sports
boat
parade
playground
baby
park
beach
dancing
show
group of two
night
one person
singing
cheer
crowd
music
group of 3+
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
3mc c s o n g s d b p b p p b s g s s b a w pm 0
Overlapped Concepts
Environmental Sound Recognition - Dan Ellis
• Grab top 200 videos
8GMM+Bha (9/15)
from YouTube search
then filter for quality,
unedited = 1873 videos
manually relabel with
concepts
pLSA500+lognorm (12/15)
2011-06-01
5 /36
Obtaining Labeled Data
• Amazon Mechanical Turk
Y-G Jiang et al. 2011
10s clips
9,641 videos in 4 weeks
Environmental Sound Recognition - Dan Ellis
2011-06-01
6 /36
2. Background Classification
• Baseline for soundtrack classification
8
7
6
5
4
3
2
1
0
VTS_04_0001 - Spectrogram
MFCC
Covariance
Matrix
30
20
10
0
-10
MFCC covariance
-20
1
2
3
4
5
6
7
8
9
time / sec
20
18
16
14
12
10
8
6
4
2
1
2
3
4
5
6
7
8
9
time / sec
level / dB
18
16
20
15
10
5
0
-5
-10
-15
-20
value
14
12
0
10
• Classify by e.g. KL distance + SVM
Environmental Sound Recognition - Dan Ellis
50
20
MFCC dimension
MFCC
features
MFCC bin
Video
Soundtrack
freq / kHz
divide sound into short frames (e.g. 30 ms)
calculate features (e.g. MFCC) for each frame
describe clip by statistics of frames (mean, covariance)
= “bag of features”
8
6
4
2
5
10
15
MFCC dimension
20
2011-06-01
-50
7 /36
Codebook Histograms
• Convert high-dim. distributions to multinomial
8
150
6
4
7
2
MFCC
features
Per-Category
Mixture Component
Histogram
count
MFCC(1)
Global
Gaussian
Mixture Model
0
2
5
-2
6
1
10
14
8
100
15
9
12
13 3
4
-4
50
11
-6
-8
-10
-20
-10
0
10
20
0
1
2
3
MFCC(0)
4
5
6 7 8 9 10 11 12 13 14 15
GMM mixture
• Classify by distance on histograms
KL, Chi-squared
+ SVM
Environmental Sound Recognition - Dan Ellis
2011-06-01
8 /36
Latent Semantic Analysis (LSA)
• Probabilistic LSA (pLSA) models each histogram
as a mixture of several ‘topics’
.. each clip may have several things going on
• Topic sets optimized through EM
p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip)
=
GMM histogram ftrs
*
“Topic”
p(topic | clip)
p(ftr | clip)
“Topic”
AV Clip
AV Clip
GMM histogram ftrs
p(ftr | topic)
use (normalized?) p(topic | clip) as per-clip features
Environmental Sound Recognition - Dan Ellis
2011-06-01
9 /36
Background Classification Results
Average Precision
K Lee & Ellis ’10
1
Guessing
MFCC + GMM Classifier
Single−Gaussian + KL2
8−GMM + Bha
1024−GMM Histogram + 500−log(P(z|c)/Pz))
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
pl
gr
• Wide range in performance
sk
su i
ns
e
bi
rth t
da
an y
im
w al
ed
di
ng
pi
cn
ic
m
us
eu
m
M
EA
N
a
sp t
o
gr
ad rts
ua
tio
n
bo
de
ra
nd
pa
ou
ay
gr
ba
by
rk
h
pa
ac
g
be
in
da
nc
ow
o
sh
tw
t
p
of
ni
gh
on
ou
e
pe
rs
g
in
r
si
ng
ee
d
ch
ow
us
m
cr
on
gr
ou
p
of
3+
0
ic
0.1
Concept class
audio (music, ski) vs. non-audio (group, night)
large AP uncertainty on infrequent classes
Environmental Sound Recognition - Dan Ellis
2011-06-01 10/36
Sound Texture Features
McDermott Simoncelli ’09
Ellis, Zheng, McDermott ’11
• Characterize sounds by
perceptually-sufficient statistics
\x\
Sound
\x\
Automatic
gain
control
mel
filterbank
(18 chans)
Octave bins
0.5,1,2,4,8,16 Hz
FFT
\x\
\x\
Histogram
\x\
\x\
Mahalanobis
distance ...
mean, var,
skew, kurt
(18 x 4)
Cross-band
correlations
(318 samples)
Envelope
correlation
1159_10 urban cheer clap
1062_60 quiet dubbed speech music
2404
1273
1
617
0
2
4
6
8
10
0
2
4
6
8
10
time / s
Texture features
mel band
•
Subband
distributions
& env x-corrs
freq / Hz
.. verified by matched
resynthesis
Modulation
energy
(18 x 6)
0
level
15
10
5
M V S K 0.5 2 8 32
moments
Environmental Sound Recognition - Dan Ellis
mod frq / Hz
5
10 15
mel band
M V S K 0.5 2 8 32
moments
mod frq / Hz
5
10 15
mel band
2011-06-01 11/36
Sound Texture Features
• Test on MED 2010
development data
10 specially-collected
manual labels
cLap
Cheer
Music
Speech
Dubbed
other
noisy
quiet
urban
rural
M
cLap
Cheer
Music
Speech
Dubbed
other
noisy
quiet
urban
rural
M
cLap
Cheer
Music
Speech
Dubbed
other
noisy
quiet
urban
rural
M
0.9
0.8
0.7
0.6
0.5
MED 2010 Txtr Feature (normd) means
0.5
0
−0.5
V
S
K
ms
cbc
MED 2010 Txtr Feature (normd) stddevs
1
0.5
0
V
S
K
ms
cbc
MED 2010 Txtr Feature (normd) means/stddevs
0.5
0
−0.5
V
S
K
ms
M
MVSK
ms
cbc
MVSK+ms
MVSK+cbc
MVSK+ms+cbc
1
cbc
Environmental Sound Recognition - Dan Ellis
Rural
Urban
Quiet
Noisy Dubbed Speech Music
Cheer
Clap
Avg
• Contrasts in
feature sets
correlation of labels...
• Perform
~ same as MFCCs
combine well
2011-06-01 12/36
3. Foreground Event Recognition
• Global vs. local class models
K Lee, Ellis, Loui ’10
tell-tale acoustics may be ‘washed out’ in statistics
try iterative realignment of HMMs:
YT baby 002:
voice
baby
laugh
4
New Way:
Limited temporal extents
freq / kHz
freq / kHz
Old Way:
All frames contribute
3
4
2
1
1
5
10
voice
15
0
time / s
baby
3
2
0
voice
bg
5
voice baby
10
laugh
15
bg
time / s
laugh
baby
laugh
“background” model shared by all clips
Environmental Sound Recognition - Dan Ellis
2011-06-01 13/36
Foreground Event HMMs
freq / Hz
Spectrogram : YT_beach_001.mpeg,
Manual Clip-level Labels : “beach”, “group of 3+”, “cheer”
• Training labels only
•
Refine models by
EM realignment
• Use for classifying
entire video...
or seeking to relevant
part
Environmental Sound Recognition - Dan Ellis
etc
speaker4
speaker3
speaker2
unseen1
Background
25 concepts + 2 global BGs
at clip-level
4k
3k
2k
1k
0
global BG2
global BG1
music
cheer
ski
singing
parade
dancing
wedding
sunset
sports
show
playground
picnic
park
one person
night
museum
group of 2
group of 3+
graduation
crowd
boat
birthday
beach
baby
animal
0
0
−50
drum
Manual Frame-level Annotation
wind
cheer
voice
voice
voice
water wave
voice
voice voice
voice cheer
Viterbi Path by 27−state Markov Model
1
2
3
4
5
6
7
8
9
10
time/sec
Lee & Ellis ’10
2011-06-01 14/36
Transient Features
Cotton, Ellis, Loui ’11
• Transients =
•
foreground events?
Onset detector
finds energy bursts
best SNR
• PCA basis to
represent each
300 ms x auditory freq
• “bag of transients”
Environmental Sound Recognition - Dan Ellis
2011-06-01 15/36
Nonnegative Matrix Factorization
templates
+ activation
freq / Hz
• Decompose spectrograms into
5478
Original mixture
3474
2203
1398
883
X=W·H
fast forgiving
gradient descent
algorithm
2D patches
sparsity control...
Smaragdis Brown ’03
Abdallah Plumbley ’04
Virtanen ’07
442
Basis 1
(L2)
Basis 2
(L1)
Basis 3
(L1)
0
Environmental Sound Recognition - Dan Ellis
1
2
3
4
5
6
7
8
9
10
time / s
2011-06-01 16/36
NMF Transient Features
• Learn 20 patches from
Acoustic Event Error Rate (AEER%)
•
Cotton, Ellis ’11
freq (Hz)
Meeting Room Acoustic
Event data
300
Patch 3
Patch 4
Patch 5
Patch 6
Patch 7
Patch 8
Patch 9
Patch 10
Patch 11
Patch 12
Patch 13
Patch 14 Patch 15
Patch 16
Patch 17
Patch 18
Patch 19 Patch 20
1398
442
3474
1398
442
1398
442
0.0
200
100
0.5 0.0
0.5 0.0
0.5 0.0
0.5
time (s)
noise-robust
MFCC
NMF
combined
15
0.5 0.0
• NMF more
150
0
10
Patch 2
442
3474
250
50
Patch 1
1398
3474
Compare to MFCC-HMM
detector
Error Rate in Noise
3474
20
SNR (dB)
25
30
Environmental Sound Recognition - Dan Ellis
combines well ...
2011-06-01 17/36
4. Speech Separation
• Speech recognition is finding best-fit
parameters - argmax P(W | X)
• Recognize mixtures with Factorial HMM
model + state sequence for each voice/source
exploit sequence constraints, speaker differences
model 2
model 1
observations / time
separation relies on detailed speaker model
Environmental Sound Recognition - Dan Ellis
2011-06-01 18/36
• Idea: Find
Eigenvoices
Kuhn et al. ’98, ’00
Weiss & Ellis ’07, ’08, ’09
speaker model parameter space
generalize without
losing detail?
• Eigenvoice model:
µ = µ̄ + U
adapted
model
mean
voice
eigenvoice
bases
Environmental Sound Recognition - Dan Ellis
Speaker models
Speaker subspace bases
w + B
weights
h
channel channel
bases weights
2011-06-01 19/36
Eigenvoice Bases
280 states x 320 bins
= 89,600 dimensions
Frequency (kHz)
• Mean model
8
30
4
40
2
b d g p t k jh ch s z f th v dh m n l
Frequency (kHz)
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 1
8
6
6
4
4
2
2
0
b d g p t k jh ch s z f th v dh m n l
8
Frequency (kHz)
additional
components for
acoustic channel
20
50
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 2
8
6
6
4
4
2
2
0
b d g p t k jh ch s z f th v dh m n l
8
Frequency (kHz)
•
10
6
8
Eigencomponents
shift formants/
coloration
Mean Voice
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 3
8
6
6
4
4
2
2
0
b d g p t k jh ch s z f th v dh m n l
Environmental Sound Recognition - Dan Ellis
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
2011-06-01 20/36
Eigenvoice Speech Separation
• Factorial HMM analysis
Weiss & Ellis ’10
with tuning of source model parameters
= eigenvoice speaker adaptation
Environmental Sound Recognition - Dan Ellis
2011-06-01 21/36
Eigenvoice Speech Separation
Environmental Sound Recognition - Dan Ellis
2011-06-01 22/36
Eigenvoice Speech Separation
• Eigenvoices for Speech Separation task
speaker adapted (SA) performs midway between
speaker-dependent (SD) & speaker-indep (SI)
Mix
SI
SA
SD
Environmental Sound Recognition - Dan Ellis
2011-06-01 23/36
Binaural Cues
• Model interaural spectrum of each source
as stationary level and time differences:
• e.g. at 75°, in reverb:
IPD
IPD residual
Environmental Sound Recognition - Dan Ellis
ILD
2011-06-01 24/36
Model-Based EM Source Separation
and Localization (MESSL)
Mandel & Ellis ’09
Re-estimate
source parameters
Assign spectrogram points
to sources
can model more sources than sensors
flexible initialization
Environmental Sound Recognition - Dan Ellis
2011-06-01 25/36
MESSL Results
• Modeling uncertainty improves results
tradeoff between constraints & noisiness
2.45 dB
0.22 dB
12.35 dB
8.77 dB
9.12 dB
Environmental Sound Recognition - Dan Ellis
-2.72 dB
2011-06-01 26/36
MESSL with Source Priors
• Fixed or adaptive speech models
Environmental Sound Recognition - Dan Ellis
Weiss, Mandel & Ellis ’11
2011-06-01 27/36
MESSL-EigenVoice Results
• Source models function as priors
• Interaural parameter spatial separation
Frequency (kHz)
Frequency (kHz)
source model prior improves spatial estimate
Ground truth (12.04 dB)
8
DUET (3.84 dB)
8
6
6
6
4
4
4
2
2
2
0
0
0.5
1
1.5
MESSL (5.66 dB)
8
0
0
0.5
1
1.5
MESSL SP (10.01 dB)
8
0
6
6
4
4
4
2
2
2
0
0.5
1
Time (sec)
1.5
0
0
0.5
1
Time (sec)
Environmental Sound Recognition - Dan Ellis
1.5
1
0.8
0
0
0.5
1
1.5
MESSL EV (10.37 dB)
8
6
0
2D FD BSS (5.41 dB)
8
0.6
0.4
0.2
0
0
0.5
1
Time (sec)
1.5
2011-06-01 28/36
Noise-Robust Pitch Tracking
• Important for voice detection & separation
• Based on channel selection Wu & Wang (2003)
BS Lee & Ellis ’11
pitch from summary
autocorrelation
over “good” bands
trained classifier decides
which channels to include
Mean BER of pitch tracking algorithms
Mean BER / %
100
• Improves over
simple Wu criterion
10
especially for mid SNR
1
Wu
CS-GTPTHvarAdjC5
-10
-5
0
5
10
SNR / dB
15
20
Environmental Sound Recognition - Dan Ellis
25
2011-06-01 29/36
Freq / kHz
4
Channels
40
08_rbf_pinknoise5dB
20
0
−20
−40
2
0
• Trained selection
20
25
includes more
off-harmonic
channels
12.5
period / ms
Wu
Noise-Robust Pitch Tracking
0
25
12.5
Channels
40
20
25
12.5
period / ms
CS-GT
0
0
25
12.5
0
0
1
2
3
Environmental Sound Recognition - Dan Ellis
4 time / sec
2011-06-01 30/36
5. Outstanding Issues
• Better object/event separation
parametric models
spatial information?
computational
auditory
scene analysis...
Frequency Proximity
• Large-scale analysis
• Integration with video
Environmental Sound Recognition - Dan Ellis
Common Onset
Harmonicity
Barker et al. ’05
2011-06-01 31/36
Audio-Visual Atoms
Jiang et al. ’09
• Object-related features
from both audio (transients) & video (patches)
time
visual atom
...
visual atom
...
audio atom
frequency
frequency
audio spectrum
time
Environmental Sound Recognition - Dan Ellis
audio atom
Joint Audio-Visual Atom
discovery
...
visual atom
...
audio atom
time
2011-06-01 32/36
Audio-Visual Atoms
• Multi-instance learning of A-V co-occurrences
M Visual Atoms
visual atom
…
visual atom
time
…
…
time
codebook
learning
Joint A-V
Codebook
…
audio atom
audio atom
M×N
A-V Atoms
…
N audio atoms
Environmental Sound Recognition - Dan Ellis
2011-06-01 33/36
Audio-Visual Atoms
black suit
+ romantic
music
marching
people
+ parade
sound
Wedding
sand
+ beach
sounds
Parade
Beach
Audio only
Visual only
Visual + Audio
Environmental Sound Recognition - Dan Ellis
2011-06-01 34/36
Summary
• Machine Listening:
Getting useful information from sound
• Background sound classification
... from whole-clip statistics?
• Foreground event recognition
... by focusing on peak energy patches
• Speech content is very important
... separate with pitch, models, ...
Environmental Sound Recognition - Dan Ellis
2011-06-01 35/36
•
•
•
•
•
•
•
•
•
•
•
•
References
Jon Barker, Martin Cooke, & Dan Ellis, “Decoding Speech in the Presence of Other Sources,” Speech
Communication 45(1): 5-25, 2005.
Courtenay Cotton, Dan Ellis, & Alex Loui, “Soundtrack classification by transient events,” IEEE
ICASSP, Prague, May 2011.
Courtenay Cotton & Dan Ellis, “Spectral vs. Spectro-Temporal Features for Acoustic Event
Classification,” submitted to IEEE WASPAA, 2011.
Dan Ellis, Xiaohong Zheng, Josh McDermott, “Classifying soundtracks with audio texture features,”
IEEE ICASSP, Prague, May 2011.
Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, & Alex Loui, “Short-Term Audio-Visual
Atoms for Generic Video Concept Classification,” ACM MultiMedia, 5-14, Beijing, Oct 2009.
Keansub Lee & Dan Ellis, “Audio-Based Semantic Concept Classification for Consumer Video,” IEEE
Tr. Audio, Speech and Lang. Proc. 18(6): 1406-1416, Aug. 2010.
Keansub Lee, Dan Ellis, Alex Loui, “Detecting local semantic concepts in environmental sounds
using Markov model based clustering,” IEEE ICASSP, 2278-2281, Dallas, Apr 2010.
Byung-Suk Lee & Dan Ellis, “Noise-robust pitch tracking by trained channel selection,” submitted to
IEEE WASPAA, 2011.
Andriy Temko & Climent Nadeu, “Classification of acoustic events using SVM-based clustering
schemes,” Pattern Recognition 39(4): 682-694, 2006
Ron Weiss & Dan Ellis, “Speech separation using speaker-adapted Eigenvoice speech models,”
Computer Speech & Lang. 24(1): 16-29, 2010.
Ron Weiss, Michael Mandel, & Dan Ellis, “Combining localization cues and source model
constraints for binaural source separation,” Speech Communication 53(5): 606-621, May 2011.
Tong Zhang & C.-C. Jay Kuo, “Audio content analysis for on-line audiovisual data segmentation,”
IEEE TSAP 9(4): 441-457, May 2001
Environmental Sound Recognition - Dan Ellis
2011-06-01 36/36
Download