Extracting Information from Sound 1. 2.

advertisement
Extracting Information
from Sound
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Machine Listening
Global Classification
Foreground & Transients
Outstanding Issues
Information from Sound - Dan Ellis
2011-03-01
1 /19
1. Machine Listening
• Extracting useful information from sound
Describe
Automatic
Narration
Emotion
Music
Recommendation
Classify
Environment
Awareness
ASR
Music
Transcription
“Sound
Intelligence”
VAD
Speech/Music
Environmental
Sound
Speech
Dectect
Task
... like animals do
Information from Sound - Dan Ellis
Music
Domain
2011-03-01
2 /19
Listening to Mixtures
• The world is cluttered
& sound is transparent
mixtures are inevitable
• Useful information is structured by ‘sources’
specific definition of a ‘source’:
intentional independence
Information from Sound - Dan Ellis
2011-03-01
3 /19
•
Applications
Audio Lifelog
Diarization
09:00
2004-09-13
09:30
10:00
10:30
11:00
11:30
preschool
cafe
Ron
lecture
12:30
office
outdoor
group
L2
cafe
office
outdoor
lecture
outdoor
DSP03
compmtg
meeting2
13:30
lab
14:00
cafe
meeting2 Manuel
outdoor
office
cafe
office
Mike
Arroyo?
outdoor
Sambarta?
15:00
15:30
16:00
office
office
office
postlec
office
Lesser
16:30
Consumer Video Classification
Information from Sound - Dan Ellis
cafe
office
14:30
•
preschool
12:00
13:00
2004-09-14
17:00
17:30
18:00
outdoor
lab
cafe
2011-03-01
4 /19
Consumer Video Dataset
• 25 “concepts” from
1G+KL2 (10/15)
Kodak user study
boat, crowd, cheer, dance, ...
from YouTube search
then filter for quality,
unedited = 1873 videos
manually relabel with
concepts
• Concept overlap:
Information from Sound - Dan Ellis
museum
picnic
wedding
animal
birthday
sunset
ski
graduation
sports
boat
parade
playground
baby
park
beach
dancing
show
group of two
night
one person
singing
cheer
crowd
music
group of 3+
Labeled Concepts
• Grab top 200 videos
1
8GMM+Bha (9/15)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
3mc c s o n g s d b p b p p b s g s s b a w pm 0
Overlapped Concepts
pLSA500+lognorm (12/15)
2011-03-01
5 /19
2. Global Classification
• Baseline for soundtrack classification
8
7
6
5
4
3
2
1
0
VTS_04_0001 - Spectrogram
MFCC
Covariance
Matrix
30
20
10
0
-10
MFCC covariance
-20
1
2
3
4
5
6
7
8
9
time / sec
20
18
16
14
12
10
8
6
4
2
1
2
3
4
5
6
7
8
9
time / sec
50
20
level / dB
18
16
20
15
10
5
0
-5
-10
-15
-20
value
MFCC dimension
MFCC
features
MFCC bin
Video
Soundtrack
freq / kHz
divide sound into short frames (e.g. 30 ms)
calculate features (e.g. MFCC) for each frame
describe clip by statistics of frames (mean, covariance)
= “bag of features”
14
12
0
10
8
6
4
2
5
10
15
MFCC dimension
• Classify by e.g. Mahalanobis distance + SVM
Information from Sound - Dan Ellis
20
2011-03-01
-50
6 /19
Codebook Histograms
• Convert nonplanar distributions to multinomial
8
150
6
4
7
2
MFCC
features
Per-Category
Mixture Component
Histogram
count
MFCC(1)
Global
Gaussian
Mixture Model
0
2
5
-2
6
1
10
14
8
100
15
9
12
13 3
4
-4
50
11
-6
-8
-10
-20
-10
0
10
20
0
1
2
3
• Classify by distance on histograms
MFCC(0)
4
5
6 7 8 9 10 11 12 13 14 15
GMM mixture
KL, Chi-squared
+ SVM
Information from Sound - Dan Ellis
2011-03-01
7 /19
Latent Semantic Analysis (LSA)
• Probabilistic LSA (pLSA) models each histogram
as a mixture of several ‘topics’
.. each clip may have several things going on
• Topic sets optimized through EM
p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip)
=
GMM histogram ftrs
*
“Topic”
p(topic | clip)
p(ftr | clip)
“Topic”
AV Clip
AV Clip
GMM histogram ftrs
p(ftr | topic)
use (normalized?) p(topic | clip) as per-clip features
Information from Sound - Dan Ellis
2011-03-01
8 /19
Global Classification Results
Average Precision
Lee & Ellis ’10
1
Guessing
MFCC + GMM Classifier
Single−Gaussian + KL2
8−GMM + Bha
1024−GMM Histogram + 500−log(P(z|c)/Pz))
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
• Wide range in performance
sk
su i
ns
e
bi
rth t
da
an y
im
w al
ed
di
ng
pi
cn
ic
m
us
eu
m
M
EA
N
a
sp t
o
gr
ad rts
ua
tio
n
bo
de
ra
nd
pa
pl
ay
gr
ou
by
rk
ba
pa
d
ch
ow
cr
us
m
e
si er
n
on gin
g
e
pe
rs
on
gr
n
ou igh
t
p
of
tw
o
sh
o
da w
nc
in
g
be
ac
h
gr
ou
p
of
3+
0
ic
0.1
Concept class
audio (music, ski) vs. non-audio (group, night)
large AP uncertainty on infrequent classes
Information from Sound - Dan Ellis
2011-03-01
9 /19
3. Foreground & Transients
• Global vs. local class models
tell-tale acoustics may be ‘washed out’ in statistics
try iterative realignment of HMMs:
YT baby 002:
voice
baby
laugh
4
New Way:
Limited temporal extents
freq / kHz
freq / kHz
Old Way:
All frames contribute
3
4
2
1
1
5
10
voice
15
0
time / s
baby
3
2
0
voice
bg
5
voice baby
10
laugh
15
bg
time / s
laugh
baby
laugh
“background” model shared by all clips
Information from Sound - Dan Ellis
2011-03-01 10/19
Landmark-based Fingerprints
Shazam ’03
robust to
channel,
background
freq / Hz
• Sound characterized by time-frequency peaks
Query audio
4000
3500
3000
2500
2000
1500
1000
500
0
relies on
precise
timing
Match: 05−Full Circle at 0.032 sec
4000
3500
3000
2500
2000
1500
1000
500
0
Information from Sound - Dan Ellis
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
time / sec
2011-03-01 11 /19
Soundtrack Fingerprint Matching
• Landmark pairs
are a noise-robust
fingerprint
VIdeo IMpLQaiHWbE at 195s
4
3
2
1
Use to match
distinct videos
with same
sound ambience
0
freq / kHz
•
freq / kHz
Cotton & Ellis ’10
195.5
196
196.5
197
197.5
198
198.5
199
time / sec
VIdeo Yi1hkNkqHBc at 218 s
4
3
2
1
0
Information from Sound - Dan Ellis
218.5
219
219.5
220
220.5
221
221.5
222
time / sec
2011-03-01 12/19
Event Landmark Signatures
• Build index of Gabor neighbor pairs
Cotton & Ellis ’09
recognize repeated events with similar pairs
Information from Sound - Dan Ellis
2011-03-01 13/19
Transient Features
Cotton, Ellis, Loui ’11
• Onset detector
finds energy bursts
best SNR
• PCA basis to
represent each
300 ms x auditory frq
• “bag of transients”
Information from Sound - Dan Ellis
2011-03-01 14/19
4. Outstanding Issues
• How to define “transients”?
• How to separate foreground & background?
• How to exploit prior knowledge of sounds?
• How to make classification discriminative?
• Large-scale soundtrack classification
Information from Sound - Dan Ellis
2011-03-01 15/19
Nonnegative Matrix Factorization
templates
+ activation
freq / Hz
• Decompose spectrograms into
3474
2203
883
442
Basis 1
(L2)
Basis 2
(L1)
Basis 3
(L1)
0
Information from Sound - Dan Ellis
Original mixture
1398
X=W·H
fast & forgiving
gradient descent
algorithm
2D patches
sparsity control...
5478
Smaragdis & Brown ’03
Abdallah & Plumbley ’04
Virtanen ’07
1
2
3
4
5
6
7
8
9
10
time / s
2011-03-01 16/19
Sound Textures
McDermott & Simoncelli ’09
Ellis, Zhang, McDermott ’11
• Characterize sounds by
perceptually-sufficient statistics
\x\
Sound
\x\
Automatic
gain
control
mel
filterbank
(18 chans)
Octave bins
0.5,1,2,4,8,16 Hz
FFT
\x\
\x\
Histogram
\x\
\x\
Mahalanobis
distance ...
Information from Sound - Dan Ellis
mean, var,
skew, kurt
(18 x 4)
Cross-band
correlations
(318 samples)
Envelope
correlation
1159_10 urban cheer clap
1062_60 quiet dubbed speech music
2404
1273
1
617
0
2
4
6
8
10
0
2
4
6
8
10
time / s
Texture features
mel band
•
Subband
distributions
& env x-corrs
freq / Hz
.. verified by matched
resynthesis
Modulation
energy
(18 x 6)
0
level
15
10
5
M V S K 0.5 2 8 32
moments
mod frq / Hz
5
10 15
mel band
M V S K 0.5 2 8 32
moments
mod frq / Hz
5
10 15
mel band
2011-03-01 17/19
Real-World Dictionary
• BBC Sound Effects as reference library
The Czech Republic - Slovakia
Hungary
Rural South America
Urban South
America
Footsteps
2
Footsteps 1
India Pakistan Nepal-Countrysid
India & Nepal-City Life
Exterior Atmospheres-Rural Background
England
France
Suburbia
Crowds
Birds
Istanbul
Emergency
Cats
Construction
Age Of Steam
Trains
Spain
Schools & Crowds
Horses &
Dogs
Horses
Farm Machinery
Livestock 2
Livestock 1
Adventure Sports
Greece
Equestrian Events
Africa The Natural World
Africa The Human World
Hospitals
Babies
China
Aircraft
America
Ships And Boats 2
Ships And Boats 1
Weather 1
Electronically Generated Sounds
Explosions Guns Alarms
Sport Leisure
Auto
Rural Soundscapes
Big Ben Taxi Bus Atmospheres
Industry
Birds
Rivers Streams Water
Computers Printers Phones
European Soundscapes
Misc
Audiences Children Crowds Foots
Animals and Birds
Transportation
Interior Atmosphere
Household
Exterior Atmosphere
BBC Sound Effects
BEH I T A A MEC RB I BRAS EE W S S A A C B H AAEGA L L F HHSST A C CE I B CSF EE I I F FURHT
1000+ examples ...
comprehensive?
similarity via
normalized
textures
(over 10s
chunks)
Information from Sound - Dan Ellis
2011-03-01 18/19
Summary
• Machine Listening:
Getting useful information from sound
• Environmental sound classification
... from whole-clip statistics?
• Transients & energy peaks
... separate foreground & background
•
Useful classification of unconstrained audio
... to combine with video analysis
Information from Sound - Dan Ellis
2011-03-01 19/19
Download