Analysis of Everyday Sounds

advertisement
Analysis of
Everyday Sounds
Dan Ellis and Keansub Lee
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
5.
Personal and Consumer Audio
Segmenting & Clustering
Special-Purpose Detectors
Generic Concept Detectors
Challenges & Future
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 1 /35
LabROSA Overview
Information
Extraction
Music
Recognition
Environment
Separation
Machine
Learning
Retrieval
Signal
Processing
Speech
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 2 /35
1. Personal Audio Archives
• Easy to record everything you hear
<2GB / week @ 64 kbps
to find anything
• Hard
how to scan?
how to visualize?
how to index?
• Need automatic analysis
• Need minimal impact
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 3 /35
Personal Audio Applications
• Automatic appointment-book history
fills in when & where of movements
• “Life statistics”
how long did I spend in meetings this week?
most frequent conversations
favorite phrases?
• Retrieving details
what exactly did I promise?
privacy issues...
• Nostalgia
• ... or what?
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 4 /35
Consumer Video
• Short video clips as the
evolution of snapshots
10-60 sec, one location,
no editing
browsing?
• More information for indexing...
video + audio
foreground + background
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 5 /35
Information in Audio
• Environmental recordings contain info on:
location – type (restaurant, street, ...) and specific
activity – talking, walking, typing
people – generic (2 males), specific (Chuck & John)
spoken content ... maybe
• but not:
what people and things “looked like”
day/night ...
... except when correlated with audible features
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 6 /35
A Brief History of Audio Processing
• Environmental sound classification
draws on earlier sound classification work
as well as source separation...
Speech Recognition
Source Separation
One channel
Speaker ID
Multi-channel
GMM-HMMs
Model-based
Music Audio
Genre & Artist ID
Cue-based
Sountrack & Environmental
Recognition
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 7 /35
2. Segmentation & Clustering
• Top-level structure for long recordings:
Where are the major boundaries?
e.g. for diary application
support for manual browsing
of fundamental time-frame
• Length
60s rather than 10ms?
background more important than foreground
average out uncharacteristic transients
features
• Perceptually-motivated
.. so results have perceptual relevance
broad spectrum + some detail
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 8 /35
MFCC Features
• Need “timbral” features:
log-domain









auditory-like
frequency
warping

Mel-Frequency Cepstral Coeffs (MFCCs)










discrete



cosine
 
transform


= orthogonalization

Analysis of Everyday Sounds - Ellis & Lee






2007-07-24 p. 9 /35
 
Long-Duration Features
Normalized Energy Deviation
Average Linear Energy
120
15
100
10
80
60
20
freq / bark
freq / bark
20
15
40
10
20
5
5
60
dB
dB
Average Log Energy
15
100
10
80
20
freq / bark
freq / bark
Log Energy Deviation
120
20
5
15
15
10
10
5
5
60
dB
dB
Spectral Entropy Deviation
Average Spectral Entropy
0.9
0.8
15
0.7
10
0.6
5
20
freq / bark
freq / bark
20
0.5
bits
0.5
15
0.4
10
0.3
0.2
5
0.1
50
100
150
200
250
300
350
400
450
time / min
• Capture both average and variation
• Capture a little more detail in subbands...
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 10/35
bits
Spectral Entropy
NF
• Auditory spectrum: A[n, j] = ∑ w X[n, k]
of each"band:
• Spectral entropy ≈w ‘peakiness’
!
w X[n, k]
X[n, k]
jk
k=0
NF
jk
H[n, j] = − ∑
A[n, j]
energy / dB
k=0
jk
· log
A[n, j]
FFT spectral magnitude
0
-20
Auditory Spectrum
-40
-60
rel. entropy / bits
0
1000
2000
3000
4000
5000
6000
7000
8000
0.5
0
per-band
Spectral Entropies
-0.5
-1
30 340 750 1130 1630
2280
3220 3780
4470
5280
6250
7380
freq / Hz
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 11 /35
BIC Segmentation
• BIC (Bayesian Info. Crit.) compares models:
log
L(X1 ;M1 )L(X2 ;M2 )
L(X;M0 )
≷
λ
2
log(N )∆#(M )
AvgLogAudSpec
2004-09-10-1023_AvgLEnergy
20
15
10
5
BIC score
boundary
passes BIC
0
no boundary
with shorter context
-100
-200
13:30
14:00
14:30
L(X1;M1)
last segmentation
point
L(X;M0)
Analysis of Everyday Sounds - Ellis & Lee
15:00
15:30
16:00
time / hr
L(X2;M2)
candidate
boundary
current
context limit
2007-07-24 p. 12/35
BIC Segmentation Results
62 hr hand-marked dataset
• Evaluate:
8 days, 139 segments, 16 categories
measure Correct Accept % @ False Accept = 2%:
Feature
Correct Accept
80.8%
0.8
μH
81.1%
0.7
σH/μH
81.6%
μdB + σH/μH
84.0%
μdB + σH/μH + μH
83.6%
0.4
mfcc
73.6%
0.3
o
Sensitivity
μdB
0.6
0.5
0.2
0
Analysis of Everyday Sounds - Ellis & Lee
µdB
µH
!H/µH
µdB + !H/µH
µdB + µH + !H/µH
0.005
0.01
0.015
0.02
0.025
1 - Specificity
0.03
2007-07-24 p. 13 /35
0.035
0.04
Segment Clustering
• Daily activity has lots of repetition:
Automatically cluster similar segments
‘affinity’ of segments as KL2 distances
4*5)#1-%
1))%'23
-"#"0-)
,"#,)#
()!%*#)/
,'(('"#.
,#)"()!%*#)+
!"#$%"&'
;01)
,0:('23
4%#))%
#)4%"*#"2%
+
768
(',#"#9
7
!"15*4
!15
Analysis of Everyday Sounds - Ellis & Lee
(', #4% 4%# 666
2007-07-24 p. 14 /35
Spectral Clustering
• Eigenanalysis of affinity matrix: A = U•S•V′
SVD components: uk•skk•vk'
Affinity Matrix
900
800
800
600
700
400
600
200
k=1
k=2
k=3
k=4
500
400
800
300
600
200
400
100
200
200
400
600
800
200 400 600 800
200 400 600 800
eigenvectors vk give cluster memberships
• Number of clusters?
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 15 /35
Clustering Results
• Clustering of automatic segments gives
‘anonymous classes’
BIC criterion to choose number of clusters
make best correspondence to 16 GT clusters
• Frame-level scoring gives ~70% correct
errors when same ‘place’ has multiple ambiences
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 16 /35
Browsing Interface
/ Diary interface
• Browsing
links to other information (diary, email, photos)
synchronize with note taking? (Stifelman & Arons)
audio thumbnails
• Release Tools + “how to” for capture
!"#!!
'!!(D!%D&$
'!!(D!%D&(
'!!(D!%D&)
'!!(D!%D&*
'!!(D!%D&+
!"#$!
!%#!!
!%#$!
&!#!!
&!#$!
&&#!!
,-./01223
045.
,-./01223
045.
>2=
3.067-.
<..68=:
045.
25580.
<..68=:'
,2/63.0
EFG!(
C'
045.
25580.
2769223.067-.
EFG!$
&&#$!
&'#!!
25580.
&'#$!
25580.
276922:-27,
<..68=:'
27692227692225580.
276922276922<..68=:
25580.
,2/63.0
<..68=:
&$#$!
34;
&(#!!
045.
<..68=:' ?4=7.3
27692225580.
045.
25580.
?8H.
C'
045.
25580.
<..68=:
3.067-.'
@--2A2B
276922-
F4<;4-64B
25580.
&(#$!
&+#$!
25580.
&"#$!
&%#!!
&%#$!
/.<8=4-
25580.
:,
25580.
,2/63.0
25580.
:-27,
C.//.-
H.4=/7;
25580.
<..68=:'
:-27,
25580.
045.
27692234;
25580.
Analysis of Everyday Sounds - Ellis & Lee
&"#!!
:-27,
25580.
:-414<
&*#$!
&+#!!
045.
:-27,
276922-
&)#!!
&*#!!
-9:
02<,<6:
276922-
&$#!!
&)#$!
045.
:-27,
3.067-.
045.
045.
276922-
,-./01223
=4614=
045.
045.
2007-07-24 p. 17 /35
045.
3. Special-Purpose Detectors: Speech
• Speech emerges as most interesting content
• Just identifying speech would be useful
goal is speaker identification / labeling
• Lots of background noise
conventional Voice Activity Detection inadequate
• Insight: Listeners detect pitch track (melody)
look for voice-like periodicity in noise
coffeeshop excerpt
Frequency
4000
3000
2000
1000
0
0
0.5
1
1.5
2
2.5
Time
Analysis of Everyday Sounds - Ellis & Lee
3
3.5
4
4.5
2007-07-24 p. 18/35
Voice Periodicity Enhancement
• Noise-robust subband autocorrelation
• Subtract
local average
suppresses steady
background
e.g. machine noise
15 min test set; 88% acc (no suppression: 79%)
also for enhancing speech by harmonic filtering
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 19/35
Detecting Repeating Events
with Jim Ogle
• Recurring sound events can be informative
indicate similar circumstance...
but: define “event” – sound organization
define “recurring event” – how similar?
.. and how to find them – tractable?
• Idea: Use hashing (fingerprints)
index points to other occurrences of each hash;
intersection of hashes points to match
- much quicker search
use a fingerprint insensitive to background?
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 20/35
Shazam Fingerprints
• Prominent spectral onsets are landmarks;
Use relations {f1, f2, t} as hashes
Phone ring - Shazam fingerprint
4000
3000
2000
1000
0
0
0.5
1
1.5
2
2.5
intrinsically robust to background noise
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 21/35
3
.(<,#'8#*'%(/#,1,(&*"
D'2#&4,#-2'/%3&6'(#.%/6'#.(/#&,5,-4'(,#26(<*7#&4,#2,-,.&,/#*,.234#:'2?*#8.625=#:,55"##D6<%2,#EF#*4':*#&4,#'%&-%&#'8#.
,-,.&,/#*,.234#'(#.#*326-&,/#GH#)6(%&,#2,3'2/6(<"##>46*#6(35%/,/#2,-,.&*#'8#&:'#*'(<*7#&42,,#&,5,-4'(,#26(<*7#&:'#,5,3&2'(63
;,55*7#.(/#&:'#,.34#'8#<.2.<,#/''2#.(/#3.2#/''2#('6*,*"##$#1.26,&=#'8#;.3?<2'%(/#('6*,*#:,2,#.5*'#-2,*,(&"
Exhaustive Search for Repeats
• More selective
hashes →
few hits required to
confirm match
(faster; better
precision)
but less robust to
backgound
(reduce recall)
• Works well when exact structure repeats
!"#$%&'()S#+,-,.&,/#*'%(/#,1,(&*#8'%(/#/%26(<#*,.234"##C'&,#&4,#K*,58L).&34#/6.<'(.5"##+,-,.&*#'8#,1,(&*#.2,#8'%(/#'88T/6.<'(.5#.*
5.;,5,/"
recorded music, electronic alerts
no good for “organic” sounds e.g. garage door
$*#,I-,3&,/#&4,#-2'/%3&6'(#.%/6'#.(/#&,5,-4'(,#26(<*#:,2,#,.*65=#8'%(/#:465,#&4,#*6(<5,#&'(,#;,55#.(/#('6*,#*'%(/*#:,2,
)6**,/"##>46*#*3.&&,2-5'&#6*#32,.&,/#;=#-5'&&6(<#&4,#&6),#'8#&4,#J%,2=#*'%(/#,1,(&#.<.6(*&#&4,#&6),#'8#.(=#2,-,.&#'33%22,(3,*
'8#&4.&#,1,(&"##>4%*#&4,#*&2.6<4&#/6.<'(.5#56(,#2,-2,*,(&*#&4,#K*,58L#).&34#&4.&#'33%2*#:4,(#&4,#J%,2=#*'%(/#,1,(&#-.**,*
&*,58#6(#&4,#865,"##>4,#'88#/6.<'(.5#-'6(&*#.2,#&4,#2,-,.&,/#,1,(&*"##D'2#,I.)-5,7#&4,#-'6(&*#.&#MHFF7ENHFO#.(/#MHFF7#EPFFO
,-2,*,(&#&:'#2,-,.&#'33%22,(3,*#'8#&4,#&,5,-4'(,#26(<*#.&#HFF#*,3'(/*"##>4,*,#*.),#2,-,.&#,1,(&*#.2,#*4':(#.&#MHFF7HFFO7
ENHF7HFFO7#.(/#MEPFF7HFFO"##>46*#*=)),&2=#3.(#;,#*,,(#8'2#.55#&4,#).&34,*#.(/#6*#&4,#2,*%5&#'8#&4,#5.&,2#'33%22,(3,*#'8#.
'%(/#,1,(&#86(/6(<#&4,#,.256,2#'33%22,(3,"##Q(#-2.3&63,#.#8'2:.2/#*,.234#:'%5/#'(5=#;,#(,,/,/#.(/#4.58#'8#&46*#-5'&#:'%5/
4':#.55#2,5,1.(&##6(8'2).&6'("##>4,#/%-563.&6'(#6(#&46*#6)-5,),(&.&6'(#:.*#/'(,#&'#6(32,.*,#&4,#(%);,2#'8#2,-,.&,/
5,),(&*#&'#;,&&,2#,1.5%.&,#&4,#*=*&,)"##>4,#*,.234#&6),#8'2#&4,#GH#)6(%&,#.%/6'#865,#%*,/#&'#<,(,2.&,#&46*#-5'&#:.*#RP
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 22/35
Music Detector
• Two characteristic features for music
strong, sustained periodicity (notes)
clear, rhythmic repetition (beat)
at least one should be present!
Pitch-range
subband
autocorrelation
Local
stability
measure
Rhythm-range
envelope
autocorrelation
Perceptual
rhythm
model
Fused
music
classifier
music audio
• Noise-robust pitch detector
looks for high-order autocorrelation
• Beat tracker
.. from Music IR work
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 23/35
4. Generic Concept Detectors
• Consumer Video application:
How to assist browsing?
system automatically tags recordings
tags chosen by usefulness, feasibility
• Initial set of 25 tags defined:
“animal”, “baby”, “cheer”, “dancing” ...
human annotation of 1300+ videos
evaluate by average precision
• Multimodal detection
separate audio + visual low-level detectors
(then fused...)
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 24 /35
MFCC Covariance Representation
• Each clip/segment → fixed-size statistics
similar to speaker ID and music genre classification
• Full Covariance matrix of MFCCs
8
7
6
5
4
3
2
1
0
VTS_04_0001 - Spectrogram
MFCC
Covariance
Matrix
30
20
10
0
-10
MFCC covariance
-20
1
2
3
4
5
6
7
8
9
time / sec
20
18
16
14
12
10
8
6
4
2
1
2
3
4
5
6
7
8
9
time / sec
50
20
level / dB
18
16
20
15
10
5
0
-5
-10
-15
-20
value
MFCC dimension
MFCC
features
MFCC bin
Video
Soundtrack
freq / kHz
maps the kinds of spectral shapes present
14
12
0
10
8
6
4
2
• Clip-to-clip distances for SVM classifier
5
10
15
MFCC dimension
20
by KL or 2nd Gaussian model
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 25/35
-50
GMM Histogram Representation
• Want a more ‘discrete’ description
.. to accommodate nonuniformity in MFCC space
.. to enable other kinds of models...
• Divide up feature space with a single
Gaussian Mixture Model
.. then represent each clip by the components used
8
150
6
4
7
2
MFCC
features
Per-Category
Mixture Component
Histogram
count
MFCC(1)
Global
Gaussian
Mixture Model
0
2
5
-2
6
1
10
14
8
100
15
9
12
13 3
4
-4
50
11
-6
-8
-10
-20
-10
0
10
20
MFCC(0)
Analysis of Everyday Sounds - Ellis & Lee
0
1
2
3
4
5
6 7 8 9 10 11 12 13 14 15
GMM mixture
2007-07-24 p. 26/35
Latent Semantic Analysis (LSA)
• Probabilistic LSA (pLSA) models each
histogram as a mixture of several ‘topics’
.. each clip may have several things going on
• Topic sets optimized through EM
p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip)
=
GMM histogram ftrs
*
“Topic”
p(topic | clip)
p(ftr | clip)
“Topic”
AV Clip
AV Clip
GMM histogram ftrs
p(ftr | topic)
use p(topic | clip) as per-clip features
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 27/35
AP
Audio-Only Results
• Wide range of results:
0.8
1G + KL
1G + Mah
GMM Hist. + pLSA
Guess
0.7
0.6
0.5
0.4
0.3
0.2
0.1
a
W ch
ed
di
M ng
us
eu
m
Pa
Pl ra
ay de
gr
ou
nd
Su
ns
et
Pi
cn
Bi ic
rth
da
y
Be
by
t
Ba
Bo
a
Da Ski
nc
in
g
Gr
On
e
pe
r
ou son
p
of
3
+
M
us
ic
Ch
e
Gr
ou er
p
of
2
Sp
or
ts
Pa
r
Cr k
ow
d
Ni
gh
An t
im
Si al
ng
in
g
Sh
ow
0
Concepts
audio (music, ski) vs. non-audio (group, night)
large AP uncertainty on infrequent classes
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 28/35
How does it ‘feel’?
• Browser impressions: How wrong is wrong?
Top 8 hits
for “Baby”
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 29/35
Confusion analysis
• Where are the errors coming from?
(b) Confusion Matrix of Classified Labels
True Labels
(a) Overlapped Manual Labels
1
Birthday
Picnic
Sunset
Playground
Parade
Museum
Wedding
Beach
Baby
Boat
Dancing
Ski
Show
Singing
Animal
Night
Crowd
Park
Sports
Group of 2
Cheer
Music
Group of 3 +
One person
0.7
0.9
0.8
0.6
0.7
0.5
0.6
0.4
0.5
0.3
0.4
0.3
0.2
0.2
0.1
0.1
1 3 M C 2 S P C N A S S S D B B BWM P P S P B
Analysis of Everyday Sounds - Ellis & Lee
0
1 3 M C 2 S P C N A S S S D B B BWM P P S P B
2007-07-24 p. 30/35
0
Fused Results - AV Joint Boosting
• Audio helps in many classes
0.9
Random Baseline
Video Only
Audio Only
A+V Fusion
0.8
0.7
0.5
0.4
0.3
0.2
0.1
Analysis of Everyday Sounds - Ellis & Lee
MAP
birthday
picnic
sunset
playground
parade
museum
wedding
beach
baby
boat
dancing
ski
shows
singing
animal
night
crowd
park
sport
group of 2
music
cheer
group of 3+
0
one person
AP
0.6
2007-07-24 p. 31/35
5. Future: Temporal Focus
• Global vs. local class models
tell-tale acoustics may be ‘washed out’ in statistics
try iterative realignment of HMMs:
YT baby 002:
voice
baby
laugh
4
New Way:
Limited temporal extents
freq / kHz
freq / kHz
Old Way:
All frames contribute
3
4
2
1
1
5
10
voice
15
baby
3
2
0
voice
0
time / s
bg
5
voice baby
10
laugh
15
bg
time / s
baby
laugh
“background” (bg) model shared by all clips
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 32 /35
laugh
Handling Sound Mixtures
• MFCCs of mixtures ≠ mix of MFCCs
recognition despite widely varying background?
factorial models / Nonnegative Matrix Factorization
sinusoidal / landmark techniques
MFCC
Noise resynth
Solo
Voice
freq / kHz
Original
crm-11-070307-noise
crm-11-070307
4
3
2
1
0
0
M+F
Voice
Mix
freq / kHz
-20
crm-11-070307+16-050105
4
crm-11-070307+16-050105-noise
-40
crm-11737.wav
3
crm-11737-noise.wav
-60
level / dB
2
1
0
crm-11737+16515.wav crm-11737+16515-noise.wav
0
0.5
1
1.5
0
time / sec
0.5
Analysis of Everyday Sounds - Ellis & Lee
1
1.5
time / sec
2007-07-24 p. 33/35
Larger Datasets
• Many detectors are visibly data-limited
getting data is ~ hard
labeling data is expensive
• Bootstrap from YouTube etc.
lots of web video is edited/dubbed...
- need a “consumer video” detector?
• Preliminary YouTube results disappointing
downloaded data needed extensive clean-up
models did not match Kodak data
• (Freely available data!)
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 34/35
Conclusions
• Environmental sound contains information
.. that’s why we hear!
.. computers can hear it too
• Personal audio can be segmented, clustered
find specific sounds to help navigation/retrieval
• Consumer video can be ‘tagged’
.. even in unpromising cases
audio is complementary to video
• Interesting directions for better models
Analysis of Everyday Sounds - Ellis & Lee
2007-07-24 p. 35 /35
Download