Recognizing and Classifying Environmental Sounds 1. 2.

advertisement
Recognizing and Classifying
Environmental Sounds
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Machine Listening
Background Classification
Foreground Event Recognition
Open Issues
Recognizing Environmental Sounds - Dan Ellis
2012-10-24
1 /35
1. Machine Listening
• Extracting useful information from sound
Describe
Automatic
Narration
Emotion
Music
Recommendation
Classify
Environment
Awareness
ASR
Music
Transcription
“Sound
Intelligence”
VAD
Speech/Music
Environmental
Sound
Speech
Dectect
Task
... like animals do
Recognizing Environmental Sounds - Dan Ellis
Music
Domain
2012-10-24 2 /35
Environmental Sound Recognition
• Goal: Describe soundtracks with a vocabulary
of user-relevant acoustic events/sources
Spectrogram : YT_beach_001.mpeg,
freq / Hz
4k
3k
sp4
sp3
sp2
sp1
Bgd
2k
1k
0
0
0
1
2
3
4
5
6
7
8
cheer voice
cheer
voice
voice
voice
water wave
cheer
voice
1
2
9 10
3
4
5
6
7
8
9
10
time/sec
time/sec
• Challenges:
Defining acoustic event vocabulary
Overlapping sounds
Ground-truth training data
Classifier accuracy
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 3 /35
Environmental Sound Applications
• Audio Lifelog
Diarization
09:00
09:30
10:00
10:30
11:00
11:30
2004-09-14
preschool
cafe
preschool
cafe
Ron
lecture
12:00
office
12:30
office
outdoor
group
13:00
L2
cafe
office
outdoor
lecture
outdoor
DSP03
compmtg
meeting2
13:30
lab
14:00
cafe
meeting2 Manuel
outdoor
office
cafe
office
Mike
Arroyo?
outdoor
Sambarta?
14:30
15:00
15:30
16:00
• Consumer Video
2004-09-13
office
office
office
postlec
office
Lesser
16:30
17:00
17:30
18:00
outdoor
lab
cafe
Classification & Search
• Live hearing prosthesis app
• Robot environment sensitivity
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 4 /35
Prior Work
• Environment
Classification
speech/music/
silent/machine
Zhang & Kuo ’01
• Nonspeech Sound Recognition
Temko & Nadeu ’06
Recognizing Environmental Sounds - Dan Ellis
Meeting room
Audio Event Classification
sports events - cheers, bat/ball sounds,
...
2012-10-24 5 /35
Consumer Video Dataset
• 25 “concepts” from
1G+KL2 (10/15)
Kodak user study
boat, crowd, cheer, dance, ...
1
Labeled Concepts
museum
picnic
wedding
animal
birthday
sunset
ski
graduation
sports
boat
parade
playground
baby
park
beach
dancing
show
group of two
night
one person
singing
cheer
crowd
music
group of 3+
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
3mc c s o n g s d b p b p p b s g s s b a w pm 0
Overlapped Concepts
• Grab top 200 videos
8GMM+Bha (9/15)
from YouTube search
then filter for quality,
unedited = 1873 videos
manually relabel with
concepts
pLSA500+lognorm (12/15)
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 6 /35
Obtaining Labeled Data
• Amazon Mechanical Turk
Y-G Jiang et al. 2011
10s clips
9,641 videos in 4 weeks
Columbia Consumer Video (CCV) set
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 7 /35
2. Background Classification
• Baseline for soundtrack classification
8
7
6
5
4
3
2
1
0
VTS_04_0001 - Spectrogram
MFCC
Covariance
Matrix
30
20
10
0
-10
MFCC covariance
-20
1
2
3
4
5
6
7
8
9
time / sec
20
18
16
14
12
10
8
6
4
2
1
2
3
4
5
6
7
8
9
time / sec
50
20
level / dB
18
16
20
15
10
5
0
-5
-10
-15
-20
value
MFCC dimension
MFCC
features
MFCC bin
Video
Soundtrack
freq / kHz
divide sound into short frames (e.g. 30 ms)
calculate features (e.g. MFCC) for each frame
describe clip by statistics of frames (mean, covariance)
= “bag of features”
14
12
0
10
8
6
4
2
5
10
15
MFCC dimension
20
-50
• Classify by e.g. Mahalanobis distance + SVM
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 8 /35
Codebook Histograms
• Convert high-dim. distributions to multinomial
8
150
6
4
7
2
MFCC
features
Per-Category
Mixture Component
Histogram
count
MFCC(1)
Global
Gaussian
Mixture Model
0
2
5
-2
6
1
10
14
8
100
15
9
12
13 3
4
-4
50
11
-6
-8
-10
-20
-10
0
10
20
0
1
2
3
MFCC(0)
4
5
6 7 8 9 10 11 12 13 14 15
GMM mixture
• Classify by distance on histograms
KL, Chi-squared
+ SVM
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 9 /35
Latent Semantic Analysis (LSA)
• Probabilistic LSA (pLSA) models each histogram
as a mixture of several ‘topics’
.. each clip may have several things going on
• Topic sets optimized through EM
p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip)
=
GMM histogram ftrs
*
“Topic”
p(topic | clip)
p(ftr | clip)
“Topic”
AV Clip
AV Clip
GMM histogram ftrs
p(ftr | topic)
use (normalized?) p(topic | clip) as per-clip features
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 10 /35
Background Classification Results
Average Precision
K Lee & Ellis ’10
1
Guessing
MFCC + GMM Classifier
Single−Gaussian + KL2
8−GMM + Bha
1024−GMM Histogram + 500−log(P(z|c)/Pz))
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
• Wide range in performance
sk
su i
ns
e
bi
rth t
da
an y
im
w al
ed
di
ng
pi
cn
ic
m
us
eu
m
M
EA
N
a
sp t
o
gr
ad rts
ua
tio
n
bo
de
ra
nd
pa
ou
pl
ay
gr
ba
by
rk
h
pa
ac
g
be
in
da
nc
ow
o
sh
tw
t
ou
p
of
ni
gh
on
gr
e
pe
rs
g
in
r
si
ng
ee
d
ch
ow
us
m
cr
on
gr
ou
p
of
3+
0
ic
0.1
Concept class
audio (music, ski) vs. non-audio (group, night)
large AP uncertainty on infrequent classes
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 11 /35
Sound Texture Features
• Characterize sounds by
McDermott Simoncelli ’09
Ellis, Zheng, McDermott ’11
perceptually-sufficient statistics
\x\
Sound
\x\
Automatic
gain
control
mel
filterbank
(18 chans)
Octave bins
0.5,1,2,4,8,16 Hz
FFT
\x\
\x\
Histogram
\x\
\x\
.. verified by matched
resynthesis
Mahalanobis
distance ...
Cross-band
correlations
(318 samples)
Envelope
correlation
1062_60 quiet dubbed speech music
2404
1273
1
617
0
2
4
6
8
10
0
2
4
6
8
10
time / s
Texture features
mel band
distributions
& env x-corrs
mean, var,
skew, kurt
(18 x 4)
1159_10 urban cheer clap
freq / Hz
• Subband
Modulation
energy
(18 x 6)
0
level
15
10
5
M V S K 0.5 2 8 32
moments
Recognizing Environmental Sounds - Dan Ellis
mod frq / Hz
5
10 15
mel band
M V S K 0.5 2 8 32
moments
mod frq / Hz
5
10 15
mel band
2012-10-24 12 /35
Sound Texture Features
• Test on MED 2010
development data
10 specially-collected
manual labels
0.9
0.8
0.7
0.6
0.5
cLap
Cheer
Music
Speech
Dubbed
other
noisy
quiet
urban
rural
M
cLap
Cheer
Music
Speech
Dubbed
other
noisy
quiet
urban
rural
M
cLap
Cheer
Music
Speech
Dubbed
other
noisy
quiet
urban
rural
M
MED 2010 Txtr Feature (normd) means
0.5
0
−0.5
V
S
K
ms
cbc
MED 2010 Txtr Feature (normd) stddevs
1
0.5
0
V
S
K
ms
cbc
MED 2010 Txtr Feature (normd) means/stddevs
0.5
0
−0.5
V
S
K
ms
M
MVSK
ms
cbc
MVSK+ms
MVSK+cbc
MVSK+ms+cbc
1
Rural
Urban
Quiet
Noisy Dubbed Speech Music
Cheer
Clap
Avg
• Contrasts in
feature sets
correlation of labels...
• Perform
~ same as MFCCs
combine well
cbc
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 13 /35
Auditory Model Features
• Based on Lyon/Patterson Auditory Image Model
Simplified version is 10x faster (RTx5 → RT/2)
• Captures fine time structure in multiple bands
.. The information that is lost in MFCC features
de
lay
short-time
autocorrelation
e
lin
frequency channels
Sound Cochlea
filterbank
Subband VQ
Subband VQ
Histogram
Feature Vector
Subband VQ
Subband VQ
freq
lag
Correlogram
slice
time
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 14 /35
Auditory Model Feature Results
• Results vary with class, but fusion helps 15%
0.8
MFCC
0.7
SBPCA
Fusion
0.6
0.5
0.4
0.3
0.2
0.1
P
mA
ac
h
ay
gr
ou
nd
Be
Pl
Bi
rd
ra
du
at
io
n
Bi
rth
da
W
y
ed
dR
ec
W
ed ep
dC
er
W
em
ed
dD
an
ce
M
us
No Perf
nM
us
Pe
rf
Pa
ra
de
G
g
Do
Ca
t
Ba
sk
et
ba
ll
Ba
se
ba
ll
So
cc
er
Ice
Sk
at
ing
Sk
iin
g
Sw
im
mi
ng
Bi
ki n
g
0
Results for CCV (9317 videos)
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 15 /35
Real-World Dictionary
• BBC Sound Effects as reference library
The Czech Republic - Slovakia
Hungary
Rural South America
Urban South
America
Footsteps
2
Footsteps 1
India Pakistan Nepal-Countrysid
India & Nepal-City Life
Exterior Atmospheres-Rural Background
England
France
Suburbia
Crowds
Birds
Istanbul
Emergency
Cats
Construction
Age Of Steam
Trains
Spain
Schools & Crowds
Horses &
Dogs
Horses
Farm Machinery
Livestock 2
Livestock 1
Adventure Sports
Greece
Equestrian Events
Africa The Natural World
Africa The Human World
Hospitals
Babies
China
Aircraft
America
Ships And Boats 2
Ships And Boats 1
Weather 1
Electronically Generated Sounds
Explosions Guns Alarms
Sport Leisure
Auto
Rural Soundscapes
Big Ben Taxi Bus Atmospheres
Industry
Birds
Rivers Streams Water
Computers Printers Phones
European Soundscapes
Misc
1000+ examples ...
comprehensive?
similarity via
normalized
textures
(over 10s
chunks)
Audiences Children Crowds Foots
Animals and Birds
Transportation
Interior Atmosphere
Household
Exterior Atmosphere
BBC Sound Effects
BEH I T A A MEC RB I BRAS EE W S S A A C B H AAEGA L L F HHSST A C CE I B CSF EE I I F FURHT
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 16 /35
BBC Audio Semantic Classes
• Use BBC Sound Effects Library
2238 sound files
Short keyword
descriptions
SFX001–04–01
SFX001–05–01
SFX001–06–01
SFX001–07–01
SFX001–08–01
SFX001–09–01
• Train classifiers for 100
most common keywords
Select 45 best-performing
• Added as “semantic units”
.. Useful for description?
Recognizing Environmental Sounds - Dan Ellis
Wood Fire Inside Stove 5:07
!
City Skyline City Skyline
9:46
!
High Street With Traffic, Footsteps!
Car Wash Automatic, Wash Phase Inside R
Motor Cycle Yamaha Rd 350: Motor Cycle
Motor Cycle Yamaha Rd 350, Rider Runs U!
290
267
240
197
193
...
walking
stairs
stone
siren
pavement
country
ascending
warfare
wood
footsteps
running
futuristic
woman
ambience
women
men
boat
horses
litre
emergency
descending
babies
wooden
animals
backgrounds
footsteps
on
animals
ambience
interior
!
!
59 men!
59 general!
55 switch!
53 starts!
53 crowds!
...!
BBC2238 top 25 mutual AP
1
0.8
0.6
0.4
0.2
2012-10-24 17 /35
0.5
0.4
0.3
Recognizing Environmental Sounds - Dan Ellis
P
0.6
mA
g_
E0
a_
02
bo
-F
ar
ee
d_
di
tri
ng
ck
_a
E0
E0
n
_a
03
05
ni m
-L
-W
E0
a
al
nd
or
04
ki n
ing
-W
g_
_a
ed
on
_f
di
_a
ish
ng
_w
_c
er
oo
em
dw
on
or
y
ki n
E0
g
_
E0
06
pr
07
oj
-B
ec
-C
ir t
t
hd
ha
ay
ng
_p
i ng
E0
ar
_a
08
ty
_
v
-F
eh
E0
l
a
icl
09
sh
e_
_m
-G
tir
ob
et
e
tin
_g
g_
at
he
a_
E0
rin
ve
10
g
h
icl
-G
e
ro
_u
om
ns
tu
E0
ing
ck
11
_a
-M
n_
ak
an
i ng
im
al
_a
_s
an
dw
ic h
E0
12
-P
E0
ar
14
ad
E0
-R
E
e
01
ep
15
3ai
-W
P
r
ing
ar
or
ko
kin
_a
ur
g_
n_
ap
on
_a
pl
ia
_s
nc
ew
e
i ng
_p
ro
je
ct
mp
tin
tte
E0
01
-A
Audio Semantic Results
• Semantic-level feature fusion helps ~15%
relative
mfccs-230
sema100mfcc
sbpcas-4000
sema100sbpca
sema100sum
0.2
0.1
0
Results on TREC MED 2011 DEVT1 (6314 videos)
2012-10-24 18 /35
Audio Semantic Results
• Browsing results is surprisingly good
(sometimes)
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 19 /35
3. Foreground: Transient Features
Cotton, Ellis, Loui ’11
• Transients =
•
foreground events?
Onset detector
finds energy bursts
best SNR
• PCA basis to
represent each
300 ms x auditory freq
• “bag of transients”
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 20 /35
Nonnegative Matrix Factorization
• Decompose spectrograms into
freq / Hz
templates
+ activation
5478
Smaragdis Brown ’03
Abdallah Plumbley ’04
Virtanen ’07
Original mixture
3474
2203
1398
883
X=W·H
442
Basis 1
(L2)
fast & forgiving
gradient descent
algorithm
2D patches
sparsity control
computation time...
Basis 2
(L1)
Basis 3
(L1)
Recognizing Environmental Sounds - Dan Ellis
0
1
2
3
4
5
6
7
8
9
10
time / s
2012-10-24 21 /35
NMF Transient Features
• Learn 20 patches from
freq (Hz)
Meeting Room Acoustic
Event data
Cotton, Ellis ’11
• Compare to MFCC-HMM
detector
3474
Patch 1
Patch 2
Patch 3
Patch 4
Patch 5
Patch 6
Patch 7
Patch 8
Patch 9
Patch 10
Patch 11
Patch 12
Patch 13
Patch 14 Patch 15
Patch 16
Patch 17
Patch 18
Patch 19 Patch 20
1398
442
3474
1398
442
3474
Acoustic Event Error Rate (AEER%)
1398
442
Error Rate in Noise
300
3474
1398
250
442
0.0
200
100
0
10
0.5 0.0
0.5 0.0
0.5
time (s)
noise-robust
MFCC
NMF
combined
15
0.5 0.0
• NMF more
150
50
0.5 0.0
20
SNR (dB)
25
30
Recognizing Environmental Sounds - Dan Ellis
combines well ...
2012-10-24 22 /35
Foreground Event Localization
• Global vs. local class models
K Lee, Ellis, Loui ’10
tell-tale acoustics may be ‘washed out’ in statistics
try iterative realignment of HMMs:
YT baby 002:
voice
baby
laugh
4
New Way:
Limited temporal extents
freq / kHz
freq / kHz
Old Way:
All frames contribute
3
4
2
1
1
5
10
voice
15
0
time / s
baby
3
2
0
voice
bg
5
voice baby
10
laugh
15
bg
time / s
laugh
baby
laugh
“background” model shared by all clips
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 23 /35
Foreground Event HMMs
freq / Hz
Spectrogram : YT_beach_001.mpeg,
Manual Clip-level Labels : “beach”, “group of 3+”, “cheer”
• Training labels only
•
Refine models by
EM realignment
• Use for classifying
entire video...
or seeking to relevant
part
Recognizing Environmental Sounds - Dan Ellis
etc
speaker4
speaker3
speaker2
unseen1
Background
25 concepts + 2 global BGs
at clip-level
4k
3k
2k
1k
0
global BG2
global BG1
music
cheer
ski
singing
parade
dancing
wedding
sunset
sports
show
playground
picnic
park
one person
night
museum
group of 2
group of 3+
graduation
crowd
boat
birthday
beach
baby
animal
0
0
−50
drum
Manual Frame-level Annotation
wind
cheer
voice
voice
voice
water wave
voice
voice voice
voice cheer
Viterbi Path by 27−state Markov Model
1
2
3
4
5
6
7
8
Lee & Ellis ’10
9
10
time/sec
2012-10-24 24 /35
Refining Ground Truth
• Audio Ground Truth at coarse time resolution
better-focused labels give better classifiers?
but little information in very short time frames
Relabel based on
most likely segments
in clip
Retrain classifier
Recognizing Environmental Sounds - Dan Ellis
Video
0EwtL2yr0y8
03FF1Sz53YU
Tags
Parade
-
Time Labels
Parade
Features
SVM Train
SVM Classify
Iteration 1
Initial labels apply
to whole clip
Iteration 0
• Train classifiers on shorter (2 sec) segments?
Posteriors
Time Labels
Features
Parade
P
SVM Train
SVM Classify
2012-10-24 25 /35
Refining Ground Truth
• Refining labels is “Multiple Instance Learning”
“Positive” clips have at least one +ve frame
“Negative” clips are all –ve
• Refine based on previous classifier’s scores
Text
Threshold from
CDFs of +ve
and –ve
frames
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 26 /35
Refining Ground Truth
• Score by aggregating frame labels to whole clip
• Systems improve for 3-5 epochs, ~10% total
0.7
mfcc-230-2s-E1
0.6
mfcc-230-2s-E2
mfcc-230-2s-E3
0.5
0.4
0.3
0.2
0.1
Recognizing Environmental Sounds - Dan Ellis
mA
P
Bi
ra rd
du
at
io
n
Bi
rth
d
W
ed ay
dR
ec
W
ed ep
dC
W ere
m
ed
dD
an
ce
M
u
No sPer
nM f
us
Pe
rf
Pa
ra
de
Be
a
Pl
ay ch
gr
ou
nd
G
Ca
t
Do
g
Ba
sk
et
ba
Ba l l
se
ba
ll
So
c
Ice cer
Sk
at
i ng
Sk
ii
Sw ng
im
mi
ng
Bi
kin
g
0
2012-10-24 27 /35
4. Open Issues
• Better object/event separation
parametric models
spatial information?
computational
auditory
scene analysis...
•
• Integration with video
Large-scale analysis
Recognizing Environmental Sounds - Dan Ellis
Frequency Proximity
Common Onset
Harmonicity
Barker et al. ’05
2012-10-24 28 /35
Audio Annotation
Audio*Annota9on*
•
  CoLordina7ng$programLwide$effort$to$share$labels$
Co-ordinating program-wide effort
labels
-  Defining*label*set*
-  Crea9ng*labeled*data*
Defining label set
-  CMU*43*label*set*as*basis*
Creating labeled data
CMU 43 label set as basis
  Efficiency$of$labeling$is$key$
•
2012D10D22*
-  AudioDonly*vs.*audio+video?*
-  CoarseDlevel*human*labels**
*+*automa9c*refinement?*
Efficiency of labeling is key
Audio-only vs. audio+video?
Coarse-level human labels
+ automatic refinement?
Recognizing Environmental Sounds - Dan Ellis
to share
animal
singing
clatter
anim_bird
music_sing
rustle
anim_cat
music
scratch
anim_ghoat
knock
hammer
anim_horse
thud
washboard
human_noise
clap
applause
laugh
click
whistle
scream
bang
squeak
child
beep
tone
mumble
engine_quiet
sirene
speech
engine_light
water
speech_ne
power_tool
micro_blow
radio
engine_heavy
wind
white_noise
cheer
other_creak
crowd
Burger & Metze, CMU, 2012
(c)*IBM*Corpora9on*and*
Columbia*University*
2012-10-24 29 /35
Audio-Visual Atoms
Jiang et al. ’09
• Object-related features
from both audio (transients) & video (patches)
time
visual atom
...
visual atom
...
audio atom
frequency
frequency
audio spectrum
time
Recognizing Environmental Sounds - Dan Ellis
audio atom
Joint Audio-Visual Atom
discovery
...
visual atom
...
audio atom
time
2012-10-24 30 /35
Audio-Visual Atoms
• Multi-instance learning of A-V co-occurrences
M Visual Atoms
visual atom
…
visual atom
time
…
…
time
codebook
learning
Joint A-V
Codebook
…
audio atom
audio atom
M×N
A-V Atoms
…
N audio atoms
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 31 /35
Audio-Visual Atoms
black suit
+ romantic
music
marching
people
+ parade
sound
Wedding
sand
+ beach
sounds
Parade
Beach
Audio only
Visual only
Visual + Audio
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 32 /35
Summary
• Machine Listening:
Getting useful information from sound
• Background sound classification
... from whole-clip statistics?
• Foreground event recognition
... by focusing on peak energy patches
• Speech content is very important
... separate with pitch, models, ...
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 33 /35
References
•
•
•
•
•
•
•
•
•
•
•
•
Jon Barker, Martin Cooke, & Dan Ellis, “Decoding Speech in the Presence of Other Sources,” Speech
Communication 45(1): 5-25, 2005.
Courtenay Cotton, Dan Ellis, & Alex Loui, “Soundtrack classification by transient events,” IEEE
ICASSP, Prague, May 2011.
Courtenay Cotton & Dan Ellis, “Spectral vs. Spectro-Temporal Features for Acoustic Event
Classification,” submitted to IEEE WASPAA, 2011.
Dan Ellis, Xiaohong Zheng, Josh McDermott, “Classifying soundtracks with audio texture features,”
IEEE ICASSP, Prague, May 2011.
Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, & Alex Loui, “Short-Term Audio-Visual
Atoms for Generic Video Concept Classification,” ACM MultiMedia, 5-14, Beijing, Oct 2009.
Keansub Lee & Dan Ellis, “Audio-Based Semantic Concept Classification for Consumer Video,” IEEE
Tr. Audio, Speech and Lang. Proc. 18(6): 1406-1416, Aug. 2010.
Keansub Lee, Dan Ellis, Alex Loui, “Detecting local semantic concepts in environmental sounds
using Markov model based clustering,” IEEE ICASSP, 2278-2281, Dallas, Apr 2010.
Byung-Suk Lee & Dan Ellis, “Noise-robust pitch tracking by trained channel selection,” submitted to
IEEE WASPAA, 2011.
Andriy Temko & Climent Nadeu, “Classification of acoustic events using SVM-based clustering
schemes,” Pattern Recognition 39(4): 682-694, 2006
Ron Weiss & Dan Ellis, “Speech separation using speaker-adapted Eigenvoice speech models,”
Computer Speech & Lang. 24(1): 16-29, 2010.
Ron Weiss, Michael Mandel, & Dan Ellis, “Combining localization cues and source model
constraints for binaural source separation,” Speech Communication 53(5): 606-621, May 2011.
Tong Zhang & C.-C. Jay Kuo, “Audio content analysis for on-line audiovisual data segmentation,”
IEEE TSAP 9(4): 441-457, May 2001
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 34 /35
Acknowledgment
Supported by the Intelligence Advanced Research Projects Activity (IARPA) via
Department of Interior National Business Center contract number D11PC20070. The
U.S. Government is authorized to reproduce and distribute reprints for Government
purposes notwithstanding any copyright annotation thereon.
Disclaimer: The views and conclusions contained herein are those of the authors and
should not be interpreted as necessarily representing the official policies or
endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S.
Government.
Recognizing Environmental Sounds - Dan Ellis
2012-10-24 35 /35
Download