Recognizing and Classifying Environmental Sounds

advertisement
Recognizing and Classifying
Environmental Sounds
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
5.
http://labrosa.ee.columbia.edu/
Environmental Sound Recognition
Foreground Events
Background Retrieval
Labels & Annotation
Future Directions
Environmental Sounds - Dan Ellis
2013-06-01
1 /34
1. What is hearing for?
http://news.bbc.co.uk/2/hi/science/nature/7130484.stm
• Hearing = getting information from sound
predators/prey
communication
• Environmental sound recognition is fundamental
Environmental Sounds - Dan Ellis
2013-06-01 2 /34
Environmental Sound Perception
Ellis 1996
f/Hz City
4000
2000
•
1000
What do
people
hear?
400
200
0
1
2
3
4
5
6
7
8
9
Horn1 (10/10)
S9 horn 2
S10 car horn
S4 horn1
S6 double horn
S2 first double horn
S7 horn
S7 horn2
S3 1st horn
S5 Honk
S8 car horns
S1 honk, honk
Crash (10/10)
sources
ambience
S7 gunshot
S8 large object crash
S6 slam
S9 door Slam?
S2 crash
S4 crash
S10 door slamming
S5 Trash can
S3 crash (not car)
S1 slam
Horn2 (5/10)
S9 horn 5
S8 car horns
S2 horn during crash
S6 doppler horn
S7 horn3
Truck (7/10)
• Mixtures
are the
rule
S8 truck engine
S2 truck accelerating
S5 Acceleration
S1 rev up/passing
S6 acceleration
S3 closeup car
S10 wheels on road
Horn3 (5/10)
S7 horn4
S9 horn 3
S8 car horns
S3 2nd horn
S10 car horn
Squeal (6/10)
S6 airbrake
S4 squeal
S2 squeek
S3 squeal
S8 break Squeaks
S1 squeal
Horn4 (8/10)
S7 horn5
S9 horn 4
S2 horn during Squeek
S8 car horns
S3 2nd horn
S10 car horn
S5 Honk 2
S6 horn3
Environmental Sounds - Dan Ellis
Horn5 (10/10)
2013-06-01 3 /34
Sound Scene Evaluations
• Evaluations are good for research
help researchers, help funders
• A decade of evaluations:
NIST MtgRm
Meeting rooms
Acoustic events
ADC
Music transcription
MIREX
Speech separation
SSC/PASCAL
Source separation
Segmentation
Video (soundtrack)
classification
D-CASE
CHIL/CLEAR
CHiME
SASSEC SiSEC
{
TRECVID
2003
Albayzin
TRECVID MED
VideoCLEF / MediaEval
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Metrics: SNR, Frame Acc, Event Error Rate, mAP
Environmental Sounds - Dan Ellis
2013-06-01 4 /34
DCASE (2012-2013)
Giannoulis et al. 2013
• 2012 IEEE AASP Challenge:
freq / kHz freq / kHz freq / kHz freq / kHz freq / kHz
Systems submitted Mar 2013
bus01
Results at WASPAA,
Oct
2013
4
2 tasks... 02
office01
4
• Task 1: Scene classification
2
0
park01
10 classes x2 10 examples x 30 s
-
4
street, supermarket,
restaurant, office, park ...
0
restaurant01
evaluate on 100 examples
2
(~1 hour total)
0
tube01
4
by classification
accuracy
2
4
0
Environmental Sounds - Dan Ellis
5
10
15
20
25
freq / kHz freq / kHz freq / kHz freq / kHz freq / kHz
“Detection and Classification of Acoustic
Scenes and Events”
busystreet01
4
2
0
openairmarket01
4
2
0
quietstreet01
4
2
0
supermarket01
4
2
0
tubestation01
4
2
0
5
10
15
20
25 time / s
2013-06-01 5 /34
DCASE Event Detection
• Task 2: Event detection (“office live”)
16 events x 20 training examples (~ 20 min total)
knock, laugh, drawer, keys, phone ...
evaluate on ~15 min (?)
metrics: frame-level AEER
& event-level precision-recall
Environmental Sounds - Dan Ellis
2013-06-01 6 /34
TRECVID MED (2010-2013)
• “Multimedia Event Detection”
Over et al. 2011
e.g. MED2011: 15 events x 200 example videos (~60s)
- Making a sandwich, Birthday party, Parade, Flash mob
evaluate by mean Average Precision
over 10k-100k videos (200-2,000 hours)
audio and video ...
- participants have annotated ~1000 videos (> 10 h)
E009 Getting a Vehicle Unstuck
Environmental Sounds - Dan Ellis
2013-06-01 7 /34
Consumer Video Dataset
• Columbia Consumer Video (CCV)
Y-G. Jiang et al. 2011
9,317 videos / 210 hours
20 concepts based on consumer user study
Labeled via Amazon Mechanical Turk
Environmental Sounds - Dan Ellis
2013-06-01 8 /34
Environmental Sound Motivations
09:00
09:30
• Audio Lifelog
Diarization
10:00
10:30
11:00
11:30
cafe
L2
cafe
office
preschool
cafe
Ron
lecture
office
12:30
office
outdoor
group
13:00
2004-09-14
preschool
12:00
outdoor
lecture
outdoor
DSP03
compmtg
meeting2
13:30
lab
14:00
cafe
meeting2 Manuel
outdoor
office
cafe
office
Mike
Arroyo?
outdoor
Sambarta?
14:30
15:00
15:30
16:00
•
2004-09-13
office
office
office
postlec
office
Lesser
16:30
Consumer Video
Classification & Search
17:00
17:30
18:00
outdoor
lab
cafe
• Real-time hearing prosthesis app
• Robot environment sensitivity
• Understanding hearing
Environmental Sounds - Dan Ellis
2013-06-01 9 /34
2. Foreground Event Recognition
• “Events” are what we hear / notice
• ASR approach?
Reyes-Gomes & Ellis 2003
dogBarks2
freq / kHz
s7
s3
s2 s4
s2 s3
s7 s2
s3s5
s2
s3 s2 s4 s3 s4
s3 s7
s4
8
7
6
5
4
3
2
1
time
1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2
1.15
events = words? what are subwords?
need labeled data
but ... mature tools are great
Environmental Sounds - Dan Ellis
2013-06-01 10 /34
Transient Features
Cotton, Ellis, Loui ’11
• Transients =
•
foreground events?
Onset detector
finds energy bursts
best SNR
• PCA basis to
represent each
300 ms x auditory freq
• “bag of transients”
Environmental Sounds - Dan Ellis
2013-06-01 11 /34
Transient Features
• Results show a
small benefit
• Examine clusters
looking for
semantic consistency...
link cluster to label
class
similar to
MFCC baseline?
group of 3+
music
crowd
cheer
singing
1 person
night
group of 2
show
dancing
beach
park
baby
playground
parade
boat
sports
graduation
birthday
ski
sunset
animal
wedding
picnic
museum
MAP
0
Environmental Sounds - Dan Ellis
fusion
event ftrs
MFCC ftrs
guess
0.2
0.4
0.6
average precision
0.8
1
2013-06-01 12 /34
NMF Transient Features
• Decompose spectrograms into
freq / Hz
templates
+ activation
3474
2203
883
442
Basis 1
(L2)
Basis 2
(L1)
Basis 3
(L1)
computation time...
Environmental Sounds - Dan Ellis
Original mixture
1398
X=W·H
well-behaved
gradient descent
algorithm
2D patches
sparsity control
5478
Smaragdis & Brown ’03
Abdallah & Plumbley ’04
Virtanen ’07
0
1
2
3
4
5
6
7
8
9
10
time / s
2013-06-01 13 /34
NMF Transient Features
Cotton & Ellis ’11
CLEAR Meeting Room
events
freq (Hz)
• Learn 20 patches from
• Compare to MFCC-HMM
Acoustic Event Error Rate (AEER%)
detector
Patch 3
Patch 4
Patch 5
Patch 6
Patch 7
Patch 8
Patch 9
Patch 10
Patch 11
Patch 12
Patch 13
Patch 14 Patch 15
Patch 16
Patch 17
Patch 18
Patch 19 Patch 20
442
3474
1398
442
442
3474
1398
442
0.0
200
150
MFCC
NMF
combined
15
0.5 0.0
0.5 0.0
0.5 0.0
0.5 0.0
0.5
time (s)
• NMF more
100
0
10
Patch 2
1398
250
50
Patch 1
1398
3474
Error Rate in Noise
300
3474
noise-robust
20
SNR (dB)
Environmental Sounds - Dan Ellis
25
30
combines well ...
2013-06-01 14 /34
Why Are Events Hard?
• Events are short
target sounds may occupy only a few % of time
• Events are varied
what is the vocabulary? what are the prototypes?
source & channel variability
• Critical information is in fine-time structure
onset transient etc.
poor match to classic frame-spectral-envelope features
Environmental Sounds - Dan Ellis
2013-06-01 15 /34
3. Background Retrieval
• Baseline for soundtrack classification
K. Lee & Ellis 2010
8
7
6
5
4
3
2
1
0
VTS_04_0001 - Spectrogram
MFCC
Covariance
Matrix
30
20
10
0
-10
MFCC covariance
-20
1
2
3
4
5
6
7
8
9
time / sec
20
18
16
14
12
10
8
6
4
2
1
2
3
4
5
6
7
8
9
time / sec
50
20
level / dB
18
16
20
15
10
5
0
-5
-10
-15
-20
value
MFCC dimension
MFCC
features
MFCC bin
Video
Soundtrack
freq / kHz
divide sound into short frames (e.g. 30 ms)
calculate features (e.g. MFCC) for each frame
describe clip by statistics of frames (mean, covariance)
= “bag of features”
14
12
0
10
8
6
4
2
5
10
15
MFCC dimension
20
-50
• Classify by e.g. Mahalanobis distance + SVM
Environmental Sounds - Dan Ellis
2013-06-01 16 /34
Retrieval Evaluation
CCV Precision−Recall (mfcc+sbpca)
• Rank large test set
1
MusicPerformance − 0.69
Birthday − 0.49
Graduation − 0.46
IceSkating − 0.42
Swimming − 0.37
Dog − 0.27
WeddingReception − 0.18
0.9
by match to category
0.8
0.7
• Precision-Recall
Precision
0.6
0.5
0.4
CCV Test − mfc230 − AP (mean=0.345)
RAND
0.9
Beach
Parade
0.8
NonMusicPerf
MusicPerf
WedCerem
Classifiers
WedRecep
Birthday
Graduation
1
0.2
Playground
WedDance
0.3
•
0.7
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
0.6
0.5
Bird
Dog
0.4
Cat
Biking
0.3
Swimming
Skiing
0.2
IceSkating
Soccer
• mean Average Precision
0.1
Baseball
Basketball
Bb Bs So Ic Sk Sw Bi Ca Do Bi Gr Bd Wr Wc Wd Mp Np Pa Be Pl RN
Labels
Environmental Sounds - Dan Ellis
0
2013-06-01 17 /34
Retrieval Examples
• High precision for top hits (in-domain)
Environmental Sounds - Dan Ellis
2013-06-01 18 /34
Sound Texture Features
McDermott et al. ’09
Ellis, Zheng, McDermott ’11
• Characterize sounds by
perceptually-sufficient statistics
\x\
Sound
\x\
Automatic
gain
control
mel
filterbank
(18 chans)
Octave bins
0.5,1,2,4,8,16 Hz
FFT
\x\
\x\
Histogram
\x\
\x\
.. verified by matched
resynthesis
Mahalanobis
distance ...
2404
1273
1
617
0
2
4
6
8
10
0
2
4
6
8
10
time / s
Texture features
0
level
15
10
5
M V S K 0.5 2 8 32
moments
Environmental Sounds - Dan Ellis
1062_60 quiet dubbed speech music
mel band
distributions
& env x-corrs
mean, var,
skew, kurt
(18 x 4)
Cross-band
correlations
(318 samples)
Envelope
correlation
1159_10 urban cheer clap
freq / Hz
• Subband
Modulation
energy
(18 x 6)
mod frq / Hz
5
10 15
mel band
M V S K 0.5 2 8 32
moments
mod frq / Hz
5
10 15
mel band
2013-06-01 19 /34
Sound Texture Features
development data
10 audio-oriented
manual labels
cLap
Cheer
Music
Speech
Dubbed
other
noisy
quiet
urban
rural
M
cLap
Cheer
Music
Speech
Dubbed
other
noisy
quiet
urban
rural
M
•
cLap
Cheer
Music
Speech
Dubbed
other
noisy
quiet
urban
rural
M
AP
• Test on MED 2010
M
MVSK
ms
cbc
MVSK+ms
MVSK+cbc
MVSK+ms+cbc
1
0.9
0.8
0.7
0.6
0.5
MED 2010 Txtr Feature (normd) means
Rural
Urban
Quiet
Noisy Dubbed Speech Music
Cheer
Clap
Avg
0.5
• Per-class stats
0
−0.5
V
S
K
ms
relate dimensions to
classes?
cbc
MED 2010 Txtr Feature (normd) stddevs
1
0.5
0
V
S
K
ms
1
cbc
MED 2010 Txtr Feature (normd) means/stddevs
Perform ~ same as MFCCs
covariance ~ texture?
V
S
K
ms
cbc
Environmental Sounds - Dan Ellis
mfcc
txtr
mfcc+txtr
0.9
0.8
0.5
0
0.7
0.6
−0.5
0.5
Rural
Urban
Quiet
Noisy Dubbed Speech Music
Cheer
Clap
Avg
2013-06-01 20 /34
Auditory Model Features
• Subband Autocorrelation PCA (SBPCA)
Lyon et al. 2010
Cotton & Ellis 2013
Simplified version of Lyon et al. system
10x faster (RTx5 → RT/2)
• Captures fine time structure in multiple bands
.. missing in MFCC features
la
de
yl
short-time
autocorrelation
ine
frequency channels
Sound Cochlea
filterbank
Subband VQ
Subband VQ
Histogram
Feature Vector
Subband VQ
Subband VQ
freq
lag
Correlogram
slice
time
Environmental Sounds - Dan Ellis
2013-06-01 21 /34
Subband Autocorrelation
lin
short-time
autocorrelation
e
Subband VQ
Sound Cochlea
filterbank
Subband VQ
Histogram
Subband VQ
Feature Vector
fine time structure
lay
frequency channels
• Autocorrelation stabilizes
de
Subband VQ
freq
lag
Correlogram
slice
time
25 ms window,
lags up to 25 ms
calculated every
10 ms
normalized to
max (zero lag)
Environmental Sounds - Dan Ellis
2013-06-01 22 /34
Auditory Model Feature Results
• SAI and SBPCA
close to
MFCC baseline
MAP
All 3
MFCC + SBPCA
MFCC + SAI
SAI + SBPCA
MFCC
SAI
SBPCA
Playground
Beach
Parade
NonMusicPerf
MusicPerfmnce
• Fusing MFCC and
SBPCA improves
mAP by 15% rel
mAP: 0.35 → 0.40
WeddingDance
WeddingCerem
WeddingRecep
Birthday
Graduation
Bird
Dog
Cat
•
Calculation time
MFCC: 6 hours
SAI: 1087 hours
SBPCA: 110 hours
Biking
Swimming
Skiing
IceSkating
Soccer
Baseball
Basketball
0
Environmental Sounds - Dan Ellis
0.2
0.4
0.6
Avg. Precision (AP)
0.8
1
2013-06-01 23 /34
What is Being Recognized?
• Soundtracks represented by global features
freq / kHz
MFCC covariance, codebook histograms
What are the critical parts of the sound?
k5tiYqzvuoc Dog
4
3
2
1
0
Playground
Beach
Parade
NonMusicPerformance
a
MusicPerformance
u
Classifiers
WeddingDance
WeddingCeremony
e
WeddingReception
e
Birthday
Graduation
Bird
Dog
Cat
Biking
Swimming
Skiing
IceSkating
Soccer
Baseball
Basketball
0
Environmental Sounds - Dan Ellis
10
20
30
40
50
60
time / sec
70
2013-06-01 24 /34
Semantic Audio Features
• Train classifiers on related labeled data
Sound
MFCCs
Segment
Statistics
55
Pre-Trained
SVMs
Soundtrack
Semantic
Features
Event
Classifiers
defines a new “semantic” feature space
0.6
0.5
•
mfccs-230
sema100mfcc
sbpcas-4000
0.4
Use for
target
classifier
0.3
sema100sbpca
sema100sum
0.2
0.1
0
or combo
Environmental Sounds - Dan Ellis
2013-06-01 25 /34
4. Labels & Annotation
Burger et al. ’12
• “Semantic Features” are a promising approach
but we need good coverage...
how to learn more categories?
• Annotation is expensive
fine time annotation
> 10x real-time
a few hours are available
• What to label?
generic vs. task-specific
Environmental Sounds - Dan Ellis
animal
singing
clatter
anim_bird
music_sing
rustle
anim_cat
music
scratch
anim_ghoat
knock
hammer
anim_horse
thud
washboard
human_noise
clap
applause
laugh
click
whistle
scream
bang
squeak
child
beep
tone
mumble
engine_quiet
sirene
speech
engine_light
water
speech_ne
power_tool
micro_blow
radio
engine_heavy
wind
white_noise
cheer
other_creak
crowd
2013-06-01 26 /34
BBC Audio Semantic Classes
• BBC Sound
Effects Library
2238 tracks (60 h)
short descriptions
SFX001–04–01
SFX001–05–01
SFX001–06–01
SFX001–07–01
SFX001–08–01
SFX001–09–01
290
267
240
197
193
...
• Use top 45 keywords
Classifiers
BBC2238 traindev1 - Average Precision (mean=0.678)
RAND
voices
crowds
household
children
street
traffic
electric
crowd
train
urban
transport
boat
car
cars
speech
aircraft
horse
electronic
rural
birds
animal
rhythms
animals
siren
vocals
cat
steam
sports
babies
ambience
futuristic
construction
emergency
horses
men
wood
country
footsteps
wooden
warfare
running
women
walking
pavement
stairs
Wood Fire Inside Stove 5:07
!
City Skyline City Skyline
9:46
!
High Street With Traffic, Footsteps!
Car Wash Automatic, Wash Phase Inside R
Motor Cycle Yamaha Rd 350: Motor Cycle
Motor Cycle Yamaha Rd 350, Rider Runs U!
footsteps
on
animals
ambience
interior
59 men!
59 general!
55 switch!
53 starts!
53 crowds!
...!
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
• Added as
“semantic units”
some redundancy
visible in mutual APs
0.2
0.1
st pawaworuwawo fo cowomehoemco fuambasp st ca vo si an rh an bi ru el ho ai sp ca cabo tr ur tr cr el tr st chho cr voRN
Labels
Environmental Sounds - Dan Ellis
0
2013-06-01 27 /34
BBC Audio Semantic Classes
• Limited semantic correspondence
Frequency
SFX051−05 (ambience crowds)
Ambience, London Covent Garden Exterior, Crowds In The Piazza On A Busy Saturday Afternoon, Occasional Busker S Music In The Distance
4000
3000
2000
1000
voices
crowds
household
children
street
traffic
electric
crowd
train
urban
transport
boat
car
cars
speech
aircraft
horse
electronic
rural
birds
animal
rhythms
animals
siren
vocals
cat
steam
sports
babies
ambience
futuristic
construction
emergency
horses
men
wood
country
footsteps
wooden
warfare
running
women
walking
pavement
stairs
0
0
50
Environmental Sounds - Dan Ellis
100
150
200
250
time / s
2013-06-01 28 /34
Label Temporal Refinement
K Lee, Ellis, Loui ’10
• Audio Ground Truth at coarse time resolution
better-focused labels give better classifiers?
but little information in very short time frames
Relabel based on
most likely segments
in clip
Retrain classifier
Environmental Sounds - Dan Ellis
Video
0EwtL2yr0y8
03FF1Sz53YU
Tags
Parade
-
Time Labels
Parade
Features
SVM Train
SVM Classify
Iteration 1
Initial labels apply
to whole clip
Iteration 0
• Train classifiers on shorter (2 sec) segments?
Posteriors
Time Labels
Features
Parade
P
SVM Train
SVM Classify
2013-06-01 29 /34
Label Temporal Refinement
• Refining labels is “Multiple Instance Learning”
“Positive” clips have at least one +ve frame
“Negative” clips are all –ve
• Refine based on previous classifier’s scores
threshold from
CDFs of +ve
and –ve
frames
mAP improves
~10% after a
few iterations
Environmental Sounds - Dan Ellis
2013-06-01 30 /34
5. Future: Tasks & Metrics
• Environmental sound recognition:
What is it good for?
media content description (“Recounting”)
environmental awareness
• What are the right ways to evaluate?
task-specific metrics: AEER, F-measure
downstream tasks: WER, mAP
real applications: archive search, aware devices
Environmental Sounds - Dan Ellis
2013-06-01 31 /34
Labels & Annotations
• Training data: quality vs. quantity
quality costs:
- DCASE ~ 0.3 h
- TRECVID MED
~ 10 h
quantity
always wins
• Opportunistic labeling
Susanne Burger CMU
e.g. Sound Effects library, subtitles ...
need refinement strategies
• Existing annotations indicate interest
Environmental Sounds - Dan Ellis
2013-06-01 32 /34
Source Separation
• Separated sources makes event detection easy
“separate then recognize” paradigm
separation
mix
t-f masking
+ resynthesis
identify
target energy
ASR
speech
models
words
mix
identify
vs.
find best
words model
words
speech
models
source
knowledge
integrated solution more powerful...
• Environmental Source Separation is ill-defined
relevant “sources” are listener-defined
environment description addresses this
Environment recognition for source separation
Environmental Sounds - Dan Ellis
2013-06-01 33 /34
Summary
• (Machine) Listening:
Getting useful information from sound
• Foreground event recognition
... by focusing on peak energy patches
• Background sound retrieval
... from long-time statistics
•
Data, Labels, and Task
... what are the sources of interest?
Environmental Sounds - Dan Ellis
2013-06-01 34 /34
References 1/2
•
•
•
•
•
•
•
•
•
Samer Abdallah, Mark Plumbley, “Polyphonic music transcription by non-negative
sparse coding of power spectra,” ISMIR, 2004, p. 10-14.
Susanne Burger, Qin Jin, Peter F. Schulam, Florian Metze, “Noisemes: Manual
annotation of environmental noise in audio streams,” Technical Report LTI-12-017,
CMU, 2012.
Courtenay Cotton, Dan Ellis, & Alex Loui, “Soundtrack classification by transient
events,” IEEE ICASSP, Prague, May 2011.
Courtenay Cotton & Dan Ellis, “Spectral vs. Spectro-Temporal Features for Acoustic
Event Classification,” IEEE WASPAA, 2011, 69-72.
Courtenay Cotton and Dan Ellis, “Subband Autocorrelation Features for Video
Soundtrack Classification,” IEEE ICASSP, Vancouver, May 2013, 8663-8666.
Dan Ellis, “Prediction-driven computational auditory scene analysis.” Ph.D. thesis, MIT
Dept of EECS, 1996.
Dan Ellis, Xiaohong Zheng, Josh McDermott, “Classifying soundtracks with audio
texture features,” IEEE ICASSP, Prague, May 2011.
Dimitrios Giannoulis, Emmanouil Benetos, Dan Stowell, Mathias Rossignol, Mathieu
Lagrange and Mark Plumbley, “Detection and Classification of Acoustic Scenes and
Events,” Technical Report EECSRR-13-01, QMUL School of EE and CS, 2013.
Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Dan Ellis, Alex Loui, “Consumer video
understanding: A benchmark database and an evaluation of human and machine
performance,” ACM ICMR, Apr. 2011, p. 29.
Environmental Sounds - Dan Ellis
2013-06-01 35 /34
References 2/2
•
•
•
•
•
•
•
•
Keansub Lee, Dan Ellis, Alex Loui, “Detecting local semantic concepts in
environmental sounds using Markov model based clustering,” IEEE ICASSP,
2278-2281, Dallas, Apr 2010.
Keansub Lee & Dan Ellis, “Audio-Based Semantic Concept Classification for Consumer
Video,” IEEE TASLP. 18(6): 1406-1416, Aug. 2010.
R.F. Lyon, M. Rehn, S. Bengio, T.C. Walters, and G. Chechik, “Sound retrieval and
ranking using sparse auditory representations,” Neural Computation, vol. 22, no. 9,
pp. 2390–2416, Sept. 2010.
Josh McDermott, Andrew Oxenham, Eero Simoncelli, “Sound texture synthesis via
filter statistics,” IEEE WASPAA, 2009, 297-300.
Paul Over, George Awad, Jon Fiscus, Brian Antonishek, Martial Michel, Alan Smeaton,
Wessel Kraaij, Wessel, Georges Quénot, “TRECVID 2010 - An overview of the goals,
tasks, data, evaluation mechanisms, and metrics,” Technical Report, NIST, 2011.
Manuel Reyes-Gomez & Dan Ellis, “Selection, Parameter Estimation, and
Discriminative Training of Hidden Markov Models for General Audio Modeling,” IEEE
ICME, Baltimore, 2003, I-73-76.
Paris Smaragdis & Judith Brown, “Non-negative matrix factorization for polyphonic
music transcription,” IEEE WASPAA, 2003, p. 177-180.
Tuomas Virtanen, “Monaural sound source separation by nonnegative matrix
factorization with temporal continuity and sparseness criteria,” IEEE TASLP 15(3):
1066-1074, 2007.
Environmental Sounds - Dan Ellis
2013-06-01 36 /34
Acknowledgment
Supported by the Intelligence Advanced Research Projects Activity (IARPA) via
Department of Interior National Business Center contract number D11PC20070. The
U.S. Government is authorized to reproduce and distribute reprints for Government
purposes notwithstanding any copyright annotation thereon.
Disclaimer: The views and conclusions contained herein are those of the authors and
should not be interpreted as necessarily representing the official policies or
endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S.
Government.
Environmental Sounds - Dan Ellis
2013-06-01 37 /34
Download