Mining Audio 1. 2. 3.

advertisement
Mining Audio
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
Mining Audio - Dan Ellis
http://labrosa.ee.columbia.edu/
Real-World Sound
Music Audio
Environmental Audio
Outstanding Issues
2012-09-14
1 /41
LabROSA Overview
• Getting information from sound
Information
Extraction
Music
Recognition
Environment
Separation
Machine
Learning
Retrieval
Signal
Processing
Speech
Mining Audio - Dan Ellis
2012-09-14
2 /41
1. Real-World Sound
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
-60
level / dB
0
2
4
6
8
10
12
time/s
02_m+s-15-evil-goodvoice-fade
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
• Sounds rarely occur in isolation
.. so analyzing mixtures (“scenes”) is a problem
.. for humans and machines
Mining Audio - Dan Ellis
2012-09-14
3 /41
Auditory Scene Analysis
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
• Received waveform is a mixture
2 sensors, N sources - underconstrained
• Use prior knowledge (models) to constrain
Mining Audio - Dan Ellis
2012-09-14
4 /41
Machine Listening
• Extracting useful information from sound
Describe
Automatic
Narration
Emotion
Music
Recommendation
Classify
Environment
Awareness
ASR
Music
Transcription
“Sound
Intelligence”
VAD
Speech/Music
Environmental
Sound
Speech
Dectect
Task
... like animals do
Mining Audio - Dan Ellis
Music
Domain
2012-09-14
5 /41
2. Mining Music Audio
Query track
One Million Songs
4000
Frequency
3000
2000
1000
0
0
1
2
3
4
5
Time
6
7
8
9
10
“Similar” tracks
4000
Frequency
3000
2000
4000
1000
4000
0
3000
1
2
3
4
5
Time
6
7
8
9
10
Frequency
0
Frequency
3000
2000
2000
1000
1000
0
0
Mining Audio - Dan Ellis
0
1
2
3
4
5
Time
6
7
8
9
0
1
2
3
4
5
Time
6
7
8
9
10
10
2012-09-14
6 /41
Music Audio Representations
• We need a description of the audio that
contains “suitable” detail
• Speech Recognition uses Mel-Frequency
Cepstral Coefficients (MFCCs)
Let It Be (LIB-1) - log-freq specgram
freq / Hz
5915
1396
329
Coefficient
MFCCs
12
10
8
6
4
2
Noise excited MFCC resynthesis (LIB-2)
freq / Hz
5915
1396
329
0
Mining Audio - Dan Ellis
5
10
15
20
25
time / sec
2012-09-14
7 /41
Warren et al. 2003
Chroma Features
• We’d like to preserve the notes
at least within one octave
Let It Be - log-freq specgram (LIB-1)
freq / Hz
6000
1400
chroma bin
300
Chroma features
B
A
G
E
D
C
Shepard tone resynthesis of chroma (LIB-3)
freq / Hz
6000
1400
300
MFCC-filtered shepard tones (LIB-4)
freq / Hz
6000
1400
300
0
Mining Audio - Dan Ellis
5
10
15
20
25
time / sec
2012-09-14
8 /41
Beat-Synchronous Chroma
• Perform Beat Tracking
.. and record one chroma vector per beat
compact representation of harmonies
Let It Be - log-freq specgram (LIB-1)
freq / Hz
6000
1400
300
chroma bin
Onset envelope + beat times
Beat-synchronous chroma
B
A
G
E
D
C
Beat-synchronous chroma + Shepard resynthesis (LIB-6)
freq / Hz
6000
1400
300
0
Mining Audio - Dan Ellis
5
10
15
20
25
time / sec
2012-09-14
9 /41
Million Song Dataset (MSD)
Thierry Bertin-Mahieux
• Commercial-scale dataset
available to MIR researchers
1M pop songs
250 GB of features
(6 years of listening)
• Many facets:
Features,
Lyrics,
Tags,
Covers,
Listeners ...
http://labrosa.ee.columbia.edu/millionsong
Mining Audio - Dan Ellis
2012-09-14 10/41
MSD Audio Features
• Use Echo Nest “Analyze” features
segment audio into variable-length “events”
supports
a crude
resynthesis:
freq / Hz
2416
Origina
761
240
B
A
G
EN
Featur
E
D
C
freq / Hz
represent by
12 chroma +
12 “timbre”
TRKUYPW128F92E1FC0 - Tori Amos - Smells Like Teen Spirit
2416
Resyn
761
240
0
Mining Audio - Dan Ellis
2
4
6
8
10
12 time / sec 14
2012-09-14 11/41
MSD Metadata
EN Metadata
Last.fm Tags
artist:
release:
title:
id:
key:
mode:
loudness:
tempo:
time_signature:
duration:
sample_rate:
audio_md5:
7digitalid:
familiarity:
year:
100.0 – cover
57.0 – covers
43.0 – female vocalists
42.0 – piano
34.0 – alternative
14.0 – singer-songwriter
11.0 – acoustic
8.0 – tori amos
7.0 – beautiful
6.0 – rock
6.0 – pop
6.0 – Nirvana
6.0 – female vocalist
6.0 – 90s
5.0 – out of genre covers
'Tori Amos'
'LIVE AT MONTREUX'
'Smells Like Teen Spirit'
'TRKUYPW128F92E1FC0'
5
0
-16.6780
87.2330
4
216.4502
22050
'8'
5764727
0.8500
1992
SHS Covers
%5489,4468, Smells
TRTUOVJ128E078EE10
TRFZJOZ128F4263BE3
TRJHCKN12903CDD274
TRELTOJ128F42748B7
TRJKBXL128F92F994D
TRIHLAW128F429BBF8
TRKUYPW128F92E1FC0
Mining Audio - Dan Ellis
5.0
4.0
4.0
4.0
4.0
3.0
3.0
3.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
cover songs
soft rock
nirvana cover
Mellow
alternative rock
chick rock
Ballad
Awesome Covers
melancholic
k00l ch1x
indie
female vocalistist
female
cover song
american
MxM Lyric Bag-of-Words
Like Teen Spirit
Nirvana
Weird Al Yankovic
Pleasure Beach
The Flying Pickets
Rhythms Del Mundo feat. Shanade
The Bad Plus
Tori Amos
12
11
10
9
7
6
6
6
hello
i
a
and
it
are
we
now
6
6
6
4
4
4
3
3
here
us
entertain
the
feel
yeah
to
my
3
3
3
3
3
3
3
3
is
with
oh
out
an
light
less
danger
2012-09-14 12/41
Finding Similar Items
Avery Wang ’03
• If it’s really the same, we can use
freq / Hz
fingerprinting (e.g. Shazam):
Query audio
4000
3500
3000
2500
find local “landmarks”
quantize by pairs
inverted index
2000
1500
1000
500
0
Match: 05−Full Circle at 0.032 sec
4000
3500
3000
2500
2000
1500
1000
500
0
Mining Audio - Dan Ellis
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
time / sec
2012-09-14 13/41
Dynamic Programming Alignment
Serra, Gomez et al. ’08
• We can match the chroma representations
by DP alignment
robust to timing changes
quite efficient
best performer in MIREX Cover Song evaluations
Mining Audio - Dan Ellis
2012-09-14 14/41
Large-Scale Cover Matching
Bertin-Mahieux & Ellis ’11
• How can we find covers in 1M songs?
@ 1 sec / comparison, one search = 11.5 CPU-days
full N2 mining = 16,000 CPU-years
• Hashing?
landmarks from chroma patches (like Shazam)
represent item by distribution of contour fragments
one stage in a filtering hierarchy...
Mining Audio - Dan Ellis
2012-09-14 15/41
Euclidean Cover Matching
Ellis & Poliner ’08
Bertin-Mahieux & Ellis ’12
• Reduce cover matching to nearest-neighbor
search in fixed-size beat-chroma patches
• Right “distance”?
10
8
principal components
learned weightings
6
4
2
120
Alignment/
segmentation?
music segmentation
multiple probes
framing-insensitive
representation
Mining Audio - Dan Ellis
130
140
150
160
170
Between the Bars − Glenn Phillips − pwr .25 − −17 beats − trsp 2
12
10
8
6
4
2
120
chroma bin
•
Between the Bars − Elliot Smith − pwr .25
12
130
140
150
160
170
160
170
pointwise product (sum(12x300) = 19.34)
12
10
8
6
4
2
120
130
140
150
time / beats
2012-09-14 16/41
Pattern Mining
• Cluster beat-synchronous chroma patches
#32273 - 13 instances
chroma bins
G
F
D
C
A
G
F
D
C
A
G
F
D
C
A
G
F
D
C
A
Mining Audio - Dan Ellis
5
a20-top10x5-cp4-4p0
#51917 - 13 instances
#65512 - 10 instances
#9667 - 9 instances
#10929 - 7 instances
#61202 - 6 instances
#55881 - 5 instances
#68445 - 5 instances
10
15
20
5
10
15
time / beats
2012-09-14 17/41
Other Work: MSD Challenge
with Brian McFee
• Using “Taste Profile” data
User-Song-playcount
b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
SOAKIMP12A8C130995
SOAPDEY12A81C210A9
SOBBMDR12A8C13253B
SOCNMUH12A6D4F6E6D
SODACBL12A8C13C273
SODDNQT12A6D4F5F7E
1
1
2
1
1
5
• Challenge:
Rank 1M songs to complete half a playlist
using audio, metadata, lyrics, whatever
• Kaggle.com based contest
self-service evaluation, leader board
• Results at MIREX, AdMIRe
Mining Audio - Dan Ellis
2012-09-14 18/41
Music: Summary
•
Finding Musical Similarity at Large Scale
•
Beat Chroma Representation
•
Million Song Dataset
•
MSD Challenge
Mining Audio - Dan Ellis
2012-09-14 19/41
3. Environmental Audio
•
Audio Lifelog
Diarization
09:00
09:30
10:00
10:30
11:00
11:30
2004-09-14
preschool
cafe
preschool
cafe
Ron
lecture
12:00
office
12:30
office
outdoor
group
13:00
L2
cafe
office
outdoor
lecture
outdoor
DSP03
compmtg
meeting2
13:30
lab
14:00
cafe
meeting2 Manuel
outdoor
office
cafe
office
Mike
Arroyo?
outdoor
Sambarta?
14:30
15:00
15:30
•
2004-09-13
16:00
office
office
office
postlec
office
Lesser
16:30
Consumer Video Classification & Search
Mining Audio - Dan Ellis
17:00
17:30
18:00
outdoor
lab
cafe
2012-09-14 20/41
Prior Work
• Environment
Classification
speech/music/
silent/machine
Zhang & Kuo ’01
• Nonspeech Sound Recognition
Temko & Nadeu ’06
Mining Audio - Dan Ellis
Meeting room
Audio Event Classification
sports events - cheers, bat/ball
sounds, ...
2012-09-14 21/41
Consumer Video Dataset
• 25 “concepts” from
1G+KL2 (10/15)
Kodak user study
boat, crowd, cheer, dance, ...
1
Labeled Concepts
museum
picnic
wedding
animal
birthday
sunset
ski
graduation
sports
boat
parade
playground
baby
park
beach
dancing
show
group of two
night
one person
singing
cheer
crowd
music
group of 3+
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
3mc c s o n g s d b p b p p b s g s s b a w pm 0
Overlapped Concepts
• Grab top 200 videos
8GMM+Bha (9/15)
from YouTube search
then filter for quality,
unedited = 1873 videos
manually relabel with
concepts
pLSA500+lognorm (12/15)
Mining Audio - Dan Ellis
2012-09-14 22/41
Obtaining Labeled Data
• Amazon Mechanical Turk
Y-G Jiang et al. 2011
10s clips
9,641 videos in 4 weeks
Mining Audio - Dan Ellis
2012-09-14 23/41
2. Background Classification
• Baseline for soundtrack classification
8
7
6
5
4
3
2
1
0
VTS_04_0001 - Spectrogram
MFCC
Covariance
Matrix
30
20
10
0
-10
MFCC covariance
-20
1
2
3
4
5
6
7
8
9
time / sec
20
18
16
14
12
10
8
6
4
2
1
2
3
4
5
6
7
8
9
time / sec
level / dB
18
16
20
15
10
5
0
-5
-10
-15
-20
value
14
12
0
10
8
6
4
2
• Classify by e.g. KL distance + SVM
Mining Audio - Dan Ellis
50
20
MFCC dimension
MFCC
features
MFCC bin
Video
Soundtrack
freq / kHz
divide sound into short frames (e.g. 30 ms)
calculate features (e.g. MFCC) for each frame
describe clip by statistics of frames (mean, covariance)
= “bag of features”
5
10
15
MFCC dimension
20
-50
2012-09-14 24/41
Codebook Histograms
• Convert high-dim. distributions to multinomial
8
150
6
4
7
2
MFCC
features
Per-Category
Mixture Component
Histogram
count
MFCC(1)
Global
Gaussian
Mixture Model
0
2
5
-2
6
1
10
14
8
100
15
9
12
13 3
4
-4
50
11
-6
-8
-10
-20
-10
0
10
20
0
1
2
3
MFCC(0)
4
5
6 7 8 9 10 11 12 13 14 15
GMM mixture
• Classify by distance on histograms
KL, Chi-squared
+ SVM
Mining Audio - Dan Ellis
2012-09-14 25/41
Latent Semantic Analysis (LSA)
• Probabilistic LSA (pLSA) models each histogram
as a mixture of several ‘topics’
.. each clip may have several things going on
• Topic sets optimized through EM
p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip)
=
GMM histogram ftrs
*
“Topic”
p(topic | clip)
p(ftr | clip)
“Topic”
AV Clip
AV Clip
GMM histogram ftrs
p(ftr | topic)
use (normalized?) p(topic | clip) as per-clip features
Mining Audio - Dan Ellis
2012-09-14 26/41
Background Classification
Results
Average Precision
K Lee & Ellis ’10
1
Guessing
MFCC + GMM Classifier
Single−Gaussian + KL2
8−GMM + Bha
1024−GMM Histogram + 500−log(P(z|c)/Pz))
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
pl
gr
• Wide range in performance
sk
su i
ns
e
bi
rth t
da
an y
im
w al
ed
di
ng
pi
cn
ic
m
us
eu
m
M
EA
N
a
sp t
o
gr
ad rts
ua
tio
n
bo
de
ra
nd
pa
ou
ay
gr
ba
by
rk
h
pa
ac
g
be
in
da
nc
ow
o
sh
tw
t
p
of
ni
gh
on
ou
e
pe
rs
g
in
r
si
ng
ee
d
ch
cr
us
m
ow
on
gr
ou
p
of
3+
0
ic
0.1
Concept class
audio (music, ski) vs. non-audio (group, night)
large AP uncertainty on infrequent classes
Mining Audio - Dan Ellis
2012-09-14 27/41
Sound Texture Features
McDermott Simoncelli ’09
Ellis, Zheng, McDermott ’11
• Characterize sounds by
perceptually-sufficient statistics
\x\
Sound
\x\
Automatic
gain
control
mel
filterbank
(18 chans)
Octave bins
0.5,1,2,4,8,16 Hz
FFT
\x\
\x\
Histogram
\x\
\x\
1159_10 urban cheer clap
distributions
& env x-corrs
Mahalanobis
distance ...
2404
1273
1
617
0
2
4
6
8
10
0
2
4
6
8
10
time / s
Texture features
0
level
15
10
5
M V S K 0.5 2 8 32
moments
Mining Audio - Dan Ellis
1062_60 quiet dubbed speech music
mel band
freq / Hz
• Subband
mean, var,
skew, kurt
(18 x 4)
Cross-band
correlations
(318 samples)
Envelope
correlation
.. verified by matched
resynthesis
Modulation
energy
(18 x 6)
mod frq / Hz
5
10 15
mel band
M V S K 0.5 2 8 32
moments
mod frq / Hz
5
10 15
mel band
2012-09-14 28/41
Sound Texture Features
• Test on MED 2010
development data
cLap
Cheer
Music
Speech
Dubbed
other
noisy
quiet
urban
rural
M
cLap
Cheer
Music
Speech
Dubbed
other
noisy
quiet
urban
rural
M
MED 2010 Txtr Feature (normd) means
0.5
0
−0.5
V
S
K
ms
cbc
MED 2010 Txtr Feature (normd) stddevs
1
0.5
0
V
S
K
ms
cbc
MED 2010 Txtr Feature (normd) means/stddevs
0.5
0
−0.5
V
S
K
ms
Mining Audio - Dan Ellis
cbc
0.9
0.8
10 specially-collected
manual labels
cLap
Cheer
Music
Speech
Dubbed
other
noisy
quiet
urban
rural
M
M
MVSK
ms
cbc
MVSK+ms
MVSK+cbc
MVSK+ms+cbc
1
0.7
0.6
0.5
Rural
Urban
Quiet
Noisy Dubbed Speech Music
Cheer
Clap
Avg
• Contrasts in
feature sets
correlation of labels...
• Perform
~ same as MFCCs
combine well
2012-09-14 29/41
Real-World Dictionary
• BBC Sound Effects as reference library
The Czech Republic - Slovakia
Hungary
Rural South America
Urban South
America
Footsteps
2
Footsteps 1
India Pakistan Nepal-Countrysid
India & Nepal-City Life
Exterior Atmospheres-Rural Background
England
France
Suburbia
Crowds
Birds
Istanbul
Emergency
Cats
Construction
Age Of Steam
Trains
Spain
Schools & Crowds
Horses &
Dogs
Horses
Farm Machinery
Livestock 2
Livestock 1
Adventure Sports
Greece
Equestrian Events
Africa The Natural World
Africa The Human World
Hospitals
Babies
China
Aircraft
America
Ships And Boats 2
Ships And Boats 1
Weather 1
Electronically Generated Sounds
Explosions Guns Alarms
Sport Leisure
Auto
Rural Soundscapes
Big Ben Taxi Bus Atmospheres
Industry
Birds
Rivers Streams Water
Computers Printers Phones
European Soundscapes
Misc
1000+ examples ...
comprehensive?
similarity via
normalized
textures
(over 10s
chunks)
Mining Audio - Dan Ellis
Audiences Children Crowds Foots
Animals and Birds
Transportation
Interior Atmosphere
Household
Exterior Atmosphere
BBC Sound Effects
BEH I T A A MEC RB I BRAS EE W S S A A C B H AAEGA L L F HHSST A C CE I B CSF EE I I F FURHT
2012-09-14 30/41
3. Foreground Event Recognition
• Global vs. local class models
K Lee, Ellis, Loui ’10
tell-tale acoustics may be ‘washed out’ in statistics
try iterative realignment of HMMs:
YT baby 002:
voice
baby
laugh
4
New Way:
Limited temporal extents
freq / kHz
freq / kHz
Old Way:
All frames contribute
3
4
2
1
1
5
10
voice
15
0
time / s
baby
3
2
0
voice
bg
5
voice baby
10
laugh
15
bg
time / s
laugh
baby
laugh
“background” model shared by all clips
Mining Audio - Dan Ellis
2012-09-14 31/41
Foreground Event HMMs
freq / Hz
Spectrogram : YT_beach_001.mpeg,
Manual Clip-level Labels : “beach”, “group of 3+”, “cheer”
• Training labels only
•
Refine models by
EM realignment
• Use for classifying
entire video...
or seeking to relevant
part
Mining Audio - Dan Ellis
etc
speaker4
speaker3
speaker2
unseen1
Background
25 concepts + 2 global BGs
at clip-level
4k
3k
2k
1k
0
global BG2
global BG1
music
cheer
ski
singing
parade
dancing
wedding
sunset
sports
show
playground
picnic
park
one person
night
museum
group of 2
group of 3+
graduation
crowd
boat
birthday
beach
baby
animal
0
0
−50
drum
Manual Frame-level Annotation
wind
cheer
voice
voice
voice
water wave
voice
voice voice
voice cheer
Viterbi Path by 27−state Markov Model
1
2
3
4
5
6
7
8
Lee & Ellis ’10
9
10
time/sec
2012-09-14 32/41
Transient Features
Cotton, Ellis, Loui ’11
• Transients =
•
foreground events?
Onset detector
finds energy bursts
best SNR
• PCA basis to
represent each
300 ms x auditory freq
• “bag of transients”
Mining Audio - Dan Ellis
2012-09-14 33/41
Nonnegative Matrix Factorization
templates
+ activation
freq / Hz
• Decompose spectrograms into
3474
2203
883
442
Basis 1
(L2)
Basis 2
(L1)
Basis 3
(L1)
0
Mining Audio - Dan Ellis
Original mixture
1398
X=W·H
fast forgiving
gradient descent
algorithm
2D patches
sparsity control...
5478
Smaragdis Brown ’03
Abdallah Plumbley ’04
Virtanen ’07
1
2
3
4
5
6
7
8
9
10
time / s
2012-09-14 34/41
NMF Transient Features
• Learn 20 patches from
Acoustic Event Error Rate (AEER%)
•
Compare to MFCC-HMM
detector
3474
Error Rate in Noise
Patch 1
Patch 2
Patch 3
Patch 4
Patch 5
Patch 6
Patch 7
Patch 8
Patch 9
Patch 10
Patch 11
Patch 12
Patch 13
Patch 14 Patch 15
Patch 16
Patch 17
Patch 18
Patch 19 Patch 20
1398
442
3474
1398
442
3474
1398
442
3474
1398
442
0.0
250
0.5 0.0
0.5 0.0
• NMF more
200
150
MFCC
NMF
combined
15
Mining Audio - Dan Ellis
0.5 0.0
0.5 0.0
0.5
time (s)
noise-robust
100
0
10
freq (Hz)
Meeting Room Acoustic
Event data
300
50
Cotton, Ellis ’11
combines well ...
20
SNR (dB)
25
30
2012-09-14 35/41
Matching Videos via Fingerprints
are a noise-robust
fingerprint
freq / kHz
• Landmark pairs
Cotton & Ellis ’10
VIdeo IMpLQaiHWbE at 195s
4
3
2
• Use to match
0
freq / kHz
distinct videos
with same
sound ambience
1
195.5
196
196.5
197
197.5
198
198.5
199
time / sec
VIdeo Yi1hkNkqHBc at 218 s
4
3
2
1
0
Mining Audio - Dan Ellis
218.5
219
219.5
220
220.5
221
221.5
222
time / sec
2012-09-14 36/41
4. Outstanding Issues
• Better object/event separation
parametric models
spatial information?
computational
auditory
scene analysis...
Frequency Proximity
• Large-scale analysis
• Integration with video
Mining Audio - Dan Ellis
Common Onset
Harmonicity
Barker et al. ’05
2012-09-14 37/41
Audio-Visual Atoms
Jiang et al. ’09
• Object-related features
from both audio (transients) & video (patches)
time
visual atom
...
visual atom
...
time
Mining Audio - Dan Ellis
audio atom
frequency
frequency
audio spectrum
audio atom
Joint Audio-Visual Atom
discovery
...
visual atom
...
audio atom
time
2012-09-14 38/41
Audio-Visual Atoms
• Multi-instance learning of A-V co-occurrences
M Visual Atoms
visual atom
visual atom
time
…
…
…
time
codebook
learning
Joint A-V
Codebook
…
audio atom
audio atom
M×N
A-V Atoms
…
N audio atoms
Mining Audio - Dan Ellis
2012-09-14 39/41
Audio-Visual Atoms
black suit
+ romantic
music
marching
people
+ parade
sound
Wedding
sand
+ beach
sounds
Parade
Beach
Audio only
Visual only
Visual + Audio
Mining Audio - Dan Ellis
2012-09-14 40/41
Summary
• Machine Listening:
Getting useful information from sound
• Background sound classification
... from whole-clip statistics?
• Foreground event recognition
... by focusing on peak energy patches
• Speech content is very important
... separate with pitch, models, ...
Mining Audio - Dan Ellis
2012-09-14 41/41
Download