Multimedia Applications of Audio Recognition 1. 2.

advertisement
Multimedia Applications
of Audio Recognition
Dan Ellis
LabROSA
Columbia EE
dpwe@ee.columbia.edu
http://labrosa.ee.columbia.edu/
1. Finding speaker turns in meetings
2. Segmenting ‘personal audio’ recordings
3. Eigenrhythms: representing drum tracks
MM workshop ’04 - Dan Ellis
Columbia University
2004-06-18
1. Finding Speaker Turns in Meetings
mic recordings carry information
• Multiple
on speaker turns
every voice reaches every mic... (?)
... but with differing coupling filters (delays, gains)
• Find turns with minimal assumptions
e.g. ad-hoc sensor setups (multiple PDAs)
differences to remove effect of source signal
- no spectral models, < 1xRT
MM workshop ’04 - Dan Ellis
Columbia University
2004-06-18
Between-Channel Cues:
Timing (ITD) & Level
Speaker activity
Speaker
ground-truth
xocrr peak lags (5pt med filt)
skew/samp
50
Timing diffs (ITD)
(2 mic pairs, 250ms win)
0
-50
125
130
135
140
145
150
norm xcorr pk val
155
160
165
170
175
1
Peak correlation
coefficient r
0.5
0
125
130
135
140
145
150
per-chan E
155
160
165
170
175
-40
Per-channel
energy
dB
-50
-60
-70
-80
125
130
135
140
145
150
chan E diffs
155
160
165
170
175
10
Between-channel
energy differences
dB
5
0
-5
-10
125
130
135
140
145
150
time/s
155
MM workshop ’04 - Dan Ellis
Columbia University
160
165
170
175
2004-06-18
Choosing “Good” Frames
coef. r
• Correlation
~ channel similarity:
!n mi[n] · m j [n + !]
ri j [!] = !
! m2i ! m2j
• Select frames with r in top 50% in both pairs
•
ITD - high-correlation points (435/1201)
0
Skew34 / samples
Skew34 / samples
ITD - all points
-50
-100
0
-50
-100
•
• Cleaner basis for models
-150
-100
-50
Skew12 / samples
0
MM workshop ’04 - Dan Ellis
Columbia University
about 35% of points
-150
-100
-50
Skew12 / samples
0
2004-06-18
Spectral Clustering
of “affinity matrix” A
• Eigenvectors
to pick out similar points:
Affinity matrix A
first 12 eigenvectors (normalized)
2
400
0
350
300
-2
250
-4
200
-6
150
-8
100
-10
50
100
200
300
point index
400
-12
0
100
•
• Ad-hoc mapping to clusters
200
300
point index
400
amn = exp{−"x[m] − x[n]"2/2!2}
Number of clusters K from eigenvalues ≈ points
MM workshop ’04 - Dan Ellis
Columbia University
2004-06-18
Speaker Models & Classification
• Actual clusters depend on ! and K heuristic
Gaussians to each cluster,
• Fit
assign that class to all frames within radius
or: consider dimension independently, choose best
ICSI0: good points
All pts: nearest class
All pts: closest dimension
0
0
0
-20
-20
-20
-40
-40
-40
-60
-60
-60
-80
-80
-80
-100
-100
-50
0
-100
-100
MM workshop ’04 - Dan Ellis
Columbia University
-50
0
-100
-100
-50
0
2004-06-18
Performance Analysis
• Compare reference & system activity maps:
system misses quiet speakers 2,3,4 (deletions)
system splits speaker 6 (deletions+insertions)
many short gaps (deletions)
• NIST RT04s dev set (80 min):
correct: 34.4% ins: 16.5% del: 63.7%
MM workshop ’04 - Dan Ellis
Columbia University
2004-06-18
2. Segmenting Personal Audio
• Easy to record everything you hear
~100GB / year @ 64 kbps
• Very hard to find anything
how to scan?
how to visualize?
how to index?
• Starting point: Collect data
~ 60 hours (8 days, ~7.5 hr/day)
hand-mark 139 segments (26 min/seg avg.)
assign to 41 classes (8 have multiple instances)
MM workshop ’04 - Dan Ellis
Columbia University
2004-06-18
Features for Long Recordings
• Feature frames = 1 min (not 25 ms!)
• Characterize variation within each frame...
Normalized Energy Deviation
Average Linear Energy
120
15
100
10
80
15
40
10
20
5
5
dB
Average Log Energy
60
dB
Log Energy Deviation
120
15
100
10
80
20
freq / bark
20
freq / bark
60
20
freq / bark
freq / bark
20
5
15
15
10
10
5
5
60
dB
dB
Spectral Entropy Deviation
Average Spectral Entropy
0.9
0.8
15
0.7
10
5
•
0.6
0.5
bits
0.5
15
0.4
10
0.3
0.2
5
0.1
50
100
150
200
250
300
350
400
450
time / min
and structure within coarse auditory bands
MM workshop ’04 - Dan Ellis
Columbia University
20
freq / bark
freq / bark
20
2004-06-18
bits
BIC Segmentation
• Untrained segmentation technique
statistical test indicates good change points:
log
L(X1 ;M1 )L(X2 ;M2 )
L(X;M0 )
≷
λ
2
log(N )∆#(M )
• Evaluate against hand-marked boundaries
different features & combinations
Correct Accept % @ False Accept = 2%:
80.8%
81.1%
81.6%
84.0%
83.6%
0.8
0.7
Sensitivity
µdB
µH
σH/µH
µdB + σH/µH
µdB + σH/µH + µH
0.6
0.5
0.3
0.2
0
MM workshop ’04 - Dan Ellis
Columbia University
µdB
µH
!H/µH
µdB + !H/µH
µdB + µH + !H/µH
0.4
0.005
0.01
0.015
0.02
0.025
1 - Specificity
0.03
0.035
2004-06-18
0.04
Future Work
• Clustering of segments
.. using spectral clustering
compare to ground-truth, confusions
• Visualization / browsing:
freq / Bark
Hand-marked
boundaries
20
15
10
5
21:00
22:00
23:00
• Privacy protection
0:00
1:00
2:00
3:00
4:00
Clock time
speaker/speech “search and destroy”
MM workshop ’04 - Dan Ellis
Columbia University
2004-06-18
3. Eigenrhythms:
Representing Drum Tracks
• Pop songs built on repeating “drum loop”
bass drum, snare, hi-hat
small variations on a few basic patterns
• Eigen-analysis (PCA) to capture variations?
by analyzing lots of (MIDI) data
• Applications
music categorization
“beat box” synthesis
MM workshop ’04 - Dan Ellis
Columbia University
2004-06-18
Aligning the Data
• Need to align patterns prior to PCA...
tempo (stretch):
by inferring BPM &
normalizing
downbeat (shift):
correlate against
‘mean’ template
MM workshop ’04 - Dan Ellis
Columbia University
2004-06-18
Eigenrhythms
20+ Eigenvectors for good coverage
• Need
of 100 training patterns (1200 dims)
• Top patterns:
MM workshop ’04 - Dan Ellis
Columbia University
2004-06-18
Eigenrhythms for Classification
6
in
• Clusters
Eigenspace:
All tracks projected onto 1st two eigenrhythms
ho:inside
hh:gThang
rb:honey
4
rc:whteroom
pp:dlla
l
hh:rufryder
bl:hideaway
rb:heyLover
Eigenrhythm 2
2
0
rc:californ
nw:evcount
s
nw:psboysi
n ho:pvandyk di:danqueen pp:distance
di:boot
yrc:zztop
nw:dontyou
rb:mgirlsat
hh:1mChance
rb:downlow
nw:pur
e
di:satnight
nw:amadeus
pu:blitzkr
gpu:bSedated
rc:jump
di:funkytwn
hh:nEpisode
co:alabama
hh:bigpimpn
pu:rubysoho
rc:money
hh:stan
hh:jackson
bl:crosfire
rc:tuesdays
bl:thrill
pp:lkvirgin
pp:fly
hh:slmshady
pu:beatbrat
rc:hardday
rc:blackdog
nw:deserve
pu:waitinRm
pp:lvprayer
co:SArose
hh:superst
rdi:lafreak
pp:mjBeatit
di:dontstop
co:walkline
nw:bmonday
nw:whipi
trb:chgWorld
pp:loveshck
rc:rolstone
di:carwash bl:meanwoma
nw:dbdance
bl:blues2gm
co:aftermid
co:walkmi
d
ho:modjo
pu:happyguy
pu:bombshel
co:goodlook
bl:onebeer
hh:bigPoppa
bl:dimples
co:byYrMan
bl:chicken
co:texas
rc:layl
a
co:tennesse
rb:volove
di:boogient
ho:bemylove
pu:aWal
k ho:dpworld
rb:lsaround
-2
di:discoinf
di:boogiewl
pp:bholly
ho:onemore
bl:boomboom
-4
co:ringfire
rb:bismine
nw:banvenus
ho:badtouch
-6
-6
-4
-2
0
Eigenrhythm 1
pp:onemore
pu:anarchy
pp:downundr
2
4
• Genre classification? (10 way)
nearest neighbor in 4D eigenspace: 21% correct
MM workshop ’04 - Dan Ellis
Columbia University
2004-06-18
6
Summary:
Audio Analysis for Multimedia
• Spatial cues for meeting recordings
recover meeting ‘structure’ from ad-hoc recordings
• Segmenting ‘personal audio’ recordings
what are good features for 1 minute frames?
• Eigenrhythm analysis of drum patterns
define the subspace of ‘musical’ rhythms
MM workshop ’04 - Dan Ellis
Columbia University
2004-06-18
Download