Clap Detection and Discrimination for Rhythm Therapy Nathan Lesser & Dan Ellis

advertisement
Clap Detection and Discrimination
for Rhythm Therapy
Nathan Lesser & Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Engineering, Columbia University, NY USA
{nathan,dpwe}@ee.columbia.edu
1.
2.
3.
4.
“Rhythm Therapy”
Clap Range Estimation
Experiments
Conclusions
Clap Detection - Lesser & Ellis
2005-03-22
p. 1 /14
1. “Rhythm Therapy”
• Rhythmic clapping may
help neural development
sensori-motor planning
focus and attention
• “Interactive metronome”
devices
give feedback on synchrony
sensor-based
• Classroom deployment?
acoustic-based?
for multiple simultaneous users??
Clap Detection - Lesser & Ellis
from interactivemetronome.com
2005-03-22
p. 2 /14
Clap Discrimination
• Scenario:
Many students in same classroom
each clapping in time to their own laptop
students wear headphones (but no sensor)
computer hears neighbors
• Goal:
Discriminate between ‘near-field’
and ‘far-field’ claps
‘near-field’ = ~1 meter, on-axis
‘far-field’ = > 2 meters, maybe off-axis
Clap Detection - Lesser & Ellis
2005-03-22
p. 3 /14
Data Collection
• Record isolated claps at various locations
can superimpose them later...
• Grid of seats:
Classroom Plan View
Clapping
0 locations
claps from locations 0..9
record at locations 5 & 9 only
• Multiple rooms
pilot: 1 room,
2 x 5 claps/location
main data: 2 (+2) rooms,
1 x 50 farfield claps/location
+ 300 nearfield claps/rec.loc.
= 1500 claps/room
Clap Detection - Lesser & Ellis
1
2
3
4
5
6
5
7
8
9
9
Recording
locations
Front of Room
2005-03-22
p. 4 /14
2. Clap Range Estimation
• Task:
Discriminate claps from in front of rig
from all others (more distant)
main perceptual cue to distance (range):
direct-to-reverberant ratio (DRR)
how to differentiate direct and reverb?
• Novel problem: Acoustic range estimation
define correlates of DRR
exploit properties of claps (wideband, compact)
.. then just feed to classifier
Clap Detection - Lesser & Ellis
2005-03-22
p. 5 /14
•
reverberation
(RT60 ~ 900ms)
Initial burst for
near-field
“direct sound”
Far-field (327MUDD ff50:4)
0.25
0
-0.25
-0.5
energy (4ms) / dB
~ same
Near-field (327MUDD nf50:4)
0.5
0
-20
-40
-60
-80
freq / kHx
• Absolute level
varies
• Decay slopes
amplitude
Clap Examples
10
8
6
4
2
0
0
Clap Detection - Lesser & Ellis
0.1
0.2
0.3
0.4
0.5
0.6
0.7
time / s
0
0.1
2005-03-22
0.2
0.3
0.4
p. 6 /14
0.5
0.6
0.7
time / s
Processing
• Detection → Features → Classifier
!"#$%&'&(&
)*#+&,%"%-"./0
12/0"&304
5#6&784./
>=5/*,0?6.@0
&CDB=.<;B+
-.+*/01.*,230
45670O.<=L*+
*45678
A=.<;B=.2
9:,*+:;5<=.2
<2#.0.0$&>#?%*;
!"#$%&''(&
)#"%$/2.9#"./0
<2#.0.0$=<%;"
:%"&1%#"82%;
9,6=.=.2
E5;7*0
4*./*,0;F0G6++
4,;++04;,,*56/=;.
Clap!"#$#%&
Detection - Lesser & Ellis
9*+/
;,0
9,6=.=.2
H)E4
G;<*5
9*+/
'()*++*,
IF*6/J,*0K*L/;,M0C0N;<*5 5%;8*";
2005-03-22
p. !7 /14
Clap Detection
• Simple transient detector
limits feature calculation to ‘clap events’
on Δ(Energy20ms)
to get desired
number of claps
known for our data
8
6
4
2
ratio / dB
• Adjust threshold
freq / kHz
Far-field (327MUDD ff50:1-5)
10
0
40
20
0
0
0.5
1
1.5
2
2.5
3
3.5
4
• Backup from maxima to find precise onset
Fielded system will need to adapt threshold
and reject non-claps
Clap Detection - Lesser & Ellis
2005-03-22
p. 8 /14
4.5
5
time / s
Range Features
dB
• Paper: Ctr. of Mass, Slope in 0..20 , 0..100ms
CoM20ms
Near-field (327MUDD nf50:4)
Far-field (327MUDD ff50:4)
CoM100ms
-20
-40
-40
slope20ms
slope100ms
-60
•
• New: Slope in 0..20ms , 20..100ms
-60
-80
0
0.05
0.1
0.15
0
0.05
0.1
time / s
0.15
+ Energy Ratio 0..20ms / 20..100ms
dB
Near-field (327MUDD nf50:4)
Far-field (327MUDD ff50:4)
-20
-40
energy ratio
-40
slope20:100ms
-60
-60
-80
0
0.05
0.1
Clap Detection - Lesser & Ellis
0.15
0
0.05
0.1
2005-03-22
time / s
0.15
p. 9 /14
Range Feature Behavior
327MUDD loc 5
• Original 4 features
6
good separation
except CoM20
Eratio excellent
slope20:100 useless...
• Range estimation?
CoM20, slope20
show promise
4
2
2
1
0
0
20
2-4kHz band
• New features
3
CoM 20
2
4
6
8
0
20
slope 20
0
0
CoM 100
0
0.5
1
1.5
2
-20
-10
0
slope100
-20
-20
-40
-40
-40
30
-20
0
20
40
Eratio
-60
-40
10
20
0
10
-10
0
-20
-30
slope 20:100
p0
-10
0
5
10
15
20
25
-30
-30
p1 p2 p3
-20
4-8kHz band
(each plot shows 4-8 kHz band vs. 2-4 kHz band)
Clap Detection - Lesser & Ellis
2005-03-22
p. 10/14
-10
p4 NF p6
p7 p8 p9
3. Experiments
• Build and test actual near/far-field classifier
• Feature experiments
quantitative feature comparison
best combinations
• Data experiments
training data: amount, locations
test data: same/different room/location
• Regularized Least-Squares Classifier (RLSC)
find a hyperplane in (expanded) feature space
~ simplified Support Vector Machine - no QP
Clap Detection - Lesser & Ellis
2005-03-22
p. 11/14
Feature Comparisons
• Train on room 327Mudd; Test on 627Mudd
Feature comparison: All 3 bands, train on all M327, test on all M627
clap error rate / %
50
40
30
20
10
0
CoM_20
CoM_100
slo_20
slo_100
feature set
slo_20:100
Eratio
• Eratio alone (9/1500 = 0.6% errors) beats
best combination of rest:
(CoM20+ CoM100+ slo20 = 0.9% errors)
difference of ~0.5% required for signficance
Clap Detection - Lesser & Ellis
2005-03-22
p. 12/14
Generalizing Location, Room
• Matrix of 2 rooms x 2 recording locations
CER%
Train
M627L5
Test
M627L9 M327L5
M327L9
M627L5
2.0
0.5
0.4
0.0
M627L9
3.7
0.4
0.7
0.0
M327L5
1.5
0.5
0.4
0.0
M327L9
0.1
0.7
0.4
0.0
627Mudd loc5 is hard data; 327Mudd loc9 is easy!
Cross-room (shaded) cases generalize better !?
Plenty of data: 5 claps/loc (20%) just as good
Clap Detection - Lesser & Ellis
2005-03-22
p. 13/14
4. Conclusions
• Discriminating isolated near- and far-field
claps is feasible (use Eratio 0..20/20..100ms)
• Detection of candidate claps likely to limit
accuracy in practice
but have ‘rhythmic’ expectations...
• Applicability to general range estimation?
Eratio relies on short-duration direct-sound
..but other sounds have clicks (e.g. speech bursts)
CoM20, slope20 closer to proportional to range
Clap Detection - Lesser & Ellis
2005-03-22
p. 14/14
Azimuth Features
• Cross-correlation of L and R for azimuth:
ITD scatter vs. source (for MUDD327 pos 5)
p0
p1 p2 p3
2-4kHz
p4 NF p6
p7 p8 p9
-0.5
-1
-1.5
-1.5
-1
-0.5
0
0.5
4-8kHz
1
1.5
nearby locations distinguished - useful
distant locations (p2) give random results
needs nonlinear feature space expansion!
Clap Detection - Lesser & Ellis
2005-03-22
p. 15/14
Error Analysis
• 627Mudd (record loc 5) is the tough set;
false
accepts
for loc 6
(ambiguous
Eratio)
look at classifier margins:
a few solid
false rejects...
m4[14:16] vs f627
2
1
... really look like
far-field???
0
-1
-2
50 51 52 53 54 56 57 58 59 5n 5n 5n 5n 5n 5n 90 91 92 93 94 95 96 97 98 9n 9n 9n 9n 9n
m6[14:16] vs f627
2
Claps 33 and 34 from 627M:nf90
1
1
0.5
0
0
-0.5
-1
-2
2
0
1
-20
0
-40
-1
2
-2
0
5
10
15
20
25
30
35
40
45
50
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0
4
x 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0
0.2
0.4
0.6
0.8
1
Time
1.2
1.4
1.6
1.8
1.5
1
0.5
0
Clap Detection - Lesser & Ellis
0
20
50 51 52 53 54 56 57 58 59 5n 5n 5n 5n 5n 5n 90 91 92 93 94 95 96 97 98 9n 9n 9n 9n 9n
Frequency
0
123
456
789
-1
2005-03-22
p. 16/14
Usefulness of Each Position
• Train on 50 near-field claps + 50 far-field
claps from a single location:
Location comparison (Erat ftrs): train M327L5 one loc, test on all M627L5
7
clap error rate / %
6
5
4
3
2
1
0
p0
p1
p2
p3
p4
p6
p7
Far field training examples location
all recorded at location 5
‘behind’ (p7-p9) less useful
right-side (p3, p6) most useful !?
Clap Detection - Lesser & Ellis
p8
p9
all
p0
p1 p2 p3
p4 p5 p6
p7 p8 p9
2005-03-22
p. 17/14
Download