Clap Detection and Discrimination for Rhythm Therapy Nathan Lesser & Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Engineering, Columbia University, NY USA {nathan,dpwe}@ee.columbia.edu 1. 2. 3. 4. “Rhythm Therapy” Clap Range Estimation Experiments Conclusions Clap Detection - Lesser & Ellis 2005-03-22 p. 1 /14 1. “Rhythm Therapy” • Rhythmic clapping may help neural development sensori-motor planning focus and attention • “Interactive metronome” devices give feedback on synchrony sensor-based • Classroom deployment? acoustic-based? for multiple simultaneous users?? Clap Detection - Lesser & Ellis from interactivemetronome.com 2005-03-22 p. 2 /14 Clap Discrimination • Scenario: Many students in same classroom each clapping in time to their own laptop students wear headphones (but no sensor) computer hears neighbors • Goal: Discriminate between ‘near-field’ and ‘far-field’ claps ‘near-field’ = ~1 meter, on-axis ‘far-field’ = > 2 meters, maybe off-axis Clap Detection - Lesser & Ellis 2005-03-22 p. 3 /14 Data Collection • Record isolated claps at various locations can superimpose them later... • Grid of seats: Classroom Plan View Clapping 0 locations claps from locations 0..9 record at locations 5 & 9 only • Multiple rooms pilot: 1 room, 2 x 5 claps/location main data: 2 (+2) rooms, 1 x 50 farfield claps/location + 300 nearfield claps/rec.loc. = 1500 claps/room Clap Detection - Lesser & Ellis 1 2 3 4 5 6 5 7 8 9 9 Recording locations Front of Room 2005-03-22 p. 4 /14 2. Clap Range Estimation • Task: Discriminate claps from in front of rig from all others (more distant) main perceptual cue to distance (range): direct-to-reverberant ratio (DRR) how to differentiate direct and reverb? • Novel problem: Acoustic range estimation define correlates of DRR exploit properties of claps (wideband, compact) .. then just feed to classifier Clap Detection - Lesser & Ellis 2005-03-22 p. 5 /14 • reverberation (RT60 ~ 900ms) Initial burst for near-field “direct sound” Far-field (327MUDD ff50:4) 0.25 0 -0.25 -0.5 energy (4ms) / dB ~ same Near-field (327MUDD nf50:4) 0.5 0 -20 -40 -60 -80 freq / kHx • Absolute level varies • Decay slopes amplitude Clap Examples 10 8 6 4 2 0 0 Clap Detection - Lesser & Ellis 0.1 0.2 0.3 0.4 0.5 0.6 0.7 time / s 0 0.1 2005-03-22 0.2 0.3 0.4 p. 6 /14 0.5 0.6 0.7 time / s Processing • Detection → Features → Classifier !"#$%&'&(& )*#+&,%"%-"./0 12/0"&304 5#6&784./ >=5/*,0?6.@0 &CDB=.<;B+ -.+*/01.*,230 45670O.<=L*+ *45678 A=.<;B=.2 9:,*+:;5<=.2 <2#.0.0$&>#?%*; !"#$%&''(& )#"%$/2.9#"./0 <2#.0.0$=<%;" :%"&1%#"82%; 9,6=.=.2 E5;7*0 4*./*,0;F0G6++ 4,;++04;,,*56/=;. Clap!"#$#%& Detection - Lesser & Ellis 9*+/ ;,0 9,6=.=.2 H)E4 G;<*5 9*+/ '()*++*, IF*6/J,*0K*L/;,M0C0N;<*5 5%;8*"; 2005-03-22 p. !7 /14 Clap Detection • Simple transient detector limits feature calculation to ‘clap events’ on Δ(Energy20ms) to get desired number of claps known for our data 8 6 4 2 ratio / dB • Adjust threshold freq / kHz Far-field (327MUDD ff50:1-5) 10 0 40 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 • Backup from maxima to find precise onset Fielded system will need to adapt threshold and reject non-claps Clap Detection - Lesser & Ellis 2005-03-22 p. 8 /14 4.5 5 time / s Range Features dB • Paper: Ctr. of Mass, Slope in 0..20 , 0..100ms CoM20ms Near-field (327MUDD nf50:4) Far-field (327MUDD ff50:4) CoM100ms -20 -40 -40 slope20ms slope100ms -60 • • New: Slope in 0..20ms , 20..100ms -60 -80 0 0.05 0.1 0.15 0 0.05 0.1 time / s 0.15 + Energy Ratio 0..20ms / 20..100ms dB Near-field (327MUDD nf50:4) Far-field (327MUDD ff50:4) -20 -40 energy ratio -40 slope20:100ms -60 -60 -80 0 0.05 0.1 Clap Detection - Lesser & Ellis 0.15 0 0.05 0.1 2005-03-22 time / s 0.15 p. 9 /14 Range Feature Behavior 327MUDD loc 5 • Original 4 features 6 good separation except CoM20 Eratio excellent slope20:100 useless... • Range estimation? CoM20, slope20 show promise 4 2 2 1 0 0 20 2-4kHz band • New features 3 CoM 20 2 4 6 8 0 20 slope 20 0 0 CoM 100 0 0.5 1 1.5 2 -20 -10 0 slope100 -20 -20 -40 -40 -40 30 -20 0 20 40 Eratio -60 -40 10 20 0 10 -10 0 -20 -30 slope 20:100 p0 -10 0 5 10 15 20 25 -30 -30 p1 p2 p3 -20 4-8kHz band (each plot shows 4-8 kHz band vs. 2-4 kHz band) Clap Detection - Lesser & Ellis 2005-03-22 p. 10/14 -10 p4 NF p6 p7 p8 p9 3. Experiments • Build and test actual near/far-field classifier • Feature experiments quantitative feature comparison best combinations • Data experiments training data: amount, locations test data: same/different room/location • Regularized Least-Squares Classifier (RLSC) find a hyperplane in (expanded) feature space ~ simplified Support Vector Machine - no QP Clap Detection - Lesser & Ellis 2005-03-22 p. 11/14 Feature Comparisons • Train on room 327Mudd; Test on 627Mudd Feature comparison: All 3 bands, train on all M327, test on all M627 clap error rate / % 50 40 30 20 10 0 CoM_20 CoM_100 slo_20 slo_100 feature set slo_20:100 Eratio • Eratio alone (9/1500 = 0.6% errors) beats best combination of rest: (CoM20+ CoM100+ slo20 = 0.9% errors) difference of ~0.5% required for signficance Clap Detection - Lesser & Ellis 2005-03-22 p. 12/14 Generalizing Location, Room • Matrix of 2 rooms x 2 recording locations CER% Train M627L5 Test M627L9 M327L5 M327L9 M627L5 2.0 0.5 0.4 0.0 M627L9 3.7 0.4 0.7 0.0 M327L5 1.5 0.5 0.4 0.0 M327L9 0.1 0.7 0.4 0.0 627Mudd loc5 is hard data; 327Mudd loc9 is easy! Cross-room (shaded) cases generalize better !? Plenty of data: 5 claps/loc (20%) just as good Clap Detection - Lesser & Ellis 2005-03-22 p. 13/14 4. Conclusions • Discriminating isolated near- and far-field claps is feasible (use Eratio 0..20/20..100ms) • Detection of candidate claps likely to limit accuracy in practice but have ‘rhythmic’ expectations... • Applicability to general range estimation? Eratio relies on short-duration direct-sound ..but other sounds have clicks (e.g. speech bursts) CoM20, slope20 closer to proportional to range Clap Detection - Lesser & Ellis 2005-03-22 p. 14/14 Azimuth Features • Cross-correlation of L and R for azimuth: ITD scatter vs. source (for MUDD327 pos 5) p0 p1 p2 p3 2-4kHz p4 NF p6 p7 p8 p9 -0.5 -1 -1.5 -1.5 -1 -0.5 0 0.5 4-8kHz 1 1.5 nearby locations distinguished - useful distant locations (p2) give random results needs nonlinear feature space expansion! Clap Detection - Lesser & Ellis 2005-03-22 p. 15/14 Error Analysis • 627Mudd (record loc 5) is the tough set; false accepts for loc 6 (ambiguous Eratio) look at classifier margins: a few solid false rejects... m4[14:16] vs f627 2 1 ... really look like far-field??? 0 -1 -2 50 51 52 53 54 56 57 58 59 5n 5n 5n 5n 5n 5n 90 91 92 93 94 95 96 97 98 9n 9n 9n 9n 9n m6[14:16] vs f627 2 Claps 33 and 34 from 627M:nf90 1 1 0.5 0 0 -0.5 -1 -2 2 0 1 -20 0 -40 -1 2 -2 0 5 10 15 20 25 30 35 40 45 50 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 4 x 10 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 Time 1.2 1.4 1.6 1.8 1.5 1 0.5 0 Clap Detection - Lesser & Ellis 0 20 50 51 52 53 54 56 57 58 59 5n 5n 5n 5n 5n 5n 90 91 92 93 94 95 96 97 98 9n 9n 9n 9n 9n Frequency 0 123 456 789 -1 2005-03-22 p. 16/14 Usefulness of Each Position • Train on 50 near-field claps + 50 far-field claps from a single location: Location comparison (Erat ftrs): train M327L5 one loc, test on all M627L5 7 clap error rate / % 6 5 4 3 2 1 0 p0 p1 p2 p3 p4 p6 p7 Far field training examples location all recorded at location 5 ‘behind’ (p7-p9) less useful right-side (p3, p6) most useful !? Clap Detection - Lesser & Ellis p8 p9 all p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 2005-03-22 p. 17/14