Segmenting and Classifying Long-Duration Recordings of “Personal Audio” Dan Ellis and Keansub Lee Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA {dpwe,kslee}@ee.columbia.edu 1. 2. 3. 4. 5. “Personal Audio” Features Segmentation Clustering Future Work Segmenting Personal Audio - Ellis & Lee 2004-10-03 1. Personal Audio • Easy to record everything you hear <2GB / week @ 64 kbps hard to • Very find anything how to scan? how to visualize? how to index? • Need automatic analysis Segmenting Personal Audio - Ellis & Lee 2004-10-03 Applications • Automatic appointment-book history fills in when & where of movements • “Life statistics” how long did I spend in meetings this week vs. last most frequent conversations favorite phrases?? • Retrieving details what exactly did I promise? privacy issues... • Nostalgia? Segmenting Personal Audio - Ellis & Lee 2004-10-03 Data Set • Starting point: Collect data 62 hours recorded (8 days, ~7.5 hr/day) hand-mark 139 segments (26 min/seg avg.) assign to 16 classes (11 have multiple instances) Label Library Campus Restaurant Bowling Lecture 1 Car/Taxi Street total mins 981 750 560 244 234 165 162 Segmenting Personal Audio - Ellis & Lee total segs 27 56 5 2 4 7 16 2004-10-03 2. Features duration recordings • Long may benefit from longer basic time-frames 60s rather than 10ms? • Perceptually-motivated features broad spectrum + some detail? • For diary application... background more important than foreground? smooth out uncharacteristic transients Segmenting Personal Audio - Ellis & Lee 2004-10-03 Feature sets Normalized Energy Deviation Average Linear Energy 120 15 100 10 80 15 40 10 20 5 5 dB Average Log Energy 60 dB Log Energy Deviation 120 15 100 10 80 20 freq / bark 20 freq / bark 60 20 freq / bark freq / bark 20 5 15 15 10 10 5 5 60 dB dB Spectral Entropy Deviation Average Spectral Entropy 0.9 0.8 15 0.7 10 0.6 5 0.5 bits 20 freq / bark freq / bark 20 0.5 15 0.4 10 0.3 0.2 5 0.1 50 100 150 200 250 300 350 400 • Capture both average and variation • Capture a little more detail in subbands... Segmenting Personal Audio - Ellis & Lee 450 time / min 2004-10-03 bits Spectral Entropy NF • Auditory spectrum: A[n, j] = ! w X[n, k] •• Spectral entropy ≈ ‘peakiness’ of each band: ! " jk k=0 NF energy / dB w jkX[n, k] w jkX[n, k] H[n, j] = − ! · log A[n, j] k=0 A[n, j] FFT spectral magnitude 0 -20 Auditory Spectrum -40 -60 rel. entropy / bits 0 1000 2000 3000 4000 5000 6000 7000 8000 0.5 0 per-band Spectral Entropies -0.5 -1 30 340 750 1130 1630 2280 3220 3780 Segmenting Personal Audio - Ellis & Lee 4470 5280 6250 7380 freq / Hz 2004-10-03 3. BIC segmentation • BIC (Bayesian Information Criterion): Compare more and less complex models log L(X1 ;M1 )L(X2 ;M2 ) L(X;M0 ) • For segmentation: ≷ λ 2 log(N )∆#(M ) Grow context window from current boundary For each window, test every possible segmentation When BIC is positive, mark new segment candidate boundary last segmentation point current context limit 0 N L(X1;M1) time L(X2;M2) L(X;M0) Segmenting Personal Audio - Ellis & Lee 2004-10-03 BIC Segmentation Example BIC score AvgLogAudSpec 2004-09-10-1023_AvgLEnergy 20 15 10 5 boundary passes BIC last seg point 0 -100 no boundary found with shorter window -200 13:30 14:00 14:30 15:00 • No training or stored models Segmenting Personal Audio - Ellis & Lee current window limit 15:30 16:00 time / hr 2004-10-03 Segmentation Results • Evaluate: 60hr hand-marked boundaries different features & combinations Correct Accept % @ False Accept = 2%: Correct Accept 80.8% 81.1% 81.6% 84.0% 83.6% 73.6% 0.8 0.7 Sensitivity Feature µdB µH σH/µH µdB + σH/µH µdB + σH/µH + µH avg. mfcc 0.6 0.5 µdB µH !H/µH µdB + !H/µH µdB + µH + !H/µH 0.4 0.3 0.2 0 Segmenting Personal Audio - Ellis & Lee 0.005 0.01 0.015 0.02 0.025 1 - Specificity 0.03 2004-10-03 0.035 0.04 4. Segment clustering activity has lots of repetition: • Daily Automatically cluster similar segments ‘affinity’ of segments as KL2 distances 1 supermkt meeting karaoke barber lecture2 billiard break lecture1 car/taxi home bowling street restaurant library campus 0.5 cmp Segmenting Personal Audio - Ellis & Lee lib rst str ... 0 2004-10-03 Spectral Clustering • Eigenanalysis of affinity matrix: A = U•S•V’ SVD components: uk•skk•vk' Affinity Matrix 900 800 800 600 700 400 600 200 k=1 k=2 k=3 k=4 500 400 800 300 600 200 400 100 200 200 400 600 800 200 400 600 800 200 400 600 800 eigenvectors vk give cluster memberships • Number of clusters? Segmenting Personal Audio - Ellis & Lee 2004-10-03 Clustering Results of automatic segments gives • Clustering ‘anonymous classes’ BIC criterion to choose number of clusters make best correspondence to 16 GT clusters freq / Bark Hand-marked boundaries 20 15 10 5 21:00 22:00 23:00 0:00 1:00 2:00 3:00 4:00 Clock time • Frame-level scoring gives ~70% correct errors when same ‘place’ has multiple ambiences clusters formed by strong foregrounds (voices) Segmenting Personal Audio - Ellis & Lee 2004-10-03 5. Future Work • Visualization / browsing / diary inference link in other information sources Segmenting Personal Audio - Ellis & Lee 2004-10-03 Privacy • Recording conversations conflicts with expectations of privacy critical barrier to progress • Technical solutions to improve acceptance? Speaker/speech “search and destroy” scramble 100ms segs of speech (preserving longer-term statistics) high-confidence speaker ID to bypass Segmenting Personal Audio - Ellis & Lee 2004-10-03 Conclusions • “Personal Audio” is easy & cheap to collect but is it any use? • Boundaries quite easy to spot moving to a new location change in activity (talking <> reading) • Repeated activities can cluster together .. so user’s labels can propagate • Still gaining experience with the data speech is the most interesting part - .. but very hard to transcribe speaker ID, privacy, ... Segmenting Personal Audio - Ellis & Lee 2004-10-03