Minimal-Impact Audio-Based Personal Archives Dan Ellis and Keansub Lee Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA {dpwe,kslee}@ee.columbia.edu 1. 2. 3. 4. 5. 6. “Personal Audio” Archives Features Segmentation Clustering Privacy Future Work Audio Personal Archives - Ellis & Lee 2004-10-15 1. Personal Audio • Easy to record everything you hear <2GB / week @ 64 kbps hard to • Very find anything how to scan? how to visualize? how to index? • Need automatic analysis • Need minimal impact Audio Personal Archives - Ellis & Lee 2004-10-15 Applications • Automatic appointment-book history fills in when & where of movements • “Life statistics” how long did I spend in meetings this week vs. last most frequent conversations favorite phrases?? • Retrieving details what exactly did I promise? privacy issues... • Nostalgia? Audio Personal Archives - Ellis & Lee 2004-10-15 Data Set • Starting point: Collect data 62 hours recorded (8 days, ~7.5 hr/day) hand-mark 139 segments, 16 classes Label Library Campus Restaurant Bowling Lecture 1 Car/Taxi Street total mins 981 750 560 244 234 165 162 total segs 27 56 5 2 4 7 16 minimal impact? Audio Personal Archives - Ellis & Lee 2004-10-15 2. Features duration recordings • Long may benefit from longer basic time-frames 60s rather than 10ms? • Perceptually-motivated features broad spectrum + some detail? • For diary application... background more important than foreground? smooth out uncharacteristic transients Audio Personal Archives - Ellis & Lee 2004-10-15 Feature sets Normalized Energy Deviation Average Linear Energy 120 15 100 10 80 15 40 10 20 5 5 dB Average Log Energy 60 dB Log Energy Deviation 120 15 100 10 80 20 freq / bark 20 freq / bark 60 20 freq / bark freq / bark 20 5 15 15 10 10 5 5 60 dB dB Spectral Entropy Deviation Average Spectral Entropy 0.9 0.8 15 0.7 10 0.6 5 0.5 bits 20 freq / bark freq / bark 20 0.5 15 0.4 10 0.3 0.2 5 0.1 50 100 150 200 250 300 350 400 • Capture both average and variation • Capture a little more detail in subbands... Audio Personal Archives - Ellis & Lee 450 time / min 2004-10-15 bits Spectral Entropy NF • Auditory spectrum: A[n, j] = ! w X[n, k] •• Spectral entropy ≈ ‘peakiness’ of each band: ! " jk k=0 NF energy / dB w jkX[n, k] w jkX[n, k] H[n, j] = − ! · log A[n, j] k=0 A[n, j] FFT spectral magnitude 0 -20 Auditory Spectrum -40 -60 rel. entropy / bits 0 1000 2000 3000 4000 5000 6000 7000 8000 0.5 0 per-band Spectral Entropies -0.5 -1 30 340 750 1130 1630 2280 3220 3780 Audio Personal Archives - Ellis & Lee 4470 5280 6250 7380 freq / Hz 2004-10-15 3. BIC segmentation • BIC (Bayesian Information Criterion): Compare more and less complex models log L(X1 ;M1 )L(X2 ;M2 ) L(X;M0 ) • For segmentation: ≷ λ 2 log(N )∆#(M ) Grow context window from current boundary For each window, test every possible segmentation When BIC is positive, mark new segment candidate boundary last segmentation point current context limit 0 N L(X1;M1) time L(X2;M2) L(X;M0) Audio Personal Archives - Ellis & Lee 2004-10-15 BIC Segmentation Example BIC score AvgLogAudSpec 2004-09-10-1023_AvgLEnergy 20 15 10 5 boundary passes BIC last seg point 0 -100 no boundary found with shorter window -200 13:30 14:00 14:30 15:00 • No training or stored models Audio Personal Archives - Ellis & Lee current window limit 15:30 16:00 time / hr 2004-10-15 Segmentation Results • Evaluate: 60hr hand-marked boundaries different features & combinations Correct Accept % @ False Accept = 2%: Correct Accept 80.8% 81.1% 81.6% 84.0% 83.6% 73.6% 0.8 0.7 Sensitivity Feature µdB µH σH/µH µdB + σH/µH µdB + σH/µH + µH avg. mfcc 0.6 0.5 µdB µH !H/µH µdB + !H/µH µdB + µH + !H/µH 0.4 0.3 0.2 0 Audio Personal Archives - Ellis & Lee 0.005 0.01 0.015 0.02 0.025 1 - Specificity 0.03 2004-10-15 0.035 0.04 4. Segment clustering activity has lots of repetition: • Daily Automatically cluster similar segments ‘affinity’ of segments as KL2 distances 1 supermkt meeting karaoke barber lecture2 billiard break lecture1 car/taxi home bowling street restaurant library campus 0.5 cmp Audio Personal Archives - Ellis & Lee lib rst str ... 0 2004-10-15 Spectral Clustering • Eigenanalysis of affinity matrix: A = U•S•V’ SVD components: uk•skk•vk' Affinity Matrix 900 800 800 600 700 400 600 200 k=1 k=2 k=3 k=4 500 400 800 300 600 200 400 100 200 200 400 600 800 200 400 600 800 200 400 600 800 eigenvectors vk give cluster memberships • Number of clusters? Audio Personal Archives - Ellis & Lee 2004-10-15 Clustering Results of automatic segments gives • Clustering ‘anonymous classes’ BIC criterion to choose number of clusters make best correspondence to 16 GT clusters • Frame-level scoring gives ~70% correct errors when same ‘place’ has multiple ambiences Audio Personal Archives - Ellis & Lee 2004-10-15 5. Privacy • Recording conversations conflicts with expectations of privacy critical barrier to progress • Technical solutions to improve acceptance? Speaker/speech “search and destroy” scramble 100ms segs of speech (preserving longer-term statistics) high-confidence speaker ID to bypass Audio Personal Archives - Ellis & Lee 2004-10-15 Speech Scrambling • Permute 200 ms segments within 1 s blocks freq / kHz freq / kHz removes intelligibility preserves local structure segment features almost unchanged 4 Original (dan+kean-ex.wav) 2 20 0 0 -20 4 Scrambled (200ms wins over 1s) -40 -60 level / dB 2 0 0 2 4 6 Audio Personal Archives - Ellis & Lee 8 10 12 14 time / s 2004-10-15 6. Future Work / • Visualization browsing / diary inference link in other information sources - diary - email • What is it good for? NoteTaker interface Audio Personal Archives - Ellis & Lee 2004-10-15 Conclusions • “Personal Audio” is easy & cheap to collect but is it any use? • Boundaries quite easy to spot e.g. moving to a new location • Repeated activities can cluster together .. so user’s labels can propagate • Still gaining experience with the data speech, speaker ID, privacy, ... Audio Personal Archives - Ellis & Lee 2004-10-15