Enhancement of Very Noisy Speech Signals Dan Ellis & Zhuo Chen Team Swordfish / ICSI / Columbia ! dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Speech Enhancement 2. Flat-Pitch Enhancement 3. Results & Future COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK Enhancing Very Noisy Speech - Dan Ellis 2014-07-09 - 1 /14 1. Speech Enhancement • Noisy speech is a challenge: freq / kHz • Surprise channel in surprise language 4 3 2 1 0 0 IARPA_BABEL_OP1_204_73990_20130521_162632_inLine ! 20 ! 0 -20 ! ! -40 5 10 15 20 25 time / sec level / dB • How to distinguish speech and interference? • Energy peaks are speech (spectral subtraction) • Energy troughs are noise (Wiener, log-mmse) • Speech has a known form (Factorial HMM) • Voiced speech is periodic (Pitch-based) Enhancing Very Noisy Speech - Dan Ellis 2014-07-09 - 2 /14 Noisy Channel Detection • MIC channels are in a different formant • Even after resampling, noisy channels are very distinct: _20130521_162632_inLine OP1_204 - M 20 SNR vad / dB 20 0 -20 0 -40 0 20 20 40 NIST stnr / dB 60 80 25 time / sec Enhancing Very Noisy Speech - Dan Ellis level / dB N=6 70 60 50 40 30 10 20 40 60 80 DynRange 125−250 Hz / dB OP1_206 80 100 50 0 all MIC 20 OP1_204 - HOM level / dB 10 _20130521_162632_outLine −10 0 DynRange 250−500 Hz / dB 50 N=59 579 1177 2137 3882 freq / Hz DynRange 250−500 Hz / dB OP1_204_DEV: STNR & SNRvad 30 level / dB 100 OP1_204 80 70 60 50 40 30 20 10 20 40 60 80 DynRange 125−250 Hz / dB 2014-07-09 - 3 /12 KWS on Noisy Signals • WER is very poor • only nonzero thanks to common word ஆஹ் (“ah”) Enhancing Very Noisy Speech - Dan Ellis OP1_204_13189__inLine OP1_204_61440__inLine OP1_204_73990__inLine OP1_204_78161__inLine OP1_204_90937__inLine OP1_204_91808__inLine Corr | 5.1 | 7.9 | 5.4 | 15.1 | 3.0 | 2.4 Sub 5.1 4.8 2.6 54.4 3.0 10.0 Del 89.8 87.4 92.0 30.5 93.9 87.5 Ins 1.1 0.0 1.2 6.0 0.0 1.2 Err 96.0 92.1 95.9 90.8 97.0 98.8 2014-07-09 - 4 /12 Spectral-Based Enhancement • Classic enhancement boosts loudest parts • evaluated over 6 MIC utterances of OP1_204_DEV Corr CSub S DelD I Ins 10.610.6 82.6 82.6 Original 6.8 6.8 0.7 Wiener 8.8 17.517.5 73.7 8.8 73.7 1.5 1.5 log-MMSE 9.9 9.9 59.0 31.131.1 59.0 1.4 1.4 E.hist norrm 8.6 8.6 68.4 23.123.1 68.4 0.6 0.6 Enhancing Very Noisy Speech - Dan Ellis 0.7 2014-07-09 - 5 /14 RPCA Enhancement • Decompose spectrogram into sparse + low-rank • Sparse activation H of Chen, McFee & Ellis ’14 dictionary W min ! H,L,S H kHk1 L kLk⇤ + ! + S kSk1 + I+! (H) s.t. Y = !W H + L + S • ASR benefits: Corr S D I Ins 0.7 10.831.1 36.5 59.0 52.7 0.5 1.4 wie+RPCA 40.1 15.5 10.446.1 49.5 38.4 2.1 1.6 Orig log MMSE L+WH C Del 10.6 82.6 6.8 10.6 82.6 Original 6.8 Sub RPCA 9.9 Enhancing Very Noisy Speech - Dan Ellis 0.7 2014-07-09 - 6 /14 2. Pitch-Based Enhancement Denbigh & Zhao ’92 • Energy concentrated in harmonics • Given pitch, freq / kHz • Voiced Speech has near-periodic waveform 2 0 freq / kHz • Problems 3 1 keep only those harmonics? • time-varying filter • sinusoidal model 4 4 3 2 1 0 • pitch errors • filtering artefacts • unvoiced speech, graceful degradation Enhancing Very Noisy Speech - Dan Ellis 0 1 2 3 4 5 6 7 time / s 8 after Wang ’95 2014-07-09 - 7 /14 Pitch Estimation in Noise ! • e.g. finding peaks in autocorrelation ! ! ! 3 2 0 ! 6 4 2 0 lag / ms • not robust to noise 4 1 lag / ms periodic structure freq / kHz • Conventional pitch trackers are based on 6 4 2 0 0.5 1 1.5 2 2.5 3 3.5 time / s • Classifier-based approach • don’t predefine nature of pitch • let a classifier learn from examples Enhancing Very Noisy Speech - Dan Ellis 2014-07-09 - 8 /14 Classification-based Pitch Tracker • Subband Autocorrelation Classification Lee & Ellis ’12 (SAcC) Pitch Tracker: • Trained on noisy speech with true pitch targets ! la de ! yl short-time autocorrelation ine Neural network classifier ck ! ! ! ! ! ! c1 BN frequency channels Sound Cochlea filterbank uv 60.0 61.8 63.6 65.4 Pitch Viterbi smoother B3 B2 B1 freq lag 392.1 403.6 Correlogram slice time • Subband autocorrelation features • PCA to reduce dimensions Enhancing Very Noisy Speech - Dan Ellis 2014-07-09 - 9 /14 SAcC Results • Excellent in-domain • at low SNRs • errors dominated by V/UV Wu get_f0 10 SAcC 10 −10 ! • Generalization 0 10 20 SNR (dB) 30 Mean PTE of Wu and SAcC on RATS 70 Wu Keele nCH All 60 is good 50 40 PTE • between different RATS channels YIN PTE (%) results ! FDA − RBF and pink noise 100 30 20 10 0 Enhancing Very Noisy Speech - Dan Ellis A B C D E RATS Channel F G H 2014-07-09 - 10/14 Flat-Pitch Processing • Time-varying filtering is tricky • if pitch variation and filter impulse response are on a similar time-scale noisy Flatten the pitch • use local pitch estimate to resample • process constant-pitch • resampling is (near) invertible Noisy signal 1000 SAcC pitch tracker freq / Hz • Solution: speech 500 0 pitch-flatten time map pr(vx) pitch time-varying resampling Resampled to pitch = 200 Hz flat-pitch-domain enhancement inverse time map Filtered − comb time-varying resampling Resampled back to original pitch and mixed with original enhanced speech 0 Enhancing Very Noisy Speech - Dan Ellis 0.5 1 1.5 time / s 2014-07-09 - 11/14 Flat-Pitch Processing • How to ! ! 1000 freq / Hz enhance flat pitch? Noisy signal 500 0 Filtered − wiener • Wiener filtering ! ! Filtered − comb • Comb filtering ! ! • “Phase Vocoder” emphasis Enhancing Very Noisy Speech - Dan Ellis Filtered − pvsmooth 0 5 10 15 20 2014-07-09 - 12/14 Flat-Pitch Results • Over 6 MIC utterances of OP1_204_MIC Corr Sub Del Ins Original 6.8 10.6 82.6 0.7 log MMSE 9.9 31.1 59.0 1.4 L+WH 15.5 46.1 38.4 1.6 flat_pitch-comb 10.0 25.2 64.9 0.7 MMSE+f_p-comb 13.6 42.5 43.9 2.8 f_p-comb+L+WH 15.0 52.4 32.6 3.1 Enhancing Very Noisy Speech - Dan Ellis 2014-07-09 - 13/14 Summary • Noisy Speech • single distant mic in real-world environments ! • Enhancement • boosting spectrogram energy that appears to be speech • low-rank + sparse dictionary exploits knowledge ! • Flat-Pitch enhancement • trained noise-robust pitch classifier • dynamic resampling to flatten pitch for enhancement Enhancing Very Noisy Speech - Dan Ellis 2014-07-09 - 14/14