Noise Robust Pitch Tracking by Subband Autocorrelation Classification (SAcC) Byung Suk Lee & Dan Ellis • Columbia University / ICSI • {bsl,dpwe}@ee.columbia.edu COLUMBIA UNIVERSITY Laboratory for the Recognition and Organization of Speech and Audio Summary: A neural net classifier is trained to identify the pitch of a frame of subband autocorrelation principal components. Accuracy is greatly improved for noisy, bandlimited speech, matched to the training data. Pitch * not in standard YIN • 48, 4-pole 2-zero ERB-scale cochlea filter approximations form auditory-like subbands. YIN Freq / Hz Wu SAcC log MLP output 0 1 Pitch “scores” from YIN, Wu, and SAcC time / s • Added interference hurts the autocorrelation, but filtering into subbands can reveal certain bands with better local SNR. Also, the contribution of each subband can be adjusted in later fusion to select sources. • An example is [Wu, Wang & Brown 2003] (“the Wu algorithm”): Channel/Peak Selection Cross-channel Integration Pitch Tracking using HMM 900 1000 1100 1200 1300 1400 1500 1600 Frame • GPE, the most common pitch tracking metric, only considers frames where both ground truth and system report a pitch value. This rewards “punting” on difficult frames. • We define VE as accuracy over all true-voiced frames, and UE over all true-unvoiced frames. • We evaluate by PTE, the mean of VE and UE. Pitch Evv Nvv Nu Autocorrelation • Normalized autocorrelation out to 25 ms (400 samples @ 16 kHz) reveals periodic structure in each subband. • The autocorrelation in each subband is highly constrained by the bandlimited input. • Pitch information is distributed throughout each autocorrelation row; simple peak-picking ignores most of this. • This work began as an investigation into the peak selection and integration stages of the Wu algorithm. We found that replacing both stages with a trained classifier offered large performance improvements, as well as the chance to train domain-specific pitch trackers. Principal Component Analysis • The autocorrelations in a given subband always reflect the center frequency, i.e., they occupy a small part of the 400-D space. • We use per-subband PCA to reduce each band to 10 coefficients, which preserves almost all the variance. • PCA bases remain stable when learned from different signal conditions, so are fixed. Clean Speech Material Hidden Markov Model Smoothing • The MLP generates posterior probabilities across 66 pitch candidates + “no pitch” for every 10 ms time frame. • These become a single pitch track via Viterbi decoding through an HMM with pitch states. • Transition probabilities are set parametrically (pitch-invariant) and tuned empirically. MLP output 45 40 Radio-band filter (RBF) 35 Freq / kHz 2 800 Evaluation Nv Evv GPE = Evv/Nvv Evu Euv • We tested on the pitchannotated FDA data with radio-band filtering and pink noise added at a range of SNRs. • SAcC trained on Keele data (with similar corruption) substantially outperformed other pitch trackers. • Later experiments show that SAcC can generalize to mismatched train/test scenarios. VE = (Evv+Evu)/Nv Nu UE = Euv/Nu PTE = (VE+UE)/2 FDA − RBF and pink noise 100 100 10 10 1 1 −10 0 10 20 30 YIN Wu −10 0 10 20 30 0 10 SNR (dB) 20 30 get_f0 2 10 100 SAcC 100 10 10 1 1 −10 0 10 SNR (dB) 20 30 −10 30 Subband • We are specifically interested in pitch tracking for low quality radio reception data (e.g. hand-held narrow-FM) (RATS project). • This data has both high noise/distortion and narrow bandwidth. 700 GPE (%) Normalized Autocorelation 0 600 Label Reject UE (%) Sound Auditory Filterbank 50 PTE (%) Pitch tracking using HMM* VE (%) Local peak selection & interpolation Wu likelihood Agreement Cumulative normalization and threshold • The core of our system is a trained (MLP) classifier. • It takes k (10) PCA coefficients for each of s (48) subband autocorrelations and estimates posteriors over p (67) quantized pitch values. • The MLP is trained discriminatively on noisy data (with ground truth pitches). • Discriminative training virtually eliminates octave/suboctave errors seen in peak-based pitch tracking. • The pitch classifier needs training data with ground truth. Performance improves when training data is more similar to test data. • We used pitch-annotated data (Keele), then artificially filtered and added noise to resemble the target domain. 1000 • We also generated YIN pseudo-groundWu 500 SWIPE’ truth for target domain data by 0 agV using only frames agU where three dUV independent pitch dVV trackers agreed. 100 1 - YIN similarity Pitch Index Sound Autocorrelationbased difference function Classifier Filterbank • “Classic” approaches to pitch tracking reveal time-domain waveform periodicity via autocorrelation or related approaches. • An example is YIN [de Cheveigné & Kawahara 2002]: Training Data Pitch Index Background 1 25 References 20 15 0 -20 -40 0 Gain / dB 0.5 1 Time / s 1.5 10 time / s 5 50 100 150 200 Lags 250 300 350 400 MATLAB code available: http://labrosa.ee.columbia.edu/projects/SAcC/ A. de Cheveigné and H. Kawahara, “YIN, a fundamental frequency estimator for speech and music,” J. Acoust. Soc. Am., 111(4):1917–1930, April 2002. M. Wu, D.L. Wang, and G.J. Brown, “A multipitch tracking algorithm for noisy speech,” IEEE Tr. Speech and Audio Proc., 11(3):229–241, May 2003. for Interspeech 2012 • 2012-09-11 • dpwe@ee.columbia.edu