Noise Robust Pitch Tracking by Subband Autocorrelation Classification (SAcC)

advertisement
Noise Robust Pitch Tracking
by Subband Autocorrelation Classification (SAcC)
Byung Suk Lee & Dan Ellis • Columbia University / ICSI • {bsl,dpwe}@ee.columbia.edu
COLUMBIA
UNIVERSITY
Laboratory for the Recognition and
Organization of Speech and Audio
Summary: A neural net classifier is trained to identify the pitch of a frame of subband autocorrelation principal components.
Accuracy is greatly improved for noisy, bandlimited speech, matched to the training data.
Pitch
* not in standard YIN
• 48, 4-pole 2-zero ERB-scale cochlea filter
approximations form auditory-like subbands.
YIN
Freq / Hz
Wu
SAcC log MLP output
0
1
Pitch “scores” from YIN, Wu, and SAcC
time / s
• Added interference hurts the autocorrelation, but filtering into subbands
can reveal certain bands with better local SNR. Also, the contribution of
each subband can be adjusted in later fusion to select sources.
• An example is [Wu, Wang & Brown 2003] (“the Wu algorithm”):
Channel/Peak
Selection
Cross-channel
Integration
Pitch Tracking
using HMM
900
1000
1100
1200
1300
1400
1500 1600
Frame
• GPE, the most common pitch tracking metric, only considers frames where
both ground truth and system report a pitch value. This rewards “punting”
on difficult frames.
• We define VE as accuracy over all
true-voiced frames, and UE over all
true-unvoiced frames.
• We evaluate by PTE, the mean of
VE and UE.
Pitch
Evv
Nvv
Nu
Autocorrelation
• Normalized autocorrelation out to 25 ms
(400 samples @ 16 kHz) reveals periodic
structure in each subband.
• The autocorrelation in each subband is highly
constrained by the bandlimited input.
• Pitch information is distributed throughout
each autocorrelation row;
simple peak-picking ignores most of this.
• This work began as an investigation into the peak selection and
integration stages of the Wu algorithm. We found that replacing both
stages with a trained classifier offered large performance improvements,
as well as the chance to train domain-specific pitch trackers.
Principal Component Analysis
• The autocorrelations in a given subband
always reflect the center frequency, i.e., they
occupy a small part of the 400-D space.
• We use per-subband PCA to reduce each
band to 10 coefficients, which preserves
almost all the variance.
• PCA bases remain stable when learned from
different signal conditions, so are fixed.
Clean Speech
Material
Hidden Markov Model Smoothing
• The MLP generates posterior probabilities
across 66 pitch candidates + “no pitch”
for every 10 ms time frame.
• These become a single pitch track via Viterbi
decoding through an HMM with pitch states.
• Transition probabilities are set parametrically
(pitch-invariant) and tuned empirically.
MLP output
45
40
Radio-band
filter (RBF)
35
Freq / kHz
2
800
Evaluation
Nv
Evv
GPE = Evv/Nvv
Evu
Euv
• We tested on the pitchannotated FDA data with
radio-band filtering and
pink noise added at a
range of SNRs.
• SAcC trained on Keele
data (with similar
corruption) substantially
outperformed other pitch
trackers.
• Later experiments show
that SAcC can generalize
to mismatched train/test
scenarios.
VE = (Evv+Evu)/Nv
Nu
UE = Euv/Nu
PTE = (VE+UE)/2
FDA − RBF and pink noise
100
100
10
10
1
1
−10
0
10
20
30
YIN
Wu
−10
0
10
20
30
0
10
SNR (dB)
20
30
get_f0
2
10
100
SAcC
100
10
10
1
1
−10
0
10
SNR (dB)
20
30
−10
30
Subband
• We are specifically interested
in pitch tracking for low
quality radio reception data
(e.g. hand-held narrow-FM)
(RATS project).
• This data has both
high noise/distortion
and narrow bandwidth.
700
GPE (%)
Normalized
Autocorelation
0
600
Label
Reject
UE (%)
Sound Auditory
Filterbank
50
PTE (%)
Pitch tracking
using HMM*
VE (%)
Local peak
selection
& interpolation
Wu likelihood
Agreement
Cumulative
normalization
and
threshold
• The core of our system is a
trained (MLP) classifier.
• It takes k (10) PCA coefficients
for each of s (48) subband
autocorrelations and estimates
posteriors over p (67) quantized
pitch values.
• The MLP is trained
discriminatively on noisy data
(with ground truth pitches).
• Discriminative training virtually
eliminates octave/suboctave
errors seen in peak-based pitch
tracking.
• The pitch classifier needs training data with ground truth.
Performance improves when training data is more similar to test data.
• We used pitch-annotated data (Keele), then artificially filtered
and added noise to resemble the target domain.
1000
• We also generated
YIN
pseudo-groundWu
500
SWIPE’
truth for target
domain data by
0
agV
using only frames
agU
where three
dUV
independent pitch
dVV
trackers agreed.
100
1 - YIN similarity
Pitch Index
Sound Autocorrelationbased
difference
function
Classifier
Filterbank
• “Classic” approaches to pitch tracking reveal time-domain waveform
periodicity via autocorrelation or related approaches.
• An example is YIN [de Cheveigné & Kawahara 2002]:
Training Data
Pitch Index
Background
1
25
References
20
15
0 -20 -40 0
Gain / dB
0.5
1
Time / s
1.5
10
time / s
5
50
100
150
200
Lags
250
300
350
400
MATLAB code available: http://labrosa.ee.columbia.edu/projects/SAcC/
A. de Cheveigné and H. Kawahara, “YIN, a fundamental frequency estimator
for speech and music,” J. Acoust. Soc. Am., 111(4):1917–1930, April
2002.
M. Wu, D.L. Wang, and G.J. Brown, “A multipitch tracking algorithm for noisy
speech,” IEEE Tr. Speech and Audio Proc., 11(3):229–241, May 2003.
for Interspeech 2012 • 2012-09-11 • dpwe@ee.columbia.edu
Download