Estimating Tract Variables for Speech Inversion using Artificial

advertisement
ESTIMATING TRACT VARIABLES
FROM ACOUSTICS VIA
MACHINE LEARNING
CHRISTIANA SABETT
APPLIED MATH, APPLIED STATISTICS, AND SCIENTIFIC COMPUTING (AMSC)
OCTOBER 7, 2014
ADVISOR: DR. CAROL ESPY-WILSON
ELECTRICAL AND COMPUTER ENGINEERING
INTRODUCTION
Automatic speech recognition (ASR) systems are inadequate/incomplete in their
current forms.
• Coarticulation – overlap of actions in vocal tract
TRACT VARIABLESa
• Articulatory information: information from
the organs along the vocal tract
• Tract variables (TVs): vocal tract constriction
variables relaying information of a physical
trajectory in time
• Lip Aperture (LA)
• Lip Protrusion (LP)
• Tongue tip constriction degree (TTCD)
• Tongue tip constriction location (TTCL)
• Tongue body constriction degree (TBCD)
• Tongue body constriction location
(TBCL)
• Velum (VEL)
• Glottis (GLO)
a.
Mitra et al, 2010.
TRACT VARIABLES
Perfect-memory: Clearly articulated
2000
P
0
Frequency
8000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.1
0.2
0.3
0.4
Time
0.5
0.6
0.7
0.8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
6000
4000
2000
0
TB
10
0
-10
0
TT
• TVs can improve the
robustness of
automatic speech
recognition
-2000
-5
-10
-15
-15
LA
• TVs are consistent in
the presence of
coarticulation
-20
-25
ER
F EH
K T M
EH
M ER
IY
PROJECT GOAL
Effectively estimate TV trajectories using artificial neural networks,
implementing Kalman smoothing when necessary.
APPROACH
• Artificial neural networks (ANNs)b
• Feed-forward ANN (FF-ANN)
• Recurrent ANN (RANN)
• Motivation:
• Speech inversion is a many-to-one mappingc
• ANNs can map m inputs to n outputs
b.
c.
Papcun, 1992.
Atal et al., 1978.
Retroflex /r/
Bunched /r/
STRUCTUREd
• 3 hidden layers
• Each node has f = tanh(x)
sigmoidal activation function
• Weights w
• Biases b
• Input: acoustic features
vector (9x20 or 9x13)
• Output: gk , an estimate of
the TV trajectories at time k
(dimension 8x1)
• gk is a nonlinear composition
of the activation functions
d.
Mitra, 2010.
COST FUNCTION
• Networks trained by minimizing the sum-of-squares error
• Training data [x, t] (N = 315 words)e
• Output of the network gk is predicted TV trajectory estimated by
position at each time step k
• Weights and biases updated using scaled conjugate gradient
algorithm and dynamic backpropagation to reduce ESE
e.
Mitra, 2010.
DYNAMIC BACKPROPAGATIONf
Where:
f.
Jin, Liang, and M.m. Gupta, 1999.
SCALED CONJUGATE GRADIENT (SCG)g
• Choose weight vector and scalars. Let p1 = r1 = -E’SE (w1)
• While steepest descent direction rk ≠ 0
• If success = true, calculate second order information.
• Scale sk : the finite difference approximation to the second derivative.
• If δk ≤ 0, make the Hessian positive definite.
• Calculate step size αk = (pk rk )/δk
• Calculate comparison parameter Δk
• If Δk ≥ 0 :
• wk+1 = wk + αk pk , rk+1 = -E’SE (wk+1)
• if k mod M = 0 (M is number of weights), restart algorithm: let pk+1 = rk+1
• else create new conjugate direction pk+1 = rk+1 + βk pk
• If Δk < 0.25, increase scale parameter: λk = 4 λk
g.
Moller, 1993.
KALMAN SMOOTHING
Kalman filtering is used to smooth the noisy trajectory estimates from the ANNs
• TV trajectories modeled as output of a dynamic system
• State space representation:
• Parameters:
Γ : time difference (ms) between two consecutive measurements
ωk : process noise
νk : measurement noise
KALMAN SMOOTHINGh
• Recursive estimator
• Predict phase:
• Predicted state estimate
• Predicted estimate covariance
• Update phase:
• Sk = Residual covariance
• Kk = Optimal Kalman gain
• Update the state estimate
• Update estimate covariance
h.
Kalman, 1960.
IMPLEMENTATION
• Python
• Scientific libraries
• FANN (Fast Artificial Neural Network)
• Neurolab
• PyBrain
• Deepthought/Deepthought2 high performance computing clusters
TEST PROBLEM
• Synthetic data set (420 words) as model input [x,t]
• Data sampled over nine 10-ms windows with
• Generated from a speech production model at Haskins Laboratory (Yale Univ.)
• TV trajectories generated by TAsk Dynamic and Applications (TADA) model
• Reproduce estimates of root mean square error (RMSE) and Pearson productmoment correlation coefficient (PPMC)
VALIDATION METHODS
• New real data set:
• 47 American-English speakers
• 56 tasks per speaker
• Obtained from Univ. of Wisconsin’s
X-Ray MicroBeam Speech
Production database
• Feed data through model
• Compare error estimates
• Obtain visual trajectories
MILESTONES
• Build a FF-ANN
• Implement Kalman smoothing
• Use synthetic data to test FF-ANN
• Build a recurrent ANN
• Implement smoothing (if necessary)
• Test AR-ANN using real data
TIMELINE
• This semester: Build and test an FF-ANN
• October: Research and start implementation.
• November: Finish implementation and incorporate Kalman smoothing.
• December: Test and compile results using synthetic data.
• Next semester: Build and test a recurrent ANN
• January-February: Research and begin implementation (modifying FF-ANN).
• March: Finish implementation. Begin testing.
• April: Modifications (as necessary) and further testing.
• May: Finalize and collect results.
DELIVERABLES
• Proposal presentation and report
• Mid-year presentation/report
• Final presentation/report
• FF-ANN code
• Recurrent ANN code
• Synthetic data set
• Real acoustic data set
BIBLIOGRAPHY
1. Atal, B. S., J. J. Chang, M. V. Matthews, and J. W. Tukey. "Inversion of Articulatory-toacoustic Transformation in the Vocal Tract by a Computer-sorting Technique." The
Journal of the Acoustical Society of America 63.5 (1978): 1535-1553.
2. Bengio, Yoshua. "Introduction to Multi-Layer Perceptrons (Feedforward Neural
Networks)¶." Introduction to Multi-Layer Perceptrons (Feedforward Neural
Networks) — Notes De Cours IFT6266 Hiver 2010. 2 Apr. 2010. Web. 4 Oct. 2014.
3. Jin, Liang, and M.m. Gupta. "Stable Dynamic Backpropagation Learning in Recurrent
Neural Networks." IEEE Transactions on Neural Networks 10.6 (1999): 1321-1334.
Web. 4 Oct. 2014. <http://www.maths.tcd.ie/~mnl/store/JinGupta1999a.pdf>.
4. Jordan, Michael I., and David E. Rumelhart. "Forward Models: Supervised Learning
with a Distal Teacher." Cognitive Science 16 (1992): 307-354. Web. 4 Oct. 2014. ]
5. Kalman, R. E. "A New Approach to Linear Filtering and Prediction
Problems." Journal of Basic Engineering 82 (1960): 35-45. Web. 4 Oct. 2014.
BIBLIOGRAPHY
6. Mitra, Vikramjit. Improving Robustness of Speech Recognition Systems. Dissertation,
University of Maryland, College Park. 2010.
7. Mitra, V., I. Y. Ozbek, Hosung Nam, Xinhui Zhou, and C. Y. Espy-Wilson. "From
Acoustics for Vocal Tract Time Functions." Acoustics, Speech, and Signal Processing,
2009. ICASSP 2009.(2009): 4497-4500. Print.
8. Moller, M. "A Scaled Conjugate Gradient Algorithm For Fast Supervised
Learning." Neural Networks 6 (1993): 525-533. Web. 4 Oct. 2014.
9. Nielsen, Michael. "Neural Networks and Deep Learning." Neural Networks and Deep
Learning. Determination Press, 1 Sept. 2014. Web. 4 Oct. 2014.
10. Papcun, George. "Inferring Articulation and Recognizing Gestures from Acoustics with
a Neural Network Trained on X-ray Microbeam Data." The Journal of the Acoustical
Society of America (1992): 688. Web. 4 Oct. 2014.
BIBLIOGRAPHY
All images taken from
10. Mitra, Vikramjit, Hosung Nam, Carol Y. Espy-Wilson, Elliot Saltzman, and
Louis Goldstein. "Retrieving Tract Variables From Acoustics: A Comparison
of Different Machine Learning Strategies." IEEE Journal of Selected Topics in
Signal Processing 4.6 (2010): 1027-1045. Print.
11. Espy-Wilson, Carol. Presentation at Interspeech 2013.
12. Espy-Wilson, Carol. Unpublished results.
Sound clips courtesy of
• I Know That Voice. 2013. Film.
• Carol Espy-Wilson, Interspeech 2013.
THANKS!
QUESTIONS?
Download