Estimating Tract Variables for Speech Inversion using Artificial

ESTIMATING TRACT VARIABLES FROM ACOUSTICS VIA MACHINE LEARNING CHRISTIANA SABETT APPLIED MATH, APPLIED STATISTICS, AND SCIENTIFIC COMPUTING (AMSC) OCTOBER 7, 2014 ADVISOR: DR. CAROL ESPY-WILSON ELECTRICAL AND COMPUTER ENGINEERING INTRODUCTION Automatic speech recognition (ASR) systems are inadequate/incomplete in their current forms. • Coarticulation – overlap of actions in vocal tract TRACT VARIABLESa • Articulatory information: information from the organs along the vocal tract • Tract variables (TVs): vocal tract constriction variables relaying information of a physical trajectory in time • Lip Aperture (LA) • Lip Protrusion (LP) • Tongue tip constriction degree (TTCD) • Tongue tip constriction location (TTCL) • Tongue body constriction degree (TBCD) • Tongue body constriction location (TBCL) • Velum (VEL) • Glottis (GLO) a. Mitra et al, 2010. TRACT VARIABLES Perfect-memory: Clearly articulated 2000 P 0 Frequency 8000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 Time 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 6000 4000 2000 0 TB 10 0 -10 0 TT • TVs can improve the robustness of automatic speech recognition -2000 -5 -10 -15 -15 LA • TVs are consistent in the presence of coarticulation -20 -25 ER F EH K T M EH M ER IY PROJECT GOAL Effectively estimate TV trajectories using artificial neural networks, implementing Kalman smoothing when necessary. APPROACH • Artificial neural networks (ANNs)b • Feed-forward ANN (FF-ANN) • Recurrent ANN (RANN) • Motivation: • Speech inversion is a many-to-one mappingc • ANNs can map m inputs to n outputs b. c. Papcun, 1992. Atal et al., 1978. Retroflex /r/ Bunched /r/ STRUCTUREd • 3 hidden layers • Each node has f = tanh(x) sigmoidal activation function • Weights w • Biases b • Input: acoustic features vector (9x20 or 9x13) • Output: gk , an estimate of the TV trajectories at time k (dimension 8x1) • gk is a nonlinear composition of the activation functions d. Mitra, 2010. COST FUNCTION • Networks trained by minimizing the sum-of-squares error • Training data [x, t] (N = 315 words)e • Output of the network gk is predicted TV trajectory estimated by position at each time step k • Weights and biases updated using scaled conjugate gradient algorithm and dynamic backpropagation to reduce ESE e. Mitra, 2010. DYNAMIC BACKPROPAGATIONf Where: f. Jin, Liang, and M.m. Gupta, 1999. SCALED CONJUGATE GRADIENT (SCG)g • Choose weight vector and scalars. Let p1 = r1 = -E’SE (w1) • While steepest descent direction rk ≠ 0 • If success = true, calculate second order information. • Scale sk : the finite difference approximation to the second derivative. • If δk ≤ 0, make the Hessian positive definite. • Calculate step size αk = (pk rk )/δk • Calculate comparison parameter Δk • If Δk ≥ 0 : • wk+1 = wk + αk pk , rk+1 = -E’SE (wk+1) • if k mod M = 0 (M is number of weights), restart algorithm: let pk+1 = rk+1 • else create new conjugate direction pk+1 = rk+1 + βk pk • If Δk < 0.25, increase scale parameter: λk = 4 λk g. Moller, 1993. KALMAN SMOOTHING Kalman filtering is used to smooth the noisy trajectory estimates from the ANNs • TV trajectories modeled as output of a dynamic system • State space representation: • Parameters: Γ : time difference (ms) between two consecutive measurements ωk : process noise νk : measurement noise KALMAN SMOOTHINGh • Recursive estimator • Predict phase: • Predicted state estimate • Predicted estimate covariance • Update phase: • Sk = Residual covariance • Kk = Optimal Kalman gain • Update the state estimate • Update estimate covariance h. Kalman, 1960. IMPLEMENTATION • Python • Scientific libraries • FANN (Fast Artificial Neural Network) • Neurolab • PyBrain • Deepthought/Deepthought2 high performance computing clusters TEST PROBLEM • Synthetic data set (420 words) as model input [x,t] • Data sampled over nine 10-ms windows with • Generated from a speech production model at Haskins Laboratory (Yale Univ.) • TV trajectories generated by TAsk Dynamic and Applications (TADA) model • Reproduce estimates of root mean square error (RMSE) and Pearson productmoment correlation coefficient (PPMC) VALIDATION METHODS • New real data set: • 47 American-English speakers • 56 tasks per speaker • Obtained from Univ. of Wisconsin’s X-Ray MicroBeam Speech Production database • Feed data through model • Compare error estimates • Obtain visual trajectories MILESTONES • Build a FF-ANN • Implement Kalman smoothing • Use synthetic data to test FF-ANN • Build a recurrent ANN • Implement smoothing (if necessary) • Test AR-ANN using real data TIMELINE • This semester: Build and test an FF-ANN • October: Research and start implementation. • November: Finish implementation and incorporate Kalman smoothing. • December: Test and compile results using synthetic data. • Next semester: Build and test a recurrent ANN • January-February: Research and begin implementation (modifying FF-ANN). • March: Finish implementation. Begin testing. • April: Modifications (as necessary) and further testing. • May: Finalize and collect results. DELIVERABLES • Proposal presentation and report • Mid-year presentation/report • Final presentation/report • FF-ANN code • Recurrent ANN code • Synthetic data set • Real acoustic data set BIBLIOGRAPHY 1. Atal, B. S., J. J. Chang, M. V. Matthews, and J. W. Tukey. "Inversion of Articulatory-toacoustic Transformation in the Vocal Tract by a Computer-sorting Technique." The Journal of the Acoustical Society of America 63.5 (1978): 1535-1553. 2. Bengio, Yoshua. "Introduction to Multi-Layer Perceptrons (Feedforward Neural Networks)¶." Introduction to Multi-Layer Perceptrons (Feedforward Neural Networks) — Notes De Cours IFT6266 Hiver 2010. 2 Apr. 2010. Web. 4 Oct. 2014. 3. Jin, Liang, and M.m. Gupta. "Stable Dynamic Backpropagation Learning in Recurrent Neural Networks." IEEE Transactions on Neural Networks 10.6 (1999): 1321-1334. Web. 4 Oct. 2014. <http://www.maths.tcd.ie/~mnl/store/JinGupta1999a.pdf>. 4. Jordan, Michael I., and David E. Rumelhart. "Forward Models: Supervised Learning with a Distal Teacher." Cognitive Science 16 (1992): 307-354. Web. 4 Oct. 2014. ] 5. Kalman, R. E. "A New Approach to Linear Filtering and Prediction Problems." Journal of Basic Engineering 82 (1960): 35-45. Web. 4 Oct. 2014. BIBLIOGRAPHY 6. Mitra, Vikramjit. Improving Robustness of Speech Recognition Systems. Dissertation, University of Maryland, College Park. 2010. 7. Mitra, V., I. Y. Ozbek, Hosung Nam, Xinhui Zhou, and C. Y. Espy-Wilson. "From Acoustics for Vocal Tract Time Functions." Acoustics, Speech, and Signal Processing, 2009. ICASSP 2009.(2009): 4497-4500. Print. 8. Moller, M. "A Scaled Conjugate Gradient Algorithm For Fast Supervised Learning." Neural Networks 6 (1993): 525-533. Web. 4 Oct. 2014. 9. Nielsen, Michael. "Neural Networks and Deep Learning." Neural Networks and Deep Learning. Determination Press, 1 Sept. 2014. Web. 4 Oct. 2014. 10. Papcun, George. "Inferring Articulation and Recognizing Gestures from Acoustics with a Neural Network Trained on X-ray Microbeam Data." The Journal of the Acoustical Society of America (1992): 688. Web. 4 Oct. 2014. BIBLIOGRAPHY All images taken from 10. Mitra, Vikramjit, Hosung Nam, Carol Y. Espy-Wilson, Elliot Saltzman, and Louis Goldstein. "Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies." IEEE Journal of Selected Topics in Signal Processing 4.6 (2010): 1027-1045. Print. 11. Espy-Wilson, Carol. Presentation at Interspeech 2013. 12. Espy-Wilson, Carol. Unpublished results. Sound clips courtesy of • I Know That Voice. 2013. Film. • Carol Espy-Wilson, Interspeech 2013. THANKS! QUESTIONS?

Estimating Tract Variables for Speech Inversion using Artificial

Related documents

Products

Support

Estimating Tract Variables for Speech Inversion using Artificial

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib