Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute of Technology September 23, 2010 Background Bayesian speech synthesis [Hashimoto et al., ’08] Represent the problem of speech synthesis All processes can be derived from one single predictive distribution Approximation for estimating posterior Posterior is independent of synthesis data ⇒ Training and synthesis processes are separated Integration of training and synthesis processes Derive an algorithm that posterior and synthesis data are iteratively updated 2 Outline Bayesian speech synthesis Problem & Proposed method Variational Bayesian method Speech parameter generation Approximation of posterior Integration of training and synthesis processes Experiments Conclusion & Future work 3 Bayesian speech synthesis (1/2) Model training and speech synthesis ML Training Synthesis Bayes : Synthesis data : Training data : Model parameters Training & Synthesis : Label seq. for synthesis : Label seq. for training 4 Bayesian speech synthesis (2/2) Predictive distribution (marginal likelihood) : HMM state seq. for synthesis data : HMM state seq. for training data : Likelihood of synthesis data : Likelihood of training data : Prior distribution for model parameters Variational Bayesian method [Attias; ’99] 5 Variational Bayesian method (1/2) Estimate approximate posterior distribution ⇒ Maximize a lower bound (Jensen’s inequality) : Expectation w.r.t. : Approximate distribution of the true posterior distribution 6 Variational Bayesian method (2/2) Random variables are statistically independent Optimal posterior distributions : Normalization terms Iterative updates as the EM algorithm 7 Speech parameter generation Speech parameter generation based on Bayesian approach Lower bound likelihood well approximates true marginal Maximize the lower bound 8 Outline Bayesian speech synthesis Problem & Proposed method Variational Bayesian method Speech parameter generation Approximation of posterior Integration of training and synthesis processes Experiments Conclusion & Future work 9 Bayesian speech synthesis Maximize the lower bound of log marginal likelihood consistently Estimation of posterior distributions Speech parameter generation ⇒ All processes are derived from the single predictive distribution 10 Approximation of posterior depends on synthesis data ⇒ Synthesis data is not observed Assume that synthesis data is independent of [Hashimoto et al., ’08] ⇒ Estimate posterior from only training data 11 Separation of training & synthesis Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training data Training Synthesis Synthesis data 12 Use of generated data Problem: Posterior distribution depends on synthesis data Synthesis data is not observed Proposed method: Use generated data instead of observed data for estimating posterior distribution Iterative updates as the EM algorithm 13 Previous method Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training data Training Synthesis Synthesis data 14 Proposed method Update of posterior distribution (HMM state sequence of training data) Training data Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Synthesis data 15 Synthesis data Synthesis data can include several utterances Synthesis data impacts on posterior distributions How many utterances are generated in one update process? Two methods are discussed Batch-based method Update posterior distributions for several test sentences Sentence-based method Update posterior distributions for one test sentence 16 Update method (1/2) Batch-based method Generated synthesis data of all test sentences is used for update of posterior distributions Synthesis data of all test sentences is generated by using the same posterior distributions Sentence 1 Sentence 2 ・・・ ・・・ ・・・ Sentence N 17 Update method (2/2) Sentence-based method Generated synthesis data of one test sentence is used for update of posterior distributions Synthesis data of each test sentence is generated by using different posterior distributions Sentence 1 Sentence 2 ・・・ ・・・ ・・・ ・・・ Sentence N 18 Outline Bayesian speech synthesis Problem & Proposed method Variational Bayesian method Speech parameter generation Approximation of posterior Integration of training and synthesis processes Experiments Conclusion & Future work 19 Experimental conditions Database ATR Japanese speech database B-set Speaker MHT Training data 450 utterances Test data 53 utterances Sampling rate 16 kHz Window Blackman window Frame size / shift 25 ms / 5 ms Feature vector 24 mel-cepstrum + Δ + ΔΔ and log F0 + Δ + ΔΔ (78 dimension) HMM 5-state left-to-right HSMM without skip transition 20 Iteration process Update of posterior dists. and synthesis data 1. Posterior dists. are estimated from training data 2. Initial synthesis data is generated 3. Context-clustering using training and generated synthesis data 4. Posterior dists. are re-estimated from training data and generated synthesis data (Number of updates is 5) 5. Synthesis data is re-generated 6. Step 3, 4, and 5 are iterated 21 Comparison of number of updates Data for estimation of posterior distributions Iteration0 Iteration1 Iteration2 Iteration3 450 training utterances 450 utterances + 1 utterance generated in Iteration0 450 utterances + 1 utterance generated in Iteration1 450 utterances + 1 utterance generated in Iteration2 22 Experimental results Comparison of the number of updates 23 Comparison of Batch and Sentence Training & Generation ML Baseline ML Bayes Batch Bayes Sentence Bayes Data for estimation of posterior distributions 450 utterances 450 utterances 450 + 53 generated utterances 450 + 1 generated utterance (53 different posterior dists.) 24 Experimental results Comparison of Batch and Sentence 25 Conclusions and future work Integration of training and synthesis processes Generated synthesis data is used for estimating posterior distributions Posterior distributions and synthesis data are updated iteratively Outperform the baseline method Future work Investigation of the relation between the amount of training and synthesis data Experiments on various amounts of training data 26 Thank you