slide

advertisement
Bayesian Speech Synthesis Framework
Integrating Training and Synthesis Processes
Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda
Nagoya Institute of Technology
September 23, 2010
Background

Bayesian speech synthesis [Hashimoto et al., ’08]



Represent the problem of speech synthesis
All processes can be derived from one single
predictive distribution
Approximation for estimating posterior
Posterior is independent of synthesis data
⇒ Training and synthesis processes are separated


Integration of training and synthesis processes

Derive an algorithm that posterior and synthesis
data are iteratively updated
2
Outline

Bayesian speech synthesis



Problem & Proposed method




Variational Bayesian method
Speech parameter generation
Approximation of posterior
Integration of training and synthesis processes
Experiments
Conclusion & Future work
3
Bayesian speech synthesis (1/2)
Model training and speech synthesis
ML
Training
Synthesis
Bayes
: Synthesis data
: Training data
: Model parameters
Training &
Synthesis
: Label seq. for synthesis
: Label seq. for training
4
Bayesian speech synthesis (2/2)
Predictive distribution (marginal likelihood)
: HMM state seq. for synthesis data
: HMM state seq. for training data
: Likelihood of synthesis data
: Likelihood of training data
: Prior distribution for model parameters
Variational Bayesian method [Attias; ’99]
5
Variational Bayesian method (1/2)
Estimate approximate posterior distribution
⇒ Maximize a lower bound
(Jensen’s inequality)
: Expectation w.r.t.
: Approximate distribution of
the true posterior distribution
6
Variational Bayesian method (2/2)

Random variables are statistically independent

Optimal posterior distributions
: Normalization terms
Iterative updates as the EM algorithm
7
Speech parameter generation

Speech parameter generation based on
Bayesian approach


Lower bound
likelihood well
approximates true marginal
Maximize the lower bound
8
Outline

Bayesian speech synthesis



Problem & Proposed method




Variational Bayesian method
Speech parameter generation
Approximation of posterior
Integration of training and synthesis processes
Experiments
Conclusion & Future work
9
Bayesian speech synthesis

Maximize the lower bound of log marginal
likelihood consistently
Estimation of posterior distributions
 Speech parameter generation
⇒ All processes are derived from the single
predictive distribution

10
Approximation of posterior

depends on synthesis data
⇒ Synthesis data is not observed

Assume that
synthesis data
is independent of
[Hashimoto et al., ’08]
⇒ Estimate posterior from only training data
11
Separation of training & synthesis
Update of posterior distribution
(HMM state sequence of training data)
Update of posterior distribution
(Model parameters)
Update of posterior distribution
(HMM state sequence of synthesis data)
Generation of synthesis data
Training data
Training
Synthesis
Synthesis data
12
Use of generated data

Problem:



Posterior distribution depends on synthesis data
Synthesis data is not observed
Proposed method:


Use generated data instead of observed data for
estimating posterior distribution
Iterative updates as the EM algorithm
13
Previous method
Update of posterior distribution
(HMM state sequence of training data)
Update of posterior distribution
(Model parameters)
Update of posterior distribution
(HMM state sequence of synthesis data)
Generation of synthesis data
Training data
Training
Synthesis
Synthesis data
14
Proposed method
Update of posterior distribution
(HMM state sequence of training data)
Training data
Update of posterior distribution
(Model parameters)
Update of posterior distribution
(HMM state sequence of synthesis data)
Generation of synthesis data
Synthesis data
15
Synthesis data

Synthesis data can include several utterances



Synthesis data impacts on posterior distributions
How many utterances are generated in one
update process?
Two methods are discussed

Batch-based method


Update posterior distributions for several test
sentences
Sentence-based method

Update posterior distributions for one test
sentence
16
Update method (1/2)

Batch-based method


Generated synthesis data of all test sentences is
used for update of posterior distributions
Synthesis data of all test sentences is generated
by using the same posterior distributions
Sentence 1
Sentence 2
・・・
・・・
・・・
Sentence N
17
Update method (2/2)

Sentence-based method


Generated synthesis data of one test sentence is
used for update of posterior distributions
Synthesis data of each test sentence is generated
by using different posterior distributions
Sentence 1
Sentence 2
・・・
・・・
・・・
・・・
Sentence N
18
Outline

Bayesian speech synthesis



Problem & Proposed method




Variational Bayesian method
Speech parameter generation
Approximation of posterior
Integration of training and synthesis processes
Experiments
Conclusion & Future work
19
Experimental conditions
Database
ATR Japanese speech database B-set
Speaker
MHT
Training data
450 utterances
Test data
53 utterances
Sampling rate
16 kHz
Window
Blackman window
Frame size / shift
25 ms / 5 ms
Feature vector
24 mel-cepstrum + Δ + ΔΔ and
log F0 + Δ + ΔΔ (78 dimension)
HMM
5-state left-to-right HSMM
without skip transition
20
Iteration process

Update of posterior dists. and synthesis data
1. Posterior dists. are estimated from training data
2. Initial synthesis data is generated
3. Context-clustering using training and generated
synthesis data
4. Posterior dists. are re-estimated from training
data and generated synthesis data
(Number of updates is 5)
5. Synthesis data is re-generated
6. Step 3, 4, and 5 are iterated
21
Comparison of number of updates
Data for estimation of
posterior distributions
Iteration0
Iteration1
Iteration2
Iteration3
450 training utterances
450 utterances +
1 utterance generated in Iteration0
450 utterances +
1 utterance generated in Iteration1
450 utterances +
1 utterance generated in Iteration2
22
Experimental results

Comparison of the number of updates
23
Comparison of Batch and Sentence
Training
&
Generation
ML
Baseline
ML
Bayes
Batch
Bayes
Sentence
Bayes
Data for estimation
of posterior distributions
450 utterances
450 utterances
450 +
53 generated utterances
450 +
1 generated utterance
(53 different posterior dists.)
24
Experimental results

Comparison of Batch and Sentence
25
Conclusions and future work

Integration of training and synthesis processes




Generated synthesis data is used for estimating
posterior distributions
Posterior distributions and synthesis data are
updated iteratively
Outperform the baseline method
Future work


Investigation of the relation between the amount
of training and synthesis data
Experiments on various amounts of training data
26
Thank you
Download