>> Li Deng: Okay. Welcome to this presentation. ... Professor Zhen-Hua Ling to give us the lecture on HMM-based...

advertisement
>> Li Deng: Okay. Welcome to this presentation. It's my great pleasure to welcome
Professor Zhen-Hua Ling to give us the lecture on HMM-based speech synthesis and
Professor Ling currently teaches at China University of Science and Technology, and he has a
very very nice strong crew over there doing speech synthesis synthesis. And he recently
arrived at University of Washington here to spend one year working on various aspect of
speech and perhaps some multimedia. So I take the opportunity to invite him to introduce
speech synthesis here, so hopefully you will enjoy his talk. Okay, I'll give the floor to
Professor Ling.
>> Zhen-Hua Ling: Thank you. Thank you, Li, and good morning everyone and hello
everyone here and hello to listeners over the internet. It is a great honor to be here and tell
something about speech synthesis technology.
And my name is Zhen-Hua Ling, and now I'm a visiting scholar at University of Washington
and I come from USTC in China. Today the topic of my presentation is the advancement of
speech synthesis, its fundamental recent advances.
This is the outline of my presentation. At first I'll give some review of the state of the art
advance in HMM-based speech synthesis techniques including its some background
knowledge, its basic techniques, and to show some examples of its flexibility in controlling the
voice characters of synthetic speech.
Then I will introduce some recent advances of this method which were the lab developed by
our research group. I'll give three examples of firstly the articulatory control of HMM-based
speech synthesis, the second is minimum care diverging's permit innovation, and the third
one is the hybrid [indiscernible] selection speech synthesis. Okay.
So I'll start from some background instruction. So what is speech synthesis? It is the defined
as a kind of measured artificial production of human speech. And in recent years, speech
synthesis has become one of the key techniques in speech creating intelligent human
machine interaction.
If the input to a speech synthesis system is in the form of a text, we can also narrate text to
speech or TTS for short. It can be considered add a kind of how to say, inverse function of
automatic speech recognition because in automatic speech recognition we want to convert
speech to text, but in TTS we do the inverse thing.
This figure shows the general -- general, how to say, diagram of TTS system. This is
composed of two main models. The first one is text analysis. This model convert the input
text to sound intermediate linguistic representation. For example we extract the phoneme
sequence from the input text as the [indiscernible] and some other things. The next model is
well from generation. By this model we produce speech according to the -- this kind of
linguistic representation.
We also name these two models the front end and the back end of a TTS system. And in the
top -- in this top here, I will focus on the back-end techniques of a TTS system, that means
where I mention speech synthesis in the following instruction that only refer to the back end of
TTS system.
So, in order to realize a speech synthesis system, many different kind of approach have been
proposed in history. Before the 1990s the world's best form of synthesis is utmost dominating
measured. Here formant means some acoustic features which describe the resonant
frequency of speech spectrum.
In this measure we have each phonetic, we have the formant -- formant features of each
phonetic units designed by human beings manually, even some rules, and in the final a
synthetic speech is produced using so soft field model.
Then after 1990s, the Corpus-based concatenative speech synthesis become more and more
popular. Here Corpus means a large database with some recording speech waveform and
waveform and level information. In this method, we concatenate the speech -- concatenate
the segment of speech unit of speech waveform to construct an entire sentence.
There are different kind of implementation. In some method, we have only one instance or
one concatenative unit for each target unit so that the single inventory method. More
method -- more method we have much more instances for each candidate for each target
unit, that means we need to do some unit selection to collect the most appropriate unit to
synthesize the target sentence.
Then in the middle of 1990s so now method what has been proposed that is the
Corpus-based statistical parametric synthesis. In this method, we estimate acoustic -acoustic models automatically from the large concatenative state. Then at synthesis time, we
will predict the acoustic features from the trend acoustic models and then you would use
some kind of source-filter model to reach that waveform.
So this method can can compare to flexibility, compared with the previous waveform
concatenation method. And in most cases, we can be model is used as the status of model.
So we commonly name this method the HMM-based speech synthesis or HTS for short,
which means H24S [phonetic] which you can see in letters here. So in this talk, I will introduce
this kind of method including some basic techniques and recent advances of HMM-based
speech synthesis.
Before I introduce the basic techniques of this kind of action based speech synthesis, I'd like
to show -- sorry, I'd like to show the -- the overall the reach out of this measure. We can see
that the HMM-based can be divided in two parts; the first one is the training part, the second
is the synthesis part.
The training part is very similar to the HMM-based automatic speed recognition. We firstly
extract acoustic features including the excitation candidate and spectral candidate of when
the speech and waveform of the recalled in the database. Then we trend set to models using
the extract parameters and the label information of the database.
At the synthesis time the input text first actually pass through a textual model to get this kind
of label information which is just the linguistic representation we mentioned above. Then, we
generate the acoustic parameters such as the excitation parameter and the spectral
parameter from this HMM model set according to the level information. Finally this
parameters would be set into a source filter model to reconstruct the signal. So this is how
HMM-based speech synthesis work.
There are rough four very important to HMM-based speech synthesis. I'll give some more
detail instruction to them one by one. This first technique is speech vocoding. Speech
vocoding means at the training time we extract acoustic features from the speech waveform
and at the synthesis time we reconstruct the speech waveform from the synthesis speech
features.
The speech vocoding techniques motivated by the speech production mechanism of human
beings. We can look at this figure here.
The -- when people speak, well first they have this airflow produced by lungs; then this air
flow will pass through the vocal cords to generate the source signal of the speech. And when
we pronounce vowel segment or vowel. Our vocal cords will vibrate, periodically. And so we
can get a pulse trend as the sound source.
While we we speak some unvoiced speech segment for example consonants, our vocal cords
will not vibrate, so we can get some noise-like some source. This part trend of noise signal
will pass through the vocal tract and generate the final speech.
The function of vocal tract is very similar, like a theatre [phonetic] if a frequency test does
transfer characteristics will decide which speech synthesis specific phoneme, we are
pronounce H, we are pronouncing.
So mathematically, we can -- we can -- we can extract the vocoder using this kind of
source-filter model. It consists of the source excitation part and a vocal tract resonance part.
In the source excitation part, we have signal EN, which is either a pulse train or white noise.
And the [indiscernible] is on this part we use time in the linear time-invariant field to represent
the vocal tract the character of each speech frame. So the final -- final speech is the
evolution between this, this two signal EN and XN.
Furthermore, we will extract some spectral parameters to this part the vocal tract, this vocal
tract assistant XN or its Fourier transform.
Here we can use different kinds of spectral parameters just according to the different kinds of
speech spectrum model we used. For example we can use the other autoregressive model to
-- to represent the -- the speech process.
In this case the speech parameters, spectral parameter CM will become the meaningful
coefficient. We can also use some explanation of -- explanation model. In this in this model
the parameters CM will become the cepstral coefficient.
Anyway. We can estimate that this parameters by some mathematical criterion. At this
equation here, the observation vector O means the speech waveform of each speech frame is
a coefficient we want to estimate.
Here this figure shows this figure summarized how the speech vocoder work at synthesis
time, at training time and time. In training time, we extract the acoustic features such as the
F0 and unvoiced to voiced label information, and the Mel-cepstrum from the alternate speech
frame by frame. Then, this speech parameters will be modeled by Hidden Markov model in
the next of the modeling step.
At synthesis time this -- all these parameters will be generated from the trained acoustic
models and sent into the speech synthesizer to reconstruct the speech signal. So that means
we use speech vocoder to extract acoustic features at the training time and reconstruct using
a speech waveform at synthesis time.
The next technique that's very important to the HMM-based speech synthesis is the speech
parameter modeling and generation. Here, Hidden Markov model is used as the acoustic
model to disquiet the distribution of acoustic features.
This figure shows example of Hidden Markov model, which has three states and left to right to
post here the O and the Q stands for the observed acoustic feature sequence and the hidden
state sequcence for each sentence and the BQOT's [phonetic] output the state output
probability of each state, and the A is the state of transition probability.
Then the model training of HMM-based speech synthesis is just the same as HMM-based
automatic speech recognition. Here we estimate Hidden Markov model, for each phonetic
unit. For example for each phoneme we can do it by following the method more likely to be
carrying and using each of them to update the model parameters internally.
Then, at synthesis time, assume the Hidden Markov model for the example sentence is given.
So our task is to generate speech parameters from this model. The method we use is to
determine speech parameter vector sequence by maximize -- by maximize the output
probability in this way. Here, lambda is the each model of the target sentence and the feature
we want to generate.
We can firstly approximate this equation by using only the most popular synthesis information
for all possible sequence. Then we convert these to a two-step multiplication problem. The
first step is to predict the hidden unit sequence independent from the acoustic feature
sequence.
We have introduced that in our method we use the HMM of the left to right the part that -- that
means if we want to determine the synthesis sequence of Hidden Markov model, which is the
same as to determine the duration of each state, so at training time, we estimate the said
duration distributions for each state which is the PI function. Then, the state sequence Q can
be predicted for input sentence by maximize this output probability. If we use single Gaussian
to describe these distribution, that output will just be the mean of these this Gaussian. So it's
-- it's quite simple.
The -- the next step is that when the optimal sequence Q has been determined, we want to
invert the acoustic feature sequence O. It's also best on the maximum output probability
curtailing.
If you have a look at this figure which shows an example of parameter generation from a
sentence Hidden Markov model, here we can see that each vertical dotted line represent a
frame, and the interval between two solid lines for example between this -- between this one
this one, is, is a state. So we can see that different state has different durations. Because the
red line shows the meaning of Gaussian distribution for the state output probability and the
pink area shows is variance.
It's obvious there if we want to generate the acoustic feature O which maximize the output
probability, the O would become a sequence of mean vector. Just this otherwise shape.
But it will cause a problem which is the discontinuity and instead boundaries. It will if we just
reconstruct the speech signal using this kind of output, the quality of sentence speech will
degrade because of discontinuity in the speech waveform. So we solve this problem by
introduce some dynamic features into the acoustic model.
Here, say T is the static acoustic feature at frame T, and we can calculate the data at square
CT as the data and integration features. Here, the data integration feature are constants to
the first and second derivative of the acoustic static feature sequence. And thing they are just
-- a kind of -- can be considered as a kind of linear combination of the static features at
modeling them.
Then, we can put the static and dynamic features together for example, in this way to get the
complete speech parameter vector OT.
If we look at the whole sentence we can see that the complete acoustic feature sequence O
can be considered as a kind of linear transform from the static feature sequence C which the
transform max W, which is determined by how we define the dynamic features. Then we can
rewrite our parameter generation curtailment in this way.
So that means we want to find the most -- we want to find the optimal static feature sequence
C to maximize the output of probability of the complete acoustic feature sequence O which
contains most the static and the dynamic components. So by setting the derivative of this
function is just back to C to zero, we can get a solution of this problem. We can solve the C
after solving this group of linear equations.
It's conditional complexity is not very high because the W is very, is very sparse metrics here.
This example shows the -- this figure shows what will happen if the dynamic features are use.
We can see that under the constraint of dynamic features. The generated acoustic feature,
here is the blue line, will become much more smooth compared to the step wise [phonetic],
the red, the red line, the step wised mean sequence.
Subject variation results have proved that this kind of parameter generation is very effective in
proving the, the speech quality on the naturalness of synthetic speech by reducing the
discontinuity and instead boundaries.
Okay, the --
>>: Okay. Is that the [inaudible]. Is that any property of the generative incurred, for example,
[inaudible]?
>> Zhen-Hua Ling: Yeah.
>>: [inaudible] so what part of the solution there that can guarantee that --
>> Zhen-Hua Ling: You mean no flucuation?
>>: Right.
>> Zhen-Hua Ling: That we have the first the derivative
>>: [inaudible]
>> Zhen-Hua Ling: Yeah because here see if actually using the first or the derivative of may
be some fluctuation.
>>: Oh, I see, this is
>> Zhen-Hua Ling: Yeah this is how we caluculate dynamic features because the first
derivative only use the next frame.
>>: I see.
>> Zhen-Hua Ling: So there will be some fluctuation. If we use the first order and the second
order together.
>>: Okay.
>> Zhen-Hua Ling: We can't guarantee the solution is of the -- without fluctuation.
So the next techniques that's very important for HMM-based speech synthesis is the MSD
agent F0 modeling. I mentioned above that in speech vocoding status were not only not
caption, but --
>>: Okay, go back to the previous slide.
>> Zhen-Hua Ling: Okay.
>>: So what is the variance for the for the [inaudible] of the streams together.
>> Zhen-Hua Ling: Yeah. That --
>>: The three dimension, not just one.
>> Zhen-Hua Ling: No, no.
>>: Okay I see okay. So this was all the various --
>> Zhen-Hua Ling: Yeah yeah yeah.
>>: [inaudible].
>> Zhen-Hua Ling: So in Hidden Markov model, is also very important to predict the F0
feature. F0 means the fundamental frequency, it decides the pitch of the synthetic speech.
This figure shows the example of the trajectory observed F0s. We can see that there is a
very important property of F0. That is F0 only exist in the voice segment of speech. For
example this segment. For some unvoiced segment, we cannot observe F0 at all.
So that means we cannot just apprise continuous or discreet to modeling the F0 directly.
Because in some part of the F0 trajectory it has no observations.
Here a new form of Hidden Markov model has been proposed which is named amount space
probability situation Hidden Markov model to solve this problem. And this figure shows the
structure of a multispace probability distribution Hidden Markov model. We can see that state
kind of has two or more distributions to describe the the acoustic features. It's very similar to
the [indiscernible] Hidden Markov model. For example, using caution each model for each
state.
But the difference is that for each stated here, in the HMM, we can have different dimensions
for different mixture. For example, this common maybe have damage in one, and this maybe
damage, two damage. Some -- some distributions have zero damage, to discrete a value.
So it provide more flexibility than the mounted mixture [phonetic] Hidden Markov model.
Use this kind of model structures, we use two space. The first space, the space is the one
dimension continuous value which modeled by a single Gaussian distribution. The second -second space is the of zero dimension which mean it just present the unknown space without
observations.
>>: So this is a final model?
>> Zhen-Hua Ling: Yeah. Actually, yeah. Only we use weight to represent the -- the
unvoiced space. The -- the voice to weight -- voice to weight, how to, stand for the probability
of the current state being voiced or being unvoiced. There also need to be changed
according to the changes.
>>: Voice that you get on the F0?
>> Zhen-Hua Ling: Yeah.
>>: So, if there's no voice, [indiscernible]?
>> Zhen-Hua Ling: Yeah. Actually, for each step we have the two distributions there would
just be decided if -- whether or not it's voice or unvoiced it's decided by the weight. For
example voice for a vowel, this letter will be very close to one, and this is very close to zero.
So it's provided this kind of flexibility.
>>: So -- so in that case, voice and voice detection is automatically into the [inaudible].
>> Zhen-Hua Ling: Uh --
>>: [inaudible] do you you have the separate voice?
>> Zhen-Hua Ling: For the training data -- the training data, the voice and unvoiced features
needed to be extracted for each frame at --
>>: Oh, the training segment.
>> Zhen-Hua Ling: Yeah, but, I mean, no. If in the feature instruction, we need to decide
each frame to be voiced or unvoiced. But in the model training we all need it to be, what do I
need to decide each state to be voiced or unvoiced.
>>: I see. Okay.
>> Zhen-Hua Ling: We can assume [indiscernible] each trained automatically.
>>: I see. Okay.
>> Zhen-Hua Ling: This figure shows the -- the final complete acoustic feature vector we
used in our model training, which is composed of two parts. The first is the spectrum, part the
second is excitation part. The spectrum part might consist, for example, the static data and
the Mel-cepstrum coefficient. We could we a mounted dimensional single Gaussian to model
it for each HM state. In the excitation part it consist of the static data and data log F0. For
each dimension we're using we use a true space distribution to modify the F0.
>>: I see. So altogether, each just -- each have several states?
>> Zhen-Hua Ling: Yeah altogether four streams. Is for the Mel-cepstrum and the other for
the different components for the excitation.
>>: [inaudible] as if they are separate from each other.
>> Zhen-Hua Ling: Yes.
>>: [inaudible] say that is why we allow with it --
>> Zhen-Hua Ling: Yeah they are. [indiscernible] will align. That [indiscernible]. Yeah yeah.
There is no dependence between these speech.
So the last techniques very important for HMM-based synthesis is the context-dependent
model training and clustering. In HMM-based speech synthesis we expect to estimate the
Hidden Markov models as accurately as possible to describe the distribution of acoustic
features.
So that means we need to take into account the vectors of speech variations. And here we
use the context information to represent this kind of vectors. So that means we will trend,
different HMs for the phonetic unit with different contexts. So we've got this kind of
context-dependent Hidden Markov model.
The most single form of context is the phoneme identifier. In other words trend the Hidden
Markov model. For example, if a reader speaks the phoneme in English, we can speak the
Hidden Markov models.
But we can get further. For example. Here we can see form sequence of a sentence. There
are several instances of the same vowel. But we can use different kind of model to represent
of each sample because they are surrounding or they are modeling or the phonemes are
different. So that well lead to the [indiscernible] model.
>>: So CL means function --
>> Zhen-Hua Ling: [inaudible] It's a Japanese.
>>: Oh, Japanese okay.
>> Zhen-Hua Ling: So actually in action HMM-based speech synthesis we need to very large
number of contact information that can be at different levels. But at phoneme level we need to
consider the current phoneme and the neighboring phonemes.
In consider the level, we need to consider syllable or something like that. In water level,
maybe we need to consider the the pattern of speech or some other contact information. At
first level we need to consider the first boundary or some other things.
So if we combine all of it, this kind of contact information together well get a very huge
number. So it to definitely to data speech synthesis problem. That means for each
context-dependent model, we can only find the one example -- one sample, sample or even
new sample in the training database.
So how to solve this problem with just a use a simple method is we allow the
context-defendant, Hidden Markov models to share their parameters if they share
informations are similar. In order decide the strategy of model sharing, the HMM based state
classroom method has been probably used.
`This is the kind of top two bottom method. At first we use all the, all the state of Hidden
Markov models to share their distribution parameters. Then at each state we will -- we will be
separated automatically by a kind optimal question.
Here, the questions for the level are here. They are selected from question set which is
predefined according to the context distributions of the of different language. And the optimal
question is determined by the [indiscernible] criterion, which the consideration.
So after the distributive has been constructed, all the context-dependent Hidden Markov
models in the seven float, we all share their parameters. For example, all this -- actually
[indiscernible] will have the same mean and the same variance.
At synthesis time, the input text first pass through a text analysis model and get the context
information of the concatenation. Then this kind of concatenation will be used to answer the
questions in the truth and then finally we can get the -- get the truth, the leaf note of the
distributive, which represent the distribution parameters of each synthesis states.
Then we can connect the states together one by one to get the HMs for each phonemes and
for all, this phoneme, to get the HM of the whole sentence. Once this HM of the whole
sentence, we can use the parameter algorithm which we have used before to produce the -the speech parameters.
>>: If -- by this is consistent, so if the [inaudible] of a previous state --
>> Zhen-Hua Ling: Yeah.
>>: -- chosen, then the next state should have the last context. So are they ensure --
>> Zhen-Hua Ling: I think because this context information decided by the input sentence
there are, there are detect analysis results of the -- of the input sentence. So asking that
should be consistent, which I believe --
>>: Inconsistent.
>> Zhen-Hua Ling: Okay.
>>: [indiscernible]
>> Zhen-Hua Ling: Yeah, yeah.
>>: Or in that case.
>> Zhen-Hua Ling: In that case, yes.
And the next thing I want to mention here is that in HMM-based speech model because we
used mouth pull acoustic features. For example, we build a static model for Mel-cepstrum for
F0 and the four stated durations and different kind of acoustic features are affected by
different kind of context information. For example, the sole identifier of some [indiscernible]
will be more important to Mel-cepstrum, but the stress or accent information is more important
for F0.
So in our implementation it would be to the stream dependent decision-based tree. That
means for the Mel-cepstrum for F0 or for static duration we have different decision tree to
have better modify -- better modeling. To the different kind of acoustic features.
After the above, you have -- you will see that the HMM-based techniques [indiscernible] of
this method that is very important advantage of the HMM-based speech synthesis is that its
flexibility. We can see that different from the waveform concatenation method. The
HMM-based we do not use the speech waveform directly at this time. We just predict
acoustic features from an acoustic model, which means we can control the voice
characteristics of synthetic speech flexibility by just modify of changing the parameters of
acoustic model.
Here I will show you several examples of the flexibility of HMM-based speech synthesis. The
first one is, we call it adaptation. That means we can -- we can build synthesis system of
target speaker of or different speaking style with only a small amount of training data from this
speaker or style.
The techniques we used in -- in model adaptation or [indiscernible] speech recognition but
that works well in the TTS. Here, I can show you some examples of the speech adaptation
results. But first I'd like to play the original voice of recorded speech.
>>: "Author of The Danger Trail, Phillip Steels, etc."
>> Zhen-Hua Ling: This is from a female speaker and then we can train an average voice
model here using a mounted speaker database, it sound like this
>>: "I wish you were down here with me."
>> Zhen-Hua Ling: So maybe it's sound like anywhere. Just average voice. If we have ten
sentences from the target speaker, we can use this amount of data to adapt the average
model and we can generate adaptive model and synthesis speech in like that.
>>: "I wish you were down here with me."
>> Zhen-Hua Ling: We can perceive that is very much more close to the original voice of the
female speaker. If you have more sentences we can do it better. This is so one step
[indiscernible.]
>>: "I wish you were down here with me;" "I wish you were down here with me."
>> Zhen-Hua Ling: Only one thousand.
>>: "I wish you were down here with me."
>> Zhen-Hua Ling: You can compare to the original voice.
>>: "Author of The Danger Trail, Phillip Steels etc."
>> Zhen-Hua Ling: That's very close to its original voice compared to the average voice.
So the the next example is we call the interpolation. That means if we have some
representative Hidden Markov model sets, we can interpolate their model parameters to
gradually change the speaker and the speaking styles.
For example, we can do a speaker interpolation to to realize a transition from a male speaker
to a female speaker. Here I can show you how the synthetic speech is effect -- is affected by
the interpolation. This is male speaker's synthetic voice.
>>: "I wish you were down here with me."
>> Zhen-Hua Ling: This is a female speaker's.
>>: "I wish you were down here with me."
>> Zhen-Hua Ling: If we use some interpolation ratio we can perceive the transition effect of
the of the voice.
>>: "I wish you were down here with me;" "I wish you were down here with me;" "I wish you
were down here with me;" "I wish you were down here with me."
>> Zhen-Hua Ling: And if you have many representative HM sets, that it will cause a problem
that is very difficult to set the --
>>: So interpolation means you took like all the wings for each state?
>> Zhen-Hua Ling: Yes.
>>: So how do you know which one -- oh yeah, because you have the [inaudible] model
already, so the corresponding model --
>> Zhen-Hua Ling: Yeah that's actually the same context information we use in integration.
>>: I see. And how about the pitch?
>> Zhen-Hua Ling: We can --
>>: The mean or the pitch.
>> Zhen-Hua Ling: We can in this example we can also perceive the transition of a lower
pitch and female with a higher pitch become higher and higher, yeah.
This is the Eigenvoice method, this method, the sensation is when that we have a larger
member of representative Hidden Markov model sets it's very interpolation ratio set for each
HMM set. So we just treat each HMM set as the supervector and aplply, speech analysis to
reduce the dimensionality of speech space, then we can control the -- the -- the voice
characteristic simply.
The last example is multiple regression Hidden Markov model. In this method we can assign
intuitive meaning to each Eigen space [phonetic] dimension. For example, in the emotional
speech synthesis we can give each dimension of the Eigen space some meaning. For
example this one means sad, this one means joyful, this one means rough. Then we can
control the the emotion category of synthetic speech as using a three-dimension emotion
vector. It's -- it's it will be more easy to -- much easier to control the -- the motion of synthetic
speech.
Okay. Before I finish the first part of this talk, I'd like to recommend some resources which
were very useful for action HMM-based speech synthesis. The first one is a HTS tool kit for
HMM-based speech synthesis, can be obtained from this web site. And this is the software
which platform, it's released as a patch code for HTK. And it also provides some demo script,
so if you have your own training data, it's very easy to extract a speech synthetic by our your
own.
The next resource is HTS slides very useful to talk, to the researchers who are interested in
this -- in this topic. And some of the slides here also adapt from this -- from this material.
Okay.
In the second part of my talk I'd like to introduce some recent advances of the Hidden Markov
model speech synthesis, which was developed by our research group.
The first -- the first topic I'd like to introduce is the articulatory control of the speech synthesis.
Here we have introduce that in the connection of Hidden Markov model based speech
synthesis, we can we train common dependent Hidden Markov models for acoustic features.
This acoustic features I extracted from the speech waveforms. At the same time, we use this
kind of parameter generation to predict the acoustic features and to reconstruct the waveform.
There are many advantages of this method. For example it's very flexible. It's trainable. And
we can do some model adaptation to do some exchange the character of synthetic speech.
But there is also disadvantage of this measure. One of them, imitation of HMM-based speech
synthesis comes from its -- come from that it's a data-driven approach. It constrains that the
character of synthetic speech is strongly dependent on the training data available. And it's
very difficult to integrate the phonetic knowledge into a reputable system building.
For example if we want to build a voice of a child of a child. So that means we need to collect
some data from a child in advance. No matter you are doing speech level training or
adaptation. But on the other hand, we may have some phonetic knowledge which can
describe the difference between adult's voice and a child's voice. For a child, the vocal tract
is much shorter, and the vowels are much higher, but in HMM-based speech synthesis, it's
very difficult to integrate this kind of with the phonetic knowledge into a system building.
So, we use another kind of feature here. We named articulatory feature. Articulatory feature
describes the movement of articulators when people speak. They can provide a simple
explanation for speech characteristics and they are very convenient for representing -representing the phonetic knowledge.
This kind of articulatory features can be captured by some modern single acquisition
techniques. For example in our experiment we use the electromagnetic data to capture this
kind of articulatory features. In this techniques some very tiny sensors onto the tongue and
the lips of the speaker doing his pronunciation. Then we can record the realtime position of
all this sensors when he speak.
Because of this advantages of articulatory features, we propose the method to integrate this
articulatory feature into Hidden Markov model that speech synthesis to further improve its
flexibility.
At the training time, we estimate a joint distribution for the acoustic and articulatory features at
each frame. Just as this one here denotes the acoustic features and the Y denotes the
articulatory feature. If we look at the of the output probability, we model the cross-stream
dependency between the acoustic feature and articulatory feature explicitly. This is because
of the speech production mechanisms we know that the speech waveform is always caused
by the movement of articulators. So we model this type of dependency.
At synthesis type time, the process is a little bit more complex than conventional method.
When the unified acoustic-articulatory model has been trained, we first generate the actual at
the synthesis time, then we design articulatory control model by integrating the phonetic
knowledge and then to affect the phoneme generation acoustic features.
In terms of the constant variance modeling, we have proposed a two different kind of
methods. The first one is the [indiscernible]. Here, we can we have here this -- this condition
of distribution just describes that the relationship between acoustic and articulatory features at
each HMM state. We use a linear transform from the articulatory features YT to determine
the mean of the distribution for acoustic features.
This AT is trend, this AG is trend for each HMM state so it better have some disadvantageous
for example it cannot adapt to the modified acoustic articulatory features, because at
synthesis time we will generate the YT at first that will modify it.
>>: [inaudible] for the training time you need to simultaneously collect the [inaudible].
>> Zhen-Hua Ling: Yeah.
>>: The, investigation directly of the --
>> Zhen-Hua Ling: Yeah yeah.
>>: And that's a very expensive process.
>> Zhen-Hua Ling: Yeah, of course. It's very expensive. And the difficulty.
>>: So [inaudible] how do you generate, [inaudible] [inaudible] so do you articulate the
feature?
>> Zhen-Hua Ling: Yeah, the model of the technique is still, we have not finished now.
>>: Okay, so --
>> Zhen-Hua Ling: Okay focus on the single speaker here now. In the next step we are
trying to find some more maybe feasible approach to do some model adaptation for this type
of modeling.
>>: [inaudible] to be linear, but in practice a very very --
>> Zhen-Hua Ling: Yeah actually here the AG is dependent, so really it so is kind of
nonlinear. Or, in other words, piece of linear transform.
>>: I have to the same model that I have done, I have been doing for a few years. It's very
very -- look just exactly the same as before recognition. So you infer, you treat that Y to be
hidden variable and then you integrate them up and then from that you can infer that
[indiscernible] very very, very curious about what this is.
>> Zhen-Hua Ling: And we also improve the previous model structure and we adapt the
feature specs, this method. We at first trend Gaussian mixed model, lambda G, to represent
the articulatory feature space. Then we train a group of linear transform. But the linear
transform A is not as metaphor, each HM state but estimated for each mixture of the -- of
the -- of the Gaussian mixed model. That means when we modify here at the synthesis time
this prosthetic probability will change accordingly and then they will affect the relationship
between Y and X so we can get a better representation of this kind of relationship at synthesis
time.
>>: So same question here. Since you treat this Y to be variable, you don't need to have
data for training [inaudible]
>> Zhen-Hua Ling: Yes, yes, yeah.
>>: [inaudible]
>> Zhen-Hua Ling: [indiscernible] hidden variable or seeing that variable?
>>: No, no. But according to this model...
>> Zhen-Hua Ling: Uh-huh. Yeah.
>>: So you try anything to make that, you know [inaudible] so because you without that
articulatory -- so I assume that since you do have that articulatory feature there you just use
this model.
>> Zhen-Hua Ling: Yeah.
>>: And on top of this [indiscernible].
>> Zhen-Hua Ling: Yeah.
>>: You need to have [indiscernible]?
>> Zhen-Hua Ling: [inaudible] because in our -- at synthesis time, we need to generate Y and
we need to modify it.
>>: Generate?
>> Zhen-Hua Ling: Yeah, we need to generate Y, because the modification on that is that --
>>: [inaudible]
>> Zhen-Hua Ling: This way.
>>: Okay.
>> Zhen-Hua Ling: We first generate the articulatory features. Recently we see the
articulatory feature is more physical and meaningful.
>>: So during sample, we do sampling?
>>: [inaudible] output possibility generation.
>>: Okay, so you got the data already.
>> Zhen-Hua Ling: Yeah, yeah.
>>: So once you have that, then how do you get -- but I thought that you [inaudible], so you
can do them together.
>> Zhen-Hua Ling: Yeah.
>>: So you can do them together.
>> Zhen-Hua Ling: Yeah. At synthesis time, we want to integrate this kind of synthetic
knowledge --
>>: [inaudible] modify.
>> Zhen-Hua Ling: This way is is more feasible than for articulatory features than acoustic
features
>>: [indiscernible] machine techniques, just doing the standard, [indiscernible].
>> Zhen-Hua Ling: Okay.
>>: So this is a very kind of it's kind of state by state, not -- okay, okay, okay. Okay.
>> Zhen-Hua Ling: So, here I can show some experiment's results. We use a English
database because the recording is expensive. So we have only one of these other sentences
but that's enough for speech dependent speech model training. We use six sensors and each
has two dimensions --
>>: Yeah, so you have [inaudible]?
>> Zhen-Hua Ling: Yes, yes. Actually, this database is is a public --
>>: Is a public?
>> Zhen-Hua Ling: Yeah can use [inaudible]. I think there is a website for this database now.
The first experiment is some hyper articulation experiment studies. We want to simulate that
in some noisy conditions the speaker may want to put more effort what they pronounce, so
that they'll open their mouth wider or something like that. So we just in this experiment, we
generate the articulatory features at first, then we modify and articulatory features at first then
we modify the articulatory features using one point five scanning vector for Z coordinate. Z
means the bottoms up direction. We just want to simulate that way you pronounce your
articulators move in much larger --
>>: Can -- can I take a look at the [inaudible] three slides ago? Back. Yeah, back. One
more, one more, one more, one more. Back, okay. One more, one more, one more. This
one. So when you want to simulate to synthesize articulate, you are saying that you move all
the speech generation up.
>> Zhen-Hua Ling: Yeah. The -- a wider range.
>>: Wider range. I see. So including [inaudible], you have one, two, three --
>> Zhen-Hua Ling: We'll have six sensors. Each sensor [indiscernible] the Z and the Y,
because all these sensors are put in the middle. So the X axis is omit. So we put this two
dimension and in the fourth example we just scurry up the Z dimension.
>>: [inaudible] dimension. But if you this upper feature [inaudible] in that case.
>> Zhen-Hua Ling: I suggest some this modifying here is a little bit heuristic yeah. We just
want to simulate whether or not this kind of modification articulatory features can influence the
generator acoustic features.
>>: Oh, I see. So [inaudible], what do you mean by articulatory control [inaudible].
>> Zhen-Hua Ling: Yeah, yeah.
>>: And [inaudible].
>> Zhen-Hua Ling: Yeah. We have -- we'll have some maybe not very, how how to say, it's
not very in detail we just need -- yeah.
>>: So -- but why is it that you move Z the way up, it's going to be articulated [inaudible]?
>> Zhen-Hua Ling: Not moving up, we just -- the range. We increase the range.
>>: Oh, I see.
>> Zhen-Hua Ling: Increase the range.
>>: So, in the system you just make is various speaker --
>> Zhen-Hua Ling: Yeah, yeah.
>>: [inaudible] artificially manipulate them.
>> Zhen-Hua Ling: No.
>>: Okay, I see. Okay.
>> Zhen-Hua Ling: Yeah. So we can see some change in the spectral gram. We can see
that of this kind of modification the performance does become more clear. And the high -high focus part get more energy. And also we did a subject listening test for the listeners to
do some dictation in some noisy conditions as of five DBSL [phonetic]. And we can see that
red drops from 52% to 45.
>>: How do you [inaudible]?
>> Zhen-Hua Ling: This is dictation of the synthetic speech.
>>: Oh, I see. So you use that to generate that.
>> Zhen-Hua Ling: Yeah.
>>: Okay, okay.
>> Zhen-Hua Ling: We give them some speech and ask the listeners to write it down, the
word that they heard.
>>: The human listener.
>> Zhen-Hua Ling: Human listener.
>>: Oh, I see.
>> Zhen-Hua Ling: Human dictation. Human dictation. We just want to improve the
integration of the speech in noisy conditions.
>>: [inaudible] all you do is that you increase the variance of one variable, right? And when
you do that, why does the following [inaudible] also have to estimate A.
>> Zhen-Hua Ling: Yeah [inaudible] there are some -- we -- by using the conditional
distribution here, for example --
>>: Yeah, yeah. It's just not --
>> Zhen-Hua Ling: This one, we can -- we can learn some relationship between acoustic and
articulatory features. There are some training samples in the database with the large
variance of -- of the -- of the tongue movement. So we can learn this kind of relationship.
That means if your tongue moves is very -- has a wider range, so would your acoustic
representer would be? There are some samples of this of this in your training data. We can
do this by this dependency. And since the time --
>>: That's how you look --
>> Zhen-Hua Ling: Yeah. This is --
>>: This is increase the variance for Y. So listen to it.
>> Zhen-Hua Ling: A has been in the training database and at synthesis time, we generate
the Y first then we increase the --
>>: The variance.
>> Zhen-Hua Ling: The variance. A is the same.
>>: A is the same.
>> Zhen-Hua Ling: Yeah.
>>: Oh, okay. That's very interesting.
>> Zhen-Hua Ling: And the -- the -- the second example is the vowel modification. And in
this experiment our phonetic knowledge is place of of articulation for different vowels.
We can look at this, this table here. There are some vowels English for example that vowel
"A" vowel "E" or vowel "Eh." And most significant difference between these vowels among
these vowels as they are tongue height. For the E, you have higher tongue, and the A have a
lower tongue. Eh have middle tongue.
So here, we just predict articulatory features using the vowel A at first, then we modify the
height of this of tongues to see whether or not we can change this vowel to be perceived as
vowel E or vowel Eh respectively.
>>: Oh so what that really good for otherwise you [inaudible].
>> Zhen-Hua Ling: Yeah up here we have some very up to date [inaudible]. For example,
we, when we increase the height of the tongue, most of the stimulates will be perceived as
vowel E here. If we decrease the height most of them will be perceived as Eh. This is a
human perception. Also a dictation results.
Here I can show you some examples of the -- of the of the synthesis speech. This one is the
synthetic speech for the vowel A. This is a various set we can hear that.
>>: "Now we'll say set again."
>> Zhen-Hua Ling: It says set. If we increase the tongue height...
>>: "Now we'll say set again;" "Now we'll say set again;" "Now we'll say sit again."
>> Zhen-Hua Ling: Now it becomes sit. So it's very clear transformation.
>>: "Now we'll say sat again. Now we'll say sat again. Now we'll say sat again."
>>: Yeah, more like sat.
>> Zhen-Hua Ling: Yeah. So by this experiment we'll show that our --
>>: [inaudible] those is more than just performance, then also they cannot do it [inaudible]
duration, sat become duration, somehow. Oh that's very interesting.
>> Zhen-Hua Ling: Yeah.
>>: I just can't believe that it's so powerful, [inaudible] all you do is information.
>> Zhen-Hua Ling: Linear information in the articulatory feature space.
>>: Yeah, correct correct.
>> Zhen-Hua Ling: In the -- in the modification articulatory space we just some single
modification. But in the --
>>: So it's like when you move that, it's true the information the performance --
>> Zhen-Hua Ling: Yeah yeah.
>>: Yeah that's that's obvious, okay. So.
>> Zhen-Hua Ling: Well, here, I have said that the linear transforms estimates for each
Gaussian mixtures sub space or articulatory features so. Totally we have a nonlinear
transform.
>>: I see. Okay. So you just in that case you move the mean.
>> Zhen-Hua Ling: Yeah.
>>: Here, mean of the what -- so tell me -- so what exactly in the system nations an
estimation, so like in six sensors and each sensors has two dimension
>> Zhen-Hua Ling: Let me move three dimension, only thre three sensors on your tongue.
>>: [inaudible]
>> Zhen-Hua Ling: [inaudible] tongue height.
>>: Okay, so when you move how height, of course your form is going to change. So, I
basically hear that the formant changes.
>> Zhen-Hua Ling: Yeah, yeah.
>>: I see. So that's just system form diverse speech.
>>: Okay, that's cool. That's very cool.
>> Zhen-Hua Ling: And also this is some preliminatory [phonetic] attempt of integrating these
kind of articulatory features.
>>: [inaudible] or in China?
>> Zhen-Hua Ling: [inaudible] research [indiscernible].
>>: Oh, I see.
>> Zhen-Hua Ling: So this is English database. The database is recorded by the -- okay.
The -- the second thing I'd like to introduce is a minimum care primary generation. We know
that in the HMM-based speech synthesis it has many advantages for example it's trainable,
it's flexible, it's adaptable. But there is also some disadvantages of this method. The most
significant way is the quality of synthetic speech is degraded, we can perceive some how to
say, pass or uncomfortable things in the synthetic speech.
So there are three reasons we call this kind of speech quality degradation. The firstly is the
vocoder and secondly is the acoustic modeling method. And the third is the primary
generation algorithm. Last we use the maximum output probability for the generation could
use the above it will be over- smoothed. Here, we focus on the third cause of this kind of
problem.
This -- this equation has been introduced before. This is a maximum output probability of
primary generation. Because in our method, the state function is represented by a single
Gaussian. So if we solve this problem the outputs are very close to a sample mean. Even if
we have used some dynamic features they are still very close to the mean of each state. So
the detail of characteristics of a speech parameters are lost in the synthetic speech.
>>: So MOPPG stand for --
>> Zhen-Hua Ling: Maximum output probability primary generation. Yeah. Just this -- this
one.
Okay. Many important methods has been proposed try to solve this kind of problem. I think
the most successful one is the global variance modeling. In this method, we calculate the
sentence level variance of acoustic features for each training sentence. Then we estimate
the single Gaussian distribution lambda B, for this kind of global variance.
Then at the synthesis time the optimum function is defining this way, which is a combination
of two parts. The first one is the conventional articulatory probability of the acoustic Hidden
Markov model, and the second line is output probability of the global variance model.
So by using the under the contract of this model, we can generate the acoustic feature
trajectory with larger variance and improve the -- to alleviate the almost working of the
trajectory and improve the quality of synthetic speech.
>>: Okay. So why don't you just, [inaudible] increase the variance.
>> Zhen-Hua Ling: You mean at synthesis time or?
>>: Synthesis time.
>> Zhen-Hua Ling: Yeah, by asking that, that's too realistic -- yeah. Here we guided it by
some static model.
>>: So you train the --
>> Zhen-Hua Ling: Yeah this is trained. This is trained. For each --
>>: So, after you do this kind of training do you see the amount of change in -- in variance
consistent across different folds, different states or are they in some states making too small
[inaudible]?
>> Zhen-Hua Ling: Actually that's, according to the criterion, asking this kind of variance
increase would be inconsistent. Because --
>>: I see.
>> Zhen-Hua Ling: -- there is a trade off between these two parts.
>>: Yeah, so we call that regular audition.
>> Zhen-Hua Ling: Yeah, yes. Yes, yes, yeah. And this method has been proved to be very
effective, in some experiments, it can improve the quality of synthetic speech significantly.
>>: So, question. [inaudible]
>> Zhen-Hua Ling: I think maybe this -- that it depends on the problem because here
because here, almost musing is the most important problem. We want more variation in the
synthetic speech.
And if -- if -- if one method is the primary measure [indiscernible]. Here the output probability,
is calculated using the mean of the generated segments instead of -- instead of the output
probability of each frame. Which would calculate the output probability.
We can see that there is a common points of both this two methods. This is both these two
methods examine the similarities between the total parameters for of natural derive from the
generated acoustic features.
For example in the TV method the distributive parameter is the sublevel or segment level
mean. We want this kind of distributive parameter to be similar for both for the nitrous speech
and for the generated speech.
So, based on what we this we propose another kind of generated algorithm which is this
method, we -- we explicitly measure the distribution similarity between the -- between the
generated feature and the natural feature.
Here, the generated Hidden Markov model, which means the sentence estimate from the
generated acoustic features is expected to be as close as possible towards the target
collation. Which was the HM voice used for primary generation. This is the mostly important
the criterion our proposal method, and the care divergence is adopt [indiscernible] distance to
Hidden Markov models. And we have implement to approach to using this criterion, which is
to optimize the generated acoustic method directly and to estimate a linear transform to the
maximum output possibility primary generation results.
So the first thing is that when we generate a sentence from the Hidden Markov model, we
need to estimate the generated HMM according to this -- it's not easy because we have some
data surface problem. We need to HMM. Here we use maximum posterior space expiration
to estimate this HM parameters.
So for each for the mean and the variance of each state there can be written as a function of
the O. O is the generated speech. That means when the primary features are generated, we
use the features to estimate the mean at a variance of each HMM state as then get the
sentence HM of the generative speech. That we can accumulate the [indiscernible] between
this two HMs, we use the care damage in the symmetrical form between the hatch HM and
the generative HM. But unfortunately because of the hidden state, we cannot, we cannot
calculate this --
>>: [inaudible]
>> Zhen-Hua Ling: Okay. Right here. We have right here. The generative means the HM
estimate from the generated acoustic features. For instance, at sentence synthesis time, we
input a text and then we generate a acoustic trajectory for this sentence.
>>: [inaudible] or something.
>> Zhen-Hua Ling: Yeah yeah but we can estimate Hidden Markov model from this trajectory.
>>: Okay yeah, which is not very good --
>> Zhen-Hua Ling: Which is --
>>: -- compared to the real --
>> Zhen-Hua Ling: [indiscernible] but we want to minimize this distortion.
>>: I see I see. Okay.
>> Zhen-Hua Ling: The target HMM, commit at the HM that generate the acoustic features.
>>: I see, so now you sort of estimate a new set of HM.
>> Zhen-Hua Ling: Yeah.
>>: That minimize the [inaudible] doing that.
>> Zhen-Hua Ling: Yeah something like that. Yeah yeah.
>>: Oh that's cool. Okay.
>> Zhen-Hua Ling: That is something like we --
>>: So why do you want do that instead of just doing sampling?
>> Zhen-Hua Ling: I think the reason is that, here, actually the motivation is that the
conventional max -- the over-smoothing effect of the conventional maximum output probability
parameter generation algorithm if we just use the -- this kind of output probability to evaluate
this sequence is good or not good, now it will call some over-smoothing problem.
>>: Okay. [inaudible] in this case you only use the mean.
>> Zhen-Hua Ling: Yes.
>>: But if you also use a variance just like what -- to the -- the end.
>> Zhen-Hua Ling: You mean the -- you mean the -- you mean the -- if you --
>>: If you estimate the global variance of the sequence and then you do the sampling
according to the mean and this variance, you should be able to get a higher [inaudible]
>> Zhen-Hua Ling: I think in truth, our method when the global variance model has been
introduced it also be solved by maximize this kind of combined -- combined output probability.
It's not, you're not sampling method.
>>: I mean you can do the sampling if you already have the variance.
>> Zhen-Hua Ling: Yes we can, but the -- I have some results of the just a simple sampling
method, but I actually think it's not too good. That means, currently it may be using the mean
is much safer than -- than -- [laughter]. It's a little bit dangerous if you have not a good
constraint --
>>: [inaudible] frame by frame.
>> Zhen-Hua Ling: [inaudible]
>>: So you said that you saw a [inaudible].
>> Zhen-Hua Ling: Yes yes.
>>: So [inaudible] sampling?
>> Zhen-Hua Ling: [inaudible].
>>: Maybe you'd have.
>> Zhen-Hua Ling: It's not so good.
>>: I see okay. So this was kind of, you know, segmental.
>> Zhen-Hua Ling: Actually in this method, we have not touch the problem of whether or not
we should do sampling or use the maximum output probability. We just try to improve the
maximum output probability method by using a new --
>>: [inaudible] using these criterion and announcing these [inaudible].
>> Zhen-Hua Ling: No. Actually --
>>: [inaudible] when you don't use the [inaudible].
>> Zhen-Hua Ling: No no no. The actually the criterion just this way.
>>: Okay.
>> Zhen-Hua Ling: We just try to find the optimal static video sequence to minimize this kind
of
>>: Divergence.
>> Zhen-Hua Ling: Divergence.
>>: Divergence measured between two --
>> Zhen-Hua Ling: Yeah.
>>: One of them is using generated.
>> Zhen-Hua Ling: Up and down approximation of the [inaudible] between two Hidden
Markov models. Because this criterion is a function of the -- the static feature. So we can we
just try to solve this problem. And we, it will be updated the C by some particular set of them
to get the the final results.
And this is the one method how might use of the criterion who have also tried another method
that this we constrain the final output acoustic features to be some linear transform of
maximum output probability primary generation results in this way. And under this criterion we
just estimate the coefficient and the BIOS [phonetic] of the linear transform. And we want to -we hope that by this method we can get more robust estimation of the transform metrics to
estimate the scene directly.
Furthermore, we can also estimate that the linear transform -- linear transform coefficients
and the BIOS on the training set which means we can do it globally at the train set. So we it
will reduce the computation at synthesis time. Here, the care divergence is calculated for all
the sentence in the training set and the sun up together. And then we can estimate the linear
transform minimize this care divergence.
>>: I see. So this is the public training. So after your training you get another HMM and HM,
which is you can generate model.
>> Zhen-Hua Ling: Actually this training will only estimate these two parameters. This is
separate from the Hidden Markov model. The Hidden Markov model will not be continued
anymore.
>>: So what's [inaudible]?
>> Zhen-Hua Ling: Here, we can --
>>: Oh! It's just expiration.
>> Zhen-Hua Ling: Yeah.
>>: Oh! So it's only two primary too.
>> Zhen-Hua Ling: Yeah, yeah. Actually that's just alternative method when compared to the
previous one. Here, we just give many freedom to C. It can change anywhere, just minimize
this. But here we give some constraint, we require the C to be an --
>>: [inaudible] HMM do this, I guess it wasn't do that right.
>> Zhen-Hua Ling: All this method are only related with the primary generation part. The
model trainings, we do not attach that here.
>>: I see okay okay. It's --
>> Zhen-Hua Ling: This is only for how we can generate the primary data once model has
been given.
>>: I see. It's presumably that's a train HMM, okay. My biggest -- you originally have HMM,
you know, the HMM isn't -- maybe too smooth away from your data. You know training data.
So you multiply HMM so that that become close to each other. That's what I thought.
>> Zhen-Hua Ling: No. Maybe I can go back to it. To here.
I think for example we can consider things this way. We have a trained HMM model we just
keep it. And then we can generate acoustic features for input text. Then we can estimate the
generated HM from this generate acoustic features. Then if we how to say, we make some
modifications to the generated acoustic features. Then the generated HM will be different.
>>: I see. Correct, correct.
>> Zhen-Hua Ling: So we can find the optimal --
>>: I see.
>> Zhen-Hua Ling: -- feature sequence. It has a generated HM which is very close to the
target HM. The target HM is what we train at the training time and it's kept always the same.
Okay.
Here is some experimental results. We compared five method here. The first is the
conventional maximum output probability parameter generation algorithm. The second is the
classic GV method. And this three method are proposed method here. The first is linear care
damage primary generation. The second is mean official transform. This one is also the
transform but estimated globally object training set.
We can see then this figure shows the average of care divergence of different algorithm. We
can see that for the -- we compare the natural speech, the MOPPG output and the GV output
we can see that the natural speech has the lowest care divergence. Compared to this two
method.
This means that, I think it means maybe this care divergence is better criterion maximum
output probability. Because in a sense, the output probability, this red line is the optimal, but it
sounds not as good as the natural speech. And this figure shows a care divergence of the
opposed so we can see there they welcome very close.
And this figure shows some sample trajectory of the this is for one dimension of
Mel-cepstrum. We can see that the MOPPG output is very -- is very smooth, and the
variance is very small. But after we apply the care divergence based or parameter
generation, for example, these two method we can have the trajectory with much more
variation, in the synthetic speech.
And this figure shows some subjective results. That is we have some synthetic samples and
ask people to listen to the samples and give a score from one to five to each sample and we
calculate the average value. Higher value higher score means better naturally.
We can listen some of here. This is a maximum output probability generated output.
>>: [Chinese dictation]
>> Zhen-Hua Ling: I'm sorry for that, there's only Chinese sample. But maybe you can
perceive the difference between this one and the next one. If you use GV method the speech
much clearer.
>>: [Chinese dictation]
>> Zhen-Hua Ling: [indiscernible] can perceive the difference.
>>: [Chinese dictation]
>> Zhen-Hua Ling: This is our proposal.
>>: [Chinese dictation]
>> Zhen-Hua Ling: We also can compare it to another fact deliberation by the preference
test, that is to compare the systems of care divergence field transform by global transform
method. The final result is that the global transform method can achieve better result even
than GV and the difference is significant in a group of listeners.
So in the last partly introduce another one we have done before that is hybrid HMM model,
speech synthesis. Before that I'd like to show some review of the classic unit selection
algorithm.
In this algorithm the most important thing is that we need to define two course functions. For
example here we have a target unit that is a target, want want to success, and we have a
unit -- a kind of unit of sequence.
You, here we have two cost to function, the function target cost which define how appropriate
the concatenate unit can be used target unit. And now the cost function we name the
concatenation cost, which measures how much the discontinuity will be if we connect this to
some altogether.
Then the final unit that criterion can be defining in this way which is the summation over target
unit target cost and coalition costs and the optimal unit sequence is select by minimize this
cost function.
Compared with the HMM-based speech synthesis the unit selection and waveform synthetic
measure has its advantages and disadvantages. For example for this method because the
real speech segments are used as a time so it can peruse high high actuality at the waveform
level than HMM-based speech synthesis. This one, this advantage before that vocoder sound
feel not so comfortable while you listen to the speech.
>>: [inaudible] distance?
>> Zhen-Hua Ling: Actually it's kind of proved, but --
>>: Still not --
>> Zhen-Hua Ling: Cannot beat this one. Because it's the natural speech segment.
Especially here, in right or limited domain, it has a very high nature, if you have a good
results.
But the disadvantageous that for unit selection, we'll need a large reaction, we need large
footprints. That means at synthesis time, we need to store the whole speech database. This
may be several hundred megabytes or gigabytes or several gigabytes. And to a concept of
discontinuity if we pull some unit together as it shouldn't be and also it has some unstable
quality. This means for some sentences we can have maybe perfect synthesis results, but for
some sentences there will be some breach in this -- in this sample and you will feel very
uncomfortable because of this this synthetic arrows.
So in our previous work we have proposed a action-based unit selection method. In this
method we try to integrate the damages of static approach in automatic robustness, and the
superiority of unit selections and high requests synthetic speech, and we want to achieve a
trend select speech system.
This figure shows the -- the flowchart of the HMM-based unit speech synthesis system. The
training part is almost the same as the HMM-based speech synthesis I have introduced
before. Where it turned to features and trains context dependent HMM model. The difference
is at synthesis time, we do not generate acoustic features from the model directly. Instead,
instead of that, we have a statistical model based unit selection model to select to optimal
unique sequence based on criterion these models.
Particularly, I will -- for the at the model training part we will first train that acoustic features,
commonly used features including spectrum/F0 at each frame. Then from duration. The
concatenation spectrum or F0 at full boundary. And this one I think this one is maybe not so
important for HMM-based parameter metric speech, but if we want to do unit selection would
need this kind of model. And some other model besides simple levels of features. Or some
other things.
And the model training is almost the same as the HMM-based speech synthesis we train
concatenative models on the maximum criterion we use a decent class model class wing to
deal with the data sparsity problem, and we use a HMM for the F0 and the concatenative F0
model.
And this slide shows the unit selection process. Here, assume the concatenative information
of target sentence is C, and we have a concatenative phone sequence. Then how to
determine the -- which unit sequence we should select? I think this equation is the criterion
we used. It can be dividing to two parts.
The first part just described the possibility of the -- of the kind of unique sequence to be
generated by the target contact information and the second part describes the similarity
between the contact information of the concatenative units and the concatenative information
of the targets units.
We also use a care diverging to calculate this part. So if you put this two part together our
criterion is that we want to find the unique sequence which has high probability and low care
divergence in terms of the contact information.
So we can rewrite this, this -- this formula into the concatenative unit selection formulas
including a target cost, phone level concatenative cost. We can use the modify the dynamic
program in search to do the -- to do the optimization.
Okay. And finally I will introduce the performance of our proposed method in the Blizzard
Challenge speech synthesis simulation events. Blizzard Challenge is the variation events
which has been held every year since 2005. Its purpose is to promote the process of speech
synthesis techniques by evaluating the system built using the same database and different
method.
The USTC teem enters this event since 2006 and we adopt HMM-based unit selection since
2007 and we have some very excellent performance in all the measurement of evaluation test
since, since that year.
So, let's take the Blizzard Challenge Event of this year for example. This year, we have a
very challenging synthesis topic. That is the speech database we use is not recorded in
recording room but taken from some internet resources. This is audio book speech database
which has four stories written by Mark Twain and read by an American English narrator. The
database size is quite big is more than 50 hours it has been imperfect transcription and rich
expressiveness.
This -- this slide shows the results of the -- of the Blizzard Challenge this year.
>>: But that's coming from inter[inaudible]
>> Zhen-Hua Ling: Yeah [inaudible].
Here the first the first item of the evaluation is a similarity which measures how similar your
synthetic speech is towards the voice of the source speaker. We can see that here, that A is
a natural speech and C is our system. The mean of the score is about 4.1 is the highest
score of all participants. And this natural score, our system also get the highest manes score.
Which is about 3.8. Intelligibility. We did some human dictation for the synthetic speech. And
the dictator would narrate our system is around 19%, which also the lowest of all the systems.
This is here we also have some paragraph tests. In this tests --
>>: So what is F?
>> Zhen-Hua Ling: So --
>>: Which group is F?
>> Zhen-Hua Ling: F? I can't remember, actually, this, this, this competition is dealing in an
anonymous way.
>>: Oh, okay.
>> Zhen-Hua Ling: [laughter]. Yeah.
>>: So how do you know that's you?
>> Zhen-Hua Ling: They will give you some data so you know which system is yours, but
they will not tell you -- I think that everybody accords some information your work, but I forgot
which - which F stand for.
>>: I see okay. So what -- is that must other than the ones under condition, or otherwise
error rate cannot be so high, 20%.
>> Zhen-Hua Ling: No, it's not a logic connection. Because this -- this sentence outweigh
named SUS sentence, that means sematic unpredictable sentence -- yeah. It's that difficult.
You cannot use some concatenative method to guess what the words is. So it's much more
difficult.
>>: Okay, I see. So can we go back two slides? One more.
>> Zhen-Hua Ling: Okay.
>>: So FIB [phonetic] is [inaudible].
>> Zhen-Hua Ling: Yes, but there are also a significant test results distribute by the organizer.
In that results our system is significantly better.
>>: I mean those three groups, FIB.
>> Zhen-Hua Ling: Oh you mean.
>>: [inaudible]
>> Zhen-Hua Ling: Actually -- actually this is a false plot because the scores of each given
this from one to five. So that's means discreet rather for each for each. Now this means -this -- this -- so the lines means the the how to say, the median of the --
>>: The mean?
>> Zhen-Hua Ling: Not mean, the...
>>: [inaudible]
>> Zhen-Hua Ling: Yeah. And this this back is the how to say, the variance. The variance,
yeah.
>>: Standard deviation.
>> Zhen-Hua Ling: Standard deviation yes. Maybe times standard standard deviation. Of
how, I remember. This should be this should be explained this way. Half of the training
samples as located in this box. Half of the samples are in this box.
>>: Okay. So this is the open method diagram.
>> Zhen-Hua Ling: Yeah. Open method diagram.
>>: Okay.
>> Zhen-Hua Ling: Yeah, yeah.
>>: So if you move to next -- so what is the similarity here? Similarity means what natural is
intelligent.
>>: Similar to the original recordings.
>> Zhen-Hua Ling: Similar to the original speakers recording
>>: Okay. So what does naturalist or --
>> Zhen-Hua Ling: Actually --
>>: This one test adaptation --
>> Zhen-Hua Ling: Actually it's related with natural, for example, if a -- if a -- if a speech
sample is very unnatural, it cannot get the high similarity score. But if we have a natural, if we
have someone is very natural but it sounds like another person, not a target person.
>>: Oh, I see. So you natural is two categories. One is the absolute naturalness, you know,
without --
>> Zhen-Hua Ling: Yeah.
>>: Without speaker ability [inaudible].
>> Zhen-Hua Ling: Yeah yeah yeah.
>>: Okay. I see. So here... naturally [inaudible]. But what do they have the [inaudible] box?
Normally if it's the bar, then you actually will see something modeling the little speck, the O.
So, yeah, that's the same question I thought you were asking. So the MOS score is 03.8,
right? That's your system.
>> Zhen-Hua Ling: That's our system yeah.
>>: That is supposed to be somewhere at 3.8
>> Zhen-Hua Ling: Yeah, something like that.
>>: It doesn't look like that. I don't know. I don't understand. What is the box representation
here?
>> Zhen-Hua Ling: You mean this?
>>: Yeah.
>> Zhen-Hua Ling: That is the median.
>>: The median.
>> Zhen-Hua Ling: Yeah so it must be a -- a -- one, two four, three, four -- four, five. So that
is median value. Not the mean.
>>: Okay, okay. So what's the 3.8 here?
>> Zhen-Hua Ling: 3.8.
>>: Come for here.
>> Zhen-Hua Ling: No, we cannot get 3.8 from this figure from other sources. This figure just
shows the distribution of the scores.
>>: I see, okay. I see.
>> Zhen-Hua Ling: And the scores actually are in some discreet.
>>: So it's just your system and two other groups are doing equally well [inaudible].
>> Zhen-Hua Ling: You mean? I think the meeting obviously is the set, but there is also
some results for the significance analysis that we can't show that analysis system
[indisernible].
>>: Oh, okay. You can't see that from here.
>> Zhen-Hua Ling: Yeah.
>>: [indiscernible]
>> Zhen-Hua Ling: There are some other -- there are better detailed results of the --
>>: I see. So is this for Chinese or for English?
>> Zhen-Hua Ling: It's English.
>>: Oh, example was -- so your group with both English and Chinese and both are probably
the best.
>> Zhen-Hua Ling: Because every year, in several years we have more than one language
evaluation, but this year we have only the English.
>>: English, oh, okay. So does this test have Chinese as well?
>> Zhen-Hua Ling: You mean the listeners for --
>>: Yeah, for this year's test challenge participants.
>> Zhen-Hua Ling: I see. No other listeners are from China.
>>: I see. Okay. Okay.
>> Zhen-Hua Ling: Okay. This is a new measurement this year that is for the paragraph test.
This means that we synthesize some -- that means that what the listeners include is not
sentence by sentence but a whole paragraph of novel. So they can give some impression
scores on this seven aspects. And and the range, I remember, is from 10 to 60. This is a
score of the natural speech and this score of our system is also the best performance in all
the seven aspects compared to other system. So I can show you some demos of this. We
have different types of synthetic sentence. Some are book sentences, some news and book
paragraphs.
>>: But this test [inaudible] SUS. Because this is --
>> Zhen-Hua Ling: There are some I didn't put the samples here, but because --
>>: So this is only to look for naturalness.
>> Zhen-Hua Ling: Actually these two parts are used for the similarity and naturalness test.
And this parts used for the paragraph test.
>>: "Once more the original verdict was sustained. All that little world was drunk with joy.
I've liked him ever since he presented this morning. Her research is the second piece by
Edinburg University scientists highlighted recently."
>>: This is the original one --
>> Zhen-Hua Ling: The synthetic speech.
>>: Oh, the synthetic speech.
>> Zhen-Hua Ling: All the samples are synthetic speech from our system.
>>: Oh, okay.
>>: "So it was warm and smug with it, though bleak and wrong without, it was light and bright
with it. Though outside, it was as dark and dreary as if the world had been lit with Hartford
gas. Alonzo smiled feebly to think how his loving vagueries had made him a maniac in the
eyes of the world and was proceeding to pursue his line of thought further when a faint, sweet
strain, the very ghost of sound, so remote and attenuated it seemed struck upon his ear."
>>: So which one -- what is the difference between?
>> Zhen-Hua Ling: No difference, just some samples.
>>: Just some samples.
>> Zhen-Hua Ling: Sorry. I haven't explained yet. I think the text of the book paragraph are
taken from maybe another story of Mark Twain which is not containing the training set.
Okay. Finally I will give summary of this presentation. In this presentation I introduced some
basic techniques of Hidden Markov model-based speech synthesis. The different, I think the
key ideas is that there are two models I used. One is the source-filter model, use it to extract
acoustic features and use to reconstruct speech waveform, and second is the statistical
acoustic model, which describes distribution of acoustic features for different concatenative
information and used to generate acoustic features from this model.
This HMM-based speech synthesis method has become more and more popular in recent
years, but it's still far from perfect. There are still many parts need to be improved.
Then I introduced some recent advances of this method which were conduct in our lab. In
order to get better flexibility we produce, we propose method of integrating articulatory
features into HMM-based speech synthesis. In order to get better speech quality, we
proposed a minimum care damaging a privileged algorithm naturalness, we use the hybrid
Hidden Markov model in its natural approach.
>>: Quality and naturalness, are they similar?
>> Zhen-Hua Ling: They are they are related, but I think speech quality is more related with
the spectral parameters. For example, you can feel some buzzness [phonetic] or almost
smooth. And the naturalist is more read, by the quality of some other things.
>>: So quality is a subset of naturalness.
>> Zhen-Hua Ling: Yeah, it can be. For example, if you charge the naturalness of a sentence
maybe speech quality is one of factors you need to consider.
Okay. Let's end on my talk. Thank you.
>>: [applause]
>>: Okay, so -- I think we ask all the questions already. Okay. Thank you.
Download