>> Li Deng: Okay. Welcome to this presentation. ... Professor Zhen-Hua Ling to give us the lecture on HMM-based...

>> Li Deng: Okay. Welcome to this presentation. It's my great pleasure to welcome Professor Zhen-Hua Ling to give us the lecture on HMM-based speech synthesis and Professor Ling currently teaches at China University of Science and Technology, and he has a very very nice strong crew over there doing speech synthesis synthesis. And he recently arrived at University of Washington here to spend one year working on various aspect of speech and perhaps some multimedia. So I take the opportunity to invite him to introduce speech synthesis here, so hopefully you will enjoy his talk. Okay, I'll give the floor to Professor Ling. >> Zhen-Hua Ling: Thank you. Thank you, Li, and good morning everyone and hello everyone here and hello to listeners over the internet. It is a great honor to be here and tell something about speech synthesis technology. And my name is Zhen-Hua Ling, and now I'm a visiting scholar at University of Washington and I come from USTC in China. Today the topic of my presentation is the advancement of speech synthesis, its fundamental recent advances. This is the outline of my presentation. At first I'll give some review of the state of the art advance in HMM-based speech synthesis techniques including its some background knowledge, its basic techniques, and to show some examples of its flexibility in controlling the voice characters of synthetic speech. Then I will introduce some recent advances of this method which were the lab developed by our research group. I'll give three examples of firstly the articulatory control of HMM-based speech synthesis, the second is minimum care diverging's permit innovation, and the third one is the hybrid [indiscernible] selection speech synthesis. Okay. So I'll start from some background instruction. So what is speech synthesis? It is the defined as a kind of measured artificial production of human speech. And in recent years, speech synthesis has become one of the key techniques in speech creating intelligent human machine interaction. If the input to a speech synthesis system is in the form of a text, we can also narrate text to speech or TTS for short. It can be considered add a kind of how to say, inverse function of automatic speech recognition because in automatic speech recognition we want to convert speech to text, but in TTS we do the inverse thing. This figure shows the general -- general, how to say, diagram of TTS system. This is composed of two main models. The first one is text analysis. This model convert the input text to sound intermediate linguistic representation. For example we extract the phoneme sequence from the input text as the [indiscernible] and some other things. The next model is well from generation. By this model we produce speech according to the -- this kind of linguistic representation. We also name these two models the front end and the back end of a TTS system. And in the top -- in this top here, I will focus on the back-end techniques of a TTS system, that means where I mention speech synthesis in the following instruction that only refer to the back end of TTS system. So, in order to realize a speech synthesis system, many different kind of approach have been proposed in history. Before the 1990s the world's best form of synthesis is utmost dominating measured. Here formant means some acoustic features which describe the resonant frequency of speech spectrum. In this measure we have each phonetic, we have the formant -- formant features of each phonetic units designed by human beings manually, even some rules, and in the final a synthetic speech is produced using so soft field model. Then after 1990s, the Corpus-based concatenative speech synthesis become more and more popular. Here Corpus means a large database with some recording speech waveform and waveform and level information. In this method, we concatenate the speech -- concatenate the segment of speech unit of speech waveform to construct an entire sentence. There are different kind of implementation. In some method, we have only one instance or one concatenative unit for each target unit so that the single inventory method. More method -- more method we have much more instances for each candidate for each target unit, that means we need to do some unit selection to collect the most appropriate unit to synthesize the target sentence. Then in the middle of 1990s so now method what has been proposed that is the Corpus-based statistical parametric synthesis. In this method, we estimate acoustic -acoustic models automatically from the large concatenative state. Then at synthesis time, we will predict the acoustic features from the trend acoustic models and then you would use some kind of source-filter model to reach that waveform. So this method can can compare to flexibility, compared with the previous waveform concatenation method. And in most cases, we can be model is used as the status of model. So we commonly name this method the HMM-based speech synthesis or HTS for short, which means H24S [phonetic] which you can see in letters here. So in this talk, I will introduce this kind of method including some basic techniques and recent advances of HMM-based speech synthesis. Before I introduce the basic techniques of this kind of action based speech synthesis, I'd like to show -- sorry, I'd like to show the -- the overall the reach out of this measure. We can see that the HMM-based can be divided in two parts; the first one is the training part, the second is the synthesis part. The training part is very similar to the HMM-based automatic speed recognition. We firstly extract acoustic features including the excitation candidate and spectral candidate of when the speech and waveform of the recalled in the database. Then we trend set to models using the extract parameters and the label information of the database. At the synthesis time the input text first actually pass through a textual model to get this kind of label information which is just the linguistic representation we mentioned above. Then, we generate the acoustic parameters such as the excitation parameter and the spectral parameter from this HMM model set according to the level information. Finally this parameters would be set into a source filter model to reconstruct the signal. So this is how HMM-based speech synthesis work. There are rough four very important to HMM-based speech synthesis. I'll give some more detail instruction to them one by one. This first technique is speech vocoding. Speech vocoding means at the training time we extract acoustic features from the speech waveform and at the synthesis time we reconstruct the speech waveform from the synthesis speech features. The speech vocoding techniques motivated by the speech production mechanism of human beings. We can look at this figure here. The -- when people speak, well first they have this airflow produced by lungs; then this air flow will pass through the vocal cords to generate the source signal of the speech. And when we pronounce vowel segment or vowel. Our vocal cords will vibrate, periodically. And so we can get a pulse trend as the sound source. While we we speak some unvoiced speech segment for example consonants, our vocal cords will not vibrate, so we can get some noise-like some source. This part trend of noise signal will pass through the vocal tract and generate the final speech. The function of vocal tract is very similar, like a theatre [phonetic] if a frequency test does transfer characteristics will decide which speech synthesis specific phoneme, we are pronounce H, we are pronouncing. So mathematically, we can -- we can -- we can extract the vocoder using this kind of source-filter model. It consists of the source excitation part and a vocal tract resonance part. In the source excitation part, we have signal EN, which is either a pulse train or white noise. And the [indiscernible] is on this part we use time in the linear time-invariant field to represent the vocal tract the character of each speech frame. So the final -- final speech is the evolution between this, this two signal EN and XN. Furthermore, we will extract some spectral parameters to this part the vocal tract, this vocal tract assistant XN or its Fourier transform. Here we can use different kinds of spectral parameters just according to the different kinds of speech spectrum model we used. For example we can use the other autoregressive model to -- to represent the -- the speech process. In this case the speech parameters, spectral parameter CM will become the meaningful coefficient. We can also use some explanation of -- explanation model. In this in this model the parameters CM will become the cepstral coefficient. Anyway. We can estimate that this parameters by some mathematical criterion. At this equation here, the observation vector O means the speech waveform of each speech frame is a coefficient we want to estimate. Here this figure shows this figure summarized how the speech vocoder work at synthesis time, at training time and time. In training time, we extract the acoustic features such as the F0 and unvoiced to voiced label information, and the Mel-cepstrum from the alternate speech frame by frame. Then, this speech parameters will be modeled by Hidden Markov model in the next of the modeling step. At synthesis time this -- all these parameters will be generated from the trained acoustic models and sent into the speech synthesizer to reconstruct the speech signal. So that means we use speech vocoder to extract acoustic features at the training time and reconstruct using a speech waveform at synthesis time. The next technique that's very important to the HMM-based speech synthesis is the speech parameter modeling and generation. Here, Hidden Markov model is used as the acoustic model to disquiet the distribution of acoustic features. This figure shows example of Hidden Markov model, which has three states and left to right to post here the O and the Q stands for the observed acoustic feature sequence and the hidden state sequcence for each sentence and the BQOT's [phonetic] output the state output probability of each state, and the A is the state of transition probability. Then the model training of HMM-based speech synthesis is just the same as HMM-based automatic speech recognition. Here we estimate Hidden Markov model, for each phonetic unit. For example for each phoneme we can do it by following the method more likely to be carrying and using each of them to update the model parameters internally. Then, at synthesis time, assume the Hidden Markov model for the example sentence is given. So our task is to generate speech parameters from this model. The method we use is to determine speech parameter vector sequence by maximize -- by maximize the output probability in this way. Here, lambda is the each model of the target sentence and the feature we want to generate. We can firstly approximate this equation by using only the most popular synthesis information for all possible sequence. Then we convert these to a two-step multiplication problem. The first step is to predict the hidden unit sequence independent from the acoustic feature sequence. We have introduced that in our method we use the HMM of the left to right the part that -- that means if we want to determine the synthesis sequence of Hidden Markov model, which is the same as to determine the duration of each state, so at training time, we estimate the said duration distributions for each state which is the PI function. Then, the state sequence Q can be predicted for input sentence by maximize this output probability. If we use single Gaussian to describe these distribution, that output will just be the mean of these this Gaussian. So it's -- it's quite simple. The -- the next step is that when the optimal sequence Q has been determined, we want to invert the acoustic feature sequence O. It's also best on the maximum output probability curtailing. If you have a look at this figure which shows an example of parameter generation from a sentence Hidden Markov model, here we can see that each vertical dotted line represent a frame, and the interval between two solid lines for example between this -- between this one this one, is, is a state. So we can see that different state has different durations. Because the red line shows the meaning of Gaussian distribution for the state output probability and the pink area shows is variance. It's obvious there if we want to generate the acoustic feature O which maximize the output probability, the O would become a sequence of mean vector. Just this otherwise shape. But it will cause a problem which is the discontinuity and instead boundaries. It will if we just reconstruct the speech signal using this kind of output, the quality of sentence speech will degrade because of discontinuity in the speech waveform. So we solve this problem by introduce some dynamic features into the acoustic model. Here, say T is the static acoustic feature at frame T, and we can calculate the data at square CT as the data and integration features. Here, the data integration feature are constants to the first and second derivative of the acoustic static feature sequence. And thing they are just -- a kind of -- can be considered as a kind of linear combination of the static features at modeling them. Then, we can put the static and dynamic features together for example, in this way to get the complete speech parameter vector OT. If we look at the whole sentence we can see that the complete acoustic feature sequence O can be considered as a kind of linear transform from the static feature sequence C which the transform max W, which is determined by how we define the dynamic features. Then we can rewrite our parameter generation curtailment in this way. So that means we want to find the most -- we want to find the optimal static feature sequence C to maximize the output of probability of the complete acoustic feature sequence O which contains most the static and the dynamic components. So by setting the derivative of this function is just back to C to zero, we can get a solution of this problem. We can solve the C after solving this group of linear equations. It's conditional complexity is not very high because the W is very, is very sparse metrics here. This example shows the -- this figure shows what will happen if the dynamic features are use. We can see that under the constraint of dynamic features. The generated acoustic feature, here is the blue line, will become much more smooth compared to the step wise [phonetic], the red, the red line, the step wised mean sequence. Subject variation results have proved that this kind of parameter generation is very effective in proving the, the speech quality on the naturalness of synthetic speech by reducing the discontinuity and instead boundaries. Okay, the -- >>: Okay. Is that the [inaudible]. Is that any property of the generative incurred, for example, [inaudible]? >> Zhen-Hua Ling: Yeah. >>: [inaudible] so what part of the solution there that can guarantee that -- >> Zhen-Hua Ling: You mean no flucuation? >>: Right. >> Zhen-Hua Ling: That we have the first the derivative >>: [inaudible] >> Zhen-Hua Ling: Yeah because here see if actually using the first or the derivative of may be some fluctuation. >>: Oh, I see, this is >> Zhen-Hua Ling: Yeah this is how we caluculate dynamic features because the first derivative only use the next frame. >>: I see. >> Zhen-Hua Ling: So there will be some fluctuation. If we use the first order and the second order together. >>: Okay. >> Zhen-Hua Ling: We can't guarantee the solution is of the -- without fluctuation. So the next techniques that's very important for HMM-based speech synthesis is the MSD agent F0 modeling. I mentioned above that in speech vocoding status were not only not caption, but -- >>: Okay, go back to the previous slide. >> Zhen-Hua Ling: Okay. >>: So what is the variance for the for the [inaudible] of the streams together. >> Zhen-Hua Ling: Yeah. That -- >>: The three dimension, not just one. >> Zhen-Hua Ling: No, no. >>: Okay I see okay. So this was all the various -- >> Zhen-Hua Ling: Yeah yeah yeah. >>: [inaudible]. >> Zhen-Hua Ling: So in Hidden Markov model, is also very important to predict the F0 feature. F0 means the fundamental frequency, it decides the pitch of the synthetic speech. This figure shows the example of the trajectory observed F0s. We can see that there is a very important property of F0. That is F0 only exist in the voice segment of speech. For example this segment. For some unvoiced segment, we cannot observe F0 at all. So that means we cannot just apprise continuous or discreet to modeling the F0 directly. Because in some part of the F0 trajectory it has no observations. Here a new form of Hidden Markov model has been proposed which is named amount space probability situation Hidden Markov model to solve this problem. And this figure shows the structure of a multispace probability distribution Hidden Markov model. We can see that state kind of has two or more distributions to describe the the acoustic features. It's very similar to the [indiscernible] Hidden Markov model. For example, using caution each model for each state. But the difference is that for each stated here, in the HMM, we can have different dimensions for different mixture. For example, this common maybe have damage in one, and this maybe damage, two damage. Some -- some distributions have zero damage, to discrete a value. So it provide more flexibility than the mounted mixture [phonetic] Hidden Markov model. Use this kind of model structures, we use two space. The first space, the space is the one dimension continuous value which modeled by a single Gaussian distribution. The second -second space is the of zero dimension which mean it just present the unknown space without observations. >>: So this is a final model? >> Zhen-Hua Ling: Yeah. Actually, yeah. Only we use weight to represent the -- the unvoiced space. The -- the voice to weight -- voice to weight, how to, stand for the probability of the current state being voiced or being unvoiced. There also need to be changed according to the changes. >>: Voice that you get on the F0? >> Zhen-Hua Ling: Yeah. >>: So, if there's no voice, [indiscernible]? >> Zhen-Hua Ling: Yeah. Actually, for each step we have the two distributions there would just be decided if -- whether or not it's voice or unvoiced it's decided by the weight. For example voice for a vowel, this letter will be very close to one, and this is very close to zero. So it's provided this kind of flexibility. >>: So -- so in that case, voice and voice detection is automatically into the [inaudible]. >> Zhen-Hua Ling: Uh -- >>: [inaudible] do you you have the separate voice? >> Zhen-Hua Ling: For the training data -- the training data, the voice and unvoiced features needed to be extracted for each frame at -- >>: Oh, the training segment. >> Zhen-Hua Ling: Yeah, but, I mean, no. If in the feature instruction, we need to decide each frame to be voiced or unvoiced. But in the model training we all need it to be, what do I need to decide each state to be voiced or unvoiced. >>: I see. Okay. >> Zhen-Hua Ling: We can assume [indiscernible] each trained automatically. >>: I see. Okay. >> Zhen-Hua Ling: This figure shows the -- the final complete acoustic feature vector we used in our model training, which is composed of two parts. The first is the spectrum, part the second is excitation part. The spectrum part might consist, for example, the static data and the Mel-cepstrum coefficient. We could we a mounted dimensional single Gaussian to model it for each HM state. In the excitation part it consist of the static data and data log F0. For each dimension we're using we use a true space distribution to modify the F0. >>: I see. So altogether, each just -- each have several states? >> Zhen-Hua Ling: Yeah altogether four streams. Is for the Mel-cepstrum and the other for the different components for the excitation. >>: [inaudible] as if they are separate from each other. >> Zhen-Hua Ling: Yes. >>: [inaudible] say that is why we allow with it -- >> Zhen-Hua Ling: Yeah they are. [indiscernible] will align. That [indiscernible]. Yeah yeah. There is no dependence between these speech. So the last techniques very important for HMM-based synthesis is the context-dependent model training and clustering. In HMM-based speech synthesis we expect to estimate the Hidden Markov models as accurately as possible to describe the distribution of acoustic features. So that means we need to take into account the vectors of speech variations. And here we use the context information to represent this kind of vectors. So that means we will trend, different HMs for the phonetic unit with different contexts. So we've got this kind of context-dependent Hidden Markov model. The most single form of context is the phoneme identifier. In other words trend the Hidden Markov model. For example, if a reader speaks the phoneme in English, we can speak the Hidden Markov models. But we can get further. For example. Here we can see form sequence of a sentence. There are several instances of the same vowel. But we can use different kind of model to represent of each sample because they are surrounding or they are modeling or the phonemes are different. So that well lead to the [indiscernible] model. >>: So CL means function -- >> Zhen-Hua Ling: [inaudible] It's a Japanese. >>: Oh, Japanese okay. >> Zhen-Hua Ling: So actually in action HMM-based speech synthesis we need to very large number of contact information that can be at different levels. But at phoneme level we need to consider the current phoneme and the neighboring phonemes. In consider the level, we need to consider syllable or something like that. In water level, maybe we need to consider the the pattern of speech or some other contact information. At first level we need to consider the first boundary or some other things. So if we combine all of it, this kind of contact information together well get a very huge number. So it to definitely to data speech synthesis problem. That means for each context-dependent model, we can only find the one example -- one sample, sample or even new sample in the training database. So how to solve this problem with just a use a simple method is we allow the context-defendant, Hidden Markov models to share their parameters if they share informations are similar. In order decide the strategy of model sharing, the HMM based state classroom method has been probably used. `This is the kind of top two bottom method. At first we use all the, all the state of Hidden Markov models to share their distribution parameters. Then at each state we will -- we will be separated automatically by a kind optimal question. Here, the questions for the level are here. They are selected from question set which is predefined according to the context distributions of the of different language. And the optimal question is determined by the [indiscernible] criterion, which the consideration. So after the distributive has been constructed, all the context-dependent Hidden Markov models in the seven float, we all share their parameters. For example, all this -- actually [indiscernible] will have the same mean and the same variance. At synthesis time, the input text first pass through a text analysis model and get the context information of the concatenation. Then this kind of concatenation will be used to answer the questions in the truth and then finally we can get the -- get the truth, the leaf note of the distributive, which represent the distribution parameters of each synthesis states. Then we can connect the states together one by one to get the HMs for each phonemes and for all, this phoneme, to get the HM of the whole sentence. Once this HM of the whole sentence, we can use the parameter algorithm which we have used before to produce the -the speech parameters. >>: If -- by this is consistent, so if the [inaudible] of a previous state -- >> Zhen-Hua Ling: Yeah. >>: -- chosen, then the next state should have the last context. So are they ensure -- >> Zhen-Hua Ling: I think because this context information decided by the input sentence there are, there are detect analysis results of the -- of the input sentence. So asking that should be consistent, which I believe -- >>: Inconsistent. >> Zhen-Hua Ling: Okay. >>: [indiscernible] >> Zhen-Hua Ling: Yeah, yeah. >>: Or in that case. >> Zhen-Hua Ling: In that case, yes. And the next thing I want to mention here is that in HMM-based speech model because we used mouth pull acoustic features. For example, we build a static model for Mel-cepstrum for F0 and the four stated durations and different kind of acoustic features are affected by different kind of context information. For example, the sole identifier of some [indiscernible] will be more important to Mel-cepstrum, but the stress or accent information is more important for F0. So in our implementation it would be to the stream dependent decision-based tree. That means for the Mel-cepstrum for F0 or for static duration we have different decision tree to have better modify -- better modeling. To the different kind of acoustic features. After the above, you have -- you will see that the HMM-based techniques [indiscernible] of this method that is very important advantage of the HMM-based speech synthesis is that its flexibility. We can see that different from the waveform concatenation method. The HMM-based we do not use the speech waveform directly at this time. We just predict acoustic features from an acoustic model, which means we can control the voice characteristics of synthetic speech flexibility by just modify of changing the parameters of acoustic model. Here I will show you several examples of the flexibility of HMM-based speech synthesis. The first one is, we call it adaptation. That means we can -- we can build synthesis system of target speaker of or different speaking style with only a small amount of training data from this speaker or style. The techniques we used in -- in model adaptation or [indiscernible] speech recognition but that works well in the TTS. Here, I can show you some examples of the speech adaptation results. But first I'd like to play the original voice of recorded speech. >>: "Author of The Danger Trail, Phillip Steels, etc." >> Zhen-Hua Ling: This is from a female speaker and then we can train an average voice model here using a mounted speaker database, it sound like this >>: "I wish you were down here with me." >> Zhen-Hua Ling: So maybe it's sound like anywhere. Just average voice. If we have ten sentences from the target speaker, we can use this amount of data to adapt the average model and we can generate adaptive model and synthesis speech in like that. >>: "I wish you were down here with me." >> Zhen-Hua Ling: We can perceive that is very much more close to the original voice of the female speaker. If you have more sentences we can do it better. This is so one step [indiscernible.] >>: "I wish you were down here with me;" "I wish you were down here with me." >> Zhen-Hua Ling: Only one thousand. >>: "I wish you were down here with me." >> Zhen-Hua Ling: You can compare to the original voice. >>: "Author of The Danger Trail, Phillip Steels etc." >> Zhen-Hua Ling: That's very close to its original voice compared to the average voice. So the the next example is we call the interpolation. That means if we have some representative Hidden Markov model sets, we can interpolate their model parameters to gradually change the speaker and the speaking styles. For example, we can do a speaker interpolation to to realize a transition from a male speaker to a female speaker. Here I can show you how the synthetic speech is effect -- is affected by the interpolation. This is male speaker's synthetic voice. >>: "I wish you were down here with me." >> Zhen-Hua Ling: This is a female speaker's. >>: "I wish you were down here with me." >> Zhen-Hua Ling: If we use some interpolation ratio we can perceive the transition effect of the of the voice. >>: "I wish you were down here with me;" "I wish you were down here with me;" "I wish you were down here with me;" "I wish you were down here with me." >> Zhen-Hua Ling: And if you have many representative HM sets, that it will cause a problem that is very difficult to set the -- >>: So interpolation means you took like all the wings for each state? >> Zhen-Hua Ling: Yes. >>: So how do you know which one -- oh yeah, because you have the [inaudible] model already, so the corresponding model -- >> Zhen-Hua Ling: Yeah that's actually the same context information we use in integration. >>: I see. And how about the pitch? >> Zhen-Hua Ling: We can -- >>: The mean or the pitch. >> Zhen-Hua Ling: We can in this example we can also perceive the transition of a lower pitch and female with a higher pitch become higher and higher, yeah. This is the Eigenvoice method, this method, the sensation is when that we have a larger member of representative Hidden Markov model sets it's very interpolation ratio set for each HMM set. So we just treat each HMM set as the supervector and aplply, speech analysis to reduce the dimensionality of speech space, then we can control the -- the -- the voice characteristic simply. The last example is multiple regression Hidden Markov model. In this method we can assign intuitive meaning to each Eigen space [phonetic] dimension. For example, in the emotional speech synthesis we can give each dimension of the Eigen space some meaning. For example this one means sad, this one means joyful, this one means rough. Then we can control the the emotion category of synthetic speech as using a three-dimension emotion vector. It's -- it's it will be more easy to -- much easier to control the -- the motion of synthetic speech. Okay. Before I finish the first part of this talk, I'd like to recommend some resources which were very useful for action HMM-based speech synthesis. The first one is a HTS tool kit for HMM-based speech synthesis, can be obtained from this web site. And this is the software which platform, it's released as a patch code for HTK. And it also provides some demo script, so if you have your own training data, it's very easy to extract a speech synthetic by our your own. The next resource is HTS slides very useful to talk, to the researchers who are interested in this -- in this topic. And some of the slides here also adapt from this -- from this material. Okay. In the second part of my talk I'd like to introduce some recent advances of the Hidden Markov model speech synthesis, which was developed by our research group. The first -- the first topic I'd like to introduce is the articulatory control of the speech synthesis. Here we have introduce that in the connection of Hidden Markov model based speech synthesis, we can we train common dependent Hidden Markov models for acoustic features. This acoustic features I extracted from the speech waveforms. At the same time, we use this kind of parameter generation to predict the acoustic features and to reconstruct the waveform. There are many advantages of this method. For example it's very flexible. It's trainable. And we can do some model adaptation to do some exchange the character of synthetic speech. But there is also disadvantage of this measure. One of them, imitation of HMM-based speech synthesis comes from its -- come from that it's a data-driven approach. It constrains that the character of synthetic speech is strongly dependent on the training data available. And it's very difficult to integrate the phonetic knowledge into a reputable system building. For example if we want to build a voice of a child of a child. So that means we need to collect some data from a child in advance. No matter you are doing speech level training or adaptation. But on the other hand, we may have some phonetic knowledge which can describe the difference between adult's voice and a child's voice. For a child, the vocal tract is much shorter, and the vowels are much higher, but in HMM-based speech synthesis, it's very difficult to integrate this kind of with the phonetic knowledge into a system building. So, we use another kind of feature here. We named articulatory feature. Articulatory feature describes the movement of articulators when people speak. They can provide a simple explanation for speech characteristics and they are very convenient for representing -representing the phonetic knowledge. This kind of articulatory features can be captured by some modern single acquisition techniques. For example in our experiment we use the electromagnetic data to capture this kind of articulatory features. In this techniques some very tiny sensors onto the tongue and the lips of the speaker doing his pronunciation. Then we can record the realtime position of all this sensors when he speak. Because of this advantages of articulatory features, we propose the method to integrate this articulatory feature into Hidden Markov model that speech synthesis to further improve its flexibility. At the training time, we estimate a joint distribution for the acoustic and articulatory features at each frame. Just as this one here denotes the acoustic features and the Y denotes the articulatory feature. If we look at the of the output probability, we model the cross-stream dependency between the acoustic feature and articulatory feature explicitly. This is because of the speech production mechanisms we know that the speech waveform is always caused by the movement of articulators. So we model this type of dependency. At synthesis type time, the process is a little bit more complex than conventional method. When the unified acoustic-articulatory model has been trained, we first generate the actual at the synthesis time, then we design articulatory control model by integrating the phonetic knowledge and then to affect the phoneme generation acoustic features. In terms of the constant variance modeling, we have proposed a two different kind of methods. The first one is the [indiscernible]. Here, we can we have here this -- this condition of distribution just describes that the relationship between acoustic and articulatory features at each HMM state. We use a linear transform from the articulatory features YT to determine the mean of the distribution for acoustic features. This AT is trend, this AG is trend for each HMM state so it better have some disadvantageous for example it cannot adapt to the modified acoustic articulatory features, because at synthesis time we will generate the YT at first that will modify it. >>: [inaudible] for the training time you need to simultaneously collect the [inaudible]. >> Zhen-Hua Ling: Yeah. >>: The, investigation directly of the -- >> Zhen-Hua Ling: Yeah yeah. >>: And that's a very expensive process. >> Zhen-Hua Ling: Yeah, of course. It's very expensive. And the difficulty. >>: So [inaudible] how do you generate, [inaudible] [inaudible] so do you articulate the feature? >> Zhen-Hua Ling: Yeah, the model of the technique is still, we have not finished now. >>: Okay, so -- >> Zhen-Hua Ling: Okay focus on the single speaker here now. In the next step we are trying to find some more maybe feasible approach to do some model adaptation for this type of modeling. >>: [inaudible] to be linear, but in practice a very very -- >> Zhen-Hua Ling: Yeah actually here the AG is dependent, so really it so is kind of nonlinear. Or, in other words, piece of linear transform. >>: I have to the same model that I have done, I have been doing for a few years. It's very very -- look just exactly the same as before recognition. So you infer, you treat that Y to be hidden variable and then you integrate them up and then from that you can infer that [indiscernible] very very, very curious about what this is. >> Zhen-Hua Ling: And we also improve the previous model structure and we adapt the feature specs, this method. We at first trend Gaussian mixed model, lambda G, to represent the articulatory feature space. Then we train a group of linear transform. But the linear transform A is not as metaphor, each HM state but estimated for each mixture of the -- of the -- of the Gaussian mixed model. That means when we modify here at the synthesis time this prosthetic probability will change accordingly and then they will affect the relationship between Y and X so we can get a better representation of this kind of relationship at synthesis time. >>: So same question here. Since you treat this Y to be variable, you don't need to have data for training [inaudible] >> Zhen-Hua Ling: Yes, yes, yeah. >>: [inaudible] >> Zhen-Hua Ling: [indiscernible] hidden variable or seeing that variable? >>: No, no. But according to this model... >> Zhen-Hua Ling: Uh-huh. Yeah. >>: So you try anything to make that, you know [inaudible] so because you without that articulatory -- so I assume that since you do have that articulatory feature there you just use this model. >> Zhen-Hua Ling: Yeah. >>: And on top of this [indiscernible]. >> Zhen-Hua Ling: Yeah. >>: You need to have [indiscernible]? >> Zhen-Hua Ling: [inaudible] because in our -- at synthesis time, we need to generate Y and we need to modify it. >>: Generate? >> Zhen-Hua Ling: Yeah, we need to generate Y, because the modification on that is that -- >>: [inaudible] >> Zhen-Hua Ling: This way. >>: Okay. >> Zhen-Hua Ling: We first generate the articulatory features. Recently we see the articulatory feature is more physical and meaningful. >>: So during sample, we do sampling? >>: [inaudible] output possibility generation. >>: Okay, so you got the data already. >> Zhen-Hua Ling: Yeah, yeah. >>: So once you have that, then how do you get -- but I thought that you [inaudible], so you can do them together. >> Zhen-Hua Ling: Yeah. >>: So you can do them together. >> Zhen-Hua Ling: Yeah. At synthesis time, we want to integrate this kind of synthetic knowledge -- >>: [inaudible] modify. >> Zhen-Hua Ling: This way is is more feasible than for articulatory features than acoustic features >>: [indiscernible] machine techniques, just doing the standard, [indiscernible]. >> Zhen-Hua Ling: Okay. >>: So this is a very kind of it's kind of state by state, not -- okay, okay, okay. Okay. >> Zhen-Hua Ling: So, here I can show some experiment's results. We use a English database because the recording is expensive. So we have only one of these other sentences but that's enough for speech dependent speech model training. We use six sensors and each has two dimensions -- >>: Yeah, so you have [inaudible]? >> Zhen-Hua Ling: Yes, yes. Actually, this database is is a public -- >>: Is a public? >> Zhen-Hua Ling: Yeah can use [inaudible]. I think there is a website for this database now. The first experiment is some hyper articulation experiment studies. We want to simulate that in some noisy conditions the speaker may want to put more effort what they pronounce, so that they'll open their mouth wider or something like that. So we just in this experiment, we generate the articulatory features at first, then we modify and articulatory features at first then we modify the articulatory features using one point five scanning vector for Z coordinate. Z means the bottoms up direction. We just want to simulate that way you pronounce your articulators move in much larger -- >>: Can -- can I take a look at the [inaudible] three slides ago? Back. Yeah, back. One more, one more, one more, one more. Back, okay. One more, one more, one more. This one. So when you want to simulate to synthesize articulate, you are saying that you move all the speech generation up. >> Zhen-Hua Ling: Yeah. The -- a wider range. >>: Wider range. I see. So including [inaudible], you have one, two, three -- >> Zhen-Hua Ling: We'll have six sensors. Each sensor [indiscernible] the Z and the Y, because all these sensors are put in the middle. So the X axis is omit. So we put this two dimension and in the fourth example we just scurry up the Z dimension. >>: [inaudible] dimension. But if you this upper feature [inaudible] in that case. >> Zhen-Hua Ling: I suggest some this modifying here is a little bit heuristic yeah. We just want to simulate whether or not this kind of modification articulatory features can influence the generator acoustic features. >>: Oh, I see. So [inaudible], what do you mean by articulatory control [inaudible]. >> Zhen-Hua Ling: Yeah, yeah. >>: And [inaudible]. >> Zhen-Hua Ling: Yeah. We have -- we'll have some maybe not very, how how to say, it's not very in detail we just need -- yeah. >>: So -- but why is it that you move Z the way up, it's going to be articulated [inaudible]? >> Zhen-Hua Ling: Not moving up, we just -- the range. We increase the range. >>: Oh, I see. >> Zhen-Hua Ling: Increase the range. >>: So, in the system you just make is various speaker -- >> Zhen-Hua Ling: Yeah, yeah. >>: [inaudible] artificially manipulate them. >> Zhen-Hua Ling: No. >>: Okay, I see. Okay. >> Zhen-Hua Ling: Yeah. So we can see some change in the spectral gram. We can see that of this kind of modification the performance does become more clear. And the high -high focus part get more energy. And also we did a subject listening test for the listeners to do some dictation in some noisy conditions as of five DBSL [phonetic]. And we can see that red drops from 52% to 45. >>: How do you [inaudible]? >> Zhen-Hua Ling: This is dictation of the synthetic speech. >>: Oh, I see. So you use that to generate that. >> Zhen-Hua Ling: Yeah. >>: Okay, okay. >> Zhen-Hua Ling: We give them some speech and ask the listeners to write it down, the word that they heard. >>: The human listener. >> Zhen-Hua Ling: Human listener. >>: Oh, I see. >> Zhen-Hua Ling: Human dictation. Human dictation. We just want to improve the integration of the speech in noisy conditions. >>: [inaudible] all you do is that you increase the variance of one variable, right? And when you do that, why does the following [inaudible] also have to estimate A. >> Zhen-Hua Ling: Yeah [inaudible] there are some -- we -- by using the conditional distribution here, for example -- >>: Yeah, yeah. It's just not -- >> Zhen-Hua Ling: This one, we can -- we can learn some relationship between acoustic and articulatory features. There are some training samples in the database with the large variance of -- of the -- of the tongue movement. So we can learn this kind of relationship. That means if your tongue moves is very -- has a wider range, so would your acoustic representer would be? There are some samples of this of this in your training data. We can do this by this dependency. And since the time -- >>: That's how you look -- >> Zhen-Hua Ling: Yeah. This is -- >>: This is increase the variance for Y. So listen to it. >> Zhen-Hua Ling: A has been in the training database and at synthesis time, we generate the Y first then we increase the -- >>: The variance. >> Zhen-Hua Ling: The variance. A is the same. >>: A is the same. >> Zhen-Hua Ling: Yeah. >>: Oh, okay. That's very interesting. >> Zhen-Hua Ling: And the -- the -- the second example is the vowel modification. And in this experiment our phonetic knowledge is place of of articulation for different vowels. We can look at this, this table here. There are some vowels English for example that vowel "A" vowel "E" or vowel "Eh." And most significant difference between these vowels among these vowels as they are tongue height. For the E, you have higher tongue, and the A have a lower tongue. Eh have middle tongue. So here, we just predict articulatory features using the vowel A at first, then we modify the height of this of tongues to see whether or not we can change this vowel to be perceived as vowel E or vowel Eh respectively. >>: Oh so what that really good for otherwise you [inaudible]. >> Zhen-Hua Ling: Yeah up here we have some very up to date [inaudible]. For example, we, when we increase the height of the tongue, most of the stimulates will be perceived as vowel E here. If we decrease the height most of them will be perceived as Eh. This is a human perception. Also a dictation results. Here I can show you some examples of the -- of the of the synthesis speech. This one is the synthetic speech for the vowel A. This is a various set we can hear that. >>: "Now we'll say set again." >> Zhen-Hua Ling: It says set. If we increase the tongue height... >>: "Now we'll say set again;" "Now we'll say set again;" "Now we'll say sit again." >> Zhen-Hua Ling: Now it becomes sit. So it's very clear transformation. >>: "Now we'll say sat again. Now we'll say sat again. Now we'll say sat again." >>: Yeah, more like sat. >> Zhen-Hua Ling: Yeah. So by this experiment we'll show that our -- >>: [inaudible] those is more than just performance, then also they cannot do it [inaudible] duration, sat become duration, somehow. Oh that's very interesting. >> Zhen-Hua Ling: Yeah. >>: I just can't believe that it's so powerful, [inaudible] all you do is information. >> Zhen-Hua Ling: Linear information in the articulatory feature space. >>: Yeah, correct correct. >> Zhen-Hua Ling: In the -- in the modification articulatory space we just some single modification. But in the -- >>: So it's like when you move that, it's true the information the performance -- >> Zhen-Hua Ling: Yeah yeah. >>: Yeah that's that's obvious, okay. So. >> Zhen-Hua Ling: Well, here, I have said that the linear transforms estimates for each Gaussian mixtures sub space or articulatory features so. Totally we have a nonlinear transform. >>: I see. Okay. So you just in that case you move the mean. >> Zhen-Hua Ling: Yeah. >>: Here, mean of the what -- so tell me -- so what exactly in the system nations an estimation, so like in six sensors and each sensors has two dimension >> Zhen-Hua Ling: Let me move three dimension, only thre three sensors on your tongue. >>: [inaudible] >> Zhen-Hua Ling: [inaudible] tongue height. >>: Okay, so when you move how height, of course your form is going to change. So, I basically hear that the formant changes. >> Zhen-Hua Ling: Yeah, yeah. >>: I see. So that's just system form diverse speech. >>: Okay, that's cool. That's very cool. >> Zhen-Hua Ling: And also this is some preliminatory [phonetic] attempt of integrating these kind of articulatory features. >>: [inaudible] or in China? >> Zhen-Hua Ling: [inaudible] research [indiscernible]. >>: Oh, I see. >> Zhen-Hua Ling: So this is English database. The database is recorded by the -- okay. The -- the second thing I'd like to introduce is a minimum care primary generation. We know that in the HMM-based speech synthesis it has many advantages for example it's trainable, it's flexible, it's adaptable. But there is also some disadvantages of this method. The most significant way is the quality of synthetic speech is degraded, we can perceive some how to say, pass or uncomfortable things in the synthetic speech. So there are three reasons we call this kind of speech quality degradation. The firstly is the vocoder and secondly is the acoustic modeling method. And the third is the primary generation algorithm. Last we use the maximum output probability for the generation could use the above it will be over- smoothed. Here, we focus on the third cause of this kind of problem. This -- this equation has been introduced before. This is a maximum output probability of primary generation. Because in our method, the state function is represented by a single Gaussian. So if we solve this problem the outputs are very close to a sample mean. Even if we have used some dynamic features they are still very close to the mean of each state. So the detail of characteristics of a speech parameters are lost in the synthetic speech. >>: So MOPPG stand for -- >> Zhen-Hua Ling: Maximum output probability primary generation. Yeah. Just this -- this one. Okay. Many important methods has been proposed try to solve this kind of problem. I think the most successful one is the global variance modeling. In this method, we calculate the sentence level variance of acoustic features for each training sentence. Then we estimate the single Gaussian distribution lambda B, for this kind of global variance. Then at the synthesis time the optimum function is defining this way, which is a combination of two parts. The first one is the conventional articulatory probability of the acoustic Hidden Markov model, and the second line is output probability of the global variance model. So by using the under the contract of this model, we can generate the acoustic feature trajectory with larger variance and improve the -- to alleviate the almost working of the trajectory and improve the quality of synthetic speech. >>: Okay. So why don't you just, [inaudible] increase the variance. >> Zhen-Hua Ling: You mean at synthesis time or? >>: Synthesis time. >> Zhen-Hua Ling: Yeah, by asking that, that's too realistic -- yeah. Here we guided it by some static model. >>: So you train the -- >> Zhen-Hua Ling: Yeah this is trained. This is trained. For each -- >>: So, after you do this kind of training do you see the amount of change in -- in variance consistent across different folds, different states or are they in some states making too small [inaudible]? >> Zhen-Hua Ling: Actually that's, according to the criterion, asking this kind of variance increase would be inconsistent. Because -- >>: I see. >> Zhen-Hua Ling: -- there is a trade off between these two parts. >>: Yeah, so we call that regular audition. >> Zhen-Hua Ling: Yeah, yes. Yes, yes, yeah. And this method has been proved to be very effective, in some experiments, it can improve the quality of synthetic speech significantly. >>: So, question. [inaudible] >> Zhen-Hua Ling: I think maybe this -- that it depends on the problem because here because here, almost musing is the most important problem. We want more variation in the synthetic speech. And if -- if -- if one method is the primary measure [indiscernible]. Here the output probability, is calculated using the mean of the generated segments instead of -- instead of the output probability of each frame. Which would calculate the output probability. We can see that there is a common points of both this two methods. This is both these two methods examine the similarities between the total parameters for of natural derive from the generated acoustic features. For example in the TV method the distributive parameter is the sublevel or segment level mean. We want this kind of distributive parameter to be similar for both for the nitrous speech and for the generated speech. So, based on what we this we propose another kind of generated algorithm which is this method, we -- we explicitly measure the distribution similarity between the -- between the generated feature and the natural feature. Here, the generated Hidden Markov model, which means the sentence estimate from the generated acoustic features is expected to be as close as possible towards the target collation. Which was the HM voice used for primary generation. This is the mostly important the criterion our proposal method, and the care divergence is adopt [indiscernible] distance to Hidden Markov models. And we have implement to approach to using this criterion, which is to optimize the generated acoustic method directly and to estimate a linear transform to the maximum output possibility primary generation results. So the first thing is that when we generate a sentence from the Hidden Markov model, we need to estimate the generated HMM according to this -- it's not easy because we have some data surface problem. We need to HMM. Here we use maximum posterior space expiration to estimate this HM parameters. So for each for the mean and the variance of each state there can be written as a function of the O. O is the generated speech. That means when the primary features are generated, we use the features to estimate the mean at a variance of each HMM state as then get the sentence HM of the generative speech. That we can accumulate the [indiscernible] between this two HMs, we use the care damage in the symmetrical form between the hatch HM and the generative HM. But unfortunately because of the hidden state, we cannot, we cannot calculate this -- >>: [inaudible] >> Zhen-Hua Ling: Okay. Right here. We have right here. The generative means the HM estimate from the generated acoustic features. For instance, at sentence synthesis time, we input a text and then we generate a acoustic trajectory for this sentence. >>: [inaudible] or something. >> Zhen-Hua Ling: Yeah yeah but we can estimate Hidden Markov model from this trajectory. >>: Okay yeah, which is not very good -- >> Zhen-Hua Ling: Which is -- >>: -- compared to the real -- >> Zhen-Hua Ling: [indiscernible] but we want to minimize this distortion. >>: I see I see. Okay. >> Zhen-Hua Ling: The target HMM, commit at the HM that generate the acoustic features. >>: I see, so now you sort of estimate a new set of HM. >> Zhen-Hua Ling: Yeah. >>: That minimize the [inaudible] doing that. >> Zhen-Hua Ling: Yeah something like that. Yeah yeah. >>: Oh that's cool. Okay. >> Zhen-Hua Ling: That is something like we -- >>: So why do you want do that instead of just doing sampling? >> Zhen-Hua Ling: I think the reason is that, here, actually the motivation is that the conventional max -- the over-smoothing effect of the conventional maximum output probability parameter generation algorithm if we just use the -- this kind of output probability to evaluate this sequence is good or not good, now it will call some over-smoothing problem. >>: Okay. [inaudible] in this case you only use the mean. >> Zhen-Hua Ling: Yes. >>: But if you also use a variance just like what -- to the -- the end. >> Zhen-Hua Ling: You mean the -- you mean the -- you mean the -- if you -- >>: If you estimate the global variance of the sequence and then you do the sampling according to the mean and this variance, you should be able to get a higher [inaudible] >> Zhen-Hua Ling: I think in truth, our method when the global variance model has been introduced it also be solved by maximize this kind of combined -- combined output probability. It's not, you're not sampling method. >>: I mean you can do the sampling if you already have the variance. >> Zhen-Hua Ling: Yes we can, but the -- I have some results of the just a simple sampling method, but I actually think it's not too good. That means, currently it may be using the mean is much safer than -- than -- [laughter]. It's a little bit dangerous if you have not a good constraint -- >>: [inaudible] frame by frame. >> Zhen-Hua Ling: [inaudible] >>: So you said that you saw a [inaudible]. >> Zhen-Hua Ling: Yes yes. >>: So [inaudible] sampling? >> Zhen-Hua Ling: [inaudible]. >>: Maybe you'd have. >> Zhen-Hua Ling: It's not so good. >>: I see okay. So this was kind of, you know, segmental. >> Zhen-Hua Ling: Actually in this method, we have not touch the problem of whether or not we should do sampling or use the maximum output probability. We just try to improve the maximum output probability method by using a new -- >>: [inaudible] using these criterion and announcing these [inaudible]. >> Zhen-Hua Ling: No. Actually -- >>: [inaudible] when you don't use the [inaudible]. >> Zhen-Hua Ling: No no no. The actually the criterion just this way. >>: Okay. >> Zhen-Hua Ling: We just try to find the optimal static video sequence to minimize this kind of >>: Divergence. >> Zhen-Hua Ling: Divergence. >>: Divergence measured between two -- >> Zhen-Hua Ling: Yeah. >>: One of them is using generated. >> Zhen-Hua Ling: Up and down approximation of the [inaudible] between two Hidden Markov models. Because this criterion is a function of the -- the static feature. So we can we just try to solve this problem. And we, it will be updated the C by some particular set of them to get the the final results. And this is the one method how might use of the criterion who have also tried another method that this we constrain the final output acoustic features to be some linear transform of maximum output probability primary generation results in this way. And under this criterion we just estimate the coefficient and the BIOS [phonetic] of the linear transform. And we want to -we hope that by this method we can get more robust estimation of the transform metrics to estimate the scene directly. Furthermore, we can also estimate that the linear transform -- linear transform coefficients and the BIOS on the training set which means we can do it globally at the train set. So we it will reduce the computation at synthesis time. Here, the care divergence is calculated for all the sentence in the training set and the sun up together. And then we can estimate the linear transform minimize this care divergence. >>: I see. So this is the public training. So after your training you get another HMM and HM, which is you can generate model. >> Zhen-Hua Ling: Actually this training will only estimate these two parameters. This is separate from the Hidden Markov model. The Hidden Markov model will not be continued anymore. >>: So what's [inaudible]? >> Zhen-Hua Ling: Here, we can -- >>: Oh! It's just expiration. >> Zhen-Hua Ling: Yeah. >>: Oh! So it's only two primary too. >> Zhen-Hua Ling: Yeah, yeah. Actually that's just alternative method when compared to the previous one. Here, we just give many freedom to C. It can change anywhere, just minimize this. But here we give some constraint, we require the C to be an -- >>: [inaudible] HMM do this, I guess it wasn't do that right. >> Zhen-Hua Ling: All this method are only related with the primary generation part. The model trainings, we do not attach that here. >>: I see okay okay. It's -- >> Zhen-Hua Ling: This is only for how we can generate the primary data once model has been given. >>: I see. It's presumably that's a train HMM, okay. My biggest -- you originally have HMM, you know, the HMM isn't -- maybe too smooth away from your data. You know training data. So you multiply HMM so that that become close to each other. That's what I thought. >> Zhen-Hua Ling: No. Maybe I can go back to it. To here. I think for example we can consider things this way. We have a trained HMM model we just keep it. And then we can generate acoustic features for input text. Then we can estimate the generated HM from this generate acoustic features. Then if we how to say, we make some modifications to the generated acoustic features. Then the generated HM will be different. >>: I see. Correct, correct. >> Zhen-Hua Ling: So we can find the optimal -- >>: I see. >> Zhen-Hua Ling: -- feature sequence. It has a generated HM which is very close to the target HM. The target HM is what we train at the training time and it's kept always the same. Okay. Here is some experimental results. We compared five method here. The first is the conventional maximum output probability parameter generation algorithm. The second is the classic GV method. And this three method are proposed method here. The first is linear care damage primary generation. The second is mean official transform. This one is also the transform but estimated globally object training set. We can see then this figure shows the average of care divergence of different algorithm. We can see that for the -- we compare the natural speech, the MOPPG output and the GV output we can see that the natural speech has the lowest care divergence. Compared to this two method. This means that, I think it means maybe this care divergence is better criterion maximum output probability. Because in a sense, the output probability, this red line is the optimal, but it sounds not as good as the natural speech. And this figure shows a care divergence of the opposed so we can see there they welcome very close. And this figure shows some sample trajectory of the this is for one dimension of Mel-cepstrum. We can see that the MOPPG output is very -- is very smooth, and the variance is very small. But after we apply the care divergence based or parameter generation, for example, these two method we can have the trajectory with much more variation, in the synthetic speech. And this figure shows some subjective results. That is we have some synthetic samples and ask people to listen to the samples and give a score from one to five to each sample and we calculate the average value. Higher value higher score means better naturally. We can listen some of here. This is a maximum output probability generated output. >>: [Chinese dictation] >> Zhen-Hua Ling: I'm sorry for that, there's only Chinese sample. But maybe you can perceive the difference between this one and the next one. If you use GV method the speech much clearer. >>: [Chinese dictation] >> Zhen-Hua Ling: [indiscernible] can perceive the difference. >>: [Chinese dictation] >> Zhen-Hua Ling: This is our proposal. >>: [Chinese dictation] >> Zhen-Hua Ling: We also can compare it to another fact deliberation by the preference test, that is to compare the systems of care divergence field transform by global transform method. The final result is that the global transform method can achieve better result even than GV and the difference is significant in a group of listeners. So in the last partly introduce another one we have done before that is hybrid HMM model, speech synthesis. Before that I'd like to show some review of the classic unit selection algorithm. In this algorithm the most important thing is that we need to define two course functions. For example here we have a target unit that is a target, want want to success, and we have a unit -- a kind of unit of sequence. You, here we have two cost to function, the function target cost which define how appropriate the concatenate unit can be used target unit. And now the cost function we name the concatenation cost, which measures how much the discontinuity will be if we connect this to some altogether. Then the final unit that criterion can be defining in this way which is the summation over target unit target cost and coalition costs and the optimal unit sequence is select by minimize this cost function. Compared with the HMM-based speech synthesis the unit selection and waveform synthetic measure has its advantages and disadvantages. For example for this method because the real speech segments are used as a time so it can peruse high high actuality at the waveform level than HMM-based speech synthesis. This one, this advantage before that vocoder sound feel not so comfortable while you listen to the speech. >>: [inaudible] distance? >> Zhen-Hua Ling: Actually it's kind of proved, but -- >>: Still not -- >> Zhen-Hua Ling: Cannot beat this one. Because it's the natural speech segment. Especially here, in right or limited domain, it has a very high nature, if you have a good results. But the disadvantageous that for unit selection, we'll need a large reaction, we need large footprints. That means at synthesis time, we need to store the whole speech database. This may be several hundred megabytes or gigabytes or several gigabytes. And to a concept of discontinuity if we pull some unit together as it shouldn't be and also it has some unstable quality. This means for some sentences we can have maybe perfect synthesis results, but for some sentences there will be some breach in this -- in this sample and you will feel very uncomfortable because of this this synthetic arrows. So in our previous work we have proposed a action-based unit selection method. In this method we try to integrate the damages of static approach in automatic robustness, and the superiority of unit selections and high requests synthetic speech, and we want to achieve a trend select speech system. This figure shows the -- the flowchart of the HMM-based unit speech synthesis system. The training part is almost the same as the HMM-based speech synthesis I have introduced before. Where it turned to features and trains context dependent HMM model. The difference is at synthesis time, we do not generate acoustic features from the model directly. Instead, instead of that, we have a statistical model based unit selection model to select to optimal unique sequence based on criterion these models. Particularly, I will -- for the at the model training part we will first train that acoustic features, commonly used features including spectrum/F0 at each frame. Then from duration. The concatenation spectrum or F0 at full boundary. And this one I think this one is maybe not so important for HMM-based parameter metric speech, but if we want to do unit selection would need this kind of model. And some other model besides simple levels of features. Or some other things. And the model training is almost the same as the HMM-based speech synthesis we train concatenative models on the maximum criterion we use a decent class model class wing to deal with the data sparsity problem, and we use a HMM for the F0 and the concatenative F0 model. And this slide shows the unit selection process. Here, assume the concatenative information of target sentence is C, and we have a concatenative phone sequence. Then how to determine the -- which unit sequence we should select? I think this equation is the criterion we used. It can be dividing to two parts. The first part just described the possibility of the -- of the kind of unique sequence to be generated by the target contact information and the second part describes the similarity between the contact information of the concatenative units and the concatenative information of the targets units. We also use a care diverging to calculate this part. So if you put this two part together our criterion is that we want to find the unique sequence which has high probability and low care divergence in terms of the contact information. So we can rewrite this, this -- this formula into the concatenative unit selection formulas including a target cost, phone level concatenative cost. We can use the modify the dynamic program in search to do the -- to do the optimization. Okay. And finally I will introduce the performance of our proposed method in the Blizzard Challenge speech synthesis simulation events. Blizzard Challenge is the variation events which has been held every year since 2005. Its purpose is to promote the process of speech synthesis techniques by evaluating the system built using the same database and different method. The USTC teem enters this event since 2006 and we adopt HMM-based unit selection since 2007 and we have some very excellent performance in all the measurement of evaluation test since, since that year. So, let's take the Blizzard Challenge Event of this year for example. This year, we have a very challenging synthesis topic. That is the speech database we use is not recorded in recording room but taken from some internet resources. This is audio book speech database which has four stories written by Mark Twain and read by an American English narrator. The database size is quite big is more than 50 hours it has been imperfect transcription and rich expressiveness. This -- this slide shows the results of the -- of the Blizzard Challenge this year. >>: But that's coming from inter[inaudible] >> Zhen-Hua Ling: Yeah [inaudible]. Here the first the first item of the evaluation is a similarity which measures how similar your synthetic speech is towards the voice of the source speaker. We can see that here, that A is a natural speech and C is our system. The mean of the score is about 4.1 is the highest score of all participants. And this natural score, our system also get the highest manes score. Which is about 3.8. Intelligibility. We did some human dictation for the synthetic speech. And the dictator would narrate our system is around 19%, which also the lowest of all the systems. This is here we also have some paragraph tests. In this tests -- >>: So what is F? >> Zhen-Hua Ling: So -- >>: Which group is F? >> Zhen-Hua Ling: F? I can't remember, actually, this, this, this competition is dealing in an anonymous way. >>: Oh, okay. >> Zhen-Hua Ling: [laughter]. Yeah. >>: So how do you know that's you? >> Zhen-Hua Ling: They will give you some data so you know which system is yours, but they will not tell you -- I think that everybody accords some information your work, but I forgot which - which F stand for. >>: I see okay. So what -- is that must other than the ones under condition, or otherwise error rate cannot be so high, 20%. >> Zhen-Hua Ling: No, it's not a logic connection. Because this -- this sentence outweigh named SUS sentence, that means sematic unpredictable sentence -- yeah. It's that difficult. You cannot use some concatenative method to guess what the words is. So it's much more difficult. >>: Okay, I see. So can we go back two slides? One more. >> Zhen-Hua Ling: Okay. >>: So FIB [phonetic] is [inaudible]. >> Zhen-Hua Ling: Yes, but there are also a significant test results distribute by the organizer. In that results our system is significantly better. >>: I mean those three groups, FIB. >> Zhen-Hua Ling: Oh you mean. >>: [inaudible] >> Zhen-Hua Ling: Actually -- actually this is a false plot because the scores of each given this from one to five. So that's means discreet rather for each for each. Now this means -this -- this -- so the lines means the the how to say, the median of the -- >>: The mean? >> Zhen-Hua Ling: Not mean, the... >>: [inaudible] >> Zhen-Hua Ling: Yeah. And this this back is the how to say, the variance. The variance, yeah. >>: Standard deviation. >> Zhen-Hua Ling: Standard deviation yes. Maybe times standard standard deviation. Of how, I remember. This should be this should be explained this way. Half of the training samples as located in this box. Half of the samples are in this box. >>: Okay. So this is the open method diagram. >> Zhen-Hua Ling: Yeah. Open method diagram. >>: Okay. >> Zhen-Hua Ling: Yeah, yeah. >>: So if you move to next -- so what is the similarity here? Similarity means what natural is intelligent. >>: Similar to the original recordings. >> Zhen-Hua Ling: Similar to the original speakers recording >>: Okay. So what does naturalist or -- >> Zhen-Hua Ling: Actually -- >>: This one test adaptation -- >> Zhen-Hua Ling: Actually it's related with natural, for example, if a -- if a -- if a speech sample is very unnatural, it cannot get the high similarity score. But if we have a natural, if we have someone is very natural but it sounds like another person, not a target person. >>: Oh, I see. So you natural is two categories. One is the absolute naturalness, you know, without -- >> Zhen-Hua Ling: Yeah. >>: Without speaker ability [inaudible]. >> Zhen-Hua Ling: Yeah yeah yeah. >>: Okay. I see. So here... naturally [inaudible]. But what do they have the [inaudible] box? Normally if it's the bar, then you actually will see something modeling the little speck, the O. So, yeah, that's the same question I thought you were asking. So the MOS score is 03.8, right? That's your system. >> Zhen-Hua Ling: That's our system yeah. >>: That is supposed to be somewhere at 3.8 >> Zhen-Hua Ling: Yeah, something like that. >>: It doesn't look like that. I don't know. I don't understand. What is the box representation here? >> Zhen-Hua Ling: You mean this? >>: Yeah. >> Zhen-Hua Ling: That is the median. >>: The median. >> Zhen-Hua Ling: Yeah so it must be a -- a -- one, two four, three, four -- four, five. So that is median value. Not the mean. >>: Okay, okay. So what's the 3.8 here? >> Zhen-Hua Ling: 3.8. >>: Come for here. >> Zhen-Hua Ling: No, we cannot get 3.8 from this figure from other sources. This figure just shows the distribution of the scores. >>: I see, okay. I see. >> Zhen-Hua Ling: And the scores actually are in some discreet. >>: So it's just your system and two other groups are doing equally well [inaudible]. >> Zhen-Hua Ling: You mean? I think the meeting obviously is the set, but there is also some results for the significance analysis that we can't show that analysis system [indisernible]. >>: Oh, okay. You can't see that from here. >> Zhen-Hua Ling: Yeah. >>: [indiscernible] >> Zhen-Hua Ling: There are some other -- there are better detailed results of the -- >>: I see. So is this for Chinese or for English? >> Zhen-Hua Ling: It's English. >>: Oh, example was -- so your group with both English and Chinese and both are probably the best. >> Zhen-Hua Ling: Because every year, in several years we have more than one language evaluation, but this year we have only the English. >>: English, oh, okay. So does this test have Chinese as well? >> Zhen-Hua Ling: You mean the listeners for -- >>: Yeah, for this year's test challenge participants. >> Zhen-Hua Ling: I see. No other listeners are from China. >>: I see. Okay. Okay. >> Zhen-Hua Ling: Okay. This is a new measurement this year that is for the paragraph test. This means that we synthesize some -- that means that what the listeners include is not sentence by sentence but a whole paragraph of novel. So they can give some impression scores on this seven aspects. And and the range, I remember, is from 10 to 60. This is a score of the natural speech and this score of our system is also the best performance in all the seven aspects compared to other system. So I can show you some demos of this. We have different types of synthetic sentence. Some are book sentences, some news and book paragraphs. >>: But this test [inaudible] SUS. Because this is -- >> Zhen-Hua Ling: There are some I didn't put the samples here, but because -- >>: So this is only to look for naturalness. >> Zhen-Hua Ling: Actually these two parts are used for the similarity and naturalness test. And this parts used for the paragraph test. >>: "Once more the original verdict was sustained. All that little world was drunk with joy. I've liked him ever since he presented this morning. Her research is the second piece by Edinburg University scientists highlighted recently." >>: This is the original one -- >> Zhen-Hua Ling: The synthetic speech. >>: Oh, the synthetic speech. >> Zhen-Hua Ling: All the samples are synthetic speech from our system. >>: Oh, okay. >>: "So it was warm and smug with it, though bleak and wrong without, it was light and bright with it. Though outside, it was as dark and dreary as if the world had been lit with Hartford gas. Alonzo smiled feebly to think how his loving vagueries had made him a maniac in the eyes of the world and was proceeding to pursue his line of thought further when a faint, sweet strain, the very ghost of sound, so remote and attenuated it seemed struck upon his ear." >>: So which one -- what is the difference between? >> Zhen-Hua Ling: No difference, just some samples. >>: Just some samples. >> Zhen-Hua Ling: Sorry. I haven't explained yet. I think the text of the book paragraph are taken from maybe another story of Mark Twain which is not containing the training set. Okay. Finally I will give summary of this presentation. In this presentation I introduced some basic techniques of Hidden Markov model-based speech synthesis. The different, I think the key ideas is that there are two models I used. One is the source-filter model, use it to extract acoustic features and use to reconstruct speech waveform, and second is the statistical acoustic model, which describes distribution of acoustic features for different concatenative information and used to generate acoustic features from this model. This HMM-based speech synthesis method has become more and more popular in recent years, but it's still far from perfect. There are still many parts need to be improved. Then I introduced some recent advances of this method which were conduct in our lab. In order to get better flexibility we produce, we propose method of integrating articulatory features into HMM-based speech synthesis. In order to get better speech quality, we proposed a minimum care damaging a privileged algorithm naturalness, we use the hybrid Hidden Markov model in its natural approach. >>: Quality and naturalness, are they similar? >> Zhen-Hua Ling: They are they are related, but I think speech quality is more related with the spectral parameters. For example, you can feel some buzzness [phonetic] or almost smooth. And the naturalist is more read, by the quality of some other things. >>: So quality is a subset of naturalness. >> Zhen-Hua Ling: Yeah, it can be. For example, if you charge the naturalness of a sentence maybe speech quality is one of factors you need to consider. Okay. Let's end on my talk. Thank you. >>: [applause] >>: Okay, so -- I think we ask all the questions already. Okay. Thank you.

>> Li Deng: Okay. Welcome to this presentation. ... Professor Zhen-Hua Ling to give us the lecture on HMM-based...

Related documents

Products

Support

&gt;&gt; Li Deng: Okay. Welcome to this presentation. ... Professor Zhen-Hua Ling to give us the lecture on HMM-based...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Li Deng: Okay. Welcome to this presentation. ... Professor Zhen-Hua Ling to give us the lecture on HMM-based...