>> Geoffrey Zweig: It's my pleasure to introduce Jui-Ting Huang today. Jui-Ting is graduating from the University of Illinois Urbana-Champaign where she's one of Mark Hasegawa-Johnson's students. She was an intern here in 2009 and did some very interesting work with [inaudible] Lee at that time. Basically they did language model training in order to maximize the mutual information between the acoustics and the language model when you will you have to start with is the language model itself and the phone confusion matrix. From then she moved on to work on her thesis work, which she is going to be talking about today, and that focused on mixed supervised and unsupervised training or semi-supervised learning, and she'll tell us about that now. >> Jui-Ting Huang: So thank you, Geoff. So today -- today I will talk about my thesis research, semi-supervised learning for acoustic and prosodic modeling in speech. So I will first give a brief introduction, brief [inaudible], on semi-supervised learning and then I will present two kinds of problems where I investigate the use of unlabeled data by some proposed semi-supervised learning algorithms followed by some conclusions. So we know that the scale of audio data -- sorry. Actually there is a [inaudible] here. So that's the talk outline, and -- yeah. So the scale of audio data that we can collect is increasing dramatically these days. It turns out that speech data without transcriptions is easy to collect but time consuming to transcribed. This motivates the research on semi-supervised learning which tries to use unlabeled data in addition to a limited amount of labeled data to improve the supervised model. So the ultimate goal here is to improve speech applications where recognition performance was limited by the amount of transcribed data. And there are a few related methods proposed for speech applications. Most related approaches can be categorized into self-training methods which we conventionally call unsupervised or some lightly-supervised training with confidence thresholds, meaning that what it does is that for untranscribed speech we run an existing recognizer on this set of untranscribed speech and augments the training set with confidence -- automatic transcriptions and we train the model. While it has shown -- demonstrated the potential use, demonstrated the power of untranscribed data to improve the system, there's no systematic way -- there are some drawbacks. For example, there is no systematic way to determine the threshold of confidence measure. And, on the other hand, there's no theoretical foundation about the convergence property. So in my work I'm taking another perspective and trying to propose a principal framework that we can train our models by some reasonable training objectives considering the influence from both labeled and unlabeled data. So let me give a formal definition of semi-supervised learning, meaning that we know that in machine learning the classification problem is given data described by some signals, some features, we want to the map it into one of the pre-defined categories, which is called supervised learning because we would need a training set of label tokens and then we learn from -- we learn the mapping from them. So semi-supervised learning, meaning that in addition of those label tokens, we are also given a set of unlabeled data to learn the mapping. So in our framework, the models are -- the models I'm looking at are either Gaussian mixture models or hidden Markov models. So by learning I mean the estimation of GMM or HMM parameters. So the basic idea is to use unlabeled data to generalize the applicability of classifier for unseen data. There is prerequisite data which is -- we assume the distribution of unlabeled examples is relevant for the classification problem. So in this way we can -- because if we have large quantity of unlabeled data, we can leverage the data distribution from the unlabeled data and connect data with classifier. And there are different perspectives that we can use to make use of this connection. One is that -- the first kind is kind of treat unlabeled data in the context of missing data problem, meaning that the unlabeled data is kind of like incomplete [inaudible] misses class label information. And this incomplete data can be -one way to treat it is the EM algorithm. Another kind is we're trying to impose some measure of unlabeled data to regularize the [inaudible] problem caused by using labeled data only. So in the following -- my method is basically based on these two perspective, one of these two perspectives. >>: How do you measure the relevance? Say the distribution of unlabeled examples is relevant, how to measure them, right? >> Jui-Ting Huang: There's no quantification on this relevance. By relevant I mean that, for example, if we treat -- as I say, if we treated -- if we have a generative model, then unlabeled data is relevant here because generative training is going to find the better likelihood to describe your data. Then in that case, it's natural to incorporate unlabeled data into your generative training, and that if you have a better generative classifier -- generative model, which we are using for classification problem, then they are relevant. Yeah. And, also, there are another relevance. For example, here, if we think the distribution of the unlabeled data can change our thinking about -- can change our thinking about the model we learned from only labeled data, which I will show later, then it's also relevant. Yeah. So I think the point I raise here is that it's not -- it's not necessary that unlabeled data can help. You need to have correct context, correct model, to use it. So in particular I proposed a training framework where special models are trained to optimize an objective that reflects reasonable assumptions about both labeled and unlabeled data. And in particular there are two kinds of training criteria. The first kind is generative training criteria, and we will find that its ability to discover unseen classes, and we apply this ability to the problem of prosodic break detection. And then I go to -- I go on to the acoustic modeling program and propose a semi-supervised learning maximum likelihood criteria for GMMs or HMM training. And then the other kinds of training criteria is regularized discriminative training criteria, meaning that the original discriminative criteria will be all meant with a certain measure on unlabeled data as regularization, and I will apply it to acoustic modeling. Some the first part I will show how we use generative model to deal with detection of prosody in Mandarin speech. So prosody -- let me briefly introduce prosody. Prosody are these variations in pitch, loudness, tempo and rhythm in human speech to convey extra information in speech communication. There are many prosody events across many languages, and here I'm interested about prosodic break or prosodic boundary, phrase boundary, meaning that instead of speak an utterance in a flat tempo, we usually will tend to pause in several locations within an utterance. And the boundaries within those -- the boundaries between natural group of the words in speech is prosodic break. Here is an example in Mandarin speech where each symbol here represents a syllable corresponding to a [inaudible]. So then -- now I play this male speaker and you will find that instead of he speak it in a flat tempo, he will try to slow down slightly in some place as are indicated by a blue line. [Male voice speaking in Mandarin played] So those blue lines is where the prosodic breaks are. Then locating those prosodic breaks -- I'm finding my mouse -- locating those prosodic breaks in speech is useful for finding natural group for speech synthesis, and also it appears that the segmentation [inaudible] prosody responding to syntactic structure in speech which can be used for speech understanding. Furthermore, we could also have built prosody-dependent acoustic modeling, meaning that the acoustic model will change depending on the prosody event currently. So since it's useful, we are interested in the automatic -- the task of automatic prosodic break detection, which is essentially a classifier which receives acoustic correlates and classifies the event as non-break or break. And because it's a classification problem -- supervised learning problem which will require the prosodically labeled data, well need a prosodic annotation, and it's usually done by linguistic experts who are sensitive to those acoustic cues and can be efficient to mark those prosodic marks for [inaudible]. So because it's even more harder than [inaudible] transcription which motivates our goal, which is to automatically locate prosodic break in Mandarin speech without any prosodically labeled data, it can be done because of two things. One thing is that we will first identify reliable representative from the non-break class using some simple textural lexical cues from Mandarin. Then given a set of -- a small set of labeled data, we rely on semi-supervised learning algorithm to learn a classifier using both labeled and the rest of the unlabeled set. And it's worth noting here that here we have a special setting of the problem where the labeled data is only available from only one of the two classes, and we will see that we are able to take care of this scenario in the unseen class, the break class, by our algorithm. So here's an example how it works in general, meaning that if we have utterances, then a speech recognizer will recognize text. For us in here, it's a sequence of Chinese characters. Then for every syllable boundary here corresponding to each character is where we need to decide if there's a prosodic break or not. And according to some rule which I will explain later, we are able to identify some class representative for the non-break class indicated as NB here, and we collect this data as the labeled set. Then the rest of the syllable boundary with unknown class identity go to the unlabeled set. Then our semi-supervised learner will use these two sets to estimate a possible underlying model and predict the most likely prosody for those unlabeled boundaries or even [inaudible] test set. So let me show us the rule. To find the class representative, we leveraged some information from a text which is output from the recognizer. And before showing the rule, let me introduce the concept of lexical word in Chinese, meaning that beyond the single character, the lexical word is actually the combination of multiple characters which can convey different lexical meaning than its character components. And it's reasonable for a recognizer to output a word sequence rather than a character sequence. For example, you can have the word based language models. Then the lexical cue remember relying on is a literature sort of study that prosodic breaks do not exist within a short lexical word. By short we mean that the lexical word contains less than three or equal to three syllables. So we see that according to this cue, for those within word syllable boundaries we are pretty sure they are from non-break class. So we mark them as NB. That's how we found the class representative. And the rest go to the unlabeled data. Then how do we learn from this mixed set? First let me show the generative model we are using and I'll show how to retrain a model. So we are using mixture model, a [inaudible] mixture model that has M mixture components that can generate data x, y, I sub m. X is the multi-dimensional prosody features containing N acoustic correlates, y is the class information, and I sub m is the indicator of label is missing or not. So then [inaudible] the [inaudible] probability of the data is described in this mixture based model where we have weight in the Gaussian for each component, but because it's a tied mixture, two classes, meaning two classes share the same pool of Gaussian components, so the class discrimination comes from the solved class membership defined by this component dependent class probability. Also, one important feature here is that we have class dependent label missing probability to match closely to the one class scenario, meaning that if we have known that from the training data we don't have label information from the break class, we simply apply the constraint that the label missing probability for the break class is one, and then we learn the rest of the parameters. So because it's a generative model, it's natural to have mix model likelihood training criterion to estimate all of your parameters, and the likelihood is computed by the combination of the labeled set and the unlabeled set, and it can be solved by EM algorithm. Then the classification is simply finding the class that maximize the class posterior probability which can be computed in the following way using the parameters we just learned. So we show our -- we evaluated our supervised approach in a duration corpus which is real speech by a single male Mandarin speaker. In this corpus prosodic annotations are available, but we just used it for -- as the ground truth for evaluation. And there are two kinds of break labels, two labels of break labels, but we treat them as the same break class. Regarding features for each syllable boundaries, we extract eight acoustic features, including pitch, duration, and energy related features. Yeah. So here ->>: Is there some reason why it was just one speaker? >> Jui-Ting Huang: Yes, it's just one speaker. >>: But why? Why not use ->> Jui-Ting Huang: Because this corpus just contained one speaker. >>: Does that make a difference? >> Jui-Ting Huang: It make difference because then we will need to take care of speaker normalization regarding the prosodic features and it could be [inaudible] to some extent. But now it's a simpler situation. >>: [inaudible] >> Jui-Ting Huang: Yeah, sure. Yes, I totally agree with you. And so one thing is that -- opposite of this fact. However, the number of locations that can be identified by silence vary depending on the corpus. It's not necessary to have sufficient number of -- sufficient number of syllable boundaries marked by silence. It's depending on the corpus. So we think identified class representatives from non-break class is more general than finding the boundary by silence. But it's -- it's possible to add in those data by silence and then -- then we are back to a normal setting of semi-supervised learning where the labeled data have information about two classes. Yes. Actually, I have that result, but I didn't show here for complete comparison. >>: [inaudible] classification. So it's a classification, you know the word and you classify ->> Jui-Ting Huang: Yes, in this case it's a kind of prosody event classifier. So I know we see the boundary, meaning that, yes, they are all -- I assume that I gather information from the recognizer output. >>: So for more complete modeling you may need also doing a sequence. >> Jui-Ting Huang: Yes. >>: And also maybe using [inaudible] word specific boundary information there too >> Jui-Ting Huang: Yes. So I will show that for acoustic modeling there is a sequence model, but here for prosodic modeling I didn't consider -- I haven't considered the sequence scenario yet. >>: [inaudible] >> Jui-Ting Huang: So in the experiment, the transcription comes from the ground truth. >>: [inaudible] >> Jui-Ting Huang: No. And another thing is -- another thing is it's -- maybe you can discuss later. So -- okay. So here we -- here I want to show the -- demonstrate the model's ability to discover a new class by using one of our features as an example. So the feature here is the difference of averaged energy of the syllable after and before a syllable boundary. The top-most plot is the histogram of the features in a training set regardless of the class information. The next two plots are the respective histograms for -- this one is for non-break class, this one is for break class. We see that the break class has a lower mean than the non-break class. Then we show the estimated data distribution given only a subset of labeled tokens from non-break class with different number of Gaussian components. And we see that if we have sufficient number of Gaussian components, like 16 Gaussian components, then we are able to identify the -- this is the joint probability computed using a model, and we see that given sufficient parameters, we are able to identify the contrastive class that we haven't seen label set before. So let's look at the detection accuracy results. So let me show two reference numbers first. By fully-annotated approach I mean that assume all of the labels is prosodic label available. So this is kind of the upper bound performance we can get, the 91 percent. And then by chance rate I mean because the [inaudible] distribution of the data. If you simply assign every syllable boundary as non-break, we get 77 accuracy. But it's not a reasonable way -- reasonable method according to the precision and the [inaudible]. Then by using the non-break data in the label token we are able to give a reasonable accuracy, but it's more meaningful to see that we have a reasonable precision in recall summarize by F score here. So show summaries that we are able to discover prosodic breaks by [inaudible] and semi-supervised algorithm. We now move to the problem of acoustic modeling, and we will start with generative approach that has shown its effectiveness in prosodic model. So in the context of acoustic modeling with a generative criterion such as maximum likelihood criterion, unlabeled data -- and it's -- GMMs or HMMs are all generative model. Given this context, unlabeled data can be naturally incorporated into generative framework. That is, we extend the semi-supervised version of ML criteria. In particular, the model parameters will now -- M to maximize the likelihood of the joint labeled and unlabeled set. And it can be done by EM or Baum-Welch for GMMs or HMMs. There are some issues here first raised by Cohen. That is, incorrect model assumptions can cause unlabeled data to degrade performances. But in their paper they assume that one Gaussian distribution per class. In our case we're using a more expressive probability model which is GMM, and it's already a better model assumption than a single Gaussian distribution, and we don't observe the degrading effect so far. And, also, it's relevant known that generative model may not optimize the classification accuracy well, and as people usually go out to do discriminative training we will see how we extend the supervised discriminative training to semi-supervised version. So the discriminative criterion is simply if we are using maximal mutual information, it's finding a maximum of the average log posterior probability. And since discrimination is prone to over-fitting the training data, it would helpful to include a regularization term to reduce the problem of over-fitting. Then our semi-supervised algorithm are essentially using certainly measures on unlabeled data as regularization to the supervised training criteria. Therefore, we propose to -- there are two regularization measures. The first one, we proposed to augment the supervised criterion with the maximum likelihood measure on unlabeled set. So the interpretation here is it's balanced between leveraged evidence from the labeled data and also clustering information from the whole input data. >>: So can you go back just maybe one more? So isn't there a prior on the label sequence, a prior on why I needs to be in there if you look at the mutual information, the probability of x and y over the probability of x by itself and the probability of y by itself and then use the chain rule and you have the probability of x and the probability of y given the maximum of p of x cancels in the numerator and the denominator, so you have d of y given x over d of y? >> Jui-Ting Huang: I think I don't follow you. >>: Maybe this is like what [inaudible]. >>: What happens to d of y? Isn't that the denominator? >> Jui-Ting Huang: Yeah, if you ->>: But for discriminative training d of y is ->> Jui-Ting Huang: So here it's -- here it's for -- it's the classification version of the criterion. Yeah, I think, as you say, there is a p of y term, but. >>: [inaudible] >> Jui-Ting Huang: Yeah, I think ->>: Well, let's keep going. >> Jui-Ting Huang: Yeah, but I have to say that -- so I will look at the classification case and the recognition case, and this is the -- this is the measure that I use for the classification training, and I think it's generalized to the MI criterion for the recognition. >>: Is that different from traditional maximum likelihood? >> Jui-Ting Huang: This one is actually conditional maximum likelihood for classifier. We usually call it that. And I'm just saying that they are equivalent if you generalize it to maximum mutual information for the recognition case. So I first talk about the hybrid training objective. Because -- if you look more closely at the estimation problem, the object is actually composed of multiple terms of likelihood, and the corresponding parameters -- parameter formula can be derived based on a multiplication of growth transform or extending Baum-Welch. So essentially it's a multiplication of original MI training by adding additional statistics from the unlabeled data like this way. And the L and the U is just a normalization due to the amount of the training data [inaudible] the balancing factor. So this is the version for classification where each model is represented by a Gaussian measure model. So this is the mean vector for class C and the component M. So I want to discuss first our relation to other work. So there have been -several techniques have been developed to make together over 15 problems. For example, H-criterion globally interpolates MMI and the ML objective and also I-smoothing is used to interpolate with Gaussian-component-dependent priors based on the ML estimate. Our criterion is different in that we leverage information from unlabeled data. And there are different interesting perspectives proposed by IBM researchers in this year [inaudible] they proposed a maximum entropy measure on unlabeled data to augment with the supervised MMI term which essentially results in a minus term here -- minus sign here, flip the sign here for the unlabeled measure. >>: [inaudible] [laughter] >> Jui-Ting Huang: So I haven't ->>: [inaudible] >> Jui-Ting Huang: They have slight improvement, but they combined that method with another paradigm which was multi-view learning by combination many systems. So, I mean, it's interesting to try why I will get if I flip a sign. >>: So in words -- correct me if I'm making a mistake here, but I think in words, what the first one means is that you want to maximize the likelihood of the supervised data and make sure that you've got a hypothesis for the unsupervised data that gets highlighted. And what the second one does with the minus is you want to maximize the likelihood of the supervised data while saying as little as possible about the unsupervised data. Is that qualitatively right? >> Jui-Ting Huang: Yes. And I'm thinking if you have very strong supervised model, maybe it's reasonable -- I mean, maybe you can try to use unlabeled data to penalize the likelihood on the competing hypothesis of penalization. But here you will see that in my experiments I assume we have limited amount of labeled data. I mean, my supervised model is pure in the first place -- it poor in the first place, and in that case I rely on the unlabeled data to capture the data distribution. Then in that case the hypothesis regardless of the -- it might be some errors in the hypothesis recognized by recognizer. In this context where I don't have good enough distribution model in the beginning, maximum likelihood regularization might help. >>: [inaudible]. >>: I agree. Doing a minus seems bizarre. >>: I'm really wondering why ->>: [inaudible]. >>: Then in the entropy there would have been a p of -- there would have been ->> Jui-Ting Huang: It's p [inaudible] there is no p. >>: [inaudible] >> Jui-Ting Huang: It's just ->>: P log p. >>: [inaudible] >> Jui-Ting Huang: There's no [inaudible] information here. Yeah. It's just ->>: But to compute p of x you have to hypothesize class information, right? Because you have class-based models. >> Jui-Ting Huang: Yeah, you have to -- >>: You're going to have to break that down to somehow a sum over all the possible class labels [inaudible]. >> Jui-Ting Huang: Actually I have ->>: [inaudible] >> Jui-Ting Huang: Actually I ->>: [inaudible] >> Jui-Ting Huang: As you said, to approximate, usually you rely on the prior information here. Yeah. So I think they do the same thing. They use a language model and their acoustic model to approximate this. Yeah. >>: So here in this equation the first line, on the unlabeled data if you find some theta that gives any [inaudible] p of theta, you know, of x, very low probability that your maximization will get hugely penalized. So you would not allow, you know, those densities that give very low probability [inaudible] explain why it helps. >> Jui-Ting Huang: Yeah. >>: So you try to make the -- for every -- for the xi, you make them almost equal. >> Jui-Ting Huang: I see. >>: Kind of smoothing towards average. >>: I'm wondering if there's some kind of self-inconsistency, like if you have a pile of labeled data and no unlabeled data, you maximize the probability of the labeled data, and now imagine that you chop your labeled data in half and move half of it into the unlabeled data and throw away the labels [inaudible] probability of that same data that used to be trying to prove its probability, now all of a sudden you're trying to decrease its probability. It seems kind of inconsistent. >> Jui-Ting Huang: So that's interesting. And I think they also think it's kind of corresponding to the maximum entropy idea. So I think it's worth discussions. I already talked about this. Okay. So the second regularization I have is conditional entropy regularization which is the uncertainty of class prediction given features. The paper first proposed this idea is based on the assumption that unlabeled data are beneficial, especially when classes are well separated. Then you are trying -- if you have a model prior -- so when they estimate a model, they apply a model prior that prefers minimum class overlap, which is equivalent of minimizing label entropy on unlabeled data. The interpretation -- the other interpretation is that the negative conditional entropy term encouraged the model to have the greatest possible certainty about its label decisions. So it reinforces the confidence of the supervised classifier we learned from the labels. So to compute the conditional entropy, because we don't know the real data distribution, we approximate it with the empirical distribution from the unlabeled data. So the formula is essentially like this, and so meaning that we have discriminative criteria for labeled data and the negative condition or entropy for unlabeled data. And it's in a sense an unlabeled version of this discriminative criterion, so we have coherently described this objective. Then we can -- because of this term we can now using the -- ending Baum-Welch. So we optimize it by gradient-based methods based on the gradient computation of this term. Moreover, we use preconditioned conjugate gradient methods to accelerate the convergence rate of steepest descent. So these I'm just showing overall the gradient of these two which looks like this in the classification case. >>: At the end you're probably going to compare results using different techniques, and one of those techniques is going to -- not the optimization, but one of the techniques is, maybe it's just playing them out, which presumably you'd use EM to optimize. Then there's going to be this other technique where you've used the preconditioned conjugate gradient descent to optimize, and there's going to be a difference between those techniques. And so the question is how do you know that difference comes from the objective function versus the optimization method? Like if you did preconditioned conjugate gradient descent for a regular maximum likelihood objective function, maybe that would give you a different answer. Is there some reason you can tell me, oh, you don't have to worry about that? >> Jui-Ting Huang: I think as long as I'm sure the objective functions are optimized ->>: But now we're going to get local ->> Jui-Ting Huang: Yeah, either of them, just getting local optimal. I think ->>: So maybe one optimization technique is more likely to put you into a better local optimum than another one. >> Jui-Ting Huang: I see what you mean. So, for example, for this method I have -- for this method I have applied the training objective and then it looks reasonably saturated after several iterations. I haven't tried to do the same thing for EM algorithm. But usually -- but I assume that it's actually -- you also have to -- it actually converges very fast, right? But I cannot comment if the preconditioned conjugate gradient is better than finding -- I think the nature of [inaudible] finding local maximum, now I have issue of finding better local maximum. So I guess they are fair -- so it's fair to compare them. I think -- because the way I use preconditioned conjugate gradient is to speed up the training, I'm not aware of any optimization property of that can help to find a better local optima. >>: [inaudible] >> Jui-Ting Huang: >>: I disagree. >>: This is the whole reason why people have, like, inertia in their gradient descent techniques so they can get over these local minimum. It really depends on how you follow the gradient ->>: [inaudible]. >>: If you look at a different field like [inaudible], people put a huge amount of effort in exactly how they're going to optimize these things and what the learning rate is and whether they precondition or don't precondition or what the starting point is. >>: [inaudible]. >>: Even though the data is the same and the objective function is the same, the details of the optimization, at least in that case, seem to make a difference. >>: [inaudible] >> Jui-Ting Huang: One thing is that at least for MI, I compare with gradient descent method and extended Baum-Welch method, they give almost the same number. >>: Oh, wait, what? Who did that? >> Jui-Ting Huang: I mean I did that. So, I mean, without a -- without a term, if I do that with gradient descent method, I have the similar -- I have the same result with the one that I optimized using extended Baum-Welch. >>: Okay. That's what I was asking. >> Jui-Ting Huang: Okay. >>: So you have done that? >> Jui-Ting Huang: Yeah, I've done that. Yeah, for MMI I've done it. But it's similar because extended Baum-Welch is an extension of Baum-Welch, so probably it provides some evidence about the optimization. Doesn't matter much here. So I'm just saying that -- yeah. So it's -- a negative conditional entropy is different from the previous interpolation term, and then the negative conditional entropy has been applied to other discriminative classifiers such as logistic regression and CIF. And here we apply it and see regularization to discriminative training of GMMs and the HMMs. So I look at two specific tests. One is classification, the other is recognition. Classification is like -- probably I don't need to explain more, but I'm assuming I'm already given the segment. I just want to recognize the phone identity by the posterior [inaudible] probability, in which case the components that is found dependent distribution described by Gaussian mixture models. Then I used TIMIT corpus to create a semi-supervised setting. Labels of certain potential of training set are kept and the rest of unlabeled set becomes the unlabeled set. Then I extracted segmental features for each segment which is fixed length vector calculated based on the PLP features within the different regions within the phone segment and also pre and post segment and plus the log duration. So this is the -- I'm showing that that it has -- so I'm showing now the label is not clear, but the blue line is objective value, and we have saturated -- saturated in 50 iterations. And the green line is the phone accuracy. I calculated on the [inaudible] set. So you see it correlates well. It's actually a set because it's not trending set accuracy, it's kind of test accuracy, but, still, regularization help us to have nice correlation between accuracy and objective. Then here we show classification accuracy for two regularizers. One is MMI regularizer. The other is conditional entropy. The axis is the decreasing amount of labeled data with increasing amount of unlabeled data. Therefore, the best performance we can see, which is the dashed line here, decreases along the axis. And to have a clear view I [inaudible] the supervised model, performance of supervised number with two regularization approach. And you can see they have very different behaviors. Now, ML regularization helps very much for the case where we only have 20 percent of labels available, and this is -- the right-most number is 5 percent of labels. So it can -- it out performed the conditional entropy regularization given very limited amount of labeled data. On the other hand, a conditional entropy regularization consistently improves over MMI training regardless of the supervised -- the quality of the supervised model. >>: [inaudible] >> Jui-Ting Huang: I have -- let me show it. Let me show a graph very quickly and let's see if I can go there. Wait. >>: [inaudible] >> Jui-Ting Huang: No, it's just [inaudible] between -- [inaudible] at the range of -- the range of 10. And I have a plot showing that it's quite insensitive to -- so the performance is insensitive to the value alpha. >>: [inaudible] >> Jui-Ting Huang: So it not changes much, no. I want to show that. Let me see if I can -- very quickly. So I have the labels, few labels to more labels. You see that they have the similar behaviors by [inaudible] number of alpha. Okay. Let me go back to -Okay. Then about for recognition. Another task is to recognize the whole phone sequence given an utterance. And we know that the recognition is based on the score from both language model and the acoustic model. >>: So when you have in the MMI only case, that's just training with whatever labeled data you have. >> Jui-Ting Huang: Yes. >>: But in all cases the same size model or no? >> Jui-Ting Huang: Yes. >>: [inaudible] >> Jui-Ting Huang: Yes. >>: [inaudible] >> Jui-Ting Huang: Oh, no, no. The number of the parameters are fixed even I'm adding unlabeled data. >>: [inaudible] >> Jui-Ting Huang: Yeah, I will show that actually by adding unlabeled data we are able to grow the number of parameters by engine active training, and I will show that shortly. Yes. Okay. So in the current [inaudible] we were interested in optimize the HMM parameters for acoustic model. And, first of all, let me go back to generative training I mentioned in the first place where I have placed my likelihood criterion for both labeled and unlabeled data. And to approximate the distribution of the xi, I have a confusable set with respect to each utterance xi and which is derived by recognition, and in the real implementation it essentially can be summarized -- let's -- summarized by a denominator model. In the same sense this one is also summarized by a numerator model, encodes the full acoustic and language model used according to a labeled phone sequence. And for unlabeled data we have a recognition model such that it approximated the distribution. Then essentially the update formula, for example, one of the mean vectors in your state model, it looks like this where we just adding the occupancy statistic from the denominator lattices for unlabeled data. For discriminative training we have to find a way to dispute the conditional entropy for each utterances. Here we approximated with the N-best hypothesis. So now Hi is actually the N-best hypothesis [inaudible] by running recognize on each utterance i. Then we apply the same optimization methods for finding discriminative training. So here show the result. Here we assume we only have 5 percent of training set available and unlabeled data is the rest of 95 percent. And I compare with the self-training methods meaning that the conventional [inaudible] training where we just find the confidence enough transcriptions to add in the label set, then we train the model. Then we see that the best number of Gaussian we can get from this amount of labeled data is only A components, and then by adding unlabeled data we are able to grow. And we see that the semi-supervised learning is actually better than the self-training ML. And another advantage is that we are not necessarily using confidence measures here for our methods. >>: Was this with monophones or triphones? >> Jui-Ting Huang: It's monophones. The second results for discriminative training where we starting with the initial model, the best model we can train from the semi-supervised training [inaudible] approach from this graph -- from this plot. And then I also compare with the self-training. And as we can see, again, the confidence measure have to use to try for this. And as you can see, semi-supervised IM is better than the self-training MI. But it's worth mentioning that when I just used N-best to compute the conditional entropy, it has its limitation in that I only have a few competing hypotheses associated with each utterance. It's [inaudible] to approximate the sequence distribution. Okay. So some take-home messages. We see that unlabeled data are useful for finding more accurate distributions or a better classifier depending on the training criterion. So, for example, we found that in prosodic modeling we are able to find more accurate likelihood functions corresponding to true class distributions and so we are able to discover unseen classes. And that by applying unlabeled data on the regularized discriminative training framework we are -- we can have a better classifier. And one of the advantages here is confidence measures are not necessary in our framework for unlabeled data to be useful, but, of course, you can combine this with -- try to fill out your data and by some tuning you can get better results. And one current limitation of my thesis work is that I only have experiments on TIMIT set, and it's worthwhile to try on a large vocabulary data set to see how it can be generalized. Thank you. [applause] >> Jui-Ting Huang: Any questions and comments? >> Geoffrey Zweig: We have time for one or two questions. >>: The self-training approach, you used the confidence to select data? >> Jui-Ting Huang: Yes. >>: So there's this kind of theory, right? Do you select high confidence things and the transcription with the more reliable -- well, you are adding the data which you already know very well, right? So what is your thinking about this in using confidence a lot to select data? Maybe for those speech segments, consider it [inaudible] they may give you more information even [inaudible]. >> Jui-Ting Huang: If you are just using the confidence score, I found the similar conclusion as you. For self training I would get -- because I need to tree different value of confidence to see which one gives the best performance on the development set, and it's like -- confidence threshold, it's like if this is the accuracy, development set, it's kind of like this. So I usually find something like here. I think it's -- so that's why I think it's very -- I mean, it would be helpful if we have some [inaudible] then we can make use the data that's below this threshold. Yeah. And so I think with my framework, as I said, it's one way to make use of the data regardless the confidence measure. And so I guess confidence measure is a method for the very straightforward unsupervised training, but if you apply different method of semi-supervised learning, not necessarily my method, but also maybe the multi-view learning, meaning that you're able to -- the one that has lower confidence is not necessary -- the one have lower confidence from one recognizer have the same lower confidence from the other recognizer in our system. So this is where, if you combine multiple system, you are able to find the more interesting trending transcription and then it can be complementary and useful for different system. Do you see what I mean? >>: Yes. >>: So are there plots where the total pot of data was the same size and you're just sort of change the [inaudible] label versus [inaudible]? >> Jui-Ting Huang: Yes. >>: So what would the plot look like if the pot of label data was fixed and then you had a variable amount of [inaudible]? >> Jui-Ting Huang: Yes, I have a plot about that. Yes, it's -- for example, I have a fixed amount of 5 percent labeled data, and I tried to add the number of the unlabeled data [inaudible] in our [inaudible] set. And this is for MMI-ML. And I haven't had a plot made for MI and CE yet. So I'm sure I should include it in my thesis. >> Geoffrey Zweig: Any other questions? Okay. Let's thank the speaker again. [applause]