>> Geoffrey Zweig: It's my pleasure to introduce Jui-Ting... is graduating from the University of Illinois Urbana-Champaign where she's...

advertisement
>> Geoffrey Zweig: It's my pleasure to introduce Jui-Ting Huang today. Jui-Ting
is graduating from the University of Illinois Urbana-Champaign where she's one
of Mark Hasegawa-Johnson's students. She was an intern here in 2009 and did
some very interesting work with [inaudible] Lee at that time. Basically they did
language model training in order to maximize the mutual information between the
acoustics and the language model when you will you have to start with is the
language model itself and the phone confusion matrix.
From then she moved on to work on her thesis work, which she is going to be
talking about today, and that focused on mixed supervised and unsupervised
training or semi-supervised learning, and she'll tell us about that now.
>> Jui-Ting Huang: So thank you, Geoff.
So today -- today I will talk about my thesis research, semi-supervised learning
for acoustic and prosodic modeling in speech.
So I will first give a brief introduction, brief [inaudible], on semi-supervised
learning and then I will present two kinds of problems where I investigate the use
of unlabeled data by some proposed semi-supervised learning algorithms
followed by some conclusions.
So we know that the scale of audio data -- sorry. Actually there is a [inaudible]
here. So that's the talk outline, and -- yeah. So the scale of audio data that we
can collect is increasing dramatically these days. It turns out that speech data
without transcriptions is easy to collect but time consuming to transcribed.
This motivates the research on semi-supervised learning which tries to use
unlabeled data in addition to a limited amount of labeled data to improve the
supervised model.
So the ultimate goal here is to improve speech applications where recognition
performance was limited by the amount of transcribed data. And there are a few
related methods proposed for speech applications. Most related approaches can
be categorized into self-training methods which we conventionally call
unsupervised or some lightly-supervised training with confidence thresholds,
meaning that what it does is that for untranscribed speech we run an existing
recognizer on this set of untranscribed speech and augments the training set with
confidence -- automatic transcriptions and we train the model.
While it has shown -- demonstrated the potential use, demonstrated the power of
untranscribed data to improve the system, there's no systematic way -- there are
some drawbacks. For example, there is no systematic way to determine the
threshold of confidence measure. And, on the other hand, there's no theoretical
foundation about the convergence property.
So in my work I'm taking another perspective and trying to propose a principal
framework that we can train our models by some reasonable training objectives
considering the influence from both labeled and unlabeled data.
So let me give a formal definition of semi-supervised learning, meaning that we
know that in machine learning the classification problem is given data described
by some signals, some features, we want to the map it into one of the
pre-defined categories, which is called supervised learning because we would
need a training set of label tokens and then we learn from -- we learn the
mapping from them.
So semi-supervised learning, meaning that in addition of those label tokens, we
are also given a set of unlabeled data to learn the mapping. So in our
framework, the models are -- the models I'm looking at are either Gaussian
mixture models or hidden Markov models. So by learning I mean the estimation
of GMM or HMM parameters.
So the basic idea is to use unlabeled data to generalize the applicability of
classifier for unseen data. There is prerequisite data which is -- we assume the
distribution of unlabeled examples is relevant for the classification problem.
So in this way we can -- because if we have large quantity of unlabeled data, we
can leverage the data distribution from the unlabeled data and connect data with
classifier. And there are different perspectives that we can use to make use of
this connection.
One is that -- the first kind is kind of treat unlabeled data in the context of missing
data problem, meaning that the unlabeled data is kind of like incomplete
[inaudible] misses class label information. And this incomplete data can be -one way to treat it is the EM algorithm.
Another kind is we're trying to impose some measure of unlabeled data to
regularize the [inaudible] problem caused by using labeled data only. So in the
following -- my method is basically based on these two perspective, one of these
two perspectives.
>>: How do you measure the relevance? Say the distribution of unlabeled
examples is relevant, how to measure them, right?
>> Jui-Ting Huang: There's no quantification on this relevance. By relevant I
mean that, for example, if we treat -- as I say, if we treated -- if we have a
generative model, then unlabeled data is relevant here because generative
training is going to find the better likelihood to describe your data. Then in that
case, it's natural to incorporate unlabeled data into your generative training, and
that if you have a better generative classifier -- generative model, which we are
using for classification problem, then they are relevant.
Yeah. And, also, there are another relevance. For example, here, if we think the
distribution of the unlabeled data can change our thinking about -- can change
our thinking about the model we learned from only labeled data, which I will show
later, then it's also relevant. Yeah.
So I think the point I raise here is that it's not -- it's not necessary that unlabeled
data can help. You need to have correct context, correct model, to use it.
So in particular I proposed a training framework where special models are trained
to optimize an objective that reflects reasonable assumptions about both labeled
and unlabeled data. And in particular there are two kinds of training criteria.
The first kind is generative training criteria, and we will find that its ability to
discover unseen classes, and we apply this ability to the problem of prosodic
break detection. And then I go to -- I go on to the acoustic modeling program
and propose a semi-supervised learning maximum likelihood criteria for GMMs or
HMM training.
And then the other kinds of training criteria is regularized discriminative training
criteria, meaning that the original discriminative criteria will be all meant with a
certain measure on unlabeled data as regularization, and I will apply it to acoustic
modeling.
Some the first part I will show how we use generative model to deal with
detection of prosody in Mandarin speech. So prosody -- let me briefly introduce
prosody. Prosody are these variations in pitch, loudness, tempo and rhythm in
human speech to convey extra information in speech communication.
There are many prosody events across many languages, and here I'm interested
about prosodic break or prosodic boundary, phrase boundary, meaning that
instead of speak an utterance in a flat tempo, we usually will tend to pause in
several locations within an utterance. And the boundaries within those -- the
boundaries between natural group of the words in speech is prosodic break.
Here is an example in Mandarin speech where each symbol here represents a
syllable corresponding to a [inaudible]. So then -- now I play this male speaker
and you will find that instead of he speak it in a flat tempo, he will try to slow
down slightly in some place as are indicated by a blue line.
[Male voice speaking in Mandarin played]
So those blue lines is where the prosodic breaks are. Then locating those
prosodic breaks -- I'm finding my mouse -- locating those prosodic breaks in
speech is useful for finding natural group for speech synthesis, and also it
appears that the segmentation [inaudible] prosody responding to syntactic
structure in speech which can be used for speech understanding.
Furthermore, we could also have built prosody-dependent acoustic modeling,
meaning that the acoustic model will change depending on the prosody event
currently.
So since it's useful, we are interested in the automatic -- the task of automatic
prosodic break detection, which is essentially a classifier which receives acoustic
correlates and classifies the event as non-break or break.
And because it's a classification problem -- supervised learning problem which
will require the prosodically labeled data, well need a prosodic annotation, and
it's usually done by linguistic experts who are sensitive to those acoustic cues
and can be efficient to mark those prosodic marks for [inaudible].
So because it's even more harder than [inaudible] transcription which motivates
our goal, which is to automatically locate prosodic break in Mandarin speech
without any prosodically labeled data, it can be done because of two things.
One thing is that we will first identify reliable representative from the non-break
class using some simple textural lexical cues from Mandarin. Then given a set
of -- a small set of labeled data, we rely on semi-supervised learning algorithm to
learn a classifier using both labeled and the rest of the unlabeled set.
And it's worth noting here that here we have a special setting of the problem
where the labeled data is only available from only one of the two classes, and we
will see that we are able to take care of this scenario in the unseen class, the
break class, by our algorithm.
So here's an example how it works in general, meaning that if we have
utterances, then a speech recognizer will recognize text. For us in here, it's a
sequence of Chinese characters.
Then for every syllable boundary here corresponding to each character is where
we need to decide if there's a prosodic break or not.
And according to some rule which I will explain later, we are able to identify some
class representative for the non-break class indicated as NB here, and we collect
this data as the labeled set.
Then the rest of the syllable boundary with unknown class identity go to the
unlabeled set. Then our semi-supervised learner will use these two sets to
estimate a possible underlying model and predict the most likely prosody for
those unlabeled boundaries or even [inaudible] test set.
So let me show us the rule. To find the class representative, we leveraged some
information from a text which is output from the recognizer. And before showing
the rule, let me introduce the concept of lexical word in Chinese, meaning that
beyond the single character, the lexical word is actually the combination of
multiple characters which can convey different lexical meaning than its character
components.
And it's reasonable for a recognizer to output a word sequence rather than a
character sequence. For example, you can have the word based language
models. Then the lexical cue remember relying on is a literature sort of study
that prosodic breaks do not exist within a short lexical word. By short we mean
that the lexical word contains less than three or equal to three syllables.
So we see that according to this cue, for those within word syllable boundaries
we are pretty sure they are from non-break class. So we mark them as NB.
That's how we found the class representative. And the rest go to the unlabeled
data.
Then how do we learn from this mixed set? First let me show the generative
model we are using and I'll show how to retrain a model.
So we are using mixture model, a [inaudible] mixture model that has M mixture
components that can generate data x, y, I sub m. X is the multi-dimensional
prosody features containing N acoustic correlates, y is the class information, and
I sub m is the indicator of label is missing or not.
So then [inaudible] the [inaudible] probability of the data is described in this
mixture based model where we have weight in the Gaussian for each
component, but because it's a tied mixture, two classes, meaning two classes
share the same pool of Gaussian components, so the class discrimination comes
from the solved class membership defined by this component dependent class
probability.
Also, one important feature here is that we have class dependent label missing
probability to match closely to the one class scenario, meaning that if we have
known that from the training data we don't have label information from the break
class, we simply apply the constraint that the label missing probability for the
break class is one, and then we learn the rest of the parameters.
So because it's a generative model, it's natural to have mix model likelihood
training criterion to estimate all of your parameters, and the likelihood is
computed by the combination of the labeled set and the unlabeled set, and it can
be solved by EM algorithm.
Then the classification is simply finding the class that maximize the class
posterior probability which can be computed in the following way using the
parameters we just learned.
So we show our -- we evaluated our supervised approach in a duration corpus
which is real speech by a single male Mandarin speaker. In this corpus prosodic
annotations are available, but we just used it for -- as the ground truth for
evaluation.
And there are two kinds of break labels, two labels of break labels, but we treat
them as the same break class.
Regarding features for each syllable boundaries, we extract eight acoustic
features, including pitch, duration, and energy related features. Yeah.
So here ->>: Is there some reason why it was just one speaker?
>> Jui-Ting Huang: Yes, it's just one speaker.
>>: But why? Why not use ->> Jui-Ting Huang: Because this corpus just contained one speaker.
>>: Does that make a difference?
>> Jui-Ting Huang: It make difference because then we will need to take care of
speaker normalization regarding the prosodic features and it could be [inaudible]
to some extent. But now it's a simpler situation.
>>: [inaudible]
>> Jui-Ting Huang: Yeah, sure. Yes, I totally agree with you. And so one thing
is that -- opposite of this fact. However, the number of locations that can be
identified by silence vary depending on the corpus. It's not necessary to have
sufficient number of -- sufficient number of syllable boundaries marked by
silence. It's depending on the corpus.
So we think identified class representatives from non-break class is more general
than finding the boundary by silence. But it's -- it's possible to add in those data
by silence and then -- then we are back to a normal setting of semi-supervised
learning where the labeled data have information about two classes.
Yes. Actually, I have that result, but I didn't show here for complete comparison.
>>: [inaudible] classification. So it's a classification, you know the word and you
classify ->> Jui-Ting Huang: Yes, in this case it's a kind of prosody event classifier. So I
know we see the boundary, meaning that, yes, they are all -- I assume that I
gather information from the recognizer output.
>>: So for more complete modeling you may need also doing a sequence.
>> Jui-Ting Huang: Yes.
>>: And also maybe using [inaudible] word specific boundary information there
too
>> Jui-Ting Huang: Yes. So I will show that for acoustic modeling there is a
sequence model, but here for prosodic modeling I didn't consider -- I haven't
considered the sequence scenario yet.
>>: [inaudible]
>> Jui-Ting Huang: So in the experiment, the transcription comes from the
ground truth.
>>: [inaudible]
>> Jui-Ting Huang: No.
And another thing is -- another thing is it's -- maybe you can discuss later.
So -- okay. So here we -- here I want to show the -- demonstrate the model's
ability to discover a new class by using one of our features as an example.
So the feature here is the difference of averaged energy of the syllable after and
before a syllable boundary. The top-most plot is the histogram of the features in
a training set regardless of the class information.
The next two plots are the respective histograms for -- this one is for non-break
class, this one is for break class. We see that the break class has a lower mean
than the non-break class.
Then we show the estimated data distribution given only a subset of labeled
tokens from non-break class with different number of Gaussian components.
And we see that if we have sufficient number of Gaussian components, like 16
Gaussian components, then we are able to identify the -- this is the joint
probability computed using a model, and we see that given sufficient parameters,
we are able to identify the contrastive class that we haven't seen label set before.
So let's look at the detection accuracy results. So let me show two reference
numbers first.
By fully-annotated approach I mean that assume all of the labels is prosodic label
available. So this is kind of the upper bound performance we can get, the 91
percent.
And then by chance rate I mean because the [inaudible] distribution of the data.
If you simply assign every syllable boundary as non-break, we get 77 accuracy.
But it's not a reasonable way -- reasonable method according to the precision
and the [inaudible].
Then by using the non-break data in the label token we are able to give a
reasonable accuracy, but it's more meaningful to see that we have a reasonable
precision in recall summarize by F score here. So show summaries that we are
able to discover prosodic breaks by [inaudible] and semi-supervised algorithm.
We now move to the problem of acoustic modeling, and we will start with
generative approach that has shown its effectiveness in prosodic model.
So in the context of acoustic modeling with a generative criterion such as
maximum likelihood criterion, unlabeled data -- and it's -- GMMs or HMMs are all
generative model. Given this context, unlabeled data can be naturally
incorporated into generative framework. That is, we extend the semi-supervised
version of ML criteria. In particular, the model parameters will now -- M to
maximize the likelihood of the joint labeled and unlabeled set. And it can be
done by EM or Baum-Welch for GMMs or HMMs.
There are some issues here first raised by Cohen. That is, incorrect model
assumptions can cause unlabeled data to degrade performances.
But in their paper they assume that one Gaussian distribution per class. In our
case we're using a more expressive probability model which is GMM, and it's
already a better model assumption than a single Gaussian distribution, and we
don't observe the degrading effect so far.
And, also, it's relevant known that generative model may not optimize the
classification accuracy well, and as people usually go out to do discriminative
training we will see how we extend the supervised discriminative training to
semi-supervised version.
So the discriminative criterion is simply if we are using maximal mutual
information, it's finding a maximum of the average log posterior probability. And
since discrimination is prone to over-fitting the training data, it would helpful to
include a regularization term to reduce the problem of over-fitting.
Then our semi-supervised algorithm are essentially using certainly measures on
unlabeled data as regularization to the supervised training criteria.
Therefore, we propose to -- there are two regularization measures. The first one,
we proposed to augment the supervised criterion with the maximum likelihood
measure on unlabeled set. So the interpretation here is it's balanced between
leveraged evidence from the labeled data and also clustering information from
the whole input data.
>>: So can you go back just maybe one more? So isn't there a prior on the label
sequence, a prior on why I needs to be in there if you look at the mutual
information, the probability of x and y over the probability of x by itself and the
probability of y by itself and then use the chain rule and you have the probability
of x and the probability of y given the maximum of p of x cancels in the numerator
and the denominator, so you have d of y given x over d of y?
>> Jui-Ting Huang: I think I don't follow you.
>>: Maybe this is like what [inaudible].
>>: What happens to d of y? Isn't that the denominator?
>> Jui-Ting Huang: Yeah, if you ->>: But for discriminative training d of y is ->> Jui-Ting Huang: So here it's -- here it's for -- it's the classification version of
the criterion. Yeah, I think, as you say, there is a p of y term, but.
>>: [inaudible]
>> Jui-Ting Huang: Yeah, I think ->>: Well, let's keep going.
>> Jui-Ting Huang: Yeah, but I have to say that -- so I will look at the
classification case and the recognition case, and this is the -- this is the measure
that I use for the classification training, and I think it's generalized to the MI
criterion for the recognition.
>>: Is that different from traditional maximum likelihood?
>> Jui-Ting Huang: This one is actually conditional maximum likelihood for
classifier. We usually call it that. And I'm just saying that they are equivalent if
you generalize it to maximum mutual information for the recognition case.
So I first talk about the hybrid training objective. Because -- if you look more
closely at the estimation problem, the object is actually composed of multiple
terms of likelihood, and the corresponding parameters -- parameter formula can
be derived based on a multiplication of growth transform or extending
Baum-Welch.
So essentially it's a multiplication of original MI training by adding additional
statistics from the unlabeled data like this way. And the L and the U is just a
normalization due to the amount of the training data [inaudible] the balancing
factor. So this is the version for classification where each model is represented
by a Gaussian measure model. So this is the mean vector for class C and the
component M.
So I want to discuss first our relation to other work. So there have been -several techniques have been developed to make together over 15 problems.
For example, H-criterion globally interpolates MMI and the ML objective and also
I-smoothing is used to interpolate with Gaussian-component-dependent priors
based on the ML estimate.
Our criterion is different in that we leverage information from unlabeled data.
And there are different interesting perspectives proposed by IBM researchers in
this year [inaudible] they proposed a maximum entropy measure on unlabeled
data to augment with the supervised MMI term which essentially results in a
minus term here -- minus sign here, flip the sign here for the unlabeled measure.
>>: [inaudible] [laughter]
>> Jui-Ting Huang: So I haven't ->>: [inaudible]
>> Jui-Ting Huang: They have slight improvement, but they combined that
method with another paradigm which was multi-view learning by combination
many systems. So, I mean, it's interesting to try why I will get if I flip a sign.
>>: So in words -- correct me if I'm making a mistake here, but I think in words,
what the first one means is that you want to maximize the likelihood of the
supervised data and make sure that you've got a hypothesis for the unsupervised
data that gets highlighted. And what the second one does with the minus is you
want to maximize the likelihood of the supervised data while saying as little as
possible about the unsupervised data. Is that qualitatively right?
>> Jui-Ting Huang: Yes. And I'm thinking if you have very strong supervised
model, maybe it's reasonable -- I mean, maybe you can try to use unlabeled data
to penalize the likelihood on the competing hypothesis of penalization. But here
you will see that in my experiments I assume we have limited amount of labeled
data. I mean, my supervised model is pure in the first place -- it poor in the first
place, and in that case I rely on the unlabeled data to capture the data
distribution. Then in that case the hypothesis regardless of the -- it might be
some errors in the hypothesis recognized by recognizer. In this context where I
don't have good enough distribution model in the beginning, maximum likelihood
regularization might help.
>>: [inaudible].
>>: I agree. Doing a minus seems bizarre.
>>: I'm really wondering why ->>: [inaudible].
>>: Then in the entropy there would have been a p of -- there would have
been ->> Jui-Ting Huang: It's p [inaudible] there is no p.
>>: [inaudible]
>> Jui-Ting Huang: It's just ->>: P log p.
>>: [inaudible]
>> Jui-Ting Huang: There's no [inaudible] information here. Yeah. It's just ->>: But to compute p of x you have to hypothesize class information, right?
Because you have class-based models.
>> Jui-Ting Huang: Yeah, you have to --
>>: You're going to have to break that down to somehow a sum over all the
possible class labels [inaudible].
>> Jui-Ting Huang: Actually I have ->>: [inaudible]
>> Jui-Ting Huang: Actually I ->>: [inaudible]
>> Jui-Ting Huang: As you said, to approximate, usually you rely on the prior
information here. Yeah. So I think they do the same thing. They use a language
model and their acoustic model to approximate this. Yeah.
>>: So here in this equation the first line, on the unlabeled data if you find some
theta that gives any [inaudible] p of theta, you know, of x, very low probability that
your maximization will get hugely penalized. So you would not allow, you know,
those densities that give very low probability [inaudible] explain why it helps.
>> Jui-Ting Huang: Yeah.
>>: So you try to make the -- for every -- for the xi, you make them almost equal.
>> Jui-Ting Huang: I see.
>>: Kind of smoothing towards average.
>>: I'm wondering if there's some kind of self-inconsistency, like if you have a
pile of labeled data and no unlabeled data, you maximize the probability of the
labeled data, and now imagine that you chop your labeled data in half and move
half of it into the unlabeled data and throw away the labels [inaudible] probability
of that same data that used to be trying to prove its probability, now all of a
sudden you're trying to decrease its probability. It seems kind of inconsistent.
>> Jui-Ting Huang: So that's interesting. And I think they also think it's kind of
corresponding to the maximum entropy idea. So I think it's worth discussions.
I already talked about this.
Okay. So the second regularization I have is conditional entropy regularization
which is the uncertainty of class prediction given features. The paper first
proposed this idea is based on the assumption that unlabeled data are beneficial,
especially when classes are well separated.
Then you are trying -- if you have a model prior -- so when they estimate a
model, they apply a model prior that prefers minimum class overlap, which is
equivalent of minimizing label entropy on unlabeled data.
The interpretation -- the other interpretation is that the negative conditional
entropy term encouraged the model to have the greatest possible certainty about
its label decisions. So it reinforces the confidence of the supervised classifier we
learned from the labels.
So to compute the conditional entropy, because we don't know the real data
distribution, we approximate it with the empirical distribution from the unlabeled
data. So the formula is essentially like this, and so meaning that we have
discriminative criteria for labeled data and the negative condition or entropy for
unlabeled data. And it's in a sense an unlabeled version of this discriminative
criterion, so we have coherently described this objective.
Then we can -- because of this term we can now using the -- ending
Baum-Welch. So we optimize it by gradient-based methods based on the
gradient computation of this term.
Moreover, we use preconditioned conjugate gradient methods to accelerate the
convergence rate of steepest descent. So these I'm just showing overall the
gradient of these two which looks like this in the classification case.
>>: At the end you're probably going to compare results using different
techniques, and one of those techniques is going to -- not the optimization, but
one of the techniques is, maybe it's just playing them out, which presumably
you'd use EM to optimize. Then there's going to be this other technique where
you've used the preconditioned conjugate gradient descent to optimize, and
there's going to be a difference between those techniques. And so the question
is how do you know that difference comes from the objective function versus the
optimization method? Like if you did preconditioned conjugate gradient descent
for a regular maximum likelihood objective function, maybe that would give you a
different answer. Is there some reason you can tell me, oh, you don't have to
worry about that?
>> Jui-Ting Huang: I think as long as I'm sure the objective functions are
optimized ->>: But now we're going to get local ->> Jui-Ting Huang: Yeah, either of them, just getting local optimal. I think ->>: So maybe one optimization technique is more likely to put you into a better
local optimum than another one.
>> Jui-Ting Huang: I see what you mean. So, for example, for this method I
have -- for this method I have applied the training objective and then it looks
reasonably saturated after several iterations. I haven't tried to do the same thing
for EM algorithm. But usually -- but I assume that it's actually -- you also have to
-- it actually converges very fast, right?
But I cannot comment if the preconditioned conjugate gradient is better than
finding -- I think the nature of [inaudible] finding local maximum, now I have issue
of finding better local maximum. So I guess they are fair -- so it's fair to compare
them.
I think -- because the way I use preconditioned conjugate gradient is to speed up
the training, I'm not aware of any optimization property of that can help to find a
better local optima.
>>: [inaudible]
>> Jui-Ting Huang:
>>: I disagree.
>>: This is the whole reason why people have, like, inertia in their gradient
descent techniques so they can get over these local minimum. It really depends
on how you follow the gradient ->>: [inaudible].
>>: If you look at a different field like [inaudible], people put a huge amount of
effort in exactly how they're going to optimize these things and what the learning
rate is and whether they precondition or don't precondition or what the starting
point is.
>>: [inaudible].
>>: Even though the data is the same and the objective function is the same, the
details of the optimization, at least in that case, seem to make a difference.
>>: [inaudible]
>> Jui-Ting Huang: One thing is that at least for MI, I compare with gradient
descent method and extended Baum-Welch method, they give almost the same
number.
>>: Oh, wait, what? Who did that?
>> Jui-Ting Huang: I mean I did that. So, I mean, without a -- without a term, if I
do that with gradient descent method, I have the similar -- I have the same result
with the one that I optimized using extended Baum-Welch.
>>: Okay. That's what I was asking.
>> Jui-Ting Huang: Okay.
>>: So you have done that?
>> Jui-Ting Huang: Yeah, I've done that. Yeah, for MMI I've done it. But it's
similar because extended Baum-Welch is an extension of Baum-Welch, so
probably it provides some evidence about the optimization. Doesn't matter much
here.
So I'm just saying that -- yeah. So it's -- a negative conditional entropy is
different from the previous interpolation term, and then the negative conditional
entropy has been applied to other discriminative classifiers such as logistic
regression and CIF. And here we apply it and see regularization to discriminative
training of GMMs and the HMMs.
So I look at two specific tests. One is classification, the other is recognition.
Classification is like -- probably I don't need to explain more, but I'm assuming I'm
already given the segment. I just want to recognize the phone identity by the
posterior [inaudible] probability, in which case the components that is found
dependent distribution described by Gaussian mixture models.
Then I used TIMIT corpus to create a semi-supervised setting. Labels of certain
potential of training set are kept and the rest of unlabeled set becomes the
unlabeled set. Then I extracted segmental features for each segment which is
fixed length vector calculated based on the PLP features within the different
regions within the phone segment and also pre and post segment and plus the
log duration.
So this is the -- I'm showing that that it has -- so I'm showing now the label is not
clear, but the blue line is objective value, and we have saturated -- saturated in
50 iterations. And the green line is the phone accuracy. I calculated on the
[inaudible] set. So you see it correlates well.
It's actually a set because it's not trending set accuracy, it's kind of test accuracy,
but, still, regularization help us to have nice correlation between accuracy and
objective.
Then here we show classification accuracy for two regularizers. One is MMI
regularizer. The other is conditional entropy. The axis is the decreasing amount
of labeled data with increasing amount of unlabeled data. Therefore, the best
performance we can see, which is the dashed line here, decreases along the
axis.
And to have a clear view I [inaudible] the supervised model, performance of
supervised number with two regularization approach. And you can see they
have very different behaviors.
Now, ML regularization helps very much for the case where we only have 20
percent of labels available, and this is -- the right-most number is 5 percent of
labels. So it can -- it out performed the conditional entropy regularization given
very limited amount of labeled data.
On the other hand, a conditional entropy regularization consistently improves
over MMI training regardless of the supervised -- the quality of the supervised
model.
>>: [inaudible]
>> Jui-Ting Huang: I have -- let me show it. Let me show a graph very quickly
and let's see if I can go there. Wait.
>>: [inaudible]
>> Jui-Ting Huang: No, it's just [inaudible] between -- [inaudible] at the range
of -- the range of 10. And I have a plot showing that it's quite insensitive to -- so
the performance is insensitive to the value alpha.
>>: [inaudible]
>> Jui-Ting Huang: So it not changes much, no. I want to show that. Let me
see if I can -- very quickly.
So I have the labels, few labels to more labels. You see that they have the
similar behaviors by [inaudible] number of alpha. Okay. Let me go back to -Okay. Then about for recognition. Another task is to recognize the whole phone
sequence given an utterance. And we know that the recognition is based on the
score from both language model and the acoustic model.
>>: So when you have in the MMI only case, that's just training with whatever
labeled data you have.
>> Jui-Ting Huang: Yes.
>>: But in all cases the same size model or no?
>> Jui-Ting Huang: Yes.
>>: [inaudible]
>> Jui-Ting Huang: Yes.
>>: [inaudible]
>> Jui-Ting Huang: Oh, no, no. The number of the parameters are fixed even
I'm adding unlabeled data.
>>: [inaudible]
>> Jui-Ting Huang: Yeah, I will show that actually by adding unlabeled data we
are able to grow the number of parameters by engine active training, and I will
show that shortly. Yes.
Okay. So in the current [inaudible] we were interested in optimize the HMM
parameters for acoustic model. And, first of all, let me go back to generative
training I mentioned in the first place where I have placed my likelihood criterion
for both labeled and unlabeled data.
And to approximate the distribution of the xi, I have a confusable set with respect
to each utterance xi and which is derived by recognition, and in the real
implementation it essentially can be summarized -- let's -- summarized by a
denominator model. In the same sense this one is also summarized by a
numerator model, encodes the full acoustic and language model used according
to a labeled phone sequence.
And for unlabeled data we have a recognition model such that it approximated
the distribution.
Then essentially the update formula, for example, one of the mean vectors in
your state model, it looks like this where we just adding the occupancy statistic
from the denominator lattices for unlabeled data.
For discriminative training we have to find a way to dispute the conditional
entropy for each utterances. Here we approximated with the N-best hypothesis.
So now Hi is actually the N-best hypothesis [inaudible] by running recognize on
each utterance i.
Then we apply the same optimization methods for finding discriminative training.
So here show the result. Here we assume we only have 5 percent of training set
available and unlabeled data is the rest of 95 percent. And I compare with the
self-training methods meaning that the conventional [inaudible] training where we
just find the confidence enough transcriptions to add in the label set, then we
train the model.
Then we see that the best number of Gaussian we can get from this amount of
labeled data is only A components, and then by adding unlabeled data we are
able to grow. And we see that the semi-supervised learning is actually better
than the self-training ML. And another advantage is that we are not necessarily
using confidence measures here for our methods.
>>: Was this with monophones or triphones?
>> Jui-Ting Huang: It's monophones.
The second results for discriminative training where we starting with the initial
model, the best model we can train from the semi-supervised training [inaudible]
approach from this graph -- from this plot. And then I also compare with the
self-training. And as we can see, again, the confidence measure have to use to
try for this. And as you can see, semi-supervised IM is better than the
self-training MI.
But it's worth mentioning that when I just used N-best to compute the conditional
entropy, it has its limitation in that I only have a few competing hypotheses
associated with each utterance. It's [inaudible] to approximate the sequence
distribution.
Okay. So some take-home messages. We see that unlabeled data are useful
for finding more accurate distributions or a better classifier depending on the
training criterion.
So, for example, we found that in prosodic modeling we are able to find more
accurate likelihood functions corresponding to true class distributions and so we
are able to discover unseen classes.
And that by applying unlabeled data on the regularized discriminative training
framework we are -- we can have a better classifier. And one of the advantages
here is confidence measures are not necessary in our framework for unlabeled
data to be useful, but, of course, you can combine this with -- try to fill out your
data and by some tuning you can get better results.
And one current limitation of my thesis work is that I only have experiments on
TIMIT set, and it's worthwhile to try on a large vocabulary data set to see how it
can be generalized.
Thank you.
[applause]
>> Jui-Ting Huang: Any questions and comments?
>> Geoffrey Zweig: We have time for one or two questions.
>>: The self-training approach, you used the confidence to select data?
>> Jui-Ting Huang: Yes.
>>: So there's this kind of theory, right? Do you select high confidence things
and the transcription with the more reliable -- well, you are adding the data which
you already know very well, right? So what is your thinking about this in using
confidence a lot to select data? Maybe for those speech segments, consider it
[inaudible] they may give you more information even [inaudible].
>> Jui-Ting Huang: If you are just using the confidence score, I found the similar
conclusion as you. For self training I would get -- because I need to tree different
value of confidence to see which one gives the best performance on the
development set, and it's like -- confidence threshold, it's like if this is the
accuracy, development set, it's kind of like this. So I usually find something like
here.
I think it's -- so that's why I think it's very -- I mean, it would be helpful if we have
some [inaudible] then we can make use the data that's below this threshold.
Yeah. And so I think with my framework, as I said, it's one way to make use of
the data regardless the confidence measure. And so I guess confidence
measure is a method for the very straightforward unsupervised training, but if you
apply different method of semi-supervised learning, not necessarily my method,
but also maybe the multi-view learning, meaning that you're able to -- the one
that has lower confidence is not necessary -- the one have lower confidence from
one recognizer have the same lower confidence from the other recognizer in our
system. So this is where, if you combine multiple system, you are able to find the
more interesting trending transcription and then it can be complementary and
useful for different system. Do you see what I mean?
>>: Yes.
>>: So are there plots where the total pot of data was the same size and you're
just sort of change the [inaudible] label versus [inaudible]?
>> Jui-Ting Huang: Yes.
>>: So what would the plot look like if the pot of label data was fixed and then
you had a variable amount of [inaudible]?
>> Jui-Ting Huang: Yes, I have a plot about that. Yes, it's -- for example, I have
a fixed amount of 5 percent labeled data, and I tried to add the number of the
unlabeled data [inaudible] in our [inaudible] set. And this is for MMI-ML. And I
haven't had a plot made for MI and CE yet. So I'm sure I should include it in my
thesis.
>> Geoffrey Zweig: Any other questions? Okay. Let's thank the speaker again.
[applause]
Download