>> Geoffrey Zweig: Hi. So it's my pleasure... Dirk has had an extremely distinguished career. He started...

advertisement
>> Geoffrey Zweig: Hi. So it's my pleasure to introduce Dirk Van Compernolle today.
Dirk has had an extremely distinguished career. He started in Stanford University and
was there overlapped with, among others, Les Atlas and Mari Ostendorf.
After that he had a stint at IBM working with Fred Jelinek. Eventually he left, then, IBM
and took a position as a professor at the Katholieke Universiteit in Leuven, Belgium,
where he is a professor now.
And, additionally, while there was a vice president of research at Lernout and Hauspie, so
he's got a great industrial perspective on things also. And recently has been working on
the FLaVoR's bottom-up decoder. And even more recently, a template-based speech
recognition, which he's going to talk about today.
>> Dirk Van Compernolle: Thank you, Geoff. So today I'll give you an overview of
what we've been doing in example-based recognition. This is work -- there's a couple of
other names on there, as this has been going on for quite a couple of years.
And, okay, you can see in the slides later there's maybe some of the more important
references to the work.
But I guess the first publications we had on this are in 2004. And things have changed.
We've learned quite a bit by moving up from smaller to bigger problems by doing more
experiments. And it's been an interesting ride.
So I will try to -- not to summarize what we've all done, but to give more or less our
current perspective on what we do it.
So I'll do the talk in two parts. More or less first get to a baseline system, get to some
example-based speech recognition, and then go into the more fine points of it.
Why did we start on this work? Well, there is a couple of different motivations. A major
one is when you go and look into psycholinguistic literature, there's very much evidence
that humans store traces. Almost of the audio or just think of -- I have two small kids.
They speak Dutch at home like me, but when they hear songs in English, they try to
repeat them. This is pure acoustic memory. These things are in there. There is traces.
And with music, that's quite obvious.
But we do this for speech too. How important it is, there's obviously a great deal of early
abstraction in the system, too, and how important those longer references are, that's by far
less clear.
So there is evidence that we use trajectories, templates, episodes, whatever you want to
call them. I put them under one big nomenclature. Then another motivator was the
success of concatenative synthesis where people said, well, we've tried to model things
with a good model for so long, it doesn't work, we've gone away from the model, we just
put in lots of data, and we've played more ignorant than what we were before, and things
work better.
HMMs have worked fine, and they still work fine. And they're still -- they're still the
state of the art. But we try to beat them.
And what have we done in HMMs? For the last 20 years we've tried to overcome all the
errors, the basic errors in the model. At some point you get tired of doing that, of just
tweaking it, and you'd just like to go to another model.
So and another thing was maybe more, okay, let's try to do something that burns up lots
of computer space and lots of memory. Because it's growing so fast and somehow
HMMs don't keep up with it.
So what do we do in HMMs that I don't like or that's -- most of us don't like. HMMs take
data comes in and they say this, okay, here we have a state. That state, we assume this is
a short-term stationary segmentation of speech. So whatever sequence those observations
come in, it doesn't really matter. You can turn the sequences upside down.
Well, what have we done, we've done derivatives on the features, people have tried to put
it into the model. This has helped. This is tweaking of the model. But it's still
fundamentally a problem.
So what do we do? Let's take a small example here. We looked at lots of data from the I
from the word five, and all the dots are individual data points, just projected down to
two-dimensional space. So they're originally -- the feature vector is 39 dimensional or 36
dimensional algorithm in this thing. But then we do an LDA to higher, which is suited
for the picture. So we say, okay, what's the two most important axises of the I.
And so you see the plots. And you see HMM states.
So three states associated for one phoning. Okay. It's [inaudible], there's a lot of
transitional information in there, it's not an R, but it's I. So we're expected to see some
movements.
And we see those dots and HMM says, okay, let's take all the dots belonging to one state
and let's just group them mean invariants, yes, we take multi-Gaussians, we make it a bit
more complex. And that's our representation of each of these states.
>> So these have the top two dimensions in the HLDA?
>> Dirk Van Compernolle: Not the HLDA. Well, we don't use HLDA. We use
something a little bit similar. But not the LDA that you're going to use for the recognizer,
but an LDA to stand out I, specifically trained on the I versus all the rest, so that the I gets
more pronounced.
>> I guess the question that I have is if you work to look at the dimensions and the
recognizor's looking at it, will it because it's higher dimensional space?
>> Dirk Van Compernolle: It would not -- no, you would not be able to make a nice
picture because it's higher dimensional space. And the first two dimensions -- the first
two dimensions might not matter. You might have to look for which dimensions give
you nice pictures.
>> You're trying to say this is just representative or it's just an artifact of the way it was
drawn?
>> Dirk Van Compernolle: I'm looking let's say in the 36 dimensional space, I'm looking
for the 2D intersection where I'm going to get the biggest variants on the points, so which
gives me the nicest picture. It's just one cut in this huge space.
>> [inaudible]
>> Dirk Van Compernolle: I think it's a projection that I would -- did this long time ago.
I think it's projection.
>> It would be nice if these were the dimensions, two of the dimensions that the
recognizer [inaudible].
>> Dirk Van Compernolle: It's not. We've done some plots for capture too, and with
capture it's not too bad.
>> Right. So that's my point. Are you making things look worse than they really are in
practice?
>> Dirk Van Compernolle: Let's say -- you know how to visualize 39 dimensional space.
>> Just show two of them.
>> Dirk Van Compernolle: No, no, that's bad idea. That's a bad idea.
>> Dimensional -- probably the more real -- in a higher dimensional case or if you
compute the LDA on everything, things would be more scrunched up on top of each
other, so it would look worse [inaudible].
>> [inaudible]
>> Dirk Van Compernolle: No, because -- I'm not so sure that it's more believable by
taking the random two dimensions.
>> If I take 39 dimensions and collapse them to two, everything is going to be
[inaudible].
>> Dirk Van Compernolle: Yeah. Let's look at some other things in here.
>> That's not to say that [inaudible].
>> Dirk Van Compernolle: No. Let's look at how now for the same [inaudible]
trajectories look. So they are there. It's not -- the point that I want to make is that on the
left top side or the beginnings of the trajectories, the other side are the end of the
trajectories.
It's not a random sequence of points that -- or when you're in state one you don't get
randomly [inaudible] average the variants indicated the points from that distribution.
That's not the way you draw points to create realizations. That's the point I want to make.
That's the only thing.
So even after doing context-dependent modeling, after doing derivatives, there is still
information in the trajectories at the phone-in level. Even at the very short trajectories.
That's the only point I want to make; that the first-order Markov assumption does
something. Significantly overgeneralizes what speech is.
For example, you could say a black one, as I draw it, is a reasonably good example. A
red one in terms when you would start computing HMM probabilities with points on this
red curve will get a higher likelihood than the black one. But intuitively I would say it's a
worse representative of the class of trajectories.
And that's basically the motivator I have. Let's look at another example. The previous
things were things I drawn by hand. These are real examples. Let's take a real trajectory.
The real trajectory of data, starting point, endpoint. And let's find now in the class of
points the closest trajectory using a DTW algorithm, just a dynamic time-warping
algorithm. Nothing more than that.
And that's the green. That's the closest trajectory that we find. By no means the
sequence of nearest neighbors in it. It's something that goes a little bit the same way. I
don't want to talk about smoothness, because there's some other dimensions to be taken
into account. But this is the DTW then on the full dimensional space. Not that space, not
to match.
So in this space maybe so be careful by saying it's not just the closest points, maybe they
are the closest points in 30 dimensional space. I'm not sure. I have to be careful when I
make it. Because we're looking now at the two-dimensional space. It's not just in each
dimension the closest points. But we are trying to make a sequence of nearest neighbors
in the high-dimensional space.
>> When do you use enabling information to [inaudible] curve?
>> Dirk Van Compernolle: I do a dynamic time warping. So I do points, local distances,
and then I add them up along the full trajectory to see where I get.
>> [inaudible]
>> Dirk Van Compernolle: I'll give you a few more details on the example what it does
in a couple of slides.
So what's the matching paradigm that we use for HMMs in general? And actually we'll
see the time warping thing is not that different from the HMM setup.
We have -- so that's standard Bayesian recognition. We get the word sequence which
maximizes the probability of words given a set of data, and we typically write that in
another way where we find a likelihood function or some function that gives the
matching score of the data given. A word sequence times the probability of a language
model. That's standard formula for anything. It's a generative model.
Now, what do we do in template matching? Because we have all -- they're not just a
single template sequence that can explain word sequence, there's plenty of template of
sequences that can explain the word sequence. So we really should sum over many
possible templates and words and find this thing.
Now, that sum is horrible to compute because there's so many possible examples. So we
take a very -- maybe a very awful shortcut, you might say. We do a Viterbi assumption
here and we'll do a maximum of these things. And then we get ultimately to a formula
that looks very much the same as a thing of both, except that we have an extra turn. We
have a probability that's the template sequence realizes the word sequence.
Of course let's say it has to be -- you have -- because the T here, you have to see as a
sequence of templates that you're going to get out of your database of all possible
templates that you're going to link together. So you're going to splice things together. I
don't put the whole timing information here.
Now, these templates of course have to explain the words, so they have to have the right
phonetic identities, when you look at the words, and they have to explain them in a
phonetic sense. But then, okay, what's the probability. We'll come to that later.
Now, let's look at this. You say, okay, it's about this argmax, that's our Viterbi
assumption. Argmax we need a search engine to do this. But basically this thing doesn't
look very different than anything we've done in HMM, so we will be reusing our search
engine that we have developed for HMMs.
We need the similarity score between data and words and templates. And we need this
prior template. And then we have language model, just the way we have a language
model in speech recognition.
Now, if we compare HMM and examples, actually there are a lot of components in the
full system that are the same. You need some units, phone units, Allophone units.
They're all in example-based systems. We have Allophone units as well. We have phone
units.
We need the local similarity there. There is a big difference there. In HMMs we
typically use multi-Gaussians. In example-based we'll have some local distance
measuring the distance between a point in space, a reference, and a point in space on a
test sample.
The time alignment, we do Viterbian, DTW, or actually the same algorithms. Something
with this computer. Not I'll just move my hands. It sometimes starts typing. This seems
to have built-in speech recognition one way or another. But with a very high error rate.
The search we do -- so I should use this pointer. The beam search is actually the same
one. We use a token passing system to implement these things. And it's the same search
we'll be using for the example.
The training, well, that's where things are very different. The local distances. That's
where things are really different. That's not so different. And on transitions, well, we'll
have some things on transitions which are quite different too.
What do we do for training? The only thing we need to do for training in this, we take a
big database and we have to label it. We have to know which segment of speech
correspond to what, what is it. We can label it at the phoneme-level, we can label at
word level, you -- in principle you're free. In the concept of the system, you're free to do
what you want. But, in practice, all the things I'll be showing will be
phonetically-labeled databases.
>> You mean manually labeled?
>> Dirk Van Compernolle: No, no, no, no. Free automatic. And we use our HMM.
HMMs are nice things. They are good tools. We use them. So we don't -- we started off
this -- in this work we started off by throwing out HMMs completely. But, no, no, no, we
need them. We need them. We need them at many places.
Okay. And maybe we need to do some estimation on our distance matrix. But I put
down optional.
Well, we'll see all this orange. I was going to skip those slides.
Now, let's look at those probabilities and distances. When we have observation
probabilities, really that part of it, so it's a sequence of data given a sequence of states,
and we maximize that. So we say, okay, we have to find the state indicias which
maximize our probability. That's what the Viterbial algorithm does. If you take
logarithm of that, this is still a maximization, becomes a sum.
And if we now assume for this probability function a single Gaussian, well, then we just
have the simple quadratic term, and we have something which is standing before our
exponent and our Gaussian becomes there. As long as that sigma doesn't vary much from
one distribution to another, we can get rid of this. And so then we end up with something
that looks like that, a standard Mahanolobis type of distance.
So single Gaussian with Viterbi by or Mahanolobis distance in some type of DTW, so
now we have to minimize this. We have to find instead of mu's then we're going to now
replace these mu's by examples.
So we treat points as means in the Gaussians in a single Gaussian system.
>> Your distance can include that term, right?
>> Dirk Van Compernolle: I can include this term, and we have included this term.
We've done work with that. We've done a lot of work with that.
>> [inaudible]
>> Dirk Van Compernolle: It used to make different. It doesn't make any difference
anymore. That's the -- I've gone back to some of our old stuff. And then it made a big
difference, and now it doesn't anymore. And I'll continue.
>> [inaudible] only [inaudible] training [inaudible] show that if you do [inaudible] taking
this wrong turn, probably is a little more useful. Well, first of all, make things much
easier to train.
>> Dirk Van Compernolle: Well, it used to be useful. And I don't have all the answers
why it's not useful anymore. But we've moved up the big databases to many more
examples. And to a lot of things, we've included all the refinement in the system. And
the more refinement we've included into the system, the less we need this. But I come
back on that with -- there's some results where it's included and where it's not included.
But so okay. We'll take it -- I just drop it here to say, okay, it looks very similar, the two
things, to -- just for the intuitive understanding of it.
Now, there's a bigger problem with this when we go to distances. So now we have to
minimize over a sequence over, so possibly time steps, and we have to find the best
template or template sequence in the whole database to do this.
Now, when we have a sigma here, what sigma should we put there? And a thing that's
actually a bigger problem than saying we want that term or we don't want that term, but
what sigma should you put there?
>> And you're actually making that class dependent?
>> Dirk Van Compernolle: Well, when you look at the HMM formulation, that sigma is
dependent on the class that you're matching. You're looking for distance from a point to
a class. Or a matching score of a point to a class. Now you're saying I match a point
with a point. Okay. In your reference database you know to which class they belong.
You know I'm in I, I'm in E, I'm in something, and you could say, okay, that's -- and I
want it to be in the correct class. So I'll measure how far I am from that class.
But you're looking at a point. And this variance, okay, what do we know about this
variance in -- what are they in HMMs, it is the global distribution of points in a class.
And that's going to be pretty big, as you saw, on that example. And we'll be working
much more in the local neighborhood and the global distribution of the class really
doesn't make that big an impact on what you do.
And shouldn't. It doesn't matter. Let's say I have a class which looks like this, probably I
want -- if I'm here, I want to have something here in the neighborhood. I don't care about
those points all the way at the other end of the class.
>> And is it going to turn out to be important, what covariance you use? Versus if you
just ->> Dirk Van Compernolle: Let's look at some results later. So but the best similarity,
you can find two HMMs as saying, okay, I infer the class from the reference. And if -- so
I trained single -- diagonal covariance matrixes on classes, and so you could do
phonemically or even more state labeling or whatever type of things. We've done a
number of ways to come up with many covariances.
>> You could argue that maybe you should use a covariance that's associated with the
utterance from which you drew the template. In other words, not something about the
class, but something about the speaker or the speaker at that particular point in time in
that acoustic environment. That's where you get your covariance.
>> Dirk Van Compernolle: I don't know. Doesn't sound logic to me.
>> Okay.
>> Dirk Van Compernolle: One thing there is, when you put something here, and you
define it the Y, well, then, it's not a distance anymore. It's not symmetric, which makes it
already some strange thing to do, because you're really comparing two things. And why
should the thing in the reference database have a label and not the one that you're testing.
They both have a label. So why should you use something asymmetric.
In this nonparametric type of approach that we're working with. It's kind of
counterintuitive to do something asymmetric. Beware we are using features, we are
using features here that have gone through HLDA-type preprocessing. So they are
already reasonably normalized. That's what we're always doing. And this definitely
helps.
So the importance of this -- it should be a second [inaudible] effect. Because we've
already rotated the whole thing into a space where Euclidean distance should make more
or less sense.
>> So not just rotated by the -- divided by the absolute variance of each dimension.
>> Dirk Van Compernolle: Yeah. Everything. We've -- there's a [inaudible]
information, discriminant analysis, there's a rotation to make things as good fitted to
single Gaussian distributions.
>> Can you learn that thing maybe discriminatively or ->> Dirk Van Compernolle: [inaudible] spent a whole Ph.D. on it and didn't get
anywhere. That's the name Mike McDonald that you've seen on these slides. And this
has been one of the more negative results that have come out of it. He spent six years on
this and never managed to get any improvement or whatever else with it.
You might also blame him, but you might at blame the constant. But, yeah, that's what
we thought so, too, that maybe this is something that you should be able to learn
discriminatively.
>> [inaudible] covariance between different classes to see if they actually are very close?
>> Dirk Van Compernolle: There's differences. There's differences. How close? I can't
put any figure on it. The only thing I can do is say, okay, what do we see in terms of
experiments or results. You'll see a few things. Just keep your breath for five more
seconds.
I know. It's going to take a little bit longer. But I have quite a few slides on what then
the different distances do. But we don't have our full initial system yet, and that's what
I'm trying to make. But this distance metric obviously is a problem, and I'll come back
with more things.
Now, okay, I put it a little bit more explicit that we actually have template sequences that
are matching word sequences. And we have there this term probability of the templates
given the word sequence. Okay. Is any template sequence that explains, so you look at
the phonetic labels of the templates, you then need template sequence equally good
enough.
Well, we do the way we like in speech recognition. We say, okay, a template sequence,
we'll put this as a problem of the template given the previous sequences and the word
sequence. And so -- and approximate this as more or less template transition costs.
Okay. There's also some real template prior costing there, really prior of a template. I
have no idea what they are. I have an I here in my database, I have another I in my
database. Is one more likely than other. Should I have a prior probability assigned to
certain things.
I could run big -- leave one out experiment on my database and see how often something
is used on my training database. That's a possibility of estimating things. It's -- we've
debated. We've had plenty of philosophical discussions on what's a prior. We've just
computed the likelihood of that template based on an HMM system.
And then normalized over all templates with the same phonetic signature, which also
might say, okay, how good does it match the class, is that a good one.
It's -- the prior as such hasn't given as much, but there's also this concatenation term.
And that makes more sense. That says, okay, one template following another template,
when we had them phonetically labeled. Well, it's more likely if the whole -- if the larger
phonetic context of these things is consistent.
So the thing of like context-dependent phones, if I have a template of an R, which was
the original recorded between a P and a T, then it's much more likely that the previous
template would have been a P. And even a P which was followed by an A and then
maybe preceded by something else.
So there's definitely phone context that plays a role, and then there's something we call
meta-information. And meta-information can be all kinds of other things, where you say
I like things to be consistent, like it's more likely that all the templates that I will draw are
from the same gender. Why would I start mixing male and female templates. Doesn't
make too much sense, it's more likely that they were recorded in the same background
conditions.
Any abrupt transitions are just unlikely. And I also could look at spectral discontinuity in
those templates. All these things are not likely.
So template transition costs, yes. And actually the most important one we have there is
something we call natural successor. And this is something that you'll see up here more
often. I'll say we use full names as templates, but -- and everybody gives immediately as
a criticism, no, you should use symbols or words or longer units.
Well, we try to overcome this criticism by saying let's try to have as long of units as
possible. But going to work units will sparsify our training database tremendously. So
this will not do a good job there.
>> So the idea that you should be similar to this [inaudible] you just make all possible
[inaudible] if whatever match is the best [inaudible] whatever you have.
>> Dirk Van Compernolle: You choose whatever the best ->> [inaudible] predetermines that [inaudible].
>> Dirk Van Compernolle: You do. You choose whatever the best, but you encourage
segments that are as long as possible. So you encourage bi-phones, tri-phones as they
are.
And so how do we do that encouragement, like in speech recognition, when you start a
new word, you have a word startup cost in order to have not too many new words being
started. That's what every decoder does.
Then you have a phone-in recognition system, plain phone-in system. You need a
phone-in transition startup cost. So we also have a startup cost in the decoder. But when
we have a recording that says that, then, okay, we hypothesize template B, and then we
have to hypothesize "ah" after that.
When we take this one natural segment as it was in the original training database, when
we take a "bah" instead of "bu" and an independent "ah," we say, well, we don't need a
template transition startup cost for this second one. Or if we take "bah," well, also for the
D we don't need a template transition startup cost.
So we give preference to longer segments. So if we don't make transitions, we do that.
So instead of doing this explicitly, we do this implicitly. And it's important. It makes a
big difference.
>> Now, to do that you also need to have segment boundary information for different
template.
>> Dirk Van Compernolle: In the original database, yes.
>> [inaudible]
>> Dirk Van Compernolle: Use the HMM to do that.
>> So doing the decoding, do we use HMM also?
>> Dirk Van Compernolle: They're decoding -- well, we use a token passing decoder,
and the token passing decoder that we use is the same one that we use in the HMM
system. It doesn't look any different.
>> But, I mean, do we actually use [inaudible].
>> Dirk Van Compernolle: During decoding, no.
>> In the TVS, typically [inaudible] to convert from word to hidden and then [inaudible]
guide the synthesis, so use [inaudible] model to actually choose the templates from the
training set. And you can get a much better result, basically.
>> [inaudible] those, since your second [inaudible] uses HMM already [inaudible]
otherwise you may never be consistent. I don't know [inaudible].
>> Dirk Van Compernolle: Okay. Let's look at the overview of DSS. So what do we
have. We have distance measure specific, we have a reference database, we have the
template transitions. We also -- we need a block. Fast template selection, we want to do
a fast match, because we don't want to search the full database. It's too big. And then we
need the search algorithm. And lexicon and language model, that's no different that what
you would have in a standard system.
So all these things are very much the same except this fast template selection, and then
more the things here at the bottom are different than in an HMM system.
Now, an example. Let's listen to the inputs. So you try to find this out.
>> Audio: It's still unclear.
>> Dirk Van Compernolle: Which is natural speech. And now you try to do a
templates-based recognition.
>> Audio: It's still unclear.
>> Dirk Van Compernolle: It's not a horrible things at all. So what did really happen, so
this is the input. Now, you have a segment of speech here, the E, you go and look in your
database that you're working with, you find another E, you take this E, you timeline it
with an E in the test sample, you do this for all individual sounds and you splice it
together. You do some interpolation on the FFDs, and you do an inverse and you get the
sound back.
>> Was same speaker in the training database?
>> Dirk Van Compernolle: I think for this example, yes. I have some more examples
coming.
>> So presumably you need to look for similar contacts [inaudible].
>> Dirk Van Compernolle: You definitely need to look for similar context. But here are
some examples on what really that effect is of this concatenation cost. Where I say we
don't -- we want -- we prefer multi-phone units over single-phone units, because there's
going to be much fewer -- we want to have as few transitions in the matched template
sequence that we have. And that's a pretty old example also here.
>> Audio: There were 13.9 million crimes reported in 1988.
>> Dirk Van Compernolle: That's a Wall Street Journal sentence.
>> Audio: There were 13.9 million crimes reported in 1988.
>> Dirk Van Compernolle: So that's an attempt to resynthesize that. And that's using
just phone templates, not context-dependent templates. That's pretty basic system from
some time ago, but it's nicely illustrative. How many segments do we have? Okay, you
might say, okay, we have 55 phones in that sentence. When what you did now was not
playing at all with those transitions.
So you just try to find a concatenation of the best phones in the database. Well, you find
48 segments by chance. There's a couple of segments which have -- like seven of them
will have two phones in them.
>> What's the difference between a segment and a template?
>> Dirk Van Compernolle: Here with segment I mean a natural segment of speech in the
original training database. Which can consist -- which is a multi-phone unit.
>> I see.
>> So if you [inaudible] how do you choose the best phone?
>> Dirk Van Compernolle: Just based on DTW.
>> Oh.
>> Dirk Van Compernolle: That's just doing DTW. And I say I'm looking for the best
phones and I just do DTW, is the only thing I do. And now I'll play big with those phone
concatenation costs. Say, well, things have to come in the right phonetic context, so I
look at phone-in context sensitivity. And I prefer long segments. And then it sounds a
little bit different.
>> Audio: There were 13.9 million crimes reported in 1988.
>> Dirk Van Compernolle: So if you didn't do the costs ->> Audio: There were 13.9 million crimes reported in 1988. There were 13.9 million
crimes reported in 1988.
>> Dirk Van Compernolle: That's when we add the costs. And you look instead of 12
phonemic errors, you only make three phonemic errors, and the average segment length
has increased to almost three phones per segment.
So in this example you're still -- you're really using multi-phone units.
>> So this implies that, if I understand it right, that the segments that were used in that
no-costs one actually have better DTW scores, just pure DTW scores than the ones in the
bottom line.
>> Dirk Van Compernolle: Yes. When you just look at the DTW score, this one is better
than this.
>> Isn't that surprising?
[multiple people speaking at once]
>> Dirk Van Compernolle: But the DTW score is a bad representation of your match.
Because it's not just a DTW score to reference, but also the DTW score from frame to
frame that you've synthesized, and that gets so unnatural at those joining points that it's
not -- your -- what you're putting in is -- is probably something that I was showing all the
way in the beginning, is something that isn't in your manifold of your points even, of
traces.
You're putting a collection of points that makes no sense, that doesn't exist. So where
now we've done the consistent trajectory within the phone that we've totally forgotten it
on the cross-phone level, where it at has to be consistent.
>> And so this last line, this is using information that is not present in an HMM system,
because it's using information [inaudible].
>> Dirk Van Compernolle: Both. Part of the information is in an HMM, because it uses
contact dependency on the phone units.
>> Okay.
>> Dirk Van Compernolle: But it also has -- but part of the information is not in an
HMM.
>> Like if the two segments came from the same person [inaudible] or they were adjacent
originally in the data.
>> So how do you compute the transition costs?
>> Dirk Van Compernolle: For the natural -- so there is this phone-in -- this template
startup cost which is one for all templates. The same cost for all templates. Like you
have a word startup cost in a decoder. So that's one parameter which is -- which is
optimized on the development.
>> The guys that do speech synthesis now, the modern technique is rather than
computing the cost in an arbitrary way, which is what they did, [inaudible] I'm going to
look at my HMM to tell me on the average when I go [inaudible] this unit with this unit,
where are the scores. And, as you said, within that unit, it's fine, but when you put it
there, the HMM is going to tell you this discontinuity started [inaudible] some score.
That depends on the context dependence and you can precompute it.
>> Dirk Van Compernolle: Yes. We've done a number of things on this. But I'll -- I'll
hold off on giving some more details on this. Because there's ->> [inaudible]
>> Dirk Van Compernolle: How many parameters here? Well, when you have context
dependent things, when you look at the context, it might be lots of parameters. But what
we've done instead, so we used to work with phone templates. We've moved to
context-dependent phones as templates. And ->> [inaudible] I have in manually joint [inaudible]?
>> Dirk Van Compernolle: To determine development set you mean?
>> Manually, not automatically trim.
>> Dirk Van Compernolle: It's all -- it's optimized on the development set, all the
parameters.
Okay. Let's now dig a bit deeper in all the problems that you've seen that are there, like
the distance metrics. Let's look at results there and the distance metrics. There's
something that comes with it that we haven't touched on, that's outliers, and then we do
frame and phone experiments. And then we have some things making the databases
bigger and bigger and bigger.
Well, we've done a couple of things to look at that. So let's look again. So where should
we use this Mahanolobis distance, what should we use, should we use the sigmas of the
classes, should we take this into account or not.
We've done it. We've looked at it in many ways. So we have this local Mahanolobis
distance where we take the C, okay, there should be a minus 1 there. And this has been
our standard approach for a long time. And many years back I've even claimed that this
made a huge difference. But now I say at the bottom this was our approach because
we're not using these anymore.
We've also looked at adaptive kernels. So you say it's really a kernel that you put on
every point that you have. But maybe you should be looking at the local distribution of
the points around that to have -- to define your variance for it.
We've done this too. Then you have some time when you start working with kernels that
you really start pushing your full distribution a little bit out because you have points here,
you put the kernels over there, and you also push your distribution a little bit further out
than what it really is.
So people in kernel-based techniques have used techniques called data sharpening to say
let's draw especially the outlier points a little bit towards the center of the distribution,
because otherwise, if I put kernels over it, I even put them more out. And, moreover,
outliers may not be very good representatives.
But there is a much bigger problem that we've really run into. There's two types of
outliers, and one are mispronunciations, bad pronunciations, and others are rare
pronunciations. And the wrong ones you want to get rid of. They may be transcription
errors, they may be things that get out of your database. The rare ones you want to keep.
And it's damn difficult to make a distinction between those two.
Any good ideas on that, saying, okay, if I find with an HMM a low probability for a
sound or something, okay, sometimes it's so bad that you can say yes, it's a bad piece of
data. But sometimes it's just bad. Now, what is it? Is it bad data or is it rare data. And
rare data I don't want to get rid of.
I don't want to get rid of the person with an occasional 50 hertz pitch, which barely exists,
but there are people who go extremely low in pitch. I don't want to get rid of them.
Their data is going to look a bit different.
So we've -- the good outliers are difficult to distinguish from the bad outliers. So this
data-sharpening concept, well there's actually this very nice in saying, well, maybe you
don't really need to know.
Let's just draw all points from the edges of your distribution, let's draw them a little bit
inside. And not towards the center of the distribution, but you have a global distribution,
you have an outlier here, just look in its local neighborhood.
Just look around there, what points of the class are in my local neighborhood. Don't use
the centroid of the full class. Just look in a local neighbors. Look at the K nearest
neighbors within its class.
So look at its K nearest neighbors within the class, and then just take the mean out of this.
And replace each data point that you have by that mean.
So you do a change all the things you have. Now, let's look ->> [inaudible]
>> Dirk Van Compernolle: I do this point by point.
>> Point. Oh. So after you use [inaudible].
>> Dirk Van Compernolle: Yes, I do it data point by data point. I do it data point by day
that point. And, okay, that's what happens. So, again, some projection in a 2D space.
What happens to things in the middle? You look at these K nearest neighbors. Well,
things are typically distributed nicely, distributed when you're in the middle of your
distribution, and what happens to those points, almost nothing. They barely move. What
happens to points that are at the edges, okay, you look for this point, you look for its K
nearest neighbors, you make a whole sphere around it, well, all the points are on that end.
So whatever is going to happen is this thing moves in that direction. And now if it's -what?
>> The bad point [inaudible] too.
>> Dirk Van Compernolle: The bad point can move in too, but at least it's going to look
much more like the other points of my distribution. Yes, it can move in too. But at least
I've colored it with a flavor of what it should look like. And I don't have to make a hard
decision on what they are.
>> Dirk Van Compernolle: So if you iterate this over and over again [inaudible].
>> Dirk Van Compernolle: No. No, no, no. This might iterate to [inaudible].
Now, we thought, oh, this is very nice. Why don't we do this first and then train our
HMM again on this and shouldn't our HMM now work better or not. Didn't. It didn't.
>> [inaudible]
>> Dirk Van Compernolle: And you've made your variants too small if you start now
training an HMM on this, then the distribution is not good anymore and you've drawn the
points too much inside. But for the nonparametric version of thing, does this work.
Well, the thing is we do this Viterbi approximation. So if you find one good match and
something is badly labeled and things like that, you can really make bad errors.
So we don't -- so it's a little bit -- it's coloring your data with some class information that
you're doing. Instead of using the original data point, you say, okay, I'll color it with a
little bit of -- and if it's a really good member of the class, here in the middle, I won't do
much at all. Or I won't do anything at all. If it's a bad example of the class, I'll do quite a
bit.
>> So what other area of the research used this [inaudible]?
>> Dirk Van Compernolle: This comes from things on nonparametric statistics. We
didn't invent this. It's a -- yes. That's textbook material in nonparametric statistics. It's
not a rare thing.
Okay. We did experiments on plenty of different test sets. And, okay, let's look just at
some frame classification things.
Where we've compared things with data sharpening, no data sharpening, and we've
compared distance metrics, Euclidean, local Mahanolobis, so where you estimate the
Mahanolobis, that, and then also with adaptive kernel where you even go and look in the
local neighborhood of the points, if you should adjust your kernel or not.
And then we did some voting or we added some Parzen stuff. We just look at the single
best. So this is simple frame classification. Don't look at all the results. But look at the
nice things. When you do data sharpening, everything in this column is better than this
column.
And this is not for this experiment. Whatever experiment you will see after this, data
sharpening helps, 5, 10, 15 percent. Not a randomly small number, but a consistent thing.
So for this type of approach, well, it's been consistent -- it's the most consistent trick in
the book that we found that helps us.
Then many of all the other stuff that we've tried, we've had much more -- not so clear
things.
>> Presumably for this kind of task you don't really have many outliers, right,
[inaudible]?
>> Dirk Van Compernolle: But I say we've done five different databases already. Or six
different databases. So -- well, TIMIT isn't all that. You have outliers because of their
phone-ins which aren't really produced, that people have invented in the transcription and
so it's not that clean. It's reasonably clean.
The other stuff have is that when you do some voting, when you look at the K nearest
neighbor type thing, for the simple experiments you always do better than single best one.
And Euclidean is not at all a bad idea. But I'll show some other plots which may shed a
different light on some of these things.
>> [inaudible]
>> Dirk Van Compernolle: Well, there's this quarter set and there's the full test set. So
everything on TIMIT is overoptimized and I shouldn't present the results, according to
Geoff.
So let's look at something elsewhere we didn't do -- well, this is on development set of
Wall Street Journal. Again, we do frame classification. You see huge differences
sometimes -- well, especially like with single best between no data sharpening and data
sharpening. If you do some voting, things get better. But the differences between this
whole left and right column is really very significant always.
>> So what's then the reason used to follow the template?
>> Dirk Van Compernolle: That's the Wall Street Journal 0 database.
>> The training one.
>> Dirk Van Compernolle: The training one, yes.
>> So for each user's training to build the template.
>> Dirk Van Compernolle: To build the templates we use the standard training and test
setups of all fees databases. We don't do anything strange there.
>> [inaudible] I don't understand something. This frame classification ->> Dirk Van Compernolle: Yes.
>> I'm sorry, not the frame classification, the sharpening, was that done by moving whole
templates?
>> Dirk Van Compernolle: Parts. This was parts. I'm pretty sure it's points. I think
we've looked at both of the them. But I think they independently move. They could be
incoherent moved.
>> They could be.
>> Dirk Van Compernolle: They could be.
>> [inaudible] why not do the data sharpening on the template?
>> Dirk Van Compernolle: I'm pretty sure we've done this and it didn't make any
difference. And this is much faster.
>> So [inaudible] frame classification, what does this mean?
>> Dirk Van Compernolle: You look at the frame, you look at the label of the frame, and
you try to ->> But the template, you don't do ->> Dirk Van Compernolle: No, no, there's no DTW. This is frame classification. This is
just for looking at some of these [inaudible]. When I look at a single point in my
database, how do they behave on things like data sharpening, how do they behave
according to these different metrics. That's all that I'm doing. There's no template-based
recognition in this.
>> The reference HMM 63, is that ->> Dirk Van Compernolle: That's basically multi-Gaussian. That's a multi-Gaussian.
>> It does the decoding and then you look at the labeling of the frames?
>> Dirk Van Compernolle: Or is it just a [inaudible].
>> Okay.
>> Dirk Van Compernolle: Now, look at another picture of this. Now, look at the
number of nearest neighbors and you look how many of them are correct. So basically
when you do the classification, you're looking very much here, just to the nearest
neighbor, to the one nearest neighbor or maybe to the ten nearest neighbors or so when
you do -- so -- or 20 nearest neighbors when you do this.
So the dotted line is Euclidean. Now, this says when I look locally, like 62 percent, so
when I look at the first nearest neighbor, 62 percent of the time I'll be correct. When I'll
look at my 500 nearest neighbors, then only something like 50 percent will be in class.
So the further you look around, the more nearest neighbors you go look at, the more
things will be out of class, of course, because they're bigger. And now how do I define
this class. If I define this class Euclidean, then actually this drops quite fast. And if I
don't do any data sharpening, this drops tremendously fast if I go and look far away.
Now, here we see things like local Mahanolobis base, this Mahanolobis based thing, or
adaptive kernel stuff. This does much better and especially when I go and look far away.
But that's what's indeed the HMM view. I'm looking at my whole class. And we know
when we do experiments with this DTW we often get templates which are ranked 50 at
100, 150. So we may be operating very much in this area here.
But still in this area here, Euclidean is doing better than the Mahanolobis-based kernels.
It's only when I look very far away when I really want to model the whole data, the whole
class data, then I have to use my Mahanolobis-based kernel. And that's also what it is.
Or that's the way we trained it. We use the class ->> [inaudible] doesn't make any sense [inaudible] depending on the scale of your data.
>> Dirk Van Compernolle: The data has been scaled, has been LDA'd.
>> [inaudible]
>> Dirk Van Compernolle: Yes.
>> Oh, I see. So effectively Euclidean is just a global Mahanolobis.
>> Dirk Van Compernolle: It's a global -- it's been globally normalized to data.
But you see in nonparametric statistics, people, when they, when you really have lots of
points and when you're looking locally, it seems you shouldn't do anything in trying to
say my space looks different. No, it should be you have the spacing of your points that
does it for you. Don't do it twice. It's already in the spacing of the points. And I was
saying again, my distance, it seems to be not so relevant.
So we made probably mistakes in the beginning. We thought this was very important,
but probably we didn't have enough examples. We had to look too far, and then we
needed Mahanolobis-based distances. If you don't have to look far, if you have another
examples, if you work -- if you're dealing with things that are relevant enough, this
Euclidean distance seems to do the job.
>> So the Euclidean distance is just a special case of these [inaudible].
>> Dirk Van Compernolle: Oh, yeah. So you don't do anything. You get rid of your
covariance matrix. It goes faster and goes better.
>> [inaudible]
>> Dirk Van Compernolle: Well, we trained it on the identity of the reference data. So
we grouped the reference data in classes, like 1,000 classes we trained 1,000 different
covariances.
>> So I'm just like [inaudible] confused like the HMMs that you mentioned, they're just
[inaudible] and they are doing better than this system, they are about like 60 always.
>> Dirk Van Compernolle: Yeah.
>> And how [inaudible] from adaptive kernels using too many frames as nearest
neighbors?
>> Dirk Van Compernolle: Well, the things that are far away don't tell you much.
>> Yeah, but [inaudible] do the same thing.
>> Dirk Van Compernolle: JMM is doing the same thing. Not exactly, no, no, no.
>> [inaudible] and JMM was multi-model thing, like multi-Gaussians ->> Dirk Van Compernolle: It will not -- JMMs will not be too -- or not highly
discriminative, unless you train them discriminatively. They say, okay, here are fine
points of this class, and I have a pretty good membership function.
So what are our conclusions after all this work. And as I said, we spend a lot of time on
this. And, sadly enough, we come back to the simplest thing. No, there's one thing. We
retain data sharpening. We retain. It's been an improvement and all the way at the end I
show you speech recognition results on Wall Street Journal 0+1, so much bigger
databases yet, and it stays on track there. It's very relevant.
Euclidean distance, this is good as whatever thing we can come up with. We've tried
many other stuff. And adaptive kernel, sometimes may give a little bit of improvement.
But we basically stopped that work. It's a lot of effort, and we really see benefits from it.
Okay. Now, what other problems do we have when we go too much bigger databases.
So we start it on [inaudible]. Well, we want to search through all of these templates.
Well, we cannot search all these templates because you might -- for each phone you
might have a thousand, 10,000 examples, 100,000, a million examples in your database.
So that's a problem.
First, in our first work we did bottom-up template selection. We say, okay, we're going
to work fully data driven, we're going to look at the data and try to come up with this.
I'm not going to go into the details of this. It was nice type of stuff. We use some
roadmap in this to find -- to quickly search K nearest neighbors and then try to find nice
trajectories in those.
But it was very time-consuming, whatever way we did. It created dense -- we needed
dense graph, really dense graphs to do something good, and there was sometimes gaps in
the graph that we couldn't overcome.
So what we currently do almost all the time, we use an HMM to generate the phone
graph. Well, fast, it's not that fast, it's -- it's reasonably fast, but it's very efficient way to
do things.
So how does it work? We have -- okay. I would like switch this graph. Because the -so we first do an HMM. We do some phone decoder, or it might be a phone decodering
first from a word decoder. Doesn't really matter. But we get a phone like this. With
timestamps in the notice and labels and scores on the arcs.
Then we say now for each arc let's go and look for examples, very specific examples in
the database. So every phone in the database has a unique ID. We put these ID, and so
we go and now make a graph, not just with one example of this, but actually with many
examples of that phone. And then we're going to rescore this template graph. So from a
phone graph, we make a template graph, then we do a decoding on this, and we might
again recombine these things at the end.
Yeah, talk about a bit about [inaudible] similar [inaudible].
>> Dirk Van Compernolle: Well, our no -- from the phone graph, so now for each arc we
go and look in the database for all these templates, with that label, and we do DTW on
them and we just rank them and we take the 200 best ones. So we have a much bigger
graph. We have a graph with the same number of nodes, but we have 200 times as many
arcs as the phone graph.
And we do a new decoding on this because here going from one phone to another phone
in our HMM-based system, there's no specific cost. But here we need to do a new
decoding because we have costs going from one phone to another. Because we have this
phone transit -- this template transition costs that we need to integrate.
So we're not going to get -- the best phone sequence may not be the same in this and this.
This will be different because we have this new set of transition costs that are applied at
the template level.
>> [inaudible]
>> Dirk Van Compernolle: This one is not [inaudible]. I would like to switch it because
you normally expect the first one to stand on the left side.
Something about our phone graphs. They tend to be pretty good. For example, we can
generate [inaudible] phoneme recognizer if you use a 4-gram phoneme recognizer to get
like .7 percent phone error rate. If you use ->> [inaudible]
>> Dirk Van Compernolle: Um ->> The numbers and [inaudible].
>> Dirk Van Compernolle: That's versus an Oracle. So we did phone transcription with
the HMM. We did forced alignment on the Wall Street Journal because there is not an
official transcription of Wall Street Journal, so we generated one.
>> [inaudible]
>> Dirk Van Compernolle: On the lattice. When we do it for -- oh, there's something
wrong here. This should be 0 -- Wall Street Journal 0+1, so much bigger ones. They'll
go to 20k, and there's a lot of vocabulary words. We will have graph error rates on the
order of 2.7 percent.
>> What does GER mean?
>> Dirk Van Compernolle: Graph error rate.
>> Oh.
>> Dirk Van Compernolle: And that's -- was discussing this with Geoff this morning. So
we have a reasonable density in this thing still, like word densities, five words on average
when you take a cut in the graph at any point. So it's not a sparse. This is just meant to
keep our search space manageable.
Now, there's much more questions. When you go to larger database. I've already
mentioned this thing is a template, pronounced correctly [inaudible] pronunciation and I
want to keep it, if it's bad I would like to throw it out. But if I have ten times the same
thing, I probably want to throw it out too. But do I want to keep on keeping all the same
stuff and the same stuff and the same stuff over and over again.
So then there's other questions. We have this meta-information that we put in the
concatenations. And there -- this was something that was not obvious at all in the
beginning either. And you say, okay, I have a female template, I have a male template,
and I would like to have male templates followed by male templates. It's normal because
I would have a male speaker and I do speaker detection or gender detection that way.
On the other hand, I could say, well, but that's maybe not so good because now I'm going
to be using none of my female templates for a male speaker. If I do something like
VTLN, if they're -- what's important? Is it important that the template is totally the same,
but maybe I have a female speaker who said exactly the same sentence. So I get
examples of words and word combinations that I don't have in my male speakers. Would
I now like to match my male speaker with that example of the female speaker.
Well, then I should do some speaker normalization. Because then I could get with a
smaller database a more efficient usage. Because, okay, what's important really is the
phonetic context and the richness of that.
>> [inaudible] you don't need to do that [inaudible].
>> Dirk Van Compernolle: Yes.
>> [inaudible]
>> Dirk Van Compernolle: Yes. But I like to use this bigger unit. We've seen the longer
the segments that we take -- so if we take multi-phone segments, our results are better
than when we have single phone segments. So I would like as many Quint phones, true
Quint phones to be present in my database. The more the better. The more likely I'm
going to be doing good.
>> But after you use a lot of units, do you have to redo this phone graph, or do you
[inaudible]?
>> Dirk Van Compernolle: No, we don't. It's all -- this is implicit. No, we don't.
Because it's implicit. Well, we've done things with explicit longer units too. But ->> [inaudible]
>> Dirk Van Compernolle: In the decoding you do it.
>> Geoffrey Zweig: Just a note on time. About ten more minutes.
>> Dirk Van Compernolle: Yeah, I know. But we're getting to the end.
So there's a dichotomy here. So what do I want to do. Do I want to keep very specific -another thing is how about noise? Do I want to get rid of the noise or do I want to
duplicate my database for all different noise scenarios? Do I want instead of multi-style
training, like in the noisy situations, do I want to do multi-state or do I want to keep every
different noise thing, because then I don't have to make bad assumptions about anything.
Well, in anything, whatever we want to do, we would like to get our database smaller
because it will go faster. We don't want redundancy. But we also want to maximally
exploit our database.
The thing we never want to do is do this on a template type -- on a unit-by-unit basis
because we don't want to introduce gaps in the original data. We at least always want to
keep full sentences. We never want to say, okay, I have -- I will store them as phones
and I will only keep the 100,000 most representative As and the 10,000 most
representative Bs. No, that's not what I want to do.
So what things have we done? We've moved to context dependency units. Very similar
as in HMMs, except for one important difference, it's a longer unit because in HMMs we
typically do decision-tree clustering of states. And we make a phone out of tree states
and we reuse states across context dependent phones. We don't do this here. We just do
a clustering where we have and the whole segment has to be unique to the class.
But we use our same software for this and we've also gone and looked, put VTLN in
there, which was not there in the beginning, because we thought, well, we like to have
male things to be different from females.
What are our results on that. Some things on VTLN. If we don't use VTLN, this is the
Wall Street Journal 0 results, it's only ten speakers, but I think it's relevant of the
examples. Look at gender matching, the number of templates with gender matching
things. If we don't use VTLN, you might get an average of 75. If you use VTLN, it's
only 60 percent. Meaning that's the number of speakers that are used in the templates.
It's not a weighted average. It's just what speakers of a certain gender are used.
Now, so this means we are going to be using more female speech for male people, more
male speech for female speakers. So just the number of speakers that ultimately do
appear one way or another for a certain speaker goes up from 210 to 230. Not that much.
But so we have -- so by doing it we do cross-gender matching. And our results go up. So
the thing is creating our set of context dependent variability is better than keeping this
gender thing. So we better normalize according to speaker, so it's really this
pronunciation variability, the context dependency that's important and that we need as
many possible examples of this as possible.
Then we did quite some work on trying to do this cleanup of the database. So where we
try to distinguish things into good, bad, and redundant. So far we've just done this and
that's mainly for computational reasons, just on a speaker basis. So we try to find
speakers, which are good, that we really need in our database, ones that are bad, that
make many results, and ones that are redundant that we can throw away without changing
the performance.
This works reasonably well. You can do almost 50 percent. I don't have the graphs. I
apologize for that, as we do have them. But you can compact the database quite a bit
throwing speakers away with almost no degradation. But there's always a minimal
degradation. So in a sense it's not a very happy thing. We're not very happy with the
current results.
So it's very possible -- so there are speakers who really are on average quite redundant.
We'll probably get rid of those, but really bad speakers we cannot find, otherwise it
should go up. So we really have to redo this whole thing on a sentence-by-sentence
basis, instead of speaker-by-speaker basis.
Okay. This slide is a little bit out of proportion. That's where we did the template
activation for that. Sorry. I didn't -- some other further optimizations we did is actually
silence. We don't put into the DTW system anymore. We just do a straight mapping
from the HMM score. There's a very high correlation between the two, which makes
sense. And because of similarity between the two systems. And this speeds up roughly
the VTLN. It gave us roughly 10 percent improvement and we sometimes use hybrid
systems.
So let's, to conclude, look at some more results. [inaudible] larger and larger databases.
Here is the template type, context independent or context dependent and VTLN. Did we
use data sharpening. And I've included here one more result where we did not use data
sharpening. A Wall Street Journal 0 plus one. When you don't -- where you keep all the
other stuff the same, you don't use the data sharpening, you have 11.4 percent. When you
do the data sharpening it's 9.2, 20 percent relative difference. So it's huge on this stuff.
We don't ->> So not being hybrid, that ->> Dirk Van Compernolle: That means it's just the DTW.
>> The final score is based just on the DTW scores.
>> Dirk Van Compernolle: Yeah.
>> On top of the phone lattice.
>> Dirk Van Compernolle: Resource management, the results we have are significantly
better than on the HMM alone. On Wall Street Journal one, they're not. They're very
close to single system. But look in things where you go from context independent to
context dependent with VTLN, what a huge difference you get.
What has happened by going to this context-dependent templates and to this VTLN, is
that a lot of the information in the transition costs that we had, like we used to have a cost
for gender transitions, well, after we do VTLN, the gender transition, we really don't need
it anymore. It goes away. The bigger the databases we have, the more consistent we
have that Euclidean distance is the same. The more context dependent environments we
have ->> [inaudible] this whole table. Like can you explain what is your template stuff, and is
there an HMM baseline anyway?
>> Dirk Van Compernolle: There's -- only in the hybrid system. It's where we combine
the scores of HMMs and DTW. All the rest, when there's a no, that's -- the -- though I
don't think -- not in these -- in these ones here, there's no HMM involved at all except for
the original labeling of the training database.
>> We do have HMM baselines for these.
>> Dirk Van Compernolle: We have HMM baselines for these. These are around more
resource management 2.8 I guess. Here it's 3 percent, here it's 7.5. So we're not at
HMM -- at our HMM yet.
>> Is that ->> Dirk Van Compernolle: But on the -- like on this thing -- around here I don't know
what's going wrong with the merging of the things, because there's like a 25 percent
difference in errors that they make. So they do make different errors. So that tells us that
we -- from the merging that there must be more to be gained.
>> From the Wall Street Journal experiments you first use HMMs to produce a phone,
test phone graphs and then [inaudible].
>> Dirk Van Compernolle: Yeah. That we do. So our results are getting better and
better when we first -- the first runs we did on Wall Street Journal 0+1 half a year ago, we
had 20 percent, I guess. So doing something new and something bigger is always a
challenge. And it's often in small points.
But this one has been a big one. I think this one -- so there are a few things which are
consistent over all the stuff we do. But a little bit of the annoying stuff is that many of
the things that we've looked at, we ultimate come to the conclusion that we can leave
them out of the system, and that the simpler system with the Euclidean distance works
better.
We still have the natural success across which place something. But once we've moved
to context-dependent templates, any other contextual information doesn't help anymore,
so those also the phoneme context translation costs, they're also gone.
So it's -- the story is a bit mixed on this. The results get better and better, but there's -some of the things that we thought that would really help or some of the differences
between these nonparametric and the more parametric stuff, well, the big gainers are in
some thing which are very close to HMMs, which is saying it's really this pronunciation
variability which is the big problem in speech.
>> So everything here uses a phone graph ->> Dirk Van Compernolle: We've all used phone graph, but the first three ones, if I
remember right, were created where the phone graph -- no, did not use phone graphs,
only template graphs. And they had template graphs generated in a bottom-up fashion.
All the Wall Street Journal experiments, they use phone graphs to start from.
>> So most of the competition, as I can see, in the decoding is for the translation from
phone graph into the [inaudible] is it?
>> Dirk Van Compernolle: That's yes. Yeah. All the DTW computations that you have
there.
>> So I have a couple of questions.
>> Dirk Van Compernolle: Sure.
>> So the first question, [inaudible] sharpening when you use it on HMMs, does it have,
like you were saying, the same or the degraded?
>> Dirk Van Compernolle: Degrades significantly.
>> It does make sense [inaudible] do good for a number of other things, but for HMMs I
thought that it's a lot -- it's a lot difference.
>> Dirk Van Compernolle: Well, if it just does -- pull in bad outliers, it should do good.
>> All right.
>> Dirk Van Compernolle: So it does do other things than that. It's not just doing
cleaning up of your database.
>> So the [inaudible] question, the data compression thing that you did, so does it publish
[inaudible] speech or ->> Dirk Van Compernolle: Which one?
>> The data compression procedure that you used to throughout the bad speakers --
>> Dirk Van Compernolle: That hasn't been published yet. But that's likely to be
submitted into Speech this year.
>> Geoffrey Zweig: We are just about out of time. Any other things you want to
mention?
>> Dirk Van Compernolle: I think I'm pretty much at the end of my [inaudible] what I
have to say.
>> Geoffrey Zweig: All right. Let's thank the speaker.
[applause]
>> Dirk Van Compernolle: Thank you.
Download