>> Geoffrey Zweig: Hi. So it's my pleasure to introduce Dirk Van Compernolle today. Dirk has had an extremely distinguished career. He started in Stanford University and was there overlapped with, among others, Les Atlas and Mari Ostendorf. After that he had a stint at IBM working with Fred Jelinek. Eventually he left, then, IBM and took a position as a professor at the Katholieke Universiteit in Leuven, Belgium, where he is a professor now. And, additionally, while there was a vice president of research at Lernout and Hauspie, so he's got a great industrial perspective on things also. And recently has been working on the FLaVoR's bottom-up decoder. And even more recently, a template-based speech recognition, which he's going to talk about today. >> Dirk Van Compernolle: Thank you, Geoff. So today I'll give you an overview of what we've been doing in example-based recognition. This is work -- there's a couple of other names on there, as this has been going on for quite a couple of years. And, okay, you can see in the slides later there's maybe some of the more important references to the work. But I guess the first publications we had on this are in 2004. And things have changed. We've learned quite a bit by moving up from smaller to bigger problems by doing more experiments. And it's been an interesting ride. So I will try to -- not to summarize what we've all done, but to give more or less our current perspective on what we do it. So I'll do the talk in two parts. More or less first get to a baseline system, get to some example-based speech recognition, and then go into the more fine points of it. Why did we start on this work? Well, there is a couple of different motivations. A major one is when you go and look into psycholinguistic literature, there's very much evidence that humans store traces. Almost of the audio or just think of -- I have two small kids. They speak Dutch at home like me, but when they hear songs in English, they try to repeat them. This is pure acoustic memory. These things are in there. There is traces. And with music, that's quite obvious. But we do this for speech too. How important it is, there's obviously a great deal of early abstraction in the system, too, and how important those longer references are, that's by far less clear. So there is evidence that we use trajectories, templates, episodes, whatever you want to call them. I put them under one big nomenclature. Then another motivator was the success of concatenative synthesis where people said, well, we've tried to model things with a good model for so long, it doesn't work, we've gone away from the model, we just put in lots of data, and we've played more ignorant than what we were before, and things work better. HMMs have worked fine, and they still work fine. And they're still -- they're still the state of the art. But we try to beat them. And what have we done in HMMs? For the last 20 years we've tried to overcome all the errors, the basic errors in the model. At some point you get tired of doing that, of just tweaking it, and you'd just like to go to another model. So and another thing was maybe more, okay, let's try to do something that burns up lots of computer space and lots of memory. Because it's growing so fast and somehow HMMs don't keep up with it. So what do we do in HMMs that I don't like or that's -- most of us don't like. HMMs take data comes in and they say this, okay, here we have a state. That state, we assume this is a short-term stationary segmentation of speech. So whatever sequence those observations come in, it doesn't really matter. You can turn the sequences upside down. Well, what have we done, we've done derivatives on the features, people have tried to put it into the model. This has helped. This is tweaking of the model. But it's still fundamentally a problem. So what do we do? Let's take a small example here. We looked at lots of data from the I from the word five, and all the dots are individual data points, just projected down to two-dimensional space. So they're originally -- the feature vector is 39 dimensional or 36 dimensional algorithm in this thing. But then we do an LDA to higher, which is suited for the picture. So we say, okay, what's the two most important axises of the I. And so you see the plots. And you see HMM states. So three states associated for one phoning. Okay. It's [inaudible], there's a lot of transitional information in there, it's not an R, but it's I. So we're expected to see some movements. And we see those dots and HMM says, okay, let's take all the dots belonging to one state and let's just group them mean invariants, yes, we take multi-Gaussians, we make it a bit more complex. And that's our representation of each of these states. >> So these have the top two dimensions in the HLDA? >> Dirk Van Compernolle: Not the HLDA. Well, we don't use HLDA. We use something a little bit similar. But not the LDA that you're going to use for the recognizer, but an LDA to stand out I, specifically trained on the I versus all the rest, so that the I gets more pronounced. >> I guess the question that I have is if you work to look at the dimensions and the recognizor's looking at it, will it because it's higher dimensional space? >> Dirk Van Compernolle: It would not -- no, you would not be able to make a nice picture because it's higher dimensional space. And the first two dimensions -- the first two dimensions might not matter. You might have to look for which dimensions give you nice pictures. >> You're trying to say this is just representative or it's just an artifact of the way it was drawn? >> Dirk Van Compernolle: I'm looking let's say in the 36 dimensional space, I'm looking for the 2D intersection where I'm going to get the biggest variants on the points, so which gives me the nicest picture. It's just one cut in this huge space. >> [inaudible] >> Dirk Van Compernolle: I think it's a projection that I would -- did this long time ago. I think it's projection. >> It would be nice if these were the dimensions, two of the dimensions that the recognizer [inaudible]. >> Dirk Van Compernolle: It's not. We've done some plots for capture too, and with capture it's not too bad. >> Right. So that's my point. Are you making things look worse than they really are in practice? >> Dirk Van Compernolle: Let's say -- you know how to visualize 39 dimensional space. >> Just show two of them. >> Dirk Van Compernolle: No, no, that's bad idea. That's a bad idea. >> Dimensional -- probably the more real -- in a higher dimensional case or if you compute the LDA on everything, things would be more scrunched up on top of each other, so it would look worse [inaudible]. >> [inaudible] >> Dirk Van Compernolle: No, because -- I'm not so sure that it's more believable by taking the random two dimensions. >> If I take 39 dimensions and collapse them to two, everything is going to be [inaudible]. >> Dirk Van Compernolle: Yeah. Let's look at some other things in here. >> That's not to say that [inaudible]. >> Dirk Van Compernolle: No. Let's look at how now for the same [inaudible] trajectories look. So they are there. It's not -- the point that I want to make is that on the left top side or the beginnings of the trajectories, the other side are the end of the trajectories. It's not a random sequence of points that -- or when you're in state one you don't get randomly [inaudible] average the variants indicated the points from that distribution. That's not the way you draw points to create realizations. That's the point I want to make. That's the only thing. So even after doing context-dependent modeling, after doing derivatives, there is still information in the trajectories at the phone-in level. Even at the very short trajectories. That's the only point I want to make; that the first-order Markov assumption does something. Significantly overgeneralizes what speech is. For example, you could say a black one, as I draw it, is a reasonably good example. A red one in terms when you would start computing HMM probabilities with points on this red curve will get a higher likelihood than the black one. But intuitively I would say it's a worse representative of the class of trajectories. And that's basically the motivator I have. Let's look at another example. The previous things were things I drawn by hand. These are real examples. Let's take a real trajectory. The real trajectory of data, starting point, endpoint. And let's find now in the class of points the closest trajectory using a DTW algorithm, just a dynamic time-warping algorithm. Nothing more than that. And that's the green. That's the closest trajectory that we find. By no means the sequence of nearest neighbors in it. It's something that goes a little bit the same way. I don't want to talk about smoothness, because there's some other dimensions to be taken into account. But this is the DTW then on the full dimensional space. Not that space, not to match. So in this space maybe so be careful by saying it's not just the closest points, maybe they are the closest points in 30 dimensional space. I'm not sure. I have to be careful when I make it. Because we're looking now at the two-dimensional space. It's not just in each dimension the closest points. But we are trying to make a sequence of nearest neighbors in the high-dimensional space. >> When do you use enabling information to [inaudible] curve? >> Dirk Van Compernolle: I do a dynamic time warping. So I do points, local distances, and then I add them up along the full trajectory to see where I get. >> [inaudible] >> Dirk Van Compernolle: I'll give you a few more details on the example what it does in a couple of slides. So what's the matching paradigm that we use for HMMs in general? And actually we'll see the time warping thing is not that different from the HMM setup. We have -- so that's standard Bayesian recognition. We get the word sequence which maximizes the probability of words given a set of data, and we typically write that in another way where we find a likelihood function or some function that gives the matching score of the data given. A word sequence times the probability of a language model. That's standard formula for anything. It's a generative model. Now, what do we do in template matching? Because we have all -- they're not just a single template sequence that can explain word sequence, there's plenty of template of sequences that can explain the word sequence. So we really should sum over many possible templates and words and find this thing. Now, that sum is horrible to compute because there's so many possible examples. So we take a very -- maybe a very awful shortcut, you might say. We do a Viterbi assumption here and we'll do a maximum of these things. And then we get ultimately to a formula that looks very much the same as a thing of both, except that we have an extra turn. We have a probability that's the template sequence realizes the word sequence. Of course let's say it has to be -- you have -- because the T here, you have to see as a sequence of templates that you're going to get out of your database of all possible templates that you're going to link together. So you're going to splice things together. I don't put the whole timing information here. Now, these templates of course have to explain the words, so they have to have the right phonetic identities, when you look at the words, and they have to explain them in a phonetic sense. But then, okay, what's the probability. We'll come to that later. Now, let's look at this. You say, okay, it's about this argmax, that's our Viterbi assumption. Argmax we need a search engine to do this. But basically this thing doesn't look very different than anything we've done in HMM, so we will be reusing our search engine that we have developed for HMMs. We need the similarity score between data and words and templates. And we need this prior template. And then we have language model, just the way we have a language model in speech recognition. Now, if we compare HMM and examples, actually there are a lot of components in the full system that are the same. You need some units, phone units, Allophone units. They're all in example-based systems. We have Allophone units as well. We have phone units. We need the local similarity there. There is a big difference there. In HMMs we typically use multi-Gaussians. In example-based we'll have some local distance measuring the distance between a point in space, a reference, and a point in space on a test sample. The time alignment, we do Viterbian, DTW, or actually the same algorithms. Something with this computer. Not I'll just move my hands. It sometimes starts typing. This seems to have built-in speech recognition one way or another. But with a very high error rate. The search we do -- so I should use this pointer. The beam search is actually the same one. We use a token passing system to implement these things. And it's the same search we'll be using for the example. The training, well, that's where things are very different. The local distances. That's where things are really different. That's not so different. And on transitions, well, we'll have some things on transitions which are quite different too. What do we do for training? The only thing we need to do for training in this, we take a big database and we have to label it. We have to know which segment of speech correspond to what, what is it. We can label it at the phoneme-level, we can label at word level, you -- in principle you're free. In the concept of the system, you're free to do what you want. But, in practice, all the things I'll be showing will be phonetically-labeled databases. >> You mean manually labeled? >> Dirk Van Compernolle: No, no, no, no. Free automatic. And we use our HMM. HMMs are nice things. They are good tools. We use them. So we don't -- we started off this -- in this work we started off by throwing out HMMs completely. But, no, no, no, we need them. We need them. We need them at many places. Okay. And maybe we need to do some estimation on our distance matrix. But I put down optional. Well, we'll see all this orange. I was going to skip those slides. Now, let's look at those probabilities and distances. When we have observation probabilities, really that part of it, so it's a sequence of data given a sequence of states, and we maximize that. So we say, okay, we have to find the state indicias which maximize our probability. That's what the Viterbial algorithm does. If you take logarithm of that, this is still a maximization, becomes a sum. And if we now assume for this probability function a single Gaussian, well, then we just have the simple quadratic term, and we have something which is standing before our exponent and our Gaussian becomes there. As long as that sigma doesn't vary much from one distribution to another, we can get rid of this. And so then we end up with something that looks like that, a standard Mahanolobis type of distance. So single Gaussian with Viterbi by or Mahanolobis distance in some type of DTW, so now we have to minimize this. We have to find instead of mu's then we're going to now replace these mu's by examples. So we treat points as means in the Gaussians in a single Gaussian system. >> Your distance can include that term, right? >> Dirk Van Compernolle: I can include this term, and we have included this term. We've done work with that. We've done a lot of work with that. >> [inaudible] >> Dirk Van Compernolle: It used to make different. It doesn't make any difference anymore. That's the -- I've gone back to some of our old stuff. And then it made a big difference, and now it doesn't anymore. And I'll continue. >> [inaudible] only [inaudible] training [inaudible] show that if you do [inaudible] taking this wrong turn, probably is a little more useful. Well, first of all, make things much easier to train. >> Dirk Van Compernolle: Well, it used to be useful. And I don't have all the answers why it's not useful anymore. But we've moved up the big databases to many more examples. And to a lot of things, we've included all the refinement in the system. And the more refinement we've included into the system, the less we need this. But I come back on that with -- there's some results where it's included and where it's not included. But so okay. We'll take it -- I just drop it here to say, okay, it looks very similar, the two things, to -- just for the intuitive understanding of it. Now, there's a bigger problem with this when we go to distances. So now we have to minimize over a sequence over, so possibly time steps, and we have to find the best template or template sequence in the whole database to do this. Now, when we have a sigma here, what sigma should we put there? And a thing that's actually a bigger problem than saying we want that term or we don't want that term, but what sigma should you put there? >> And you're actually making that class dependent? >> Dirk Van Compernolle: Well, when you look at the HMM formulation, that sigma is dependent on the class that you're matching. You're looking for distance from a point to a class. Or a matching score of a point to a class. Now you're saying I match a point with a point. Okay. In your reference database you know to which class they belong. You know I'm in I, I'm in E, I'm in something, and you could say, okay, that's -- and I want it to be in the correct class. So I'll measure how far I am from that class. But you're looking at a point. And this variance, okay, what do we know about this variance in -- what are they in HMMs, it is the global distribution of points in a class. And that's going to be pretty big, as you saw, on that example. And we'll be working much more in the local neighborhood and the global distribution of the class really doesn't make that big an impact on what you do. And shouldn't. It doesn't matter. Let's say I have a class which looks like this, probably I want -- if I'm here, I want to have something here in the neighborhood. I don't care about those points all the way at the other end of the class. >> And is it going to turn out to be important, what covariance you use? Versus if you just ->> Dirk Van Compernolle: Let's look at some results later. So but the best similarity, you can find two HMMs as saying, okay, I infer the class from the reference. And if -- so I trained single -- diagonal covariance matrixes on classes, and so you could do phonemically or even more state labeling or whatever type of things. We've done a number of ways to come up with many covariances. >> You could argue that maybe you should use a covariance that's associated with the utterance from which you drew the template. In other words, not something about the class, but something about the speaker or the speaker at that particular point in time in that acoustic environment. That's where you get your covariance. >> Dirk Van Compernolle: I don't know. Doesn't sound logic to me. >> Okay. >> Dirk Van Compernolle: One thing there is, when you put something here, and you define it the Y, well, then, it's not a distance anymore. It's not symmetric, which makes it already some strange thing to do, because you're really comparing two things. And why should the thing in the reference database have a label and not the one that you're testing. They both have a label. So why should you use something asymmetric. In this nonparametric type of approach that we're working with. It's kind of counterintuitive to do something asymmetric. Beware we are using features, we are using features here that have gone through HLDA-type preprocessing. So they are already reasonably normalized. That's what we're always doing. And this definitely helps. So the importance of this -- it should be a second [inaudible] effect. Because we've already rotated the whole thing into a space where Euclidean distance should make more or less sense. >> So not just rotated by the -- divided by the absolute variance of each dimension. >> Dirk Van Compernolle: Yeah. Everything. We've -- there's a [inaudible] information, discriminant analysis, there's a rotation to make things as good fitted to single Gaussian distributions. >> Can you learn that thing maybe discriminatively or ->> Dirk Van Compernolle: [inaudible] spent a whole Ph.D. on it and didn't get anywhere. That's the name Mike McDonald that you've seen on these slides. And this has been one of the more negative results that have come out of it. He spent six years on this and never managed to get any improvement or whatever else with it. You might also blame him, but you might at blame the constant. But, yeah, that's what we thought so, too, that maybe this is something that you should be able to learn discriminatively. >> [inaudible] covariance between different classes to see if they actually are very close? >> Dirk Van Compernolle: There's differences. There's differences. How close? I can't put any figure on it. The only thing I can do is say, okay, what do we see in terms of experiments or results. You'll see a few things. Just keep your breath for five more seconds. I know. It's going to take a little bit longer. But I have quite a few slides on what then the different distances do. But we don't have our full initial system yet, and that's what I'm trying to make. But this distance metric obviously is a problem, and I'll come back with more things. Now, okay, I put it a little bit more explicit that we actually have template sequences that are matching word sequences. And we have there this term probability of the templates given the word sequence. Okay. Is any template sequence that explains, so you look at the phonetic labels of the templates, you then need template sequence equally good enough. Well, we do the way we like in speech recognition. We say, okay, a template sequence, we'll put this as a problem of the template given the previous sequences and the word sequence. And so -- and approximate this as more or less template transition costs. Okay. There's also some real template prior costing there, really prior of a template. I have no idea what they are. I have an I here in my database, I have another I in my database. Is one more likely than other. Should I have a prior probability assigned to certain things. I could run big -- leave one out experiment on my database and see how often something is used on my training database. That's a possibility of estimating things. It's -- we've debated. We've had plenty of philosophical discussions on what's a prior. We've just computed the likelihood of that template based on an HMM system. And then normalized over all templates with the same phonetic signature, which also might say, okay, how good does it match the class, is that a good one. It's -- the prior as such hasn't given as much, but there's also this concatenation term. And that makes more sense. That says, okay, one template following another template, when we had them phonetically labeled. Well, it's more likely if the whole -- if the larger phonetic context of these things is consistent. So the thing of like context-dependent phones, if I have a template of an R, which was the original recorded between a P and a T, then it's much more likely that the previous template would have been a P. And even a P which was followed by an A and then maybe preceded by something else. So there's definitely phone context that plays a role, and then there's something we call meta-information. And meta-information can be all kinds of other things, where you say I like things to be consistent, like it's more likely that all the templates that I will draw are from the same gender. Why would I start mixing male and female templates. Doesn't make too much sense, it's more likely that they were recorded in the same background conditions. Any abrupt transitions are just unlikely. And I also could look at spectral discontinuity in those templates. All these things are not likely. So template transition costs, yes. And actually the most important one we have there is something we call natural successor. And this is something that you'll see up here more often. I'll say we use full names as templates, but -- and everybody gives immediately as a criticism, no, you should use symbols or words or longer units. Well, we try to overcome this criticism by saying let's try to have as long of units as possible. But going to work units will sparsify our training database tremendously. So this will not do a good job there. >> So the idea that you should be similar to this [inaudible] you just make all possible [inaudible] if whatever match is the best [inaudible] whatever you have. >> Dirk Van Compernolle: You choose whatever the best ->> [inaudible] predetermines that [inaudible]. >> Dirk Van Compernolle: You do. You choose whatever the best, but you encourage segments that are as long as possible. So you encourage bi-phones, tri-phones as they are. And so how do we do that encouragement, like in speech recognition, when you start a new word, you have a word startup cost in order to have not too many new words being started. That's what every decoder does. Then you have a phone-in recognition system, plain phone-in system. You need a phone-in transition startup cost. So we also have a startup cost in the decoder. But when we have a recording that says that, then, okay, we hypothesize template B, and then we have to hypothesize "ah" after that. When we take this one natural segment as it was in the original training database, when we take a "bah" instead of "bu" and an independent "ah," we say, well, we don't need a template transition startup cost for this second one. Or if we take "bah," well, also for the D we don't need a template transition startup cost. So we give preference to longer segments. So if we don't make transitions, we do that. So instead of doing this explicitly, we do this implicitly. And it's important. It makes a big difference. >> Now, to do that you also need to have segment boundary information for different template. >> Dirk Van Compernolle: In the original database, yes. >> [inaudible] >> Dirk Van Compernolle: Use the HMM to do that. >> So doing the decoding, do we use HMM also? >> Dirk Van Compernolle: They're decoding -- well, we use a token passing decoder, and the token passing decoder that we use is the same one that we use in the HMM system. It doesn't look any different. >> But, I mean, do we actually use [inaudible]. >> Dirk Van Compernolle: During decoding, no. >> In the TVS, typically [inaudible] to convert from word to hidden and then [inaudible] guide the synthesis, so use [inaudible] model to actually choose the templates from the training set. And you can get a much better result, basically. >> [inaudible] those, since your second [inaudible] uses HMM already [inaudible] otherwise you may never be consistent. I don't know [inaudible]. >> Dirk Van Compernolle: Okay. Let's look at the overview of DSS. So what do we have. We have distance measure specific, we have a reference database, we have the template transitions. We also -- we need a block. Fast template selection, we want to do a fast match, because we don't want to search the full database. It's too big. And then we need the search algorithm. And lexicon and language model, that's no different that what you would have in a standard system. So all these things are very much the same except this fast template selection, and then more the things here at the bottom are different than in an HMM system. Now, an example. Let's listen to the inputs. So you try to find this out. >> Audio: It's still unclear. >> Dirk Van Compernolle: Which is natural speech. And now you try to do a templates-based recognition. >> Audio: It's still unclear. >> Dirk Van Compernolle: It's not a horrible things at all. So what did really happen, so this is the input. Now, you have a segment of speech here, the E, you go and look in your database that you're working with, you find another E, you take this E, you timeline it with an E in the test sample, you do this for all individual sounds and you splice it together. You do some interpolation on the FFDs, and you do an inverse and you get the sound back. >> Was same speaker in the training database? >> Dirk Van Compernolle: I think for this example, yes. I have some more examples coming. >> So presumably you need to look for similar contacts [inaudible]. >> Dirk Van Compernolle: You definitely need to look for similar context. But here are some examples on what really that effect is of this concatenation cost. Where I say we don't -- we want -- we prefer multi-phone units over single-phone units, because there's going to be much fewer -- we want to have as few transitions in the matched template sequence that we have. And that's a pretty old example also here. >> Audio: There were 13.9 million crimes reported in 1988. >> Dirk Van Compernolle: That's a Wall Street Journal sentence. >> Audio: There were 13.9 million crimes reported in 1988. >> Dirk Van Compernolle: So that's an attempt to resynthesize that. And that's using just phone templates, not context-dependent templates. That's pretty basic system from some time ago, but it's nicely illustrative. How many segments do we have? Okay, you might say, okay, we have 55 phones in that sentence. When what you did now was not playing at all with those transitions. So you just try to find a concatenation of the best phones in the database. Well, you find 48 segments by chance. There's a couple of segments which have -- like seven of them will have two phones in them. >> What's the difference between a segment and a template? >> Dirk Van Compernolle: Here with segment I mean a natural segment of speech in the original training database. Which can consist -- which is a multi-phone unit. >> I see. >> So if you [inaudible] how do you choose the best phone? >> Dirk Van Compernolle: Just based on DTW. >> Oh. >> Dirk Van Compernolle: That's just doing DTW. And I say I'm looking for the best phones and I just do DTW, is the only thing I do. And now I'll play big with those phone concatenation costs. Say, well, things have to come in the right phonetic context, so I look at phone-in context sensitivity. And I prefer long segments. And then it sounds a little bit different. >> Audio: There were 13.9 million crimes reported in 1988. >> Dirk Van Compernolle: So if you didn't do the costs ->> Audio: There were 13.9 million crimes reported in 1988. There were 13.9 million crimes reported in 1988. >> Dirk Van Compernolle: That's when we add the costs. And you look instead of 12 phonemic errors, you only make three phonemic errors, and the average segment length has increased to almost three phones per segment. So in this example you're still -- you're really using multi-phone units. >> So this implies that, if I understand it right, that the segments that were used in that no-costs one actually have better DTW scores, just pure DTW scores than the ones in the bottom line. >> Dirk Van Compernolle: Yes. When you just look at the DTW score, this one is better than this. >> Isn't that surprising? [multiple people speaking at once] >> Dirk Van Compernolle: But the DTW score is a bad representation of your match. Because it's not just a DTW score to reference, but also the DTW score from frame to frame that you've synthesized, and that gets so unnatural at those joining points that it's not -- your -- what you're putting in is -- is probably something that I was showing all the way in the beginning, is something that isn't in your manifold of your points even, of traces. You're putting a collection of points that makes no sense, that doesn't exist. So where now we've done the consistent trajectory within the phone that we've totally forgotten it on the cross-phone level, where it at has to be consistent. >> And so this last line, this is using information that is not present in an HMM system, because it's using information [inaudible]. >> Dirk Van Compernolle: Both. Part of the information is in an HMM, because it uses contact dependency on the phone units. >> Okay. >> Dirk Van Compernolle: But it also has -- but part of the information is not in an HMM. >> Like if the two segments came from the same person [inaudible] or they were adjacent originally in the data. >> So how do you compute the transition costs? >> Dirk Van Compernolle: For the natural -- so there is this phone-in -- this template startup cost which is one for all templates. The same cost for all templates. Like you have a word startup cost in a decoder. So that's one parameter which is -- which is optimized on the development. >> The guys that do speech synthesis now, the modern technique is rather than computing the cost in an arbitrary way, which is what they did, [inaudible] I'm going to look at my HMM to tell me on the average when I go [inaudible] this unit with this unit, where are the scores. And, as you said, within that unit, it's fine, but when you put it there, the HMM is going to tell you this discontinuity started [inaudible] some score. That depends on the context dependence and you can precompute it. >> Dirk Van Compernolle: Yes. We've done a number of things on this. But I'll -- I'll hold off on giving some more details on this. Because there's ->> [inaudible] >> Dirk Van Compernolle: How many parameters here? Well, when you have context dependent things, when you look at the context, it might be lots of parameters. But what we've done instead, so we used to work with phone templates. We've moved to context-dependent phones as templates. And ->> [inaudible] I have in manually joint [inaudible]? >> Dirk Van Compernolle: To determine development set you mean? >> Manually, not automatically trim. >> Dirk Van Compernolle: It's all -- it's optimized on the development set, all the parameters. Okay. Let's now dig a bit deeper in all the problems that you've seen that are there, like the distance metrics. Let's look at results there and the distance metrics. There's something that comes with it that we haven't touched on, that's outliers, and then we do frame and phone experiments. And then we have some things making the databases bigger and bigger and bigger. Well, we've done a couple of things to look at that. So let's look again. So where should we use this Mahanolobis distance, what should we use, should we use the sigmas of the classes, should we take this into account or not. We've done it. We've looked at it in many ways. So we have this local Mahanolobis distance where we take the C, okay, there should be a minus 1 there. And this has been our standard approach for a long time. And many years back I've even claimed that this made a huge difference. But now I say at the bottom this was our approach because we're not using these anymore. We've also looked at adaptive kernels. So you say it's really a kernel that you put on every point that you have. But maybe you should be looking at the local distribution of the points around that to have -- to define your variance for it. We've done this too. Then you have some time when you start working with kernels that you really start pushing your full distribution a little bit out because you have points here, you put the kernels over there, and you also push your distribution a little bit further out than what it really is. So people in kernel-based techniques have used techniques called data sharpening to say let's draw especially the outlier points a little bit towards the center of the distribution, because otherwise, if I put kernels over it, I even put them more out. And, moreover, outliers may not be very good representatives. But there is a much bigger problem that we've really run into. There's two types of outliers, and one are mispronunciations, bad pronunciations, and others are rare pronunciations. And the wrong ones you want to get rid of. They may be transcription errors, they may be things that get out of your database. The rare ones you want to keep. And it's damn difficult to make a distinction between those two. Any good ideas on that, saying, okay, if I find with an HMM a low probability for a sound or something, okay, sometimes it's so bad that you can say yes, it's a bad piece of data. But sometimes it's just bad. Now, what is it? Is it bad data or is it rare data. And rare data I don't want to get rid of. I don't want to get rid of the person with an occasional 50 hertz pitch, which barely exists, but there are people who go extremely low in pitch. I don't want to get rid of them. Their data is going to look a bit different. So we've -- the good outliers are difficult to distinguish from the bad outliers. So this data-sharpening concept, well there's actually this very nice in saying, well, maybe you don't really need to know. Let's just draw all points from the edges of your distribution, let's draw them a little bit inside. And not towards the center of the distribution, but you have a global distribution, you have an outlier here, just look in its local neighborhood. Just look around there, what points of the class are in my local neighborhood. Don't use the centroid of the full class. Just look in a local neighbors. Look at the K nearest neighbors within its class. So look at its K nearest neighbors within the class, and then just take the mean out of this. And replace each data point that you have by that mean. So you do a change all the things you have. Now, let's look ->> [inaudible] >> Dirk Van Compernolle: I do this point by point. >> Point. Oh. So after you use [inaudible]. >> Dirk Van Compernolle: Yes, I do it data point by data point. I do it data point by day that point. And, okay, that's what happens. So, again, some projection in a 2D space. What happens to things in the middle? You look at these K nearest neighbors. Well, things are typically distributed nicely, distributed when you're in the middle of your distribution, and what happens to those points, almost nothing. They barely move. What happens to points that are at the edges, okay, you look for this point, you look for its K nearest neighbors, you make a whole sphere around it, well, all the points are on that end. So whatever is going to happen is this thing moves in that direction. And now if it's -what? >> The bad point [inaudible] too. >> Dirk Van Compernolle: The bad point can move in too, but at least it's going to look much more like the other points of my distribution. Yes, it can move in too. But at least I've colored it with a flavor of what it should look like. And I don't have to make a hard decision on what they are. >> Dirk Van Compernolle: So if you iterate this over and over again [inaudible]. >> Dirk Van Compernolle: No. No, no, no. This might iterate to [inaudible]. Now, we thought, oh, this is very nice. Why don't we do this first and then train our HMM again on this and shouldn't our HMM now work better or not. Didn't. It didn't. >> [inaudible] >> Dirk Van Compernolle: And you've made your variants too small if you start now training an HMM on this, then the distribution is not good anymore and you've drawn the points too much inside. But for the nonparametric version of thing, does this work. Well, the thing is we do this Viterbi approximation. So if you find one good match and something is badly labeled and things like that, you can really make bad errors. So we don't -- so it's a little bit -- it's coloring your data with some class information that you're doing. Instead of using the original data point, you say, okay, I'll color it with a little bit of -- and if it's a really good member of the class, here in the middle, I won't do much at all. Or I won't do anything at all. If it's a bad example of the class, I'll do quite a bit. >> So what other area of the research used this [inaudible]? >> Dirk Van Compernolle: This comes from things on nonparametric statistics. We didn't invent this. It's a -- yes. That's textbook material in nonparametric statistics. It's not a rare thing. Okay. We did experiments on plenty of different test sets. And, okay, let's look just at some frame classification things. Where we've compared things with data sharpening, no data sharpening, and we've compared distance metrics, Euclidean, local Mahanolobis, so where you estimate the Mahanolobis, that, and then also with adaptive kernel where you even go and look in the local neighborhood of the points, if you should adjust your kernel or not. And then we did some voting or we added some Parzen stuff. We just look at the single best. So this is simple frame classification. Don't look at all the results. But look at the nice things. When you do data sharpening, everything in this column is better than this column. And this is not for this experiment. Whatever experiment you will see after this, data sharpening helps, 5, 10, 15 percent. Not a randomly small number, but a consistent thing. So for this type of approach, well, it's been consistent -- it's the most consistent trick in the book that we found that helps us. Then many of all the other stuff that we've tried, we've had much more -- not so clear things. >> Presumably for this kind of task you don't really have many outliers, right, [inaudible]? >> Dirk Van Compernolle: But I say we've done five different databases already. Or six different databases. So -- well, TIMIT isn't all that. You have outliers because of their phone-ins which aren't really produced, that people have invented in the transcription and so it's not that clean. It's reasonably clean. The other stuff have is that when you do some voting, when you look at the K nearest neighbor type thing, for the simple experiments you always do better than single best one. And Euclidean is not at all a bad idea. But I'll show some other plots which may shed a different light on some of these things. >> [inaudible] >> Dirk Van Compernolle: Well, there's this quarter set and there's the full test set. So everything on TIMIT is overoptimized and I shouldn't present the results, according to Geoff. So let's look at something elsewhere we didn't do -- well, this is on development set of Wall Street Journal. Again, we do frame classification. You see huge differences sometimes -- well, especially like with single best between no data sharpening and data sharpening. If you do some voting, things get better. But the differences between this whole left and right column is really very significant always. >> So what's then the reason used to follow the template? >> Dirk Van Compernolle: That's the Wall Street Journal 0 database. >> The training one. >> Dirk Van Compernolle: The training one, yes. >> So for each user's training to build the template. >> Dirk Van Compernolle: To build the templates we use the standard training and test setups of all fees databases. We don't do anything strange there. >> [inaudible] I don't understand something. This frame classification ->> Dirk Van Compernolle: Yes. >> I'm sorry, not the frame classification, the sharpening, was that done by moving whole templates? >> Dirk Van Compernolle: Parts. This was parts. I'm pretty sure it's points. I think we've looked at both of the them. But I think they independently move. They could be incoherent moved. >> They could be. >> Dirk Van Compernolle: They could be. >> [inaudible] why not do the data sharpening on the template? >> Dirk Van Compernolle: I'm pretty sure we've done this and it didn't make any difference. And this is much faster. >> So [inaudible] frame classification, what does this mean? >> Dirk Van Compernolle: You look at the frame, you look at the label of the frame, and you try to ->> But the template, you don't do ->> Dirk Van Compernolle: No, no, there's no DTW. This is frame classification. This is just for looking at some of these [inaudible]. When I look at a single point in my database, how do they behave on things like data sharpening, how do they behave according to these different metrics. That's all that I'm doing. There's no template-based recognition in this. >> The reference HMM 63, is that ->> Dirk Van Compernolle: That's basically multi-Gaussian. That's a multi-Gaussian. >> It does the decoding and then you look at the labeling of the frames? >> Dirk Van Compernolle: Or is it just a [inaudible]. >> Okay. >> Dirk Van Compernolle: Now, look at another picture of this. Now, look at the number of nearest neighbors and you look how many of them are correct. So basically when you do the classification, you're looking very much here, just to the nearest neighbor, to the one nearest neighbor or maybe to the ten nearest neighbors or so when you do -- so -- or 20 nearest neighbors when you do this. So the dotted line is Euclidean. Now, this says when I look locally, like 62 percent, so when I look at the first nearest neighbor, 62 percent of the time I'll be correct. When I'll look at my 500 nearest neighbors, then only something like 50 percent will be in class. So the further you look around, the more nearest neighbors you go look at, the more things will be out of class, of course, because they're bigger. And now how do I define this class. If I define this class Euclidean, then actually this drops quite fast. And if I don't do any data sharpening, this drops tremendously fast if I go and look far away. Now, here we see things like local Mahanolobis base, this Mahanolobis based thing, or adaptive kernel stuff. This does much better and especially when I go and look far away. But that's what's indeed the HMM view. I'm looking at my whole class. And we know when we do experiments with this DTW we often get templates which are ranked 50 at 100, 150. So we may be operating very much in this area here. But still in this area here, Euclidean is doing better than the Mahanolobis-based kernels. It's only when I look very far away when I really want to model the whole data, the whole class data, then I have to use my Mahanolobis-based kernel. And that's also what it is. Or that's the way we trained it. We use the class ->> [inaudible] doesn't make any sense [inaudible] depending on the scale of your data. >> Dirk Van Compernolle: The data has been scaled, has been LDA'd. >> [inaudible] >> Dirk Van Compernolle: Yes. >> Oh, I see. So effectively Euclidean is just a global Mahanolobis. >> Dirk Van Compernolle: It's a global -- it's been globally normalized to data. But you see in nonparametric statistics, people, when they, when you really have lots of points and when you're looking locally, it seems you shouldn't do anything in trying to say my space looks different. No, it should be you have the spacing of your points that does it for you. Don't do it twice. It's already in the spacing of the points. And I was saying again, my distance, it seems to be not so relevant. So we made probably mistakes in the beginning. We thought this was very important, but probably we didn't have enough examples. We had to look too far, and then we needed Mahanolobis-based distances. If you don't have to look far, if you have another examples, if you work -- if you're dealing with things that are relevant enough, this Euclidean distance seems to do the job. >> So the Euclidean distance is just a special case of these [inaudible]. >> Dirk Van Compernolle: Oh, yeah. So you don't do anything. You get rid of your covariance matrix. It goes faster and goes better. >> [inaudible] >> Dirk Van Compernolle: Well, we trained it on the identity of the reference data. So we grouped the reference data in classes, like 1,000 classes we trained 1,000 different covariances. >> So I'm just like [inaudible] confused like the HMMs that you mentioned, they're just [inaudible] and they are doing better than this system, they are about like 60 always. >> Dirk Van Compernolle: Yeah. >> And how [inaudible] from adaptive kernels using too many frames as nearest neighbors? >> Dirk Van Compernolle: Well, the things that are far away don't tell you much. >> Yeah, but [inaudible] do the same thing. >> Dirk Van Compernolle: JMM is doing the same thing. Not exactly, no, no, no. >> [inaudible] and JMM was multi-model thing, like multi-Gaussians ->> Dirk Van Compernolle: It will not -- JMMs will not be too -- or not highly discriminative, unless you train them discriminatively. They say, okay, here are fine points of this class, and I have a pretty good membership function. So what are our conclusions after all this work. And as I said, we spend a lot of time on this. And, sadly enough, we come back to the simplest thing. No, there's one thing. We retain data sharpening. We retain. It's been an improvement and all the way at the end I show you speech recognition results on Wall Street Journal 0+1, so much bigger databases yet, and it stays on track there. It's very relevant. Euclidean distance, this is good as whatever thing we can come up with. We've tried many other stuff. And adaptive kernel, sometimes may give a little bit of improvement. But we basically stopped that work. It's a lot of effort, and we really see benefits from it. Okay. Now, what other problems do we have when we go too much bigger databases. So we start it on [inaudible]. Well, we want to search through all of these templates. Well, we cannot search all these templates because you might -- for each phone you might have a thousand, 10,000 examples, 100,000, a million examples in your database. So that's a problem. First, in our first work we did bottom-up template selection. We say, okay, we're going to work fully data driven, we're going to look at the data and try to come up with this. I'm not going to go into the details of this. It was nice type of stuff. We use some roadmap in this to find -- to quickly search K nearest neighbors and then try to find nice trajectories in those. But it was very time-consuming, whatever way we did. It created dense -- we needed dense graph, really dense graphs to do something good, and there was sometimes gaps in the graph that we couldn't overcome. So what we currently do almost all the time, we use an HMM to generate the phone graph. Well, fast, it's not that fast, it's -- it's reasonably fast, but it's very efficient way to do things. So how does it work? We have -- okay. I would like switch this graph. Because the -so we first do an HMM. We do some phone decoder, or it might be a phone decodering first from a word decoder. Doesn't really matter. But we get a phone like this. With timestamps in the notice and labels and scores on the arcs. Then we say now for each arc let's go and look for examples, very specific examples in the database. So every phone in the database has a unique ID. We put these ID, and so we go and now make a graph, not just with one example of this, but actually with many examples of that phone. And then we're going to rescore this template graph. So from a phone graph, we make a template graph, then we do a decoding on this, and we might again recombine these things at the end. Yeah, talk about a bit about [inaudible] similar [inaudible]. >> Dirk Van Compernolle: Well, our no -- from the phone graph, so now for each arc we go and look in the database for all these templates, with that label, and we do DTW on them and we just rank them and we take the 200 best ones. So we have a much bigger graph. We have a graph with the same number of nodes, but we have 200 times as many arcs as the phone graph. And we do a new decoding on this because here going from one phone to another phone in our HMM-based system, there's no specific cost. But here we need to do a new decoding because we have costs going from one phone to another. Because we have this phone transit -- this template transition costs that we need to integrate. So we're not going to get -- the best phone sequence may not be the same in this and this. This will be different because we have this new set of transition costs that are applied at the template level. >> [inaudible] >> Dirk Van Compernolle: This one is not [inaudible]. I would like to switch it because you normally expect the first one to stand on the left side. Something about our phone graphs. They tend to be pretty good. For example, we can generate [inaudible] phoneme recognizer if you use a 4-gram phoneme recognizer to get like .7 percent phone error rate. If you use ->> [inaudible] >> Dirk Van Compernolle: Um ->> The numbers and [inaudible]. >> Dirk Van Compernolle: That's versus an Oracle. So we did phone transcription with the HMM. We did forced alignment on the Wall Street Journal because there is not an official transcription of Wall Street Journal, so we generated one. >> [inaudible] >> Dirk Van Compernolle: On the lattice. When we do it for -- oh, there's something wrong here. This should be 0 -- Wall Street Journal 0+1, so much bigger ones. They'll go to 20k, and there's a lot of vocabulary words. We will have graph error rates on the order of 2.7 percent. >> What does GER mean? >> Dirk Van Compernolle: Graph error rate. >> Oh. >> Dirk Van Compernolle: And that's -- was discussing this with Geoff this morning. So we have a reasonable density in this thing still, like word densities, five words on average when you take a cut in the graph at any point. So it's not a sparse. This is just meant to keep our search space manageable. Now, there's much more questions. When you go to larger database. I've already mentioned this thing is a template, pronounced correctly [inaudible] pronunciation and I want to keep it, if it's bad I would like to throw it out. But if I have ten times the same thing, I probably want to throw it out too. But do I want to keep on keeping all the same stuff and the same stuff and the same stuff over and over again. So then there's other questions. We have this meta-information that we put in the concatenations. And there -- this was something that was not obvious at all in the beginning either. And you say, okay, I have a female template, I have a male template, and I would like to have male templates followed by male templates. It's normal because I would have a male speaker and I do speaker detection or gender detection that way. On the other hand, I could say, well, but that's maybe not so good because now I'm going to be using none of my female templates for a male speaker. If I do something like VTLN, if they're -- what's important? Is it important that the template is totally the same, but maybe I have a female speaker who said exactly the same sentence. So I get examples of words and word combinations that I don't have in my male speakers. Would I now like to match my male speaker with that example of the female speaker. Well, then I should do some speaker normalization. Because then I could get with a smaller database a more efficient usage. Because, okay, what's important really is the phonetic context and the richness of that. >> [inaudible] you don't need to do that [inaudible]. >> Dirk Van Compernolle: Yes. >> [inaudible] >> Dirk Van Compernolle: Yes. But I like to use this bigger unit. We've seen the longer the segments that we take -- so if we take multi-phone segments, our results are better than when we have single phone segments. So I would like as many Quint phones, true Quint phones to be present in my database. The more the better. The more likely I'm going to be doing good. >> But after you use a lot of units, do you have to redo this phone graph, or do you [inaudible]? >> Dirk Van Compernolle: No, we don't. It's all -- this is implicit. No, we don't. Because it's implicit. Well, we've done things with explicit longer units too. But ->> [inaudible] >> Dirk Van Compernolle: In the decoding you do it. >> Geoffrey Zweig: Just a note on time. About ten more minutes. >> Dirk Van Compernolle: Yeah, I know. But we're getting to the end. So there's a dichotomy here. So what do I want to do. Do I want to keep very specific -another thing is how about noise? Do I want to get rid of the noise or do I want to duplicate my database for all different noise scenarios? Do I want instead of multi-style training, like in the noisy situations, do I want to do multi-state or do I want to keep every different noise thing, because then I don't have to make bad assumptions about anything. Well, in anything, whatever we want to do, we would like to get our database smaller because it will go faster. We don't want redundancy. But we also want to maximally exploit our database. The thing we never want to do is do this on a template type -- on a unit-by-unit basis because we don't want to introduce gaps in the original data. We at least always want to keep full sentences. We never want to say, okay, I have -- I will store them as phones and I will only keep the 100,000 most representative As and the 10,000 most representative Bs. No, that's not what I want to do. So what things have we done? We've moved to context dependency units. Very similar as in HMMs, except for one important difference, it's a longer unit because in HMMs we typically do decision-tree clustering of states. And we make a phone out of tree states and we reuse states across context dependent phones. We don't do this here. We just do a clustering where we have and the whole segment has to be unique to the class. But we use our same software for this and we've also gone and looked, put VTLN in there, which was not there in the beginning, because we thought, well, we like to have male things to be different from females. What are our results on that. Some things on VTLN. If we don't use VTLN, this is the Wall Street Journal 0 results, it's only ten speakers, but I think it's relevant of the examples. Look at gender matching, the number of templates with gender matching things. If we don't use VTLN, you might get an average of 75. If you use VTLN, it's only 60 percent. Meaning that's the number of speakers that are used in the templates. It's not a weighted average. It's just what speakers of a certain gender are used. Now, so this means we are going to be using more female speech for male people, more male speech for female speakers. So just the number of speakers that ultimately do appear one way or another for a certain speaker goes up from 210 to 230. Not that much. But so we have -- so by doing it we do cross-gender matching. And our results go up. So the thing is creating our set of context dependent variability is better than keeping this gender thing. So we better normalize according to speaker, so it's really this pronunciation variability, the context dependency that's important and that we need as many possible examples of this as possible. Then we did quite some work on trying to do this cleanup of the database. So where we try to distinguish things into good, bad, and redundant. So far we've just done this and that's mainly for computational reasons, just on a speaker basis. So we try to find speakers, which are good, that we really need in our database, ones that are bad, that make many results, and ones that are redundant that we can throw away without changing the performance. This works reasonably well. You can do almost 50 percent. I don't have the graphs. I apologize for that, as we do have them. But you can compact the database quite a bit throwing speakers away with almost no degradation. But there's always a minimal degradation. So in a sense it's not a very happy thing. We're not very happy with the current results. So it's very possible -- so there are speakers who really are on average quite redundant. We'll probably get rid of those, but really bad speakers we cannot find, otherwise it should go up. So we really have to redo this whole thing on a sentence-by-sentence basis, instead of speaker-by-speaker basis. Okay. This slide is a little bit out of proportion. That's where we did the template activation for that. Sorry. I didn't -- some other further optimizations we did is actually silence. We don't put into the DTW system anymore. We just do a straight mapping from the HMM score. There's a very high correlation between the two, which makes sense. And because of similarity between the two systems. And this speeds up roughly the VTLN. It gave us roughly 10 percent improvement and we sometimes use hybrid systems. So let's, to conclude, look at some more results. [inaudible] larger and larger databases. Here is the template type, context independent or context dependent and VTLN. Did we use data sharpening. And I've included here one more result where we did not use data sharpening. A Wall Street Journal 0 plus one. When you don't -- where you keep all the other stuff the same, you don't use the data sharpening, you have 11.4 percent. When you do the data sharpening it's 9.2, 20 percent relative difference. So it's huge on this stuff. We don't ->> So not being hybrid, that ->> Dirk Van Compernolle: That means it's just the DTW. >> The final score is based just on the DTW scores. >> Dirk Van Compernolle: Yeah. >> On top of the phone lattice. >> Dirk Van Compernolle: Resource management, the results we have are significantly better than on the HMM alone. On Wall Street Journal one, they're not. They're very close to single system. But look in things where you go from context independent to context dependent with VTLN, what a huge difference you get. What has happened by going to this context-dependent templates and to this VTLN, is that a lot of the information in the transition costs that we had, like we used to have a cost for gender transitions, well, after we do VTLN, the gender transition, we really don't need it anymore. It goes away. The bigger the databases we have, the more consistent we have that Euclidean distance is the same. The more context dependent environments we have ->> [inaudible] this whole table. Like can you explain what is your template stuff, and is there an HMM baseline anyway? >> Dirk Van Compernolle: There's -- only in the hybrid system. It's where we combine the scores of HMMs and DTW. All the rest, when there's a no, that's -- the -- though I don't think -- not in these -- in these ones here, there's no HMM involved at all except for the original labeling of the training database. >> We do have HMM baselines for these. >> Dirk Van Compernolle: We have HMM baselines for these. These are around more resource management 2.8 I guess. Here it's 3 percent, here it's 7.5. So we're not at HMM -- at our HMM yet. >> Is that ->> Dirk Van Compernolle: But on the -- like on this thing -- around here I don't know what's going wrong with the merging of the things, because there's like a 25 percent difference in errors that they make. So they do make different errors. So that tells us that we -- from the merging that there must be more to be gained. >> From the Wall Street Journal experiments you first use HMMs to produce a phone, test phone graphs and then [inaudible]. >> Dirk Van Compernolle: Yeah. That we do. So our results are getting better and better when we first -- the first runs we did on Wall Street Journal 0+1 half a year ago, we had 20 percent, I guess. So doing something new and something bigger is always a challenge. And it's often in small points. But this one has been a big one. I think this one -- so there are a few things which are consistent over all the stuff we do. But a little bit of the annoying stuff is that many of the things that we've looked at, we ultimate come to the conclusion that we can leave them out of the system, and that the simpler system with the Euclidean distance works better. We still have the natural success across which place something. But once we've moved to context-dependent templates, any other contextual information doesn't help anymore, so those also the phoneme context translation costs, they're also gone. So it's -- the story is a bit mixed on this. The results get better and better, but there's -some of the things that we thought that would really help or some of the differences between these nonparametric and the more parametric stuff, well, the big gainers are in some thing which are very close to HMMs, which is saying it's really this pronunciation variability which is the big problem in speech. >> So everything here uses a phone graph ->> Dirk Van Compernolle: We've all used phone graph, but the first three ones, if I remember right, were created where the phone graph -- no, did not use phone graphs, only template graphs. And they had template graphs generated in a bottom-up fashion. All the Wall Street Journal experiments, they use phone graphs to start from. >> So most of the competition, as I can see, in the decoding is for the translation from phone graph into the [inaudible] is it? >> Dirk Van Compernolle: That's yes. Yeah. All the DTW computations that you have there. >> So I have a couple of questions. >> Dirk Van Compernolle: Sure. >> So the first question, [inaudible] sharpening when you use it on HMMs, does it have, like you were saying, the same or the degraded? >> Dirk Van Compernolle: Degrades significantly. >> It does make sense [inaudible] do good for a number of other things, but for HMMs I thought that it's a lot -- it's a lot difference. >> Dirk Van Compernolle: Well, if it just does -- pull in bad outliers, it should do good. >> All right. >> Dirk Van Compernolle: So it does do other things than that. It's not just doing cleaning up of your database. >> So the [inaudible] question, the data compression thing that you did, so does it publish [inaudible] speech or ->> Dirk Van Compernolle: Which one? >> The data compression procedure that you used to throughout the bad speakers -- >> Dirk Van Compernolle: That hasn't been published yet. But that's likely to be submitted into Speech this year. >> Geoffrey Zweig: We are just about out of time. Any other things you want to mention? >> Dirk Van Compernolle: I think I'm pretty much at the end of my [inaudible] what I have to say. >> Geoffrey Zweig: All right. Let's thank the speaker. [applause] >> Dirk Van Compernolle: Thank you.