>> Dong Yu: Today, we are glad to have Chao Weng from Electrical and Computer Engineering Department in Georgia Tech to come here to give a talk, and he has been intern in AT&T Labs and Microsoft Research in the past. Today, he will talk about his work on robust speech recognition and understanding. >> Chao Weng: Okay. Thanks for the introduction, and thanks for coming to my talk. So I'm very honored to be here again, to give this talk on what I have been doing for my PhD dissertation. So the topic of my talk is Towards Robust Conversational Speech Recognition and Understanding. So this is the outline of my talk. In the first introduction section, I will explain the challenges that we are faced with, and then I will explain the motivations. And in the second section, I will elaborate the proposed research one by one, in four separate parts. And in the last section, I will give the system overview of the whole proposed research and the latest remaining work that needs to be done towards the final system. So first, introduction section. So while significant progress has been made in ASR, the conversational speech recognition and understanding remains a challenging problem. So unlike read or highly constrained speech, spontaneous conversational speech is really ungrammatical and unstructured. So while you really can expect over 90% word accuracy on a large recovery task but with dictation speaking still, say, Wall Street Journal, this word accuracy would dramatically be degraded for a spontaneous speech task, say, Switchboard. So this is the -- these figures show the benchmarks of ASR in the word average for the DARPAsponsored tasks. And we can see, for the Wall Street Journal, both for the red speech, for the resource management, 1,000 tasks, and for the Wall Street Journal, both 5,000 and 20,000 tasks. We can't really limit the word average below 10%, but for the conversational speech task, the word average is still relatively high. So another challenge is recognize and understand conversational speech comes from the adverse acoustical environments, and additive noise channel distortions and Lombard effect, all these will lead to a severe mismatch between training and testing data. So one of the most challenging problems, unsolved problems, is the cocktail party problem, and in this kind of problem, one needs to focus auditory attention on one conversation out of many and trying to recognize it. So, actually, this can be viewed as a specific case of adverse acoustical environments, which means we want to recognize and understand speech in the presence of competing talkers. So apart from the speech recognition part for the speech understanding here, we are having a general speech understanding problem that could be very complicated. Specifically, in this work, we mean that we're trying to extract semantic notions from given conversational speech, and so even with perfect ASR, this can be also challenging due to the interference of fillers, mispronounced words and disfluencies. So with all these challenges we have, we will mainly use two techniques to attack them. The first is WFST, weighed finite state transducers. So in this framework, we of course have a lot of powerful algorithms to take advantage of, and we'll see how we can do the lattice generation for [indiscernible] training later on. So also in this framework, the recognition part and the understanding part can be seamlessly integrated with each other, and the second techniques are of course the deep neural networks. We know that the recent success of deep neural networks for ASR has been -- opens a lot of possibilities to further improve the system performance. So this is the -- this is speech recognition and understanding with the WFST view, so we can see from very low-level speech given speech signals. In the end, we want to extract certain semantic notions, and so for now, I want to say that in this work we will work mainly on two parts, which are highlighted in the figures. The first part is the acoustic models, and the second part is what kind of transducers or what kind of module we need to map the output from ASR to certain semantic notions. So the first motivation with that word average in full transcription of conversational speech is relatively high. And observing the fact that the relevant semantic notions are already embedded in a set of keywords. So the first motivation is that we want to do better acoustic modeling for the keyword spotting for the speech understanding. And the second motivation is that the adverse acoustical environments degrade performance substantially, and we want to utilize DNN to further improve the noise robustness of the system. And the third motivation is that since that conversational speech is ungrammatical and ill structured, and we need a kind of semantic decoder that can be seamlessly interfacing with ASR outputs, and since that we are working in the WFST framework, so we need a robust WFST-based semantic decoder. So the final objective is we want to build a conversational speech recognition and speech understanding system which is robust to various acoustical environments, and here with all three motivations we mentioned, we have three goals to accomplish. The first is we want to propose a model-training methodology for keyword spotting and for the speech understanding. And the second goal is we want to propose DNN-based acoustic models, and that is robust to the additive noise, channel distortions and the interference from competing talkers. And the third goal is that we want to propose a robust WFST-based semantic decoder that can be seamlessly coupled with ASR, so the second part. The first -- so, remember, the first goal is that we want to propose an acoustic model training methodology for keyword spotting for conversational speech, and there has been a lot of previous works on keyword spotting. I have [it] right here. So the other one, a so-called two-stage approach with benefits of various LVCSR techniques. A twostage approach delivers decent performance, where in the first LVCSR stage, we just do the LVCSR, and in the second stage, we verify and detect the keywords from the output of LVCSR. So the drawback of this approach is that the two stages are designed under different criteria, and LVCSR systems are trained to minimize word average in general, without placing any emphasis on keywords. And the discriminative training as a general technique to boost a speech recognition system, and also this combinative training using nonuniform criteria seems a good candidate to solve these limitations. And in this kind of method, they want to minimize nonuniform error cost according to various error types rather than just 0-1 loss. So following the same spirit, we will formulate non-uniform MCE for the keyword spotting on the conversational speech. So for certain training frames, we define the frame discriminative function in this form, which has two parts. First, there is a state posterior and second is frame likelihood. And then we can formulate the non-uniform MCE in this form. So first, we have classification error here, and then, depending on whether the keywords appeared in a reference word transcription or appeared in the hypothesis space, we're going to embed a frame-level error cost. >>: I'm sorry, can you just go over that a little more, maybe starting with the first one. So the frame discriminative function, you have the state posterior there. W is a word string. >> Chao Weng: Yes, word string, yes. >>: And the that's the frame likelihood, okay. >> Chao Weng: The second part is the -- yes. Yes. >>: So this is a frame level. >> Chao Weng: Discriminative function. >>: Function. What makes it discriminative? >> Chao Weng: Discriminate is here. We have, for the reference word transcription and for all the hypothesis space, we also have a discriminative function. >>: Oh, I see, so this is basically -- so this is straight MCE training. That's what you're using to discriminate between two different hypotheses. >> Chao Weng: Yes. >>: Or that's the training criteria in there. >> Chao Weng: Yes. >>: So the state posterior is [indiscernible]. The state posterior is the fraction counts of this particular state? Okay. This one is compared to the regular [indiscernible]. >> Chao Weng: Yes, yes. >>: But normally, the center MCE doesn't have the state posterior term. It's only for the ->> Chao Weng: Only have a what? >>: It's only the likelihood term, the standard MCE. >> Chao Weng: Yes, yes, standard. >>: So you put it there for the sake of? >> Chao Weng: Because we want to -- one more thing is we want to embed the error cost in the frame level, and in the conventional MCE case, you cannot, because it's based on the whole word string, right? >>: So the training level [indiscernible]. >>: So you can do the both the state posterior or either one [indiscernible], so basically you've got it corresponding to a case ->> Chao Weng: Yes, yes. Yes. For the reference word transcription, yes, exactly. But for the hypothesis space, you would not have only 1, 0, right? >>: So is what you're doing here somehow summing over the different alignments of the frames to the hypothesis? >> Chao Weng: You mean here? >>: The point of the state posterior. >> Chao Weng: Yes, with the same state signal, with the same state index. Yes. >>: So what makes it non-uniform here? >> Chao Weng: Non-uniform is the error cost. >>: Oh, it's ->> Chao Weng: [Indiscernible]. >>: Oh, okay, you talk about it on the next slide. >> Chao Weng: Yes. >>: So that's the same as the one that [Fuchun] did in his dissertation. >> Chao Weng: A big difference I will describe later on. >>: It's different? It's different or the same as ->> Chao Weng: No, it's not the same. It's different, yes. >>: I think -- maybe this is Jeff's question, but is there a place where you sum over state alignments for the correct keyword? Like if I see a state posterior, given the word, and the word is the numerator that you're trying to ->> Chao Weng: You mean here? >>: Yes. >> Chao Weng: By the read, sum over all the ->>: Do you sum over all the ->> Chao Weng: No, we just do alignment. >>: Just best path? >> Chao Weng: Best path, yes. >>: So then why is -- but the state posteriors used are not binary. They're not indicators. >> Chao Weng: In this case, we have different -- in the hypothesis space, right? >>: And can you just go over exactly what's being summed there in that triple sum? >> Chao Weng: Here? >>: Right. So J is -- what is J? Oh, J is down in the -- J is down on the bottom. >> Chao Weng: Based on the ->>: Oh, J is a set -- okay. >> Chao Weng: Yes. We have different. >>: So for all the correct ->> Chao Weng: Yes. So this is ->>: It's a class, right? It's a class level. >> Chao Weng: Yes. It's a class level, yes. Exactly. >>: Okay, so you can have that small number ->> Chao Weng: So J is more like labels from the cost labels from the reference transcription, and here, I is the levels from the hypothesis space. >>: I see. Is labels different from words? What's the difference between a label and a word? >>: The label here is a state ID. >>: Oh, okay. >> Chao Weng: Yes, but here it may be a little bit confused. We just -- here, the I, whether it belongs to keywords or not keywords, we just look at the alignments. Rather, the state, certain state index is false within certain keywords' time boundary. >>: So you're somehow going to try to improve the likelihood of hypotheses that have a lot of overlap between states that are assigned to keywords and the true alignment of the keywords to the audio. Is that what's going on? >> Chao Weng: Yes. When the keywords appear, we're going to train more, actually. I will show later on. Yes. >>: Do you need to have [invest]? >> Chao Weng: Oh, we're doing it on the lattice. >>: On the lattice. >> Chao Weng: Yes. Yes, if we kind of -- we can use a certain -- if we define error cost function here, epsilon of T, we can simplify the overall objective function in this form. And if we take the gradient of this objective function, and we compute it to the record MCE, so we can see there are maybe two main differences. The first is corresponding to the -- the [indiscernible] response to the derivatives from sigmoid function, and this value for a non-uniform MCE is changed over frame by frame. In the regular MCE, actually, this is a constant value among the whole training speech algorithms. And for the more efficient implementation, we just bring this difference to the agreement. And also, the second difference, we embed frame-level error cost here. And we can see that non-uniform MCE after approximation is equivalent to MCE on the artificial resampled training set, and this training set actually is actually according to the error cost function epsilon of T. And another issue of doing non-uniform MCE training is that the competing hypothesis needs to exclude the reference word transcription, and then naively removing the arcs corresponding to the reference words will definitely hurt the topological structure of the lattice. So this can be, of course, circumvented by subtracting the corresponding statistics from both the nominator and denominator posteriors. But in the WFST framework, we can come up with a more elegant solution, which takes advantage of WFST difference operation. So we call that for a regular WFST decoded lattice, which is essentially pruned version of the whole decoding graph, and the whole decoding graph for certain speech algorithms can be expressed in this form. And we know that the input labels of this pruned version is actually a [treatment] transition identifiers, which apparently are not a valid operant for our difference operation, because we want to exclude the word-level transcription. So that will be a compact WFST-based lattice definition, and now it's already become the standard lattice-generation [indiscernible] encoding. So in this kind of lattice format, both input and output labels are words, and acoustic models, language models scores and state alignment are encoded in the weight using this special summary. So with this special lattice definition, we can -- the lattice generation for non-uniform MCE training can be summarized in three steps. We first invert the prone portion of the whole decoding graph, and then we encode the inverted portion using this special summary. Then, we do some graph optimization with remove epsilon and determinization. Then, we compile the referent transcription into a WFST, and then we simply just do the WFST difference. We will get the lattice for non-uniform MCE training. >>: Excuse me, so the C is the scores. >> Chao Weng: Here? Yes, the C is scores. S is the state alignment. And, as mentioned earlier, that non-uniform MCE is actually equivalent to employ the regular MCE on a resampled training set, and this training set is reweighted by the error cost function epsilon of T, so here, the boosting-based techniques can be applied here naturally. And in this kind of boosting method, the class corresponds to the acoustic tied state, or senone. And for each class, we define the class-specific error cost, which means we first look at the staple [ph] theories derived from the lattice we just generated. And if the state index for those accumulated posteriors whose state index is different from the reference state index, if this value is beyond 0.5, we think it just makes a mistake. And we weighted by the error cost function, we would have got the classspecific error cost. >>: So you don't use that kind of add-up formulation explanation. >> Chao Weng: Yes, we used, actually, in explaining the whole algorithm. And then, with this empirical error cost, we will use that to decide which models or operations we're going to use to generate acoustic scores during decoding. And this is the whole algorithm, and first we have error cost function initialization here, and then, we're collecting the reference staple [ph] theories, both from the reference transcription and lattice we just generated, and with the class-specific empirical error cost, we have to update the error cost function, as in the Adaboost way. And with update error cost function, we can train the non-uniform MCE, train the new iteration model, using non-uniform MCE. And with the class-specific empirical error cost, we can determine which models we use among all iterations to generate the acoustic scores during decoding. >>: I'm sorry. So I understand -- I can parse boosted non-uniform MCE, but what's the adaptive part, again? >> Chao Weng: Adaptive, because originally -- in our original work, we just made this fixed. And, actually, this boosted here, we don't mean the -- a bit confused. We don't mean the boosting. Here, I will explain later, called it boosted here, it means that we used certain kind of tricks when we did the extending [indiscernible]. >>: Oh, it's that. >> Chao Weng: Which is the same tricking the boosted in line, which is canceling the ->>: So it's not about error boost ->> Chao Weng: Yes, the term is a bit confused, but yes. Yes, we do the experiments on two tasks. First is Switchboard, and the other is the HKUST Mandarin, which is essentially just a Mandarin counterpart of Switchboard. And the then the baseline speech-recognition system is kind of straightforward. So we first do the speech recognition experiments, and we can see in this case MCE got the best result here compared to other standard discriminative training. >>: This [indiscernible]. >> Chao Weng: Yes, it's non -- the test is that it should be [indiscernible] English. >>: So the result is on [indiscernible]. >> Chao Weng: Yes. And then we do the keyword spotting experiments on the credit card use subset of Switchboard, and here the boosted -- yes, we just mean that extended Brownell [ph] tricks. Yes. >>: I'm a little confused as to, if you're talking about -- you have to bear with me, because I'm not a speech recognition person, but if you're talking about conversational speech, why are we talking about keywords? Because I imagine the keywords part might be extremely important in a command and control or that kind of situation, but if you're talking situation, then I'm not sure how keywords come into it. >> Chao Weng: Yes. I think it's pretty much in the motivation. Conversational speech, usually, would have -- if you're talking about the full transcription, it has very high word error rate, and if we just want to kind of get a rough idea of what this conversational speech is talking about, maybe it's enough just to recognize those keywords, such that we can ->>: Do you have a predefined list of keywords? >>: Think about NSA. >>: The NSA and the DOD would be very happy to [indiscernible]. >>: Okay. >>: A small list of ->> Chao Weng: Yes, and we can see the proposed system ->>: Bomb. >> Chao Weng: I've got the kind of significant performance gains in terms of figure of merits, and this is obviously curves of three systems. And then we do the experiments on the HKUST Mandarin, and first is still the speech recognition experiment, and we can see that, in this case, boosted MMI gets the best results in terms of the general syllable error rates. And then, we also do the keyword spotting experiments. We can see that the proposed method also gets significant improvements. >>: I'm trying to understand, something like MLE means you just decode it, and then you look for instances of those keywords in the decoded speech. >> Chao Weng: Yes, yes, exactly. Yes. Yes? >>: Is there something about this database? I've heard of it, but I don't know the history. >> Chao Weng: It's like 150 hours Mandarin speech, and also it has different topics introduced to each conversation. >>: What is the purpose of that database? >> Chao Weng: Perplexity? >>: No, no, what's the purpose of that database? >> Chao Weng: Oh, purpose I think is kind of similar to Switchboard, but it's in Mandarin. And the thing is, it's also kind of bilingual thing. It's kind of spontaneous speech. It also has English works, so you have to prepare bilingual, like second. >>: So you think about it as a Chinese version of Switchboard. >> Chao Weng: Yes, I'm thinking about it. Yes. >>: Sorry, go ahead. >>: So a particular [indiscernible] group did this? >> Chao Weng: This is leased by the RDC, yes. >>: So is this F measure, or what is the table that the numbers go to? >> Chao Weng: Oh, you mean this one? Figure of merits, which is typically just -- kind of you can think of as the up bound of the -- how to say? Yes, I think I can explain. Yes, percentage of the hits average over one to 10 false alarm drivers. >>: All of the other things are just straight-up acoustic modeling and the standard ->> Chao Weng: You mean here? >>: Yes. None of those have anything to do with the keywords? >> Chao Weng: Yes, no. Just the standard discriminative training, yes. >>: So the fifth word [indiscernible]. Do you have any idea how these systems would compare on a new keyword? >> Chao Weng: A new keyword? >>: So like, the advantage of the first ones is you can take the same acoustic ->> Chao Weng: Okay, you mean once a week we change the keyword set. I rather won't do it on the feature space, because in this case we just learn feature space transform instead of retraining the models, right? Actually, this -- we can do it in the feature space, feature space discriminative training. You mean if we want to change the keyword set, right? >>: Yes. Then, the keyword set, once you have a new keyword set, which means we have to do the whole training from the beginning again. This is not flexible, but we can do it in the feature space, which means we only train the feature transform for each keyword set. Then we can decide which transform we use. Yes, FMP, FMI, this kind of stuff. >>: So earlier you mentioned something about boosted -- Adaboost. You summarized all this. The early error, it looks a little bit like the Adaboost, on the formulas ->> Chao Weng: Yes, the adaptive part kind of actually comes from the Adaboost. >>: Do you have that one? In the earlier ->> Chao Weng: This actually is the Adaboost way, right, to do the error cost function. >>: But the formula of these different ones, the formula that was the explanation of minus alpha and alpha as a ->> Chao Weng: Yes, you have to do certain adaptations, because in this case, we have 1,000 [indiscernible]. You cannot just directly use Adaboost. >>: You can use that formula, though? >> Chao Weng: Yes, we have to find a way that also works for the speech recognition systems, large vocabulary. >>: So that probably [indiscernible]. >> Chao Weng: Yes, this part, and I think also the formulation of objective function is also a bit different. >>: That's nothing to do with the boosted MMI, when you talk about MMI. >> Chao Weng: No. Boosted MMI just means we boost the hypothesis space. Okay, so the first part of the second goal is we want to propose a DNN-based acoustic model that is robust to additive noise and channel distortions, and also there have been a lot of good works on this before. We will not reiterate here. So any [indiscernible] with DNN's data or its performance can be achieved are our four benchmarks, without multiple decoding policies and multiple adaptation. And also, meanwhile, the current neural networks have been also explored for noise robustness, but so far, I think the only use to date as either a front-end de-noiser, or they use it in the tandem setup, so few if any have explored the recurrent deep neural networks setup, and with product [ph] gains on larger tasks where language models matters during decoding. So in the following, we tried to build a hybrid recurrent DNN systems, and we believe that richer contact offered by the recurrent connections can facilitate more robust, deeper representations of noisy speech. And this is hybrid setup, and kind of three to four, I think. So the ->>: Can I ask [indiscernible] go back and ->> Chao Weng: Which one? >>: The one that was straightforward. >> Chao Weng: This one? >>: That one. So this is a feed forward net, this is not ->> Chao Weng: And I'm just explaining the average setup. >>: That's what it's for. Do you just set the [inaudible] get something out? >> Chao Weng: Our work means we're just adding -- so here. This is just the same as this one. We're adding recurrent connections to certain hidden layers, which means it's just the same as this. For propagation, propagation process, we -- for the efficient learning process, we just do it in minibatch mode. We would have a minibatch input, and then we just do a propagation in these equations. And once we approach the recurrent hidden layers, we have to do it frame by frame, because the current input needs the feedback from last training frame. Now, once we have enough kind of outputs within certain corresponding minibatches, then we do the propagation in minibatch mode again. >>: So with recurrent language models compared to n-gram language models, with n-gram language model, you take a big chunk of history, like two or three words of history, and then you use that to predict what the next word is going to be. >> Chao Weng: Yes. >>: For recurrent language models, you only look at one word at a time. It reads in one word, updates its hidden state and then predicts what the next word's going to be, reads in one word at a time. In your original picture, there was a big chunk of frames, a lot of context that was going in. >> Chao Weng: Yes. >>: Is it still using the same chunk of ->> Chao Weng: Yes, the same, like we ->>: -- like which model, where you're going down to one frame at a time. >> Chao Weng: Yes, we're going to use the contacts offerings. >>: That is a good band ->>: Yes, I understand. >> Chao Weng: Yes. And so, yes, for the language model part, although we just take one words, but the [indiscernible] we're going to stop in the -- yes. >>: So how many history -- how many frames in the history do you feed into the [indiscernible]. >> Chao Weng: Minibatch mode. I will explain later. Yes. Yes. For the back propagation part, first, we need to have a loss function, which could be negative cross entropy, and also we can do it -- we can use some discriminative training criteria, and see this kind of -- not only just take derivatives of the loss function with respect to the X subscript N. Here, the subscript means the index of the hidden layer. And then, actually, this derivative is the error signal we can backpropagate to previous hidden layers. And also, for the backpropagation part, we can do it for the feed forward hidden layers. We can do it in a minibatch mode. But once we approach the recurrent hidden layers, we have to do it during backpropagation through time, and we will detail later. >>: So is that a typo, or is it you are doing this type of recurrent -- the recurrent only comes from the output or from the hidden layers. So looking at the X, I would expect to have Xt and Xt minus one somewhere on the right-hand side. >> Chao Weng: You mean this equation? >>: Yes. >> Chao Weng: No, it has two portions, right? >>: Yes, so do you have the ->> Chao Weng: This one is from the last frame [indiscernible], and this one is from the previous hidden layers. >>: Okay, so is the output ->>: Why is the output of the hidden layer? >>: Oh, I see, not the output of -- I see. Okay, okay. >> Chao Weng: Okay, this one is after nonlinearity, and this vector is before nonlinearity. >>: So the output of the neural network for the target is denoted somewhere else. >> Chao Weng: Yes. >>: Okay. >> Chao Weng: And with the error vectors propagated at certain hidden layers, we can just evaluate the gradients to update the grid matrix, and here the subscript means the index of training frames. So here the one column, capital M, means we just concatenate corresponding error vectors into matrices. And so when the capital M is the number of training frames we are doing the batch gradient descent, and if we only have one training frame and we just update which matrix, seeing each individual frame without doing SGD, and the minibatch SGD is kind of a compromise between the two. And practically, the reasonable size of minibatch makes all matrices fit into GPU memory, which leads to a more efficient learning process. Yes, for the recurrent hidden layers, we have to do the backpropagation a third time, and actually, for exact gradient, we have to do this process to the very beginning of corresponding training [indiscernible]. But for the minibatch backpropagation, through time, we typically just truncate the whole process once within each minibatch, which means once -- we just stop the backpropagation process once we approach the boundary of minibatch. And we can see that in this case, for certain error signals, actually, we need two error signals. One is from the upper hidden layers in this equation, and the other one is error vectors from the future frame, in this case, which means we have to do it frame by frame for this kind of minibatch backpropagation through time. So one effect is that, if we only considered the error vectors for certain frames, say, this is T plus one and T minus one, and actually, within each minibatch, the error vectors corresponds to each individual training frame, actually, it's backpropagated in different time steps. Consider both, so now we have the minibatch with just the three, the minibatch set with just the three, and this error actually is backpropagated though time two time steps here in this case, if we stop here. And then, for this case, we actually just backpropagate in one time step. So for more efficient backpropagation through time, we introduce the so-called truncated minibatch backpropagation through time, which means for each individual gradient, we're truncating the backpropagation through time for fixed time steps, and really, it's the time steps could be like four or five. And the one thing of doing this is once we do this, we can actually change the order of summation when we evaluate the gradients of the recurrent weights. And actually, this part corresponds to the minibatch gradient, so this part and this part can both evaluate it in the minibatch mode. >>: Yes, so let me clarify. So the minibatch size now is roughly, after each occasion each of the [indiscernible]. Normally, people do minibatch certain times because they can parallelize it. >> Chao Weng: No, minibatch -- so we take minibatch as like 256, and for each individual online gradient, we're going to do four or five time stamps, each individual. >>: That's just approximately for the gradient. >> Chao Weng: Yes, because for the standard way, say, for the last frame of the 256, it's going to do like backpropagation through time approximately maybe 255, right, for the last three. But in our case, we just do the fixed time stamps and four or five times. >>: So when you do the minibatch, BPTT, the motivation of doing that, is it mainly for the parallel computing? >> Chao Weng: That's just for one reason. >>: Otherwise, you would just use the whole sentence. Because you don't think about -- >> Chao Weng: Yes, that's just for one reason. Another reason, as I mentioned, because each actually error vector corresponds to each frame's backpropagated in different time stamps, which means it kind of influenced the gradients with different contributions. And the thing is, in the end, we also can see that this will also give better empirical results. So we do the experiments on two data. First is CHiME challenge and the second is the Aurora-4. So CHiME challenge is typically just also noise robustness [indiscernible] I think released last year, and it typically is just a reverberant and noisy version of the Aurora-4. We all know Aurora-4 is just the maybe the more challenging SNR conditions. And we get the first alignments from the GMM-HMM system as [indiscernible], and then we trained the first DNN system with the alignments we got from the GMM system, and here we used 40-dimension log Mel scale for the banks as features. And we also do generative pre-training, and also the DNN has seven hidden layers, with each hidden layer has 2046 hidden dimensions. And the learning rate scheduling is that we use 256 minibatch as a minibatch size, and we use 0.008 as the initial learning rate, and we shrink by half when the frame accuracy on the deficit is below the 0.5%, and we stop when the frame accuracy improvement is less than 0.1%. And, yes, with the first trained DNN system, we just do a realignment, and then with the new alignments generated from the DNN system, we train another DNN system. We just do it iteratively, once the gains from realignments becomes saturated. >>: Can I ask a question [indiscernible] again. So I'm not in this field. So you have to understand, I'm from programming languages, but I do have a question about this. It seems to me, based on this slide, that there's a lot of parameters in these systems in terms of things you could vary. How many parameters do you estimate there are in this particular problem, or this particular set of ->>: Hyper parameters. >>: Whatever you call them. Free parameters. >> Chao Weng: So I think a lot of parameters almost kind of -- how to say, it's a very [indiscernible] parameters, like say 256 minibatch. People are doing with the 256 or either the 512, this kind of minibatch. This is not so critical in our experiment, actually. >>: Okay, I guess that's the question. I mean, there's two points. One is how many different parameters there are, and the second one is how sensitive are the results to the parameters? So if you could give me that sense, that would be great. >> Chao Weng: For me, the only parameter that makes very a lot influence, I think this can be not too large, can be not beyond then, like, 512, such that I think that in that case maybe it's not easy to use SGD to train a good neural network. And the second part is this, the initial learning rate of cost is critical for the SGD. And for the others, maybe easily you can shrink by 0.6, 0.7. I tried that, it didn't make so much different, actually. This just controlled when you approach the local minimum, what kind of speed you want to proceed from a certain point. Yes, in the end, it just converged to almost the same point. >>: Is it there are many parameters that are suppressed here? >>: And this parameter business is a disease that's just endemic to speech recognition. >>: And neural networks. >>: Although I think Arul [ph] will not disagree with me if I say that it's a sin that's shared equally by machine translation. >>: In the end, for me, and I have an analogy in the programming language space. Basically, in programming languages, you build on top of architectures, so the architecture has a lot of parameters, very similar to this, right? And you can show a result and you can say this is 5% faster or 2% faster or whatever, given two different configurations, but in the end, it depends on their underlying architecture and the assumptions you make about it. And a lot of times, in fact, in our field, many results that seem to be improvements actually are not because the underlying assumption is that all these parameters are fixed a certain way. But if you change them just a little bit, you don't your 2% improvement, you actually get a decrease. So I'm just curious if the same thing happens in this space and how you resist that. >>: Yes, I think that would be two purposes, why I list all the parameters here. The first is that we want people can redo the experiments, so they can do the experiment and get similar results, that's the first thing. And the second thing is, I think for me, from my experience, some parameters are a small, like zero-one thing. It's not like if you change this to, say, 1,000, maybe your system cannot be even trained in the end. It's not whether this is better or that's better. It's kind of zero-one thing. >>: But just looking at that number, Dong probably would know better. In our old code for DNN, that for 256 size initial parameter 0.2, and after it goes through a few iterations, it jumped up to 0.0, so ->> Chao Weng: The thing is, are you talking about the minibatch read? >>: So this is the initial number, this is ->> Chao Weng: The percentile, right? In this case, I'm talking about minibatch. Yes. >>: And then just another [indiscernible]. >> Chao Weng: Yes, it's just see the frame accuracy on the dev side, and then we kind of shrink the learning rate accordingly. >>: [Indiscernible]. >>: So if it is using a seeking [ph] mechanism, where each time it tops ->>: Oh, so it's ->> Chao Weng: Yes, export. >>: So that's a slower thing than [indiscernible]. >> Chao Weng: Yes, this is a DNN system we have trained, and just one, two, three corresponds to we deployed just realignments and trained the system again and again. And the best result we get is 29.89 word average, which is actually the baseline system we use to compare it to the recurrent deep neural networks. Yes, and we use the same realignments with the base DNN systems, and for the parameter initialization, for the feed forward layers, actually, we're just scoping the weights from the DNN training after five epochs, and for the recurrent parameters, we just do the random initialization. And because we copy the weight from five epochs training, so five training epochs, so we kind of shrink the initial learning rate as 0.004. And we do both the minibatch backpropagation through time and the introduced, truncated minibatch propagation through time. And this is the best DNN system we have, and the first recurrent deep neural network system, actually, we update the recurrent weights using the standard minibatch backpropagation, and actually, we cannot get much gains in this case. But once we use, introduce, truncated minibatch backpropagation through time, actually, we can get almost two absolute word average reduction. And here, the one, two, three, four system does not actually respond to the realignments. It corresponds to adding the recurrent connections to different hidden layers. And so, yes, I think maybe you know the state-of-the-art system on this task and reported last year from WER, and the best result reported is 26.86, but they get this result using discriminatively trained language models and also MBR decoding. And if we use the same language models, we almost get the state-of-art performance, but we haven't done any modeled -any speech adaptive training. >>: So what method did [indiscernible] use to get that as result? What method did they use? >> Chao Weng: They used kind of discriminative training, both in the GMM, yes. Yes. >>: But it looks like you end up with 22.8, which is a lot better. >> Chao Weng: No, no, I will explain here. I will explain how we can achieve these two systems. So these two systems that we assume that we have a clean version of noisy speech, so which means -- yes, there are data assumptions, which means the other part is almost the same, except that we use alignments on the clean speech as a label when we train the DNNs. >>: So in that case a MER system can do the same thing, or they can splice it or whatever sterile data, they're probably going to improve as well. If you make use of that sterile data ->> Chao Weng: Yes, but in that challenge, I think this kind of prevented. You cannot make the sterile data assumption. >>: So this [indiscernible] indicates the element of the frame is so critical. >> Chao Weng: Yes, yes, that's why I show this. Yes, yes, yes. So you can see not much differences, right, compared to we only have ->>: So can you stop for a second. Can you solve this problem? >> Chao Weng: I haven't tried. That's in my future work, actually. And also, in this case, our DNN 5 system also got improvements. >>: So can you explain why the standard minibatch can have some better performance than ->> Chao Weng: Yes, so the first thing, I think the main thing, is that for the standard minibatch backpropagation and the error vectors you evaluated at a certain frame actually backpropagate with different time steps in the standard minibatch backpropagation through time. And in our case -- so which means the contribution from corresponding individual training frames has different contributions to the final gradients, but in our systems, we just used fixed -- for each individual training frames, we're going to use fixed time steps to doing backpropagation through time, which means you have almost equal contributions to the final gradients. >>: So [indiscernible] you don't -- so there is a language model where sometimes we do [indiscernible] error word. Once you have an error word observation, you do BPPT. >> Chao Weng: Yes. >>: And the next word, you also do BPPT, so you do BPPT and also you update the -- >> Chao Weng: Yes, that's the online. Yes, I think that's pretty standard in the recurrent neural network language models, yes. >>: But in the minibatch, in the truncated minibatch, the minibatch actually you mentioned is -you just accumulate on it. >> Chao Weng: Yes. We didn't update right after each. We have to accumulate within each minibatch. Then we update once. Yes. >>: So it looks like the take-home message here is that you get 2% improvement when you go from a feed-forward network to a recurrent neural net, and that's a little less than 10% relative improvement. >> Chao Weng: Yes. >>: But it's pretty good. >> Chao Weng: I think maybe for this task. To absolute, I think it's already good, yes. >>: Well, the [indiscernible] GMM is the standard here. >> Chao Weng: Yes, but I think ->>: That's from [indiscernible]. >>: I would say performance is the same, I don't know [indiscernible]. >>: And also adaptation. >>: But not [indiscernible]. >>: Also sequence-level training. >> Chao Weng: So if you read that paper, obviously, that system is much, much highly tuned, and ->>: So is everything on it. >> Chao Weng: Everything, I think, in my view is you almost use every state-of-the-art techniques within the Kaldi, almost everything, I think. And ->>: But [indiscernible]. I mean, the Kaldi, why do they have this deep-learning model in there, right? >> Chao Weng: They already have. They already have. This is done in Kaldi, yes. >>: So MER system also used Kaldi, or ->> Chao Weng: Yes, yes. They used all the Kaldi techniques, and I think -- actually, I talked to Shinji [ph], and actually they have the DNN results, if you see the papers. Actually, it's much work than these numbers. Yes. >>: So what's the -- is there something here that shows -- the table says minibatch versus truncated minibatch. I don't see that. I just see numbers. I don't see a difference. So is this the truncated? >> Chao Weng: This one is not truncated. This one. >>: Oh, the first one. >> Chao Weng: Yes, this one, and it's just standard minibatch backpropagation through time. >>: So my other question is, do you believe -- so do you believe your results about truncation to a small number of frames -- you're only going backpropagate -- your truncate is only going through like four or five steps or something. >> Chao Weng: Yes, four or five. Yes, just truncate. >>: And then you stop. Do you think that would be true if you only had one frame? I mean, do you think that's related to the context window you're putting in? >> Chao Weng: I think it relates to we are using four or five context, left or right context, and these really have correlations, temporal correlations, so I think that might be one of the reasons it works. >>: Do you mean four or five frames? >> Chao Weng: Yes, context frames, both left and right. >>: So I don't [indiscernible], so this minibatch BPPT, you are comparing, it just accumulates the error [indiscernible]. It doesn't update the weights for every new frames? So it just accumulates it. >> Chao Weng: Accumulates then within each minibatch and then updates per minibatch, not per frame. Yes. >>: Because how do you [indiscernible] the whole training procedure, because I think you actually initialized I and use it at the end. >> Chao Weng: Yes, I think I mentioned here, copies of which from DNN training. >>: So this may affect how you train the recurrent connection. If you just use BPPT, then both other weights and the recurrent weights may be treated similarly. But if you do it the other way, you may have different weights. >> Chao Weng: Yes, and I think also we discussed that -- I think several epochs is okay, and once you use -- I also tried. Once we used like converged at the end, and it doesn't make any improvements. >>: Do you have any thoughts, is it the tab? So if another experiment were to [indiscernible] on large vocabulary tasks where recurrent connection didn't help. >> Chao Weng: I haven't tried that, actually, because this is almost the noise robustness. I haven't done [indiscernible] or Switchboard yet. >>: So how difficult is CHiME compared with Aurora-4? >> Chao Weng: Okay, so the CHiME is like -- it has also five or six conditions, I think. From minus six to nine dB, and I think Aurora-4 is all the SNR conditions is beyond zero, right? It's like maybe ->>: So in this case, how do you reconcile this result to show that in comparison with I think a paper written recently by Dong and Mike to apply DNN on Aurora-4. >> Chao Weng: Yes, I will have a result on Aurora-4. Actually ->>: So are they comparable? >> Chao Weng: Yes, the result is here. This is the result I got in the Aurora-4. >>: That's better, much better than the best. >> Chao Weng: No, no, no. I think this number is not better than what Dong and Mike, because the ->>: No, I'm talking about using DNN without even using a recurrent network. >> Chao Weng: You mean this number? >>: Yes, that number is better than the best GMM. >> Chao Weng: But the thing is, Dong and Mike are using bigram language models. I'm using trigram language models. They got maybe -- what's the best number, maybe 13.4, right? >>: I think it's now at 12. >> Chao Weng: For the DNN. >>: That's better than the best. >>: Without dropout, without noise awareness training, it's ->> Chao Weng: It's 13.4, I guess. >>: Yes, doing nothing but ->> Chao Weng: Yes, but here, we use trigram. >>: [Indiscernible] or better. >>: That's comparable, but then you do dropout and there's noise, and some of the noise stuff is 12.5 or something, 12.6. >>: You mentioned you are using different language models. >> Chao Weng: Yes, yes. I think that's critical, because Dong and Mike use bigram. I am using trigram. It's not a fair comparison, yes. >>: It's more robust [indiscernible]. >> Chao Weng: Yes, I think trigram works much better than the bigram, really, yes. >>: Everyone can use a trigram, but the standard test is a bigram. >>: We have about 20 minutes left. We'll be sure to derail your talk, so whatever you think is most important, you should probably ->> Chao Weng: Okay, I think maybe the second part I'm going to just skip, because this is work I have done in the Microsoft Research during last summer, maybe I think most of ->>: Some interviewers have actually never seen this. >>: Yes, it could be interesting. >>: I mean, there's 20 minutes left. So you just ->> Chao Weng: Okay, then just quick. I'll have to be quick. So the problem is that we want to solve the problem of doing the speech recognition with the presence of a competing talker, and this is closely related to a challenge issued in 2006, and at that time, IBM Superhuman 2006 system has the best result, and it's already kind of beyond what humans can do on this task. >>: Do you have an example of this kind of speech? Do you have an example of this kind of speech, for people who aren't familiar with it? >> Chao Weng: I can show simple transcription in the experiment. >>: Or maybe you can describe a little bit about what it sounds like? >> Chao Weng: So it's typically -- the training set is just a single talker, clean speech, and both for the adapt and testing data, it's like speech contains two speakers speaking at the same time, and they each speak -- have different conditions. Say, the speaker one has high energy or speaker two has high energy, and the scoring procedure, because the language -- the grammar of these tasks is not that complicated. It just simply is a small vocabulary. It consists maybe six parts, I guess, and the second part is color. So the scoring procedure is like this. The target speaker is always speaking the color white, and the final performance is based on the word error rate on the numbers and the letters spoken by the target speaker. Yes, I think I just have to skip here. The main ideas we used the multi-style training strategy, which means we're trying to create new training data that is kind of -- because, originally, we have clean single-talker speech, we kind of have to create a new training data that can be similar to what can be observed at test time. So we try different multi-style training setups, and we have high and low-energy signal models, and also we can use a similar setup to train the high and the low-energy front-end denoiser. And we can use also the speech at the separate criteria. And the one potential issue of all these three setups is that, once the two mixing speech signals have similar energy or similar speech, these models are supposed to perform very badly. So later on, we just tried training models based on the energy of each individual frame, but in this game, we need to determine which of the two DNN outputs belongs to which speaker at each frame. So in this case, we just introduced a joint decoder here, which the joint decoder is trying to find the best two-state sequence in the two-dimensional joint space, and each joint-state space corresponds to each speaker. And the key part of this joint decoder is that the joint talking parsing on the joint talking expansion on the two HCRG graphs. So for the -- I think I suppose now the speaker one is -- for the joint talking process, for the speaker one, we suppose now it restates set one. The talker is reset state one and speaker two is reset state two. So for the non-epsilon outgoing arcs, we typically just -- when we expand, we just typically take the combination of these two, so two would get four outgoing arcs. But for the epsilon outgoing arcs, it's kind of tricky because the epsilon outgoing arcs does not consume any acoustic state -- acoustic frame, so we have to create a new joint state straight to here, so we can see the first is one, two joint state, and we have to create a new joint state three, two here and do the similar token expansion. So a potential issue of these joined together is that we allow for lower energies reaching frame by frame, and to overcome this, we introduce a constant penalty along certain paths when the loudest signals has changed from the last frame. And also, the better strategy is that we can train a DNN to predict whether the energy, which point occurs, and we use this posterior probability as a form of the penalties. Yes, this is just a data set we are working on, and I think I already explained before, earlier, so this is experiment result. And so these two systems actually corresponds to we only train the both GMM and DNN systems on the clean single-talker speech. We can see that word error is very high, especially in some challenging conditions. And the ones we use multicondition see high and low-energy models, we are doing actually very good at corresponding conditions here. So then we combine these two systems using the role that the target speaker always speaks color white, but we're still far behind with IBM Superhuman system. The reason, actually, is that we are doing very bad when the two mixing speech signals are very similar. >>: I'm not sure I got that. So the first row there, DNN, is you just trained on all the data indiscriminately. There's one DNN that decodes everything, and that's it. >> Chao Weng: Yes, this is trained on the single-talker speech. >>: Right. DNN I is trained on the dominant speaker. Is that right? >> Chao Weng: Yes, and we actually trained on the new creative training set, yes. >>: And so it does very well on the dominant speaker, where there's a high signal-to-noise ratio, but somehow the numbers were so bad for the other ones that you didn't put them on there. >> Chao Weng: Yes, but the purpose of DNN I is to focus on the signal that dominates. >>: I understand. And then DNN II is the reverse, where it was trained on the quiet speaker, and there you see the reverse, that it's doing very well on the nine dB. And DNN I + II, is that with the joint decoding? >> Chao Weng: Not joint decoding. It's just that we're using the role that -- we're just looking at the second output, whether it's white or not white. If it's white, we're just using that as the ->>: So you're having that other DNN that's saying pay attention, use the dominant speaker model or use the quiet speaker model. >> Chao Weng: Yes, exactly. >>: And you're combining those two in that one. Now I've got it. >> Chao Weng: Yes, yes. And then, this is the result of we trained the deep front-end de-noiser, and this is the result we trained using pitch, high and low-pitch signal models, but we still -we're doing better actually on the zero dB case, but we're doing not so good in the other conditions. So this system corresponds to joint decoder, and we can see without any penalty introduced, we are slightly better than IBM Superhuman systems, and with Joint Decoder I, we introduce constant penalty, which energy penalty here is slightly better, and then this corresponds to the adaptive penalty introduced system. So one observation is that the joint decoder-based systems are doing very good when the two mixed-speech signals have very similar energies, say zero dB or minus three dB, and the DNN I + II are doing very good in the two mixing signals has very large energy differences. Say six dB, or minus six dB, minus nine dB, and then we just combine these two using the front-end de-noiser. Typically we just -- with certain testing speech, we just propagate the two high-energy front-end de-noisers and the lowenergy front-end de-noisers, and we just compute the energy ratio. And if this ratio is beyond a certain threshold, we're just using the DNN I + II system. And once this value is below that ratio threshold, we're just using the Joint Decoder II system. And the threshold is determined on the deficit. In the combined system, the final number is 19.7. Yes, the third goal is that we want to propose robust WFST-based semantic decoder and we have also some previous good works. But a lot of mastered [ph] is doing topic spotting based on 1-best ASR outputs, and for more robust topic spotting, people began to use linguistic features from the lattice, and it's because of that we are working in the WFST framework. We are looking for a solution in the WFST, using WFST, and here the n-gram rational kernels, which typically just map the WFST, first two n-gram feature space, and employ it in the product. So what it just typically does the -- it's just using the common count of the certain n-grams that contained in the input transducers. So know that here the count actually is the expected count, because in WFST, actually, it kind of defines the distribution of multi-powered strings. But the n-gram rational kernels ignores the fact that many n-grams are semantically related, and it assumes uniform contribution from our n-grams to the topic discrimination. And this is an illustration how we can efficiently evaluate the n-gram rational kernels. We typically just construct a certain transducer key, and in the end, it's just the n-gram rational kernels just consist of several compositions. So, yes, so if we consider that WFST as a distribution over documents, a lot of text analysis techniques can be applied here, and now we just -- I think I just skip here. We just define the WFST, just define the latency semantic rational kernels general form of LSRK in these two forms. One is based on the transducer M, and one is based on the transducer S. So the transducer M, actually, it encodes the transform, which maps the high-dimensional n-grams to the low-dimensional space, and this can be learned in unsupervised fashion, which means you can bring in more training text corpus to enhance the speech topic spotting application. And also, the second form, transducer S, actually, it encodes the term-to-term similarity matrix, and here you can bring in more ontology-based knowledge. So one simple example is let S be the matrix form of transducer S, so when the matrix is equivalent to the identity matrix, actually, the LSRK just degenerated to the n-gram rational kernels. And if you add the inverse document frequency to the diagonals of this matrix, typically, you are doing tf-idf weighting, and also we can do this with LSA techniques, and also we can evaluate the term-to-term similarity purely based on the WordNet. And below, I have a fully [indiscernible] mainly focused on how we can generalize LSRK using probabilistic models, and so for the PLSA calls the number of latent topic variables growing with the number of documents, so one needs to pay attention when it's learned parameters. So LDA actually cleaned these issues up by introducing a global Dirichlet prior such that all latent variables can be integrated out, both in the inference and the learning process. But the downside of LDA, that the learning process can be not expressed in the closed form, which you need a variation of the sampling-based methods. So for the learning, it's okay, because we always can do the parameter learning in the offline mode, but for the inference quality, we want a system response of a speak topic spotting system as quick as possible, so we would use the -- in this case, we would use the same folding-in process as in the original PRSA work, which typically we just in the E step -actually, so we just fixed the learned probability of words given certain topics fixed. So actually, it's just E step and M step to learn the probability of certain topics, given certain documents. In our case, typically, just give certain transducers. And one rigorous use of topic models is actually the Fisher kernels can derive from the probabilistic topic models and consist two parts. But I think in the following, I will just give the main ideas how we can do this. So, first, you need to define transducer M-seed, which encodes the probability of certain words given certain topics, so this is you can learn these parameters either from PRSA or LDA, and this is an example. Here we have vocabulary ABC, and we have a two-topic index. See, we just have two topics. And the input label corresponds to words, and the output label corresponds to the topic index, and the weight corresponds to the probability, actually, this probability. And then, we also define another transducer with input -- both input and output labels at topic index. And the weight corresponds to the probability of certain topics given input transducers. So, with these definitions, actually, E step can be mostly done by the composition with these two transducers, but we need certain extra normalization steps. And then for the M step, we can actually -- done by the combination of three transducers. Between, also needs certain normalization steps. So in the end, the whole algorithm is like this, and so the output of this algorithm is saying -- in the end, we would have learned probability of certain topics given the input transducers. And with this in hand, actually, we can easily derive the two parts of Fisher kernels, as I've just shown. So we first do the experiments on the subset of Switchboard, and we had to filter out some utterances, because a lot of utterances, it's not appropriate for the topic spotting application. And we used n-gram rational kernels as a baseline, and once we have rational kernels, we're going to just use [indiscernible] to do the classification. SO the accuracy, we can see that the n-gram rational kernels get around like 28.2% accuracy, and once we used latent semantic rational kernels, LRSA and tf-idf weighting, then we can get much improvement here. >>: So what part of this is due to all the transducer stuff? What part of this is due to the transducers? For example, if I do latent Dirichlet allocation and then I just look at the likeliest topic of a document according to LDA, what accuracy is that? >> Chao Weng: I will show later. Yes, here. And this is the result. When we do LSA, we use WordNet to do the semantic expansion. It means that we first form the -- when we first form the term or the document matrix, we use WordNet to do the semantic smoothing with WordNet, and this is the influence of whether we do the semantic expansion. >>: So it looks like the kernel method is really bad in all cases. >> Chao Weng: No, this is all kernel method. >>: When you use the rational kernel. >> Chao Weng: This is all the rational kernels, but the rational kernel is based on different models. >>: Oh, I see. >>: Why do I need any of that? If I just do LDA, what do I get? If I just do LDA, no kernels, no transducers, no -- if I run LDA? >> Chao Weng: What kind of representation are you going to use? >>: So this is saying what the topic is, right? This is accuracy in identifying the topic. >> Chao Weng: Yes. >>: So LDA takes a document, so whatever your input is. >> Chao Weng: And you would just do the inference, variational inference? >>: And you do variational inference, it gives you a probability distribution on the topics. >> Chao Weng: Yes, I think it might be a little bit better, but you need time to do the inference, right? And what kind of input you take, you take the lattice of the n-best, 1-best. >>: I see, so this is operating -- the whole point of this is that it's operating on a lattice, not on just a single 1-best. >> Chao Weng: Yes, and also the lattice output from the speech recognition system. >>: And the best you can do each of those and select the best one. >> Chao Weng: Yes, you can do that, but for me, usually, n-best just -- if you look at the n-best, it's just like very limited numbers of words different. >>: [Indiscernible]. So even if you have a lattice, you switch them on the back of ->> Chao Weng: Yes, that's why we first have to use the transducer T to extract all the n-gram features. Yes. >>: So the probabilities are n-gram or on word. >> Chao Weng: On word, actually. >>: So why are you doing n-gram? >> Chao Weng: In the end, we might need some kind of extension, we use n-grams. Yes. And also, maybe -- recently, I'm doing certain work on the neural net learned word representations. In that case, we may use the n-grams, yes. >>: So the topics on the sentence, or it's a topic or -- with a sentence, you can have many words. So, actually, the topic you are spotting is actually -- for these sentences, you have the one topic. >> Chao Weng: Yes, only one topic. We didn't consider the multiple possibilities. Yes, and then the second experiment is based on the How May I Help You 0300, and this is the result only using LSA and tf-idf weighting, and then the one interesting experiment is that we can -- this result actually corresponds to we purely actually do the LSRK with WordNet, and also the topic models get the best result. Yes, and I think summary, just go give that. So I think I'll just talk about some future works. First, we want to do the sequential discriminative training using nonuniform criteria so that we can kind of combine these three, too. And, actually, I am doing certain kind of LSRK, which incorporates in some neural networks learned word representations. Yes, typically just future works, and acknowledgment. So first, I want to thank Dan Povey at Johns Hopkins for the MCE implementation and also thank David Thompson, Patrick Haffner, AT&T Labs, on the LRSK work, and also thanks to Mike and Jasha for the mentoring and Geoffrey, Frank, Kun, who actually was an intern last summer here for the suggestion and advice. And also thanks to Shinji for providing the training data some advice on the recurrent deep neural network work. I think that's pretty typically it. Yes. Too long? >>: Yes, but typically, I just want to verify, for the kernel method, typically, we talk about kernel method, the limitation is that you can use a very large and more data, because the kernel, it becomes a square of the size of the whole training data. >> Chao Weng: Yes, that's one. And the other limitation is once you have unbalanced data samples for each class, maybe as we ->>: That means that ->> Chao Weng: I haven't done that, because the main focus is that we want to find the form of kernels, not the SVM part. >>: I see, okay. So the kernels that they are using here, the rational kernel, does it have the same kernel limitations? >> Chao Weng: You mean in terms of ->>: The rational kernel, rational kernel. >> Chao Weng: In terms of the unbalanced data samples? >>: Also in terms of a square kind of ->>: Scalability. >>: Scalability. Both, both. >> Chao Weng: I don't think -- I haven't found this kind of issue. I mean, the efficiency is [right]. You see the kernel metric can be very large. Because once you have learned the parameters, M transducer or S transducer, the topic spotting can be easily done efficiently. But for the learning, you always can take time. But for my case, maybe just take a half day to learn a transducer M. >>: For that task. But suppose or data is increased by tenfold. Then do you think the same thing ->> Chao Weng: I still think that's not a problem. I still think that's not a problem. Okay, thanks.