>> Dong Yu: Today, we are glad to have... Engineering Department in Georgia Tech to come here to give...

advertisement
>> Dong Yu: Today, we are glad to have Chao Weng from Electrical and Computer
Engineering Department in Georgia Tech to come here to give a talk, and he has been intern in
AT&T Labs and Microsoft Research in the past. Today, he will talk about his work on robust
speech recognition and understanding.
>> Chao Weng: Okay. Thanks for the introduction, and thanks for coming to my talk. So I'm
very honored to be here again, to give this talk on what I have been doing for my PhD
dissertation. So the topic of my talk is Towards Robust Conversational Speech Recognition and
Understanding. So this is the outline of my talk. In the first introduction section, I will explain
the challenges that we are faced with, and then I will explain the motivations. And in the second
section, I will elaborate the proposed research one by one, in four separate parts. And in the last
section, I will give the system overview of the whole proposed research and the latest remaining
work that needs to be done towards the final system. So first, introduction section. So while
significant progress has been made in ASR, the conversational speech recognition and
understanding remains a challenging problem. So unlike read or highly constrained speech,
spontaneous conversational speech is really ungrammatical and unstructured. So while you
really can expect over 90% word accuracy on a large recovery task but with dictation speaking
still, say, Wall Street Journal, this word accuracy would dramatically be degraded for a
spontaneous speech task, say, Switchboard.
So this is the -- these figures show the benchmarks of ASR in the word average for the DARPAsponsored tasks. And we can see, for the Wall Street Journal, both for the red speech, for the
resource management, 1,000 tasks, and for the Wall Street Journal, both 5,000 and 20,000 tasks.
We can't really limit the word average below 10%, but for the conversational speech task, the
word average is still relatively high. So another challenge is recognize and understand
conversational speech comes from the adverse acoustical environments, and additive noise
channel distortions and Lombard effect, all these will lead to a severe mismatch between training
and testing data. So one of the most challenging problems, unsolved problems, is the cocktail
party problem, and in this kind of problem, one needs to focus auditory attention on one
conversation out of many and trying to recognize it. So, actually, this can be viewed as a
specific case of adverse acoustical environments, which means we want to recognize and
understand speech in the presence of competing talkers. So apart from the speech recognition
part for the speech understanding here, we are having a general speech understanding problem
that could be very complicated. Specifically, in this work, we mean that we're trying to extract
semantic notions from given conversational speech, and so even with perfect ASR, this can be
also challenging due to the interference of fillers, mispronounced words and disfluencies. So
with all these challenges we have, we will mainly use two techniques to attack them. The first is
WFST, weighed finite state transducers. So in this framework, we of course have a lot of
powerful algorithms to take advantage of, and we'll see how we can do the lattice generation for
[indiscernible] training later on. So also in this framework, the recognition part and the
understanding part can be seamlessly integrated with each other, and the second techniques are
of course the deep neural networks. We know that the recent success of deep neural networks
for ASR has been -- opens a lot of possibilities to further improve the system performance. So
this is the -- this is speech recognition and understanding with the WFST view, so we can see
from very low-level speech given speech signals. In the end, we want to extract certain semantic
notions, and so for now, I want to say that in this work we will work mainly on two parts, which
are highlighted in the figures. The first part is the acoustic models, and the second part is what
kind of transducers or what kind of module we need to map the output from ASR to certain
semantic notions. So the first motivation with that word average in full transcription of
conversational speech is relatively high. And observing the fact that the relevant semantic
notions are already embedded in a set of keywords. So the first motivation is that we want to do
better acoustic modeling for the keyword spotting for the speech understanding. And the second
motivation is that the adverse acoustical environments degrade performance substantially, and
we want to utilize DNN to further improve the noise robustness of the system. And the third
motivation is that since that conversational speech is ungrammatical and ill structured, and we
need a kind of semantic decoder that can be seamlessly interfacing with ASR outputs, and since
that we are working in the WFST framework, so we need a robust WFST-based semantic
decoder. So the final objective is we want to build a conversational speech recognition and
speech understanding system which is robust to various acoustical environments, and here with
all three motivations we mentioned, we have three goals to accomplish. The first is we want to
propose a model-training methodology for keyword spotting and for the speech understanding.
And the second goal is we want to propose DNN-based acoustic models, and that is robust to the
additive noise, channel distortions and the interference from competing talkers. And the third
goal is that we want to propose a robust WFST-based semantic decoder that can be seamlessly
coupled with ASR, so the second part. The first -- so, remember, the first goal is that we want to
propose an acoustic model training methodology for keyword spotting for conversational speech,
and there has been a lot of previous works on keyword spotting. I have [it] right here. So the
other one, a so-called two-stage approach with benefits of various LVCSR techniques. A twostage approach delivers decent performance, where in the first LVCSR stage, we just do the
LVCSR, and in the second stage, we verify and detect the keywords from the output of LVCSR.
So the drawback of this approach is that the two stages are designed under different criteria, and
LVCSR systems are trained to minimize word average in general, without placing any emphasis
on keywords. And the discriminative training as a general technique to boost a speech
recognition system, and also this combinative training using nonuniform criteria seems a good
candidate to solve these limitations. And in this kind of method, they want to minimize
nonuniform error cost according to various error types rather than just 0-1 loss. So following the
same spirit, we will formulate non-uniform MCE for the keyword spotting on the conversational
speech. So for certain training frames, we define the frame discriminative function in this form,
which has two parts. First, there is a state posterior and second is frame likelihood. And then we
can formulate the non-uniform MCE in this form. So first, we have classification error here, and
then, depending on whether the keywords appeared in a reference word transcription or appeared
in the hypothesis space, we're going to embed a frame-level error cost.
>>: I'm sorry, can you just go over that a little more, maybe starting with the first one. So the
frame discriminative function, you have the state posterior there. W is a word string.
>> Chao Weng: Yes, word string, yes.
>>: And the that's the frame likelihood, okay.
>> Chao Weng: The second part is the -- yes. Yes.
>>: So this is a frame level.
>> Chao Weng: Discriminative function.
>>: Function. What makes it discriminative?
>> Chao Weng: Discriminate is here. We have, for the reference word transcription and for all
the hypothesis space, we also have a discriminative function.
>>: Oh, I see, so this is basically -- so this is straight MCE training. That's what you're using to
discriminate between two different hypotheses.
>> Chao Weng: Yes.
>>: Or that's the training criteria in there.
>> Chao Weng: Yes.
>>: So the state posterior is [indiscernible]. The state posterior is the fraction counts of this
particular state? Okay. This one is compared to the regular [indiscernible].
>> Chao Weng: Yes, yes.
>>: But normally, the center MCE doesn't have the state posterior term. It's only for the ->> Chao Weng: Only have a what?
>>: It's only the likelihood term, the standard MCE.
>> Chao Weng: Yes, yes, standard.
>>: So you put it there for the sake of?
>> Chao Weng: Because we want to -- one more thing is we want to embed the error cost in the
frame level, and in the conventional MCE case, you cannot, because it's based on the whole word
string, right?
>>: So the training level [indiscernible].
>>: So you can do the both the state posterior or either one [indiscernible], so basically you've
got it corresponding to a case ->> Chao Weng: Yes, yes. Yes. For the reference word transcription, yes, exactly. But for the
hypothesis space, you would not have only 1, 0, right?
>>: So is what you're doing here somehow summing over the different alignments of the frames
to the hypothesis?
>> Chao Weng: You mean here?
>>: The point of the state posterior.
>> Chao Weng: Yes, with the same state signal, with the same state index. Yes.
>>: So what makes it non-uniform here?
>> Chao Weng: Non-uniform is the error cost.
>>: Oh, it's ->> Chao Weng: [Indiscernible].
>>: Oh, okay, you talk about it on the next slide.
>> Chao Weng: Yes.
>>: So that's the same as the one that [Fuchun] did in his dissertation.
>> Chao Weng: A big difference I will describe later on.
>>: It's different? It's different or the same as ->> Chao Weng: No, it's not the same. It's different, yes.
>>: I think -- maybe this is Jeff's question, but is there a place where you sum over state
alignments for the correct keyword? Like if I see a state posterior, given the word, and the word
is the numerator that you're trying to ->> Chao Weng: You mean here?
>>: Yes.
>> Chao Weng: By the read, sum over all the ->>: Do you sum over all the ->> Chao Weng: No, we just do alignment.
>>: Just best path?
>> Chao Weng: Best path, yes.
>>: So then why is -- but the state posteriors used are not binary. They're not indicators.
>> Chao Weng: In this case, we have different -- in the hypothesis space, right?
>>: And can you just go over exactly what's being summed there in that triple sum?
>> Chao Weng: Here?
>>: Right. So J is -- what is J? Oh, J is down in the -- J is down on the bottom.
>> Chao Weng: Based on the ->>: Oh, J is a set -- okay.
>> Chao Weng: Yes. We have different.
>>: So for all the correct ->> Chao Weng: Yes. So this is ->>: It's a class, right? It's a class level.
>> Chao Weng: Yes. It's a class level, yes. Exactly.
>>: Okay, so you can have that small number ->> Chao Weng: So J is more like labels from the cost labels from the reference transcription,
and here, I is the levels from the hypothesis space.
>>: I see. Is labels different from words? What's the difference between a label and a word?
>>: The label here is a state ID.
>>: Oh, okay.
>> Chao Weng: Yes, but here it may be a little bit confused. We just -- here, the I, whether it
belongs to keywords or not keywords, we just look at the alignments. Rather, the state, certain
state index is false within certain keywords' time boundary.
>>: So you're somehow going to try to improve the likelihood of hypotheses that have a lot of
overlap between states that are assigned to keywords and the true alignment of the keywords to
the audio. Is that what's going on?
>> Chao Weng: Yes. When the keywords appear, we're going to train more, actually. I will
show later on. Yes.
>>: Do you need to have [invest]?
>> Chao Weng: Oh, we're doing it on the lattice.
>>: On the lattice.
>> Chao Weng: Yes. Yes, if we kind of -- we can use a certain -- if we define error cost
function here, epsilon of T, we can simplify the overall objective function in this form. And if
we take the gradient of this objective function, and we compute it to the record MCE, so we can
see there are maybe two main differences. The first is corresponding to the -- the [indiscernible]
response to the derivatives from sigmoid function, and this value for a non-uniform MCE is
changed over frame by frame. In the regular MCE, actually, this is a constant value among the
whole training speech algorithms. And for the more efficient implementation, we just bring this
difference to the agreement. And also, the second difference, we embed frame-level error cost
here.
And we can see that non-uniform MCE after approximation is equivalent to MCE on the
artificial resampled training set, and this training set actually is actually according to the error
cost function epsilon of T. And another issue of doing non-uniform MCE training is that the
competing hypothesis needs to exclude the reference word transcription, and then naively
removing the arcs corresponding to the reference words will definitely hurt the topological
structure of the lattice. So this can be, of course, circumvented by subtracting the corresponding
statistics from both the nominator and denominator posteriors. But in the WFST framework, we
can come up with a more elegant solution, which takes advantage of WFST difference operation.
So we call that for a regular WFST decoded lattice, which is essentially pruned version of the
whole decoding graph, and the whole decoding graph for certain speech algorithms can be
expressed in this form. And we know that the input labels of this pruned version is actually a
[treatment] transition identifiers, which apparently are not a valid operant for our difference
operation, because we want to exclude the word-level transcription. So that will be a compact
WFST-based lattice definition, and now it's already become the standard lattice-generation
[indiscernible] encoding. So in this kind of lattice format, both input and output labels are
words, and acoustic models, language models scores and state alignment are encoded in the
weight using this special summary. So with this special lattice definition, we can -- the lattice
generation for non-uniform MCE training can be summarized in three steps. We first invert the
prone portion of the whole decoding graph, and then we encode the inverted portion using this
special summary. Then, we do some graph optimization with remove epsilon and
determinization. Then, we compile the referent transcription into a WFST, and then we simply
just do the WFST difference. We will get the lattice for non-uniform MCE training.
>>: Excuse me, so the C is the scores.
>> Chao Weng: Here? Yes, the C is scores. S is the state alignment. And, as mentioned
earlier, that non-uniform MCE is actually equivalent to employ the regular MCE on a resampled
training set, and this training set is reweighted by the error cost function epsilon of T, so here, the
boosting-based techniques can be applied here naturally. And in this kind of boosting method,
the class corresponds to the acoustic tied state, or senone. And for each class, we define the
class-specific error cost, which means we first look at the staple [ph] theories derived from the
lattice we just generated. And if the state index for those accumulated posteriors whose state
index is different from the reference state index, if this value is beyond 0.5, we think it just
makes a mistake. And we weighted by the error cost function, we would have got the classspecific error cost.
>>: So you don't use that kind of add-up formulation explanation.
>> Chao Weng: Yes, we used, actually, in explaining the whole algorithm. And then, with this
empirical error cost, we will use that to decide which models or operations we're going to use to
generate acoustic scores during decoding. And this is the whole algorithm, and first we have
error cost function initialization here, and then, we're collecting the reference staple [ph] theories,
both from the reference transcription and lattice we just generated, and with the class-specific
empirical error cost, we have to update the error cost function, as in the Adaboost way. And
with update error cost function, we can train the non-uniform MCE, train the new iteration
model, using non-uniform MCE. And with the class-specific empirical error cost, we can
determine which models we use among all iterations to generate the acoustic scores during
decoding.
>>: I'm sorry. So I understand -- I can parse boosted non-uniform MCE, but what's the adaptive
part, again?
>> Chao Weng: Adaptive, because originally -- in our original work, we just made this fixed.
And, actually, this boosted here, we don't mean the -- a bit confused. We don't mean the
boosting. Here, I will explain later, called it boosted here, it means that we used certain kind of
tricks when we did the extending [indiscernible].
>>: Oh, it's that.
>> Chao Weng: Which is the same tricking the boosted in line, which is canceling the ->>: So it's not about error boost ->> Chao Weng: Yes, the term is a bit confused, but yes. Yes, we do the experiments on two
tasks. First is Switchboard, and the other is the HKUST Mandarin, which is essentially just a
Mandarin counterpart of Switchboard. And the then the baseline speech-recognition system is
kind of straightforward. So we first do the speech recognition experiments, and we can see in
this case MCE got the best result here compared to other standard discriminative training.
>>: This [indiscernible].
>> Chao Weng: Yes, it's non -- the test is that it should be [indiscernible] English.
>>: So the result is on [indiscernible].
>> Chao Weng: Yes. And then we do the keyword spotting experiments on the credit card use
subset of Switchboard, and here the boosted -- yes, we just mean that extended Brownell [ph]
tricks. Yes.
>>: I'm a little confused as to, if you're talking about -- you have to bear with me, because I'm
not a speech recognition person, but if you're talking about conversational speech, why are we
talking about keywords? Because I imagine the keywords part might be extremely important in
a command and control or that kind of situation, but if you're talking situation, then I'm not sure
how keywords come into it.
>> Chao Weng: Yes. I think it's pretty much in the motivation. Conversational speech, usually,
would have -- if you're talking about the full transcription, it has very high word error rate, and if
we just want to kind of get a rough idea of what this conversational speech is talking about,
maybe it's enough just to recognize those keywords, such that we can ->>: Do you have a predefined list of keywords?
>>: Think about NSA.
>>: The NSA and the DOD would be very happy to [indiscernible].
>>: Okay.
>>: A small list of ->> Chao Weng: Yes, and we can see the proposed system ->>: Bomb.
>> Chao Weng: I've got the kind of significant performance gains in terms of figure of merits,
and this is obviously curves of three systems. And then we do the experiments on the HKUST
Mandarin, and first is still the speech recognition experiment, and we can see that, in this case,
boosted MMI gets the best results in terms of the general syllable error rates. And then, we also
do the keyword spotting experiments. We can see that the proposed method also gets significant
improvements.
>>: I'm trying to understand, something like MLE means you just decode it, and then you look
for instances of those keywords in the decoded speech.
>> Chao Weng: Yes, yes, exactly. Yes. Yes?
>>: Is there something about this database? I've heard of it, but I don't know the history.
>> Chao Weng: It's like 150 hours Mandarin speech, and also it has different topics introduced
to each conversation.
>>: What is the purpose of that database?
>> Chao Weng: Perplexity?
>>: No, no, what's the purpose of that database?
>> Chao Weng: Oh, purpose I think is kind of similar to Switchboard, but it's in Mandarin. And
the thing is, it's also kind of bilingual thing. It's kind of spontaneous speech. It also has English
works, so you have to prepare bilingual, like second.
>>: So you think about it as a Chinese version of Switchboard.
>> Chao Weng: Yes, I'm thinking about it. Yes.
>>: Sorry, go ahead.
>>: So a particular [indiscernible] group did this?
>> Chao Weng: This is leased by the RDC, yes.
>>: So is this F measure, or what is the table that the numbers go to?
>> Chao Weng: Oh, you mean this one? Figure of merits, which is typically just -- kind of you
can think of as the up bound of the -- how to say? Yes, I think I can explain. Yes, percentage of
the hits average over one to 10 false alarm drivers.
>>: All of the other things are just straight-up acoustic modeling and the standard ->> Chao Weng: You mean here?
>>: Yes. None of those have anything to do with the keywords?
>> Chao Weng: Yes, no. Just the standard discriminative training, yes.
>>: So the fifth word [indiscernible]. Do you have any idea how these systems would compare
on a new keyword?
>> Chao Weng: A new keyword?
>>: So like, the advantage of the first ones is you can take the same acoustic ->> Chao Weng: Okay, you mean once a week we change the keyword set. I rather won't do it
on the feature space, because in this case we just learn feature space transform instead of
retraining the models, right? Actually, this -- we can do it in the feature space, feature space
discriminative training. You mean if we want to change the keyword set, right?
>>: Yes. Then, the keyword set, once you have a new keyword set, which means we have to do
the whole training from the beginning again. This is not flexible, but we can do it in the feature
space, which means we only train the feature transform for each keyword set. Then we can
decide which transform we use. Yes, FMP, FMI, this kind of stuff.
>>: So earlier you mentioned something about boosted -- Adaboost. You summarized all this.
The early error, it looks a little bit like the Adaboost, on the formulas ->> Chao Weng: Yes, the adaptive part kind of actually comes from the Adaboost.
>>: Do you have that one? In the earlier ->> Chao Weng: This actually is the Adaboost way, right, to do the error cost function.
>>: But the formula of these different ones, the formula that was the explanation of minus alpha
and alpha as a ->> Chao Weng: Yes, you have to do certain adaptations, because in this case, we have 1,000
[indiscernible]. You cannot just directly use Adaboost.
>>: You can use that formula, though?
>> Chao Weng: Yes, we have to find a way that also works for the speech recognition systems,
large vocabulary.
>>: So that probably [indiscernible].
>> Chao Weng: Yes, this part, and I think also the formulation of objective function is also a bit
different.
>>: That's nothing to do with the boosted MMI, when you talk about MMI.
>> Chao Weng: No. Boosted MMI just means we boost the hypothesis space. Okay, so the first
part of the second goal is we want to propose a DNN-based acoustic model that is robust to
additive noise and channel distortions, and also there have been a lot of good works on this
before. We will not reiterate here. So any [indiscernible] with DNN's data or its performance
can be achieved are our four benchmarks, without multiple decoding policies and multiple
adaptation. And also, meanwhile, the current neural networks have been also explored for noise
robustness, but so far, I think the only use to date as either a front-end de-noiser, or they use it in
the tandem setup, so few if any have explored the recurrent deep neural networks setup, and with
product [ph] gains on larger tasks where language models matters during decoding. So in the
following, we tried to build a hybrid recurrent DNN systems, and we believe that richer contact
offered by the recurrent connections can facilitate more robust, deeper representations of noisy
speech. And this is hybrid setup, and kind of three to four, I think. So the ->>: Can I ask [indiscernible] go back and ->> Chao Weng: Which one?
>>: The one that was straightforward.
>> Chao Weng: This one?
>>: That one. So this is a feed forward net, this is not ->> Chao Weng: And I'm just explaining the average setup.
>>: That's what it's for. Do you just set the [inaudible] get something out?
>> Chao Weng: Our work means we're just adding -- so here. This is just the same as this one.
We're adding recurrent connections to certain hidden layers, which means it's just the same as
this. For propagation, propagation process, we -- for the efficient learning process, we just do it
in minibatch mode. We would have a minibatch input, and then we just do a propagation in
these equations. And once we approach the recurrent hidden layers, we have to do it frame by
frame, because the current input needs the feedback from last training frame. Now, once we
have enough kind of outputs within certain corresponding minibatches, then we do the
propagation in minibatch mode again.
>>: So with recurrent language models compared to n-gram language models, with n-gram
language model, you take a big chunk of history, like two or three words of history, and then you
use that to predict what the next word is going to be.
>> Chao Weng: Yes.
>>: For recurrent language models, you only look at one word at a time. It reads in one word,
updates its hidden state and then predicts what the next word's going to be, reads in one word at a
time. In your original picture, there was a big chunk of frames, a lot of context that was going in.
>> Chao Weng: Yes.
>>: Is it still using the same chunk of ->> Chao Weng: Yes, the same, like we ->>: -- like which model, where you're going down to one frame at a time.
>> Chao Weng: Yes, we're going to use the contacts offerings.
>>: That is a good band ->>: Yes, I understand.
>> Chao Weng: Yes. And so, yes, for the language model part, although we just take one
words, but the [indiscernible] we're going to stop in the -- yes.
>>: So how many history -- how many frames in the history do you feed into the [indiscernible].
>> Chao Weng: Minibatch mode. I will explain later. Yes. Yes. For the back propagation
part, first, we need to have a loss function, which could be negative cross entropy, and also we
can do it -- we can use some discriminative training criteria, and see this kind of -- not only just
take derivatives of the loss function with respect to the X subscript N. Here, the subscript means
the index of the hidden layer. And then, actually, this derivative is the error signal we can
backpropagate to previous hidden layers. And also, for the backpropagation part, we can do it
for the feed forward hidden layers. We can do it in a minibatch mode. But once we approach
the recurrent hidden layers, we have to do it during backpropagation through time, and we will
detail later.
>>: So is that a typo, or is it you are doing this type of recurrent -- the recurrent only comes
from the output or from the hidden layers. So looking at the X, I would expect to have Xt and Xt
minus one somewhere on the right-hand side.
>> Chao Weng: You mean this equation?
>>: Yes.
>> Chao Weng: No, it has two portions, right?
>>: Yes, so do you have the ->> Chao Weng: This one is from the last frame [indiscernible], and this one is from the previous
hidden layers.
>>: Okay, so is the output ->>: Why is the output of the hidden layer?
>>: Oh, I see, not the output of -- I see. Okay, okay.
>> Chao Weng: Okay, this one is after nonlinearity, and this vector is before nonlinearity.
>>: So the output of the neural network for the target is denoted somewhere else.
>> Chao Weng: Yes.
>>: Okay.
>> Chao Weng: And with the error vectors propagated at certain hidden layers, we can just
evaluate the gradients to update the grid matrix, and here the subscript means the index of
training frames. So here the one column, capital M, means we just concatenate corresponding
error vectors into matrices. And so when the capital M is the number of training frames we are
doing the batch gradient descent, and if we only have one training frame and we just update
which matrix, seeing each individual frame without doing SGD, and the minibatch SGD is kind
of a compromise between the two. And practically, the reasonable size of minibatch makes all
matrices fit into GPU memory, which leads to a more efficient learning process. Yes, for the
recurrent hidden layers, we have to do the backpropagation a third time, and actually, for exact
gradient, we have to do this process to the very beginning of corresponding training
[indiscernible]. But for the minibatch backpropagation, through time, we typically just truncate
the whole process once within each minibatch, which means once -- we just stop the
backpropagation process once we approach the boundary of minibatch. And we can see that in
this case, for certain error signals, actually, we need two error signals. One is from the upper
hidden layers in this equation, and the other one is error vectors from the future frame, in this
case, which means we have to do it frame by frame for this kind of minibatch backpropagation
through time. So one effect is that, if we only considered the error vectors for certain frames,
say, this is T plus one and T minus one, and actually, within each minibatch, the error vectors
corresponds to each individual training frame, actually, it's backpropagated in different time
steps. Consider both, so now we have the minibatch with just the three, the minibatch set with
just the three, and this error actually is backpropagated though time two time steps here in this
case, if we stop here. And then, for this case, we actually just backpropagate in one time step.
So for more efficient backpropagation through time, we introduce the so-called truncated
minibatch backpropagation through time, which means for each individual gradient, we're
truncating the backpropagation through time for fixed time steps, and really, it's the time steps
could be like four or five. And the one thing of doing this is once we do this, we can actually
change the order of summation when we evaluate the gradients of the recurrent weights. And
actually, this part corresponds to the minibatch gradient, so this part and this part can both
evaluate it in the minibatch mode.
>>: Yes, so let me clarify. So the minibatch size now is roughly, after each occasion each of the
[indiscernible]. Normally, people do minibatch certain times because they can parallelize it.
>> Chao Weng: No, minibatch -- so we take minibatch as like 256, and for each individual
online gradient, we're going to do four or five time stamps, each individual.
>>: That's just approximately for the gradient.
>> Chao Weng: Yes, because for the standard way, say, for the last frame of the 256, it's going
to do like backpropagation through time approximately maybe 255, right, for the last three. But
in our case, we just do the fixed time stamps and four or five times.
>>: So when you do the minibatch, BPTT, the motivation of doing that, is it mainly for the
parallel computing?
>> Chao Weng: That's just for one reason.
>>: Otherwise, you would just use the whole sentence. Because you don't think about --
>> Chao Weng: Yes, that's just for one reason. Another reason, as I mentioned, because each
actually error vector corresponds to each frame's backpropagated in different time stamps, which
means it kind of influenced the gradients with different contributions. And the thing is, in the
end, we also can see that this will also give better empirical results. So we do the experiments on
two data. First is CHiME challenge and the second is the Aurora-4. So CHiME challenge is
typically just also noise robustness [indiscernible] I think released last year, and it typically is
just a reverberant and noisy version of the Aurora-4. We all know Aurora-4 is just the maybe the
more challenging SNR conditions. And we get the first alignments from the GMM-HMM
system as [indiscernible], and then we trained the first DNN system with the alignments we got
from the GMM system, and here we used 40-dimension log Mel scale for the banks as features.
And we also do generative pre-training, and also the DNN has seven hidden layers, with each
hidden layer has 2046 hidden dimensions. And the learning rate scheduling is that we use 256
minibatch as a minibatch size, and we use 0.008 as the initial learning rate, and we shrink by half
when the frame accuracy on the deficit is below the 0.5%, and we stop when the frame accuracy
improvement is less than 0.1%. And, yes, with the first trained DNN system, we just do a
realignment, and then with the new alignments generated from the DNN system, we train another
DNN system. We just do it iteratively, once the gains from realignments becomes saturated.
>>: Can I ask a question [indiscernible] again. So I'm not in this field. So you have to
understand, I'm from programming languages, but I do have a question about this. It seems to
me, based on this slide, that there's a lot of parameters in these systems in terms of things you
could vary. How many parameters do you estimate there are in this particular problem, or this
particular set of ->>: Hyper parameters.
>>: Whatever you call them. Free parameters.
>> Chao Weng: So I think a lot of parameters almost kind of -- how to say, it's a very
[indiscernible] parameters, like say 256 minibatch. People are doing with the 256 or either the
512, this kind of minibatch. This is not so critical in our experiment, actually.
>>: Okay, I guess that's the question. I mean, there's two points. One is how many different
parameters there are, and the second one is how sensitive are the results to the parameters? So if
you could give me that sense, that would be great.
>> Chao Weng: For me, the only parameter that makes very a lot influence, I think this can be
not too large, can be not beyond then, like, 512, such that I think that in that case maybe it's not
easy to use SGD to train a good neural network. And the second part is this, the initial learning
rate of cost is critical for the SGD. And for the others, maybe easily you can shrink by 0.6, 0.7.
I tried that, it didn't make so much different, actually. This just controlled when you approach
the local minimum, what kind of speed you want to proceed from a certain point. Yes, in the
end, it just converged to almost the same point.
>>: Is it there are many parameters that are suppressed here?
>>: And this parameter business is a disease that's just endemic to speech recognition.
>>: And neural networks.
>>: Although I think Arul [ph] will not disagree with me if I say that it's a sin that's shared
equally by machine translation.
>>: In the end, for me, and I have an analogy in the programming language space. Basically, in
programming languages, you build on top of architectures, so the architecture has a lot of
parameters, very similar to this, right? And you can show a result and you can say this is 5%
faster or 2% faster or whatever, given two different configurations, but in the end, it depends on
their underlying architecture and the assumptions you make about it. And a lot of times, in fact,
in our field, many results that seem to be improvements actually are not because the underlying
assumption is that all these parameters are fixed a certain way. But if you change them just a
little bit, you don't your 2% improvement, you actually get a decrease. So I'm just curious if the
same thing happens in this space and how you resist that.
>>: Yes, I think that would be two purposes, why I list all the parameters here. The first is that
we want people can redo the experiments, so they can do the experiment and get similar results,
that's the first thing. And the second thing is, I think for me, from my experience, some
parameters are a small, like zero-one thing. It's not like if you change this to, say, 1,000, maybe
your system cannot be even trained in the end. It's not whether this is better or that's better. It's
kind of zero-one thing.
>>: But just looking at that number, Dong probably would know better. In our old code for
DNN, that for 256 size initial parameter 0.2, and after it goes through a few iterations, it jumped
up to 0.0, so ->> Chao Weng: The thing is, are you talking about the minibatch read?
>>: So this is the initial number, this is ->> Chao Weng: The percentile, right? In this case, I'm talking about minibatch. Yes.
>>: And then just another [indiscernible].
>> Chao Weng: Yes, it's just see the frame accuracy on the dev side, and then we kind of shrink
the learning rate accordingly.
>>: [Indiscernible].
>>: So if it is using a seeking [ph] mechanism, where each time it tops ->>: Oh, so it's ->> Chao Weng: Yes, export.
>>: So that's a slower thing than [indiscernible].
>> Chao Weng: Yes, this is a DNN system we have trained, and just one, two, three corresponds
to we deployed just realignments and trained the system again and again. And the best result we
get is 29.89 word average, which is actually the baseline system we use to compare it to the
recurrent deep neural networks. Yes, and we use the same realignments with the base DNN
systems, and for the parameter initialization, for the feed forward layers, actually, we're just
scoping the weights from the DNN training after five epochs, and for the recurrent parameters,
we just do the random initialization. And because we copy the weight from five epochs training,
so five training epochs, so we kind of shrink the initial learning rate as 0.004. And we do both
the minibatch backpropagation through time and the introduced, truncated minibatch propagation
through time. And this is the best DNN system we have, and the first recurrent deep neural
network system, actually, we update the recurrent weights using the standard minibatch
backpropagation, and actually, we cannot get much gains in this case. But once we use,
introduce, truncated minibatch backpropagation through time, actually, we can get almost two
absolute word average reduction. And here, the one, two, three, four system does not actually
respond to the realignments. It corresponds to adding the recurrent connections to different
hidden layers. And so, yes, I think maybe you know the state-of-the-art system on this task and
reported last year from WER, and the best result reported is 26.86, but they get this result using
discriminatively trained language models and also MBR decoding. And if we use the same
language models, we almost get the state-of-art performance, but we haven't done any modeled -any speech adaptive training.
>>: So what method did [indiscernible] use to get that as result? What method did they use?
>> Chao Weng: They used kind of discriminative training, both in the GMM, yes. Yes.
>>: But it looks like you end up with 22.8, which is a lot better.
>> Chao Weng: No, no, I will explain here. I will explain how we can achieve these two
systems. So these two systems that we assume that we have a clean version of noisy speech, so
which means -- yes, there are data assumptions, which means the other part is almost the same,
except that we use alignments on the clean speech as a label when we train the DNNs.
>>: So in that case a MER system can do the same thing, or they can splice it or whatever sterile
data, they're probably going to improve as well. If you make use of that sterile data ->> Chao Weng: Yes, but in that challenge, I think this kind of prevented. You cannot make the
sterile data assumption.
>>: So this [indiscernible] indicates the element of the frame is so critical.
>> Chao Weng: Yes, yes, that's why I show this. Yes, yes, yes. So you can see not much
differences, right, compared to we only have ->>: So can you stop for a second. Can you solve this problem?
>> Chao Weng: I haven't tried. That's in my future work, actually. And also, in this case, our
DNN 5 system also got improvements.
>>: So can you explain why the standard minibatch can have some better performance than ->> Chao Weng: Yes, so the first thing, I think the main thing, is that for the standard minibatch
backpropagation and the error vectors you evaluated at a certain frame actually backpropagate
with different time steps in the standard minibatch backpropagation through time. And in our
case -- so which means the contribution from corresponding individual training frames has
different contributions to the final gradients, but in our systems, we just used fixed -- for each
individual training frames, we're going to use fixed time steps to doing backpropagation through
time, which means you have almost equal contributions to the final gradients.
>>: So [indiscernible] you don't -- so there is a language model where sometimes we do
[indiscernible] error word. Once you have an error word observation, you do BPPT.
>> Chao Weng: Yes.
>>: And the next word, you also do BPPT, so you do BPPT and also you update the --
>> Chao Weng: Yes, that's the online. Yes, I think that's pretty standard in the recurrent neural
network language models, yes.
>>: But in the minibatch, in the truncated minibatch, the minibatch actually you mentioned is -you just accumulate on it.
>> Chao Weng: Yes. We didn't update right after each. We have to accumulate within each
minibatch. Then we update once. Yes.
>>: So it looks like the take-home message here is that you get 2% improvement when you go
from a feed-forward network to a recurrent neural net, and that's a little less than 10% relative
improvement.
>> Chao Weng: Yes.
>>: But it's pretty good.
>> Chao Weng: I think maybe for this task. To absolute, I think it's already good, yes.
>>: Well, the [indiscernible] GMM is the standard here.
>> Chao Weng: Yes, but I think ->>: That's from [indiscernible].
>>: I would say performance is the same, I don't know [indiscernible].
>>: And also adaptation.
>>: But not [indiscernible].
>>: Also sequence-level training.
>> Chao Weng: So if you read that paper, obviously, that system is much, much highly tuned,
and ->>: So is everything on it.
>> Chao Weng: Everything, I think, in my view is you almost use every state-of-the-art
techniques within the Kaldi, almost everything, I think. And ->>: But [indiscernible]. I mean, the Kaldi, why do they have this deep-learning model in there,
right?
>> Chao Weng: They already have. They already have. This is done in Kaldi, yes.
>>: So MER system also used Kaldi, or ->> Chao Weng: Yes, yes. They used all the Kaldi techniques, and I think -- actually, I talked to
Shinji [ph], and actually they have the DNN results, if you see the papers. Actually, it's much
work than these numbers. Yes.
>>: So what's the -- is there something here that shows -- the table says minibatch versus
truncated minibatch. I don't see that. I just see numbers. I don't see a difference. So is this the
truncated?
>> Chao Weng: This one is not truncated. This one.
>>: Oh, the first one.
>> Chao Weng: Yes, this one, and it's just standard minibatch backpropagation through time.
>>: So my other question is, do you believe -- so do you believe your results about truncation to
a small number of frames -- you're only going backpropagate -- your truncate is only going
through like four or five steps or something.
>> Chao Weng: Yes, four or five. Yes, just truncate.
>>: And then you stop. Do you think that would be true if you only had one frame? I mean, do
you think that's related to the context window you're putting in?
>> Chao Weng: I think it relates to we are using four or five context, left or right context, and
these really have correlations, temporal correlations, so I think that might be one of the reasons it
works.
>>: Do you mean four or five frames?
>> Chao Weng: Yes, context frames, both left and right.
>>: So I don't [indiscernible], so this minibatch BPPT, you are comparing, it just accumulates
the error [indiscernible]. It doesn't update the weights for every new frames? So it just
accumulates it.
>> Chao Weng: Accumulates then within each minibatch and then updates per minibatch, not
per frame. Yes.
>>: Because how do you [indiscernible] the whole training procedure, because I think you
actually initialized I and use it at the end.
>> Chao Weng: Yes, I think I mentioned here, copies of which from DNN training.
>>: So this may affect how you train the recurrent connection. If you just use BPPT, then both
other weights and the recurrent weights may be treated similarly. But if you do it the other way,
you may have different weights.
>> Chao Weng: Yes, and I think also we discussed that -- I think several epochs is okay, and
once you use -- I also tried. Once we used like converged at the end, and it doesn't make any
improvements.
>>: Do you have any thoughts, is it the tab? So if another experiment were to [indiscernible] on
large vocabulary tasks where recurrent connection didn't help.
>> Chao Weng: I haven't tried that, actually, because this is almost the noise robustness. I
haven't done [indiscernible] or Switchboard yet.
>>: So how difficult is CHiME compared with Aurora-4?
>> Chao Weng: Okay, so the CHiME is like -- it has also five or six conditions, I think. From
minus six to nine dB, and I think Aurora-4 is all the SNR conditions is beyond zero, right? It's
like maybe ->>: So in this case, how do you reconcile this result to show that in comparison with I think a
paper written recently by Dong and Mike to apply DNN on Aurora-4.
>> Chao Weng: Yes, I will have a result on Aurora-4. Actually ->>: So are they comparable?
>> Chao Weng: Yes, the result is here. This is the result I got in the Aurora-4.
>>: That's better, much better than the best.
>> Chao Weng: No, no, no. I think this number is not better than what Dong and Mike, because
the ->>: No, I'm talking about using DNN without even using a recurrent network.
>> Chao Weng: You mean this number?
>>: Yes, that number is better than the best GMM.
>> Chao Weng: But the thing is, Dong and Mike are using bigram language models. I'm using
trigram language models. They got maybe -- what's the best number, maybe 13.4, right?
>>: I think it's now at 12.
>> Chao Weng: For the DNN.
>>: That's better than the best.
>>: Without dropout, without noise awareness training, it's ->> Chao Weng: It's 13.4, I guess.
>>: Yes, doing nothing but ->> Chao Weng: Yes, but here, we use trigram.
>>: [Indiscernible] or better.
>>: That's comparable, but then you do dropout and there's noise, and some of the noise stuff is
12.5 or something, 12.6.
>>: You mentioned you are using different language models.
>> Chao Weng: Yes, yes. I think that's critical, because Dong and Mike use bigram. I am using
trigram. It's not a fair comparison, yes.
>>: It's more robust [indiscernible].
>> Chao Weng: Yes, I think trigram works much better than the bigram, really, yes.
>>: Everyone can use a trigram, but the standard test is a bigram.
>>: We have about 20 minutes left. We'll be sure to derail your talk, so whatever you think is
most important, you should probably ->> Chao Weng: Okay, I think maybe the second part I'm going to just skip, because this is work
I have done in the Microsoft Research during last summer, maybe I think most of ->>: Some interviewers have actually never seen this.
>>: Yes, it could be interesting.
>>: I mean, there's 20 minutes left. So you just ->> Chao Weng: Okay, then just quick. I'll have to be quick. So the problem is that we want to
solve the problem of doing the speech recognition with the presence of a competing talker, and
this is closely related to a challenge issued in 2006, and at that time, IBM Superhuman 2006
system has the best result, and it's already kind of beyond what humans can do on this task.
>>: Do you have an example of this kind of speech? Do you have an example of this kind of
speech, for people who aren't familiar with it?
>> Chao Weng: I can show simple transcription in the experiment.
>>: Or maybe you can describe a little bit about what it sounds like?
>> Chao Weng: So it's typically -- the training set is just a single talker, clean speech, and both
for the adapt and testing data, it's like speech contains two speakers speaking at the same time,
and they each speak -- have different conditions. Say, the speaker one has high energy or
speaker two has high energy, and the scoring procedure, because the language -- the grammar of
these tasks is not that complicated. It just simply is a small vocabulary. It consists maybe six
parts, I guess, and the second part is color. So the scoring procedure is like this. The target
speaker is always speaking the color white, and the final performance is based on the word error
rate on the numbers and the letters spoken by the target speaker. Yes, I think I just have to skip
here. The main ideas we used the multi-style training strategy, which means we're trying to
create new training data that is kind of -- because, originally, we have clean single-talker speech,
we kind of have to create a new training data that can be similar to what can be observed at test
time. So we try different multi-style training setups, and we have high and low-energy signal
models, and also we can use a similar setup to train the high and the low-energy front-end denoiser. And we can use also the speech at the separate criteria. And the one potential issue of all
these three setups is that, once the two mixing speech signals have similar energy or similar
speech, these models are supposed to perform very badly. So later on, we just tried training
models based on the energy of each individual frame, but in this game, we need to determine
which of the two DNN outputs belongs to which speaker at each frame. So in this case, we just
introduced a joint decoder here, which the joint decoder is trying to find the best two-state
sequence in the two-dimensional joint space, and each joint-state space corresponds to each
speaker. And the key part of this joint decoder is that the joint talking parsing on the joint
talking expansion on the two HCRG graphs. So for the -- I think I suppose now the speaker one
is -- for the joint talking process, for the speaker one, we suppose now it restates set one. The
talker is reset state one and speaker two is reset state two. So for the non-epsilon outgoing arcs,
we typically just -- when we expand, we just typically take the combination of these two, so two
would get four outgoing arcs. But for the epsilon outgoing arcs, it's kind of tricky because the
epsilon outgoing arcs does not consume any acoustic state -- acoustic frame, so we have to create
a new joint state straight to here, so we can see the first is one, two joint state, and we have to
create a new joint state three, two here and do the similar token expansion. So a potential issue
of these joined together is that we allow for lower energies reaching frame by frame, and to
overcome this, we introduce a constant penalty along certain paths when the loudest signals has
changed from the last frame. And also, the better strategy is that we can train a DNN to predict
whether the energy, which point occurs, and we use this posterior probability as a form of the
penalties. Yes, this is just a data set we are working on, and I think I already explained before,
earlier, so this is experiment result. And so these two systems actually corresponds to we only
train the both GMM and DNN systems on the clean single-talker speech. We can see that word
error is very high, especially in some challenging conditions. And the ones we use multicondition see high and low-energy models, we are doing actually very good at corresponding
conditions here. So then we combine these two systems using the role that the target speaker
always speaks color white, but we're still far behind with IBM Superhuman system. The reason,
actually, is that we are doing very bad when the two mixing speech signals are very similar.
>>: I'm not sure I got that. So the first row there, DNN, is you just trained on all the data
indiscriminately. There's one DNN that decodes everything, and that's it.
>> Chao Weng: Yes, this is trained on the single-talker speech.
>>: Right. DNN I is trained on the dominant speaker. Is that right?
>> Chao Weng: Yes, and we actually trained on the new creative training set, yes.
>>: And so it does very well on the dominant speaker, where there's a high signal-to-noise ratio,
but somehow the numbers were so bad for the other ones that you didn't put them on there.
>> Chao Weng: Yes, but the purpose of DNN I is to focus on the signal that dominates.
>>: I understand. And then DNN II is the reverse, where it was trained on the quiet speaker, and
there you see the reverse, that it's doing very well on the nine dB. And DNN I + II, is that with
the joint decoding?
>> Chao Weng: Not joint decoding. It's just that we're using the role that -- we're just looking at
the second output, whether it's white or not white. If it's white, we're just using that as the ->>: So you're having that other DNN that's saying pay attention, use the dominant speaker
model or use the quiet speaker model.
>> Chao Weng: Yes, exactly.
>>: And you're combining those two in that one. Now I've got it.
>> Chao Weng: Yes, yes. And then, this is the result of we trained the deep front-end de-noiser,
and this is the result we trained using pitch, high and low-pitch signal models, but we still -we're doing better actually on the zero dB case, but we're doing not so good in the other
conditions. So this system corresponds to joint decoder, and we can see without any penalty
introduced, we are slightly better than IBM Superhuman systems, and with Joint Decoder I, we
introduce constant penalty, which energy penalty here is slightly better, and then this
corresponds to the adaptive penalty introduced system. So one observation is that the joint
decoder-based systems are doing very good when the two mixed-speech signals have very
similar energies, say zero dB or minus three dB, and the DNN I + II are doing very good in the
two mixing signals has very large energy differences. Say six dB, or minus six dB, minus nine
dB, and then we just combine these two using the front-end de-noiser. Typically we just -- with
certain testing speech, we just propagate the two high-energy front-end de-noisers and the lowenergy front-end de-noisers, and we just compute the energy ratio. And if this ratio is beyond a
certain threshold, we're just using the DNN I + II system. And once this value is below that ratio
threshold, we're just using the Joint Decoder II system. And the threshold is determined on the
deficit. In the combined system, the final number is 19.7. Yes, the third goal is that we want to
propose robust WFST-based semantic decoder and we have also some previous good works. But
a lot of mastered [ph] is doing topic spotting based on 1-best ASR outputs, and for more robust
topic spotting, people began to use linguistic features from the lattice, and it's because of that we
are working in the WFST framework. We are looking for a solution in the WFST, using WFST,
and here the n-gram rational kernels, which typically just map the WFST, first two n-gram
feature space, and employ it in the product. So what it just typically does the -- it's just using the
common count of the certain n-grams that contained in the input transducers. So know that here
the count actually is the expected count, because in WFST, actually, it kind of defines the
distribution of multi-powered strings. But the n-gram rational kernels ignores the fact that many
n-grams are semantically related, and it assumes uniform contribution from our n-grams to the
topic discrimination. And this is an illustration how we can efficiently evaluate the n-gram
rational kernels. We typically just construct a certain transducer key, and in the end, it's just the
n-gram rational kernels just consist of several compositions. So, yes, so if we consider that
WFST as a distribution over documents, a lot of text analysis techniques can be applied here, and
now we just -- I think I just skip here. We just define the WFST, just define the latency semantic
rational kernels general form of LSRK in these two forms. One is based on the transducer M,
and one is based on the transducer S. So the transducer M, actually, it encodes the transform,
which maps the high-dimensional n-grams to the low-dimensional space, and this can be learned
in unsupervised fashion, which means you can bring in more training text corpus to enhance the
speech topic spotting application. And also, the second form, transducer S, actually, it encodes
the term-to-term similarity matrix, and here you can bring in more ontology-based knowledge.
So one simple example is let S be the matrix form of transducer S, so when the matrix is
equivalent to the identity matrix, actually, the LSRK just degenerated to the n-gram rational
kernels. And if you add the inverse document frequency to the diagonals of this matrix,
typically, you are doing tf-idf weighting, and also we can do this with LSA techniques, and also
we can evaluate the term-to-term similarity purely based on the WordNet. And below, I have a
fully [indiscernible] mainly focused on how we can generalize LSRK using probabilistic models,
and so for the PLSA calls the number of latent topic variables growing with the number of
documents, so one needs to pay attention when it's learned parameters. So LDA actually cleaned
these issues up by introducing a global Dirichlet prior such that all latent variables can be
integrated out, both in the inference and the learning process. But the downside of LDA, that the
learning process can be not expressed in the closed form, which you need a variation of the
sampling-based methods. So for the learning, it's okay, because we always can do the parameter
learning in the offline mode, but for the inference quality, we want a system response of a speak
topic spotting system as quick as possible, so we would use the -- in this case, we would use the
same folding-in process as in the original PRSA work, which typically we just in the E step -actually, so we just fixed the learned probability of words given certain topics fixed. So actually,
it's just E step and M step to learn the probability of certain topics, given certain documents. In
our case, typically, just give certain transducers. And one rigorous use of topic models is
actually the Fisher kernels can derive from the probabilistic topic models and consist two parts.
But I think in the following, I will just give the main ideas how we can do this. So, first, you
need to define transducer M-seed, which encodes the probability of certain words given certain
topics, so this is you can learn these parameters either from PRSA or LDA, and this is an
example. Here we have vocabulary ABC, and we have a two-topic index. See, we just have two
topics. And the input label corresponds to words, and the output label corresponds to the topic
index, and the weight corresponds to the probability, actually, this probability. And then, we
also define another transducer with input -- both input and output labels at topic index. And the
weight corresponds to the probability of certain topics given input transducers. So, with these
definitions, actually, E step can be mostly done by the composition with these two transducers,
but we need certain extra normalization steps. And then for the M step, we can actually -- done
by the combination of three transducers. Between, also needs certain normalization steps. So in
the end, the whole algorithm is like this, and so the output of this algorithm is saying -- in the
end, we would have learned probability of certain topics given the input transducers. And with
this in hand, actually, we can easily derive the two parts of Fisher kernels, as I've just shown. So
we first do the experiments on the subset of Switchboard, and we had to filter out some
utterances, because a lot of utterances, it's not appropriate for the topic spotting application. And
we used n-gram rational kernels as a baseline, and once we have rational kernels, we're going to
just use [indiscernible] to do the classification. SO the accuracy, we can see that the n-gram
rational kernels get around like 28.2% accuracy, and once we used latent semantic rational
kernels, LRSA and tf-idf weighting, then we can get much improvement here.
>>: So what part of this is due to all the transducer stuff? What part of this is due to the
transducers? For example, if I do latent Dirichlet allocation and then I just look at the likeliest
topic of a document according to LDA, what accuracy is that?
>> Chao Weng: I will show later. Yes, here. And this is the result. When we do LSA, we use
WordNet to do the semantic expansion. It means that we first form the -- when we first form the
term or the document matrix, we use WordNet to do the semantic smoothing with WordNet, and
this is the influence of whether we do the semantic expansion.
>>: So it looks like the kernel method is really bad in all cases.
>> Chao Weng: No, this is all kernel method.
>>: When you use the rational kernel.
>> Chao Weng: This is all the rational kernels, but the rational kernel is based on different
models.
>>: Oh, I see.
>>: Why do I need any of that? If I just do LDA, what do I get? If I just do LDA, no kernels,
no transducers, no -- if I run LDA?
>> Chao Weng: What kind of representation are you going to use?
>>: So this is saying what the topic is, right? This is accuracy in identifying the topic.
>> Chao Weng: Yes.
>>: So LDA takes a document, so whatever your input is.
>> Chao Weng: And you would just do the inference, variational inference?
>>: And you do variational inference, it gives you a probability distribution on the topics.
>> Chao Weng: Yes, I think it might be a little bit better, but you need time to do the inference,
right? And what kind of input you take, you take the lattice of the n-best, 1-best.
>>: I see, so this is operating -- the whole point of this is that it's operating on a lattice, not on
just a single 1-best.
>> Chao Weng: Yes, and also the lattice output from the speech recognition system.
>>: And the best you can do each of those and select the best one.
>> Chao Weng: Yes, you can do that, but for me, usually, n-best just -- if you look at the n-best,
it's just like very limited numbers of words different.
>>: [Indiscernible]. So even if you have a lattice, you switch them on the back of ->> Chao Weng: Yes, that's why we first have to use the transducer T to extract all the n-gram
features. Yes.
>>: So the probabilities are n-gram or on word.
>> Chao Weng: On word, actually.
>>: So why are you doing n-gram?
>> Chao Weng: In the end, we might need some kind of extension, we use n-grams. Yes. And
also, maybe -- recently, I'm doing certain work on the neural net learned word representations.
In that case, we may use the n-grams, yes.
>>: So the topics on the sentence, or it's a topic or -- with a sentence, you can have many words.
So, actually, the topic you are spotting is actually -- for these sentences, you have the one topic.
>> Chao Weng: Yes, only one topic. We didn't consider the multiple possibilities. Yes, and
then the second experiment is based on the How May I Help You 0300, and this is the result only
using LSA and tf-idf weighting, and then the one interesting experiment is that we can -- this
result actually corresponds to we purely actually do the LSRK with WordNet, and also the topic
models get the best result. Yes, and I think summary, just go give that.
So I think I'll just talk about some future works. First, we want to do the sequential
discriminative training using nonuniform criteria so that we can kind of combine these three, too.
And, actually, I am doing certain kind of LSRK, which incorporates in some neural networks
learned word representations. Yes, typically just future works, and acknowledgment. So first, I
want to thank Dan Povey at Johns Hopkins for the MCE implementation and also thank David
Thompson, Patrick Haffner, AT&T Labs, on the LRSK work, and also thanks to Mike and Jasha
for the mentoring and Geoffrey, Frank, Kun, who actually was an intern last summer here for the
suggestion and advice. And also thanks to Shinji for providing the training data some advice on
the recurrent deep neural network work. I think that's pretty typically it. Yes. Too long?
>>: Yes, but typically, I just want to verify, for the kernel method, typically, we talk about
kernel method, the limitation is that you can use a very large and more data, because the kernel,
it becomes a square of the size of the whole training data.
>> Chao Weng: Yes, that's one. And the other limitation is once you have unbalanced data
samples for each class, maybe as we ->>: That means that ->> Chao Weng: I haven't done that, because the main focus is that we want to find the form of
kernels, not the SVM part.
>>: I see, okay. So the kernels that they are using here, the rational kernel, does it have the
same kernel limitations?
>> Chao Weng: You mean in terms of ->>: The rational kernel, rational kernel.
>> Chao Weng: In terms of the unbalanced data samples?
>>: Also in terms of a square kind of ->>: Scalability.
>>: Scalability. Both, both.
>> Chao Weng: I don't think -- I haven't found this kind of issue. I mean, the efficiency is
[right]. You see the kernel metric can be very large. Because once you have learned the
parameters, M transducer or S transducer, the topic spotting can be easily done efficiently. But
for the learning, you always can take time. But for my case, maybe just take a half day to learn a
transducer M.
>>: For that task. But suppose or data is increased by tenfold. Then do you think the same
thing ->> Chao Weng: I still think that's not a problem. I still think that's not a problem. Okay, thanks.
Download