Document 17831843

advertisement
>> Geoffrey Zweig: Alright everyone, so it’s a pleasure to introduce Vijay Peddinti today. I think
everyone here knows Vijay. Vijay was a star intern here working with Mike. He’s a star student at Johns
Hopkins University where recently he won the best paper award at last year’s Interspeech, turned in the
winning system for the ASpIRE challenge last year. He’s got this long record of super successes. He’s
going to tell us today about some of the technology behind them.
>> Vijay Peddinti: Good morning everyone. At the Johns Hopkins University I work with Dan Povey and
Sanjeev Khudanpur. Most of this work was done in combination with them. Today I’m just going to talk
about multi-rate neural networks which I’ll try to convince you are very good for very efficient acoustic
modeling. These become important in the current scenario where we have a lot of data and not enough
time to model on this data.
Before I start the actual talk I would like to motivate the research goals behind pursuing this current line
of work. I basically concentrate on what is called distortion stable sequence recognition problem. For
the sequence recognition part of this work we basically use models which are capable of tackling long
term, long span temporal dependencies in acoustic data. Which would be any model like LSTM or
BLSTM or any such model and for the distortion stable part we basically rely on training data. To be very
representative to the possible operating scenarios in which the model is going to be used.
This essentially means that in most of the cases we have an immense amount of training data. We are
faced with the challenge of reducing the training time and also the decode time. Because if the models
get really huge it’s going to be a really big problem in terms of latency during decode.
In order to increase the, reduce at training time there are several different techniques. One of them is
to just use a distributed optimization algorithm. But it turns out that even with distributed optimization
if each of your instances can operate at a really fast speed you do get significant speed ups. That’s the
part I’m going to concentrate on.
The models which are capable for fast training and for really low latency decoding are mainly search
focused. I’ll try to convince you during the course of this talk that multi-rate neural networks are a really
good solution for these problems.
The outline of the talk is going to be like this. I’ll be initially discussing the problem. Then I’ll discuss a
few of the current approaches that I use to tackle this problem. Which belong to two different classes
which are feature based approaches and some model based approaches. Then I’ll show, actually
introduce the multi-rate idea. I’ll show the application of this idea in both convolution neural networks
and the recurrent architectures. But most of the results right now are just on the convolution
architecture part of it.
What exactly is the issue in sequence recognition? There are a variety of reasons you have long-span
dependencies. Input sequences like a sequence of speech vectors. The reason for that is the inherent
way in which speech is being generated which is by an articulate system which has physical constraints.
They cannot be rapid switches between different articulated states. You have the coarticulation effects
where a speech across a wide window is getting influenced by speech sounds from different phoneme
across several different time steps. You also have reasons like reverberation which are environmental
reasons where you have delayed versions. Scale versions of the speech signal adding up to the actual
speech being recorded at the microphone.
Finally, you have various kinds of channel distortions where the effects can actually be non-linear. In
some cases adversarial cases like speech encryption you actually have a frequency scrabbling techniques
where these distortions can be quite severe.
In order to tackle all these problems we need to design models which are going to remain stable to any
of these distortions. During the past two decades there have been a variety of approaches towards this
problem. Most of these approaches can be classified as feature based approaches.
In these approaches people quickly realize that you cannot just splice a large chunk of speech frames,
dump them into your acoustic model. Just hope that your acoustic model is going to learn the necessary
transforms. The reason for that is as you increase the sequence of your feature vector you have
increased variance in the input representation which is inherent. In addition to that because of
additional non-linear temporal warping that happens in speech this variance increases further. In order
to learn transforms which are invariant to such highly variant input representations you need a lot of
data.
Later on I’ll show you that irrespective of the amount of data it’s always going to be problem. Initially,
the approach towards this was to design representation functions which are going to guarantee some
form of stability to a variety of temporal distortions. This stability is usually defined as an Ellipsis
continuity of the feature transform.
Some of the most widely used approaches which follow this particular line of thought are what are
termed as multi-scale feature representations. In multi-scale feature representations you basically are
analyzing features at several different resolutions. As soon as you have a lower resolution feature
stream this lower resolution feature stream is going to remain stable to local distortions in the input
pattern. But the problem with this really lower resolution feature streams is that it also discards a lot of
information which would be very important for classification.
People try to solve this problem by building acoustic models which operate on different feature streams
and trying to combine the judgments from these acoustic models using a variety of techniques. For
example if you try, so these are two different, these two plots are generated by two different multiresolution techniques. The first one is a simple Gabor filter based approach. Over here if you look on
the plot over here, so you have three different plots over here. The first one is the really close feature.
You can imagine that any kind of temporal shift in the acoustic events over here, or frequency shift is
not going to influence that, impact that representation to a great extreme.
On the other hand you have two different kinds of very fine resolution in terms of frequency over here.
Any shift in the acoustic event over here is going to immediately show up in this representation. If you
have an acoustic model which is looking at all these representations simultaneously or a bag of acoustic
models that you are looking at these representations. You can hope to both be stable to distortions
while preserving information the three, find the resolution information necessary.
This was one particular, so over here we are basically taking the input representation and passing it
through a filter bank which has filters for different resolutions. This has also been done in a different
way where instead of using a filter bank you basically use a hierarchy of filters or what’s called a
[indiscernible]. You get a different style of representation. But the critical idea over here is basically
decompose your representation into multiple resolutions.
In addition to this there are also other kinds of techniques in order to capture this long term information
in the acoustic signal. One of the really important techniques which has been shown to be very
influential for adaptation is what we call permutation invariant representations. These are
representations which do not preserve any information about the actual sequence of data. But they
capture long term stationary information in the spacing the, like say the speaker characteristics or the
environment of characteristics. This kind of information which is giving us long term detail is very
popularly used as iVectors or mean noise estimates.
But there is an alternate approach in order to tackle the same problem of long-term sequence modeling
which is the model based approach. How exactly do model based approaches deal with this problem?
The model based approaches assume that you just have a sequence of short-term instantaneous
representations which is something you see in an MSCC feature vector sequence. Given that you have
this sequence of short-term representations you are trying to model the long-term temporal dynamics.
We already know that we cannot basically splice a long window and dump it into the DNN.
In order to tackle this problem most of the effective models basically learn temporally local transforms.
Each transform is always just looking at the finite chunk of feature sequence. But the neural network as
such is able to process information from a wider context. There are two strategies to achieve this. If
you are using a convolutional architecture you can basically stack, do hierarchical filtering. You can stack
a series of convolutional layers. Each time you go deeper into the network you see a wider context. But
because this wider context has gone through a layer of filtering its resolution has reduced to certain
extent because of pooling and other things. You’re basically, the network as a whole is looking at the
wider context. But each transform is temporarily local or looking at the lower resolution representation.
On the other hand you can deal with this problem in the recurrent architectures. Where when you are
learning a transform on the input it’s temporally local but the long-term information is being captured in
the state of the vector. This state can be derived using a task dependent loss function. It will preserve
the necessary detail if we are able to train it in the right way.
I’ll be talking about, so we are interested in these two kinds of architectures. In both these architectures
there is this additional detail that not only the transforms are temporally local. But they’re also tied
against time steps. Even if you look in a convolutional architecture or in the recurrent architecture the
same transform is being used across time steps. This shared statistical strength ensures that you are
learning better transforms compared to the simple DNN based approach.
Because recurrent architectures are so popular in our, popularly used in tackling this problem, let’s just
initially look at the recurrent architectures. One, if you are just taking a simple RNN which doesn’t have
any fills you, it has several problems. One of which is basically limited memory. It’s not able to
memorize a lot of detail in the past. It also has problems of vanishing and exploding gradients.
In order to deal with these issues several different kinds of architectural variants are proposed. But
despite the use of these architectural variants RNNs have this very basic problem where the loss
function during, or that is, that we use to train the RNN cannot be computed parallely. That is if you are
computing the outputs of an RNN you are, all these outputs cannot be computed parallely. They have
to be evaluated sequentially.
In order to work on this problem people normally use batching of sequences. They basically create the
loss function from several different sequences in parallel. But despite the use of all these tricks the best,
the fastest feed-forward networks are always faster than an RNN. This is the reason we decided to
approach the sequence modeling problem initially using convolutional architectures.
>>: But why is it necessitate the sequential processing the outputs as opposed to just the hidden state
vector. Why couldn’t you…
>> Vijay Peddinti: I definitely can’t say [indiscernible], yes.
>>: Okay.
>> Vijay Peddinti: Not the actual posting outputs.
>>: Yeah, okay.
>> Vijay Peddinti: In order to tackle that problem we actually wanted to explore convolutional
architectures. We wanted to make sure that the convolutional architectures were not sufficient for
sequence print modeling before we shift to the RNN architectures.
As I described before convolutional architectures have a variety of advantages. Where the main
advantage over here is that if you have the hidden activations at different time steps at a particular
layer, all these activations can be computed parallely which is going to give you a huge speed advantage
during training. But the problem over here is if you want to increase the context of the network you
either have to, you both have to compute more hops in your [indiscernible] convolution. At the same
time you would have, the network as a whole would have to process this vital context information.
There is going to be a linear increase in the number of parameters and also a linear increase in the
computational cost, as soon as you come to convolutional architectures.
A very simple way to deal with this problem is basically use subsampling. Essentially we are not
computing all outputs at any given layer but just few of them. You can think of it like this. If you have a
filter with a fixed context and if you’re not computing every output in the, at the input context of the
filter, basically the input dimension of that particular transform reduces. Essentially reduces the number
of parameters.
Also because you’re not computing everything in the convolution layer you have a speed advantage.
This is nothing new and people normally use this uniform subsampling strategy where they hob their
filter by a specified width. It is popularly used in all the CNN architectures. But what you want to
achieve is a subsampling which is far more severe than what you have from a uniform subsampling
strategy.
There have been other explorations of non-uniform subsampling strategies. These were previously
done in hierarchical neural networks where you have two independent neural networks cascaded. But
both of them are trained independently. Over here you have the advantage of actually decoding the
output of a lower level neural network. Looking at the individual frames and deciding which of them are
important and which of these frames are not so important.
One of the strategies people used was basically run a first pass decode over the output of the lower
level neural network. Just select one frame for every context dependent state that you have. But in our
case the problem is that we have a normal DNN where every layer is decoded, trained jointly. You
cannot assume to have any understanding of importance at half time.
In order to come up with a non-uniform subsampling strategy in this particular case, we designed our
problem like this. Given a desired context that we want to model which is a t L represented by t L
represented by t L order. The number of convolution layers that I want. The number of convolution
layers is the number of sequential operations that are going to happen in your network which will limit
your speed.
Choose the filter context. That is the context of the transform at each individual layer and the time
steps at which these convolutions have to be performed. Because we had no idea of how this had to be
done we came up with the criteria. The criteria was we want to ensure the propagation of all the
information from the input sequence when we are trying to compute one single output.
We wanted to test start that if we design a network that satisfies this criteria are we going to take a
performance hit in terms of the convolution neural network performance? There are several ways to
find a path from the lower level sequence to the output sequence while ensuring that we have
minimum activations. But we also have our other constraint. If you remember that our convolution
layers have to be always temporarily local or operate on a lower resolution information.
This gives us another constraint that as we go deeper into the network we are going to increase the
context of the network. It turns out that if we just linearly increase the context of the network and
basically just evaluate the output at the edges of the filter.
In this particular diagram you can see that previously if you had a normal convolution network you
would have computed every output at each layer. That would have meant so many computations which
are represented by every, lines of all colors including black. But if you are using the current subsampling
strategy you would just computer the colored lines, non-black lines at each layer. You can see the
drastic reduction in terms of computation.
>>: I have a question about that.
>> Vijay Peddinti: Yeah.
>>: Okay, I guess I have two questions. Well, let me start with the more interesting one. Do you think
you could get an effect similar to drop out by having multi, say two different window lengths at each
level? Here the orange nodes are taking two inputs?
>> Vijay Peddinti: Yeah.
>>: Every orange node has exactly two inputs so there’s a filter that takes two things. At that same level
do you think you could say, uh-uh? You know flip a coin say I’m going to have; I’ll have a window that
also takes five inputs instead of the two, five? When you get to a node that you’ve selected to compute
its value for at that point you flip a coin. You say will I use the filter that takes two inputs or will I use the
filter that takes five inputs? You did that at every level. You’d essentially be training over multiple
architectures. Do you think that would be analogous to drop out or that it would increase the
robustness in a similar way?
>> Vijay Peddinti: it would definitely be and led us to drop out where you have multiple filters in your
CNN at that particular level. But it would essentially increase the computational cost.
>>: [inaudible]
>> Vijay Peddinti: Yeah, you either have a two input filter or a five input filter.
>>: Actually it wouldn’t increase the computational costs. Because it would double the number of
parameters but at each node that you select to evaluate you either use one filter or the other filter.
>> Vijay Peddinti: If you have a vital filter you’re also, so in this particular case when I want to compute
the output at one particular frame. In a frame randomized training scenario you are just evaluating, you
know ahead of time that you just have to compute these outputs. You’d just being doing that.
But if you had let’s say a filter with five inputs over here that would essentially mean that I would need a
lot, many more computations at each of the levels that is beneath it. If I were making that decision at
really low convolution layer it wouldn’t affect a lot of things. But because of the dependencies among
the intuitions it could lead to a…
>>: You’re right it would happen at two. It’s like being a partition in nodes into the set that’s using the
two inputs and the set that’s using the five inputs and you just do different matrix multiplies.
>> Vijay Peddinti: Yeah and the other thing we want to ensure is basically we need to have an
understanding of which architecture to use both during training and decoding. What you say might be
true that if I randomly switch the activations that are used to update this filter. I might actually have a
more robust transform. But during decode time I need to make a decision on what to use. That would
be an additional problem I would have to solve. I’m not saying it’s not possible but, go ahead.
>> Geoffrey Zweig: Did you have another question?
>>: Oh, the technical one, yeah. Why, on the orange ones there why doesn’t it go two, two, like the
patterns irregular? It’s almost as if it had gone in from both sides.
>>: It’s regular if you look from the top down.
>>: Right, look at it like a…
>> Vijay Peddinti: Yeah, so over here the activations are being selected like this. At this level I need a
filter which, so the basic problem over here is a dependencies in speech are asymmetric, left
asymmetric. I always need filters which have a wider left context. This essentially means because of
asymmetry there can be some non-uniformly placed samples. The way we draw this chart is we select a
particular filter. We try to decide, we basically decide what activations are necessary to compute the
output of the top filter. That’s how this diagram works.
>>: No, I don’t buy that because I think the reason why it’s asymmetric or why you have this overlap in
the orange layer there is that because it’s not the power of two. Your expansion I think is the reason,
right?
>> Vijay Peddinti: The expansion over here is basically a power of t. Over here at the initial level over
here I have a filter which has a context of three plus one. Over here I have a context of six plus one. On
the top of that we have a context of nine plus one. Depending on how exactly I am choosing…
>>: [inaudible] circles that are seven and if it were eight then you would have a perfect binary tree
[indiscernible], right? My question is…
>> Vijay Peddinti: There are actually eight. It’s just that I’m sharing one of those computations…
>>: That’s my question. Is that necessary for accuracy or is that simply…
>> Vijay Peddinti: No, no.
>>: I need to map it to do something with that.
>> Vijay Peddinti: In particularly if all that it doesn’t matter what exact that you’re using as long as
you’re ensuring the propagation. Over here it just concern because of the indices I chose. I chose
indices of minus seven and two at the top layer.
>>: Yeah.
>> Vijay Peddinti: Instead of that I grabbed minus six and three. This would be like a perfectly uniform;
well at each level the nodes would be uniformly placed. It turns out that these independent small
changes do not affect the performance to the great extent.
>>: It would not make a difference for example if you put the one set of inputs that are used multiple
times, right. Into the middle node on the yellow layer there is used twice.
>> Vijay Peddinti: Yeah.
>>: It goes up through paths. Would it make sense to make the one that goes up through two paths to
be the current frame, descender frame that you’re predicting which it isn’t here, right?
>> Vijay Peddinti: Yeah, here it isn’t.
>>: It seems you’re over representing a little bit of left projects that doesn’t need to be warranted,
right.
>> Vijay Peddinti: Yeah, there were other subsampling methods where we always ensured that the
center frame is present. In order to reduce the parameters we do use other tricks when we are taking
context from. We would just take a subset of filter nodes. Whenever you ensure certain things like you
ensure that there is a center frame. Or if you think the left context is important you would always
ensure that there is more representation from the left context you do get better performance.
>>: One more question goes to Jeff. Are interview questions covered by NDA or do you think the drop
out it just sounds really good we should patent that?
[laughter]
>>: I think they are.
[laughter]
>> Vijay Peddinti: The big advantage, so these are the actual computations that you would do in this
convolutional neural network. By the way another thing I want to make clear is over here because we
want to actually compare convolutional architectures and recurrent architectures. We’re not talking
about convolutional and long frequency but just a long time. We actually have a frame transforms
which are operating on the entire feature vector at a given instant. But these are frame transforms are
being shared across time steps.
Over here you can see that there is a drastic reduction in terms of the actual computation that has to be
done. But we still have to see whether this drastic reduction in terms of the number of activations that
are computed isn’t going to affect our performance which I’ll discuss later on. It turns out that using this
particular subsampling strategy we were able to gain up to five times speed up compared to a normal
TDNN without any form of subsampling.
In our record I sought the basic setup. We started doing experiments on the Switchboard LVCSR task
just the pre ended R subset version. I’ll briefly describe the setup just to ensure, because I’ll be using
this particular setup at several places in this presentation.
We basically use a forty dimensional MFCC representation which is essentially a filter bank but just with
a TCD transform. We do that because we want to reduce the bandwidth usage when we are
transforming data across nodes. In order to do what is termed as instantaneous speaker adaptation we
append a hundred dimensional i-vector for every time frame.
One of the main focuses of the technology that we dial up is that we want to ensure that this technology
is always ready for online decoding. All the i-vectors that are being used over here are just estimated
using data that has been seen till that position in the sequence. Most of the systems that we use over
here are trained with cross entropy. Whenever there is this kind of sequence, this [indiscernible]
training I’ll mention it specifically.
All these experiments were basically done using the Kaldi ASR toolkit. We use the distributed
optimization technique where we do model averaging across instances.
>>: Why is there no [indiscernible]?
>> Vijay Peddinti: Yeah, so the, as I said one of the goals was to ensure that we always had a really good
online center. It turns out even when we are doing online mean-variance estimation there is a certain
amount of latency which is not necessarily due to the model but just for the mean estimation.
We just wanted to test out whether there was a way to eliminate any kind of mean variance
normalization and ensure that the network learns it whatever is the necessary normalization. When we
estimate our i-vectors we estimate them on non-mean-variance normalized features. If there is any
mean-variance, mean offset it is captured and coded in the i-vector. It turns out that by using our
training recipe the network is usually as if to as a network which doesn’t, which has mean-variance
normalization when we us it.
>>: [inaudible] need to introduce latency if you want to estimate i-vector?
>> Vijay Peddinti: Yeah, as I was saying over here we ensure that the i-vector is always estimated using
data seen till that particular frame. If you have a frame at ten samples only the data up till that frame is
being used.
>>: It’s an online i-vectors?
>>: Online i-vector.
>> Vijay Peddinti: I do not claim that this is better. It’s all towards the same goal that we want a really
fast online system which doesn’t have any latency which is not absolutely necessary for the model.
>>: That just means that not having the sort of gain normalization that you get from mean subtractions.
>> Vijay Peddinti: Not having tackled that problem we have other solutions. Yes, that is a really good
question because it turns out that most of the training, so if any kind of gain difference was there. If it
was well represented in the training data our model would be robust to that. But most of the training
scenarios that we have are using well curated databases where the gain is perfectly normalized. We do
see a lot of issues. In order to tackle those problems we do perturbation of our data. As soon as we do
that we eliminate most of the issues.
>>: When you’re training neural network the mean-variance normalization also helps with this
stochastic gradient to converge better? You’re giving that up. How do you, so how do you deal with
that?
>> Vijay Peddinti: In our case we don’t actually see problems like that. We haven’t fine performed any
issues either because of the non-linearity’s that we are choosing.
>>: [inaudible] business setup [inaudible].
>>: Natural gradients stuff that, does that help?
>> Vijay Peddinti: I cannot [indiscernible] again because I haven’t tested it out. But there is one
additional phase over here, that, yeah, there is nothing explicitly done to tackle this problem.
>>: One more question? You said [indiscernible] normalization on other [indiscernible]. But I think
what you’re talking about [indiscernible] that’s completely independent of that. You can still do that
and you will not lose anything because that only affects the training. It’s not something you’re
estimating during the [indiscernible], okay. Are you doing it?
>> Vijay Peddinti: No.
>>: Is it also non-global meaning.
>>: As I remember that’s it for the set, the number two.
>> Vijay Peddinti: Yeah.
>>: It actually adds to some kind, close to the [indiscernible] averaging.
>> Vijay Peddinti: Yeah.
>>: You did it because of that.
>> Vijay Peddinti: We haven’t tested it all…
>>: You’re online doesn’t have, let’s say if you run standard model with one GPU it wouldn’t do mod
averaging. It wouldn’t have it based on numbers. This is your baseline.
>> Vijay Peddinti: Yes, so what we see if you basically run it on a single GPU you usually get better
performance rather than distributing and doing an outage. But the performance hit is not very drastic.
>>: What’s the [inaudible]?
>> Vijay Peddinti: We heard [indiscernible] but in the cases that I did see I would basically get
something like point one or point two percent difference on Switchboard three hundred hours subset.
We don’t usually worry about that.
>>: The p-norm non-linearity is not as sensitive, what is that?
>> Vijay Peddinti: Over here the p-norm non-linearity we basically in the feature hidden activation
vector we select groups of ten units which are basically contiguous sets of ten units. We take some non-
[indiscernible] in this particular case we were taking the l two norm of these activations. These groups
do not assume any kind of location spatial locality or temporal locality. We basically select some set of
ten units in the hidden activation vector and perform the norm. This was basically seen as a
generalization to the max or non-linearity that’s normally used.
These are the first set of results that we had. Let me guide you through this table. The initial
comparison that we wanted to do was given the same amount of context that we normally use with a
DNN. Are we going to see any kind of advantage if we had a TDNN architecture?
In this particular comparison I’m taking the actual context of fifteen, one fifty milliseconds which is being
modeled by a DNN and distributing across over three different convolution layers. I have a convolution
layer which has a context of four, four, and seven. It turns out that even with this distribution there is
an appreciable gain. But this error comparison was not exactly fair because as soon as you start using a
convolution architecture there is a certain amount of parameter increase. We conducted another
experiment where we ensured that the DNN had the exact same number of parameters as the TDNN.
Even in this case there was a certain amount of increase in performance.
After this we basically, the main goal of these set of experiments was to actually see whether we can
model longer sequences using the TDNN. In order to test that out we started modeling wider and wider
network context. It turns out that using the TDNN we were able to model context up to thirty
milliseconds. In this particular close talk telephone speech case.
When, later on in our experiments we identified that this also depends on other hyper parameters of
the network like the non-linearity. But the interactions are not clearly understood. I’m not bringing that
into the comparison.
Using the TDNN we were actually able to model context up to two thirty milliseconds. Once we did
know that this context was helpful we actually wanted to make sure whether the DNN could also do
this. It turns out as we expected that anytime you increase the context of the DNN constraining the
amount of data that have which is three hundred hours in this case. There is always going to be some
kind of detriment in the performance.
>>: That’s a lot of [inaudible] one, two, three, four, five, six, seven, eight…
>> Vijay Peddinti: The…
>>: Which could have like three or four values.
>> Vijay Peddinti: Yeah and across all these values we can really don’t see a variance in our
[indiscernible]. The one thing that we select is this. These are not hyper parameters which have so
many degrees of freedom. We initially select the first layer of context that we want. This essentially
determines the context of the other two layers. All that we want is to adjust the offsets. The
adjustment of the offsets depends on the input context of the network that you want to model. If you
want to say model just ten frames to the right and three fifty frames to the past. The selection of these
contexts would differ. The other thing that we need to select is the number of convolution layers that
we want. That is usually additionally made upon how many sequential steps are you able to tolerate
while you’re training.
>>: I know you don’t want to hear it [indiscernible] but go ahead. But just as a ballpark the
[indiscernible] like what’s the best sounds like the [indiscernible] number is? Better than these, what’s
the best number of that grid but the best result for value?
>> Vijay Peddinti: If I remember right there was like a fine [indiscernible] in difference on the
Switchboard point per second.
>>: [inaudible]
>> Vijay Peddinti: Yeah on the absolute on the Switchboard subset. I mean it was better but not
significantly better.
>>: [indiscernible] a question that these tables indicate some symmetry between the DNN and TDNN
using curly brackets and square brackets and so on. This the first layer of the TDNNs that’s already
convolutional or still dense? The first, the two minus two, plus two…
>> Vijay Peddinti: That is dense.
>>: That’s dense.
>> Vijay Peddinti: Yeah.
>>: The first one is dense and you have all that.
>> Vijay Peddinti: Because the first one has to be dense because you have to span everything in the
initial sequence.
>>: Oh, because you always use the same filtering in the max pull [indiscernible]?
>> Vijay Peddinti: The reason we want to do that is because we want to take propagated information
from every time step. That’s the reason we had to make the very first filter dense.
>>: I see, okay but they’re not timed.
>> Vijay Peddinti: Everything is timed.
>>: Oh, okay that was my question.
>>: Like, so although you have the five…
>> Vijay Peddinti: Yeah, everything that is represented with the same color is computed from time
transform. It’s an active; it’s represented with a color it’s computed from the same transform. All low…
>>: All the other nodes show the same [indiscernible], yeah.
>> Vijay Peddinti: Sorry, so at this point of time we basically wanted to compare our TDNN model with
the state of the art results on the Switchboard at that particular point of time. In order to do this
comparison we started adding a lot of bells and whistles to our baseline model.
One of the first things we did was basically adding pronunciation probabilities. Pronunciation
probabilities are basically once you have the lexicon which has multiple pronunciations for every word.
You have your training data. After a certain amount of training you basically realign your training data
using this lexicon and see what is the probability of each pronunciation. The next thing we did was
basically a simple four-gram LM rescoring and after that as I said before, sorry.
>>: I’m wondering what’s v-perturbation, volume perturbation?
>>: Yeah, can you tell us?
>> Vijay Peddinti: Yeah, so as I said before we are always interested in online technology. We do not do
any kind of speaker adapter training of our system. This essentially means that we usually take a hit in
terms of performance. In order to compensate for this we basically present the same audio data to our
system while simulating synthetic speakers.
Normally, in order to simulate the synthetic speaker’s people have been doing a vocal track speed
perturbation or a tempo perturbation. But it turns out that both of these can be done using a simple
signal processing operation which is basically compressing or expanding your signal. Whenever you do
compression you’re perturbing both the tempo. Because this compression modifies your frequency
coefficients there is also a perturbation in the MSCC coefficients.
>>: You’re packing the sample rate? You’re not just changing the frame shift.
>> Vijay Peddinti: Yeah, we are padding the sampling rate, yeah. This, one of the biggest advantages of
this speed perturbation technique is that it’s agnostic off your feature representation that you’re using.
It turns out that that’s a very critical decision for many people who want to use the techniques and
different systems.
>>: I’m following up on Frank’s question. You’re saying that you basically sometimes pretend that
you’re sampling at eight kilohertz sometimes at seven and a half kilohertz…
>> Vijay Peddinti: No, so we sample at different rates. But we always assume that the sampling rate is
eight kilohertz. That essentially means that you are either, that gives you the stretching or the
compressing affect.
>>: But according that you add the synthetic data?
>> Vijay Peddinti: Yeah.
>>: As do your [indiscernible] combine?
>> Vijay Peddinti: Yeah.
>>: Okay.
>> Vijay Peddinti: The volume perturbation is the thing that actually helps us tackle the mean and gain
normalization issue. We basically train our system using several different perturbations of the volume.
It…
>>: Volume of what?
>> Vijay Peddinti: The volume of the entire signal.
>>: Oh, so the gain, oh…
>>: The gain…
>> Vijay Peddinti: Yeah.
>>: In three dimensional volumes…
>>: Yeah.
[laughter]
>>: I still don’t understand. Now you just said you always assume its eight kilohertz.
>> Vijay Peddinti: Yeah.
>>: Are you, no but that’s…
>> Vijay Peddinti: You’re not also basically if you have a signal at x of t.
>>: Yeah.
>> Vijay Peddinti: You now want a signal at x of alpha t where alpha is some number…
>>: Okay, so…
>> Vijay Peddinti: In order to…
>>: That is maybe like seven point five kilohertz.
>> Vijay Peddinti: Yeah.
>>: Then you just resample it back to eight kilohertz and run it through the standard pipeline.
>>: But then you need not activate it.
>> Vijay Peddinti: Yeah, so as soon as you resample it at seven point five and pretend that it is eight
kilohertz you have the compression affect.
>>: Okay, so, yeah. Did you evaluate that separately? I’m sort of curious whether…
>> Vijay Peddinti: Yeah, so the [indiscernible] papers lasting the speech with that. It turns out that
across several different data sizes. We tested data sizes up till eighteen hundred Rs. The usual gains are
six to seven percent of redo when we do speed perturbation.
>>: Volumes about the same range?
>> Vijay Peddinti: Volume perturbation in most of the tasks that we have over here both are training
data and test data are very well curated.
>>: I see.
>> Vijay Peddinti: You don’t see a lot of gain. But there are other tasks that I’ll present later on where
that essentially led to almost fifteen percent redo gain. Because in that case the gain differences in the
test data were several hundreds of DB.
>>: That’s fifteen percent of loss before because you’re [indiscernible] normalization, right?
>> Vijay Peddinti: Yeah.
[laughter]
>>: [indiscernible]
>>: That didn’t work.
>> Vijay Peddinti: Yeah, so the basic idea here is try to do as much as you can while you are training to
ensure that your decoding setup is as online as positive.
>>: [indiscernible]. I have a question. Why, I mean it’s great for the, yeah…
>>: Steal your stuff, different approach.
>>: That you took all this online approach. But I mean I don’t have [indiscernible] I mean as a
[indiscernible] try to win the challenge without that constraint, why pose that constraint?
>> Vijay Peddinti: No, so as I said I also have some results from the challenge. When we initially started
with the challenge if we were concerning ourselves to be online we were basically taking a ten percent
that led to degradation. Just because we are imposing the online constraint either when we are
estimating the i-vectors or when we are not doing any kind of normalization off our data.
>>: Okay, but why would you do that if you wanted to win the benchmark?
>> Vijay Peddinti: Because of the challenge, because of the benchmark we had to relax the challenge.
But the online thing is just like a broader goal that we want to achieve. Any technology that we are
using, so we don’t want to invest a lot of effort in doing speaker adaptive training where we require
three or four passes on the data.
>>: [inaudible].
[laughter]
>> Vijay Peddinti: Then we add what we call silence probabilities. Usually when we have silence
occurring between, the probability of silence occurring between two words is assumed to be uniform in
a normal lexicon. But over here we wanted to ensure bring about the idea of the context of the silence
impacting the duration or the probability of the silence. It turns out that though the gains because of
this silence probability estimation are not very high in this particular case. When we go to a robust
speech recognition task we get significant gains because of that particular thing. In…
>>: I don’t understand. What was it again?
>> Vijay Peddinti: If you have the probability of a silence occurring after any word it normally it seemed
to be, it’s assumed to be uniform. But over here we are weighing the probability of the silence
depending on the previous and the next word.
>>: Word, so it’s like a Lotus model sort of?
>> Vijay Peddinti: Yeah.
>>: How do you train that?
>> Vijay Peddinti: Basically this was work done by one of colleagues [indiscernible]. Over here if we
actually are introducing all these additional arch’s in order to model all the different silence
probabilities. You have a V squaring threes in the number arches depending on the vocabulary. He
basically partitioned this probability into two different parts. Where you try to see what is the impact
just because of the previous word and multiply it to the impact of the next word. It’s basically modeled
as an independent probability. There are several tricks over there in order to ensure that it’s actually
easy to implement.
On the top of that we did sequence training. Further we did, there is one minor adjustment over here.
This is not, this is sequence training. But on the top of that we do something called prior adjustment.
This is something which was found by one of my colleagues called [indiscernible]. When he was doing
semi-supervised training with neural networks, he identified that when after sequence training when we
want to estimate the likelihoods instead of computing the prior based on the alignments. If you can use
the prior which is estimated from the mean posterior of the network if you use that as the prior it gives
a lot of gain.
Over here you can see that this gain is consistent. It turns out that after his finding we started using this
trick across a variety of [indiscernible] tasks and we saw consistent gains across all these tasks.
>>: [inaudible] made things worse. I had a [indiscernible] tool from the very beginning and I had to
switch it off because it didn’t make things better, [indiscernible].
>> Vijay Peddinti: Okay.
>>: [indiscernible]
>>: Yeah, right, yeah.
>>: [indiscernible] for sequence [indiscernible].
>>: Ah, oh, right, excellent, okay.
>> Vijay Peddinti: At that point of time these were the papers with the best results on the Switchboard
at three hundred hour set. We do see that we are definitely able to perform better than the unfolded
RNN implementation by [indiscernible]. But this is definitely not a fair comparison or even a proper
comparison because the setups are different in a variety of ways. Later on I’ll give a comparison
between the TDNNs and the LSTMs which share a lot of same things.
>>: The bottom line seems to be that [indiscernible] well trained symbol system you can get down to
like eleven percent…
>> Vijay Peddinti: Yeah.
>>: With the three hundred hour data?
>> Vijay Peddinti: Yeah.
>>: When you start combining things you can get down to the tens.
>>: But there’s no [indiscernible]…
>> Vijay Peddinti: This system also has speaker adaptation and i-vectors offline information. This is one
single i-vector [indiscernible].
>>: Are there i-vectors in the TDNN?
>> Vijay Peddinti: Yeah, sorry I didn’t actually; I already mentioned that they were used in all the
systems that are…
>>: Incremental ones?
>> Vijay Peddinti: Yeah the incremental. The online i-vectors are used in every system that
[indiscernible].
>>: The bottom line does that have RNN?
>> Vijay Peddinti: This one, no.
>>: Yeah, no.
>> Vijay Peddinti: I’ll show you results with RNN later on.
>>: How much did the i-vectors [inaudible]?
>> Vijay Peddinti: The i-vectors usually give us across tasks around seven percent [indiscernible]. But if
we go too Far-field recognition tasks like AMI or the other tasks that I present here we get bigger gains.
>>: In the incremental fashion?
>> Vijay Peddinti: Whenever we do, yeah even in incremental fashion we see gains. But if we turn off
the incremental thing and start doing offline estimation the gains get slightly better.
Once we did see that we were getting implements because of the TDNN in models. We started testing
them out on [indiscernible] tasks of different data sizes. We found that across all these tasks which have
data ranging from three to eighteen hundred hours. We do get five to seven percent reduction, a literal
reduction in word error rate. There was just some disturbance in our Resource Management task. But
we didn’t do a lot of tuning of that. These gains were consistent and because of this we started using
the TDNNs as the standard DNN recipe in the Kaldi toolkit.
Now the comparison between the TDNNS and the RNNs, so after that paper was published we actually
wanted to compare locally what were the best RNN numbers that we could get. It turns out that despite
the claims of having the best TDNNs being able to tackle long term sequence dependencies. They were
still performing worse than an LSTM.
Over here I have numbers on a Switchboard task which is the telephone speech task and the AMI Single
Distant Microphone task. In both of these cases you can see that there is degradation because of
switching from an RNN to simple convolutional architecture.
>>: Why?
>> Vijay Peddinti: That is one of the first we are doing right now. Over here you can see that the best
numbers that we have over here are because of a bi-directional LSTM. This brings inherently a lot of
latency into the model. Even if you do chunk based training there is a lot of computational overhead
because you’d have to compute the right state of the model.
Right now we are looking at efforts, we are making, I’m making an effort where the right context is
being modeled using the left context is being modeled with an LSTM. The right side context is being
modeled with a TDNN. I’m trying to, okay, so I said [indiscernible] model [indiscernible]. Essentially is if
you want a twenty frame sub-context to the right rather than directly dumping it into the previous
model. Right now the trick people use in our other model right context with LSTMs is basically predict a
label with a delay. You would go update the state of your LSTMs for five more time steps before you
predict output. Right now we are trying to combine these two things and to the performance of the
LSTMs.
>>: [indiscernible] another way which is [indiscernible] results would just go back from the right context
using the recurrent neural network, go back to the essential place you want to predict. That can also
models the right hand side context with limited of latency.
>> Vijay Peddinti: Yeah, so that’s the chunk based, so that is similar to the chunk based training if I
understand that.
>>: Slightly similar not exactly the same though.
>> Vijay Peddinti: Yeah, so over here if you actually use TDNNs most of the activations that you
compute for your right context could also be reused. In that particular case you’d have to redo this
propagation step and every time you do a shift.
>>: No, no, no, the left hand side context is completely model used normal redirection of S ten. You
[indiscernible] history information, you don’t do chunkation. Only the right hand side is doing
chunkation.
>> Vijay Peddinti: Okay, okay, so if you want to do it for a new chunk you would again redo the…
>>: Only to that [indiscernible].
>> Vijay Peddinti: Okay.
>>: Yeah, the left hand side [indiscernible].
>>: Is there an analogy to be in the LSTM to sort of, so you were able to, in your model you guys have
reduced a lot of the better InterFrame correlations from your model. You basically have this
[indiscernible]. You can write an analogy [indiscernible]…
>> Vijay Peddinti: Yeah.
>>: You know not every sample…
>> Vijay Peddinti: These LSTMs actually do that. In these LSTMs we do not just do a recurrence two
step of minus one. The lowest LSTM has a recurrence of let’s say a step of minus three. The LSTM on
the top of that has a step of minus six. The one on the top of that has a step of minus nine. If you
unfold this model you can see that we are not taking the states from all the lower layer outputs.
>>: How does that compare to the…
>> Vijay Peddinti: It actually improves over the performance. This is the LSTM which has that. It would
be…
>>: It’s a separate thing if you’re going to you know [indiscernible] yourself. You should just show that
[indiscernible]…
>>: Is that…
>>: Okay where’s the…
>> Vijay Peddinti: The deviation is minus.
>>: Oh, I see.
>> Vijay Peddinti: It’s not substantially different. But the main reason you would do that is because you
would get speed cadence when you are doing sequence level objective training like CDs.
>>: What [indiscernible] did you use to implement the LSTM at the LSTM?
>> Vijay Peddinti: [indiscernible] that’s the reason we got like a later results which were done after the
paper.
>>: Question about that, the TDNN results. You said earlier on that by doing the subsampling trick you
don’t lose anything. Is this confirmed on Switchboard or did you evaluate that [indiscernible]?
>> Vijay Peddinti: I didn’t make the claim that I don’t lose anything.
>>: No…
>> Vijay Peddinti: I still had to verify that. I have experiments which do verify that.
>>: I’m curious where you would be, where this twelve point one number would be if you didn’t do the
subsampling?
>> Vijay Peddinti: Yeah, I…
>>: Actually I think the gaps bigger than I would expect it. You don’t have the number to…
>> Vijay Peddinti: In a later slide because that’s still preliminary work.
>>: Yeah.
>> Vijay Peddinti: I don’t want to make the really strong claim using that…
>>: Okay.
>>: What’s the reason you think LSTMs do performance better than TDNN? You think you get larger
context with it?
>> Vijay Peddinti: With the TDNN there is this HNL hyper parameter which we have to adjust which is
the amount of left context that we need. This is usually done empirically. As I discussed before there
are a lot of interactions in the model. When we switch the non-linear T from a p-norm to a
[indiscernible] it turns out that we could actually model a wider context. Rather than just going till two
thirty milliseconds I was able to go till three hundred milliseconds. This additional hyper parameter
tuning brings in a lot of issues.
Assuming that we still have a really, because we still had a really fast TDNN, so in this particular case I
would like to make a comparison of speed between the TDNN and the LSTM. Over here the TDNN is
three times as fast as the LSTM. Because we had this really fast model which is still able to tackle
context of greater length. Sorry, go ahead.
>>: [inaudible] times faster [inaudible] seconds? What about the LSTM that’s three times faster? For
examples it has three times less more parameters [indiscernible].
>> Vijay Peddinti: Over here we, all these models have similar number of parameter. There is not much
difference in terms of the number of actual multiplies that are being done. We are just doing them
parallely or sequentially. That’s the major difference.
>>: I’m still curious how [indiscernible] that is as fast as TDNN?
>> Vijay Peddinti: I…
>>: I know it [indiscernible] about these comparisons about parameters equals…
>> Vijay Peddinti: The last time I gave this talk over here Jeff made some suggestions and that did lead
to some improvements. I’ll definitely test that out.
Given that we had this subsample TDNN model which is operating very fast and at the same time doing
a lot of, which is able to train on a lot of data. We wanted to test this out in the Far-field recognition
task which is known to have a lot of sequential dependencies in the data.
What I exactly do is the Far-field recognition task so that in terms of dependencies it’s usually because of
the late reverberations. Late reverberations are reflections of the actual signal which happen after a
hundred milliseconds. Usually the late reverberations are non-stationary so noise robustness
techniques which assume a lot about the stationary or which do not assume varying sources which are
moving are very difficult to apply when we have reverberated signals.
The other problem with late reverberations is they are the actual speech signal which is being added to
the speech signal. You can assume that they are very correlated with speech. But in most of the normal
techniques people assume that these are not correlated because there is sufficiently long delay between
the actual signal and the reflection. But that’s still an assumption.
If you want to tackle late reverberations you can immediately see that any model which has ability to
model wider context might be able to learn invariant [indiscernible] table representations when we have
these. That’s why we wanted to test the TDNNs in the Far-field recognition task.
We chose, we decided to participate in the challenge which is called the ASpIRE Challenge. One of the
qualities of this challenge was very beneficial to us. That was, there was well defined mismatch
between the training data, the dev data, and also the eval data. The training data was just clean
telephone speech, eighteen hundred hours of Fisher data. The only information given to us was that the
test sets are going to be with reverberation. There was no kind of metadata supplied.
The dev data that was supplied for us to test out our models was ensured to be sufficiently different
from the eval data. In the eval data there was greater number of rooms, different microphones,
different types, different speakers, some speaker microphone related positions were very different.
This was a really good task to test out a model which can train on a lot of data.
These are the sources of availability. These are the typical room configurations. If you are interested I
can discuss further. Let me play out some samples from this particular task.
[audio]
You should, there’s a blogger who started when he was like eleven years old, ten years old. He’s a video
blogger. I think it’s called FoodAudities.com and he just…
[audio ended]
That’s the typical example that Mary Harper who connected the challenge wanted us to play. This is the
lowest error rate that we had with the model per speaker error rate which is twenty-six percent. This is
what it sounds like.
[audio]
Huh, yeah, it’s interesting. I actually just recently found out that like some of those weaves are actually
real hair. Like I didn’t even know...
[audio ended]
You can see that the data is very easy for humans but very difficult…
>>: [inaudible]
>> Vijay Peddinti: Yeah, so it’s easy for humans. We had twenty-six percent error rate with our ASR
system. Now the third one it’s still very easy for humans but we started getting higher rate.
[audio]
Hi. How’s it going? Good, I’m Dave. Yeah, yeah, we’ve talked before actually. Yeah…
[audio ended]
This was a really hard sentence for us because we started getting forty-one percent error rate though
we were able to, so I initially thought maybe it was because of the keyboard clacks and started verifying
the actual hypothesis. That was not the reason.
>>: But that low frequency pound that’s in there should not affect your recognition because you use
telephone filters, right?
>> Vijay Peddinti: I actually referred the training data setup we use a lot of multi-condition simulation in
order to simulate reverberation, noise settings, and a lot of other things.
>>: It’s not just a specific thing. Are you using the full banquet for the MSCCs that you’re extracting? Or
are you using a telephone that’s…
>> Vijay Peddinti: No…
>>: [indiscernible] this should not affect it right this low frequency…
>> Vijay Peddinti: Yes by letting, yeah because there is a lot. This was much harder for us.
[audio]
Hey how you doing [indiscernible]? Oh, I’m hanging in there. Um, you know take the…
[audio ended]
[laughter]
>>: Yeah, it…
[laughter]
>>: I told you…
>>: There’s that active speech.
>>: Yeah.
>>: There is the…
>> Vijay Peddinti: The key problem over there is if we were applying gain normalization techniques they
would have actually started to signal a lot more because there was a close [indiscernible] which was the
actual microphone moving. There was a speaker in the background. A simple…
>>: [indiscernible]
>> Vijay Peddinti: This is the one I played.
>>: [indiscernible]
>> Vijay Peddinti: The second last one.
[audio]
Hi how you doing?
[audio ended]
I’m playing the second last one again.
[audio]
Oh, I’m hanging in there.
[audio ended]
[laughter]
Yeah, that’s the…
[audio]
[indiscernible] you know [indiscernible]…
[audio ended]
>> Vijay Peddinti: No, if you have a really good headphone and listen to it very carefully you’d…
>>: [indiscernible]
>> Vijay Peddinti: [indiscernible] signal several times before we actually heard it.
>>: It’s like in this forty percent accuracy for this?
>> Vijay Peddinti: No, it was sixty percent error rate.
>>: Error I know.
>> Vijay Peddinti: Yes.
>>: That’s human [indiscernible]. Human can do that well at all…
>> Vijay Peddinti: Yeah. This was one of the things which was very tough for us. Our ASR never
generated any output for this.
[audio]
[indiscernible]
[audio ended]
Sorry?
>>: This is synthesized data, right?
>> Vijay Peddinti: Sorry, no, no, this is the actual data. This is data; these are samples from the dev data
that they gave us.
>>: This is a real data. Oh sorry this is real data.
>> Vijay Peddinti: Yeah but there are…
>>: How that human can transcribe this.
[laughter]
>> Vijay Peddinti: Over here before that if you were doing a really good gain control you can actually
start listening to the human [indiscernible].
>>: How much is this formal data are in the training set?
>> Vijay Peddinti: The training set has [indiscernible] data…
>>: [indiscernible]
>> Vijay Peddinti: Because data, training data is just [indiscernible] speech.
>>: [indiscernible]
>> Vijay Peddinti: We have to do a lot of simulation on these things…
>>: [indiscernible] simulate.
>>: This does make a lot of sense. If you don’t have all kind of [indiscernible] why would it actually
make your life harder?
>> Vijay Peddinti: The reason…
>>: [indiscernible]
>>: [indiscernible]
>>: Normal people cannot transcribe anything.
>>: This is [indiscernible] telling them where and when they shouldn’t put their listening devices
[indiscernible].
[laughter]
>> Vijay Peddinti: If you think that worse you should probably listen to this.
[audio]
[indiscernible]
[audio ended]
All that [indiscernible] was that there was a female speaker…
[laughter]
>>: [indiscernible] to transcribe you’ve got to be you know eighty percent error.
>>: [indiscernible]
>>: [indiscernible]
>>: Zero output is…
>>: Zero is zero.
[laughter]
>> Vijay Peddinti: Yeah, so those are the kind of samples that we were dealing with when we were
starting with this challenge.
>>: [indiscernible] speech [indiscernible]. But the one before that one I tell you a [indiscernible]…
>> Vijay Peddinti: But the one before that if you have gain control you’ll [indiscernible], yeah.
>>: Yeah, okay.
>>: But that’s why I heard she’s [indiscernible] us.
[laughter]
>> Vijay Peddinti: Well, you’re actually able to make out more than any of us heard.
[laughter]
>>: [indiscernible]
>> Geoffrey Zweig: We have about twenty minutes.
>> Vijay Peddinti: Okay, yeah, I am almost done. Over here you can see that there is a lot of mismatch
between the training and the test data. Our only solution to this was basically collect as many room
impulse responses that we could find open on the internet. Try to distort our data using all these
impulse responses. If there was any corresponding isotropic noise along with impulse response we’d
also add that to our data. Doing this we basically created five thousand five hundred, so lots of training
data.
In order to tackle the late reverberations we just start using our TDNN would solve the problem. These
are the first set of results. In the top section of the table the comparison is between a normal DNN and
a TDNN. But over here we are not doing any kind of simulation, distortion simulation in our training
data. You can see that the error rate is forty-seven percent.
Just by adding all these different kinds of reverberation we get a thirty-seven percent [indiscernible]
gain which is quite drastic. Even with other sites we saw similar gains.
>>: Vijay?
>> Vijay Peddinti: Sorry? Yeah so we basically increased the amount of training data three times by
doing, choosing different [indiscernible] of room impulse response and actual speech verified.
>>: What was the relative improvement from TDNNs on the clean data for or on…
>> Vijay Peddinti: Before it was five to seven percent.
>>: Five to seven percent so you cannot really make the claim that TDNNs give you specifically a great
gain in the presence offered that worked.
>> Vijay Peddinti: Yeah, we can because of these two columns.
>>: Yeah.
>> Vijay Peddinti: When we do match the training data.
>>: Yeah.
>> Vijay Peddinti: Then we do different kinds of distortions. The performance of the DNN is at thirtythree point one percent. That of the TDNN, the best TDNN that we could get is thirty point…
>>: That’s seven point five, right?
>> Vijay Peddinti: Yeah.
>>: If it already even in a clean speech you go from DNN to TDNN it also gives you five to seven percent.
Then you can’t really say that…
>> Vijay Peddinti: No, I’m just saying, making the claim that there is an improvement. There is a first
trained improvement. I’m not saying that in this particular case the model is going to be far better.
>>: But the stuff you’re saying…
>> Vijay Peddinti: But also you have to do…
>>: [indiscernible] specifically, okay, so…
>>: Because we can argue that the TDNN is specifically suitable for as you’re saying the long range
dependency. But now I mean you get seven…
>> Vijay Peddinti: The problem out there is that long range dependencies are existent in all ASR
scenarios. They are just more predominant in Far-field recognition. Also there is other change over
here that the amount of training data increased from three hundred to almost five thousand five
hundred hours.
>>: Yeah, right.
>> Vijay Peddinti: Using this larger amount of training data the DNN could have solved some of the
issues.
>>: But the [indiscernible] that should also benefit.
>> Vijay Peddinti: Yeah.
>>: With more.
>> Vijay Peddinti: Yeah.
>>: But you don’t know do you?
>> Vijay Peddinti: No, at that point of time we did not have a local intimidation. It would definitely have
taken a lot longer to train using the LSTMs. Training time was the most critical thing over here because
most of the gains that we are getting where because of simulating the data. The most significant gain
over here is the thirty-seven percent literal gain which is just seeing more amount of data.
>>: Instead of doing three x you’re doing [indiscernible]?
>> Vijay Peddinti: Yeah.
>>: Have you tried that?
>> Vijay Peddinti: Not yet, so this is the [indiscernible]. After doing this experiment most of the people
in my lab started hating me because I was taking all their GPUs.
[laughter]
>>: You do not have enough GPUs that’s why?
>> Vijay Peddinti: We have fifty GPUs but I was using all of them for a period of two weeks.
[laughter]
Everyone hated that. I couldn’t do anything, anymore experiments after this.
>>: The problem with [indiscernible] is slow the x. Your error is three percent [indiscernible]?
>> Vijay Peddinti: Yeah.
[laughter]
>>: [indiscernible]
>>: How many GPUs did you use in one training one?
>> Vijay Peddinti: In this training that we used thirty-two.
>>: In one round so you do model averaging over thirty-two, oh, okay?
>> Vijay Peddinti: Yes, we had to do that because we needed a lot. Even with that we, it took us three
days to train.
>>: That’s not [indiscernible] content that the loss is only point one, point two.
>> Vijay Peddinti: No, no, I’ll show you…
>>: Okay.
>> Vijay Peddinti: Yeah, we do not make, I did not make these comparisons across every task. It was
just on the very specific task that I did see modifying how the number of jobs affected the results. It was
always in that range.
>>: Even if you go that large, the large number…
>> Vijay Peddinti: I cannot answer that question because I…
>>: You didn’t really [indiscernible] comparison, okay.
>> Vijay Peddinti: One other interesting thing we saw was when we were using TDNNs in Far-field
recognition basically longer and longer context helped. When we go to very specific tasks like AMI we
saw that we were even able to model context up to five hundred milliseconds using TDNNs.
Let’s take a look over here at various things that we did. This was our system that we submitted to the
challenge. After the challenge we started adding other things like RNN-LMs. We were still able to
improve on the results. But this is the result that we were comparing with the other sites.
In all the sites you can see that the results on the dev-tests are similar. But when we look at the result
on the eval set which is the actual comparison being done. You can see that there is a drastic increase
almost fifty-seven percent of literal. That is just because of the varied, greater variance in terms of the
eval test set that we had.
This shows us that mismatch of data is still a problem despite us having a free reign on what we could
add to our training data. There was still a lot of unseen data that we could not deal with. Even with the
dev set over here we were getting good error rates. But we did not even select the room impulse
responses based on the dev set data. We did not use any metadata information when we were doing
our room impulse response selection.
I’ll just briefly discuss some of the ongoing work. Let’s come back to the same slide which is a
comparison between LSTMs and TDNNs. To answer your question, so when we start, when you have
TDNNs and when we are going to use them with sequence level objective functions. Where we are
going to compute several outputs simultaneously you can see that we are actually going to use several
intermediate computes, several intermediate activations. Though not for the current output you’d
anyway be computing for them for the next output. It’s not actually necessary to increase the
subsampling rate at an exponentially decreasing scale.
Over here you can see that in order to compute the first grey output you are computing the time step
over here. In order to compute the orange when we are anyway going to shift, right, so if we are going
to do this we, the cost of computing the intermediate outputs can be basically distributed over the
entire chunk on which we are doing sequence training.
However, if we actually want to use all these outputs we still have the linear increase in the number of
parameters. That still remains a problem. In order to solve this problem we decided to explore pooling
options. Basically if we wanted, if we had filter at the next layer which is just using two outputs and we
wanted to use all the data, we use different kinds of pooling at the lower level in order to reduce the
sampling.
We tested three different kinds of sampling, pooling which is basically using one de-convolution which is
either fixed to be smoothing filter or a filter which is learned. But which is common across all the
feature indices. We basically have a one dimension convolution filter which is shifted across feature
indices or basically using a different filter for each feature index.
These are the results. These were the results of the original cross-entropy objective. There was a
difference between sub-sampled TDNN and the LSTM. This is the same sub-sampled TDNN which has
this exponentially decreasing sampling rate as we go deeper. But if we increase the sampling rate and
ensure that the subsampling rate is same at all levels. If we just fix it to be three, reduce the slight gain,
and when we remove the subsampling completely, and use all the frames. But decrease the parameters
using pooling. When we use per-dimensional filters we do see significant gains.
To answer your question there is a detriment because doing subsampling. But if we wanted to do the
same thing in normal frame based, frame randomized training there would have been an increase in
training time by almost five to ten times. Depending on how…
>>: I’m suspicious that the LSTM only gained three percent, less than three percent relative from
sequence training.
>> Vijay Peddinti: Yeah, so [indiscernible] this has [indiscernible] rules [indiscernible] just [indiscernible]
last week.
>>: I see.
>> Vijay Peddinti: After, so Dan did some debugging and he did find out that there were very specific
issues in using the sequence level objective with the LSTMs. Right now we are trying to rectify that. We
do hope that the LSTMs would give a slightly more benefit.
But the main thing that I wanted to highlight over here is that removing the subsampling actually helps.
There is a slight gain between the original TDNN that we had and TDNN which is making use of
information at all the inputs. This comes to us without any additional costs because we are using a
sequence level objective. We can share the computation across the entire chunk.
Assuming that we do finally, sorry?
>>: [indiscernible] that eco-time is still run as a subsample model?
>> Vijay Peddinti: Yeah even at decode time you always have chunks, right. It’s always a non-subsampled model. Even before our decode time it did not gain a lot because of the TDNNs because you
always have chunks.
>>: Okay.
>> Vijay Peddinti: Assuming that we are able to assume, improve the performance of the LSTMs. There
would still be a gap in performance between the TDNN and the RNN model. We started looking at ways
of actually taking an RNN model and trying to improve its performance to the level of normal LSTM.
While ensuring that it had all the multi-rate nice properties that we want which is really low training
time.
It turns out that once a [indiscernible] already exists which is basically called the clock-work RNN. In the
clock-work RNN let me briefly describe what it does. Rather than just having a simple recurrent unit,
recurrent, set of recurrent units. You partition the recurrent units into individual groups where each
group is running at a different clock-rate. Some of the recurrent units are just being updated once every
two time steps. While some are being updated once every four or once every eight time steps.
One of the particularities about this architecture is that there are all these connections only from the
slowest units, slow units to the faster units. Over here I have a set of units which are operating at R by
four hertz. These just have connections to R by two and R by one. There are no connections from the
faster units to the slower units.
What does this actually mean? This is a typical recurrent computation that you have. Over here each
block corresponds to one partition of the matrix. Each color corresponds to your particular sampling
rate. If you have a normal recurrent equation you would have a full matrix over here. Instead of a full
matrix you are partitioning the matrix. Over here white means there are no parameters. There are no
parameters because there are just connections from the faster units to the slower units.
If you’re computing the fastest rate output you’re using all the partitions from the slower rate outputs.
But if you are computing a slower rate output you do not want to use any parameters. This essentially
reduces the number of parameters by half which is a good gain.
On the top of that only this particular block is computed every time step. This particular column is
computed away two time steps and so on. There is a really great reduction in the amount of
computation that is going on and also in the number of parameters.
But these are not the main things that make this model really attractive. The biggest gain because of
this model is that it improves the memory of the network. That happens because the model is basically
looking at input data at different temporal scales. It has several different parts in to the past each with
different delays.
This essentially, these Koutnik et al show that this model is very [indiscernible]. In some tasks is able to
match or even exceed the performance of the LSTMs, so because it was so attractive…
>> [indiscernible] problems, right, what they said?
>> Vijay Peddinti: Yeah and then the [indiscernible] problems. The only speech problem they had was
word level classification.
We tested out the performance of this model on the speech task and it was very bad. It was…
[laughter]
>>: Why so long…
>>: Does this every outperform anything?
>> Vijay Peddinti: I’ll show you.
>>: Okay.
>> Vijay Peddinti: I’ll discuss some preliminary results. On this acoustic modeling task the performance
was very bad. It was far worse than the normal DNN. We started making modifications.
One of the first things we did was there was subsampling happening in the network. What is something
every signal crossing engineer would do? Have an anti-aliasing filter to ensure that there is a smoothing
of information. Ensure that information from all the time steps is used in the filter.
The second thing we did was we removed the [indiscernible] activations that were being used in the
[indiscernible] clock-work [indiscernible], based on the experiments by [indiscernible] Le last summer.
We started using rectified linear units and the diagonal initialization that they suggested.
As soon as we started using anti-aliasing filters we saw a good gain. When we used rectified linear units
that was a major gain and the diagonal initialization uses a slightly larger gain but which is not very
consistent.
In order to reduce the output dimension we just started using the fastest units because in acoustic
models you are evaluating the output at every time step. We thought that only the fastest outputs
would be sufficient. All these gave us really good gains.
Finally, we were able to substantially improve over the original clock-work RNN. But the results were
still just as good as TDNNs. I’m not presenting the results right now because the experimentation is not
extensive. I do not think I can make strong claims.
Right now we are still looking at ways to improve the performance of the clock-work RNNs. One of that
would be to just simply use the clock-work RNN idea in any model that is already performing very well,
which is using it as, to control different gates in an LSTM. The only over here is that whenever you have
a recurrent matrix you can partition the recurrent matrix. There are a lot of models which, where you
have recurrent matrices.
>>: Is [indiscernible] actually the rate, it’s fixed, you have full rate. They’ll have them [indiscernible]
something. I think the part where I understand is using gates and some men are learning to control the
flow of the, and the different text. I think this kind of [indiscernible] kind of power for this
[indiscernible].
>> Vijay Peddinti: Yeah, so, but you can see that it’s not essential to always have a fixed set of rates.
You could actually choose to evaluate these partitions at any rate that you want.
>>: Why don’t people do RNNs where the history state you repeat back comes from a DNN with max
pooling? Sort of a max pooling over model history states that should give you a little bit like a
[indiscernible] lineman or something.
>> Vijay Peddinti: That’s what Jeff was suggesting before in the meeting. You basically want a different
histories to be congregating to the state of…
>>: Right.
>> Vijay Peddinti: Where you are making the current prediction.
>>: Yeah.
>> Vijay Peddinti: But in that case you’re fighting the additional layers rather than combining,
concatenating all these histories you want to do some kind of pooling operation on…
>>: [indiscernible] pooling.
>> Vijay Peddinti: Yeah.
>>: The time line by a corporation.
>> Vijay Peddinti: Exactly, yeah.
>>: Has anyone ever tried that I don’t know…
>> Vijay Peddinti: Maybe they should train that [indiscernible].
[laughter]
>>: It seems very straightforward actually.
>> Vijay Peddinti: Yeah, definitely right. Let me try to summarize what I did till now. Basically, we
showed that multi-rate architectures provide us good gains. The models by themselves are not
powerful compared to their non-sampled variance. But the major advantage that you get because of
multi-rate architectures is the speed advantage. These speed gains can be trivially converted to
performance gains by using more training data.
In many cases like the ASpIRE Challenge and idea that I showed the most critical thing was seeing as
much training data as you could. This kind of architecture, these architectures give us really good way
to do that.
Another advantage which I didn’t talk about a lot in the multi-rate architectures is that they operate on
multi-scale representations. In this particular architecture the input vector that is being, at this
particular clock-work RNN is being smoothed out. As soon as you start doing smoothing if you
remember the very first slides I showed. You actually get these multi-scale properties and the input
representation over here is going to have this rise with really nice distortion stability guarantees.
Because I can do the smoothing sim at various levels, desire. I do not only have really fast architectures
but also architectures which can be really robust if I am able to properly get the training done. This is
one of the reasons that multi-rate architectures are very attractive. Even in the TDNN architecture as
we go deeper into the network it was actually seeing lower and lower resolution information. The other
advantage of multi-rate architectures is that they actually have access to multi-scale representations.
If you want more details about the things that were discussed in this particular talk you could go
through our Interspeech papers. Let me briefly describe the future work. As I said before the problem
that interests me is the distortion stable sequence recognition problem where you have long term
sequence dependencies, and sequential distortions in data. These challenges are existent in all ASR
scenarios. But these are very predominant in Far-field recognition. Because of that reason after I was
introduced to the Far-field recognition problem by Mike during my internship over here. I started
working in this area a lot.
This is one particular table. This was the state of [indiscernible] before the [indiscernible] workshop.
During the [indiscernible] workshop we were able to work on the TDNNs and bring it down to some
extent. After the workshop we started doing, using stronger and stronger models. We got up to fortysix point one percent just using the BLSTM models.
Even if you use a lot of other tricks like using clean alignments for training or using beam-formed audio
data. All that you can get is you could just move from fifty-three percent to thirty-eight point three
percent. First of all when you do anything on Far-field recognition the gain looks significant. It’s very
satisfying.
But the other good thing about this problem is that despite all the things that we did we were not able
to match the performance of the close talk microphone system which was this. There is a sixty-nine
percent of literal degradation. This is not ground truth in terms of human performance versus system
performance.
These are two systems which are exactly trained on the same data and same language models. The only
thing that defers over here is the acoustic data. That gives the giving of sixty-nine percent literal
degradation. It’s really good motivation to work on this problem. Most of my focus recently has been in
this particular case. Thank you.
[applause]
>> Geoffrey Zweig: Alright.
>>: I do have another question.
>> Geoffrey Zweig: Go for it.
>>: Essentially what you did I want to describe it a little bit like you have if applied AlexNet image
processing pyramid to speech in a sense, okay. You’re dealing with the same thing that they do for
translation in their end and so on. Basically what you are doing in time direction like.
>> Vijay Peddinti: Those are already done in normal convolution architectures. The gain that is used we
just want to also do this very fast.
>>: My question is, can you now go backwards? Can you take your subsampling technique and apply it
to AlexNet? Because that’s probably could give you now if you say [inaudible] dimensions, right. Give
you like a square gain.
>> Vijay Peddinti: Yeah.
>>: But [indiscernible] why it’s worse you’ve got to make sure of that.
>>: Right, but the question is maybe the performance isn’t that much worse. Maybe you can actually
get a huge speedup, the, you know the limited cost but if you want to run it on a mobile device or
whatever.
>> Vijay Peddinti: Yeah, so this addition was made by [indiscernible] and showed at this works. She was
suggesting actually using the subsampling technique both on the frequency and the time.
>>: Yeah.
>> Vijay Peddinti: Then you had the initial [indiscernible]. I guess that’s what you’re saying, right?
>>: Yeah, except we’re applying it to image thread.
>> Vijay Peddinti: Oh, okay.
>>: That’s what probability would get much bigger gains because you don’t have forty components
[indiscernible] and fix.
>> Vijay Peddinti: Yeah.
>>: Fifty-six pixels or something.
>> Vijay Peddinti: Yeah.
>>: But I can save it for later.
>>: No, I just had a question.
[laughter]
Okay, so you started out with this motivation of, I guess there’s two approaches to tackle Far-field. You
know one is sort to try and address models that specifically address the distortions present and
formulate it. That sort of seems to be your motivation with temporal time. There’s this other approach
to saying which is sort of raise all votes then which basically well what is he going to do with the things
that approve everywhere. Eventually the gap will still persist but the gap will be so small that they don’t
care, right. It can get from one percent to two per, you know if the absolute [indiscernible] clean is one
percent [indiscernible] two percent then we’ll be happy, right [indiscernible].
It seems like you started out going for option A and you ended up in option B because you sort of had
the same improvement across clean distortion.
>> Vijay Peddinti: Actually the time line was a bit the other way. Our initial goal was basically having an
acoustic model, improve acoustic model which performed well everywhere. It turned out that these
acoustic models because they were tackling very specific kinds of things were working far better in the
Far-field recognition case.
>>: But nobody actually believed it.
>> Vijay Peddinti: Oh, okay, so I think that the ASR problem as such would, basically as we, so if you
guys do achieve the human [indiscernible] in a few years you’d basically start relaxing the constraints in
which you are recording speech. There would, I believe that there is no separate robustness problem on
a Far-field recognition problem. There is basically speech recognition problem. I guess any kind of
acoustic model which is performing well would actually give benefits in a lot of these cases as soon as
you start relaxing these constraints in which we assume that you are going to operate them.
To answer more specifically, so assume that you do have certain kinds of techniques. Let’s say the
techniques which work on raw speech, so which are showing slight improvements in terms of
performance in the actual clean speech, telephone speech recognition case. These same techniques
would actually give us far more improvements.
Papers in the last Interspeech and ASLU have shown that when we apply raw signal crossing to, raw
signal technique models in Far-field recognition you do get better gains than what you normally see. I
don’t see a really strong distinction between these two problems. But I, yeah I definitely think that even
if you start just focus on the Far-field recognition problem you would see gains on the normal problem.
>>: Early on in your talk you said, well, look there’s two ways we can approach the problem. We can try
to make long span features, feed long span features directly into the system. Or we can have a model
that somehow we’re going to characterize the distant past in the operation of the model and keep
[indiscernible] state. Basically learn how to characterize the past.
Do you have a sense now of which one is better? Like how the modulation features do with you applied
it to the same…
>> Vijay Peddinti: Let me, I hear the framework. It would be assumed that we had the same
architecture.
>>: Yeah.
>> Vijay Peddinti: [indiscernible] there were setup features called scatter features proposed.
Computationally these are a kind of convolutional neural network. The only difference is over here you
are computing using fixed filters. Over here the filters are designed for you to achieve all these
distortion stability guarantees that I was talking about.
You would have all these things. On the top of that you would use whatever acoustic model you
wanted. The other way would be to basically make this a learnable model and basically tune everything.
>>: Right.
>> Vijay Peddinti: There was a recent publication in December twenty fifteen where people showed
that in order to prove that your model has a particular guarantee. It’s not necessary to and restrict
them to be learned, design filters. If you have a design filter you have a lot of good understanding about
what the filters are. You know that it’s available which has this dilating gosh and as you go to the larger
bandwidth.
But they showed that in order to make proofs about distortion stability or all these guarantees. It was
not necessary to have this design. It was also possible with learned filters. All that you wanted over
there were these very specific characteristics in terms of the model design which was temporally local,
or frequency local transforms in their case, or the convolution operation asset.
>>: Are there ones any better because it’s a lot faster to just compute some features?
>> Vijay Peddinti: Like I said so there have been a lot of, there has been a lot of [indiscernible] trying to
learn this transform from raw signals, right. Over the past two to three years, it’s only in the last
Interspeech that they were actually showing that the learned filters they had were performing slightly
better than MFCCs. But even over there, there is no guarantee that these filters were going to perform
as well if there was a slight deviation in the training condition, mismatch.
>>: I think the filter; the feature based approach has one fundamental flaw. One thing that it absolutely
cannot model that the model based approaches that you’ve suggested can. You know what I mean
otherwise we’ll talk a little bit later.
>> Vijay Peddinti: The task basic thing. The task basic object they can do.
>>: No, one specific limitation that any like [indiscernible] filter or something’s not able to model or
that…
>> Vijay Peddinti: Yeah, people do acknowledge that fact. They do understand that in feature based
approaches there is no understanding, prior understanding of what is very good for the particular task.
They usually solve that problem by using several different kinds of features. Even if you’re doing multiscale approach you would just not use a single filter bank in order to generate your multi-scale
representations.
>>: [indiscernible]
>> Geoffrey Zweig: Okay, I think we’re about out of time. Let’s thank Vijay again.
[applause]
Great.
Download