>> Jasha Droppo: Hi. I'm Jasha. I... the speech research group. And I've known John for...

advertisement
>> Jasha Droppo: Hi. I'm Jasha. I work here at technology and research in
the speech research group. And I've known John for many years. He was a
postdoc at Microsoft Research for one year ages ago. He has five years doing
noise robustness and other great things at IBM. And these days he leads the
speech and audio group at Mitsubishi Electric Labs down in ->> John Hershey:
Cambridge, Massachusetts.
>> Jasha Droppo: Cambridge, Massachusetts. So John is here to talk about
some neat results that he gave us a preview of last week, and hopefully he'll
go into some details and we'll have lots of fun. John?
>> John Hershey: Thank you, Dr. Droppo. So please feel free to interrupt
and have me writing on the whiteboard and things like that because not
everything may be perfectly clear.
So this is a talk about separation of signals and specifically using a
special new network we came up with for doing embeddings. This is actually
one of those ideas that came up a long time ago, and we were just sort of
like expecting it to get scooped some time before we got around to getting it
out. But as far as we know, it hasn't. So we got lucky. But it's kind of
an obvious idea in retrospect, and so I wouldn't be surprised if someone else
has done similar things.
So the thing about separating sounds is that when you look at the spectrogram
of two signals, so this is just like one frame of a mixture of a male and a
female speaker talking at the same time and both voiced things.
So the black line is the log spectrum of the mixer. The red one is speaker 1
spectrum, the female one, and the blue one is speaker 2. So you can see sort
of those harmonics and some overall structure that people call [inaudible].
But the main thing to notice is that like if you look at the black line
versus the red and blue lines, the black line sits just about at the max of
the two, and only when they're very close to each other do you really see any
discrepancy in that pattern. But in most of the areas, one thing dominates
and the other thing is completely, completely obscured.
So you could sort of say that if you knew that mask, then
observation went under one part of the mask is the source
corresponds to, and the observations on the other part of
essentially very accurate representative source that that
to.
sort of the
that that mask
the mask are
mask corresponds
And basically once you have the mask, you really have all the information
that the spectrogram contains about each of the sources. There is no other
information because this stuff down here is gone. It's not there. It's
obscured.
So it's kind of just to motivate like why do we care about masks. Well, if
you had the mask, pretty much everything you can know about all of the
sources. So just taking that to a spectrogram, so this is like a mixture of
three sources and this is the masking pattern for those, the sort of Oracle
masking pattern for those sources, where each of those sources dominates in
time and frequency. So time, frequency, log spectrogram, and just sort of
discrete mask.
And you can see it's kind of intricate and they kind of overlap in intricate
ways. So it's not something where region based sort of methods could really
find those special regions.
And the other thing that's special about these regions is that they're a
function of both of the sources, like which one is the maximum requires
looking at all of the magnitudes of all the sources.
But if you had that masking pattern, you could sort of use that to see all of
the visible information for each of the sources. And the rest of the
information for those sources is essentially lost. You have no hope of
getting it anyway.
>>:
Your definition of masking is just the max [inaudible].
>> John Hershey: Yes. So one of the things that we discovered recently is
that actually there's not one unique nice way to define a mask for multiple
sources. For two sources it's very clear, one -- whichever signal has the
highest energy.
For multiple sources, it could be the case that it's kind of uniform
distribution over a bunch of sources and then so no one of them dominates
because the sum of any -- of the rest of them is always going to be greater
than each of the individual sources.
So in that case you have to sort of have a neutral value or just set the mask
to zero for that source. So you have multiple sources that have a zero
value. But here we can sort of like pretend that problem doesn't exist.
>>:
A follow-up question.
You have four colors on this graph.
>> John Hershey: Oh, yes, thank you for pointing that out.
is the silent source.
Yeah.
So this
>>: The silent source where you assume everything else is dropped below the
background?
>> John Hershey: We actually did special processing to try not to worry
about the silent source, the silence. Normally this would be some background
noise, so you could call it a noise source. And it does sort of dominate in
those regions. So you could treat that as a fourth class. That would be a
reasonable thing to do. But we didn't do that because it was too reasonable.
So we just essentially -- we get rid of that and pretend it's not there.
Because at least that's one thing you can infer from the mixer is you know
where the silence is.
So what I'm going to tell you about is how we developed a neural network that
we trained on two speaker mixers and then we tested on this mixer and it sort
of does something reasonable for separating three sources.
So this is the Oracle and this is the sort of separated masks. And you can
see it makes some mistakes, but by and large it does pretty well. You know,
like especially the yellow one there I guess must be the female one and
there's two male ones that get a little confused together.
But still it's pretty shocking that you can -- to us that you can train a
neural network that we usually think of as having a sort of fixed number of
inputs and outputs and trained it on two sources which is a particular task
that has a fixed number of labels and then use it in this other task without
modifying the network at all. In fact, we just change one parameter, like
the number of classes you want to get out.
So in a nutshell, I'll just describe what it is, and then we'll backtrack and
go through like the whole history of the world and everything. So basically
what it does is for each of those time frequency bins the output is an
embedding vector that's trained so that if it's -- so if two different time
frequency bins are from the same source, those embedding vectors should
match. And if they're from different sources they should not match.
And by having the -- structuring the output like that, we solve what we call
a permutation problem. Like rather than saying the network should have one
output for a source one, another output for a source two, and another output
for source three and should do like some softmax over those sources, that
would require that it know where to put source one and source two and source
three.
But if they're all in the same class, like how can it determine where it
should put those and how can it make an executive decision like, okay, I
started putting this source here so now I'm going to -- now I have to stay
consistent with that versus putting the other source there and have to stay
consistent with that. So it's a lot of cognitive load for the network that
we're trying to avoid.
So we also hope that this can -- this kind of embedding approach might, well,
first of all, as I showed, sort of help it generalize the different numbers
of sources. But maybe you could even be sort of source independent. Like
you could imagine training this thing on many, many different kinds of
sources. It actually doesn't care what the class labels of the sources are.
All it cares is that you know where they belong in the spectrum. So you
could potentially train it on a lot of different things. We can incorporate
microphone array information.
talk.
So that's what we're aiming for with this
So going back into the ancient history, especially since one reason I'm going
back here is because there was some work at MSR recently that kind of solved
the problem of getting neural networks to separate sources, at least for the
purpose of recognition that I was -- Jasha and Mike and Dong Yu and Chow Wang
[phonetics] did some things like that. And it was on the same task that we
had done this work before.
There's a few reasons for presenting this, so I'll just go through it
quickly. It's like one sort of way you could think of handling this problem
is having models for each of the sources. So imagine if it's speech, you
could have a speech model for each of the sources that are there. And that
could be something that you hypothesize. You could say I think there's two
sources so I'll start up a two-source model. Or I think there's three
sources, I'll start a three-source model. You could automatically determine
the model's class using some model selection techniques.
So what kind of models could you use? So one thing that we did in the past
with colleagues at IBM was we had a speech model so we had like kind of a
grammar-based system. It had state transition, so it was like a very sparse
hidden Markov model. And then we had an acoustic model that was log spectrum
domain and that itself had some sort of shared GMM states. And then the
observations are just Gaussians. And then we combine those models together
in a kind of factorial model, one for each source. And then you have an
observation.
And to make things possible to extend to multiple sources and be a little bit
tractable, we use this max interaction function, which just says I'm going to
take the max of all the inputs and that's going to be the observation. So we
say the observation is just delta -- the probability of the observation -probability density of the observation is just a delta function at the max of
all of the latent inputs sources.
So that's completely tractable to compute that with Gaussians. So posterior
just has this funny shape. It's just a max multiplied by the prior, which is
a Gaussian in the sources for a given frequency.
So now what about actually inferring the states and these masks, this
probability that one source dominates. So, again, we have the max
interaction thing, and then we have the states of the model. So let's say we
have K models and we have a bunch of discrete states. Each model has N
states. And let's say there's a transition matrix, N by N.
And then for each time and frequency, there's that -- in the posterior you
have this probability that one source dominates or the other. And that
posterior, so for each time or frequency you have the K sources. So you have
a -- it's just a multinomial. It has a value, probability value for each of
the K sources.
But unfortunately like if you were to do exact inference, which would be
silly, it would be completely intractable. It's all exponential. Like the
mask states -- so, you know, so if you have N masks and two sources, you have
2 to the N possible masks variables that you would have to explore.
And each of the mask variables depends on the configuration of all the
states, of all the sources. So you have this nice -- oops. Part of my
animation is stale. So you have this nice bipartite graph. And what we
know, like if you sort of think about restricted Boltzmann machine, right, if
you know one of the vectors, then inferring the other one is trivial because
they're all independent condition on one of them.
So even though the posterior has this horrible sort of fully connected kind
of structure, if we know one of the things -- like we know the source
states -- then all we have to do is compare those little Gaussians for that
stage and see which one is the max.
And if we have the max states, then all we have to do is sort of ignore the
areas that are masked out for each of the sources and then we can look at the
sources independently to determine what their states are. So then we just
had like HMM kind of complexity.
So the two conclusions I wanted to draw from that is like we can do what we
did, which was to do like a variational inference alternating between mask
inferences and source data inferences. Also if we can get good masks, we're
kind of almost done. Like that's like -- it's inferring source states or
whatever else you want to do with those sources, impute the missing values,
whatever it is you want to do, that's all super tractable once you know the
masks, if you can get them correct.
So that was kind of one of the inspirations. Like let's try to develop
algorithms and get the mask and then we'll go back to our crazy graphical
models or whatever to do the refinement after that.
And so we did this stuff, this variational approach, to try to get past just
doing two speakers. So originally we did this two speaker case back in the
day, like this is -- this is -- this is kind of a toy task. So take it with
a grain of salt. So human performance on this task was about here and we
were able to do better than the sort of recorded human performance by doing
exact Viterbi in a two-speaker factorial HMM using a lot of tricks to speed
it up.
And then we had different variations of that. And one of them was this
very -- well, two of these, the red one and the purple one, these are both
different variational approaches about alternating between masks, inference,
and state inference.
And in particular we came up with ways of where the variational parameters
could have a sort of particular complexity. So you could vary the complexity
and see whether you could get similar results, so the exact algorithm.
And so for this case, we were able to, you know -- this was the exact thing
that corresponded to this reduced one. This one had 65K things to consider
in terms of state combinations, and this one just had sort of 256 separate
mask values to consider.
And we do just as well, even though we're reducing the complexity. So that
was great. And then you guys -- Jasha and Mike and Dong and Chow -- recently
did a two-speaker DNN acoustic model. So not for -- our approach was
separated and then recognized and their approach was to just recognize.
And they use some nice techniques to solve the permutation problem partially
when doing state inference and computing the probability of the states of the
different sources and then running a Viterbi to sort of unravel and get the
associations across time to solve the permutation problem. And I'm not sure
exactly where their score would fit in on here, but kind of off the chart.
And then but the direction that we went with this was to -- oops -- as I
said, to do this kind of variational thing so we could get beyond two
speakers. So two different directions. But we wanted something that could
be generalized, an arbitrary number of speakers.
So we did that. It would be very complex. Just give you demos [demo
playing]. You can hear what -- how things used to work. Okay. So it works
really -- pretty well, actually. I don't think we're back to there with the
embedding approach, so we're just kind of catching up.
Of course this thing has a very constrained language model. And all these
things matter a lot. So it's not clear if the two things can be compared.
Also, closed speaker set and so on. So we didn't even use the same data
because it just had too many assumptions. Closed speaker set, small number
of words and so on.
We were able to do four speakers. I won't play the demos. Just back to
here. So back to the -- like if we can get good masks, we're almost done.
That's where we left off.
And now to -- what about DNN approaches for enhancement, because that's
almost like separation, it's basically a separation, only difference being
that you have two different classes, like you have speech and you have
nonspeech.
So what do we do for that. So we train a network to estimate a mask for each
time frequency bin so it has sort of a softmax output that ideally it should
be one for the dominant speaker or dominant source, one for the speaker if
that's the source or one for the noise if that's -- so for that time
frequency bin.
And we have to model the context. LSTMs. Recurrent networks are the
state-of-the-art for that. If you don't know what those are, they're
hideously complex networks that are recurrent, that are made so that the
gradients pass well back in time and they have gates for input, output, and
forgetting. It's not clear whether all those gates are needed. It's just
something that sort of caught on and people use this today. But surely
there's probably a better way, but, hey, it works and it's been implemented
dozens of times. So we can just use that. And that's what we used.
So we've done a bunch of different work just doing speech enhancement using
these kinds of things. And in the more intricate ones, we sort of also took
up this like let's iterate between getting masks or reconstructing sources
and doing recognition and getting states and feeding those things back. And
that all helps incrementally. You get a little bit out of doing that kind of
iteration even though it's not a variational inference algorithm, still can
help.
And these are the kind of results. These are word error rates. So if you
just look at sort of the average results, like starting from basically doing
nothing to doing kind of a vanilla, nonnegative matrix factorization type of
approach, down to doing like a bidirectional LSTM with special objective
function that considers the complex domain match to the signal and feeding
back in some state information after one iteration through a recognizer and
things like that.
Like the numbers keep decreasing. And here's with two-channel system using a
little bit of beamforming to get to a slightly lower number. So just to show
that these networks really work at this task. Even though the nonstationary
noise might even have some speech in it, sometimes -- yes.
>>: John? So what is -- is your goal here going to be for speech
recognition, or do you see this as improving perceptual quality for human
consumption of these signals?
>> John Hershey: It can be either.
general or for ->>:
Like for the -- you mean for the talk in
For you research or what you're presenting today.
>> John Hershey: Yeah. I mean, I think it could be for either one. It's
not clear. It depends on -- you know, really test dependent kind of thing.
But you could use it either way, and you can use different objective
functions to train it. These things were all trained sort of in the signal
domain as enhancement things.
So they actually sound pretty
were to sort of hook it up to
it could be better. It might
something reasonable with the
good and they help recognition. But if you
the speech recognition objective function, then
also sort of wander away from actually doing
signals and give you something that wouldn't be
listenable if there's some arbitrary transformation of the signals, but might
be good for recognition.
>>:
[inaudible].
>> John Hershey: These are with retraining, yeah. I believe. These are
with retraining. So but the problem is like you can't just naively, at
least, apply this to speech on speech problems because it's the same class.
You could do male speech versus female speech. That works. You can even do
speaker A versus speaker B if you have speaker dependent model. That even
kind of works. But for like same speaker independent and both can be male or
female, it just doesn't work.
So as I mentioned for acoustic modeling, this has been kind of solved using
the way that the permutation was disambiguated was to use the state of the
louder source as a target and the state of the quieter source as a separate
target and then to use -- if I'm remembering correctly, to use Viterbi across
time to link up the correct states across time. So that solves the
permutation problem.
It's not clear how to expend that to more than two speakers. Or how to do
enhancement with that. So for enhancement, we tried, just naively tried.
And so what we -- so here using a BLSTM, it's a recurrent model. So you
can't just sort of sort frames frame by frame, so we decided, okay, we'll put
the louder whole utterance or chunk in one of the outputs and put the softer
one in the other target output. That didn't really work.
We're using chunks here of about a hundred frames, just to kind of limit the
globalness of it.
>>:
[inaudible].
>> John Hershey:
often ->>:
I mean, negative results in neural networks are
Right, but I mean [inaudible].
>> John Hershey:
>>:
Yeah.
Yes.
Don't have any way to look at continuity.
>> John Hershey: And it's bidirectional, so we can't look at the output.
Although you could think that maybe by building up layers like each
successive layer is sort of closer to the output than eventually it could
sort of infer the output, so yeah.
It's not clear that it can't be solved this way. But we couldn't do it. And
so we also -- just the simplest thing that we tried which we thought would
work was that at training time we just look at the two masks that were
generated and we use Oracle to pick the one that matches the best matchup
between those outputs and the two targets. We have two targets, two masks,
we match them up optimally according to the objective function, and then we
get gradients after that. That fails.
And then there's also the deeper -- you call it a permutation problem, which
is what about the numbers of sources that are in there, how can we handle
different numbers of outputs in this kind of a rigid framework. But you can
imagine, you know, like you have a -- it's kind of like those Dirichlet prior
models, you have a bank actually and then you put priors so that the ones
that end up zero you ignore and it determines the number of classes. You
could imagine doing something kind of like that.
But, anyway, so this like kind of what we got with the deep net just training
on mask estimation. Like this is the sort of Oracle output mask for two
speakers, and this is what we got. And it kind of -- this kind of looks like
it's doing something from time to time. But, you know, we're asking it to
label speech with one and also label the speech with zero. So it gets a
little confused.
So we decided to come up with something that wouldn't be as hard to get to
work. So I already went through this before. We're going to use embedding
vectors for each time frequency bin. And kind of like comparing to the
class-based approach, like these V are kind of the model outputs and Y.
These are just kind of very generic, so don't pay too much attention.
If these are sort of we're training the network to output some approximations
to the labels, and these are the labels, that's the kind of class-based
approach. Like the labels are just going to be like an indicator for which
class corresponds to what mask.
And then instead we're going to do something that sort of we want to be able
to compare the outputs for different time frequency bins in a case where
they're from the same class. And/or compare them when they're from different
classes. So they'll have an objective function more like that with the
assumption that if we -- once we do that kind of training we'll be able to
then tease apart which things belong together.
So just to make it a little more formal, like if we have this sort of
indicator variable Y, so rectangular matrix, so the row is which time
frequency bin it is, and C is sort of which -- for that utterance, which of
the different sources that went into that utterance it is. Those can be
permuted around because of the way that we're going to make that into an
objective function.
So we want to be independent of the ordering of the columns of Y. So when we
take the outer product of Y, that gives us something that's independent of
the permutations. You multiply Y by a permutation, then take the outer
products, you get the same A.
And it turns out we configure this A as an ideal affinity matrix, if you
think about sort of spectral clustering type of approaches that you have an
affinity matrix. The ideal one would be one where things that are in the
same cluster have a value of one in A. So the AIJ would be one if it's in
the same class and it would be zero otherwise.
And that way it's easy to think about sort of permuting the rows and columns
around of this A so that you get a block diagonal structure, which is exactly
sort of the ideal spectral clustering input. Because that only -- it has
rank as the number of blocks, and each eigenvector is just an indicator for
where that block occurs, where that class occurs.
So if
using
whole
going
big.
we know A, we can just recover Y. So we're going to approximate A
some function. If we were to just sort of naively approximate the
big A, you know, like saying we have the sort of ideal A and we're
to approximate it with some output of the network, it would be just too
So instead of doing that, we do kind of like similar to what people do in
spectral clustering, we use a low-rank approximation. But instead of -- kind
of like the philosophy nowadays is instead of like having a complicated model
and then approximating it, we're going to start off with a model that would
be considered an approximation to the original thing and train that thing.
We're going to train the approximation because that's our new model. So we
have an approximate model which is that we're just going to use this kind of
a construction for the affinity matrix.
So note that this is nothing like the spectral clustering way of doing
things, which would have it like a local kernel and that leads to a sparse
affinity matrix which then has to be decomposed into eigenvectors in order to
sort of complete the sparse blocks and sort of fill them out into being full
blocks.
Instead of that, we're going to train the network so that this thing gives us
full blocks. That means it's going to give us dense clusters. That's the
goal.
So getting into the precise thing that we actually do, so we have some input
features and we're just sort of here. And since we're using bidirectional
LSTM, you could just think of the output at a given time and frequency as
sort of dependent on the whole spectrogram.
And so H is our network with parameters theta. And this is our objective
function. So we want our VVT to approximate YYT. That's what we said we
would do.
If you sort go through the algebra, you can see that as saying that the
things that are in the same class we just want them to be close together.
This is just a -- this actually ends up just being N once you do the math and
have the proper weighting on these things.
And then this guy says that if they're in different classes, we'd like their
distances to be further apart, like close to two, even though they can't
quite be that far apart. They're unit vectors in our formulation.
So another way to think about this objective function is that aside from just
some little weights for the number of items in each class, this is exactly
the K-means objective function as a function of Y the assignment of points to
classes.
So at training time, given the assignment, we're training V using the K-means
objective function. And at test time, we hold the parameters constant and we
optimize on Y. We try to find the assignment that reduces the same objective
function. So that way we can feel confident that what we're training it to
do is actually the right thing for the procedure that we do at test time.
>>: I want to make sure I understand your notation here.
vectors embeddings?
>> John Hershey:
>>:
The VIs are
Yes.
[inaudible] VIK would be -- K would be like a time-like variable or --
>> John Hershey: I is the time frequency invariance and K is the embedding
[inaudible] row vectors and they sit sort of as rows inside the V. So VVT
compares for each time frequency, it compares it to each other time
frequency. So VVT is a big matrix, way too big to actually use explicitly.
But fortunately this is just a, you know, Frobenius norm squared error type
of function, and it turns out like actually you never -- you don't have to
actually instantiate that big matrix because of that.
>>:
So but [inaudible] of VM is all is one, right?
>> John Hershey:
>>:
Exactly.
Exactly.
And then --
>> John Hershey: Or times.
the cosine of the angle.
>>:
So each row is --
[inaudible] VI minus VT as just the cosine --
>> John Hershey:
>>:
Yeah, the norm of -- yeah.
[inaudible].
Times.
Not minus.
The product of VI and VJ is
[inaudible] but VI minus VJ is just one plus one plus the cosine --
>> John Hershey:
Exactly. Yeah.
Yeah.
It's two minus two times the cosine of the angle.
>>: So the maximum of this cosine should be something between negative one
and one, right? And then you want to ->> John Hershey: One for the same. Right. Angle will be zero and then the
cosine -- so then you'll get -- you'll get zero. These will be the same
vector. This will be zero.
>>: Just don't understand why you sort of have here negative two
[inaudible].
>> John Hershey: This just -- well, first of all, this is just -- all I'm
doing is doing the output of this to get down here. So this is kind of -but the intuition you could say, okay, this is a distance over all of the IJ.
So it's saying that we'd like this distance, this squared distance, to be
close to two. So, in other words, we want it to be larger. Because it can
never be two.
>>:
But the value you get at just one plus one plus the cosine --
>> John Hershey:
Can be two [inaudible].
>>: So what you get is just the cosine [inaudible], right?
get -- you get what I mean?
Because you will
>> John Hershey: Yeah. So this will -- this quantity is -- this quantity VI
minus VJ squared is two minus two times the cosine ->>:
[inaudible].
>> John Hershey: Yeah, exactly. So this will be zero if they're the same.
And otherwise it will be larger than zero. Zero is the smallest it can be,
obviously, because it's a distance [inaudible].
>>: Because I saw like some other [inaudible] objective function which is
just trying to like maximize the similarity between [inaudible] the same
class and minimize the ones that are the negative player and are not in the
same class. And I'm trying to see what could be the difference in the
optimization between those two objective functions.
>> John Hershey: I see. Well, you know, people often want to put like a
hinge loss on these type of things because like we want something to be far
away, but you don't really care how far away it is as long as it's further
than anything else in that cluster. So with a margin you could have a hinge
loss, and then you don't care after that, which might be more robust.
The problem is like you cannot put a hinge loss on this and then also get the
fact that the low rank construction means that the derivative is very simple
and that the -- in fact, that this likelihood itself is just ETV minus two
times VTY plus YTY, all of which are K by K Frobenius norm. So very simple
and there's no N by N.
So N by N is this astronomically large thing that we want to avoid computing
at all costs. So this thing is all -- it's just a Frobenius if you rearrange
all the terms, make it -- use the -- multiply the large dimensions first,
basically.
So this is saying for all points we want them to spread out; otherwise,
there's kind of a trivial solution, right? If we don't have this term, then
we could just put them all zero -- or put them all at the same place. And
then all the things that are in the same class will have minimum distance and
we're great. But we want them to be different.
So this one just spreads everything out. This one says things in the same
class must be compact. Then maybe -- there are ways of interpreting this,
too. I kind of skipped some lines there. But this just says what I just
said, so it's same as K-means objective and similar to the spectral
clustering, but now because we're training it to be compact, the clustering
is actually much easier.
You actually don't need to -- so like think of spectral clustering as kind of
like trying to fill out those block diagonal things. You could think of it
as taking the matrix, the affinity matrix to a large power because you need
to spread throughout the cluster.
If you think of a transition matrix as one way of thinking about what
spectral clustering does, you have a sparse transition matrix, but still
things -- you have a connected set for each cluster, and then by following
through that transition matrix and linking up things across multiple
transitions, you get back to a block diagonal thing.
But this should be close -- we're training it to be close to block diagonal
in the first place so it doesn't require as much processing to do the
decoding kind of thing.
So, yeah. So this is just saying what I said. Like you can rearrange this
thing to do the large vector multiplies first and that way it actually much
less complex than actually -- and memory wise much less space than storing
that whole VBT thing. You don't actually have to do that at all. The
gradient is also very simple and cheap to compute.
And the rest of the gradient is just your -- whatever your network is.
Doesn't have to be LSTMs, it could be convolutional networks, whatever.
we don't get into the gradients for that part.
>>:
So
[inaudible] which gradient --
>> John Hershey: Oh, yeah. We left off the gradient of the normalization.
That's another thing. So you just -- so you have each row of V is a unit
norm. So you just have the gradient of V over its length, which is just
like -- it's like a softmax-like gradient. The only funny thing is since --
>>:
[inaudible] so it has to be -- well, although the [inaudible].
>> John Hershey: Well, the way it's implemented is that you have a
transformation coming from the network which is then normalized to be unit
length. And that's just the forward procedure of the neural network. So we
don't consider it an optimization problem with constraint. That's just how
the network works. It outputs something but gets normalized. So it's a back
propagated, we just take the gradient of it. It's a simple gradient.
The only funny thing is like since it's a sphere kind of normalization, like
all gradients are sort of stepping off of the sphere a little bit. So it
might be good to normalize your weight matrix. Since actually you could -the weight matrix for each of those rows V, actually it doesn't matter what
the scale is. If you multiply it by some number, the output is going to be
normalized anyway. So it's invariant to that scale.
So all derivatives will be tangent to that surface, but you'll always be
stepping off. But you can always renormalize the weight matrix. So in
essence you can always stay on the sphere.
So what we do is train it on hundred frame chunks. This is just what we did
in the experiment because we were going to try the sort of SVD approach, the
sort of spectral clustering-like approach. And you can't just -- can't do
that on something that's too big. So we train -- limited it to a hundred
frame chunks during training. And then we tried clustering within each chunk
of frames separately and then hooking up the results resolving permutations
after that, or we tried just using global K-means.
So the global K-means is the actual only thing in here that doesn't use any
Oracle information at all. When we did the individual chunks, we used the
best possible permutation of each chunk and we also had a method that did
that automatically. But somehow it didn't end up in the table. I'm not sure
why.
But that can be done too. It's not hard to actually come out with a nice
version of K-means that if you have overlapping chunks that you can sort of
satisfy the assignment constraints of K-means so that they're consistent
across chunks but allow the centroids to be different for each different
chunk.
So you can imagine the embedding sort of evolving over time. But you're
linking them up and forcing the same assignments across chunks. Even though
the centroids are different.
Okay. So then we trained it on 30 hours of artificial mixtures of two
speakers, mixing around 0 dB plus or minus 5 dB. And we had two different
evaluations. Because we wanted to have some kind of baseline for this, but
we couldn't find anything handy that was good enough.
So what we chose as a baseline was to say, okay, let's train NMF models for
each speaker using all of that speaker's training data, and then we'll use an
Oracle to tell us which two speakers we've got in there and we'll use those
NMF dictionaries, they have like ten frames of context, so they're actually
fairly powerful NMF thing, nonnegative matrix factorization. For those of
you who don't know, it's just you have a power spectrum and it's easy to
describe that as a sum of a bunch of basis functions that are not negative.
So that's our sort of Oracle baseline. And we could evaluate on the closed
speaker set, but we couldn't evaluate that on the open one because it's -- we
don't have a good method for that.
So let's go through the results that we've gotten so far. So the global
clustering is a little worse than the local Oracle clustering. So Oracle
K-means, that's where we do individual chunks and then find the best
permutation. Should probably say Oracle permutation, I guess. And then
global K-means is doing K-means on the whole thing. So we lose a bit going
from -- sorry.
This is [inaudible] improvement in decimals. So we get around 6 1/2 KB
overall for things that are mixed around zero dB. So we're improving
somewhat. It's not ideal, not what -- you'd like to have like 30 dB, I
guess, ideally. Or 10 would be nice.
So it's still preliminary results. But, anyway, it's kind of interesting
that the global one works at all. Because we really expected things to be
shifting over time, and we only trained it on local chunks of a hundred
frames. So evidently there's enough sort of common information that's more
like speaker ID type of information in the embeddings that allows you to do
that. But you do lose something, and we'll see more of that later on.
We also did singular value decomposition to try to do something sort of
spectral or clustering-like. Also, you know, you can think that on the
sphere K-means and SVD are pretty much the same thing. They're doing almost
the same kind of estimation. One is just a relaxed sort of version of the
other. And that had no effect really.
So and then of course an interesting thing is like how many embedding
dimensions do you need to do this, and it seems relatively insensitive. Our
best was at 40. One thing I didn't mention about the structure of the
network is that actually the output layer of the LSTM was at 10H function
before being projected from the hidden vectors of the LSTM to this output
embedding dimensionality. So there's some kind of limiting going on first,
and then we have a projection. And we tried setting that to logistic. Why
we didn't just try a linear version of that I'm not sure. But we didn't.
So kind of revealing is the results on gender. So it's like we can consider
the different combinations, male and male, female and female, male and
female, and altogether you can see like really actually the male and female
is doing great. Doing really, really well. Surprisingly even better than
this Oracle NMF thing by a fairly large margin.
But it suffers a lot with male and male, female and female. And particularly
the global K-means is just starting to really lose a lot of the benefit.
And, again, it kind of makes sense, like now you've got same gender, so a lot
of those may be even close to same speaker. Some may be more different from
each other, two male speakers, some might be more similar.
And if you're relying on this global information that was only trained on a
hundred frames, like make the network can't really put in enough information
on the basis of a hundred frame training to really tell the difference
between one male speaker and another very well. But still doing something.
So it's encouraging.
And then I
much going
do that at
means that
on the IDs
guess I didn't really emphasize this before, but we're not losing
from closed speaker set to open speaker set. Obviously you can't
all with the NMF with the Oracle speaker ID, but basically it
our model is pretty speaker independent. It's not really relying
of the 80 speakers that are in the training set.
>>: It's weird that your male-male and female-female results [inaudible] is
that just chance?
>> John Hershey:
>>:
Yeah --
[inaudible] better for females as opposed to males --
>> John Hershey: That is -- that is -- it's suspicious. I think it must be
correct. Actually, yeah, that's a good point. It's a bit suspicious.
Actually my theory for why female-female works better is just because the
harmonics are further apart, easier to separate.
We use relatively small windows. We should have used windows large enough
that you can actually see the harmonics of male speech, but we use 25
millisecond windows. So it's getting tough to see differences between
harmonics for male.
Okay. So let's just hear are some results, see how it goes [demo playing].
So pretty good for something that has no language model, you know, no
separate models of individual sources at all, just kind of training on this
mixed situation [demo playing]. And so just for comparison [demo playing].
Doesn't really work. Those two speakers are really close. So even though it
had two different dictionaries trained on just those speakers, it's just
impossible to separate based on individual templates of the speakers.
Okay. So then three speakers, we actually tried that, just fed it in the
same network, mixtures of three speakers and just set the clustering at the
end to have three clusters instead of two. And we do pretty well. I mean,
this is sort of comparable to some of the male-male or female-female numbers.
It has a male-male in it.
>>:
Oracle NMF means you pick the right speaker?
>> John Hershey:
We picked the models corresponding to each of the speakers.
>>: [inaudible] Oracle means you just know cross segments which cluster is
[inaudible].
>> John Hershey:
>>:
[inaudible] segment is, because segment to segment cluster [inaudible].
>> John Hershey:
>>:
Yes, yes.
Yeah.
So just put it the right [inaudible].
>> John Hershey: Depends on the initialization, if it's arbitrary, its
permutation, independence. Arbitrary clustering comes out of each one.
this, this one is just running full [inaudible].
And
So if we were testing on individual chunks, then I guess like these would be
more similar, right? This one would be better because the Oracle clustering
for one chunk is the same as the nonOracle clustering for one chunk.
>>: But the embeddings for speaker 1 on chunk one, there should be similar
embeddings on same speakers, a different chunk, right? Or no?
>> John Hershey: Well, they can represent the local information, so
there's -- you know, there's like the pitch contour and what phoneme is
happening right then.
>>:
Could you overlap chunks to get better permutation information?
>> John Hershey: Yes. Yes. We did originally extract overlapping chunks.
But Joe -- he was an intern at the time who did this -- he just sort of threw
those away because they were taking up too much disk space. It's fine.
Like I say, we did try a method that actually considered consistency across
those in a nice framework. It didn't do as well as the Oracle, but I believe
it was a bit better than the global one. I don't know why we didn't go with
that. It just wasn't ready yet I guess at the time of printing up these
results.
But you could imagine solving these problems with some smart way of linking
things up, dynamic programming kind of thing, across time. So that's the
picture I showed before.
So what do the embeddings mean, though? That's an interesting question.
Like we would really like -- we wish we could do like the sort of
inceptionism, like explorations of what kind of crazy sounds these things
respond to, but we haven't tried or don't really know how to do that.
This is just looking at -- so that's the mixture mask. These are the sort of
three different randomly chosen embedding dimensions. So you can make a
spectrogram out of an embedding dimension because it has a value for every
time and frequency.
So somehow these correspond to what we're seeing in the input. But we were
hoping to see something like, oh, this one is pitchy and this one is like
looking at onsets. You can't see anything like that.
One reason that it's hard to interpret these is what I was saying before, is
that like the V -- the embeddings are really rotation invariants. So if you
multiply V times a rotation matrix, call that V tilde, then V tilde, V tilde
transpose equals VV transpose. So no matter how you rotate them around. So
you're going to lose all the meaning of any particular node in the embedding
by rotating it. So that kind of gives us almost no hope understanding.
But we can try to straighten them out or detect something. Or maybe at least
just look at the output matrices and see what kind of patterns they have.
It's kind of a weird thing, right? You've got 600 nodes, each of them is
just multiplied by a matrix to get out to this whole embedding space. So
each of them sort of has its own embedding. So if it's zero, you don't see
that embedding. If it's one, you get that embedding in your output as one
part of a sum.
So looking at what those are might be interesting, but ->>: [inaudible] frequencies you have during the [inaudible] you have certain
patterns during silence. I'm not sure [inaudible].
>> John Hershey: Yeah. So what I didn't mention so much is that -- but sort
of mentioned before, what we do for the silence is we just put a weight on
the objective function during training that says we're just going to ignore
those silence bins. We don't actually care which source they belong to. So
we're not going to train the network to try to come up with a good embedding
for the silence. Because you can tell the silence by looking at the mixers,
have a threshold kind of thing.
And then at test time we just also look at the mixer, pick that silence
threshold. And then we don't -- when we do the clustering, we don't use
those values to determine the cluster center. So we do a weighted K-means
and we just ignore them when doing the M step of the K-means. We just don't
include them.
So see probably close to out of time, but just go through what we did a
little bit this summer in the workshop that some of us were involved in.
There was an idea to try to use microphone array input for this because this
thing is kind of a really tough problem. You're looking at a single
spectrogram and you're trying to tease out which parts of it belong to which
source. And just by the pattern of the amplitudes it's very complicated
because you don't know where one begins and one ends. It's the chicken and
egg problem. That's why you have to iterate between masking and states.
But if you had microphone array input, then there's sort of directions in
space that cause delays between the microphones and those delay patterns can
sort of distinguish one source from another and at least give a clue that
could be used.
So but we want to now handle those kind of permutation problem in the input.
So we want to be sort of independent of the microphone array geometry
ideally. At the very least, be independent of the ordering of the
microphones because that shouldn't have to do with the algorithm.
So the approach that we came up with was to first do an initial microphone
array sort of clustering algorithm that looks at the time frequency bins and
clusters them according to their sort of plausible consistency with a set of
delays between the microphones. So every pair of microphones can have some
delay, and you can sort of detect these delays.
And the pattern of delays is kind of indicative of a particular direction of
arrival or something. And so clustering those things, you can sort of have a
feature that differentiates between different sources.
So then the way we use those clusters is we use each of those patterns of
delays to do beamforming. So we use them as a kind of delay on some beam
form where there's particular time delays. And then feed in the beam form
channel. So then you can hope that if there's a beam that's sort of pointing
at one of the sources, the values will be larger than in a beam that's
pointing at a different source. And so the pattern across those dimensions
indicates something about the clustering of the sources.
So it's still not enough because the beams could be in different orders sort
of arbitrarily because you're just using clustering now. And unless you want
to train on so much data that you automatically learn a network that is
insensitive to the order that you put things in, probably you're not going to
fit something to those orderings.
And so we wanted to just be completely independent of that. So we came up
with a network architecture that kind of has local channels for each of the
inputs. So at test time you can have as many channels as you want in the
input because they're all the same, they all have the same weights. And then
from each of the local ones we pool into sort of a global one.
Because ideally you want those local channels to compare to each other,
because you need to say, okay, this one seems louder than this one for this
particular source or something like that. So but you can't do comparisons
without somehow having an indexing between the inputs. So to do
permutation-free sort of comparisons in a neural network, then we use a
structure like this. So same color -- well, at least within one layer, same
color means same weight matrix. So same parameters.
So this one for channel one is actually the same network as for channel two.
The networks are exactly the same. So how can they have different sort of
information. Ultimately we want to have -- be calculating different
information. So the inputs are different, so that's a start. And then we
use kind of a pooling to pool into a global unit and then feedback the global
output into the individual channel.
So this way this guy, for instance, can sort of compare its own
representation to the representations coming from the other units. Or maybe
here's a better example since it's going to be hidden, do the pooling, back
to the hidden. So this guy can see sort of how it's doing relative to the
other channels essentially.
And we don't know what kind of function it's learning inside, but at least it
has a chance of learning something completely permutation independent. And
this can be used for any kind of problem like this where you want to have
permutation independence in the input.
So that's pretty much where we left off. We didn't actually get to really
try this out yet. It's just theory at this point. But I thought I would
throw it in there because it's like kind of a future direction that we
already have sort of made some steps towards.
So bottom line, you know, we're solving this output permutation problem,
multiple instances of the same type of source can be handled. Can generalize
different numbers of sources. Maybe even you could imagine using this to
train a kind of universal sound separation engine if you train it on enough
things and have enough capacity. There's no language model, no complex
decoding process.
The whole thing is very fast. It's just speed-forward network. Well, it's a
BLSTM. But you can imagine doing a convolutional network that would just be
feed forward.
And so we're working on this, incorporating microphone array information.
stay tuned for further developments. That's my talk.
[applause]
>> Jasha Droppo:
Any other questions?
>>: I've got a lot of them, but do you think it might work for image
[inaudible] recognition?
So
>> John Hershey: Yeah. Yeah. That's definitely something we want to try.
I think that the image segmentation literature is a little more vast than the
deep network source separation literature, so I can't guarantee that
something like this hasn't been done.
There was something a little bit like this. The only thing that we found
that was closest to this was doing an MRI volume segmentation by using a
convolutional network to infer affinities. The affinities were just adjacent
affinities. So it was like you have a point in a cube, a three-dimensional
grid, and so you have six numbers to infer that are the local affinities.
And that's been done.
Nobody's used like this kind of embedding approach that like sort of has long
reach across grids like this. But I do think it would be good for that.
In that literature, there are a lot of datasets that have a ton of classes.
So, you know, you can use more like class-based approaches in those cases.
>>: Right, right. So they sort of maybe don't have a permutation problem
per se because like, you know, the features associated with light force are
different than the person. But, you know, I mean, they also ->> John Hershey:
>>:
And you can train the network.
-- learning and things like that.
>> John Hershey:
I want [inaudible].
>>: So here like, you know, the signal is same difference, so then you don't
care about the exact classness and ->> John Hershey: Yeah. I mean, there's kind of -- I've used that image,
class-based image segmentation method as kind of an existence proof for this,
because you could say that vector of probabilities of each class that you
would get out for each pixel, let's think of that as an embedding. Well,
when things are in the same class, then it should match, and when they're in
different classes, they should be different.
The advantage here is that now we can do this without having class labels.
So there's ton of objects that don't really have a class label. You know, if
you see something on the ground, you can pick it up without knowing what it
is until you get a closer look. So we know that for human perception, it's
not a problem at all to segment something without knowing what it is.
So there's no reason you couldn't have datasets that have unrecognizable
objects in them. It's just that they don't exist. There may be some that
just have segmentations. But, you know, it has to be done by a human, so
it's kind of a nonstarter for me, at least. For this stuff we just mix them
together. So if you have N things, you have N squared training things --
>>:
[inaudible] and then render, then you have something.
>> John Hershey: Yeah. Yeah. So depending on how realistic that is.
That's the only thing is the realism.
>>: With respect to the inception thing, so like -- I mean, it seems like
this thing is pairwise, so if I have a real [inaudible] and then I say I want
to move away from this thing but maintain the cosine, I don't want -- I want
it to be similar in the embedding, but otherwise as different as possible.
And then if you also do the inception thing of like preserving the local ->> John Hershey:
Yeah --
>>: -- so you don't get white noise, have you tried it?
reasonable sounds?
Do you get
>> John Hershey: We haven't tried that. There's one tiny issue which is
that like the phases are kind of important in sounds in order to sound good.
But that's okay. I mean, you could still I think learn a lot from it. And
we could think about just optimizing the phases a little bit to patch them
up.
Yeah, I think something like that makes sense. Start with a sound, just one,
and then try to stay in the same class but get different -- see what the
equivalence classes are. Yeah. Yeah, that's nice.
>>: Because those guys are super famous.
have an art exhibit now.
You know the inception guys like
>> John Hershey:
That stuff is mind blowing.
>>:
Like --
Like right?
>> John Hershey:
of that stuff.
>>:
Yeah, I know.
>> John Hershey:
Yeah.
I still haven't recovered from seeing those videos
It brings back bad memories.
Lying under a bush after midnight.
Indeed.
>>: So there's back in the missing feature day, there is methods where you
could sort of -- you had the mask, but you could basically -- you could ->> John Hershey:
[inaudible].
>>: Yeah, [inaudible] by looking at local correlations. But it turned out
that the correlations basically fell off pretty fast beyond some, I don't
know, ten frames or something.
And so it seems like you're trying to learn a search affinity over one second
so you're trying to -- there are relationships for high [inaudible] at time T
to a low frequency bandwidth one second later. Maybe not informative in
reality or difficult. Do you think you'll just learn that those things are
hard to do, or do you think you should [inaudible]?
>> John Hershey: No, no. I think that actually this is -- this gets at a
very key point that I didn't really hammer on very much, but it's actually
really crucial to understanding how this can even work at all. Like there
have been approaches to try to do spectral clustering of mixers of speech, in
fact, before.
So Michael Jordan and Francis Bach had a paper on this using spectral
clustering and training through the spectral clustering objective function.
But the features were, like you said, they're local and they're uninformed by
that masking function. So the features kind of overlap with multiple sources
potentially. Actually their whole method was problematic because they also
had a pitch tracking algorithm as input.
But in any case, one thing that this is not doing is looking at local regions
until the input. The only thing local is that the output has an embedding
assigned to each time frequency bin. The input, it's looking at the whole
spectrogram to determine that embedding for one time frequency bin and
therefore for the pair.
So it's not as if it's just looking locally and trying to match something
that it sees at low frequency at frame one with something at high frequency
at frame a hundred. Like it's looking at the whole thing and sort of
classifying, doing whatever it internally has to do to understand what that
signal is and therefore where it must dominate. It's kind of a global
approach.
We do think that one should try to do convolutional things, but we don't know
how little context you could get away with. I think ultimately by the time
you get to the top, you should have -- like every embedding should basically
be informed by the whole thing. I guess.
>>:
[inaudible] and then make him talk some more.
>>:
[inaudible].
>> Jasha Droppo:
[applause]
Okay?
Well, let's thank the speaker one more time.
Download