>> Jasha Droppo: Hi. I'm Jasha. I work here at technology and research in the speech research group. And I've known John for many years. He was a postdoc at Microsoft Research for one year ages ago. He has five years doing noise robustness and other great things at IBM. And these days he leads the speech and audio group at Mitsubishi Electric Labs down in ->> John Hershey: Cambridge, Massachusetts. >> Jasha Droppo: Cambridge, Massachusetts. So John is here to talk about some neat results that he gave us a preview of last week, and hopefully he'll go into some details and we'll have lots of fun. John? >> John Hershey: Thank you, Dr. Droppo. So please feel free to interrupt and have me writing on the whiteboard and things like that because not everything may be perfectly clear. So this is a talk about separation of signals and specifically using a special new network we came up with for doing embeddings. This is actually one of those ideas that came up a long time ago, and we were just sort of like expecting it to get scooped some time before we got around to getting it out. But as far as we know, it hasn't. So we got lucky. But it's kind of an obvious idea in retrospect, and so I wouldn't be surprised if someone else has done similar things. So the thing about separating sounds is that when you look at the spectrogram of two signals, so this is just like one frame of a mixture of a male and a female speaker talking at the same time and both voiced things. So the black line is the log spectrum of the mixer. The red one is speaker 1 spectrum, the female one, and the blue one is speaker 2. So you can see sort of those harmonics and some overall structure that people call [inaudible]. But the main thing to notice is that like if you look at the black line versus the red and blue lines, the black line sits just about at the max of the two, and only when they're very close to each other do you really see any discrepancy in that pattern. But in most of the areas, one thing dominates and the other thing is completely, completely obscured. So you could sort of say that if you knew that mask, then observation went under one part of the mask is the source corresponds to, and the observations on the other part of essentially very accurate representative source that that to. sort of the that that mask the mask are mask corresponds And basically once you have the mask, you really have all the information that the spectrogram contains about each of the sources. There is no other information because this stuff down here is gone. It's not there. It's obscured. So it's kind of just to motivate like why do we care about masks. Well, if you had the mask, pretty much everything you can know about all of the sources. So just taking that to a spectrogram, so this is like a mixture of three sources and this is the masking pattern for those, the sort of Oracle masking pattern for those sources, where each of those sources dominates in time and frequency. So time, frequency, log spectrogram, and just sort of discrete mask. And you can see it's kind of intricate and they kind of overlap in intricate ways. So it's not something where region based sort of methods could really find those special regions. And the other thing that's special about these regions is that they're a function of both of the sources, like which one is the maximum requires looking at all of the magnitudes of all the sources. But if you had that masking pattern, you could sort of use that to see all of the visible information for each of the sources. And the rest of the information for those sources is essentially lost. You have no hope of getting it anyway. >>: Your definition of masking is just the max [inaudible]. >> John Hershey: Yes. So one of the things that we discovered recently is that actually there's not one unique nice way to define a mask for multiple sources. For two sources it's very clear, one -- whichever signal has the highest energy. For multiple sources, it could be the case that it's kind of uniform distribution over a bunch of sources and then so no one of them dominates because the sum of any -- of the rest of them is always going to be greater than each of the individual sources. So in that case you have to sort of have a neutral value or just set the mask to zero for that source. So you have multiple sources that have a zero value. But here we can sort of like pretend that problem doesn't exist. >>: A follow-up question. You have four colors on this graph. >> John Hershey: Oh, yes, thank you for pointing that out. is the silent source. Yeah. So this >>: The silent source where you assume everything else is dropped below the background? >> John Hershey: We actually did special processing to try not to worry about the silent source, the silence. Normally this would be some background noise, so you could call it a noise source. And it does sort of dominate in those regions. So you could treat that as a fourth class. That would be a reasonable thing to do. But we didn't do that because it was too reasonable. So we just essentially -- we get rid of that and pretend it's not there. Because at least that's one thing you can infer from the mixer is you know where the silence is. So what I'm going to tell you about is how we developed a neural network that we trained on two speaker mixers and then we tested on this mixer and it sort of does something reasonable for separating three sources. So this is the Oracle and this is the sort of separated masks. And you can see it makes some mistakes, but by and large it does pretty well. You know, like especially the yellow one there I guess must be the female one and there's two male ones that get a little confused together. But still it's pretty shocking that you can -- to us that you can train a neural network that we usually think of as having a sort of fixed number of inputs and outputs and trained it on two sources which is a particular task that has a fixed number of labels and then use it in this other task without modifying the network at all. In fact, we just change one parameter, like the number of classes you want to get out. So in a nutshell, I'll just describe what it is, and then we'll backtrack and go through like the whole history of the world and everything. So basically what it does is for each of those time frequency bins the output is an embedding vector that's trained so that if it's -- so if two different time frequency bins are from the same source, those embedding vectors should match. And if they're from different sources they should not match. And by having the -- structuring the output like that, we solve what we call a permutation problem. Like rather than saying the network should have one output for a source one, another output for a source two, and another output for source three and should do like some softmax over those sources, that would require that it know where to put source one and source two and source three. But if they're all in the same class, like how can it determine where it should put those and how can it make an executive decision like, okay, I started putting this source here so now I'm going to -- now I have to stay consistent with that versus putting the other source there and have to stay consistent with that. So it's a lot of cognitive load for the network that we're trying to avoid. So we also hope that this can -- this kind of embedding approach might, well, first of all, as I showed, sort of help it generalize the different numbers of sources. But maybe you could even be sort of source independent. Like you could imagine training this thing on many, many different kinds of sources. It actually doesn't care what the class labels of the sources are. All it cares is that you know where they belong in the spectrum. So you could potentially train it on a lot of different things. We can incorporate microphone array information. talk. So that's what we're aiming for with this So going back into the ancient history, especially since one reason I'm going back here is because there was some work at MSR recently that kind of solved the problem of getting neural networks to separate sources, at least for the purpose of recognition that I was -- Jasha and Mike and Dong Yu and Chow Wang [phonetics] did some things like that. And it was on the same task that we had done this work before. There's a few reasons for presenting this, so I'll just go through it quickly. It's like one sort of way you could think of handling this problem is having models for each of the sources. So imagine if it's speech, you could have a speech model for each of the sources that are there. And that could be something that you hypothesize. You could say I think there's two sources so I'll start up a two-source model. Or I think there's three sources, I'll start a three-source model. You could automatically determine the model's class using some model selection techniques. So what kind of models could you use? So one thing that we did in the past with colleagues at IBM was we had a speech model so we had like kind of a grammar-based system. It had state transition, so it was like a very sparse hidden Markov model. And then we had an acoustic model that was log spectrum domain and that itself had some sort of shared GMM states. And then the observations are just Gaussians. And then we combine those models together in a kind of factorial model, one for each source. And then you have an observation. And to make things possible to extend to multiple sources and be a little bit tractable, we use this max interaction function, which just says I'm going to take the max of all the inputs and that's going to be the observation. So we say the observation is just delta -- the probability of the observation -probability density of the observation is just a delta function at the max of all of the latent inputs sources. So that's completely tractable to compute that with Gaussians. So posterior just has this funny shape. It's just a max multiplied by the prior, which is a Gaussian in the sources for a given frequency. So now what about actually inferring the states and these masks, this probability that one source dominates. So, again, we have the max interaction thing, and then we have the states of the model. So let's say we have K models and we have a bunch of discrete states. Each model has N states. And let's say there's a transition matrix, N by N. And then for each time and frequency, there's that -- in the posterior you have this probability that one source dominates or the other. And that posterior, so for each time or frequency you have the K sources. So you have a -- it's just a multinomial. It has a value, probability value for each of the K sources. But unfortunately like if you were to do exact inference, which would be silly, it would be completely intractable. It's all exponential. Like the mask states -- so, you know, so if you have N masks and two sources, you have 2 to the N possible masks variables that you would have to explore. And each of the mask variables depends on the configuration of all the states, of all the sources. So you have this nice -- oops. Part of my animation is stale. So you have this nice bipartite graph. And what we know, like if you sort of think about restricted Boltzmann machine, right, if you know one of the vectors, then inferring the other one is trivial because they're all independent condition on one of them. So even though the posterior has this horrible sort of fully connected kind of structure, if we know one of the things -- like we know the source states -- then all we have to do is compare those little Gaussians for that stage and see which one is the max. And if we have the max states, then all we have to do is sort of ignore the areas that are masked out for each of the sources and then we can look at the sources independently to determine what their states are. So then we just had like HMM kind of complexity. So the two conclusions I wanted to draw from that is like we can do what we did, which was to do like a variational inference alternating between mask inferences and source data inferences. Also if we can get good masks, we're kind of almost done. Like that's like -- it's inferring source states or whatever else you want to do with those sources, impute the missing values, whatever it is you want to do, that's all super tractable once you know the masks, if you can get them correct. So that was kind of one of the inspirations. Like let's try to develop algorithms and get the mask and then we'll go back to our crazy graphical models or whatever to do the refinement after that. And so we did this stuff, this variational approach, to try to get past just doing two speakers. So originally we did this two speaker case back in the day, like this is -- this is -- this is kind of a toy task. So take it with a grain of salt. So human performance on this task was about here and we were able to do better than the sort of recorded human performance by doing exact Viterbi in a two-speaker factorial HMM using a lot of tricks to speed it up. And then we had different variations of that. And one of them was this very -- well, two of these, the red one and the purple one, these are both different variational approaches about alternating between masks, inference, and state inference. And in particular we came up with ways of where the variational parameters could have a sort of particular complexity. So you could vary the complexity and see whether you could get similar results, so the exact algorithm. And so for this case, we were able to, you know -- this was the exact thing that corresponded to this reduced one. This one had 65K things to consider in terms of state combinations, and this one just had sort of 256 separate mask values to consider. And we do just as well, even though we're reducing the complexity. So that was great. And then you guys -- Jasha and Mike and Dong and Chow -- recently did a two-speaker DNN acoustic model. So not for -- our approach was separated and then recognized and their approach was to just recognize. And they use some nice techniques to solve the permutation problem partially when doing state inference and computing the probability of the states of the different sources and then running a Viterbi to sort of unravel and get the associations across time to solve the permutation problem. And I'm not sure exactly where their score would fit in on here, but kind of off the chart. And then but the direction that we went with this was to -- oops -- as I said, to do this kind of variational thing so we could get beyond two speakers. So two different directions. But we wanted something that could be generalized, an arbitrary number of speakers. So we did that. It would be very complex. Just give you demos [demo playing]. You can hear what -- how things used to work. Okay. So it works really -- pretty well, actually. I don't think we're back to there with the embedding approach, so we're just kind of catching up. Of course this thing has a very constrained language model. And all these things matter a lot. So it's not clear if the two things can be compared. Also, closed speaker set and so on. So we didn't even use the same data because it just had too many assumptions. Closed speaker set, small number of words and so on. We were able to do four speakers. I won't play the demos. Just back to here. So back to the -- like if we can get good masks, we're almost done. That's where we left off. And now to -- what about DNN approaches for enhancement, because that's almost like separation, it's basically a separation, only difference being that you have two different classes, like you have speech and you have nonspeech. So what do we do for that. So we train a network to estimate a mask for each time frequency bin so it has sort of a softmax output that ideally it should be one for the dominant speaker or dominant source, one for the speaker if that's the source or one for the noise if that's -- so for that time frequency bin. And we have to model the context. LSTMs. Recurrent networks are the state-of-the-art for that. If you don't know what those are, they're hideously complex networks that are recurrent, that are made so that the gradients pass well back in time and they have gates for input, output, and forgetting. It's not clear whether all those gates are needed. It's just something that sort of caught on and people use this today. But surely there's probably a better way, but, hey, it works and it's been implemented dozens of times. So we can just use that. And that's what we used. So we've done a bunch of different work just doing speech enhancement using these kinds of things. And in the more intricate ones, we sort of also took up this like let's iterate between getting masks or reconstructing sources and doing recognition and getting states and feeding those things back. And that all helps incrementally. You get a little bit out of doing that kind of iteration even though it's not a variational inference algorithm, still can help. And these are the kind of results. These are word error rates. So if you just look at sort of the average results, like starting from basically doing nothing to doing kind of a vanilla, nonnegative matrix factorization type of approach, down to doing like a bidirectional LSTM with special objective function that considers the complex domain match to the signal and feeding back in some state information after one iteration through a recognizer and things like that. Like the numbers keep decreasing. And here's with two-channel system using a little bit of beamforming to get to a slightly lower number. So just to show that these networks really work at this task. Even though the nonstationary noise might even have some speech in it, sometimes -- yes. >>: John? So what is -- is your goal here going to be for speech recognition, or do you see this as improving perceptual quality for human consumption of these signals? >> John Hershey: It can be either. general or for ->>: Like for the -- you mean for the talk in For you research or what you're presenting today. >> John Hershey: Yeah. I mean, I think it could be for either one. It's not clear. It depends on -- you know, really test dependent kind of thing. But you could use it either way, and you can use different objective functions to train it. These things were all trained sort of in the signal domain as enhancement things. So they actually sound pretty were to sort of hook it up to it could be better. It might something reasonable with the good and they help recognition. But if you the speech recognition objective function, then also sort of wander away from actually doing signals and give you something that wouldn't be listenable if there's some arbitrary transformation of the signals, but might be good for recognition. >>: [inaudible]. >> John Hershey: These are with retraining, yeah. I believe. These are with retraining. So but the problem is like you can't just naively, at least, apply this to speech on speech problems because it's the same class. You could do male speech versus female speech. That works. You can even do speaker A versus speaker B if you have speaker dependent model. That even kind of works. But for like same speaker independent and both can be male or female, it just doesn't work. So as I mentioned for acoustic modeling, this has been kind of solved using the way that the permutation was disambiguated was to use the state of the louder source as a target and the state of the quieter source as a separate target and then to use -- if I'm remembering correctly, to use Viterbi across time to link up the correct states across time. So that solves the permutation problem. It's not clear how to expend that to more than two speakers. Or how to do enhancement with that. So for enhancement, we tried, just naively tried. And so what we -- so here using a BLSTM, it's a recurrent model. So you can't just sort of sort frames frame by frame, so we decided, okay, we'll put the louder whole utterance or chunk in one of the outputs and put the softer one in the other target output. That didn't really work. We're using chunks here of about a hundred frames, just to kind of limit the globalness of it. >>: [inaudible]. >> John Hershey: often ->>: I mean, negative results in neural networks are Right, but I mean [inaudible]. >> John Hershey: >>: Yeah. Yes. Don't have any way to look at continuity. >> John Hershey: And it's bidirectional, so we can't look at the output. Although you could think that maybe by building up layers like each successive layer is sort of closer to the output than eventually it could sort of infer the output, so yeah. It's not clear that it can't be solved this way. But we couldn't do it. And so we also -- just the simplest thing that we tried which we thought would work was that at training time we just look at the two masks that were generated and we use Oracle to pick the one that matches the best matchup between those outputs and the two targets. We have two targets, two masks, we match them up optimally according to the objective function, and then we get gradients after that. That fails. And then there's also the deeper -- you call it a permutation problem, which is what about the numbers of sources that are in there, how can we handle different numbers of outputs in this kind of a rigid framework. But you can imagine, you know, like you have a -- it's kind of like those Dirichlet prior models, you have a bank actually and then you put priors so that the ones that end up zero you ignore and it determines the number of classes. You could imagine doing something kind of like that. But, anyway, so this like kind of what we got with the deep net just training on mask estimation. Like this is the sort of Oracle output mask for two speakers, and this is what we got. And it kind of -- this kind of looks like it's doing something from time to time. But, you know, we're asking it to label speech with one and also label the speech with zero. So it gets a little confused. So we decided to come up with something that wouldn't be as hard to get to work. So I already went through this before. We're going to use embedding vectors for each time frequency bin. And kind of like comparing to the class-based approach, like these V are kind of the model outputs and Y. These are just kind of very generic, so don't pay too much attention. If these are sort of we're training the network to output some approximations to the labels, and these are the labels, that's the kind of class-based approach. Like the labels are just going to be like an indicator for which class corresponds to what mask. And then instead we're going to do something that sort of we want to be able to compare the outputs for different time frequency bins in a case where they're from the same class. And/or compare them when they're from different classes. So they'll have an objective function more like that with the assumption that if we -- once we do that kind of training we'll be able to then tease apart which things belong together. So just to make it a little more formal, like if we have this sort of indicator variable Y, so rectangular matrix, so the row is which time frequency bin it is, and C is sort of which -- for that utterance, which of the different sources that went into that utterance it is. Those can be permuted around because of the way that we're going to make that into an objective function. So we want to be independent of the ordering of the columns of Y. So when we take the outer product of Y, that gives us something that's independent of the permutations. You multiply Y by a permutation, then take the outer products, you get the same A. And it turns out we configure this A as an ideal affinity matrix, if you think about sort of spectral clustering type of approaches that you have an affinity matrix. The ideal one would be one where things that are in the same cluster have a value of one in A. So the AIJ would be one if it's in the same class and it would be zero otherwise. And that way it's easy to think about sort of permuting the rows and columns around of this A so that you get a block diagonal structure, which is exactly sort of the ideal spectral clustering input. Because that only -- it has rank as the number of blocks, and each eigenvector is just an indicator for where that block occurs, where that class occurs. So if using whole going big. we know A, we can just recover Y. So we're going to approximate A some function. If we were to just sort of naively approximate the big A, you know, like saying we have the sort of ideal A and we're to approximate it with some output of the network, it would be just too So instead of doing that, we do kind of like similar to what people do in spectral clustering, we use a low-rank approximation. But instead of -- kind of like the philosophy nowadays is instead of like having a complicated model and then approximating it, we're going to start off with a model that would be considered an approximation to the original thing and train that thing. We're going to train the approximation because that's our new model. So we have an approximate model which is that we're just going to use this kind of a construction for the affinity matrix. So note that this is nothing like the spectral clustering way of doing things, which would have it like a local kernel and that leads to a sparse affinity matrix which then has to be decomposed into eigenvectors in order to sort of complete the sparse blocks and sort of fill them out into being full blocks. Instead of that, we're going to train the network so that this thing gives us full blocks. That means it's going to give us dense clusters. That's the goal. So getting into the precise thing that we actually do, so we have some input features and we're just sort of here. And since we're using bidirectional LSTM, you could just think of the output at a given time and frequency as sort of dependent on the whole spectrogram. And so H is our network with parameters theta. And this is our objective function. So we want our VVT to approximate YYT. That's what we said we would do. If you sort go through the algebra, you can see that as saying that the things that are in the same class we just want them to be close together. This is just a -- this actually ends up just being N once you do the math and have the proper weighting on these things. And then this guy says that if they're in different classes, we'd like their distances to be further apart, like close to two, even though they can't quite be that far apart. They're unit vectors in our formulation. So another way to think about this objective function is that aside from just some little weights for the number of items in each class, this is exactly the K-means objective function as a function of Y the assignment of points to classes. So at training time, given the assignment, we're training V using the K-means objective function. And at test time, we hold the parameters constant and we optimize on Y. We try to find the assignment that reduces the same objective function. So that way we can feel confident that what we're training it to do is actually the right thing for the procedure that we do at test time. >>: I want to make sure I understand your notation here. vectors embeddings? >> John Hershey: >>: The VIs are Yes. [inaudible] VIK would be -- K would be like a time-like variable or -- >> John Hershey: I is the time frequency invariance and K is the embedding [inaudible] row vectors and they sit sort of as rows inside the V. So VVT compares for each time frequency, it compares it to each other time frequency. So VVT is a big matrix, way too big to actually use explicitly. But fortunately this is just a, you know, Frobenius norm squared error type of function, and it turns out like actually you never -- you don't have to actually instantiate that big matrix because of that. >>: So but [inaudible] of VM is all is one, right? >> John Hershey: >>: Exactly. Exactly. And then -- >> John Hershey: Or times. the cosine of the angle. >>: So each row is -- [inaudible] VI minus VT as just the cosine -- >> John Hershey: >>: Yeah, the norm of -- yeah. [inaudible]. Times. Not minus. The product of VI and VJ is [inaudible] but VI minus VJ is just one plus one plus the cosine -- >> John Hershey: Exactly. Yeah. Yeah. It's two minus two times the cosine of the angle. >>: So the maximum of this cosine should be something between negative one and one, right? And then you want to ->> John Hershey: One for the same. Right. Angle will be zero and then the cosine -- so then you'll get -- you'll get zero. These will be the same vector. This will be zero. >>: Just don't understand why you sort of have here negative two [inaudible]. >> John Hershey: This just -- well, first of all, this is just -- all I'm doing is doing the output of this to get down here. So this is kind of -but the intuition you could say, okay, this is a distance over all of the IJ. So it's saying that we'd like this distance, this squared distance, to be close to two. So, in other words, we want it to be larger. Because it can never be two. >>: But the value you get at just one plus one plus the cosine -- >> John Hershey: Can be two [inaudible]. >>: So what you get is just the cosine [inaudible], right? get -- you get what I mean? Because you will >> John Hershey: Yeah. So this will -- this quantity is -- this quantity VI minus VJ squared is two minus two times the cosine ->>: [inaudible]. >> John Hershey: Yeah, exactly. So this will be zero if they're the same. And otherwise it will be larger than zero. Zero is the smallest it can be, obviously, because it's a distance [inaudible]. >>: Because I saw like some other [inaudible] objective function which is just trying to like maximize the similarity between [inaudible] the same class and minimize the ones that are the negative player and are not in the same class. And I'm trying to see what could be the difference in the optimization between those two objective functions. >> John Hershey: I see. Well, you know, people often want to put like a hinge loss on these type of things because like we want something to be far away, but you don't really care how far away it is as long as it's further than anything else in that cluster. So with a margin you could have a hinge loss, and then you don't care after that, which might be more robust. The problem is like you cannot put a hinge loss on this and then also get the fact that the low rank construction means that the derivative is very simple and that the -- in fact, that this likelihood itself is just ETV minus two times VTY plus YTY, all of which are K by K Frobenius norm. So very simple and there's no N by N. So N by N is this astronomically large thing that we want to avoid computing at all costs. So this thing is all -- it's just a Frobenius if you rearrange all the terms, make it -- use the -- multiply the large dimensions first, basically. So this is saying for all points we want them to spread out; otherwise, there's kind of a trivial solution, right? If we don't have this term, then we could just put them all zero -- or put them all at the same place. And then all the things that are in the same class will have minimum distance and we're great. But we want them to be different. So this one just spreads everything out. This one says things in the same class must be compact. Then maybe -- there are ways of interpreting this, too. I kind of skipped some lines there. But this just says what I just said, so it's same as K-means objective and similar to the spectral clustering, but now because we're training it to be compact, the clustering is actually much easier. You actually don't need to -- so like think of spectral clustering as kind of like trying to fill out those block diagonal things. You could think of it as taking the matrix, the affinity matrix to a large power because you need to spread throughout the cluster. If you think of a transition matrix as one way of thinking about what spectral clustering does, you have a sparse transition matrix, but still things -- you have a connected set for each cluster, and then by following through that transition matrix and linking up things across multiple transitions, you get back to a block diagonal thing. But this should be close -- we're training it to be close to block diagonal in the first place so it doesn't require as much processing to do the decoding kind of thing. So, yeah. So this is just saying what I said. Like you can rearrange this thing to do the large vector multiplies first and that way it actually much less complex than actually -- and memory wise much less space than storing that whole VBT thing. You don't actually have to do that at all. The gradient is also very simple and cheap to compute. And the rest of the gradient is just your -- whatever your network is. Doesn't have to be LSTMs, it could be convolutional networks, whatever. we don't get into the gradients for that part. >>: So [inaudible] which gradient -- >> John Hershey: Oh, yeah. We left off the gradient of the normalization. That's another thing. So you just -- so you have each row of V is a unit norm. So you just have the gradient of V over its length, which is just like -- it's like a softmax-like gradient. The only funny thing is since -- >>: [inaudible] so it has to be -- well, although the [inaudible]. >> John Hershey: Well, the way it's implemented is that you have a transformation coming from the network which is then normalized to be unit length. And that's just the forward procedure of the neural network. So we don't consider it an optimization problem with constraint. That's just how the network works. It outputs something but gets normalized. So it's a back propagated, we just take the gradient of it. It's a simple gradient. The only funny thing is like since it's a sphere kind of normalization, like all gradients are sort of stepping off of the sphere a little bit. So it might be good to normalize your weight matrix. Since actually you could -the weight matrix for each of those rows V, actually it doesn't matter what the scale is. If you multiply it by some number, the output is going to be normalized anyway. So it's invariant to that scale. So all derivatives will be tangent to that surface, but you'll always be stepping off. But you can always renormalize the weight matrix. So in essence you can always stay on the sphere. So what we do is train it on hundred frame chunks. This is just what we did in the experiment because we were going to try the sort of SVD approach, the sort of spectral clustering-like approach. And you can't just -- can't do that on something that's too big. So we train -- limited it to a hundred frame chunks during training. And then we tried clustering within each chunk of frames separately and then hooking up the results resolving permutations after that, or we tried just using global K-means. So the global K-means is the actual only thing in here that doesn't use any Oracle information at all. When we did the individual chunks, we used the best possible permutation of each chunk and we also had a method that did that automatically. But somehow it didn't end up in the table. I'm not sure why. But that can be done too. It's not hard to actually come out with a nice version of K-means that if you have overlapping chunks that you can sort of satisfy the assignment constraints of K-means so that they're consistent across chunks but allow the centroids to be different for each different chunk. So you can imagine the embedding sort of evolving over time. But you're linking them up and forcing the same assignments across chunks. Even though the centroids are different. Okay. So then we trained it on 30 hours of artificial mixtures of two speakers, mixing around 0 dB plus or minus 5 dB. And we had two different evaluations. Because we wanted to have some kind of baseline for this, but we couldn't find anything handy that was good enough. So what we chose as a baseline was to say, okay, let's train NMF models for each speaker using all of that speaker's training data, and then we'll use an Oracle to tell us which two speakers we've got in there and we'll use those NMF dictionaries, they have like ten frames of context, so they're actually fairly powerful NMF thing, nonnegative matrix factorization. For those of you who don't know, it's just you have a power spectrum and it's easy to describe that as a sum of a bunch of basis functions that are not negative. So that's our sort of Oracle baseline. And we could evaluate on the closed speaker set, but we couldn't evaluate that on the open one because it's -- we don't have a good method for that. So let's go through the results that we've gotten so far. So the global clustering is a little worse than the local Oracle clustering. So Oracle K-means, that's where we do individual chunks and then find the best permutation. Should probably say Oracle permutation, I guess. And then global K-means is doing K-means on the whole thing. So we lose a bit going from -- sorry. This is [inaudible] improvement in decimals. So we get around 6 1/2 KB overall for things that are mixed around zero dB. So we're improving somewhat. It's not ideal, not what -- you'd like to have like 30 dB, I guess, ideally. Or 10 would be nice. So it's still preliminary results. But, anyway, it's kind of interesting that the global one works at all. Because we really expected things to be shifting over time, and we only trained it on local chunks of a hundred frames. So evidently there's enough sort of common information that's more like speaker ID type of information in the embeddings that allows you to do that. But you do lose something, and we'll see more of that later on. We also did singular value decomposition to try to do something sort of spectral or clustering-like. Also, you know, you can think that on the sphere K-means and SVD are pretty much the same thing. They're doing almost the same kind of estimation. One is just a relaxed sort of version of the other. And that had no effect really. So and then of course an interesting thing is like how many embedding dimensions do you need to do this, and it seems relatively insensitive. Our best was at 40. One thing I didn't mention about the structure of the network is that actually the output layer of the LSTM was at 10H function before being projected from the hidden vectors of the LSTM to this output embedding dimensionality. So there's some kind of limiting going on first, and then we have a projection. And we tried setting that to logistic. Why we didn't just try a linear version of that I'm not sure. But we didn't. So kind of revealing is the results on gender. So it's like we can consider the different combinations, male and male, female and female, male and female, and altogether you can see like really actually the male and female is doing great. Doing really, really well. Surprisingly even better than this Oracle NMF thing by a fairly large margin. But it suffers a lot with male and male, female and female. And particularly the global K-means is just starting to really lose a lot of the benefit. And, again, it kind of makes sense, like now you've got same gender, so a lot of those may be even close to same speaker. Some may be more different from each other, two male speakers, some might be more similar. And if you're relying on this global information that was only trained on a hundred frames, like make the network can't really put in enough information on the basis of a hundred frame training to really tell the difference between one male speaker and another very well. But still doing something. So it's encouraging. And then I much going do that at means that on the IDs guess I didn't really emphasize this before, but we're not losing from closed speaker set to open speaker set. Obviously you can't all with the NMF with the Oracle speaker ID, but basically it our model is pretty speaker independent. It's not really relying of the 80 speakers that are in the training set. >>: It's weird that your male-male and female-female results [inaudible] is that just chance? >> John Hershey: >>: Yeah -- [inaudible] better for females as opposed to males -- >> John Hershey: That is -- that is -- it's suspicious. I think it must be correct. Actually, yeah, that's a good point. It's a bit suspicious. Actually my theory for why female-female works better is just because the harmonics are further apart, easier to separate. We use relatively small windows. We should have used windows large enough that you can actually see the harmonics of male speech, but we use 25 millisecond windows. So it's getting tough to see differences between harmonics for male. Okay. So let's just hear are some results, see how it goes [demo playing]. So pretty good for something that has no language model, you know, no separate models of individual sources at all, just kind of training on this mixed situation [demo playing]. And so just for comparison [demo playing]. Doesn't really work. Those two speakers are really close. So even though it had two different dictionaries trained on just those speakers, it's just impossible to separate based on individual templates of the speakers. Okay. So then three speakers, we actually tried that, just fed it in the same network, mixtures of three speakers and just set the clustering at the end to have three clusters instead of two. And we do pretty well. I mean, this is sort of comparable to some of the male-male or female-female numbers. It has a male-male in it. >>: Oracle NMF means you pick the right speaker? >> John Hershey: We picked the models corresponding to each of the speakers. >>: [inaudible] Oracle means you just know cross segments which cluster is [inaudible]. >> John Hershey: >>: [inaudible] segment is, because segment to segment cluster [inaudible]. >> John Hershey: >>: Yes, yes. Yeah. So just put it the right [inaudible]. >> John Hershey: Depends on the initialization, if it's arbitrary, its permutation, independence. Arbitrary clustering comes out of each one. this, this one is just running full [inaudible]. And So if we were testing on individual chunks, then I guess like these would be more similar, right? This one would be better because the Oracle clustering for one chunk is the same as the nonOracle clustering for one chunk. >>: But the embeddings for speaker 1 on chunk one, there should be similar embeddings on same speakers, a different chunk, right? Or no? >> John Hershey: Well, they can represent the local information, so there's -- you know, there's like the pitch contour and what phoneme is happening right then. >>: Could you overlap chunks to get better permutation information? >> John Hershey: Yes. Yes. We did originally extract overlapping chunks. But Joe -- he was an intern at the time who did this -- he just sort of threw those away because they were taking up too much disk space. It's fine. Like I say, we did try a method that actually considered consistency across those in a nice framework. It didn't do as well as the Oracle, but I believe it was a bit better than the global one. I don't know why we didn't go with that. It just wasn't ready yet I guess at the time of printing up these results. But you could imagine solving these problems with some smart way of linking things up, dynamic programming kind of thing, across time. So that's the picture I showed before. So what do the embeddings mean, though? That's an interesting question. Like we would really like -- we wish we could do like the sort of inceptionism, like explorations of what kind of crazy sounds these things respond to, but we haven't tried or don't really know how to do that. This is just looking at -- so that's the mixture mask. These are the sort of three different randomly chosen embedding dimensions. So you can make a spectrogram out of an embedding dimension because it has a value for every time and frequency. So somehow these correspond to what we're seeing in the input. But we were hoping to see something like, oh, this one is pitchy and this one is like looking at onsets. You can't see anything like that. One reason that it's hard to interpret these is what I was saying before, is that like the V -- the embeddings are really rotation invariants. So if you multiply V times a rotation matrix, call that V tilde, then V tilde, V tilde transpose equals VV transpose. So no matter how you rotate them around. So you're going to lose all the meaning of any particular node in the embedding by rotating it. So that kind of gives us almost no hope understanding. But we can try to straighten them out or detect something. Or maybe at least just look at the output matrices and see what kind of patterns they have. It's kind of a weird thing, right? You've got 600 nodes, each of them is just multiplied by a matrix to get out to this whole embedding space. So each of them sort of has its own embedding. So if it's zero, you don't see that embedding. If it's one, you get that embedding in your output as one part of a sum. So looking at what those are might be interesting, but ->>: [inaudible] frequencies you have during the [inaudible] you have certain patterns during silence. I'm not sure [inaudible]. >> John Hershey: Yeah. So what I didn't mention so much is that -- but sort of mentioned before, what we do for the silence is we just put a weight on the objective function during training that says we're just going to ignore those silence bins. We don't actually care which source they belong to. So we're not going to train the network to try to come up with a good embedding for the silence. Because you can tell the silence by looking at the mixers, have a threshold kind of thing. And then at test time we just also look at the mixer, pick that silence threshold. And then we don't -- when we do the clustering, we don't use those values to determine the cluster center. So we do a weighted K-means and we just ignore them when doing the M step of the K-means. We just don't include them. So see probably close to out of time, but just go through what we did a little bit this summer in the workshop that some of us were involved in. There was an idea to try to use microphone array input for this because this thing is kind of a really tough problem. You're looking at a single spectrogram and you're trying to tease out which parts of it belong to which source. And just by the pattern of the amplitudes it's very complicated because you don't know where one begins and one ends. It's the chicken and egg problem. That's why you have to iterate between masking and states. But if you had microphone array input, then there's sort of directions in space that cause delays between the microphones and those delay patterns can sort of distinguish one source from another and at least give a clue that could be used. So but we want to now handle those kind of permutation problem in the input. So we want to be sort of independent of the microphone array geometry ideally. At the very least, be independent of the ordering of the microphones because that shouldn't have to do with the algorithm. So the approach that we came up with was to first do an initial microphone array sort of clustering algorithm that looks at the time frequency bins and clusters them according to their sort of plausible consistency with a set of delays between the microphones. So every pair of microphones can have some delay, and you can sort of detect these delays. And the pattern of delays is kind of indicative of a particular direction of arrival or something. And so clustering those things, you can sort of have a feature that differentiates between different sources. So then the way we use those clusters is we use each of those patterns of delays to do beamforming. So we use them as a kind of delay on some beam form where there's particular time delays. And then feed in the beam form channel. So then you can hope that if there's a beam that's sort of pointing at one of the sources, the values will be larger than in a beam that's pointing at a different source. And so the pattern across those dimensions indicates something about the clustering of the sources. So it's still not enough because the beams could be in different orders sort of arbitrarily because you're just using clustering now. And unless you want to train on so much data that you automatically learn a network that is insensitive to the order that you put things in, probably you're not going to fit something to those orderings. And so we wanted to just be completely independent of that. So we came up with a network architecture that kind of has local channels for each of the inputs. So at test time you can have as many channels as you want in the input because they're all the same, they all have the same weights. And then from each of the local ones we pool into sort of a global one. Because ideally you want those local channels to compare to each other, because you need to say, okay, this one seems louder than this one for this particular source or something like that. So but you can't do comparisons without somehow having an indexing between the inputs. So to do permutation-free sort of comparisons in a neural network, then we use a structure like this. So same color -- well, at least within one layer, same color means same weight matrix. So same parameters. So this one for channel one is actually the same network as for channel two. The networks are exactly the same. So how can they have different sort of information. Ultimately we want to have -- be calculating different information. So the inputs are different, so that's a start. And then we use kind of a pooling to pool into a global unit and then feedback the global output into the individual channel. So this way this guy, for instance, can sort of compare its own representation to the representations coming from the other units. Or maybe here's a better example since it's going to be hidden, do the pooling, back to the hidden. So this guy can see sort of how it's doing relative to the other channels essentially. And we don't know what kind of function it's learning inside, but at least it has a chance of learning something completely permutation independent. And this can be used for any kind of problem like this where you want to have permutation independence in the input. So that's pretty much where we left off. We didn't actually get to really try this out yet. It's just theory at this point. But I thought I would throw it in there because it's like kind of a future direction that we already have sort of made some steps towards. So bottom line, you know, we're solving this output permutation problem, multiple instances of the same type of source can be handled. Can generalize different numbers of sources. Maybe even you could imagine using this to train a kind of universal sound separation engine if you train it on enough things and have enough capacity. There's no language model, no complex decoding process. The whole thing is very fast. It's just speed-forward network. Well, it's a BLSTM. But you can imagine doing a convolutional network that would just be feed forward. And so we're working on this, incorporating microphone array information. stay tuned for further developments. That's my talk. [applause] >> Jasha Droppo: Any other questions? >>: I've got a lot of them, but do you think it might work for image [inaudible] recognition? So >> John Hershey: Yeah. Yeah. That's definitely something we want to try. I think that the image segmentation literature is a little more vast than the deep network source separation literature, so I can't guarantee that something like this hasn't been done. There was something a little bit like this. The only thing that we found that was closest to this was doing an MRI volume segmentation by using a convolutional network to infer affinities. The affinities were just adjacent affinities. So it was like you have a point in a cube, a three-dimensional grid, and so you have six numbers to infer that are the local affinities. And that's been done. Nobody's used like this kind of embedding approach that like sort of has long reach across grids like this. But I do think it would be good for that. In that literature, there are a lot of datasets that have a ton of classes. So, you know, you can use more like class-based approaches in those cases. >>: Right, right. So they sort of maybe don't have a permutation problem per se because like, you know, the features associated with light force are different than the person. But, you know, I mean, they also ->> John Hershey: >>: And you can train the network. -- learning and things like that. >> John Hershey: I want [inaudible]. >>: So here like, you know, the signal is same difference, so then you don't care about the exact classness and ->> John Hershey: Yeah. I mean, there's kind of -- I've used that image, class-based image segmentation method as kind of an existence proof for this, because you could say that vector of probabilities of each class that you would get out for each pixel, let's think of that as an embedding. Well, when things are in the same class, then it should match, and when they're in different classes, they should be different. The advantage here is that now we can do this without having class labels. So there's ton of objects that don't really have a class label. You know, if you see something on the ground, you can pick it up without knowing what it is until you get a closer look. So we know that for human perception, it's not a problem at all to segment something without knowing what it is. So there's no reason you couldn't have datasets that have unrecognizable objects in them. It's just that they don't exist. There may be some that just have segmentations. But, you know, it has to be done by a human, so it's kind of a nonstarter for me, at least. For this stuff we just mix them together. So if you have N things, you have N squared training things -- >>: [inaudible] and then render, then you have something. >> John Hershey: Yeah. Yeah. So depending on how realistic that is. That's the only thing is the realism. >>: With respect to the inception thing, so like -- I mean, it seems like this thing is pairwise, so if I have a real [inaudible] and then I say I want to move away from this thing but maintain the cosine, I don't want -- I want it to be similar in the embedding, but otherwise as different as possible. And then if you also do the inception thing of like preserving the local ->> John Hershey: Yeah -- >>: -- so you don't get white noise, have you tried it? reasonable sounds? Do you get >> John Hershey: We haven't tried that. There's one tiny issue which is that like the phases are kind of important in sounds in order to sound good. But that's okay. I mean, you could still I think learn a lot from it. And we could think about just optimizing the phases a little bit to patch them up. Yeah, I think something like that makes sense. Start with a sound, just one, and then try to stay in the same class but get different -- see what the equivalence classes are. Yeah. Yeah, that's nice. >>: Because those guys are super famous. have an art exhibit now. You know the inception guys like >> John Hershey: That stuff is mind blowing. >>: Like -- Like right? >> John Hershey: of that stuff. >>: Yeah, I know. >> John Hershey: Yeah. I still haven't recovered from seeing those videos It brings back bad memories. Lying under a bush after midnight. Indeed. >>: So there's back in the missing feature day, there is methods where you could sort of -- you had the mask, but you could basically -- you could ->> John Hershey: [inaudible]. >>: Yeah, [inaudible] by looking at local correlations. But it turned out that the correlations basically fell off pretty fast beyond some, I don't know, ten frames or something. And so it seems like you're trying to learn a search affinity over one second so you're trying to -- there are relationships for high [inaudible] at time T to a low frequency bandwidth one second later. Maybe not informative in reality or difficult. Do you think you'll just learn that those things are hard to do, or do you think you should [inaudible]? >> John Hershey: No, no. I think that actually this is -- this gets at a very key point that I didn't really hammer on very much, but it's actually really crucial to understanding how this can even work at all. Like there have been approaches to try to do spectral clustering of mixers of speech, in fact, before. So Michael Jordan and Francis Bach had a paper on this using spectral clustering and training through the spectral clustering objective function. But the features were, like you said, they're local and they're uninformed by that masking function. So the features kind of overlap with multiple sources potentially. Actually their whole method was problematic because they also had a pitch tracking algorithm as input. But in any case, one thing that this is not doing is looking at local regions until the input. The only thing local is that the output has an embedding assigned to each time frequency bin. The input, it's looking at the whole spectrogram to determine that embedding for one time frequency bin and therefore for the pair. So it's not as if it's just looking locally and trying to match something that it sees at low frequency at frame one with something at high frequency at frame a hundred. Like it's looking at the whole thing and sort of classifying, doing whatever it internally has to do to understand what that signal is and therefore where it must dominate. It's kind of a global approach. We do think that one should try to do convolutional things, but we don't know how little context you could get away with. I think ultimately by the time you get to the top, you should have -- like every embedding should basically be informed by the whole thing. I guess. >>: [inaudible] and then make him talk some more. >>: [inaudible]. >> Jasha Droppo: [applause] Okay? Well, let's thank the speaker one more time.