Ivan Tashev: Good afternoon, everyone. Welcome to the final talk of our intern, Matt Mirsamadi. He is a Ph.D. student at University of Texas Dallas with the supervision under professor John Houssen and today he is going to talk about DNN-based online speech enhancement using multitask learning and suppression rule estimation.
Without further adieu, Matt, you have the floor.
Matt Mirsamadi: All right. Thanks, Ivan, for the introduction.
Hi, everyone. My name is Matt. And I was interning with the [indiscernible] group this summer and my project was on DNN-based speech enhancement. And that's what I'm going to be talking about during this presentation.
All right. Let's start with quick introducing what is speech enhancement, what we are doing here. So speech enhancement is a task of recovering a clean speech signal only based on noisy and reverberant observations. And we are focusing on single channel here, so just observe one channel of a noisy signal and you want to recover the cleanest signal. And the final goal here is improving the perceptual quality of the signal for the human listener. We are not concerned with doing enhancements for classifications or recognition tasks. What we are looking here is to just improve the perceptual quality, and hopefully intelligibility for the human listener.
So and pay attention to the facts I've put the [indiscernible] response here. So one of the measure goals if you had in mind when we were starting was we wanted to also assess the performance of these enhancement algorithms while reverberation is present. Not necessarily to do the reverberation, but to see how we can do the noise while reverberation is present.
And speech enhancement has like a variable application that many of us know about. One of the most conspicuous ones is mobile communications. And so this other like area of major applications of speech enhancement is like hearing aids and cochlear implants, which the final goal is to actually play a better quality signal for the hearing-impaired people.
So this slide basically is a summary of the whole story of my project. It tells you a lot of the things about what I did during this project. So you have these conventional speech enhancement algorithms that's based on some assumptions about the statistical properties about the speech and noise. And what they do is that they estimate the so-called suppression rule, which is actually a real value gain. A number between 0 and 1 for each time frequency bin, which is designed so that it attenuates the energy in time frequency bins that have been affected by noise and then leave the rest untouched, so then the rest of this procedure will be having the signal has been denoised.
It tells like multiple different [indiscernible] working together. You have the voice activity texture, and then based on information from that, you keep a current noise model, and then you compute quantities known as prior and posterior SNRs, estimate suppression rule, and you reconstruct your signal.
So what we are doing in this project is we want to look at whether we can replace this enhancement system with a deep neural network and kind of learn this transformation from noisy to clean speech features based on a trained database rather than just performing from signal processing on the signal.
And there are basically two different major configurations that it can imagine for doing this task. I'm
showing it in these two figures on the center and the right. So you can have, the simplest thing is to have like an end-to-end DNN regression, just different networks that receives the noisy spectrogram, and then it just transforms it through the [indiscernible] entities, and then in the output you want to get the clean features.
Another approach would be to be a little bit less intrusive here and kind of keep the basic structure of how an enhancement system looks like, but then replace many of these blocks with DNN-based alternatives and try to do the same thing, try to estimate the suppression rule, try to estimate the gains in each time frequency bin rather than the magnitude of spectrograms themselves. And I'll talk in detail about when and why we prefer each structure to the other during the presentation.
So just to keep in mind what our main goals have been. We were starting the project, one of the primary goals was that we wanted to do causal processing. In many applications of speech enhancement we can't afford to have even a couple of frames in the future, if we want the whole procedure to be causal. So this is one goal that we had in mind. And the other thing is that we need some degree of generalization. We want to be able to at least to some degree handle unseen noise, noise that haven't been seen during the training phase. And as I told you before, we wanted to do all of the experiments in the case where reverberation is also present. So these are the goals that we had.
So here I'm just listing the main things that I'll be talking about during the presentation. I'll just very quickly introduce what it is, conventional speech enhancement algorithms and how they perform. And then I'll talk a little bit about the training dataset and how we designed the training dataset for the
DNNs, and our talk about our evaluation metrics for the performance measurements. And then I'll talk about the two different basic configurations of the [indiscernible] we had. Overall regression one and the one with suppression rule estimation. And I'll talk about many different kinds of details like how we trained a DNN, like a couple of issues with causal implementation and feature normalization and those things, and also I'll talk about multitask learning which was found to be helpful with the enhancement algorithm.
So here I'm listing basically a number of statistical assumptions that [indiscernible] algorithms have about speech and noise signals. A lot of them have to do with the statistical independence. So the signal is assumed to be statistically independent from one frame to the other and also in the different frequencies bins. And also the speech signal is obviously supposed to be independent from the noise signal. And then there are like the [indiscernible] assumptions, most of them have this assumption of
Gaussianity for the magnitude of the spectral, and the other assumption is that usually they assume that noise changes are slower than speech. And that's why they have a lot of problem dealing with like vastly varying noise stuff like clicking, clapping, and those kinds of things. Those types of noise basically go through the conventional speech enhancement without any attenuation.
And so this last one is actually what is shared between both DNNs on both conventional approaches, and as a matter of fact we don't touch the phases. Whatever we do is with the magnitudes and all enhancement is done on the magnitudes, and then we mix them with the noisy phases, and then we can construct the signal.
So this is like a block diagram of how a conventional speech enhancement system works. You transform to the frequency domain, then you have, first of all, a voice activity detector that operates on each frequency bin, and that's what is basically used by many of the different sub plugs. And based on that information you update your noise model. You are keeping a model of the spectrum of the noise and then you update that using your VAD information. And then there are quantities called prior and
posterior. So you calculate them, which I'll talk about in these two quantities, and then based on those you will calculate the suppression rule and apply to the spectrum to reconstruct the signal. This is how a conventional system works.
So I'm listing here the different suppression rules that people have proposed in the past. So here I'm defining prior and posterior SNR. So prior SNR is the ratio of the power of your estimated clean speech divided by the power of noise. And posterior SNR is the power of what you see, the noisy signal divided by the power of noise. So this is a traditional way these algorithms have been formulated. People compute these two values, and then the final suppression rule, the final gain in each frequency bin is a function of these two variables. And you have different suppression rules based on what exactly these algorithms are optimizing, what is the final optimization criterion. So you have like the most famous one is the Wiener suppression rule. You have all those other ones like spectral subtraction, MMSE, and so on and so forth.
So let's talk about the exact space line that I used in this project. I used the Vans implementation of a conventional speech enhancement system, which is fully described in this book. But just to give you a couple quick details about it, it uses a model-based [indiscernible] vector from this paper here, which is basically just assuming two hypotheses for noise only and speech plus noise and then constructing like a ratio test and detecting speech versus noise. And then you use the spectra subtraction rule and this, the whole algorithm has been optimized based on what has been described in this reference that I put here.
So we want to replace this whole system with DNNs. But what we want to do is to have a database of synchronous noisy versus clean signals and use that to extract the artificials from that and then use that to train a DNN that can learn the transformations between from noisy to clean. And then during test time you just use the same DNN to clean up your features.
Just talking about a little bit of advantages and disadvantages of these two approaches. With the DNN, the good thing is that you don't have any statistical assumptions of what are your signals. And that might help you do a better job. And also another very good thing about a DNN is that it explicitly handles both temporal and frequency formation. So the features in our DNN is like [indiscernible] of multiple like magnitude of spectral [indiscernible] from different frames. So you have all the information from neighboring frequency beams and neighboring time frames. and also because of this structure, it is not different between vastly varying or slowly varying noise types because it's just a transformation, an instantaneous transformation between the noisy feature to the clean feature.
But then on the other side, the good things about the statistical approach is you don't need a training database. You don't have to go through the pain of collecting like pairs of noisy and clean data. And another good thing with like previous approaches is that they provide equal performance on seen and unseen noise steps, because they have never seen any kind of noise; it's just performing online. But then the DNN, you have your training database, so you always lose performance if you have presented with the type of noise that you haven't seen during training. And I'll talk about this more during the presentation.
So let's see how we made the training database for our experiments. So what we needed was a multicondition training data which has different types of noise, different levels of MSRs, different characteristics of reverberation. So what I did was I used the clean training datasets from TIMIT as the
speech material, and then we have some room impulse responses collected here in the [indiscernible] group which has been like recorded from multiple different distances and different directions of
[indiscernible]. So I have like an order of 60 different room impulse responses that I can use to create the reverberation effect. And then I use a collection of 100 different noise types from [indiscernible] research labs in Ohio State University, which is like very good for our purpose because it's like really a variety of different types of noise. It's 100 different types of noise like clicking, clapping, pink noise, car noise, café noise, quite various type of noise signals that we have. And then also I have this other
NOISEX datasets, NOISEX-92, and I basically construct two different test sets. One of them is like using these 100 noise types which have been used also in the training, and then I construct another test set which uses only the NOISEX-92 datasets which have never been seen during the training. So I have both seen and unseen test databases.
So to generate our training data, we assume distributions for the power of speech and the power of noise in a room. So we assume that the power of speech is a Gaussian distribution with a mean of
60dB and the standard deviation of 8dB, and for the noise you have a mean of 55 and then a standard deviation of 10. So these distributions are used to just generate the power of the signal versus noise for each final I'm going to generate. And then also to get closer to reality, we also compensate for the
Lombard effect for noises that are above a certain threshold, and then finally we limit all the finals to have SNRs between minus 5 to 30.
And here is like the way we generate actual training data. We just randomly pick the clean speech, randomly select speech power, scale it to that power, and then we do the same thing with the noise, randomly select [indiscernible] response, convolve it, and then add the noise to get your noise reverberate signal. And then one last step that you need to do is to synchronize your noisy and clean speech because this [indiscernible] response has introduced some kind of a delay, and then what are you going to do with the DNN is you're going to chop the signal into different frames and then you need synchronous pairs to present to the DNN. So you have to cancel this delay that the room impulse response has caused.
So for evaluation I use PESQ, which is designed to correlate well with the human perception of speech quality. So you have this subjective measure, which is called MOS, mean opinion score, and PESQ is designed to correlate as much as possible with that. So we use PESQ, which is a number between minus 0.5 and 4.5. And this is what we use for evaluations. Of course it's not a hundred percent. You always have to, you can't trust the PESQ numbers all the time. You always have to listen to the audio signals because, and particularly because when you have reverberation present, then PESQ is a little bit less indicative of your performance. But still it's the best thing you have, the closest thing we have to an actual subjective evaluation.
So let's talk about the actual DNN training. One major constraint that we had, one major requirement that we had from the beginning was that we want causal processing. So the way features for DNNs are usually generated, for example in speech recognition, is that you have symmetric context. So you can't concatenate feature vectors from multiple frames, a couple of frames in the past and a couple of frames in the future. So and your goal is to estimate this center, this center feature vector. This is what is the tradition with training DNNs in speech.
In speech enhancement, we cannot afford to use a couple of frames in the future. What we are looking at is this alternative specter where we just have past context. We use a center frame, a couple of frames in the past, and then we want to estimate the last, the clean version of the last frame that we have here.
So this is the way we do it.
So another detail that we had while trying to implement the system was that the issue with feature normalizations. So something that people do when they're trying DNNs is that you always do mean and variance and normalization for your features. So DNNs are known to train better if you have equal normalized means and variances along the feature dimensions. Like if you have a lot of variances in one feature dimension and a smaller variance in another one, it wouldn't train well. So this is like the standard practice for DNN training. And for most, it is always done for example in speech recognition, and it's not important, you don't care about it; those means and variances are usually known to carry like the channel information, which you don't care about, and it's fine. But then here we cannot afford to lose means and variances, because we're going to reconstruct the signal and we need those. And our
[indiscernible] where you just get the noisy signal like normalize it and say this means the variances, and then you just enhance the normalized features and then reintroduce those statistics doesn't work because those means and variances have been substantially influenced by noise. So if you use them you're just reintroducing the noise.
So what we ended up doing was to use the global clean means and variances from our clean training datasets from both normalization and the de-normalization in the end. And that actually turned out to have the best performance. So here this is how we deal with feature normalizations.
Here I'm talking about one idea that we used with enhancement DNN and it provided us with improvements in the results, and that was using multi-task learning. Multi-task learning is the idea in training the DNN to learn two different tasks at the same time. So the original paper that proposes it has this kind of fancy example that says okay if you are trying to teach someone how to play tennis, if like a baby is born and then you just play, you just teach them how to play tennis, they will probably never learn to play tennis. But then if you just teach them a variety of different things as well, because for playing tennis you also needs to know how to run, how to visually follow an object, how to move your hand and all of those things. So with the DNNs also if you have two related tasks, if you use shared hidden layers to do both, you might do better on both jobs rather than doing two different DNNs doing those independent tasks. And so talking about a speech enhancement, the voice activity detection is at the heart of any speech enhancement system. It provides outputs for all other units and the noise model directly depends on your VAD accuracy. So noise activity detection decisions in this frequency being like the most relevant task we can find for our enhancement DNN. So this is the structure we use and the output instead of just predicting the clean magnitudes, we also predicted binwise speech present probability. And this would provide us with improvements, and I will show you how much the improvements are.
So let's dive into the results now. Here is like a summary of the specifications I used for my system. I used ten hours of training to train DNN with three layers and almost 2,000 nodes per layer. I used
[indiscernible] for nonlinearity, which was like experimentally verified to consistently provide slightly better performance compared to sigmoids. And for windowing I used 32 millisecond windows with 16 milliseconds of jumps, and then a Hann window. And for context expansion, I used two different cases, the symmetric case, actually the non-causal case for which 11 context frames are used, five in the future, the center frame, and five in the past. And then for the causal cases I used 7 frames, which is like the current frame plus six frames in the past. That's it. And these two values were actually experimentally verified to provide best results.
So let's take a look at the results. I will play a number of audio files for you now. I provided the spectrograms for a visual comparison of how much noise we have removed. And then I'll play the audio files. So here is just showing you the performance of the DNN-based switch enhancement
system, and I purposely chose a very difficult scenario where you have a negative scenario, and it's recorded in 4 millimeters of base so you have a substantial amount of reverberation.
So here is what the original clean speech looks like.
"One wonders about the applicability to people."
And then contaminated with siren noise.
[siren] "one wonders about the applicability to people."
And so finally here is what the DNN does in the enhancement.
"One wonders about the applicability to people."
So an impressive job in removing the noise. But there is some kind of a problem that you can hear is that it's somehow distorted. You have this kind of frequency smearing here, which results in some kind of a distorted signal. And this turns out to be a problem that I will talk about much more. And this is like a major problem that we had with the DNN system. But it's still a very good performance in removing the noise for this difficult situation. And remember, siren noise was included in the training data. The DNN has seen this during training.
So another example with some kind of a dialing noise. The original clean speech.
"She is thinner than I am."
Noisy. [dialing] "she is thinner than I am."
And the result of what the DNN does.
"She is thinner than I am."
Again, you cannot hear the noise signal at all, but if you pay attention, there is some kind of a distortion, which is still tolerable here, not that much of a big deal, but it will get worse as I continue.
So let's see how the rest of it goes. So here I'm just comparing the performance of our DNN implementation with the conventional speech enhancement systems that are introduced earlier in the talk. And so visually you can see that, yes, the DNN has done a much better job. Still we are in the seen noise types. DNN has seen this type of noise during training. So the noisy signal.
[sound] "don't ask me to carry only red like that."
Then this is what the conventional enhancement does.
[noise] "don't ask me to carry only red like that."
And this is the result of the DNN.
"Don't ask me to carry only red like that."
It is performing considerably better than the statistical noise suppresser.
>>: The dB actually add energy at 2.75 second? [indiscernible]?
Matt Mirsamadi: Yes, it seems to have added some kind of a noise. So why we don't, some kind of a noise here. Why we don't have energy here is actually because of the room impulse responses. The room impulse responses are finally caught at a certain frequency so most of the signals don't have high frequency energy. So I don't know the exact reason why the DNN has kind of filled that area. But it looks like as if most of the things it has seen haven't had this kind of a hole so it fills it with whatever seems familiar to them.
>>: [Indiscernible].
>>: I think you said this earlier [indiscernible], but the input is loud in the spectrum?
Matt Mirsamadi: Yes, it's loud in the spectrum.
>>: And I guess you're not doing anything like what if the output is larger than, so [indiscernible] you basically, you can actually theoretically prove that it's [indiscernible] larger than the input basically so you can just, the training is coming out bigger than the original point. You just keep it as the old version.
Matt Mirsamadi: Yes.
>>: You create [indiscernible].
Matt Mirsamadi: All right. Let's move to the next example. Here is when I start experimenting with our on seen noise database. So I have this like high frequency channel noise here which has not been introduced to the DNN during training. And this is a 7dB file I recorded at 1 meters. So this is the noisy signal.
[noise] [indiscernible] salt and pepper shakers leaving off tall pieces."
And here is what the enhancement system does.
"Vases make [indiscernible] salt and pepper shakers leaving off tall pieces."
And here is what the DNN does.
"Vases make [indiscernible] salt and pepper shakers leaving off tall pieces."
So we have obviously lost some performance, but still it's better. Still even for this unseen noise type it has done better than the conventional enhancement [indiscernible].
So far I've been experimenting with symmetric scenario. I haven't introduced causality yet. So at least when we are doing symmetric experiments, so we have some kind of, still in the unseen noise we can do a little better.
Before continuing to the rest of the talk I want to just mention like something that you see with the reverberation here. So one thing that was interesting about the DNN was it could also do some kind of de-reverberation. So the kind of temporal smearing that you see here because of the reverberation has kind of been removed here. So the statistical noise suppresser cannot do any de-reverberation. But it has. The DNN has been able to do some kind of de-reverberation here.
>>: [Indiscernible].
>>: [Indiscernible].
>>: Because it looks better.
>>: It removes more noise but it's [indiscernible].
Matt Mirsamadi: Yeah, so I'll show you all the test numbers and we'll see exactly when the test is higher for which one.
But in this particular case it's still, test is still a little higher for this one compared to this one.
>>: That sounds better.
Matt Mirsamadi: And that's the problem. With the DNN you usually remove noise more than the conventional enhancement system. But then you just [indiscernible] as well. So the final test is influenced by both factors.
>>: So there is another factor here. The mean square error in the suppresser contains the amount in the rules for mean square error estimation, but here is the optimizers [indiscernible]. Well, the DNN is pure estimator and you optimize it for one [indiscernible]. So this is between optimization goal and what you want.
>>: You have nothing in mean square.
>>: For the training?
>>: Yeah, there's no difference between the [indiscernible].
Matt Mirsamadi: Yeah, the mean square between the log spectral features.
>>: So it's log V square.
Matt Mirsamadi: So let's talk about this example, which has both sources of difficulty that we had.
The noise hasn't been seen in the training. It's the same type of noise in the previous slide and we run into causal experiments here where we don't use any future context. So here is, you heard the noise in speech. Let's play it again.
"Vases make same as salt and pepper shakers leaving off tall pieces."
The noise suppresser.
"Vases make same as salt and pepper shakers leaving off tall pieces."
And the result of the DNN.
"Vases make same as salt and pepper shakers, leaving off top pieces."
So this is where I'm starting to like this one more than the DNN actually. So the conclusion that we had was that when you have these two sources of difficulty at the same time, you want the system to perform causally and at the same time you haven't seen your noise during training, then you know you have a lot of these distortions going on and you use PESQ in the competition with the conventional speech enhancement systems.
>>: Has the DNN seen the room impulse response?
Matt Mirsamadi: Yes. The DNN training is done when you present the noisy and reverberant feature with the input and the clean feature in the output. So it has seen all the reverberation kinds during training.
>>: So it's the same room impulse response used in training and test?
Matt Mirsamadi: Yes. I wasn't like specifically trying to assess like the performance on different reverberant conditions. So for reverberation I don't have like a complete set for a test set.
So I'll explain the series of the steps that we took to kind of overcome this problem, to be able to do better with the DNN even for unseen and causal scenarios.
But before getting into that, let's see a little bit of the numbers, how the PESQ performance compares.
So I'll show you the average PESQ in my test datasets. But also in addition to that, I'll show you these
PESQ versus SNR curves, which for the test data you have like different files, each one would produce a different PESQ, and then you can just feed it a curve, like to this set of results, and then you can just compare the performance of two different algorithms for different SNRs.
So here is like the comparison between our DNN implementation versus the conventional enhancement, which I'll call MMSE here in all the slides. It's actually the advanced implementation of a noise suppresser. So you see that this is foreseen noise types and I'm showing you both the symmetric case and the causal case. So although you lose a lot of performance when you move from the symmetric context to the causal case, you still, at least for the seen noises, you do better, even in the causal case than the conventional enhancement system. So on average you're doing almost a 0.2 better
PESQ, which is considerably higher and like completely audible.
So here is like the performance on our unseen noise types. So here if you have an unseen noise, the symmetric cases is still kind of doing something, at least in higher SNRs. It's almost from like 5dBN, for 5dBN higher SNRs, you do better with symmetric case. But then the causal case cannot really beat the conventional enhancement system when you haven't seen the noise during training. And this is what we heard in the examples.
Yes?
>>: What about, there should be [indiscernible] for the case of the original noises in training, the three
post process. What is the PESQ look like?
Matt Mirsamadi: Yeah, I'll show you the average values for the actual noisy tests. So I haven't included here in these graphs. So for this dataset the noisy PESQ is around 2.0. And we see these numbers for the actual noisy test sets in the final results.
So 0.2 improvement in terms of PESQ is something typical from a conventional enhancement. So at best it usually does like 0.3 maximum.
So in general, so the final conclusion is that when you have these [indiscernible], you have lower SNRs on seen noise and you want to do causal processing, you have problems in the end. So the first thing you can do to like improve things is to use multitask learning, as I talked about. So this is not specifically targeting our problem, but it will in general improve training and shift everything up so you can expect to have some improvement there. So I'm just comparing the case where you use multi-task learning with the VAD, bin-wise VAD in formation and when you don't use that. And so you have, you consistently do better in all SNRs. So if you look at the average [indiscernible], it's some improvement, not a huge improvement, but something that still was consistently observed in all of our experiments. So everything I report for the rest of the talk is using multitask learning.
And so in order to overcome the distortion problem that we had, which kind of ruins PESQ, is the idea of we came up to kind of keep the original structure of an enhancement system, as I talked about it earlier, but then just replace the different blocks with DNNs. So what you're doing here is that you are estimating the suppression rule. And then you just multiply them with noisy magnitudes and then you construct a signal. So in this way you can somehow limit the amount of distortions that you'll get.
Because any way it's a real number that's going to be multiplied. It's not like a result and the non-linear transformation of a DNN. So in this case turns out to have much less distortions compared to like an
[indiscernible] regression using a single DNN.
So the way you do here is this DNN is like what we have before. It just gets the noisy speech and produces some kind of an estimation of the clean features. And so then instead of using this feature directly, we use it to follow the rest of the procedure enhancement system to just compute the prior and posterior SNRs, and then we have another DNN which estimates the bin-wise VAD formation, and then using those you have your noise model prior and posterior SNRs, and then there is a third DNN, which actually converts these prior and posterior SNR values to actual suppression rules in each frequency.
So the suppression rules in an enhancement system is like fixed curves. But what we wanted to do here is to use the DNN to automatically learn this curve. And the good thing is that you use the frequency context information for even deriving that curve as well, instead of just having a fixed curve for each frequency beam. So this is the general structure of what we are proposing in here.
And let's hear what the results are. This one compares the results of like the end-to-end regression with the system that I just described for a 0dB signal in the causal processing case, and the noise has not been seen during training. So here's a noisy signal.
[Sound] "they own a big house in a remote countryside."
And the result of the end-to-end regression.
"They own a big house in the remote countryside"
Again, kind of severely distorted because of the problems that we talked about.
And here is like the performance of the second system, which does suppression rule estimation using
DNNs.
"They own a big house in a remote countryside."
So this is much better in terms of quality compared to the end-to-end regression. I'll play it once more.
"They own a big house in the remote countryside."
"They own a big house in the remote countryside."
So this was consistent in different files.
The second architecture, at least for the cases where you want to do causal processing on unseen noises does better.
And here is like the summary of all the evaluation results that we had, and this includes the PESQ values for the actual noisy datasets.
So here I just made the best performing system in each scenario, in each column both, so they can easily compare. So you can see that the only case where an end-to-end regression does better than the second architecture is when you are doing symmetric processing, symmetric [indiscernible] and you have seen the noises. And in this case, like the regression DNN does a better job. But in all the other cases, like in different variance of the suppression rule estimation using DNNs is the one that is doing the better job.
So another thing that I want to mention here in this table is that, first of all just pay attention to the
PESQ gain that we have from using the regression DNN to like the best performing suppression DNN# that we have. It's like almost 0.25, which is completely audible in terms of like the human perception.
And then so another thing that I want to mention here is that we use different like fixed suppression rules, like spectral subtraction and Wiener curves versus the third DNN that we had. So I couldn't get the third DNN to like provide substantially better results compared to using a fixed curve. We expected to get some improvements because with a DNN you're actually using frequency context as well. But then in practice I couldn't get like the DNN, the third DNN, which is supposed to transform the prior and posterior SNRs to the suppression rule as good as like the spectral subtraction rule that we have.
And then the last thing I want to mention here is that so you can think of two different ways of incorporating your VAD here. So you can have a separate DNN to estimate your bin-wise VAD information, or you can use the result of like your multitask learning actual first DNN. And so in the experiments that we did, it turned out that it really doesn't make any difference. Instead of having two separate DNNs doing like de-noising and VAD estimation, you can just take the results from the multitask learning and actually use the secondary task that you had in your multi-task learning instead of just getting rid of it and picking the rest of the outputs.
So just to quickly summarize the results that we have. So the conclusion was that yes DNNs can do better than conventional enhancement systems, but you have two different difficulties that you have to
deal with. You have the problem of causality and unseen noises. Two different architectures that you can have is like end-to-end regression like trying to mimic the same steps in a conventional enhancement system using DNNs. And for the case when you have completely new sets of noise and causal processing, the second architecture does better. And also you don't have to have separate DNNs for VAD and de-noising, you can just use the result of your multi-task learning DNN.
All right. That's all I have to say. Before finishing I just want to thank my mentor, Ivan, for like continuous support during the summer and lots of great discussions. And actually a lot of the results you see in our project comes from directly from his expertise in the speech enhancements, the field of speech enhancements. And also I want to thank the other members of our group, Hannes, Mark,
[indiscernible] who is not here with us today. And also I want to thank my fellow interns, Akeesh,
Long, and Supret, who actually left us a bit earlier. So these people really made this summer experience unforgettable and very useful experience for me.
Thank all.
[Applause]
>>: Could you go over the diagram of the DNN with suppression [indiscernible] so how do you get gradual [indiscernible]?
Matt Mirsamadi: So the [indiscernible] is using the clean speech. So you use the clean speech and you have a threshold basically on the clean speech if you have a fixed threshold you can do a better job of doing VAD.
>>: So this kind of created by VAD.
Matt Mirsamadi: Yeah. But because you use the cleaner speech it's actually somehow the grand truth, because it's really accurate.
>>: And about the clean speech [indiscernible] does that make clean speech spectral and what is the output?
Matt Mirsamadi: So the output is actually these two values, prior and posterior SNRs, for each frequency beam. So this is what enhancement systems use to derive the suppression rule. The suppression rule is a function of these two parameters.
>>: But the grand truth is that this output [indiscernible] is DNN. What's the target?
Matt Mirsamadi: So the target is the actual, so what you want to do is to remove the [indiscernible] frequency beams that have been affected by noise. So the actual goal that you have in your target here is the energy of actual cleaner speech divided by the energy of your noisy speech. If this gets multiplied by your noisy speech, you remove the noisy and you just pick up your clean components.
So this is the target that you use.
>>: So you say you compute target from [indiscernible] speech, clean speech spectrum and
[indiscernible] spectrum.
Matt Mirsamadi: Yes, the division between those two numbers.
>>: Thank you.
>>: Question and a comment. The question is can you just say briefly the size of the training set and the size of the data?
Matt Mirsamadi: So the training set was ten hours of data, and the DNN is three layers of 2,000 notes per layer. It's kind of on the border of like a minimum thing that we can use with three layer DNN.
But I wanted to like be able to run experiments on my own limited RAM and the single GPU that I had, so I had to, for now, stick to this minimum amount of training data.
>>: So all three are [indiscernible] layers, all three DNNs.
Matt Mirsamadi: Yes, for all three DNNs, I use this same basic structure. I didn't do like a full experiment of all the different combinations.
>>: [Indiscernible].
>>: And then the comment, if you're going to write this up presumably for something. So there's a huge, what they call the ideal ratio mass, which is basically the [indiscernible] suppression rule. So what you do is take the noise of each input, and the output is the ideal ratio mass. So going from the input all the way to the output de-reverberation a single DNN without all this other stuff.
Matt Mirsamadi: Yeah, very, very good --
>>: It's clear to differentiate yourself why this way is better than that way.
Matt Mirsamadi: Yeah, so this is actually what we were talking about like yesterday, yes. That's exactly the third architecture that you can think of. So what I was, what I believe here is that a lot of the improvements that we get in this system is because of the fact that we don't extract the magnitude of spectrograms. We just estimate suppression rules. And if you train one single DNN with noisy input and binary mass, the output in the end, it's going to give you exactly the same suppression rule. So that's actually exactly what I want to run in like the couple of days that I have and be able to compare that third architecture to this one and to see whether we need this kind of finer structure in between the
[indiscernible] or not or is just big gains, do big gains just come from the fact that we are estimating again in this frequency rather than taking the magnitude of the spectrogram itself.
>>: You mentioned it's [indiscernible] so when you have the output of the DNN, it goes way below 0 and it goes way above 1. So technically if the training output is [indiscernible]. But output of the DNN rules, the rules 0 above 1, because it doesn't know that this is [indiscernible]. Yes. It does. If you can limit it, same thing from the suppression rule.
>>: But the output, okay. [Indiscernible].
>>: And you train the DNN on variables between 0 and 1.
>>: [Indiscernible].
>>: It does compute.
Matt Mirsamadi: So the last layer is the linear layer, the last layer in the DNN. It's not a [indiscernible] hyperbolic to limit the values between minus 1 and 1. So for the VAD I trained with like minus 1 as the silence label and plus 1 as the clean label. So in the output there were cases where it was like outside the range of minus 1 and 1. So I just use the sigmoid curve to the map from this continuous number, which is at most negative and positive, to like the values of a probability of the speech presents, which is between 0 and 1.
>>: So if you exclude, if you intentionally exclude all the sirens from your training data, would it be good to suppress everything but the siren?
Matt Mirsamadi: So I'm assuming that the DNN should also see cases where you just have noise and you don't want to produce anything. Because that happens in some of the frequencies anyway. So if you present with cases like whole frames where it's noise only, I assume that it's going to do better than the case where you don't do it. But I didn't specifically run any experiments to verify this.
>>: Because what I want to achieve is that for maybe hearing impaired people, I want them to hear the sirens.
>>: So that's another question. So technically, we don't know what is going to happen if your primary source is a siren.
>>: Okay.
>>: In this case you should put and include it in the training the clean speech mixed with noise and reverberation.
Matt Mirsamadi: So the hope here is that the DNN learns this future structure. Because during training, you see in all files you see the same kinds of speech but different types of noise.
>>: So we hope that it learns the structure of this speech and memorizes that versus all the other things which include noise or other types of information.
>>: Question, if we move the siren into the training for the clean speech, you will [indiscernible] for all of the siren like sounds to go to the [indiscernible] without being suppressed.
Matt Mirsamadi: Yeah, that's right.
>>: Yeah, one thing you could suppose in this [indiscernible] it was doing some really good job of reverberation. And did we get any kind of, did you decide how well it can work in terms of --
Matt Mirsamadi: So the way we train the DNN is that the input is noisy and reverberance, the output is clean. So in some way it's also learning like the reverberation component, which is like just removing the temporal smearing and removing the correlations between different time frames and getting what actual clean speech is but that's not measurable by PESQ. So it turns out that PESQ is not really a good measure of speech quality when you have reverberation, in terms of how much reverberation you have done. So if you think about it, it makes sense because reverberation is not necessarily detrimental to the [indiscernible] quality of the speech. So it is in fact known that some amount of reverberation, if not too much, is good for the perceptual understanding of the speech. And PESQ being designed to
follow the subjective perceptual quality, it doesn't give you consistent results in terms of the reverberation performance. But yes, the DNN does some kind of reverberation.
>>: Did you get, because reverberation may be the target of some other application. Do you get any kind of idea if you could generate [indiscernible].
Matt Mirsamadi: I didn't run any specific experiments to assess the reverberation performance. For that case you should use these other evaluation methods like the signal to reverberation ratio to see how much the direct reverberation ratio compares to your input and output. But I didn't like assess, in particular, the reverberation performance here. The reason why we included reverberation here is because it makes the de-noising more difficult. We wanted to see how we can do the de-noising well when the reverberation is present, and not necessarily to clean up the reverberation as well.
>>: So to clarify, there are two major approaches for derivation. One of them is learn somehow the impulse and response to the [indiscernible] which is enormously difficult. The other approach is to say okay I don't know the impulse response but I know that reverberation energy uses suppression type when filtering for suppression reverberation. I'm pretty sure that the way the DNN [indiscernible] and it's out of the question to learn the actual [indiscernible] forget about it. It's more kind of suppression type reverberation, it's more for the [indiscernible] of the reverberation. You cannot learn the impulses.
>>: What if you train the DNN where the designed output is similar for reverberating speech?
Matt Mirsamadi: Yes, I did that and it provided worse results in terms of the noise. So it turned out that presenting the actual clean speech in the output is necessary for the DNN to learn any information.
So when you present a blurred output to it, it cannot learn anything about the signal. So yes, I did that, but it failed to provide satisfactory results.
Ivan Tashev: More questions?
Let's thank our speaker again.
[Applause]