>> Ivan Tashev: Good morning, everyone, those who are... who are attending remote -- attending this talk remotely. ...

>> Ivan Tashev: Good morning, everyone, those who are in the room and those who are attending remote -- attending this talk remotely. Good evening though those who are going to watch the recording this evening. It's my pleasure to present Nikolay Gaubitch who did his master's degree in Queen Mary University of London in 2002 and obtained his Ph.D. in Imperial College London in 2006. Nikolay has been presenting here, I think, in 2008 some results from his Ph.D. thesis, and today he's going to present multi-microphone dereverberation and intelligibility estimation in speech processing. Without further ado, Nikolay, you have the floor. >> Nikolay Gaubitch: Thank you. Thanks. Thank you for those standing up and those watching remotely. So, yeah, basically, as the title suggests, there are two separate topics, the dereverberation and the intelligibility, and this sort of represents the two areas of my life in this field. So one is more for my Ph.D. time and the other bit is something that I've been working on more recently. So to begin with this, I just wanted sort of have an overall look at what I'm trying to or what we're trying to solve. And basically the key idea is that we have some wanted speech signals somewhere produced by a talker, and we have a recorded sound at the end, and the idea is that this wanted speech should be understood by a listener at either a different time or different location. And in the meantime, this wanted speech passes through a lot of things. You have an acoustic channel, some additional noise due to other sources possibly with their own channel effects, microphones, amplifiers, maybe some codex, so on and so forth, and you get there, and by the time you've got there, there can be a lot of things happening which, in general, if you have a human listener, they would reduce the intelligibility, so what was trying to be conveyed by the talker is not really understood by the listener. There's also lower quality of the signal, so it's not that pleasant to listen to. And then I have this productivity thing in question marks which is something that people worked with related to the project that I was working on now to try to see the productivity, say, for example, of a transcriber is affected by noise and noise reduction algorithms. But there's no sort of conclusive evidence of any of that yet, so that's why I have the question mark. Or you can have a machine listener at the other end like speech recognition or speaker identification. They also perform badly once you have this scenario. So as I said, the talk will be in two parts. In the first part I will mention about multi-microphone dereverberation, one method which is a source-filter model based method and one which uses blind channel identification and inversion, and in the second part I'll talk about intelligibility estimation. And this is mainly subject-based intelligibility estimations. We have subjects in the loop. And I'll talk about how we use these methods for evaluating speech enhancement algorithms. So let's just quickly go over what the problem of dereverberation is that we're trying to solve. We have some speech signal, s of n, that is produced in a room. Actually, before I continue, if you have any questions, please interrupt me because there's quite a few different topics, and you wait to the end it gets messy. So we have this speech signal, s of n, produced in some reverberant room, and it's observed by a bunch of microphones, and the talker is at some distance from the microphone array, so we formulate this as an observed signal where the microphone is the convolution between speech signal and the room impulse response plus some additive noise. And the aim of the reverberation is to find some estimate of the speech signal s, possibly delayed or attenuated. It's not a problem. And in most cases the dereverberation is a blind problem, so we don't know anything else but x, so just the observed signal. So we try to do everything based on that. So I think there are generally three classes of ways to do dereverberation, three classes or methods. One is beamforming where you have some kind of array of microphones and you do some processing to all these microphones and you sum them in a way to try and form a beam in the direction of the talker that you want. And beamforming is obviously a multi-channel, multi-microphone approach exclusively. Then there's something I like to call speech enhancement methods where you may use some model of the speech signal or -- and then modify your reverberant speech observation in a way so that you better represent this model, or you can have some model of impulse response and then use that to process your reverberant observation. An example of this is spectral subtraction type of dereverberation methods. And, finally, you have blind system identification and equalization where you would try and actually estimate the impulse responses and then form some inverse filters based on this. And, of course, you can have a combination of all of these things to form methods. So the first method that I will talk about is based on the speech enhancement methods where it uses a model of the speech signal. It's quite a straightforward method really. It starts by looking at the LPC formulation of speech. So if you have a clean speech signal, you can write it in terms of a linear predictor like this where a are the prediction coefficients and e is the prediction residual that you get from this. And you could have a reverberant observation which was formulated in the same way. And then what you can look at is the relationship now between the different -- the two different parameters of LPC. So what is the relationship between the coefficients a and b and what is the relationship between the prediction residuals of the two. So here is an example of a bit of voice speech which could be quite familiar. So you have -- at the top left is a clean speech signal, some reverberant speech signal, and here are the residuals for the two signals. So you can see this one is probably the sort of familiar structure where you have peaks in the residual that approximately represent the instance of glottal closure and you have the reverberant residual -- it becomes messy because this structure is destroyed due to the reverberation. So reverberation effects the residual quite a lot, but, on the other hand, it's been found by several people generally that the LP coefficients are much less affected by the reverberation. So looking at these things, you can think of a way of doing the reverberation by three steps. You do LPC analysis of your reverberant observations, you get a bunch of residuals, you process the residuals, and then you use the LPC coefficient the that you got from the reverberant speech and then you resynthesize the speech. So one of the advantages of this is partly that you have much simpler structure of the clean speech residual and you know what it is, so you can maybe take advantage of that so some extent. And, yeah, it assumes that the coefficients will not be as affected. So the way we looked at this thing is something called -- something that combines spatial and temporal averaging of the residuals. So this is some motivation behind what the algorithm that I want to talk about is doing. So, again, if you have a clean speech residual and you have a reverberant speech residual, you see that the peaks that are from the original residual are kind of obscured and it's difficult to see where they are. If you were to take a microphone array and get the residual from the output of just the delay in some beamformer, you start to get a much better result of the structure, right? So you can see these peaks much clear, but there's still a lot of rubbish in between, so the structure between the peaks is quite far from this thing. So the first observation is that if we do spatial averaging, we should be able to identify these instances of glottal closure. >>: The last one is [inaudible] and then doing LPC or doing LPC [inaudible]? >> Nikolay Gaubitch: No, this is doing [inaudible] some beamformer and then doing an LPC. Yeah, the other observation you can see is that actually the signal between these peaks is pretty similar, so is changes quite slowly, while the output of the beamformer residual seemed to be much less correlated in between. So the idea could be that you could segment these larynx cycles between the peaks and then do some kind of moving average operation so that you would take this one and average it with its neighbors and it should reduce the effects of this and get back to this structure a bit more. So that's really the idea behind this approach. And in order to do this larynx averaging you need to be able to identify these peaks, and for that I've used something called DYPSA, which is -- it's very robust for identifying GCIs in clean speech, and at the output of a beamformer it performs pretty well as well, satisfy enough. And also when you do the averaging of these larynx cycles, you don't want to include the actual peaks in the averaging because they are quite sensitive. So if you distort some of these peaks you can hear it in the resynthesized speech afterwards. So to avoid this, basically what you can do is use some kind of windowing function around the cycles. So if you take -- for example, this is what I was using is just the time domain Tukey window. You can center it within the larynx cycle and that would exclude -- basically it would exclude the glottal closure instance with some margin on the side so that you can have some for errors. So then you basically segment your larynx cycles average across them without the peaks and put together the peaks. That's what this does. And in the second step what you can do is instead of actually using these averaged larynx cycles you can try and calculate -- you can calculate an inverse filter based on your original larynx cycle and the one that you've just averaged and you update this filter slowly. So in this way you can partly reduce the sensitivity of GCI detection errors, but also basically this whole thing that I've described mainly works for voice speech, and when you come to unvoiced segment you can continue using this filter because it should be a relatively good estimation of what happens to reduce the reverberation. Yes? >>: So is the averaging done over a fixed number of windows or are you doing that adaptively based on how much the residual is brought down? >> Nikolay Gaubitch: I did it as a fixed. So normally sort of one or two larynx cycles on each side of the one bit you are looking at. >>: And the other question I have is that it seems like a lot of times if you have a lot of reverberant environment, that the peaks around the cycles and the larynx signature is being kind of generated around time a little bit because there's inflections that are coming at different times so you see, like, this inflection is a bit higher. And I wonder to some degree that, you know, is this averaging timelined in any way beyond the glottal peak? Do you try to do some alignment within the larynx signature as well after you've done the windowing or is it ->> Nikolay Gaubitch: No, it's just aligned with the peak itself, yeah. I see what you mean, because you normally wouldn't have exactly the same ->>: Right. And you might be zeroing out a signal by just having slight shifts in the ->> Nikolay Gaubitch: Yeah. But you try to basically align them with a peak. And also you keep the keep the original peak position. That's why I don't include the peak in the averaging process itself. So you just -- you just average with everything between. So, yeah, I just found that that works best, anyway, for the different ones that I looked at. >>: So g is just a filter that approximates the difference between e and [inaudible]? >> Nikolay Gaubitch: Yeah. Exactly. >>: And then that's what you actually use to get ->> Nikolay Gaubitch: That's what you actually apply, yeah. It's just partly to know what to do when you have unvoiced speech coming, for example, because -- yeah. And also it releases errors that [inaudible]. So this is a diagram basically of -- sorry. >>: So what does g actually do? Is it basically a [inaudible] filter or what ->> Nikolay Gaubitch: G is just -- it's a ->>: What does it come up to be? >> Nikolay Gaubitch: Well, it should be something which relates to the inverse filter of the room transfer function to some extent. It's not really a ->>: You're operating a smaller ->> Nikolay Gaubitch: Yeah, yeah, exactly. So it's not going to be enough continue to invert the full thing, but it will be kind of a very smoother version of that. If you just have a very small size it will ->>: And the other thing is that that filter is not being applied to the peaks, it's being applied to the whole signal? >> Nikolay Gaubitch: It's applied to the peaks as well, yeah. It's applied offering to. >>: And I wonder if you might have some -- because there is something like -you can imagine there might be certain pathologies -- and I'm not saying this necessarily happens, but you can imagine certain pathologies where the filter ends up doing a good job of modeling whatever the averaging does, you know, but then that resulting filter might damage the peaks as well because they want part of the formulation. >> Nikolay Gaubitch: Well, the peaks are parted of the formulation of calculating the filter. >>: That's true. >> Nikolay Gaubitch: So they will hopefully stay the same. >>: Other they are? Okay. So I see. So you keep that in the area -- I see. It's still in there >> Nikolay Gaubitch: Yeah. >>: You don't apply the windowing for when you compute the [inaudible] the entire thing in that case? >> Nikolay Gaubitch: When you apply -- yeah, when you apply the filter, yes. It's only when you do the averaging that you exclude it, so you don't average -- and then you get it back in. >>: If you're going to do analysis of this filter and estimate the frequencies [inaudible], how would you like? High pass? Low pass? That was the question. >> Nikolay Gaubitch: Yeah, I understood the question, but ->>: You suppress more the high part from the lower part? >>: I mean, it seems like you have this low pass kind of thing that's your pitch periods and this very high frequency noise, so my guess is it's just going to be a high pass filter, but ->> Nikolay Gaubitch: Yeah, but it's not really. I mean, it doesn't have any sort of high pass/low pass structure from what I've seen. It looks like a ->>: [inaudible] of the glottals are a series of direct pauses, they have a flat frequency spectrum. And you model the noise, which is closer to white noise, that thing has also flat frequencies. So what is the [inaudible] here by applying ->>: [inaudible]. >>: A sequence of deltas in time will also be [inaudible]. So you'd see ->>: If your frame is longer [inaudible] -- >>: [inaudible]. >>: If it's a short frame it's flat and the noise is flat. >> Nikolay Gaubitch: Yeah, but the noise isn't really flat. I mean, if you look at it, the noise isn't really flat. To some extent this noise should have some relation ->>: If it contains more lower frequency, then Mike is right, this is -- if you apply a kind of higher pass filter you can improve the ->>: I see what you're saying. >> Nikolay Gaubitch: Yeah. But the noise isn't really flat in between. It should be ->>: [inaudible] filter applied to just [inaudible] periods [inaudible]. >> Nikolay Gaubitch: No, no, you wouldn't. No. >>: So it's doing something that's helping dereverberation, but it's doing [inaudible]. >> Nikolay Gaubitch: Yeah. It's still directly connected to this. But these peaks here are connected to the reverberation, right? Because you sort of excite the room with -- that's the way I see it. So every little clap of the glottis sort of excites the room to some extent. >>: Maybe a different way to look at the question would be that once you do the averaging kind of, you know, get your signal into the nice temporal residual that you want to see, you could actually estimate that g filter over a much longer window. There's more research that you'd need to do [inaudible]. >> Nikolay Gaubitch: No, no. You could do it over a longer period. >>: [inaudible] >> Nikolay Gaubitch: That's true. I mean, the motivation behind this is that people that did this before, what they did was to try and somehow identify the peaks and then just flatten out whatever's in between and ->>: I mean, you had this Gillespie citation which I assume is, like, not trying to look at the glottis but just trying to say we want a peaky signal. >> Nikolay Gaubitch: Yeah. >>: And that's the [inaudible] thing. >> Nikolay Gaubitch: Exactly. But even that actually flattens out the portions in between quite a lot when it works well. That's what I find. Because you just look at a peaky signal, right? So the ideal solution would be -- >>: It seems like that's maybe a little more robust because you don't have to estimate all this stuff like glottal closure and stuff. >> Nikolay Gaubitch: Yeah. Yeah. No, it can be. But the problem is once -- if your ideal is that you just have a peak and flatness in between, at best you get some kind of robotic solution to the whole thing. That's the problem. So that's what I was trying to avoid here is to try and sort of get back to the naturalness of the speech. So, yeah -- so that's the overall -- the overall idea here. So just to show you a couple results of that with five male, five female ->>: Did you spend a lot of time on acronyms? >> Nikolay Gaubitch: Sorry? >>: Does your group spend a lot of time coming up with acronyms? >> Nikolay Gaubitch: Actually, we do, yeah. This wasn't from our group. I think it was done in [inaudible]. >>: [inaudible]. >> Nikolay Gaubitch: Yeah, we have of a bit of -- there's going to be a lot of acronyms, actually. That's a good early point. Anyway, so [inaudible] database with a bunch of male and five female talkers, and this is with simulated room impulse responses with the image method with varying T60. And you see there the talker is about two and a half meters away. So this is an output if you compare just the delay in some beamformer on its own versus this SMERSH. Yeah, that's actually the acronym of spatiotemporal averaging method for enhancement of reverberant speech. So, yeah, this is in terms of segmental signal through reverberant ratio. You see that you get some benefits over just the delay in some beamformer. >>: So what's the order of the LPC analysis and the [inaudible]? >> Nikolay Gaubitch: The LPC analysis, 13 here is for 7 kilohertz frequency, and g is 256 [inaudible] filter. >>: How are you guys measuring the [inaudible]? >> Nikolay Gaubitch: So in this case you just take the direct path, which is quite easy since it's a simulated thing. So I know that ->>: Whatever is left is your reverberation? >> Nikolay Gaubitch: Exactly. So when you simulate it, it's actually quite easy to do. And also in terms of bark spectral distortion, which is something I used to use quite a lot and supposedly should be closer to what we hear, so, again, you see there seems to be no -- so you get more benefit when there's more reverberation from this than when there's nothing, when there's little reverberation. >>: So go back a second [inaudible]. >> Nikolay Gaubitch: Oh, this is in bark spectral distortion ->>: Can we see the previous? I'm guessing my next question. If not, I'm going to ask it. >> Nikolay Gaubitch: Okay. >>: So here you're getting [inaudible]? >> Nikolay Gaubitch: Yeah. >>: So is that audible? >> Nikolay Gaubitch: You can hear it, actually, yeah. I mean, yeah, you can definitely hear the reduction in space. I have some samples on my laptop. So if you're interested in hearing them later -- there's not point in playing it -- it wouldn't be audible here, so it's just -- but if you listen to it on headphones, you can actually hear -- yeah. >>: [inaudible] if you apply something more sophisticated like let's say [inaudible] are audibly better than the delay in some beamformer, then what would be that difference? >> Nikolay Gaubitch: I don't know. I haven't checked it. This is just compared to a delay-and-sum beamformer. Yeah, I haven't really ->>: [inaudible] >> Nikolay Gaubitch: Yeah. Yeah. >>: I have a question about your -- we can go on, but -- so you have these -- it seemed like you did this study [inaudible] that the LPC coefficients for reverberant speech are not quite as affected by reverberation -- the LPC coefficients of speech are not affected by reverberation. I imagine it's -- well, let's assume that's true. This is a noise-free case, so in the real world there's noise, and generally noise does have some ability to [inaudible]. >> Nikolay Gaubitch: Absolutely, yeah. >>: So how -- in the noise case, would you have to change something? You have to have the room with noise and reverberation, so you change [inaudible]? >> Nikolay Gaubitch: Well, actually -- yeah, you're right that it wouldn't -- that the noise does other things, but the way -- in this case the way I calculate the LPC coefficients is just using multi-channel LPC. So you basically get all the correlation matrices based on the multiple microphones. So it's eight microphones here, and that helps actually in noise as well. And actually -- yeah, that's ->>: [inaudible] >> Nikolay Gaubitch: Yeah. So you have either the beamformer, but actually I think when you do the multi-channel LPC it's even better than doing beamforming first and then -- so actually, yeah, you can hear. The interesting thing is that this averaging of the larynx cycles actually helps also reduce noise if you have it because -- so that comes there. So the combined effects you can hear quite well in ->>: One more question. So this is assuming that you know precisely what the source is for the beamformer? >> Nikolay Gaubitch: In this case, yes. >>: Did you have a chance to do any sensitivity analysis to how error in that sound source estimate might ->> Nikolay Gaubitch: I haven't done. But we actually -- a bit later after this came up, we implemented this and did some tests with -- well, we set up the whole system with -- you know, when you do an actual estimation of the source, actual source localization using just GCCFAT [phonetic] and things like this and real voice [inaudible] detection and all this. And you get pretty close to these results as well. So I didn't do much further analysis of this. It was just an idea. We tried it, and it worked nicely to some extent. And it was kind of fun to try it in a more realistic scenario, so it seems to perform more or less the same when you do that. >>: [inaudible] >> Nikolay Gaubitch: Great. So I'll move on to the next bit, which is on the channel estimation and equalization, starting with this cross relation between two channels, basically, which often happens to be the beginning of multi-channel estimation. So all it says is that if you have an observation at one microphone and you convolve it with the second microphone, impulse response is the same as taking the second microphone observation and convolving it to the first microphone's impulse response. So starting from that, you can set up a set of linear equations with R being a correlation matrix and h all the impulse responses that you're looking at. And you can identify h by just finding the eigenvector corresponding to the smallest eigenvalue of R. And so this works really well if you have no common zeros between the channels and if the autocorrelation matrix of the source signal is of full rank. So that's the basic idea of this. And then you have -- you can continue with this and do adaptive formulation which is good since acoustics change generally quite a lot when you move around, and trying to do it adaptively seems like a good idea. And by minimizing this error you can get to something which looks quite similar to an LMS filter and this -- also other versions implement this. So this is all good, and it works well if your conditions are very ideal, but once you start having noise, this whole formulation becomes quite messy. So basically what your error function -- if you're trying minimize this, it also tries to -it tries to minimize two things. One is this original thing that you had and the other one is this bit which includes noise. And it causes, it's one of the causes to something which makes the convergence behavior, these filters, very strange, so you start converging, and then at some point it misconverges into something else. And it does this independently of how good your signal, more or less, unless it's perfect. So you can see that even things like 35dB and 40dB SNR, which is way too good for any realistic thing, it has this strange behavior. So one way we looked at trying to [inaudible] this is to add some constraints, some extra knowledge, about what you're trying to estimate, and one of them would be to just assume that you know the direct path, say that you've estimated it from something else, so you add the constraint that you keep the direct path in your estimate as the correct one. Or another one which could be -- is to assume that your energy distribution in your room transfer function is uniformly distributed over frequencies, so it's relatively flat, and keep that as a constraint. So one of the misconvergence things that you see is that it actually produces some kind of extra filtering, and it makes the estimates sort of skewed here and there. So that's where this constraint came from. >>: K is frequency or ->> Nikolay Gaubitch: K is the frequency bin, yeah. So this would be the magnitude -- sorry, the power per band in the room transfer function. >>: Isn't this assumption that you have a basically cut off [inaudible]? >> Nikolay Gaubitch: Yeah. >>: So pretty much this means that these surfaces [inaudible]? >> Nikolay Gaubitch: Yeah, more or less. Yeah. So overall it would be kind of flattish when you look at it, especially within the frequency bands that you're looking ->>: [inaudible] >> Nikolay Gaubitch: Well, also in -- yeah. I think in common rooms it's fine if you don't look at too high frequencies. So if you're within the telephone bandwidth, you'll be okay, generally, with this. So this was all done with sort of telephone bandwidth in mind, so I think it's fine. Yeah, once you start going into higher frequencies, then you get the frequency dependence and it's different. >>: [inaudible] >> Nikolay Gaubitch: Yes, exactly. So we did telephone bandwidth, which is -- I think 8 kilohertz is still pretty hard-core for these algorithms to try and estimate impulse responses in terms of the number of taps that you get. And either way, I mean, the idea is that you try and add some extra information about what you're trying to estimate, and you add some extra -- basically it can result in an extra penalty term in your adaptation algorithm to try and control this. And if you take -- it looks pretty much the same. Whichever constraint you take of these, you have this behavior which was without the -- I don't know if you can see the plots here. So this was without the constraint and then if you do have the constraint it sort of stops where it should, although it slows it down a little bit. And here is what the channels look like for this case. You have five artificial channels. So each of these columns is one channel, basically, and this was the original bit that we had, and you're trying to estimate this is what it looks like without any constraints, and if you do put some constraints in, it helps, so you get a pretty good estimation. But, yeah, these are truncated to 128 taps, these channels. >>: A 128-tap reverberation? >> Nikolay Gaubitch: Yeah. It's pretty short. >>: So and the MC -- the red [inaudible] is ->> Nikolay Gaubitch: So the red [inaudible] is the misconvergence. So it's at this point. >>: So what does it look like at the [inaudible]? >> Nikolay Gaubitch: At this point it looks like this. The problem is, though, how you find this point and stop. >>: What's MPM? >> Nikolay Gaubitch: Oh, yes, I should have mentioned. So MPM is misalignment, but it's normalized ->>: [inaudible] >> Nikolay Gaubitch: Yeah, it's used quite often. So it's the same as misalignment as used in normal channel estimation, but it normalizes so that you avoid any scaling effects. So if the channels is scaled different from the estimates, it takes that into account. Yeah, so that's -- so, yeah, I mean, I have looked at some ways of trying to find where this point is, because that would be a good idea as well is if you can find this and stop here, but -- yeah, it becomes quite tricky. Okay. So that's -- I think that's about all I was thinking of saying about channel ID. It's still a very tricky problem, and to do it with very long channels, it becomes difficult. What I want to look at quickly is also, say that you do have an estimate of the channel. What can we do to actually remove this channel from the reverberant speech? And the sort of original talk on the multi-channel thing was this -- trying to equalize all the channels simultaneously using an equalizer g, and you have some desired output of what you want to get once you've equalized your channels which is an impulse that can be with arbitrary amplitude and some arbitrary delay. And in this you can set up some least squares formulation on that, and you can calculate estimates of your inverse filters. So that's the fundamental idea, which sounds all good, but as everything else with this, it's full of -- it can be full of problems. So one of the problems, of course, is even if you have a good estimate of your channels, you have several thousand taps in your channels, so you'll make the actual design of the filters very long. You can have -- you usually wouldn't have perfectly known channels. You have inaccuracies in your channels, and trying to equalize these perfectly with inaccurate estimates, you'll get into a lot of problems as well. And you can also boost noise. So if you have noise and you have your -- you have your channels trying to invert it like this, it will also increase the noise that you had in the beginning. So one of the things I looked at was to try and do something about these two problems. So it was basically just quite a straightforward translation into the problem into subbands. So if you have -- this is the full band formulation of the equalization, so you have your speech signal going through the transfer functions which you now know, and you design the inverse filters. You can just translate the whole thing by assuming that this whole thing happens in the subbands. So the only trick that is needed in this is to find the relationship really between the full band estimates and the subband estimates of these -- of the room transfer functions and then you can design the equalizers. So you basically design and equalize in each subband separately and then you reconstruct. So that reduces both the complexity quite a bit and also it makes it less sensitive, to some extent, to inaccuracies on the channel estimates that you may have. So here's an example with how that works. You have -- in this case we had, like -- we had a GDFT filterbank with 32 subbands decimated by a factory of 24, and with this stuff you get a reduction of about 120 in terms of floating point operations that you would have to design the filters. So it's quite an improvement. So here we have simulated the mismatch. So if you have an estimation error, it's just simulated by adding some noise to the true channels and evaluate the outcome in terms of magnitude and -- magnitude deviation and linear phase deviation. So what you look at is if your -- how far your equalized spectrum is from flat and how far your equalized phase is from linear. So this is the outcome of that, which -- so this was done with five microphones, 4800 tap channels, again, simulated with image method, and this is -- the results average over 100 channel realizations, so basically you have the room and you keep your microphone and you source at the same relative position and you spin them around in the room to get various realizations of an impulse response. So you can see the circles are the full band version of the equalization in terms of magnitude and phase, distortion, and if you do it in the subbands you get some improvement, and you have system mismatch along the x axis. So there's still an improvement, and especially if you go sort of above minus 32 -below minus 32dB system mismatch you can get quite decent results with this. And if you look at this in terms of the reverberation, just going back to repeat the same slide, but it's exactly the same setup as the reverberation was with SMERSH just to see what you could achieve. In this case the impulse responses are again not -- they're not fully estimated, so you assume that you know the impulse response, but you add some distortion to sort of simulate different mismatch that you get. And this is what you get in terms of segmental SRR with that different MPM, so different misalignment. So this one hundred is the reverberant case, then you have the beamformer, so this is the [inaudible] beamformer just as before, and these are with a progressively better estimation error of the channel that you would have, you get better dereverberation. And so, yeah, if you had the perfect channel and everything was great, you could actually achieve pretty good results. But even down here, that's sort of zero to minus 32dB, you still perform quite good with that. And in terms of bark spectral distortion, it's pretty much the same. So, again, this is the reverberant building some beamformer in here all the versions where you see that anything above minus 30 -- below minus 30dB misalignment you get pretty good audible results. And it's something you can hear as well. So just to conclude that part of the talk -- yes? >>: Do you know if, like -- I'm just trying to get a sense of the realisticness of using all these techniques. So if I am here and I [inaudible] and I move my head six inches to the left, how much -- do you have any idea how many MPM -- dB of MPM between those two filters there will be? >> Nikolay Gaubitch: I don't know it in terms of MPM, but I know that even if you move very little, trying to equalize with what you had here already, you can provide quite a lot of distortion. But I don't know in terms of MPM exactly what it is. >>: See it seems like by the time you figure out what the filter is, unless the person is really staying still, the chances ->>: [inaudible]. >>: No, right. But I'm saying in a real scenario it seems ->> Nikolay Gaubitch: In a real scenario, yeah, it's -- it's very difficult, and that's -yeah. That was -- as I said, this was one part of my Ph.D. work, and one of the conclusions that I did get to is actually -- I think this is nice because it gives you hope that at some point you might be able to do perfect dereverberation, you know, that it's there in a very artificial scenario. But at the moment, the way things work, it's not really possible because you need to be able to track the impulse response within seconds -- less than seconds. So I think in general, between these two algorithms, you have on one end of the spectrum this SMERSH which gives a little bit of dereverberation, but you could actually implement it on a computer and it will do something, and then you have this blind system identification and equalization which potentially can give you really good dereverberation, but at the moment, yeah, you need to calculate for a few days at one location or something to actually get something out of it. So I'm going to move on to the next bit, which is intelligibility estimation. So I have until 12, right? Okay. So just with a bit of background, intelligibility estimation is -- what we're trying to do is -- going into noise now -- is to find a relationship between SNR and how much you can understand what is being said ->>: You mean human being? >> Nikolay Gaubitch: In my case it's human being, yes. Yeah. So, yeah, this stems from the project that I work on now. And so the question is how you estimate this. Of course, one way is to have automatic estimation, so that would be really good if you could have things. And there are some methods to do that, some of which are intrusive. So intrusive estimation means that you have the clean signal and the noisy signal, and you can plug them into an algorithm and it gives you out a number. And then you have the sort of oracle of everything which is non-intrusive where you should be able to just give it a noisy signal and it gives you out a number of how good the intelligibility is. And, yeah, there's quite a few methods out there for intrusive and very little for non-intrusive. There's been some recent attempts on that, but not fully verified yet. And then it's the subject-based estimation where you, instead, have listeners, people sitting there and listening and somehow indicating what they hear. So in that case you can have two ways of looking at it. One is that you fix your SNR for a noise thing. So you place samples of the fixed SNR and you count how well these subjects understand what's being said, then you get out the percentage of correct recognition, or you can have a variable SNR in the fixed performance that you want. So you set and you say I want to find the SNR for which the recognition rate or the intelligibility rate is 75 percent, and then you vary the SNR until you find the right spot. So this constant stimuli version where you fix the SNR is quite slow and this one is somewhat faster, and in the work that we were trying to do this subject-based SNR is what we went for and what we actually needed. So I'm going to talk a little bit about that. So the objectives of what we were trying to do is basically trying to assess the effect of speech enhancement algorithms on the intelligibility and also to maybe find -- see if there are safe regions of operation in these methods, so if somebody gives you a sense speech cleaning algorithm, to tell them does it harm intelligibility or not, and often you get the speech announcement algorithm with a slider where you can adjust some parameters. So you can try and look at the parameter space and see if there are some regions in this parameter space where it could be relatively safe to use this. Now, although this automatic estimation algorithms exist, they're good at estimating the intelligibility of speech and noise when nothing else has happened, so it's just speech and noise. But they're not very good at predicting what happens after you've processed the speech. So that's one of the problems. And this is ongoing work to try and actually come up with good automatic methods to predict intelligibility for speech enhancement methods. The subject-based methods, on the other hand, for this type of problem result in a lot of people having to listen to a lot of data, which takes a lot of time. And if you pay these people, which you should for ethical reasons, it gets quite expensive. So we needed something that you can do quickly with people, and -- so this is the basic procedure of this adaptive test for intelligibility. So in general that's what you do. You have a trial, a trial n, you present the listener with something, a sample, and the listener has to indicate what they think they heard. And you present this at a certain SNR. The listener indicates in one way or the other I heard this or that, and you score that with 1 if it was correct or 0 if it wasn't. Then you take this knowledge, you have the knowledge of your SNR and what the subject indicated, and you have to adjust the SNR somehow to present it back to the subject. And typically this would be -- what people use is a fixed step-size way of doing it. So you basically say your step-size is 1dB, and if you're looking for intelligibility performance of 50 percent, you play the sample, and if the subject is right you decrease the SNR by 1dB, and if the subject is wrong you increase it by 1dB, and at some point you will converge hopefully to something which gives you 50 percent intelligibility points in terms of SNR. >>: Are you just giving one probe and getting their response or are you doing an average over ->> Nikolay Gaubitch: No, you have to do an average over quite a few. So that's the whole point. So because you don't know what the SNR you're looking for is, you have to start somewhere arbitrarily. >>: No, no, I meant that let's say right now I'm at SNR equals, you know, 2. Then are you giving, like, 10 probes to the user at that point and then averaging to get their performance rate or are you just doing one and getting back a single response? >> Nikolay Gaubitch: You get -- well, it depends a little bit on the data. In our case you do one and you get a single response. So you do ->>: [inaudible] >> Nikolay Gaubitch: Yeah. And then you can step after that. So, really, actually the question that I'm getting into is how to best adjust the SNR at each probe so that you quickest get to the right SNR for a given intelligibility. That's really the fundamental problem here. And one way to try and do this is to look at something -- this is something called the psychometric function of the speech, which is quite commonly used in psychometrics and psychoacoustics. So basically this is what the relationship between SNR and intelligibility looks like. You have a sigmoid function which links the two, and that sigmoid has kind of a point of intelligibility that you're interested in, say 75 percent, you have a slope of that function, and you would have a threshold, which is the SNR. And, also, there's shifts and scaling. So, for example, you have this one, which is the guess rate, so people can sometimes guess the right answer depending on how many options you have, and you have the luxury, which is basically a point when you should hear what was being said but you don't because you're thinking about lunch or something, for example. So starting with this thing, we use this background model rather than just doing a step-by-step adjustment of the SNR trying to select the SNR so that we get the most information about the underlying psychometric function at a given intelligibility level. So you can take the psychometric function and with all these parameters of -yeah, the lapse rate and the guess rate, so you have the [inaudible] this, and then this is your sigmoid function, and you can take more or less anything which represents a sigmoid function. In this case we used the cumulative normal distribution function, which is a nice sigmoid, and basically the slope and the threshold are controlled by the mean and the variance here. So now trying to estimate the underlying psychometric function is really just by estimating the mean and the variances of that and find where it fits best. So using this as a background, what we can do is say, well, select the next SNR value of the next probe so that we minimize the variance of the slope and these threshold estimates. Right? And just to briefly sort of go through how we do this, you start off with a 2-dimensional PDF where -- yeah, so you have the slope and the thresholds for all possible psychometric functions, you initialize this PDF to something. In the next step you can calculate the probability of getting a response r at a second SNR. Then you estimate the posterior probabilities for each of these psychometric functions that you have underlying it. And using this, you can find the variance of these two parameters, the parameters of slope and threshold, and also the expected variance that we were looking for. So then you basically select the next SNR that gives you the smallest expected variance for the estimation of the two parameters. Then you present -- >>: So you don't know that the probe will be -- oh, I see. You're looking at either if it's a [inaudible] ->> Nikolay Gaubitch: Exactly. Yeah. [inaudible] because I don't know what it's going to be. So what you do then is you present it to the listener, you get the answer, and then you update your PDF based on the answer and the SNR that you tested. And, yeah, this is called BASIE, speaking of acronyms. It took a while. So, yeah. So that's the idea. And in this way you basically get the quicker -- the idea is that with this one you should get a quicker estimate and a more accurate estimate of the SNR you're looking at. So the other thing -- sorry? >>: [inaudible] >> Nikolay Gaubitch: At the moment I just picked a bunch of SNRs and find the one. Yeah, exactly. And, actually -- well, actually I didn't add that on the slide, but in practice you also resample and rescale the PDF at every iteration so that you get closer to the actual one. It's quite -- yeah, I'm not sure with a closed form, because you need the inputs of the ->>: [inaudible] >> Nikolay Gaubitch: No. No. Because it's more of an inductive filter with a person sitting there. So you need this interaction. You just have this underlying model of the person that you're trying to match. >>: What I'm saying is you can choose the -- depending on what the form of the [inaudible] variance was, you can just take the derivative of that expression with respect to [inaudible]. >>: [inaudible] >> Nikolay Gaubitch: Yeah, actually ->>: [inaudible]. >>: In fact, it might even be quite convex. In this case it's not clear that you actually need to just sample a bunch of points. You might be able to use, you know, [inaudible] >> Nikolay Gaubitch: Yeah, maybe. I haven't looked at that yet, but, yeah, that's one of the ideas is to look at the logistic function to see if you can better stuff. >>: And one thing that might be interesting is that -- I mean, I would suspect [inaudible] is fine because I expect that that bound is probably going to recognize we have steps of like .1dB and then you look around that to see how much variation you'd get in the bound based on a small movement, it probably wouldn't be that big a deal. So it probably just [inaudible] it's probably pretty smooth, I'm guessing ->> Nikolay Gaubitch: Yeah. No, it looks -- yeah, [inaudible]. >>: So the question about that -- so you're saying that you choose a bunch of points. Are you changing the sampling of that, so maybe you did it at some fixed, you know, .1dB steps or something like that, but do you change the spacing of that grid as you move along in your algorithm, or is it always the same? >> Nikolay Gaubitch: It's always the same spacing. It's a fixed ->>: [inaudible] some fixed kind of sampling along the entire from 0 to ->> Nikolay Gaubitch: Yeah. >>: -- [inaudible]. >> Nikolay Gaubitch: Yeah, you need to sort of fix the ranges as well with the SNRs, so yeah. But normally, you know, it's quite reasonable to fix it, yeah. You can have a pretty fine grid which is more or less enough. I mean, you will never really get that accurate estimates based on the humans anyway, so -- yeah. And then the other bit which you could do with this is, of course, you could make it estimate more than one psychometric function at once, which is -- I'll show you later why that's a good idea. Yeah, actually -- so the idea of this using -- estimating more than one psychometric function is that you can simultaneously estimate processed and unprocessed speech with one subject if you try and speed these things up, and it's also -- while doing this, you're playing samples to the listener which have been processed and which have been not tampered with, so it sort of helps with the attention and keeps the listener happier. That's the one of the thoughts that we've had. >>: So what would be the word clue in that case? So they would hear, like, type one sample and they would give a response, and then they'd hear, like, type two sample and give another response? >> Nikolay Gaubitch: No. So that's the next point, actually. The answer to your question is that you choose which sample to play which gives you the best minimization of the cost function that you have. >>: So you're choosing now not only between SNR points but also between which PF you're choosing at every point? >> Nikolay Gaubitch: I mean, so if you've already converged very quickly with one of them, there's no point in keeping playing back samples on that, right? >>: But is this -- but it seems like these could be completely independent problems. Like, I mean, does it make a difference to ->> Nikolay Gaubitch: No, no, it doesn't make any difference to ->>: To serialize them and do, like, first PF1 and then PF2? >> Nikolay Gaubitch: You could do that, yeah. Exactly. The only difference it makes is that there's no point in playing -- in collecting more data if you know that you've already converged quite well. So if your error is minimized, then there's another -- so if you have two conditions, unprocessed and processed, for instance, then if you've done quite quickly -- if you've got to the right point quite quickly with the processed one, but the unprocessed isn't, you want to use more time to try and converge with the processed. So that's the only reason. But it doesn't actually do anything else ->>: They don't really help each other? >> Nikolay Gaubitch: No. The only help -- the idea is that by interleaving samples in some kind of ->>: [inaudible] >> Nikolay Gaubitch: Yeah. Exactly. That's the idea. >>: Okay. >> Nikolay Gaubitch: Yeah. So here is just one example of a simulated version of this BASIE. So this is just -- just to see that it works if your underlying model is correct. So this is simulated with the psychometric function model, so you just do the inverse of that and you generate user responses that are correct. That's the idea with this. And it's compared with a fixed step-size of 0.5dB, which is pretty good. Normally people use around sort of 1, 2dB step-sizes for these. And it's averaged over 500 runs. So here is the BASIE one. So you can see here is the number of trials that you run, and here are the -- here's the threshold estimation bias that you get. So you can see that you get pretty accurate estimation pretty quickly with BASIE, but with the fixed step-size you're going to have quite a large bias, and also the variance is pretty large with the fixed step-size, something you can't get away from. >>: I'm surprised with the fixed step -- I would have expected to kind of more jump around back and forth across those [inaudible] bias point. Why is it always -- because, I mean, if you go with the fixed step procedure as you described earlier, they're getting a probe and then the person says something and you take a step, you give a probe and they take a step, so you can see how that would result in continuous jumping around. But I'm surprised that it's always biased in one direction here. Do you have any thoughts about why that is? >> Nikolay Gaubitch: I think -- yeah, I'm not sure exactly why. I mean, I think the reason is probably that you have a very ideal user which does something strange with that. So you might converge -- because if it you look at it ->>: [inaudible] >> Nikolay Gaubitch: Because in reality, it wouldn't be like that. In reality you do get actually those jumps that you're talking about. So this is really with a very ideal user, so in some sense BASIE is more suitable to test with this ideal user because it's trying to estimate exactly that function, while this fixed step-size, when you do that you don't really necessarily have this underlying assumption of the threshold. So, I mean, one of the main things that I wanted to see with this is how many trials you need to get some convergence as the minimum which is satisfactory, and the answer was that you get it at around 30 iterations. You get an answer which is sort of between plus and minus 1dB, which is usually good enough for estimation. >>: I guess one thing that I'm [inaudible] about the simulation here is that what is the model for the human response here? They'll just respond exactly the probability that the model says, so if you're -- okay, I see. So you have, like, this sigmoid here ->> Nikolay Gaubitch: Exactly. >>: -- and you know the model is correct, you have some parameters that are predefined for the simulation, and the next thing is they'll report with 70 percent accuracy here, and then they just report with 70 percent accurate and you just sample from that? >> Nikolay Gaubitch: Exactly. It's a very ideal condition for BASIE. That's it, yeah, exactly. >>: And this still says -- this assumes a single user? >> Nikolay Gaubitch: Yeah. Well -- yeah. Because you have one model. So it doesn't really incorporate anything that is more user specific than that. >>: So one trial is typically, what, a phrase? >> Nikolay Gaubitch: So -- well, that's a long ongoing discussion in this community that people do this is what one trial is. So it can be a sentence or it can be a word. As I'll mention a little bit later, the thing I was using for a while was digit triplets. So you hear ->>: Three numbers. >> Nikolay Gaubitch: -- three numbers, and then you have to enter the three numbers that you heard. >>: [inaudible] >> Nikolay Gaubitch: Yeah. Exactly. That's the idea. So I'll show you how I use this for evaluating a couple of methods. Yeah, so this is where it actually goes to when it was applied for estimating intelligibility. So this interleave bit, it doesn't really help with, as I said, with the estimation anyway, but you can sit down one user and he does one trial and you get information about several different settings, for example, in one algorithm or any other thing. Because, actually, this you can apply with anything. You can -- all we do is vary the noise and you can have -- so you can have reverberation in the background, you can have a codec after the noise. So I looked at various things with. So, you know, the noise is just a variable. And very often it's a very sort of, you know, decisive variable. And the way we measure this is something called -- well, it's not called -- I've called it relative tolerance to other noise. People call this kind of different things, but what it means really is that you look at this threshold SNR for the noisy speech and the threshold SNR for the processed one, take the difference, and that gives you sort of the tolerance. So if you have a positive RTAN of 1dB, it means that you can lower your SNR by 1dB and still get the same performance and intelligibility. So that's really the whole thing. And this is how people were using this. So it's digit triplets that we're using, and you have an interface, so you listen to stuff and you just enter your three numbers. Yeah, there's a long talk just on how you would actually select all -- the data that you're going to listen to but I'm not going to go into that. So here is a relatively small study with this, but it's just to sort of demonstrate what the idea was. You have six people. The samples -- yeah, these digit samples were normalized, and you have car noise in the background, and people listened to this with headphones. And we tested this with noise reduction from a commercial audio work station, which I won't mention -- it's not good -- and just standard spectral subtraction with minimum statistics noise estimates. And to choose the parameters of this commercial audio work station to -- for example, has five, six, seven parameters that you can vary, but actually the one that makes the most difference when you listen to it is the maximum amount of noise reduction that you apply. And also it's similar with the spectral subtraction. That's where you can really hear things happening. So I took a bunch of settings for that between minus 1dB, so you have hardly any subtraction that you can hear, to minus 40db where it really tries to destroy everything. And if there were any other parameters in this algorithm, just keep them fixed to whatever was recommended. And so, yeah -- so this is really pushing it to 150 iterations, so you do 150 presentations of these samples to the people, which include the estimation of all of these things, which is pretty quick. So it's 10 minutes ->>: [inaudible] >> Nikolay Gaubitch: And you get -- yeah. You get quite a lot of information in 10 minutes from one person, which is pretty good. And this is the outcome of this. So really the absolute values here are maybe not that important as the highlight of trying to -- well, first of all, generally the spectral subtraction and this one as well, they destroyed the intelligibility quite a lot straightaway. Even with small settings, it seems like -- yeah, there's quite a large effect on the intelligibility. But this commercial system one is definitely a very -has a very strong effect. And, I mean, you can hear it if you set this parameter to minus 40dB. Of course, this maximum noise attenuation, I know what it means in this because it's a MATLAB implementation, and I know what are done in there. With this one, I have no idea what this means. So anyway, when you do it at minus 40 it destroys more or less anything that you can hear and ->>: So a positive RTAN is when it's making things better and a negative RTAN is when it's making things worse? >> Nikolay Gaubitch: Exactly, yeah. So I highlighted these two things here, which are the positive RTANs. >>: [inaudible] >> Nikolay Gaubitch: Yeah. Well, so this RTAN is actually the SNRs, so you vary the SNR a lot time, right? So ->>: But, I mean, what was your range? >> Nikolay Gaubitch: Well, close between plus 10db to minus 20dB. Because what I'm looking for here is 75 percent intelligibility. So for car noise, for that you need to be somewhere at minus -- I think it's minus 11dB or something like this to actually get -- unless it's a very good car. >>: Did you count r as one [inaudible]? >> Nikolay Gaubitch: No, I did all three digits. So all three digits have to be correct, yeah. >>: So let me make sure I understand. So it seems like -- in your formulation you said that, like, I have a target which is, like, 70 percent recognition, and then your RTAN is measuring that, like, okay, if I care about having that 70 percent accuracy, this number of dB can drop in the process and still have this thing. So an alternative way to look at it -- I mean, that makes sense. I think that -- but in an alternative way you could say -- so the thing is that, like, there would be a different chart essentially for each target recognition rate that you have. It could be said that, okay, now that you're about 80 percent accuracy and intelligibility is 70 percent, it's a different chart >> Nikolay Gaubitch: Yeah. >>: And I wonder -- so let's say now [inaudible] wants to characterize the system or something like that. Like an alternative way would be that you could say that, like, each of these settings of dB actually just measure -- instead of [inaudible] counting the actual intelligibility rate, you know, in each case -- because then you have a fixed table -- I mean, it's not as good in some ways -- I'm just curious what your -- do you see what I'm saying as the alternative >> Nikolay Gaubitch: Yeah, yeah. Basically you keep fixed SNRs and then you measure the ->>: Fixed SNRs. Like you just walk through this and now you get down to your table and then you say that, like, well, given this, what is the -- I'm just curious as to your thoughts around it. Because I like what you're saying about the RTAN, but I wonder what your thoughts are about, like, you know, doing RTAN versus just measuring, like, a fixed, you know, [inaudible], what the pluses and minuses might be around doing that. >> Nikolay Gaubitch: So if I understand you correctly it's that -- it's really looking at the alternative that you have fixed SNRs, a bunch of SNRs, and you have to measure that. >>: [inaudible] >> Nikolay Gaubitch: So, first, it takes longer to do it. The second one is that you wouldn't really -- it's kind of difficult to know what SNR range to look at, right? So in this case you actually -- yeah, I mean, it's easy to know what intelligibility level you're looking for than knowing what SNR range you're looking into, because the SNR range will change ->>: But there is this psychometric function that gets you above -- I don't want to say [inaudible] prior distribution on what these thresholds are. >> Nikolay Gaubitch: No, but they change for every noise factor. So the intelligibility level ->>: Hugely? >> Nikolay Gaubitch: Yeah. So if you have bubble noise, for example, to get 75dB intelligibility you have to be somewhere, well, in the vicinity of 0dB, sort of minus 5, 6db [phonetic], I think it is. And if you have car noise, you can be down to minus 12. >>: I misunderstood one part of your table. I see. So this noise generation, that's a setting in the algorithm? >> Nikolay Gaubitch: Yeah, yeah. >>: Oh, okay. So given a setting in the algorithm ->> Nikolay Gaubitch: That's it. >>: -- then you're looking across a bunch of different ->> Nikolay Gaubitch: Exactly. >>: Oh. I missed that part. Sorry. I thought this was just a fixed ->> Nikolay Gaubitch: No, no, no. >>: I see. This is just ->> Nikolay Gaubitch: Basically you change the setting of the algorithm and then you find the SNR for which your intelligibility level is 75. >>: And now you're looking across a bunch of -- I see. >> Nikolay Gaubitch: Exactly. >>: So I think it's kind of [inaudible]. >> Nikolay Gaubitch: Yeah, yeah. >>: [inaudible] >> Nikolay Gaubitch: No, I wasn't trying to prove that at all. >>: [inaudible] >> Nikolay Gaubitch: Right. >>: And so -- but there are speech processing algorithms that claim at least to improve intelligibility like these binary [inaudible] things and these other things [inaudible]. >> Nikolay Gaubitch: Yeah. Exactly. >>: [inaudible]. So have you tried -- it would be interesting to see if you could verify something that correlates with some published results with a positive outcome rather than a negative outcome. >> Nikolay Gaubitch: I mean, the trials I've done is something that I cannot present here, but I have correlated it to -- so I correlated this way to really more robust intelligibility estimates. So there's actually -- this is quite a weak test for intelligibility, but it's fast. So another way to do it is you have a listener who listens to a whole sentence with keywords and so on and so forth, which is -- [inaudible] does that a lot with his tests. And I know that support spectral subtraction and a few other algorithms, we definitely get really good correlation between this more rigorous test than this simple test. We have looked at -- we have actually looked at some of Philip [inaudible]'s methods and tried to reproduce some of the result, but we haven't really been able to get this intelligibility improvement that he claims, to be honest, yet. Although -- yeah, I mean, binary masks, for example, if you have the ideal binary mask you can easily demonstrate something which sounds really good, right, because you have this oracle and it sounds fantastic. And it's clear that you do improve intelligibility, but to actually do something with it, it gets tricky. I haven't -- because -- yeah, it's because of the nature of the work that we've been doing. We've tried different things. >>: So one other question. There's subtlety here around average over users because -- are you just, like, taking the scores over all the users and just taking the mean variance for each or are you doing that ANOVA ->> Nikolay Gaubitch: In this case it's just the mean variance, but, yeah, we tend to do it ANOVA ->>: Because especially in this one -- let's say that you have listener A and listener B, and listener A just kind of has crappy hearing and listener B is good. You want to account for that in the variation so that ->> Nikolay Gaubitch: In general we do that, so ->>: Oh, you do? >> Nikolay Gaubitch: Yeah. This is -- as I said, this is a just a case to sort of demonstrate what we're trying to do. But, yeah, in reality we have many more listeners and -- >>: So you are [inaudible]? >> Nikolay Gaubitch: Exactly. So ->>: [inaudible]. >>: I'm sorry? >>: [inaudible] [laughter] >> Nikolay Gaubitch: So, yeah -- so basically, just to sum up that part, the presentation of the BASIE tool, which can estimate -- that can help you estimate the intelligibility relatively quickly, in about 30 trials you can get an accuracy of plus-minus 1dB. And I showed you how you can actually use this to try to search in a sort of space of settings with subjects relatively quickly. So thank you for listening. [applause]. >> Ivan Tashev: Do you actually have more questions? >>: Yeah, I have ->> Ivan Tashev: Go [laughter]. >>: So one question is -- I really like this subject. I think that was quite interesting [inaudible] at this point now you can say, oh, well, we don't want to use this test, we want to use, you know, more -- the stronger sentence-based keyword testing and just slip that in [inaudible]. But one question I was going to ask is what do you know about -- or can you characterize how good that model is, like that -- you know, that particular sigmoidal model using the normal -- is there other measurements or is there evidence that shows that that is a really good model across a variety of situations to use? >> Nikolay Gaubitch: There is quite -- yeah, there's quite a large amount of research around that, yeah. Yeah, there's a vast amount of literature on intelligibility testing, and so ->>: [inaudible] >> Nikolay Gaubitch: Yeah, it looks like it. And, actually, the other thing which we looked at which is less obvious is if you have a signal processing algorithm, would it just shift the whole psychometric function if you have an improvement or would it act differently in different parts. >>: So that's a good question [laughter] >> Nikolay Gaubitch: [inaudible]. >>: That's okay. That's the wrong answer. Now you're making your life more difficult [laughter]. Right. Because you could have a certain kind of algorithm now which effectively changes the shape of what [inaudible] ->>: [inaudible]. >>: Right. But then that sigmoidal one is no longer the right one >> Nikolay Gaubitch: Well, I didn't say that. So, actually, the interesting thing is ->>: [inaudible]. >>: Yeah, well -- I mean, but it could -- so it could be different, right? It could be the case that something -- it might not be monotonic anymore because it could be the case that, you know, [inaudible] which is really great under a certain level of noise, but if things get too clean, then it actually does worse, you know, [inaudible]. >> Nikolay Gaubitch: But so far what we've found, actually, is that -- so doing this with more rigorous testing, we actually found that with quite a few of the standard sort of speech enhancement methods that exist, there is just a shift. So so far, so good. I haven't seen any evidence ->>: He's talking about minus 11db for [inaudible]. That's relatively ->>: [inaudible]. >>: No, that's okay. It's just that, I mean, like -- and it doesn't even have to just be a shift, right, because you're modeling both the [inaudible]. My concern was just that if something -- if there were some algorithm -- there may not be. I may just be positing, you know, something that doesn't exist, but if there were something where the actual, you know, overall profile of the response with respect to noise level change completely -- like if it actually got worse under low noise, then you might no longer have this thing. You might actually ->> Nikolay Gaubitch: In some sense, in defense of the case, you actually just -although we have the whole model of the function, you're still kind of looking at one point. So I don't know what the answer is, but you look at some kind of slope. So that kind of shift -- I'm not sure, but ->>: One thing that I would see is, you know -- another thing that I would say which is good about your method which you didn't highlight, but maybe you should, is that if you believe the model and you do your [inaudible], you're not just getting a point, right? You're actually getting at the curve. So one thing that I would say is that, like, you know, that you could argue -- and I was wondering if you would say that when I asked the earlier question -- you could, but maybe you feel like you'd be reaching too far -- you could say that -- you know how I was asking that, like, if I changed my threshold and made a whole different chart, you could argue that if you fit the model well, you could just create that new chart without running any more samples because you've actually modeled the entire curve. >> Nikolay Gaubitch: No, no, exactly. That's right. That's it. >>: That's a great benefit ->> Nikolay Gaubitch: No, no, it is ->>: -- over kind of ->> Nikolay Gaubitch: I didn't talk much about the slope estimate because I don't really use it for anything yet. But, yeah, that's another good advantage of this because if you do the fixed step-size and if you didn't need the whole curve, you have to sort of sample a few points along the curve, which is basically, to some extent, what you said. And then you can sort of fit ->>: I mean, there is an issue of -- you could argue that -- so because you're focusing on that point, it might only be really valid around that, that region, and maybe not [inaudible]. >> Nikolay Gaubitch: Yes. >>: I see. That's what you're saying about the fact that it's region 2. So this curve is valid only within that ->> Nikolay Gaubitch: Exactly. >>: -- the neighborhood of where you care about, that's all that really matters. >> Nikolay Gaubitch: Because what could happen is that if you have this algorithm, you can still be valid at the top and [inaudible] ->>: The problem is that your -- I mean, I think your setup might be convex-ish right now, but then if things were really weird, like you might end up in some state where there might be multiple places, you know, which had satisfy your [inaudible]. >> Nikolay Gaubitch: It's very tricky. And once you have humans involved in it, it's quite difficult, yeah, and the processing. But I think so faster it's been fine, and so far the only evidence that we do have from a lot of people is that this model actually holds. >>: The other thing that might be interesting -- maybe I'll save it for [inaudible]. >>: [inaudible]. >>: I'll stop talking. Go ahead. >>: [inaudible]. >>: Oh, okay. All right. Okay. Good. >> Ivan Tashev: No more questions? Thank you, Nikolay. >> Nikolay Gaubitch: Thank you very much. [applause]

>> Ivan Tashev: Good morning, everyone, those who are... who are attending remote -- attending this talk remotely. ...

Related documents

Products

Support

&gt;&gt; Ivan Tashev: Good morning, everyone, those who are... who are attending remote -- attending this talk remotely. ...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Ivan Tashev: Good morning, everyone, those who are... who are attending remote -- attending this talk remotely. ...