>> Eyal Ofek: Today we have Kailash Patil who is at the end of his PhD studies at Johns Hopkins University in Baltimore, and he is going to talk about auditory object recognition. Without further ado, Kailash, you have the floor. >> Kailash Patil: Do I need to be standing here? >>: You can walk around. >> Eyal Ofek: You can walk as far as the wireless microphone. >> Kailash Patil: Okay. Thank you all for coming today. Today I'll be talking about auditory object recognition. This is a brief outline of what I'll be going through. I will give you a brief idea of what auditory objects are, what the different kinds of representations are for these auditory objects and after that I'll talk about some statistical analysis where I did some experiments to evaluate these auditory object representations. I'd also like to talk about the concept of attention in terms of auditory objects. So what are auditory objects? Auditory objects are in the Oxford English dictionary they are defined as object in general is something that can be placed before or presented to the eyes or other senses. So you can see from this that the definition or identity of an object is interlinked between the perception of the object, so there's no unique identity of an object by itself. It is intertwined with the perception of that object as well. You can see why there could be difficulty representing different auditory objects. Moreover, these objects, for example consider this image. So objects can go from let's say for example, one particular instrument here or you can consider this whole section of violins or you can consider the whole image as an auditory object, sorry, or visual object. This leads to difficulties in representing different objects. Also, there is another added layer of variability where let's say for visual objects you can have different orientations for the objects. You can have different manufacturers these instruments. You can have different lighting conditions and so on. Similarly, for auditory objects you can consider two different instruments can produce different sounds, whereas, the same instrument can produce different kinds of sounds depending on the playing style, for example, the vibrato versus a plain mode of playing. There can be a lot of variations in auditory objects as well. The main difficulty comes from this temporal aspect of auditory objects. Consider this scene where there are three different acoustic objects playing at the same time, but what happens here is all the three objects get overlapped and that's what enters your ear, so the temporal overlap of these objects can be a difficult thing, but we are able to easily perceive the different instruments and the different objects of this acoustic scene. I would now like to move into the kind of auditory objects representations. I'll give you a brief overview of what is there and what we are proposing here. You can think of one simple way of representing auditory objects is just the wave form itself. So that's the topmost row here. People have tried to extract different features from the wave form. They've used features like 0 crossing or energy amplitude and so on. Another way of representing these objects is the spectrum level which is basically the energy present at different frequencies for that object. Again, people have used different features at this level like spectral flux or spectral slope and so on. Finally, as a hybrid between these two, that's the bottom representation here which is a time frequency representation, and this is a more standard technique used for the majority of auditory object analysis. People generally consider a time slice of the spectrogram and then derive different features from each time slice and this window is then shifted over time, so you get different representations which are like MFCCs or LPCCs. What you can notice is that these don't capture all the variations and nuances in the spectrogram. They just capture whatever occurs in that small time window. These may not be extremely effective at capturing the characteristics of an auditory object. Also, neurophysiological studies have shown that in the auditory cortex neurons are sensitive to various kinds of features. They can be not only just frequency and time but they can be sensitive to different kinds of patterns in the spectrogram and you can have a sensitivity to pitch. You can have sensitivity to harmonicity and so on. So this suggests that we should look at something that looks at more than just a short time window. In the auditory pathway there are millions of connections happening and these are what we call the processes just below the auditory cortex, all the processes until then we call it as the sub cortical processes and the processes in the auditory cortex we call it the cortical processes. So the sub cortical processes are well studied and it has been well documented and well noted, so there are models for the different processes up to that stage. In this talk we will be mainly focusing on how we can model and derive inspiration from the processes in the auditory cortex. Just to go over the brief steps of modeling the sub cortical processes, the incoming waveform is analyzed along a bank of asymmetric filters which are uniformly placed on the logarithmic axis. What this does is it analyzes the energy present along different frequencies and we then, to sharpen these filters we apply lateral inhibition. You can consider this as a [indiscernible] frequency so this effectually sharpens the bandwidth of the filters. And finally there's a smoothing operation which mimics the processing in the midbrain where there's a loss of phase locking. So we end up with something that is similar to a spectrogram. We call this auditory spectrogram. Here we're going to talk about what we can do to model the cortical processes. We came up with a filter bank which looks like this, so what this does is it analyzes the spectrogram for various different patterns. These patterns look like this, so what it's going to do is it's going to look for energy rises and energy falls and so on and we are doing that at different scales for different what we call it as modulation. For example, this one is very slow in time, whereas this one is extremely fast in time, so these are capturing the different modulations in time and similarly you can see the same thing along frequency. So once we have this bank of filters, what this effectively does is match this time frequency into a high dimensional space where the different access could be this spectral profile, the temporal profile time frequency and you can even think of adding different feature dimensions here like pitch and so on. So one quick motivation of why this could be a useful representation, so what we did here is we collected many examples of cello and flute notes and we averaged the representation. This is the average representation for these two instruments in that high dimensional space, or the cortical space, and you can see that most of the energy for cello lies in a different space as compared to the flute. So this clearly shows that if you move to this high dimensional space the two auditory objects which are overlapping in time are now really well separated in this space, you can think of doing multiple things in this domain. Now I'll talk about how we're going to analyze the effectiveness of such a representation for different tasks. The first task I'm considering here is the speech recognition task. Then I'll later talk about timbre recognition and so on. >>: So these object representations, this is the same as the [indiscernible] all the people that have been doing [indiscernible] same [indiscernible] >> Kailash Patil: Yeah, it's representative. >>: [indiscernible] >> Kailash Patil: No. When we move from the cortical stage here that's different from their model, so they have filters which are modeled on the actual neural responses, whereas mine is just a general 2-D [indiscernible] filter bank. So it's similar… >> Eyal Ofek: [indiscernible] this is quite typical. This is the [indiscernible] >> Kailash Patil: This is the [indiscernible] >> Eyal Ofek: Okay. And the next slide. So this is all something [indiscernible] >> Kailash Patil: Yeah. >> Eyal Ofek: So how many features you can extract from those filters? How many [indiscernible] filters? >> Kailash Patil: If you want to think about this, we have 200 of these filters, or 220 filters and so what that does is if you have multiple frequency channels so it's going to represent it into a really high dimensional space because of that. >> Eyal Ofek: So pretty much you applied that filter on the spectrogram and the output is doing something like this in that particular time frequency space? >> Kailash Patil: Yes. It's going to, the output actually is going to show how much of a match we have with this kind of pattern and this spectrogram. >> Eyal Ofek: So you have four dimensions, time, frequency, spectral and temporal? >> Kailash Patil: Yeah. >>: So I'm still a little confused. I thought you had like this one of those successful… >> Kailash Patil: So the difference is in the characterization of these filters, so they use the filters which are actually modeling like the neural responses, whereas mine is generic. >>: Oh, so the shape. >> Kailash Patil: Filter shapes, yeah. >>: Whether it's a specific model or just a filter? >> Kailash Patil: Yeah. >>: But the motivation is the same? >> Kailash Patil: Yeah. >>: So I mean in both cases you end up with a representation like that, 3-D or [indiscernible] okay. >> Kailash Patil: One note on this, when we are doing some type of statistical analysis, a high dimensional representation has its own pros and cons. What happens is machine learning, or the backend stage cannot handle effectively high dimensional space. We have like around 28,000 features at the end, so it cannot effectively handle this many features, so we will have to do something depending on what is the task. For speech we notice that preserving this temporal information is extremely crucial because if you want to recognize speech you need to have the sequence of fonts which are on the order of 100 milliseconds and these temporal characteristics are extremely crucial. Also, the machine learning setup cannot handle high number of feature dimensions again, so what we do is instead of modeling these neurons as a filter bank, we are going to model whole group of neurons. By that, what I mean is, just to give you motivation, look at the spectrogram here and what we do is we take a cross-section of the spectrogram, a long time and this is one second in duration. You can see that there are five distinct peaks in the overall temporal profile, and similarly in this spectral profile you can count that there are like four peaks in this spectral profile. If you take a two-dimensional FFT of this spectrogram what's going to happen is all the energy here is going to be localized because of these characteristics so the main peak is around 5 Hz and in the spectral one it's below what we call as one cycle per octave. So this motivates that instead of, in this 2D FFT domain, instead of having multiple filters what we can do is capture the region of speech which is, capture the region which contains only speech information. What we did then to derive which regions were the most effective to do speech recognition we tried three different filters. We call this sub optimal one, speech centric and sub optimal two. The actual filter shapes are given here on the temporal axis and the frequency axis. The dotted line is actually the shape of the filter. What we are testing here is how much of this region we need to capture. The results of these three filters is going to tell us how much we need to capture. So to test that, I'll give you the details of the experiment. The spectrogram was 128 dimensional, so if you do that modulation filtering you still have 128 dimensional adept presentation, so then we append delta features on that so we end up with the 500 dimensional features representation and we tested it on phoneme recognition experiment on [indiscernible] and during testing we are going this is a mismatch condition so we are training on clean and we are testing on additive noise cases, so only the test set is counted with additive noise from [indiscernible] database. And for the back end we have hybrid [indiscernible] MLP set up and the [indiscernible], the MLPs here are staggered so we have two MLPs which are stacked so in the end we have around six layers of neurons. The results of that experiment, what you can see here if you are just considering the clean case, obviously keeping the entire information is the best possible case, so if you are trying to reduce information you are going to lose some of the performance. But when you go to the noisy case, when you are adding different kinds of additive noise, what happens is if you keep the whole space all the noise is included and you are going to take a huge hit in performance. But when you are using the space and trying to concentrate only on the speech, we found that the speech centric filters is the best. One thing to note here is that sub optimal one is always less than sub optimal two so sub optimal one if you go back it is removing some of the speech regions over here. Sub optimal two is including some of the noise, so it turns out that removing some of the speech regions is more hurtful than including the little bit of noise. And then we… >>: [indiscernible] MLP? >> Kailash Patil: Yes. So what we do is the first MLP we just have, so we take these, we use 512 features and we map it to the phonemes and the second one we include context in that so we have I think 10 frame context on that so we are trying to include some… >>: Ten frames of posterior? >> Kailash Patil: Yes. >>: [indiscernible] then predict? >> Kailash Patil: Then I can predict. >> Eyal Ofek: Can you go a couple of slides back? When you show the filter, okay, here. Technically what you do is you get the [indiscernible] spectrogram [indiscernible] apply the filter, convert it back to spectrogram and then filter the regular speech from that? >> Kailash Patil: Yeah. >> Eyal Ofek: So [indiscernible] speech enhancement operator? Because from the spectrogram [indiscernible] and see kind of the noise missing. >> Kailash Patil: It's a little bit tricky to go back from the, this is the auditory spectrogram that we tried to go back from there and, like we tried actually to listen to the sound and it wasn't that much obvious that it's doing speech enhancement at the signal level. >> Eyal Ofek: But you did [indiscernible] >> Kailash Patil: Yeah. >> Eyal Ofek: So [indiscernible] a little bit noisier than than optimal two? >> Kailash Patil: If you compare just the noisy reconstruction, which was this one, we couldn't find that much obvious difference in the quality of the sound. >>: Interesting. >> Kailash Patil: Yeah. >>: But there is no… >> Kailash Patil: The steps to going back from this auditory spectrogram to the signal is little bit of an estimation in that it's not straightforward. >> Eyal Ofek: [indiscernible] speech which is 500 milliseconds of frame and 10 milliseconds [indiscernible], right? >> Kailash Patil: Yeah. >>: But there's nothing in this approach that requires any of that other representation, I guess is my point. So it seems like you take a spectrogram, do a 2-D [indiscernible] >> Kailash Patil: Yeah. We are considering only those modulations that are important for speech, so effectively what those representations that I was talking about is going to sample points in this space and if you do that then we are going to increase the feature dimensionality, so another way of thinking of the same thing is where you want to sample it and just extract that region. >>: Oh, so based on this you are deciding which of those others things to pick? Based on your… >> Kailash Patil: No. We don't pick those features because the speech backend cannot handle like that much dimensionality, so what we do is instead of picking those points in this region, we pick the [indiscernible] region and this space. We are not, the motivation is kind of the same but we are not using that feature set. >>: Okay. >> Kailash Patil: So we then compared this to standard approaches where we apply MVA on MFCC and PLPs and we were able to outperform both of these standard techniques on clean and all the considered noise cases. And just a note that in our technique we used the feature dimensions don't vary that much between the baseline and our approach. >>: So what is [indiscernible] and what is MVA? >> Kailash Patil: MVA is mean variance ARMA technique, so what that does is MVA it combines Cepstral mean subatraction, variance normalization and some kind of ARMA filtering which is similar to RASTA. >>: [indiscernible] enhancement techniques? [indiscernible] >> Kailash Patil: No. We compared it with some other features which I'll talk about a little bit later, but I don't know what you have in mind. >>: [indiscernible] >> Kailash Patil: Yeah. We compared it to that [indiscernible] >>: And there's one other difference. Go back. Yes. Your system has an input that has nine [indiscernible] features and the others [indiscernible] >>: How come [indiscernible] >>: I mean, people usually do the [indiscernible] dimensionality. >> Kailash Patil: So on average we see that speech centric filtering outperforms this MVA technique by 12 to 14 percent. There's some evidence in the studies which looked at the auditory processing and there is some evidence that the processing actually happens in multiple streams, so the brain actually divides the incoming information into slow dynamics or slow temporal dynamics versus fast dynamics and they process it and it's later, the evidence has come in that the higher-level. So we wanted to do a similar divide and conquer strategy so that the machine learning stage can take maximum advantage of each of these streams. How we divide this space into different streams is based on these three principles. The first one is information and coding for each stream has to have enough information. The second one is complement the information technique principal where each stream should be unique compared to the other streams and finally, the noise robustness where each stream should contain only the highest in our speech regions. The way we do that is, again, this is the same modulation domain and we divide that into these three streams and so if you see the information and coding principle, according to that we divide these streams which have at least 60 percent of the energy in this entire region. Then each stream has some unique part compared to the other streams and finally they are still localized on the speech region so we hope that this gives nice robustness. An example of how these three filtered spectrograms would look is like this. The first one is including only the low temporal and spectral modulations, so it's smooth in time and frequency. The second one is allowing for higher frequency modulations, so you can see it's a little bit more faster a long frequency. And the third one looks for a little bit faster a long time so it's not as smooth as the first one. Again, the backend is almost similar to the first experiment, so we have the same setup for each of the three streams and we combine the evidence using the product rule, so we are combining the posteriors at the very end. We test this not only on additive noise, but in this case we tested it on channel noise and artificial reverberation test conditions. In this case we compared it with two other, one other baseline which is the ETSI advanced front end baseline. On this one they do wise activity detection and they are doing the estimate the noise and they are doing some sort of noise cancellation as well as ETSI. ETSI if you look at it’s built for additive noise, so it does really well on additive but it does not do as well on the other kinds of noises. When we compare our approach we see that we are doing, we beat all of the systems not only on clean but on all other noise conditions as well by a significant margin. As I said, ETSI has these additional advantages of voice activity detection. We don't have anything like that and we are still able to beat those by around 4 to 19 percent depending on the noise conditions. And if you analyze the different streams we saw that stream one alone can beat the baseline on most of the noise conditions. Stream two and three then add onto this to give us the best advantage. So now I'll talk about how we arrived at this auditory object representation for timbre recognition. Typically, when people look at timbre, they are trying to identify what are the dimensions of timbre, so when I say timbre it's defined as anything which identifies the instrument and it's not volume or pitch, so anything apart from volume and pitch which you can use to identify the instrument. A lot of studies they look at, so they collect human studies where the human subjects rate how different they perceive two instruments and based on this let’s say perceptual distance of the instruments they map it into a two-dimensional or threedimensional space. Then they try to find features which correlate with the dimensions of the space, so they are going about and finding dimensions of timbre in that approach. Here we take a more diverse approach where we define the dimensions of time but then we see how well it matches with the perceived distances. But first we'll do a quick analysis of this representation on the classification task. Here we keep the full representation so we are keeping this full high dimension in space because to identify the timbre of an instrument it's really key to have all of the spectral and temporal variations and we can’t afford to lose the temporal fine structure. We keep the whole space and we test this on RCW musical instrument database that has 11 instruments and it has like close to 2000 notes per instrument. For each instrument they have three different manufacturers and they have multiple playing styles and they play all the possible pitch values on that instrument. We are really testing if our system can generalize to this. What we do is we take this high dimensional representation and we try to reduce the dimensions to a manageable amount using a tensor SVD. For the back end we are going to use SVM classifier with RBF kernels and I'll talk about the last one later. When we do this we are going to compare this set of representation with two other representations of the spectral features here as just the auditory spectrogram and we use that as, we also tested it on features like MFCC so they all give something in the range of 79 percent. So this second feature set that we compared this actual neuron transfer functions that are recorded in [indiscernible] so we borrowed that and this has 1200 neuron transfer functions, so we just applied those to the auditory spectrogram and we used the activity of that to classify the timbre and we saw that it's able to do quite well as compared to the spectrogram. But the problem with that is that it doesn't have the full range of [indiscernible] so it's just 1000 neurons but the brain has millions of neurons, so it's not able to accurately capture the entire range of possible neurons. If we use a model of this, our model, you see that we were able to get really good representation for these instruments. So this shows that these instruments are really well separated in this region. I was talking about how these instruments are separated in this space, so what we wanted to do is see if this model captures the perceptual distance that human subjects have stated. We went ahead and had human subjects rate different, how differently they perceive these instruments from this RCW database and we have some set of matrix which tells how different these instruments are. We then take the distances in our model and we just do a correlation with these distances and we see that when we do it with the baseline features they are not correlating that well with the human distances, but with our model we see that the correlation is also pretty high. This shows that this representation is not only able to separate instruments, but it's also able to keep the distances between the instruments as we perceive it. Then this was just based on isolated tones of instruments so we thought how we can use this in a more generalized situation. If you go and buy a musical recording you might have something like this. [music] So this is an artist playing an instrument and whereas our model was trained on something like this, which was a clean tone which has the full rise and the decay of the tone. [music] So it's allowed to decay completely and we capture all of the dynamics there, so if you test our system on this just using uniform window so you don't care about what's actually there, so you see that the classification accuracy drops drastically from 98 to 70. This shows that, these recordings we actually collected from commercially available CDs so there could have been some differences in recording equipment and so on, but as I say, since these notes are isolated notes, we think we can do better if we can extract the notes from this continuous recording. We tried existing techniques for onset detection which tell you the boundaries between two notes, but they are not able to generalize well to this database because we have recordings from various different studios and different CDs and so it wasn't able to generalize because we think that it's because most of the systems work on the signal level representation. What we did is we came up with this idea. Here what we're going to do is if you look at one particular note you can see that the harmonics are placed, are different points in the frequency axis. If we come up with a template which mimics the placing of these harmonics along the frequency axis, and if we see how well this matches, how well the pitch template matches for each pitch matches the spectrogram, we can get some set of pitch estimate and we might, those regions where the pitch estimate is changing is the possible time for onset detection. Also, the degree to which it matches we call it harmonicity and we typically see that the boundary between two notes the harmonicity is low because the overlapping harmonics, so we also matched those candidates where the harmonicity is low as possible onset boundaries. We then combined the evidence from both of these streams to get reliable estimates for onset, for the note boundaries. Then what we do is we then extract, we segment this continuous recording along these boundaries. Then if you do note extraction we do the classification on these extracted notes, we see that we get performance improvement of close to 9 percent. But to take care of the differences in recordings and other changes in the statistical distribution, we can also adapt the backend. What we did is we have a few examples of these, we take a few examples of these extracted notes and we adapt SVM boundaries to this new distribution and if we do that we see that we can further improve the performance by around 6 percent. This obviously has the advantage of having few examples that it has seen, whereas, the note extraction does not have that advantage. A few observations, we were able to not only represent a musical timbre in this space, but we were also able to capture the distances between the instruments well. When we did a study of marginal representations, like do we need to keep the whole space or can we marginalize it along different dimensions, we saw that the whole space is indeed required and we also saw how it can be modified to be robust out of the main conditions. Now I would like to talk about the concept of attention in this space. When I say attention, there are usually two kinds of things that are thought about. One is the bottom up attention where you are not doing anything voluntarily but let's say there is a loud bang somewhere in the background, you immediately divert your attention to that. That is the bottom up attention and we are going to talk about something called the top-down attention where I give you the task to pay attention to some particular sound or some particular object and without this task you won't be able to recognize this, but once you are given the task you will be able to recognize it. That's the top-down attention that we are going to talk about. For example, in the visual domain, if this is the visual scene that is there and I asked you to pay attention to this painting on the wall, so you do something like a field of view sort of thing where you are able to restrict your faculties only to that particular object. And you are also able to change, or move this field of view to a different object like if I ask you to pay attention to the flowers. So how does this work in the auditory domain? For example, from this away from you, you won't be able to tell that there are multiple objects, so I'll just play this sound to you. [audio begins] [audio ends] As you noticed, there are multiple speakers in this recording and if I now ask you to repeat what the main speaker said, I'm pretty sure nobody is able to do that. But now I give you the task to pay attention to only to the male speaker and let's see if we can do that again. [audio begins] [audio ends] You were able to pay attention to the male and you were able to say it's three, eight, five, five and so on, so this clearly shows that there is something that's happening where you can focus your view from one object to another even in the auditory domain similar to what is happening in the visual domain. We wanted to implement this in our model. Some studies on what actually happens in the brain they show that this is a transfer function of one particular neuron in ferret and the ferret is now asked to pay attention to some particular event, and when it's doing that task, the transfer function changes. So this clearly shows that the representation in the auditory cortex is not static, so the representation is changing, whereas, our representation was sort of a static representation. We want to implement some kind of adaptive technique to take care of this. The motivation is like we saw two different objects occupy two different spaces in this space. Let's say we had animals and music and they are occupying different regions, so if we had to pay attention to music, if you had some prior knowledge of music that it is there in that location you can focus or highlight that region in the space because they are separated. That's the hypothesis and the way we implement that is we collect some prior data for that instrument, or that object and we have the means representation for all the activations for all the filters and what we then do is if the task is to pay attention to animals, we use that activation pattern as a boosting, or you can call it a gain adaptation for these filters. We boost those filters which are most active during the target class and we suppress the other filters which are not so active. The results are evidence that the highest stages of processing also undergo changes during topdown attention so we can think of that as something like our statistical model adapting to this new task. We do a map adaptation in this case. Given a few examples to we map it at the target distribution to the new conditions. The way we tested this is we have a BBC sound effects database. This is quite a big database that has many classes and four to six hours of data and during testing what we do is actually mix the target class with other target classes so there are multiple sources occurring at the same time and we tell our model to pay attention to only the target and we see how well it does at this recognition. In this case the featured dimension we were able reduce it to 113 as compared to the music where we had to keep more. The backend as I said is a GMM classifier and we use map adaptation to adapt the models. The performance here is measured in d prime so this takes care of not only the true positive, but also the false, so it takes care of the false alarms and the true positives. We are not just looking at how many times they are identified, but how many times you wrongly identified it as well. This is a more robust measure. We compared it with this MFCC baseline where we are averaging the statistics for the entire duration of that recording. We take the first, second and third moment of the MFCC features and we use that as the feature rector. We also recently compared with more standard acoustic features that are there in the literature and we found that they are all comparable to this range of performance. In our model without any attention, no boosting, no adaptation of the GMMs we see that we are really getting a big gain in performance. This shows that probably our model already has these objects well separated, so we are able to do that, but when we apply boosting, that is when we applied a gain on the filters, we see the performance improves but not that much. This shows that probably there is some change in the statistics that the GMM backend is not able to handle. >>: So boosting is not the machine learning version of this or how I mean it's not the same like we learn… >> Kailash Patil: No. This is the gain boosting. And when we just do that adaptation and no boosting, we see higher jump in performance, so this may be because for doing the GMM adaptation you need some sample examples of the test conditions, so that might be why it's doing better. Finally, when we combined both of these techniques we get the best performance. This shows that when we boost the filters, we need to actually adapt the GMMs as well, so we are getting the best performance when we are doing both of these techniques. >>: [indiscernible] unsupervised in that application or supervised? >> Kailash Patil: It's supervised. >>: Supervised. You already have some samples? >> Kailash Patil: Some samples. >>: How did you [indiscernible] like exactly or [indiscernible] >> Kailash Patil: Let's say we had many filters, so we collect for the target class. Given a target class we collect from the training data of the average presentation of the mean activity for these filters and we normalize it to be in the range of let's say 1 to .5. We can control this parameter so these activations are now in some range. We use that as the gain parameters for the filters and we want to pay attention. Let's say filter one was active for speech and filter two was not active for speech, but it was active for machines. When we apply this and we pay attention when we are asking to pay attention to speeds, so filter one will still be on; it will still be active, versus filter two. >>: Is there any like normalizing the responses [indiscernible] average response to that? >> Kailash Patil: Yeah. So we saw that this cortical representation was able to, we got our own 45 percent improvement on the baseline and this representation was able also to facilitate attentional mechanism where we could selectively attend to the targets. And we showed that these attentional mechanisms provided 81 percent relative gain. That's a huge improvement. And more recently, we were also able to not only change the gain of the filters but we were also able to adapt the shape of the filters depending on the target. We showed that this was more useful for adverse conditions when the target to base ratio is really low. >> Eyal Ofek: [indiscernible] filters? >> Kailash Patil: Yeah. >> Eyal Ofek: So you are thinking about to make a different set of [indiscernible] >> Kailash Patil: Let's say we have a model like this. Given a target, we are changing the orientation of these filters a little bit and in a manner so that it's most active during that target class, so if we change… >> Eyal Ofek: [indiscernible] more suitable alphabet. >> Kailash Patil: Yeah. >>: So when you design those filters [indiscernible] approach or [indiscernible] >> Kailash Patil: It's just a, you can think of it like a 2-D wavelet transform, so we have the base functions and we just… >>: [indiscernible] >> Kailash Patil: So in conclusion, we presented a multidimensional possibly over complete space which can represent auditory, motivated from the auditory system processing and this space was, we had sensitive data frequency, temporal profiles, spectral profile and we showed how this space can be a basis for many different sound technologies like speech recognition timbre classification and scene classification, also how this space facilitates the creation of attentional mechanisms and how we think that such a mechanism can be a basis for tasks like event detection or keyword spotting and so on. These are some of the publications which resulted from this work. Do you have any questions? [applause] >>: [indiscernible] >>: [indiscernible] so the, I'm just trying to compare the differences. So the MFCC or LPCC systems were GMM systems? >> Kailash Patil: For the speech? >>: Yeah. So if you've got [indiscernible] speech recognition for the initial [indiscernible] so those first two first columns were GMMs? >> Kailash Patil: No. They are still the backend to the same as the hybrid [indiscernible] >>: Okay. So [indiscernible] and that one is just [indiscernible] and that one is just minimum [indiscernible] >> Kailash Patil: No. The backend is the same for MFCCs and for our system. >>: Right. But the speech [indiscernible] NLP [indiscernible] is the two here, the two stage [indiscernible] >> Kailash Patil: Yeah. We still have four MFCC as well. So it's exactly the same. Only the input features are changing. >>: And the other question is, it looked like on the other tasks you used this tensor as [indiscernible] but you didn't use it here? >> Kailash Patil: No. Because here the feature space, the feature dimensions were manageable. It was still in the -- the feature dimensional was fine, so we have a MLP with input layer of [indiscernible] >>: That's because you did this manual extraction [indiscernible] so you already tried take the thing in and [indiscernible] >> Kailash Patil: Yeah. That was the very first main approach. >>: I didn't know if it was there or not. >> Kailash Patil: Yeah. So I think in that case -- so what happens is I think the features are redundant I think in some way and we are not able to capture this, so we lose some sort of temporal structure. Let's say we have filters which are quite large and we are trying to find the correlation, the convolution in time, so we are losing the time, the output we lose the temporal structure, so we are getting some smooth response and we are not able to predict the phoneme sequence from that. Yes? >>: One of your early slides, your motivation slide you showed that when you listen to an orchestra there are three musicians playing and you can recognize the instruments. Can a computer do that? Can we do this automatically? Are there programs out there? Has there been research published that has successfully taken orchestra with multiple instruments and separating the sound of each instrument? >> Kailash Patil: There are some studies which do. I am not aware of studies which do exactly identifying the instruments, but they do some sort of, on this multisource environment they do different assessing like identifying the genre of the music or identifying different characteristics of the music, but maybe not exact finding out which instruments are playing. >>: The work you showed on attention, right, couldn't that just be applied to the problem so you could separate the male from the female speaker or [indiscernible] sources? Could need just apply that system to that problem? >> Kailash Patil: I think the way we formulated is given a target you have to pay attention to piano. We were able to tell to a better extent whether the piano was present or not. So we would then have to apply this for all of the sets of instruments and then we could, in that case we could tell which instruments are present, so we would have to run the system on different targets and see which of these targets are present. >>: Okay. I think that would be a very interesting application to see how far you could go. >> Eyal Ofek: And once you start that to talk about other applications of pretty much a new feature of what you could extract from the audio signal. How applicable is this for actually getting the nonverbal cues from human speech, emotion, speaker identification, even gender, age? Do you see an application for this research to set for such an application? >> Kailash Patil: I think definitely I can see some advantages of using such a representation because we are not after capturing, so in timbre we were after capturing anything that was apart from pitch and volume, so we are trying to capture all of the characteristics of that instrument. I think for an example, if in the musical instruments I tried just doing the broad class, broad instrument recognition so we could recognize that the instrument was a stringed instrument or a wind instrument or a percussion instrument, so by using similar features. That shows that it's not only capturing one characteristic, but we were able to generalize it to capture different characteristics of the sound, so the way you are saying like if you wanted to capture… >>: Let me rephrase the question to you a little bit. My understanding is you can separate or detect the presence of whether there is a cello or whatever, how can your recognition where you [indiscernible] if there were maybe three cellos and asked you which of these three it was playing when, what do you think your recognition rate would be on which one of the three cellos? >> Kailash Patil: I think if you are given three different cellos which have slightly different characteristics I think we should be able to capture the minute differences between the instruments as well. >>: You think with the timbre is again like 95 percent recognition which one it is? >> Kailash Patil: If we change, I think if you have different manufacturers -- so you are talking about different kinds of instruments, right, different, for in a cello you might have different… >>: I don't know much about cellos. I know there is a really expensive one. >>: [indiscernible] >> Kailash Patil: I think another good example for that would be given a cello you can play it in different styles so you can have the vibrato style, so the musicians use that to kind of to… >>: The same musician maybe. >> Kailash Patil: Yeah. The same musician, the same instrument but he's trying to, trying to express his emotions by a different style. >>: Do you really think you could get that same kind of recognition rate separating the three cellos the same as separating the cello from a flute? >> Kailash Patil: I think, like it might not be in the 99 percent range, but I think given if there are unique characteristics, I think we can do a really good job of capturing these characteristics. >>: That would be impressive. So I guess the obvious other one is now you have three similar 30-year-old males speaking and do you think you could recognize which one is which? >> Kailash Patil: Yeah. I think we can. It's like the timbre of human speech, so. >>: Yeah. Okay. That's optimistic, because the timbre between a male and a female is quite a bit different than between [indiscernible] so I disagree. >>: [indiscernible] >> Kailash Patil: We actually did speaker identification experiments using features similar to this and we were able to beat most of the standard approaches. >>: But did you do it under the cocktail party condition where there were two people speaking and you said out of the ten people in the database, tell me the two people who are speaking on top of each other? >> Kailash Patil: We had background noises with background noise and so on. >>: [indiscernible] adaptation the noise is very different if it's a low rumble or music or something. It's a very attractive approach both in the idea of filtering for the signal and the [indiscernible] thing, but the fascinating stuff is how far you can push it. The idea is elegant, but I don't have any sense that you really know how widely it applies [indiscernible] I think under more challenging conditions. >> Eyal Ofek: Even if you can distinguish [indiscernible] speech, no speech. Because in most of the cases [indiscernible] is talking to the [indiscernible] speech or no speech. Or even if you can tell that this voice belongs to speaker one through eight, or none of those. Also consider your [indiscernible] process of getting [indiscernible] systems. More questions? [applause] >> Kailash Patil: Thank you.