Document 17864795

advertisement
>> Eyal Ofek: Today we have Kailash Patil who is at the end of his PhD studies at Johns Hopkins
University in Baltimore, and he is going to talk about auditory object recognition. Without
further ado, Kailash, you have the floor.
>> Kailash Patil: Do I need to be standing here?
>>: You can walk around.
>> Eyal Ofek: You can walk as far as the wireless microphone.
>> Kailash Patil: Okay. Thank you all for coming today. Today I'll be talking about auditory
object recognition. This is a brief outline of what I'll be going through. I will give you a brief
idea of what auditory objects are, what the different kinds of representations are for these
auditory objects and after that I'll talk about some statistical analysis where I did some
experiments to evaluate these auditory object representations. I'd also like to talk about the
concept of attention in terms of auditory objects. So what are auditory objects? Auditory
objects are in the Oxford English dictionary they are defined as object in general is something
that can be placed before or presented to the eyes or other senses. So you can see from this
that the definition or identity of an object is interlinked between the perception of the object,
so there's no unique identity of an object by itself. It is intertwined with the perception of that
object as well. You can see why there could be difficulty representing different auditory
objects. Moreover, these objects, for example consider this image. So objects can go from let's
say for example, one particular instrument here or you can consider this whole section of
violins or you can consider the whole image as an auditory object, sorry, or visual object. This
leads to difficulties in representing different objects. Also, there is another added layer of
variability where let's say for visual objects you can have different orientations for the objects.
You can have different manufacturers these instruments. You can have different lighting
conditions and so on. Similarly, for auditory objects you can consider two different instruments
can produce different sounds, whereas, the same instrument can produce different kinds of
sounds depending on the playing style, for example, the vibrato versus a plain mode of playing.
There can be a lot of variations in auditory objects as well. The main difficulty comes from this
temporal aspect of auditory objects. Consider this scene where there are three different
acoustic objects playing at the same time, but what happens here is all the three objects get
overlapped and that's what enters your ear, so the temporal overlap of these objects can be a
difficult thing, but we are able to easily perceive the different instruments and the different
objects of this acoustic scene. I would now like to move into the kind of auditory objects
representations. I'll give you a brief overview of what is there and what we are proposing here.
You can think of one simple way of representing auditory objects is just the wave form itself.
So that's the topmost row here. People have tried to extract different features from the wave
form. They've used features like 0 crossing or energy amplitude and so on. Another way of
representing these objects is the spectrum level which is basically the energy present at
different frequencies for that object. Again, people have used different features at this level
like spectral flux or spectral slope and so on. Finally, as a hybrid between these two, that's the
bottom representation here which is a time frequency representation, and this is a more
standard technique used for the majority of auditory object analysis. People generally consider
a time slice of the spectrogram and then derive different features from each time slice and this
window is then shifted over time, so you get different representations which are like MFCCs or
LPCCs. What you can notice is that these don't capture all the variations and nuances in the
spectrogram. They just capture whatever occurs in that small time window. These may not be
extremely effective at capturing the characteristics of an auditory object. Also,
neurophysiological studies have shown that in the auditory cortex neurons are sensitive to
various kinds of features. They can be not only just frequency and time but they can be
sensitive to different kinds of patterns in the spectrogram and you can have a sensitivity to
pitch. You can have sensitivity to harmonicity and so on. So this suggests that we should look
at something that looks at more than just a short time window. In the auditory pathway there
are millions of connections happening and these are what we call the processes just below the
auditory cortex, all the processes until then we call it as the sub cortical processes and the
processes in the auditory cortex we call it the cortical processes. So the sub cortical processes
are well studied and it has been well documented and well noted, so there are models for the
different processes up to that stage. In this talk we will be mainly focusing on how we can
model and derive inspiration from the processes in the auditory cortex. Just to go over the
brief steps of modeling the sub cortical processes, the incoming waveform is analyzed along a
bank of asymmetric filters which are uniformly placed on the logarithmic axis. What this does is
it analyzes the energy present along different frequencies and we then, to sharpen these filters
we apply lateral inhibition. You can consider this as a [indiscernible] frequency so this
effectually sharpens the bandwidth of the filters. And finally there's a smoothing operation
which mimics the processing in the midbrain where there's a loss of phase locking. So we end
up with something that is similar to a spectrogram. We call this auditory spectrogram. Here
we're going to talk about what we can do to model the cortical processes. We came up with a
filter bank which looks like this, so what this does is it analyzes the spectrogram for various
different patterns. These patterns look like this, so what it's going to do is it's going to look for
energy rises and energy falls and so on and we are doing that at different scales for different
what we call it as modulation. For example, this one is very slow in time, whereas this one is
extremely fast in time, so these are capturing the different modulations in time and similarly
you can see the same thing along frequency. So once we have this bank of filters, what this
effectively does is match this time frequency into a high dimensional space where the different
access could be this spectral profile, the temporal profile time frequency and you can even
think of adding different feature dimensions here like pitch and so on. So one quick motivation
of why this could be a useful representation, so what we did here is we collected many
examples of cello and flute notes and we averaged the representation. This is the average
representation for these two instruments in that high dimensional space, or the cortical space,
and you can see that most of the energy for cello lies in a different space as compared to the
flute. So this clearly shows that if you move to this high dimensional space the two auditory
objects which are overlapping in time are now really well separated in this space, you can think
of doing multiple things in this domain. Now I'll talk about how we're going to analyze the
effectiveness of such a representation for different tasks. The first task I'm considering here is
the speech recognition task. Then I'll later talk about timbre recognition and so on.
>>: So these object representations, this is the same as the [indiscernible] all the people that
have been doing [indiscernible] same [indiscernible]
>> Kailash Patil: Yeah, it's representative.
>>: [indiscernible]
>> Kailash Patil: No. When we move from the cortical stage here that's different from their
model, so they have filters which are modeled on the actual neural responses, whereas mine is
just a general 2-D [indiscernible] filter bank. So it's similar…
>> Eyal Ofek: [indiscernible] this is quite typical. This is the [indiscernible]
>> Kailash Patil: This is the [indiscernible]
>> Eyal Ofek: Okay. And the next slide. So this is all something [indiscernible]
>> Kailash Patil: Yeah.
>> Eyal Ofek: So how many features you can extract from those filters? How many
[indiscernible] filters?
>> Kailash Patil: If you want to think about this, we have 200 of these filters, or 220 filters and
so what that does is if you have multiple frequency channels so it's going to represent it into a
really high dimensional space because of that.
>> Eyal Ofek: So pretty much you applied that filter on the spectrogram and the output is doing
something like this in that particular time frequency space?
>> Kailash Patil: Yes. It's going to, the output actually is going to show how much of a match
we have with this kind of pattern and this spectrogram.
>> Eyal Ofek: So you have four dimensions, time, frequency, spectral and temporal?
>> Kailash Patil: Yeah.
>>: So I'm still a little confused. I thought you had like this one of those successful…
>> Kailash Patil: So the difference is in the characterization of these filters, so they use the
filters which are actually modeling like the neural responses, whereas mine is generic.
>>: Oh, so the shape.
>> Kailash Patil: Filter shapes, yeah.
>>: Whether it's a specific model or just a filter?
>> Kailash Patil: Yeah.
>>: But the motivation is the same?
>> Kailash Patil: Yeah.
>>: So I mean in both cases you end up with a representation like that, 3-D or [indiscernible]
okay.
>> Kailash Patil: One note on this, when we are doing some type of statistical analysis, a high
dimensional representation has its own pros and cons. What happens is machine learning, or
the backend stage cannot handle effectively high dimensional space. We have like around
28,000 features at the end, so it cannot effectively handle this many features, so we will have to
do something depending on what is the task. For speech we notice that preserving this
temporal information is extremely crucial because if you want to recognize speech you need to
have the sequence of fonts which are on the order of 100 milliseconds and these temporal
characteristics are extremely crucial. Also, the machine learning setup cannot handle high
number of feature dimensions again, so what we do is instead of modeling these neurons as a
filter bank, we are going to model whole group of neurons. By that, what I mean is, just to give
you motivation, look at the spectrogram here and what we do is we take a cross-section of the
spectrogram, a long time and this is one second in duration. You can see that there are five
distinct peaks in the overall temporal profile, and similarly in this spectral profile you can count
that there are like four peaks in this spectral profile. If you take a two-dimensional FFT of this
spectrogram what's going to happen is all the energy here is going to be localized because of
these characteristics so the main peak is around 5 Hz and in the spectral one it's below what we
call as one cycle per octave. So this motivates that instead of, in this 2D FFT domain, instead of
having multiple filters what we can do is capture the region of speech which is, capture the
region which contains only speech information. What we did then to derive which regions
were the most effective to do speech recognition we tried three different filters. We call this
sub optimal one, speech centric and sub optimal two. The actual filter shapes are given here on
the temporal axis and the frequency axis. The dotted line is actually the shape of the filter.
What we are testing here is how much of this region we need to capture. The results of these
three filters is going to tell us how much we need to capture. So to test that, I'll give you the
details of the experiment. The spectrogram was 128 dimensional, so if you do that modulation
filtering you still have 128 dimensional adept presentation, so then we append delta features
on that so we end up with the 500 dimensional features representation and we tested it on
phoneme recognition experiment on [indiscernible] and during testing we are going this is a
mismatch condition so we are training on clean and we are testing on additive noise cases, so
only the test set is counted with additive noise from [indiscernible] database. And for the back
end we have hybrid [indiscernible] MLP set up and the [indiscernible], the MLPs here are
staggered so we have two MLPs which are stacked so in the end we have around six layers of
neurons. The results of that experiment, what you can see here if you are just considering the
clean case, obviously keeping the entire information is the best possible case, so if you are
trying to reduce information you are going to lose some of the performance. But when you go
to the noisy case, when you are adding different kinds of additive noise, what happens is if you
keep the whole space all the noise is included and you are going to take a huge hit in
performance. But when you are using the space and trying to concentrate only on the speech,
we found that the speech centric filters is the best. One thing to note here is that sub optimal
one is always less than sub optimal two so sub optimal one if you go back it is removing some of
the speech regions over here. Sub optimal two is including some of the noise, so it turns out
that removing some of the speech regions is more hurtful than including the little bit of noise.
And then we…
>>: [indiscernible] MLP?
>> Kailash Patil: Yes. So what we do is the first MLP we just have, so we take these, we use 512
features and we map it to the phonemes and the second one we include context in that so we
have I think 10 frame context on that so we are trying to include some…
>>: Ten frames of posterior?
>> Kailash Patil: Yes.
>>: [indiscernible] then predict?
>> Kailash Patil: Then I can predict.
>> Eyal Ofek: Can you go a couple of slides back? When you show the filter, okay, here.
Technically what you do is you get the [indiscernible] spectrogram [indiscernible] apply the
filter, convert it back to spectrogram and then filter the regular speech from that?
>> Kailash Patil: Yeah.
>> Eyal Ofek: So [indiscernible] speech enhancement operator? Because from the spectrogram
[indiscernible] and see kind of the noise missing.
>> Kailash Patil: It's a little bit tricky to go back from the, this is the auditory spectrogram that
we tried to go back from there and, like we tried actually to listen to the sound and it wasn't
that much obvious that it's doing speech enhancement at the signal level.
>> Eyal Ofek: But you did [indiscernible]
>> Kailash Patil: Yeah.
>> Eyal Ofek: So [indiscernible] a little bit noisier than than optimal two?
>> Kailash Patil: If you compare just the noisy reconstruction, which was this one, we couldn't
find that much obvious difference in the quality of the sound.
>>: Interesting.
>> Kailash Patil: Yeah.
>>: But there is no…
>> Kailash Patil: The steps to going back from this auditory spectrogram to the signal is little bit
of an estimation in that it's not straightforward.
>> Eyal Ofek: [indiscernible] speech which is 500 milliseconds of frame and 10 milliseconds
[indiscernible], right?
>> Kailash Patil: Yeah.
>>: But there's nothing in this approach that requires any of that other representation, I guess
is my point. So it seems like you take a spectrogram, do a 2-D [indiscernible]
>> Kailash Patil: Yeah. We are considering only those modulations that are important for
speech, so effectively what those representations that I was talking about is going to sample
points in this space and if you do that then we are going to increase the feature dimensionality,
so another way of thinking of the same thing is where you want to sample it and just extract
that region.
>>: Oh, so based on this you are deciding which of those others things to pick? Based on your…
>> Kailash Patil: No. We don't pick those features because the speech backend cannot handle
like that much dimensionality, so what we do is instead of picking those points in this region,
we pick the [indiscernible] region and this space. We are not, the motivation is kind of the
same but we are not using that feature set.
>>: Okay.
>> Kailash Patil: So we then compared this to standard approaches where we apply MVA on
MFCC and PLPs and we were able to outperform both of these standard techniques on clean
and all the considered noise cases. And just a note that in our technique we used the feature
dimensions don't vary that much between the baseline and our approach.
>>: So what is [indiscernible] and what is MVA?
>> Kailash Patil: MVA is mean variance ARMA technique, so what that does is MVA it combines
Cepstral mean subatraction, variance normalization and some kind of ARMA filtering which is
similar to RASTA.
>>: [indiscernible] enhancement techniques? [indiscernible]
>> Kailash Patil: No. We compared it with some other features which I'll talk about a little bit
later, but I don't know what you have in mind.
>>: [indiscernible]
>> Kailash Patil: Yeah. We compared it to that [indiscernible]
>>: And there's one other difference. Go back. Yes. Your system has an input that has nine
[indiscernible] features and the others [indiscernible]
>>: How come [indiscernible]
>>: I mean, people usually do the [indiscernible] dimensionality.
>> Kailash Patil: So on average we see that speech centric filtering outperforms this MVA
technique by 12 to 14 percent. There's some evidence in the studies which looked at the
auditory processing and there is some evidence that the processing actually happens in
multiple streams, so the brain actually divides the incoming information into slow dynamics or
slow temporal dynamics versus fast dynamics and they process it and it's later, the evidence
has come in that the higher-level. So we wanted to do a similar divide and conquer strategy so
that the machine learning stage can take maximum advantage of each of these streams. How
we divide this space into different streams is based on these three principles. The first one is
information and coding for each stream has to have enough information. The second one is
complement the information technique principal where each stream should be unique
compared to the other streams and finally, the noise robustness where each stream should
contain only the highest in our speech regions. The way we do that is, again, this is the same
modulation domain and we divide that into these three streams and so if you see the
information and coding principle, according to that we divide these streams which have at least
60 percent of the energy in this entire region. Then each stream has some unique part
compared to the other streams and finally they are still localized on the speech region so we
hope that this gives nice robustness. An example of how these three filtered spectrograms
would look is like this. The first one is including only the low temporal and spectral
modulations, so it's smooth in time and frequency. The second one is allowing for higher
frequency modulations, so you can see it's a little bit more faster a long frequency. And the
third one looks for a little bit faster a long time so it's not as smooth as the first one. Again, the
backend is almost similar to the first experiment, so we have the same setup for each of the
three streams and we combine the evidence using the product rule, so we are combining the
posteriors at the very end. We test this not only on additive noise, but in this case we tested it
on channel noise and artificial reverberation test conditions. In this case we compared it with
two other, one other baseline which is the ETSI advanced front end baseline. On this one they
do wise activity detection and they are doing the estimate the noise and they are doing some
sort of noise cancellation as well as ETSI. ETSI if you look at it’s built for additive noise, so it
does really well on additive but it does not do as well on the other kinds of noises. When we
compare our approach we see that we are doing, we beat all of the systems not only on clean
but on all other noise conditions as well by a significant margin. As I said, ETSI has these
additional advantages of voice activity detection. We don't have anything like that and we are
still able to beat those by around 4 to 19 percent depending on the noise conditions. And if you
analyze the different streams we saw that stream one alone can beat the baseline on most of
the noise conditions. Stream two and three then add onto this to give us the best advantage.
So now I'll talk about how we arrived at this auditory object representation for timbre
recognition. Typically, when people look at timbre, they are trying to identify what are the
dimensions of timbre, so when I say timbre it's defined as anything which identifies the
instrument and it's not volume or pitch, so anything apart from volume and pitch which you can
use to identify the instrument. A lot of studies they look at, so they collect human studies
where the human subjects rate how different they perceive two instruments and based on this
let’s say perceptual distance of the instruments they map it into a two-dimensional or threedimensional space. Then they try to find features which correlate with the dimensions of the
space, so they are going about and finding dimensions of timbre in that approach. Here we
take a more diverse approach where we define the dimensions of time but then we see how
well it matches with the perceived distances. But first we'll do a quick analysis of this
representation on the classification task. Here we keep the full representation so we are
keeping this full high dimension in space because to identify the timbre of an instrument it's
really key to have all of the spectral and temporal variations and we can’t afford to lose the
temporal fine structure. We keep the whole space and we test this on RCW musical instrument
database that has 11 instruments and it has like close to 2000 notes per instrument. For each
instrument they have three different manufacturers and they have multiple playing styles and
they play all the possible pitch values on that instrument. We are really testing if our system
can generalize to this. What we do is we take this high dimensional representation and we try
to reduce the dimensions to a manageable amount using a tensor SVD. For the back end we
are going to use SVM classifier with RBF kernels and I'll talk about the last one later. When we
do this we are going to compare this set of representation with two other representations of
the spectral features here as just the auditory spectrogram and we use that as, we also tested it
on features like MFCC so they all give something in the range of 79 percent. So this second
feature set that we compared this actual neuron transfer functions that are recorded in
[indiscernible] so we borrowed that and this has 1200 neuron transfer functions, so we just
applied those to the auditory spectrogram and we used the activity of that to classify the
timbre and we saw that it's able to do quite well as compared to the spectrogram. But the
problem with that is that it doesn't have the full range of [indiscernible] so it's just 1000
neurons but the brain has millions of neurons, so it's not able to accurately capture the entire
range of possible neurons. If we use a model of this, our model, you see that we were able to
get really good representation for these instruments. So this shows that these instruments are
really well separated in this region. I was talking about how these instruments are separated in
this space, so what we wanted to do is see if this model captures the perceptual distance that
human subjects have stated. We went ahead and had human subjects rate different, how
differently they perceive these instruments from this RCW database and we have some set of
matrix which tells how different these instruments are. We then take the distances in our
model and we just do a correlation with these distances and we see that when we do it with
the baseline features they are not correlating that well with the human distances, but with our
model we see that the correlation is also pretty high. This shows that this representation is not
only able to separate instruments, but it's also able to keep the distances between the
instruments as we perceive it. Then this was just based on isolated tones of instruments so we
thought how we can use this in a more generalized situation. If you go and buy a musical
recording you might have something like this. [music] So this is an artist playing an instrument
and whereas our model was trained on something like this, which was a clean tone which has
the full rise and the decay of the tone. [music] So it's allowed to decay completely and we
capture all of the dynamics there, so if you test our system on this just using uniform window so
you don't care about what's actually there, so you see that the classification accuracy drops
drastically from 98 to 70. This shows that, these recordings we actually collected from
commercially available CDs so there could have been some differences in recording equipment
and so on, but as I say, since these notes are isolated notes, we think we can do better if we can
extract the notes from this continuous recording. We tried existing techniques for onset
detection which tell you the boundaries between two notes, but they are not able to generalize
well to this database because we have recordings from various different studios and different
CDs and so it wasn't able to generalize because we think that it's because most of the systems
work on the signal level representation. What we did is we came up with this idea. Here what
we're going to do is if you look at one particular note you can see that the harmonics are
placed, are different points in the frequency axis. If we come up with a template which mimics
the placing of these harmonics along the frequency axis, and if we see how well this matches,
how well the pitch template matches for each pitch matches the spectrogram, we can get some
set of pitch estimate and we might, those regions where the pitch estimate is changing is the
possible time for onset detection. Also, the degree to which it matches we call it harmonicity
and we typically see that the boundary between two notes the harmonicity is low because the
overlapping harmonics, so we also matched those candidates where the harmonicity is low as
possible onset boundaries. We then combined the evidence from both of these streams to get
reliable estimates for onset, for the note boundaries. Then what we do is we then extract, we
segment this continuous recording along these boundaries. Then if you do note extraction we
do the classification on these extracted notes, we see that we get performance improvement of
close to 9 percent. But to take care of the differences in recordings and other changes in the
statistical distribution, we can also adapt the backend. What we did is we have a few examples
of these, we take a few examples of these extracted notes and we adapt SVM boundaries to
this new distribution and if we do that we see that we can further improve the performance by
around 6 percent. This obviously has the advantage of having few examples that it has seen,
whereas, the note extraction does not have that advantage. A few observations, we were able
to not only represent a musical timbre in this space, but we were also able to capture the
distances between the instruments well. When we did a study of marginal representations, like
do we need to keep the whole space or can we marginalize it along different dimensions, we
saw that the whole space is indeed required and we also saw how it can be modified to be
robust out of the main conditions. Now I would like to talk about the concept of attention in
this space. When I say attention, there are usually two kinds of things that are thought about.
One is the bottom up attention where you are not doing anything voluntarily but let's say there
is a loud bang somewhere in the background, you immediately divert your attention to that.
That is the bottom up attention and we are going to talk about something called the top-down
attention where I give you the task to pay attention to some particular sound or some
particular object and without this task you won't be able to recognize this, but once you are
given the task you will be able to recognize it. That's the top-down attention that we are going
to talk about. For example, in the visual domain, if this is the visual scene that is there and I
asked you to pay attention to this painting on the wall, so you do something like a field of view
sort of thing where you are able to restrict your faculties only to that particular object. And you
are also able to change, or move this field of view to a different object like if I ask you to pay
attention to the flowers. So how does this work in the auditory domain? For example, from
this away from you, you won't be able to tell that there are multiple objects, so I'll just play this
sound to you. [audio begins] [audio ends] As you noticed, there are multiple speakers in this
recording and if I now ask you to repeat what the main speaker said, I'm pretty sure nobody is
able to do that. But now I give you the task to pay attention to only to the male speaker and
let's see if we can do that again. [audio begins] [audio ends] You were able to pay attention to
the male and you were able to say it's three, eight, five, five and so on, so this clearly shows
that there is something that's happening where you can focus your view from one object to
another even in the auditory domain similar to what is happening in the visual domain. We
wanted to implement this in our model. Some studies on what actually happens in the brain
they show that this is a transfer function of one particular neuron in ferret and the ferret is now
asked to pay attention to some particular event, and when it's doing that task, the transfer
function changes. So this clearly shows that the representation in the auditory cortex is not
static, so the representation is changing, whereas, our representation was sort of a static
representation. We want to implement some kind of adaptive technique to take care of this.
The motivation is like we saw two different objects occupy two different spaces in this space.
Let's say we had animals and music and they are occupying different regions, so if we had to
pay attention to music, if you had some prior knowledge of music that it is there in that location
you can focus or highlight that region in the space because they are separated. That's the
hypothesis and the way we implement that is we collect some prior data for that instrument, or
that object and we have the means representation for all the activations for all the filters and
what we then do is if the task is to pay attention to animals, we use that activation pattern as a
boosting, or you can call it a gain adaptation for these filters. We boost those filters which are
most active during the target class and we suppress the other filters which are not so active.
The results are evidence that the highest stages of processing also undergo changes during topdown attention so we can think of that as something like our statistical model adapting to this
new task. We do a map adaptation in this case. Given a few examples to we map it at the
target distribution to the new conditions. The way we tested this is we have a BBC sound
effects database. This is quite a big database that has many classes and four to six hours of
data and during testing what we do is actually mix the target class with other target classes so
there are multiple sources occurring at the same time and we tell our model to pay attention to
only the target and we see how well it does at this recognition. In this case the featured
dimension we were able reduce it to 113 as compared to the music where we had to keep
more. The backend as I said is a GMM classifier and we use map adaptation to adapt the
models. The performance here is measured in d prime so this takes care of not only the true
positive, but also the false, so it takes care of the false alarms and the true positives. We are
not just looking at how many times they are identified, but how many times you wrongly
identified it as well. This is a more robust measure. We compared it with this MFCC baseline
where we are averaging the statistics for the entire duration of that recording. We take the
first, second and third moment of the MFCC features and we use that as the feature rector. We
also recently compared with more standard acoustic features that are there in the literature
and we found that they are all comparable to this range of performance. In our model without
any attention, no boosting, no adaptation of the GMMs we see that we are really getting a big
gain in performance. This shows that probably our model already has these objects well
separated, so we are able to do that, but when we apply boosting, that is when we applied a
gain on the filters, we see the performance improves but not that much. This shows that
probably there is some change in the statistics that the GMM backend is not able to handle.
>>: So boosting is not the machine learning version of this or how I mean it's not the same like
we learn…
>> Kailash Patil: No. This is the gain boosting. And when we just do that adaptation and no
boosting, we see higher jump in performance, so this may be because for doing the GMM
adaptation you need some sample examples of the test conditions, so that might be why it's
doing better. Finally, when we combined both of these techniques we get the best
performance. This shows that when we boost the filters, we need to actually adapt the GMMs
as well, so we are getting the best performance when we are doing both of these techniques.
>>: [indiscernible] unsupervised in that application or supervised?
>> Kailash Patil: It's supervised.
>>: Supervised. You already have some samples?
>> Kailash Patil: Some samples.
>>: How did you [indiscernible] like exactly or [indiscernible]
>> Kailash Patil: Let's say we had many filters, so we collect for the target class. Given a target
class we collect from the training data of the average presentation of the mean activity for
these filters and we normalize it to be in the range of let's say 1 to .5. We can control this
parameter so these activations are now in some range. We use that as the gain parameters for
the filters and we want to pay attention. Let's say filter one was active for speech and filter two
was not active for speech, but it was active for machines. When we apply this and we pay
attention when we are asking to pay attention to speeds, so filter one will still be on; it will still
be active, versus filter two.
>>: Is there any like normalizing the responses [indiscernible] average response to that?
>> Kailash Patil: Yeah. So we saw that this cortical representation was able to, we got our own
45 percent improvement on the baseline and this representation was able also to facilitate
attentional mechanism where we could selectively attend to the targets. And we showed that
these attentional mechanisms provided 81 percent relative gain. That's a huge improvement.
And more recently, we were also able to not only change the gain of the filters but we were
also able to adapt the shape of the filters depending on the target. We showed that this was
more useful for adverse conditions when the target to base ratio is really low.
>> Eyal Ofek: [indiscernible] filters?
>> Kailash Patil: Yeah.
>> Eyal Ofek: So you are thinking about to make a different set of [indiscernible]
>> Kailash Patil: Let's say we have a model like this. Given a target, we are changing the
orientation of these filters a little bit and in a manner so that it's most active during that target
class, so if we change…
>> Eyal Ofek: [indiscernible] more suitable alphabet.
>> Kailash Patil: Yeah.
>>: So when you design those filters [indiscernible] approach or [indiscernible]
>> Kailash Patil: It's just a, you can think of it like a 2-D wavelet transform, so we have the base
functions and we just…
>>: [indiscernible]
>> Kailash Patil: So in conclusion, we presented a multidimensional possibly over complete
space which can represent auditory, motivated from the auditory system processing and this
space was, we had sensitive data frequency, temporal profiles, spectral profile and we showed
how this space can be a basis for many different sound technologies like speech recognition
timbre classification and scene classification, also how this space facilitates the creation of
attentional mechanisms and how we think that such a mechanism can be a basis for tasks like
event detection or keyword spotting and so on. These are some of the publications which
resulted from this work. Do you have any questions? [applause]
>>: [indiscernible]
>>: [indiscernible] so the, I'm just trying to compare the differences. So the MFCC or LPCC
systems were GMM systems?
>> Kailash Patil: For the speech?
>>: Yeah. So if you've got [indiscernible] speech recognition for the initial [indiscernible] so
those first two first columns were GMMs?
>> Kailash Patil: No. They are still the backend to the same as the hybrid [indiscernible]
>>: Okay. So [indiscernible] and that one is just [indiscernible] and that one is just minimum
[indiscernible]
>> Kailash Patil: No. The backend is the same for MFCCs and for our system.
>>: Right. But the speech [indiscernible] NLP [indiscernible] is the two here, the two stage
[indiscernible]
>> Kailash Patil: Yeah. We still have four MFCC as well. So it's exactly the same. Only the input
features are changing.
>>: And the other question is, it looked like on the other tasks you used this tensor as
[indiscernible] but you didn't use it here?
>> Kailash Patil: No. Because here the feature space, the feature dimensions were
manageable. It was still in the -- the feature dimensional was fine, so we have a MLP with input
layer of [indiscernible]
>>: That's because you did this manual extraction [indiscernible] so you already tried take the
thing in and [indiscernible]
>> Kailash Patil: Yeah. That was the very first main approach.
>>: I didn't know if it was there or not.
>> Kailash Patil: Yeah. So I think in that case -- so what happens is I think the features are
redundant I think in some way and we are not able to capture this, so we lose some sort of
temporal structure. Let's say we have filters which are quite large and we are trying to find the
correlation, the convolution in time, so we are losing the time, the output we lose the temporal
structure, so we are getting some smooth response and we are not able to predict the
phoneme sequence from that. Yes?
>>: One of your early slides, your motivation slide you showed that when you listen to an
orchestra there are three musicians playing and you can recognize the instruments. Can a
computer do that? Can we do this automatically? Are there programs out there? Has there
been research published that has successfully taken orchestra with multiple instruments and
separating the sound of each instrument?
>> Kailash Patil: There are some studies which do. I am not aware of studies which do exactly
identifying the instruments, but they do some sort of, on this multisource environment they do
different assessing like identifying the genre of the music or identifying different characteristics
of the music, but maybe not exact finding out which instruments are playing.
>>: The work you showed on attention, right, couldn't that just be applied to the problem so
you could separate the male from the female speaker or [indiscernible] sources? Could need
just apply that system to that problem?
>> Kailash Patil: I think the way we formulated is given a target you have to pay attention to
piano. We were able to tell to a better extent whether the piano was present or not. So we
would then have to apply this for all of the sets of instruments and then we could, in that case
we could tell which instruments are present, so we would have to run the system on different
targets and see which of these targets are present.
>>: Okay. I think that would be a very interesting application to see how far you could go.
>> Eyal Ofek: And once you start that to talk about other applications of pretty much a new
feature of what you could extract from the audio signal. How applicable is this for actually
getting the nonverbal cues from human speech, emotion, speaker identification, even gender,
age? Do you see an application for this research to set for such an application?
>> Kailash Patil: I think definitely I can see some advantages of using such a representation
because we are not after capturing, so in timbre we were after capturing anything that was
apart from pitch and volume, so we are trying to capture all of the characteristics of that
instrument. I think for an example, if in the musical instruments I tried just doing the broad
class, broad instrument recognition so we could recognize that the instrument was a stringed
instrument or a wind instrument or a percussion instrument, so by using similar features. That
shows that it's not only capturing one characteristic, but we were able to generalize it to
capture different characteristics of the sound, so the way you are saying like if you wanted to
capture…
>>: Let me rephrase the question to you a little bit. My understanding is you can separate or
detect the presence of whether there is a cello or whatever, how can your recognition where
you [indiscernible] if there were maybe three cellos and asked you which of these three it was
playing when, what do you think your recognition rate would be on which one of the three
cellos?
>> Kailash Patil: I think if you are given three different cellos which have slightly different
characteristics I think we should be able to capture the minute differences between the
instruments as well.
>>: You think with the timbre is again like 95 percent recognition which one it is?
>> Kailash Patil: If we change, I think if you have different manufacturers -- so you are talking
about different kinds of instruments, right, different, for in a cello you might have different…
>>: I don't know much about cellos. I know there is a really expensive one.
>>: [indiscernible]
>> Kailash Patil: I think another good example for that would be given a cello you can play it in
different styles so you can have the vibrato style, so the musicians use that to kind of to…
>>: The same musician maybe.
>> Kailash Patil: Yeah. The same musician, the same instrument but he's trying to, trying to
express his emotions by a different style.
>>: Do you really think you could get that same kind of recognition rate separating the three
cellos the same as separating the cello from a flute?
>> Kailash Patil: I think, like it might not be in the 99 percent range, but I think given if there are
unique characteristics, I think we can do a really good job of capturing these characteristics.
>>: That would be impressive. So I guess the obvious other one is now you have three similar
30-year-old males speaking and do you think you could recognize which one is which?
>> Kailash Patil: Yeah. I think we can. It's like the timbre of human speech, so.
>>: Yeah. Okay. That's optimistic, because the timbre between a male and a female is quite a
bit different than between [indiscernible] so I disagree.
>>: [indiscernible]
>> Kailash Patil: We actually did speaker identification experiments using features similar to
this and we were able to beat most of the standard approaches.
>>: But did you do it under the cocktail party condition where there were two people speaking
and you said out of the ten people in the database, tell me the two people who are speaking on
top of each other?
>> Kailash Patil: We had background noises with background noise and so on.
>>: [indiscernible] adaptation the noise is very different if it's a low rumble or music or
something. It's a very attractive approach both in the idea of filtering for the signal and the
[indiscernible] thing, but the fascinating stuff is how far you can push it. The idea is elegant, but
I don't have any sense that you really know how widely it applies [indiscernible] I think under
more challenging conditions.
>> Eyal Ofek: Even if you can distinguish [indiscernible] speech, no speech. Because in most of
the cases [indiscernible] is talking to the [indiscernible] speech or no speech. Or even if you can
tell that this voice belongs to speaker one through eight, or none of those. Also consider your
[indiscernible] process of getting [indiscernible] systems. More questions? [applause]
>> Kailash Patil: Thank you.
Download