>> Mike Seltzer: Okay. Welcome everybody to it... Anden here from Ecole Polytechnic in Paris. We sort...

advertisement
>> Mike Seltzer: Okay. Welcome everybody to it is my great pleasure to welcome Joakim
Anden here from Ecole Polytechnic in Paris. We sort of got lucky. He was in town visiting some
folks at University of Washington and had some free time and we were able to snatch him and
have him give us a talk on his very interesting work on scattering transform which is sort of
having some interesting applications in audio and image processing and without further ado, I'll
let him take it away. And we should all say congratulations. I believe he has defended his
thesis recently and in two days he will start a postdoc in Princeton, so he only has two more
days of freedom left between jobs.
>> Joakim Anden: Thank you. Yes, my name is Joakim. I met some of you already, and I'm
going to talk about some of the work that I've been doing together with Stephan my advisor
during my PhD thesis these last few years. My thesis has been mainly concerned with
classification of audio signals. One of the problems in classification is we want to model our
different classes of signals very well in order to be able to identify in what class to put an
unknown signal. The problem is we have a lot of different types of variability in audio signals
that's not necessarily very important. We can have a shifting in time. We can have stretching
in time. We can have transposition in frequency and so on. That's not necessarily that
important, but nonetheless takes up a lot of training data if we want to construct a good model
of it. For example, we have these three sounds which are the same word pronounced
differently, and as you can hear…
>>: Ask, ask ask.
>> Joakim Anden: It's pretty obvious that they are the same word, but we see on the
spectrogram that we have very different properties in terms of pitch, duration and position
relative to the center. We have all this data and we want to be able to model it well. We want
to be able to model all this variability well, but it's not very useful for determining the class of
the signal. This is one of the problems. What's usually done to solve this is instead of trying to
model the wave forms directly, we have an intermediate representation that reduces the
variability of the data in order to allow us to customize models without having as much training
data. We don't want to reduce too much information because we want to be able to keep
what's important for classification, but this is not necessarily very obvious. We don't
necessarily know what type of information is going to be important for the classification task.
What we're going to do is we're going to take a conservative approach and we're going to say
we know what we don't need in audio representation. We know what's not going to be useful
for the classification task and this is going to be transformations of the signal that don't affect
its class and we want these transformations not to affect our representations either. This can
be time shifting. This can be frequency transposition and so on. On the other hand, we want
them to keep as much discriminability as possible. Just a quick overview of my talk. I'm first
going to talk about time shift invariance and how it applies to all your representations,
specifically the scattering transform. And then look at how the scattering transform is able to
capture certain discriminative properties of audio signals. After that we're going to look at
transposition invariance, how we can extend this representation to get something that is
invariant to pitch and shift, which gives us another type of scattering transform. Finally, apply
these to two different audio classification problems. The way we formalize this mathematically
is we just take our signal. We shift it in time by a constant and that's our time shifting. What
we say is that a problem is invariant to time shifting if a given signal is still in the same class
after we shift it in time, and then we require our representation also to be invariant. We also
have a more general class of transformations of signal which are time workings. This
corresponds to taking a signal and then deforming it slightly in its position, so it doesn't have to
be a constant deformation which would cause it to respond to a translation, but it can vary with
position. This means that it also includes dilations. You have locally time warping like this is
going to be a deformation, and the larger the deformation, generally, the more different the
sound will be able to induce changes in pitch, changes in duration. What we want to say is we
don't want to be invariant to time workings. Instead, we want to be stable in the sense that
small time warpings which have little effect on how we perceive the sound, have little effect on
the class, we want that to have little effect on the representation, but for larger time warpings
where the sound is going to sound very different, we want to representation to change a lot.
We have what's called a Lipschitz continuity condition or a stability condition to time warping
that we are also going to require. This is, in fact, a much stronger condition. We have lots of
representations that will satisfy time shift invariance, but not necessarily time warping
invariance. An example of this is the spectrogram which is commonly used but not often for
classification directly. The reason for this is while it is time shift invariance in the sense that if
we do a small time shift with respect to the window size, we don't have stability time warping.
Specifically, if we take a pure dilation where we have a linear time warping, we see that for a
sample signal like this where we have a harmonic stack, in the low frequencies we overlap
significantly in the original signal and in the warped signal, but in the higher frequencies we're
not going to have the same thing because the shift is going to be proportional to the center
frequency. A solution to this is to average the spectrogram in frequency, which gives us the
Mel frequency spectrogram. And the important thing is when we average it, we want it to be
constant Q at high frequencies, because this constant Q filter bank is going to ensure that we
compensate for the increased movement in the high-frequency components. The result is that
after averaging we have something that overlaps just as much in the high frequency as it does
in the low frequencies, and we can see that for this example the difference between the Mel
frequency spectrograms is going to be on the order of epsilon. What's interesting is that this
representation has been motivated mainly from psychoacoustic and biological interactions by
saying that this is something that we see generally in the cochlea, but it turns out that there's
also a mathematical justification in terms of time warping stability why we want to have this
type of averaging. As a summary, whenever we have a signal and we want to create a
representation of it, if we use a spectrogram we're going to have problems. We're going to
have invariance to time shifting but we're going to have instability to time warping, but a
constant Q average will stabilize this problem. In order to look at this a little deeper, the
problem with the Mel frequency spectrogram is that it's not very useful to characterize larger
scale structures by itself. To see why this is we're going to slightly reformulate the definition.
The way we calculate a Mel frequency spectrogram is we take a spectrogram and we average it
in frequency like we saw just before. Another way to get something that's very similar that's
not exactly the same but contains a similar type of information, is to take these filters that we
use to average and instead convolve the original signal with them, which gives us this time
frequency representation that is very high-frequency, that is very high resolution in highfrequency and much more regular in low-frequency. Then we average this entire time to get a
regular time frequency with grid like in the case with the Mel frequency spectrogram. There is
a way to show this mathematically that we have equivalence between the averaged filter
response amplitude and the Mel frequency spectrogram, but I'm not going to go into that right
now. A way to view this is to say that we have the wavelet transform. Now a wavelet
transform is sometimes thought of as this sort of dyadic decomposition where you have one
octave that's taken up by the wavelet coefficients and then you decompose and take the low
pass filter and then take the high pass octave of that and so on, but there's actually a lot more
flexibility that we can put into it. For example, we can have much more narrow response in
frequency which gives us this constant Q wavelet filter bank. So it's going to be constant Q in
the high frequencies and linear in the low frequencies, but it turns out that this is not
necessarily that big of a problem. We're still going to call them wavelets even though they're
not exactly wavelets, but they have a lot of the nice properties that we need from them.
What's also interesting is this is going to give us basically a Mel frequency scale as well just by
having this type of constant Q filter bank. Using this we can define a wavelet transform which
is a low pass filter which captures the low frequencies that are not in the bandpass filters
together with the bandpass filters. If we make sure that we cover the whole frequency access
we get an invertible transform. We can now say that the Mel frequency spectrogram can be
thought of as a time averaged wavelet coefficient amplitude. If we look at the wavelet
decomposition, the modulus of these, we see that when we average them in time we lose a lot
of information in terms of temporal structure. This is one of the reasons that Mel frequency
spectrograms are not very good if we want to characterize completely a large scale structure,
because if we increase the averaging scale, we're going to get something that's more invariant,
but we're not going to get something that has, that captures the temporal dynamics of the
signal, so we have a loss of temporal structure. We need to capture this fine structure and the
question is how. One approach that has been used as modulation spectrograms, but the
problem is if you take a modulation spectrogram which is going to consist of taking a
spectrogram of each of the separate channels, we're going to run into problems again because
we're going to have the instability to time warping. We need to look at a different method for
capturing this high-frequency information.
>>: Just so we're clear, these pictures on the right is [indiscernible] normal [indiscernible] and
on the left a series of sort of actual data [indiscernible]
>> Joakim Anden: Exactly. And then they are averaged to get a Mel frequency spectrogram.
It's not technically the same. Just to digress a little bit, the information is contained. If we
reduce the window size, we do retain more information in the Mel frequency spectrogram in
the sense that we do have the sequence of vectors, but the problem is we're then going to
need a classifier model on the back and to learn which specific sequences correspond to a
certain class and so on and this is what the HMM does. And what we want is we want to
relieve the model of the responsibility of learning the temporal dynamics. We want a
representation that covers a large duration of time but that captures as much information as
possible. And we see that if we try to do that with a Mel frequency spectrogram, we're going to
run into trouble because we're just going to lose more and more information.
>>: [indiscernible] Mel frequency spectrogram, [indiscernible] going to destroy the
information?
>> Joakim Anden: It's not going to destroy the information, but it's going to capture more
information than we would like. It's going to capture a lot of details that's not necessarily
important and this is kind of the idea of stability to time warping. If we're unstable with respect
to time warping, that means that we sense changes in the signal that's not necessarily that
important that corresponds to these high frequencies that can move very quickly even though
there's not a big change going on. The modulation spectrogram because it takes a spectrogram
of these signals and the spectrogram has this instability property means that we're going to run
into problems. Not what you can do is you can take the modulation spectrogram. You can
average it's a long modulation frequencies using a constant Q filter bank and then you get,
some people have been doing this so it's like a constant Q modulation spectrogram. And then
you do have a stable representation and that's going to be similar to what we're going to end
up doing actually.
>>: [indiscernible]
>> Joakim Anden: Yes. I mean he's done both I think. I think he calls it modulation scale when
he does the constant Q averaging.
>>: That's okay.
>> Joakim Anden: Yeah. It's better. It's much better because then you lose the sensitivity that
you don't really want. That's similar to what we are going to end up doing. The issue is that
when we take this wavelet modulus and we average it, we are losing information, so we're
going to see how we can recover that information. As a small sidebar, these filters that we are
considering, they're analytic filters, so they have no negative frequencies. This means that
when we take the modulus, we're actually going to get a nice, somewhat nice envelope
estimation of the signal because it's going to correspond to taking a real wavelength coefficient
and then computing a Hilbert transform of it. So when we take the modulus, we're going to get
the Hilbert envelope of each sub end. Then what we do to get what's equivalent to -- did you
have a question? No. What we do to get our Mel frequency spectrogram is we average this in
time and so we basically just recover the low frequencies of this envelope. The rest of the
information is going to be contained in the high frequencies and we see that since the wavelet
transform is invertible, we can think of this low pass part as just the low pass part of a wavelet
transform. We know that the rest of the information is contained in the high frequency
coefficients of the second wavelet transform that we apply to the envelope. The problem is
that these coefficients are not, don't have the invariance instability conditions that we wanted
so we take the modulus again and we average and so we get these what's called second order
scattering coefficients. These we call first-order scattering coefficients and second order
scattering coefficients. We have as many first-order coefficients as we have first-order
frequencies and these correspond into acoustic frequencies, corresponds to a certain sub band
that we're looking at. And then we have this second decomposition which is then corresponds
to more something like modulation frequency. We look at each sub band and we decompose it
and we look at the different types of oscillations present in there. Another way to formulate
this is to look at a very similar transform which is a nonlinear transform, which is wavelet
modulus transform. We take the wavelet transform. We apply the modulus to the wavelet
coefficients and this gives us a basic building block for constructing these types of coefficients.
We essentially get a cascade that we apply to our original signal x and out of that we get an
average which is just the average of the signal which in audio we're not going to care about
most of the time. It's going to be 0. Then we get these first-order wavelet modulus
coefficients, is what we call these. Then we reapply the operator to these coefficients and we
get what's called a first-order scattering coefficients and then second order modulus
coefficients. Then we reapply, get second order scattering coefficients and so on. So
technically this can go on to infinity, but we're going to see that this won't be necessary
thankfully. So the structure of this is basically convolutional at work. We have convolutions
followed by nonlinearities followed by convolutions. One of the differences is that here we
capture information at each layer as opposed to just at the end, and also the filters aren't
learned. Instead, they're fixed to be wavelets. And this is not to replace any sort of deep
neural network methodology. Rather, it's a complementary approach. The reason that we
don't learn here is because we know that we need the invariance condition. We need the
stability condition and then once we have a representation that satisfies this, the idea is that
we can then plug it in to a deep neural network that will then have less problems learning these
invariance stability properties because we already encoded them in the representation.
>>: [indiscernible] you talk about [indiscernible]
>> Joakim Anden: It's the modulus. It's simple modulus. That's all it does.
>>: [indiscernible] condition do you get invariance?
>> Joakim Anden: If you, if we remove the modulus our invariants are going to be all 0 because
we're going to have all sorts of linear operators and the only linear invariant that you can get
from a signal is its average. We will get 0. So the nonlinearity allows us to create more and
more invariance from a signal.
>>: [indiscernible] particular type of linearity, nonlinearity is [indiscernible] some other kind of
discernible
>> Joakim Anden: Yeah. I mean you could, we haven't done experiments in audio. We have a
little bit of experiments in audio and it turns out that the choice of nonlinearity is not all that
important. You could replace it with a local max. They have been doing other stuff with images
and stuff and it turns out that it's not that important, the difference. The modulus is nice from
a mathematical perspective because it turns out that if you want this stability condition and if
you want your transform to be contracted, it turns out that you're restricted to have a point
wise unitary operator and so it's going to be the modulus they be multiplied by some complex
numbers. There's ways of showing that, but in experimentally it doesn't seem to matter all that
much.
>>: So this only guarantees invariance. How can you make sure that you still preserve
information throughout this combination?
>> Joakim Anden: Yeah. I'm going to get to that because it's not entirely obvious that you're
discriminative from this. The idea is that whenever we lose information from averaging, we
recover some of it with the high frequencies of the wavelet decomposition. But then we lose
information again and so we recover it again. The idea is that whenever we lose information
we recover some of the lost information and so that's how we remain discriminative, but it's…
>>: [indiscernible] understand what is the meaning of that lasts [indiscernible]
>> Joakim Anden: Yeah. That's an averaging in time.
>>: It's just averaging in time?
>> Joakim Anden: Yeah. All of these convolutions are in time, so here we capture just the
average of the signal. Here we take the signal. We take a certain bandpass filter. We take the
modulus. We compute the envelope and then we average it in time to get something that is
invariant and stable. And then since we've lost information here in the average, we recover it
in the high frequencies here and the second wavelet decomposition take the modulus and
average once again. So it's to get these invariants that we need to average, because otherwise
we're very sensitive to shifting and warping.
>>: [indiscernible]
>> Joakim Anden: Yeah. These are both; the first one is a Mel scale filter bank, sort of. It has
about eight wavelets per octave. It turns out that we don't need to use the same way let's in
the second decomposition as the first decomposition, so here it's more like a traditional
wavelet filter band. You have one wavelet per octave or two is what we found works best, but
these are the basic parameters of the representation how many wavelets per octave and how
large you want to make the averaging. Yeah?
>>: Are you going to cover, I think it was Lipschitz, the dilation, how that comes into play in
terms of…
>> Joakim Anden: A little bit, yeah. I think it's the next slide. Yeah, hold on. Yes?
>>: Just one question. With the exception of the modulus this looks like a dyadic filter bank.
>> Joakim Anden: It's not completely dyadic because we have, well it depends on what you
mean.
>>: I mean if you were to choose a weight that is such that [indiscernible] scaling function
[indiscernible] to the previous [indiscernible]. And you don't, you no longer have the
[indiscernible] does that matter in this case?
>> Joakim Anden: I'm not quite sure what you mean.
>>: Perhaps we could go over it off-line?
>> Joakim Anden: It's, these wavelets are not necessarily the same wavelets that you would
have in orthogonal wavelet decomposition. In fact, these wavelets are very redundant and we
would like them to be. If we had wavelets that were orthogonal, we would run into problems it
turns out. These are actually very redundant wavelets. They overlap significantly in frequency
and they have a nonzero scalar product. It's not necessarily the same framework that you
would have in a dyadic or orthogonal wavelet decomposition.
>>: You try to minimize the amount of [indiscernible] and redundancy and here you are trying
to get more of it?
>> Joakim Anden: You want to be more redundant because if you are more redundant you are
more flexible to these types of changes. Time warping won't suddenly shift one coefficient into
another bin and then you're completely lost. So you want to have a certain type of
redundancy. It also makes sure that when you take the modulus you don't lose quite as much
information, so it has all sorts of nice properties, the fact that we are redundant. These are
wavelets, but they're wavelets in the very general sort of sense. They're wavelets in the sense
that they're dilations of another wavelet period and they're barely even that in this case. The
important thing is that they're constant Q. That's what we care about, so it's less related to like
orthogonal decompositions and so on. We can continue this cascade and then get our general
for a given m we get an mth order scattering coefficient. Put all these together into one big
factor and we get what is called the scattering transform or scattering representation. We
describe a norm on that and then with this norm we get this theorem which proves the
Lipschitz stability condition that we were talking about earlier. The reason that this holds true
is basically if you have a wavelet and you apply it to a deformed signal, you can do a change of
variables so that it's equivalent to the original signal on a deformed wavelet. This is not going
to be drastically different from the original signal, because a deformed wavelet and the original
wavelet are going to be relatively close in terms of the Euclidean norm. This will not be true in
the Fourier decomposition, for example. A sinusoid and a deformed sinusoid are going to
diverge significantly at a certain point. That is kind of the heart of the argument that it's like a
proof that runs for 10 pages, so don't ask me to go into detail. But that's the basis of why it
works. So we have this nice stability condition for this representation. We also have lots of
other nice properties like conservation of energy. If we have a wavelet transform that
conserves energy, we also have as a consequence of this, that the amount of energy in each
order decays and so it turns out for most applications we can just look at first and second order
coefficients. We also have that when we demodulated, we get something that's essentially
low-frequency and so we don't need to keep the same sampling as we had in the original signal,
so we get like this graph I was showing earlier. And the high frequencies were very irregular
and the low frequencies were very regular, so it lets us save in computation and space. It also
means that once we decompose this using the second order wavelet decomposition, we don't
need to take -- here we need to take a lot of high frequencies, but here we can get away with
just looking at the low frequencies and we see this, for example, looking at the second-order
decomposition here where we applauded the first-order frequency, the acoustic frequency and
then the second-order frequency, the modulation frequency. We see that a lot of this is going
to be 0 and so we can neglect to compute them. The result is we can get an algorithm that runs
in about n log n to calculate these coefficients. The question is what sort of information do we
have in the scattering transform. We've seen that we are discriminative in the sense that we
capture a lot of the high frequencies, a lot of the temporal structure that we've lost, but what
exactly does this constitute. We're going to look in the case of audio, a simple model which
consists of an excitation filtered by an impulse response and then modulated in time by an
amplitude A. The question then becomes what sort of information can we see is recovered by
the first and second order coefficients in this type of model which is a relatively constrained
model, but it still allows us to look at some simple phenomena. For example, here we have
three sounds with the same excitation, harmonic excitation with different amplitude envelopes.
First is a smooth envelope. Then we have a sharp envelope with a sharp attack and then we
have a sinusoidal amplitude modulation, a tremolo. We can listen to them. We hear they
sound quite different and we see here in the scalagram in the wavelet modulus decomposition
that they look quite different. But when we calculate the scattering coefficients, what we're
going to do is for each of these channels we are going to isolate them and we're going to
average them in time, and so we're going to lose the distinct characteristics of each of these
envelopes and they're going to turn out to look more or less the same with the window size is
large enough. If we need a certain amount of invariance we're not going to be able to capture
this information directly. This is essentially what we would see in the Mel frequency
spectrogram with this window size. However, if we capture then the second-order coefficients
which is going to be these, this signal but decomposed using a second way but decomposition,
then the modulus and then averaged in time, we do capture the differences in the envelope in
the sense that here we have mainly low frequencies and that's represented here in the end of
display in the second-order coefficients. Here we have a transient, so we have high frequencies
and here we have the sinusoidal modulation, so that's represented as a maximum around the
frequency. We see that we capture these different properties of the envelope in the secondorder coefficients, whereas, the first-order will mainly give us information about the type of
excitation and the filter, the spectral envelope, the filterage. And we can replace this by a
stochastic excitation, that white noise excitation for e and we're going to end up with similar
results. We can hear that these are also relatively different sounding signals. But yet, the firstorder does not necessarily see any difference between these three signals, whereas, in the
second-order we're able to differentiate between the three. The difference between this and
the harmonic case is that here we do have a low level noise floor that's due to the stochastic
excitation also as seen in the second-order. We do have a bit of a bleed over there, but we do
capture the information on the envelope as well.
>>: So [indiscernible] audio is two orders?
>> Joakim Anden: Yes. For right now we've only looked at two orders, and in classification
results it's usually enough to get good results. I'm going to get to that in a little bit.
>>: You showed the, on the bottom figure in the previous slide, what kind of value was it
evaluated for lambda one?
>> Joakim Anden: It was, yes. That's important. These are, I fixed one lambda one for this
particular display and so it's at this frequency, at 2400 Hz. But in reality you have the whole
sequence of these displays, but you would need some sort of 3-D representation that is difficult
to do. So yeah, this is for one fixed frequency channel that is showed this for, but you're going
to see something similar in all of them, because the envelope is the same at all frequencies in
this model, so you're going to see that.
>>: So how many dimensions do you have for the lambda one?
>> Joakim Anden: It depends. I mean, in the tests that we looked at, it could be around 40, 50
to 70, and in the second-order it could be in the 100’s, 200’s. This is, I calculated this for a very
dense sampling, so this is not representative.
>>: [indiscernible] 100, do you mean for each value in the first-order you have 100 and then…
>> Joakim Anden: No. For each timeframe, for each point in time I have about 50 first-order
coefficients and 200 second-order coefficients, just about. It depends on the parameters. But
on that order, yes. We can also look at frequency modulation, as you heard just now. We have
sinusoidally modulated pitch and a constant pitch in an amplitude sinusoidal amplitude
modulation and what we have is, this is also represented in the second-order in terms of this
harmonic structure here, which is different from the amplitude modulation when we just had
the sinusoidal. The problem is we can also create an amplitude modulation here, which sounds
different from the frequency modulation that gives us very similar second-order coefficients
and so this shows us that there is a certain drawback to this type of scattering representation in
that it doesn't really show the difference between AM and FM, but it will detect the presence of
either one.
>>: Those would be the same with every lambda one?
>> Joakim Anden: No. They would not, exactly. So this is for this fixed one, it would be the
same, but it's not possible to make them exactly the same for most of them. They're difficult
enough to tell apart that I think it will cause a problem in modeling them. You don't have a very
clear distinction between the two. But yet, this is just for a fixed lambda one. The final
example that we're going to look at is looking at frequency components interfering. If we have
two notes that are played together as opposed to being played separately, in the first case they
will interfere and you'll have a beating phenomenon in the envelope that's going to give us a
maximum in the second-order coefficients. We can listen to them. In a sense, the dissonance
that you hear between the two notes is captured here in the second order coefficients, a
certain kind of roughness. Another way to think about this is to say that since we have a certain
bandwidth of the first order frequency filter bank, we lose information about absolute position,
but the second-order is going to give us relative position information about different frequency
components in each frequency band. To summarize in terms of this model, what we can see is
that first-order coefficients tend to capture very short time structure like looking at excitation,
looking at the filter which is going to be in these cases around, up to 10 milliseconds of
duration. Whereas, the second-order coefficients are going to characterize larger scale
structures like amplitude modulation, frequency modulation and interference. We have this
kind of separation scales between the first and second order coefficients and what they
characterize. Another way to understand what type of information is captured in this
representation is to look at reconstruction from scattering coefficients. What we have is we
have a result that says for certain wavelengths we can, in fact, invert the ways that modulus
operator and since the scattering transform is a cascade of this operator, we can cascade this to
get an inverse of the scattering transform. The problem is to calculate this inverse is a
nonconvex optimization problem. It can be relaxed somewhat, but it's still intractable for the
types of audio signals that we're looking at, so what we're doing is actually we're using an
approximate heuristic approach to recover this which is a basic alternating rejection algorithm
like a Griffin and Lynn sort of approach. So to get the original signal we can cascade this inverse
wavelet modulus operator. The problem being at the end we won't have the necessary wavelet
modulus coefficients and so we have to use a deconvolution to get our first order estimates.
There's errors involved in this process, obviously, and since this is approximative, we get errors
and so this is not a perfect reconstruction. I would not recommend using this in some sort of
fancy audio code deck or anything, but what's nice is it gives us an idea of what type of
information is captured in the first as opposed to the second-order. We can see this for a
speech signal where we have the original signal. We have the reconstruction from just the firstorder where you see that certain parts have been smoothed out you lose a lot of transients that
we have here, for example. But if we had the second-order coefficients to the first-order
coefficients and reconstruction that, we do have a large amount of artifacts, but we hear that
the attacks and the transients have been restored to a certain extent in the reconstruction from
a second-order. This kind of confirms what we saw with the models in that we do capture that
type of temporal dynamics in the second order.
>>: [indiscernible] if you have m equals infinity, then you can recover the original signal? Is
that what it means?
>> Joakim Anden: Yes. If you had a good, yes. The answer is yes.
>>: [indiscernible] to infinity?
>> Joakim Anden: Yes. But in reality, no, because what happens is each time you take the
modulus, you're pushing your energy, you're pushing your information to the low frequencies
and so at some point most of your information will be captured by the low pass filter and so
you won't have to continue to decompose any more. At that point, your deconvolution is going
to be trivial. You're going to be able to recover your wavelet modulus coefficients fine because
your averaging won't be, won't have done anything. Then you can theoretically recover. The
problem is we're using another very good algorithm here and so once we cascade it far enough,
we're going to get all sorts of errors that will accumulate, and so adding the third order to this is
not going to improve. If we had a good algorithm to invert, then yes, it should.
>>: In theory, do you have any proof that as the term gets higher and higher the errors start
getting smaller aside from this [indiscernible]
>> Joakim Anden: In theory, yes. It should be true, yes, but I don't think we proved or look into
the details of it, but intuitively, it should, yes. We don't only have, we have lots of types of very
abilities in audio signals. We're only going to touch on a few of them here and so except for
time shift invariance, we also might want to have frequency transposition invariance. Here for
example, we have one word pronounced by two different speakers and we have a difference in
pitch, but also different in formance and frequencies, so we can listen to them.
>>: Encyclopedias.
>>: Encyclopedias.
>> Joakim Anden: And so we hear that they're are the same word, but we don't have the same
pitch information and we see here in the formance we don't have the same formance
information. What's especially interesting is we don't have a pure scaling of the frequencies.
We don't have a shift in pitch or a shift in formance that's constant over the whole spectrum.
Instead we have a different shift at different parts of it. The distances between the formance
peaks is going to vary and so this corresponds more to her frequency warping than a pure
frequency transposition. This tells us that we might not just want to be invariant to frequency
transposition but also stable to frequency warping in the same way we saw with translations in
time.
>>: [indiscernible]
>>: [indiscernible] so how do you [indiscernible]
>> Joakim Anden: The idea is…
>>: [indiscernible]
>> Joakim Anden: Sorry. The idea is that we're not completely invariant to these types of
frequency work things. We're stable to them, which means that a small frequency warping is
not going to change your representation very much, so a small change in the formance
frequencies is not going to change it very much. A large change is going to change it more, so
your representation will be able to hopefully separate between important changes in formance
frequencies than the less important ones. But if you have, you guys probably know this better
than I do. But if you have a formant frequency that shifts slightly, if that's going to change the
idea of your vowel, then you're going to run into trouble with this type of thing here
>>: [indiscernible] constrain your walking situation is that these special cases [indiscernible]
occur [indiscernible]
>> Joakim Anden: The answer is not explicitly, but what's nice with this continuity condition is
it tells us that we have a continuity with respect to these time work things, and what that
means is that a warping is going to be realized locally as the linear transformation in the
representation space. What's nice then is that if we have a linear discriminative classifier at the
back end it's going to be able to project against certain directions and in that way create a real
invariance to certain time warpings while maybe increasing sensitivity to other time warping.
Since we linearize time warpings or frequency warpings, in this case, this classifier should be
able to decide which ones is important given enough training data, of course. So that's kind of
a nice property that comes out of it. Since we don't have, if we had pure invariance then, yeah,
that would become a problem because then you might run into these types of issues.
>>: I have a remote question all the way from California. [indiscernible] was actually in the Bay
Area so Malcolm [indiscernible] asking me a question. I believe the [indiscernible] transform
only makes sense as an envelope follower if there is only a single component in the sub end
signal [indiscernible]. So does the theory still make sense if the -- I guess his question is does
this theory still make sense if the [indiscernible] transform does not actually give you the truth?
>> Joakim Anden: The idea here is it's nice to formalize it as looking at the envelope of a
specific frequency component, but it's not necessarily crucial to have it be specifically an
envelope and we saw, for example, in the cases where we do have two frequency components
in the same filter bank, in the same filter response, what we do see is we don't get the
envelopes of each one separately. Instead we get the beating phenomenon between them,
which in our case is nice because it gives us information about the internal frequency structure
in the filter bank, which we would not necessarily have if we just looked at the individual
envelopes present in there. It's not necessarily dependent on having an exact estimate of the
envelope in this case. What's important is having a nonlinearity that brings us to the lower
frequencies. It just happens that in some cases we can interpret it as a envelope when we just
have one frequency component. I hope that answers your question. [laughter]. Okay. There
have been some approaches that have been done to get this type of transposition invariance.
One, first perspective is to say that transposition frequency which is scaling a frequency is going
to be a translation in log frequency. Log frequency and Mel frequency -- in high-frequency, Mel
frequency corresponds to log frequency and so at these frequencies we can simply average the
Mel frequency spectrum and we get something that has this transposition invariance, and this
is basically what the Mel frequency cepstrum does in taking the low frequencies coefficient in
the DCT which amounts to an averaging along the Mel frequency scale. The problem is once we
start averaging along frequency, we lose information and so we can gain some invariance by
doing this, but if we try to gain more and more invariance to capture larger and larger scales
we're going to lose more and more information so we're kind of limited in that sense. And
another approach is something that you are probably familiar with is the vocal tract length
thermalization where you try to realign the spectrum at a given point with that of a reference
speaker using a sort of scaling coefficient. An advantage of this is that we don't lose
information because we're just simply rescaling the whole spectrum, but it's sometimes
problematic to estimate the scaling coefficient and if we don't estimate it exactly, we don't
have the desired invariance property. Another problem is that we don't have the stability
frequency warping us we start considering nonlinear warping of the spectrum. Then you run
into problems of you have a large storage space and so on. These approaches have their
advantages, but in this case we lose information. In this case we have a certain instability. The
approach that we're going to take is that instead of averaging a log log frequency, we're going
to take the scattering transform along the log frequency scale. We fix a time point t and we go
along log frequency and we complete another scattering transform and so on. This gives us the
separable time and frequency scattering transform, or just a separable scattering transform.
It's nice in the fact that it has the invariance that we want to time shifting and frequency
transposition and that, of course, depends on the scale that we consider here with the
scattering transform. We could decide to have a small-scale or a large-scale, but it captures
more information than if we just did a simple averaging. In experiments we usually can restrict
ourselves to the first-order coefficients. It's usually enough at the scales that we consider. The
problem is that this approach and actually the approach of the original scattering transform
loses a lot of important information because we average, we decompose each frequency band
separately and we average each frequency band separately. For example, these two signals, if
we look at the regular scattering transform with the window size that covers the whole signal,
it's going to think that they're going to give the exact same representation because we have the
same frequency band here as here, but shift them in time. Since we don't capture any
information on the correlation between frequency bands, it's not going to be able to tell the
difference. We lose this joint time frequency structure. A way to solve this is instead of
decomposing each frequency band separately, we would like to do as in S. Shamma’s cortical
representations where we decompose this time frequency scalogram in two dimensions in time
and frequency simultaneously instead of doing as in the separable case where we decompose
just once along the temporal domain and then average and then once along the frequency
domain. Here we get then a two-dimensional wavelet that we use to decompose the
scalogram. We can then form like we did before a wavelet modulus transform and cascade this
on to the scalogram and this gives us what's called the joint time frequency scattering
transform or the joint scattering transform. This has the advantage of having this same
invariance stability as the separable scattering transform but it captures more information in
that it's able to distinguish between these cases where we have shifted individual frequency
components and so it captures correlations between frequency bands. We could also create a
representation that is sensitive to frequency transposition by omitting the averaging along
frequency and just averaging a long time. This gives us something that has the same invariance
properties as the regular temporal scattering transform, but that captures more information.
Finally, we're going to look at some classification results to try to validate some of these claims.
The first is to… Yeah?
>>: Sorry, could you go back? What's the difference between what this time frequency
scattering transform and [indiscernible]
>> Joakim Anden: We were using Gephardt [phonetic] filters, yes. It's a Gephardt filter bank,
yes. This has been done by…
>>: I guess, what is the second [indiscernible]
>> Joakim Anden: The first-order is going to be x here is -- I had it in the previous slide. x is the
scalogram, the wavelet modulus coefficients. The amplitudes from the wavelet coefficients
from the first-order. The first-order wavelet decomposition stays the same and then on that
the second-order decomposition, which includes this low pass filter and then this wavelet
decomposition with the modulus and another low pass filter forms the second-order
decomposition. It's going to affect the first-order as well. But the decomposition is regular
wavelet transform, then two-dimensional wavelet transform and then you reapply the twodimensional one and so on to get the higher order. We're just going to concern ourselves with
the second order for now. Does that answer your question?
>>: Enough for now.
>> Joakim Anden: We can come back. To try this out we look at the TIMIT task of the phone
segment classification. What we do is we try to classify segment where we're given the
beginning and the end so it's technically cheating, but it makes for an easy problem to code and
to test, so that's what we did. It does give us interesting results on the different types of
representations. What we do is we take the scattering transform over about 160 milliseconds
and we computed using dirty 2 millisecond frames and we put this onto one big vector that we
then put into an SVM and we train a classifier with that. The goal is then to try to classify the
phone that's in the center of the segment. The results that we have first is we see that we
performed similarly to the Delta MFCCs with the first order coefficients. Here L=1 means that
we have first-order. L=2 means we have first and second order, and then first, second and third
order for L=3. The first-order coefficient performs similar to the MFCCs which is what we
expect because they carry similar type of information. The second-order we do see certain
improvement of almost 2 percent absolute, which shows us that the temporal dynamics that
we capture within each of these frames is actually useful for this type of classification. The
third order we actually do worse and the reason for this is that when we look at small
timescales, as I said earlier, when we apply the modulus we are going to force energy towards
the lower frequencies and that means that once we get to the third order we are already very
regular, already very low frequency and so it's not going to be very important to capture the
high frequencies in the way let's. In the third order we're going to have a lot of noise. We're
not going to have very much information and so it's not going to help us classify in this case.
>>: [indiscernible] does it mean that you feed that into the SVM?
>> Joakim Anden: Yes. In the same way that, we take Delta MFCCs over 32 milliseconds and
then we concatenate them over this whole big segment and then we feed them exactly as in
the scattering case to make it comparable.
>>: I thought that the error [indiscernible] wouldn't be so low.
>> Joakim Anden: We also include the log duration of the second. That makes about 2 percent
absolute difference, so that might be why.
>>: What is the state-of-the-art?
>> Joakim Anden: The state of the art I think is Jim Glass, so it's like a committee-based
classifier with different segment lengths and stuff and so they get 16.7 percent.
>>: [indiscernible]
>> Joakim Anden: Delta MFCCs in the first-order, it's around 50 I think. Second-order, like I
said, is around 200. Third order, I think is around 500, but I'm not sure about the third order. I
know it's about 200 for the second-order.
>>: Some of the features.
>> Joakim Anden: Yeah. But we're also using an SVM so it's not necessarily the end of the
world. If we were using GMM's, when we started this I was using GMM's because I figured
that's what everybody does and then I found out that the results were not very good. So you
are you going to have trouble if you try to choose some sort of generative model like that, but
the SVM does handle it pretty nicely.
>>: The [indiscernible] think about the [indiscernible]
>> Joakim Anden: Yeah. I did some experience where I did like huge vectors of 10,000 and it
was able to do it. This is what I was talking about. If we want transposition invariants, what we
can do which is quite nice is we can skip the frequency averaging. Since we're using an SVM,
the SVM is going to optimize a locally linear discriminative plane at each point and so if we have
enough data it should be able to learn how much it wants to average for each classifier. That
makes it so we don't have to decide how much transposition in variance to force on our
representation, but instead, we can depend on the SVM to learn it if we have enough data. So
we tried fixing the averaging scale and then not averaging in frequency and letting the SVM try
to learn it and it turned out that that was much better, so that's what these results show. To
compare for the second order scattering coefficient, there's 17 percent. If we add scattering
transform on top of that, a log frequency, we get 16.5 percent. That's almost a 1 percent
absolute improvement. And then for the joint case we've been capture more time frequency
structure instead of just temporal structure, we do improve somewhat again over the separable
case. We also looked at musical genre classification because that's something that everybody
does and it's nice to be able to compare. So this basically consists of having 30 seconds of
music and trying to classify it into 1 of 10 genres. What we do is we do for each 700
milliseconds reclassify that and we vote over the whole track to try to get a result. Again, we
get results similar for the first-order and the Delta MFCCs. Once we add the second-order we
get a big improvement of about 40 percent relative, and the difference between this and the
previous case is here we're looking at much larger timescales. The difference between what
the first-order captures and what the second-order captures is much greater. We have a lot
more temporal dynamics that we need to recover in the second-order and so we do have a
significant improvement here for the second-order. The third order does a little bit better, but
we're still not in the regime where we expect the third order to carry a lot of energy and a lot of
information.
>>: So think of the [indiscernible] here is [indiscernible]
>> Joakim Anden: State-of-the-art, I can't remember. It's a paper from 2009 where they used
modulation spectra and then they do all sorts of LDA's and they sum along certain dimensions
and stuff and they get like a 9.4 error rate. I can't remember. I can send you the paper. So this
is just using an SVM like kind of feeding it in there. One of our colleagues who was at Princeton
previously and who is now with Stephan's group managed to get 8.8 percent using a sparse
representation classifier on this using the second-order coefficient. By being a little bit more
clever you can squeeze some more out of it. For the transposition invariant case we see that
we have a certain improvement with the separable scattering transform. In the case of the
joined scattering transform, we actually don't do better. We do about the same. This can be
attributed to the fact that we don't -- here the issue is not necessarily that we want to be able
to characterize very fine join time frequency structure. Here the issue is that we have data
plans that are very far apart and we want to be able to create good invariance between them
and so adding them more specific information about the joint time frequency is not necessarily
going to help us when what we want is more invariance, not more discriminability in this case.
Yeah?
>>: What is, what is the type of signal? You would expect that the third order would
[indiscernible]
>> Joakim Anden: The third order, you would expect the third order to play a larger role when
you are looking at larger timescales. Here we are looking at less than one second and you can
see by looking at how much energy is contained in the third order that it is not very important.
Once you start looking at larger timescales in that you expect it to do better. The reason that
we didn't look at larger timescales in this case was because we didn't -- even if we added the
third order of going beyond 700 milliseconds, resulted in degraded performance. That has
more to do with the nature of the problem. Once we go beyond that we start, if we have
something that's about 500 milliseconds we have a greater chance of getting a segment where
we have maybe one instrument and so on and so it's easier to characterize as opposed to if you
get a larger segment where you have to characterize mixtures of instruments. It becomes
something that's more complicated for the classifier to deal with. If we had a problem where
we had a very large scale structure like on the order of maybe 6 seconds that we wanted to
capture, then, yeah, the third order would be expected to do better. But we haven't seen any
tasks where it actually does a lot better yet. That's kind of the thought. In conclusion, just
looking at invariance and the stability conditions for representations is a very powerful
framework for trying to analyze different properties that they have with their classification
performance. The scattering transform provides such a representation that has these nice
properties and it turns out that it actually captures important audio phenomena like amplitude
modulation, frequency modulation and so on. We can extend the scattering transform to get
the joined scattering transform which captures joint time frequency structure and also gives us
transposition invariance in a discriminative way. And finally, we see that once we test this in
numerical experiments, we do get state-of-the-art results in the two tasks that we look at. I
should add that on top of this we've also looked at classification of medical signals in terms of
heart rate variability for fetuses. We've also had some nice results in that area, but not in this
presentation unfortunately. So that's it. Thank you very much for listening. [applause]
>>: [indiscernible] examples [indiscernible] speech or mixed speech?
>> Joakim Anden: No. That's kind of one of the next things to look at to see how we perform in
terms of talker separation and in terms of presence of noise and so on. I mean Timit is very,
very nice. It's very clean and there's no noise, so that's not something that we have looked at
but it's something that would be interesting to look at in the future. But also looking at other
types of distortion like clipping or masking and so on and see how robust it is to that type of
change and how we can make it more robust to them.
>> Mike Seltzer: Any other questions? Several people will be joining this afternoon. If you do
not get in there I would like you to let me know. All right. Thanks again.
>> Joakim Anden: Thank you. [applause].
Download