>> Mike Seltzer: Okay. Welcome everybody to it is my great pleasure to welcome Joakim Anden here from Ecole Polytechnic in Paris. We sort of got lucky. He was in town visiting some folks at University of Washington and had some free time and we were able to snatch him and have him give us a talk on his very interesting work on scattering transform which is sort of having some interesting applications in audio and image processing and without further ado, I'll let him take it away. And we should all say congratulations. I believe he has defended his thesis recently and in two days he will start a postdoc in Princeton, so he only has two more days of freedom left between jobs. >> Joakim Anden: Thank you. Yes, my name is Joakim. I met some of you already, and I'm going to talk about some of the work that I've been doing together with Stephan my advisor during my PhD thesis these last few years. My thesis has been mainly concerned with classification of audio signals. One of the problems in classification is we want to model our different classes of signals very well in order to be able to identify in what class to put an unknown signal. The problem is we have a lot of different types of variability in audio signals that's not necessarily very important. We can have a shifting in time. We can have stretching in time. We can have transposition in frequency and so on. That's not necessarily that important, but nonetheless takes up a lot of training data if we want to construct a good model of it. For example, we have these three sounds which are the same word pronounced differently, and as you can hear… >>: Ask, ask ask. >> Joakim Anden: It's pretty obvious that they are the same word, but we see on the spectrogram that we have very different properties in terms of pitch, duration and position relative to the center. We have all this data and we want to be able to model it well. We want to be able to model all this variability well, but it's not very useful for determining the class of the signal. This is one of the problems. What's usually done to solve this is instead of trying to model the wave forms directly, we have an intermediate representation that reduces the variability of the data in order to allow us to customize models without having as much training data. We don't want to reduce too much information because we want to be able to keep what's important for classification, but this is not necessarily very obvious. We don't necessarily know what type of information is going to be important for the classification task. What we're going to do is we're going to take a conservative approach and we're going to say we know what we don't need in audio representation. We know what's not going to be useful for the classification task and this is going to be transformations of the signal that don't affect its class and we want these transformations not to affect our representations either. This can be time shifting. This can be frequency transposition and so on. On the other hand, we want them to keep as much discriminability as possible. Just a quick overview of my talk. I'm first going to talk about time shift invariance and how it applies to all your representations, specifically the scattering transform. And then look at how the scattering transform is able to capture certain discriminative properties of audio signals. After that we're going to look at transposition invariance, how we can extend this representation to get something that is invariant to pitch and shift, which gives us another type of scattering transform. Finally, apply these to two different audio classification problems. The way we formalize this mathematically is we just take our signal. We shift it in time by a constant and that's our time shifting. What we say is that a problem is invariant to time shifting if a given signal is still in the same class after we shift it in time, and then we require our representation also to be invariant. We also have a more general class of transformations of signal which are time workings. This corresponds to taking a signal and then deforming it slightly in its position, so it doesn't have to be a constant deformation which would cause it to respond to a translation, but it can vary with position. This means that it also includes dilations. You have locally time warping like this is going to be a deformation, and the larger the deformation, generally, the more different the sound will be able to induce changes in pitch, changes in duration. What we want to say is we don't want to be invariant to time workings. Instead, we want to be stable in the sense that small time warpings which have little effect on how we perceive the sound, have little effect on the class, we want that to have little effect on the representation, but for larger time warpings where the sound is going to sound very different, we want to representation to change a lot. We have what's called a Lipschitz continuity condition or a stability condition to time warping that we are also going to require. This is, in fact, a much stronger condition. We have lots of representations that will satisfy time shift invariance, but not necessarily time warping invariance. An example of this is the spectrogram which is commonly used but not often for classification directly. The reason for this is while it is time shift invariance in the sense that if we do a small time shift with respect to the window size, we don't have stability time warping. Specifically, if we take a pure dilation where we have a linear time warping, we see that for a sample signal like this where we have a harmonic stack, in the low frequencies we overlap significantly in the original signal and in the warped signal, but in the higher frequencies we're not going to have the same thing because the shift is going to be proportional to the center frequency. A solution to this is to average the spectrogram in frequency, which gives us the Mel frequency spectrogram. And the important thing is when we average it, we want it to be constant Q at high frequencies, because this constant Q filter bank is going to ensure that we compensate for the increased movement in the high-frequency components. The result is that after averaging we have something that overlaps just as much in the high frequency as it does in the low frequencies, and we can see that for this example the difference between the Mel frequency spectrograms is going to be on the order of epsilon. What's interesting is that this representation has been motivated mainly from psychoacoustic and biological interactions by saying that this is something that we see generally in the cochlea, but it turns out that there's also a mathematical justification in terms of time warping stability why we want to have this type of averaging. As a summary, whenever we have a signal and we want to create a representation of it, if we use a spectrogram we're going to have problems. We're going to have invariance to time shifting but we're going to have instability to time warping, but a constant Q average will stabilize this problem. In order to look at this a little deeper, the problem with the Mel frequency spectrogram is that it's not very useful to characterize larger scale structures by itself. To see why this is we're going to slightly reformulate the definition. The way we calculate a Mel frequency spectrogram is we take a spectrogram and we average it in frequency like we saw just before. Another way to get something that's very similar that's not exactly the same but contains a similar type of information, is to take these filters that we use to average and instead convolve the original signal with them, which gives us this time frequency representation that is very high-frequency, that is very high resolution in highfrequency and much more regular in low-frequency. Then we average this entire time to get a regular time frequency with grid like in the case with the Mel frequency spectrogram. There is a way to show this mathematically that we have equivalence between the averaged filter response amplitude and the Mel frequency spectrogram, but I'm not going to go into that right now. A way to view this is to say that we have the wavelet transform. Now a wavelet transform is sometimes thought of as this sort of dyadic decomposition where you have one octave that's taken up by the wavelet coefficients and then you decompose and take the low pass filter and then take the high pass octave of that and so on, but there's actually a lot more flexibility that we can put into it. For example, we can have much more narrow response in frequency which gives us this constant Q wavelet filter bank. So it's going to be constant Q in the high frequencies and linear in the low frequencies, but it turns out that this is not necessarily that big of a problem. We're still going to call them wavelets even though they're not exactly wavelets, but they have a lot of the nice properties that we need from them. What's also interesting is this is going to give us basically a Mel frequency scale as well just by having this type of constant Q filter bank. Using this we can define a wavelet transform which is a low pass filter which captures the low frequencies that are not in the bandpass filters together with the bandpass filters. If we make sure that we cover the whole frequency access we get an invertible transform. We can now say that the Mel frequency spectrogram can be thought of as a time averaged wavelet coefficient amplitude. If we look at the wavelet decomposition, the modulus of these, we see that when we average them in time we lose a lot of information in terms of temporal structure. This is one of the reasons that Mel frequency spectrograms are not very good if we want to characterize completely a large scale structure, because if we increase the averaging scale, we're going to get something that's more invariant, but we're not going to get something that has, that captures the temporal dynamics of the signal, so we have a loss of temporal structure. We need to capture this fine structure and the question is how. One approach that has been used as modulation spectrograms, but the problem is if you take a modulation spectrogram which is going to consist of taking a spectrogram of each of the separate channels, we're going to run into problems again because we're going to have the instability to time warping. We need to look at a different method for capturing this high-frequency information. >>: Just so we're clear, these pictures on the right is [indiscernible] normal [indiscernible] and on the left a series of sort of actual data [indiscernible] >> Joakim Anden: Exactly. And then they are averaged to get a Mel frequency spectrogram. It's not technically the same. Just to digress a little bit, the information is contained. If we reduce the window size, we do retain more information in the Mel frequency spectrogram in the sense that we do have the sequence of vectors, but the problem is we're then going to need a classifier model on the back and to learn which specific sequences correspond to a certain class and so on and this is what the HMM does. And what we want is we want to relieve the model of the responsibility of learning the temporal dynamics. We want a representation that covers a large duration of time but that captures as much information as possible. And we see that if we try to do that with a Mel frequency spectrogram, we're going to run into trouble because we're just going to lose more and more information. >>: [indiscernible] Mel frequency spectrogram, [indiscernible] going to destroy the information? >> Joakim Anden: It's not going to destroy the information, but it's going to capture more information than we would like. It's going to capture a lot of details that's not necessarily important and this is kind of the idea of stability to time warping. If we're unstable with respect to time warping, that means that we sense changes in the signal that's not necessarily that important that corresponds to these high frequencies that can move very quickly even though there's not a big change going on. The modulation spectrogram because it takes a spectrogram of these signals and the spectrogram has this instability property means that we're going to run into problems. Not what you can do is you can take the modulation spectrogram. You can average it's a long modulation frequencies using a constant Q filter bank and then you get, some people have been doing this so it's like a constant Q modulation spectrogram. And then you do have a stable representation and that's going to be similar to what we're going to end up doing actually. >>: [indiscernible] >> Joakim Anden: Yes. I mean he's done both I think. I think he calls it modulation scale when he does the constant Q averaging. >>: That's okay. >> Joakim Anden: Yeah. It's better. It's much better because then you lose the sensitivity that you don't really want. That's similar to what we are going to end up doing. The issue is that when we take this wavelet modulus and we average it, we are losing information, so we're going to see how we can recover that information. As a small sidebar, these filters that we are considering, they're analytic filters, so they have no negative frequencies. This means that when we take the modulus, we're actually going to get a nice, somewhat nice envelope estimation of the signal because it's going to correspond to taking a real wavelength coefficient and then computing a Hilbert transform of it. So when we take the modulus, we're going to get the Hilbert envelope of each sub end. Then what we do to get what's equivalent to -- did you have a question? No. What we do to get our Mel frequency spectrogram is we average this in time and so we basically just recover the low frequencies of this envelope. The rest of the information is going to be contained in the high frequencies and we see that since the wavelet transform is invertible, we can think of this low pass part as just the low pass part of a wavelet transform. We know that the rest of the information is contained in the high frequency coefficients of the second wavelet transform that we apply to the envelope. The problem is that these coefficients are not, don't have the invariance instability conditions that we wanted so we take the modulus again and we average and so we get these what's called second order scattering coefficients. These we call first-order scattering coefficients and second order scattering coefficients. We have as many first-order coefficients as we have first-order frequencies and these correspond into acoustic frequencies, corresponds to a certain sub band that we're looking at. And then we have this second decomposition which is then corresponds to more something like modulation frequency. We look at each sub band and we decompose it and we look at the different types of oscillations present in there. Another way to formulate this is to look at a very similar transform which is a nonlinear transform, which is wavelet modulus transform. We take the wavelet transform. We apply the modulus to the wavelet coefficients and this gives us a basic building block for constructing these types of coefficients. We essentially get a cascade that we apply to our original signal x and out of that we get an average which is just the average of the signal which in audio we're not going to care about most of the time. It's going to be 0. Then we get these first-order wavelet modulus coefficients, is what we call these. Then we reapply the operator to these coefficients and we get what's called a first-order scattering coefficients and then second order modulus coefficients. Then we reapply, get second order scattering coefficients and so on. So technically this can go on to infinity, but we're going to see that this won't be necessary thankfully. So the structure of this is basically convolutional at work. We have convolutions followed by nonlinearities followed by convolutions. One of the differences is that here we capture information at each layer as opposed to just at the end, and also the filters aren't learned. Instead, they're fixed to be wavelets. And this is not to replace any sort of deep neural network methodology. Rather, it's a complementary approach. The reason that we don't learn here is because we know that we need the invariance condition. We need the stability condition and then once we have a representation that satisfies this, the idea is that we can then plug it in to a deep neural network that will then have less problems learning these invariance stability properties because we already encoded them in the representation. >>: [indiscernible] you talk about [indiscernible] >> Joakim Anden: It's the modulus. It's simple modulus. That's all it does. >>: [indiscernible] condition do you get invariance? >> Joakim Anden: If you, if we remove the modulus our invariants are going to be all 0 because we're going to have all sorts of linear operators and the only linear invariant that you can get from a signal is its average. We will get 0. So the nonlinearity allows us to create more and more invariance from a signal. >>: [indiscernible] particular type of linearity, nonlinearity is [indiscernible] some other kind of discernible >> Joakim Anden: Yeah. I mean you could, we haven't done experiments in audio. We have a little bit of experiments in audio and it turns out that the choice of nonlinearity is not all that important. You could replace it with a local max. They have been doing other stuff with images and stuff and it turns out that it's not that important, the difference. The modulus is nice from a mathematical perspective because it turns out that if you want this stability condition and if you want your transform to be contracted, it turns out that you're restricted to have a point wise unitary operator and so it's going to be the modulus they be multiplied by some complex numbers. There's ways of showing that, but in experimentally it doesn't seem to matter all that much. >>: So this only guarantees invariance. How can you make sure that you still preserve information throughout this combination? >> Joakim Anden: Yeah. I'm going to get to that because it's not entirely obvious that you're discriminative from this. The idea is that whenever we lose information from averaging, we recover some of it with the high frequencies of the wavelet decomposition. But then we lose information again and so we recover it again. The idea is that whenever we lose information we recover some of the lost information and so that's how we remain discriminative, but it's… >>: [indiscernible] understand what is the meaning of that lasts [indiscernible] >> Joakim Anden: Yeah. That's an averaging in time. >>: It's just averaging in time? >> Joakim Anden: Yeah. All of these convolutions are in time, so here we capture just the average of the signal. Here we take the signal. We take a certain bandpass filter. We take the modulus. We compute the envelope and then we average it in time to get something that is invariant and stable. And then since we've lost information here in the average, we recover it in the high frequencies here and the second wavelet decomposition take the modulus and average once again. So it's to get these invariants that we need to average, because otherwise we're very sensitive to shifting and warping. >>: [indiscernible] >> Joakim Anden: Yeah. These are both; the first one is a Mel scale filter bank, sort of. It has about eight wavelets per octave. It turns out that we don't need to use the same way let's in the second decomposition as the first decomposition, so here it's more like a traditional wavelet filter band. You have one wavelet per octave or two is what we found works best, but these are the basic parameters of the representation how many wavelets per octave and how large you want to make the averaging. Yeah? >>: Are you going to cover, I think it was Lipschitz, the dilation, how that comes into play in terms of… >> Joakim Anden: A little bit, yeah. I think it's the next slide. Yeah, hold on. Yes? >>: Just one question. With the exception of the modulus this looks like a dyadic filter bank. >> Joakim Anden: It's not completely dyadic because we have, well it depends on what you mean. >>: I mean if you were to choose a weight that is such that [indiscernible] scaling function [indiscernible] to the previous [indiscernible]. And you don't, you no longer have the [indiscernible] does that matter in this case? >> Joakim Anden: I'm not quite sure what you mean. >>: Perhaps we could go over it off-line? >> Joakim Anden: It's, these wavelets are not necessarily the same wavelets that you would have in orthogonal wavelet decomposition. In fact, these wavelets are very redundant and we would like them to be. If we had wavelets that were orthogonal, we would run into problems it turns out. These are actually very redundant wavelets. They overlap significantly in frequency and they have a nonzero scalar product. It's not necessarily the same framework that you would have in a dyadic or orthogonal wavelet decomposition. >>: You try to minimize the amount of [indiscernible] and redundancy and here you are trying to get more of it? >> Joakim Anden: You want to be more redundant because if you are more redundant you are more flexible to these types of changes. Time warping won't suddenly shift one coefficient into another bin and then you're completely lost. So you want to have a certain type of redundancy. It also makes sure that when you take the modulus you don't lose quite as much information, so it has all sorts of nice properties, the fact that we are redundant. These are wavelets, but they're wavelets in the very general sort of sense. They're wavelets in the sense that they're dilations of another wavelet period and they're barely even that in this case. The important thing is that they're constant Q. That's what we care about, so it's less related to like orthogonal decompositions and so on. We can continue this cascade and then get our general for a given m we get an mth order scattering coefficient. Put all these together into one big factor and we get what is called the scattering transform or scattering representation. We describe a norm on that and then with this norm we get this theorem which proves the Lipschitz stability condition that we were talking about earlier. The reason that this holds true is basically if you have a wavelet and you apply it to a deformed signal, you can do a change of variables so that it's equivalent to the original signal on a deformed wavelet. This is not going to be drastically different from the original signal, because a deformed wavelet and the original wavelet are going to be relatively close in terms of the Euclidean norm. This will not be true in the Fourier decomposition, for example. A sinusoid and a deformed sinusoid are going to diverge significantly at a certain point. That is kind of the heart of the argument that it's like a proof that runs for 10 pages, so don't ask me to go into detail. But that's the basis of why it works. So we have this nice stability condition for this representation. We also have lots of other nice properties like conservation of energy. If we have a wavelet transform that conserves energy, we also have as a consequence of this, that the amount of energy in each order decays and so it turns out for most applications we can just look at first and second order coefficients. We also have that when we demodulated, we get something that's essentially low-frequency and so we don't need to keep the same sampling as we had in the original signal, so we get like this graph I was showing earlier. And the high frequencies were very irregular and the low frequencies were very regular, so it lets us save in computation and space. It also means that once we decompose this using the second order wavelet decomposition, we don't need to take -- here we need to take a lot of high frequencies, but here we can get away with just looking at the low frequencies and we see this, for example, looking at the second-order decomposition here where we applauded the first-order frequency, the acoustic frequency and then the second-order frequency, the modulation frequency. We see that a lot of this is going to be 0 and so we can neglect to compute them. The result is we can get an algorithm that runs in about n log n to calculate these coefficients. The question is what sort of information do we have in the scattering transform. We've seen that we are discriminative in the sense that we capture a lot of the high frequencies, a lot of the temporal structure that we've lost, but what exactly does this constitute. We're going to look in the case of audio, a simple model which consists of an excitation filtered by an impulse response and then modulated in time by an amplitude A. The question then becomes what sort of information can we see is recovered by the first and second order coefficients in this type of model which is a relatively constrained model, but it still allows us to look at some simple phenomena. For example, here we have three sounds with the same excitation, harmonic excitation with different amplitude envelopes. First is a smooth envelope. Then we have a sharp envelope with a sharp attack and then we have a sinusoidal amplitude modulation, a tremolo. We can listen to them. We hear they sound quite different and we see here in the scalagram in the wavelet modulus decomposition that they look quite different. But when we calculate the scattering coefficients, what we're going to do is for each of these channels we are going to isolate them and we're going to average them in time, and so we're going to lose the distinct characteristics of each of these envelopes and they're going to turn out to look more or less the same with the window size is large enough. If we need a certain amount of invariance we're not going to be able to capture this information directly. This is essentially what we would see in the Mel frequency spectrogram with this window size. However, if we capture then the second-order coefficients which is going to be these, this signal but decomposed using a second way but decomposition, then the modulus and then averaged in time, we do capture the differences in the envelope in the sense that here we have mainly low frequencies and that's represented here in the end of display in the second-order coefficients. Here we have a transient, so we have high frequencies and here we have the sinusoidal modulation, so that's represented as a maximum around the frequency. We see that we capture these different properties of the envelope in the secondorder coefficients, whereas, the first-order will mainly give us information about the type of excitation and the filter, the spectral envelope, the filterage. And we can replace this by a stochastic excitation, that white noise excitation for e and we're going to end up with similar results. We can hear that these are also relatively different sounding signals. But yet, the firstorder does not necessarily see any difference between these three signals, whereas, in the second-order we're able to differentiate between the three. The difference between this and the harmonic case is that here we do have a low level noise floor that's due to the stochastic excitation also as seen in the second-order. We do have a bit of a bleed over there, but we do capture the information on the envelope as well. >>: So [indiscernible] audio is two orders? >> Joakim Anden: Yes. For right now we've only looked at two orders, and in classification results it's usually enough to get good results. I'm going to get to that in a little bit. >>: You showed the, on the bottom figure in the previous slide, what kind of value was it evaluated for lambda one? >> Joakim Anden: It was, yes. That's important. These are, I fixed one lambda one for this particular display and so it's at this frequency, at 2400 Hz. But in reality you have the whole sequence of these displays, but you would need some sort of 3-D representation that is difficult to do. So yeah, this is for one fixed frequency channel that is showed this for, but you're going to see something similar in all of them, because the envelope is the same at all frequencies in this model, so you're going to see that. >>: So how many dimensions do you have for the lambda one? >> Joakim Anden: It depends. I mean, in the tests that we looked at, it could be around 40, 50 to 70, and in the second-order it could be in the 100’s, 200’s. This is, I calculated this for a very dense sampling, so this is not representative. >>: [indiscernible] 100, do you mean for each value in the first-order you have 100 and then… >> Joakim Anden: No. For each timeframe, for each point in time I have about 50 first-order coefficients and 200 second-order coefficients, just about. It depends on the parameters. But on that order, yes. We can also look at frequency modulation, as you heard just now. We have sinusoidally modulated pitch and a constant pitch in an amplitude sinusoidal amplitude modulation and what we have is, this is also represented in the second-order in terms of this harmonic structure here, which is different from the amplitude modulation when we just had the sinusoidal. The problem is we can also create an amplitude modulation here, which sounds different from the frequency modulation that gives us very similar second-order coefficients and so this shows us that there is a certain drawback to this type of scattering representation in that it doesn't really show the difference between AM and FM, but it will detect the presence of either one. >>: Those would be the same with every lambda one? >> Joakim Anden: No. They would not, exactly. So this is for this fixed one, it would be the same, but it's not possible to make them exactly the same for most of them. They're difficult enough to tell apart that I think it will cause a problem in modeling them. You don't have a very clear distinction between the two. But yet, this is just for a fixed lambda one. The final example that we're going to look at is looking at frequency components interfering. If we have two notes that are played together as opposed to being played separately, in the first case they will interfere and you'll have a beating phenomenon in the envelope that's going to give us a maximum in the second-order coefficients. We can listen to them. In a sense, the dissonance that you hear between the two notes is captured here in the second order coefficients, a certain kind of roughness. Another way to think about this is to say that since we have a certain bandwidth of the first order frequency filter bank, we lose information about absolute position, but the second-order is going to give us relative position information about different frequency components in each frequency band. To summarize in terms of this model, what we can see is that first-order coefficients tend to capture very short time structure like looking at excitation, looking at the filter which is going to be in these cases around, up to 10 milliseconds of duration. Whereas, the second-order coefficients are going to characterize larger scale structures like amplitude modulation, frequency modulation and interference. We have this kind of separation scales between the first and second order coefficients and what they characterize. Another way to understand what type of information is captured in this representation is to look at reconstruction from scattering coefficients. What we have is we have a result that says for certain wavelengths we can, in fact, invert the ways that modulus operator and since the scattering transform is a cascade of this operator, we can cascade this to get an inverse of the scattering transform. The problem is to calculate this inverse is a nonconvex optimization problem. It can be relaxed somewhat, but it's still intractable for the types of audio signals that we're looking at, so what we're doing is actually we're using an approximate heuristic approach to recover this which is a basic alternating rejection algorithm like a Griffin and Lynn sort of approach. So to get the original signal we can cascade this inverse wavelet modulus operator. The problem being at the end we won't have the necessary wavelet modulus coefficients and so we have to use a deconvolution to get our first order estimates. There's errors involved in this process, obviously, and since this is approximative, we get errors and so this is not a perfect reconstruction. I would not recommend using this in some sort of fancy audio code deck or anything, but what's nice is it gives us an idea of what type of information is captured in the first as opposed to the second-order. We can see this for a speech signal where we have the original signal. We have the reconstruction from just the firstorder where you see that certain parts have been smoothed out you lose a lot of transients that we have here, for example. But if we had the second-order coefficients to the first-order coefficients and reconstruction that, we do have a large amount of artifacts, but we hear that the attacks and the transients have been restored to a certain extent in the reconstruction from a second-order. This kind of confirms what we saw with the models in that we do capture that type of temporal dynamics in the second order. >>: [indiscernible] if you have m equals infinity, then you can recover the original signal? Is that what it means? >> Joakim Anden: Yes. If you had a good, yes. The answer is yes. >>: [indiscernible] to infinity? >> Joakim Anden: Yes. But in reality, no, because what happens is each time you take the modulus, you're pushing your energy, you're pushing your information to the low frequencies and so at some point most of your information will be captured by the low pass filter and so you won't have to continue to decompose any more. At that point, your deconvolution is going to be trivial. You're going to be able to recover your wavelet modulus coefficients fine because your averaging won't be, won't have done anything. Then you can theoretically recover. The problem is we're using another very good algorithm here and so once we cascade it far enough, we're going to get all sorts of errors that will accumulate, and so adding the third order to this is not going to improve. If we had a good algorithm to invert, then yes, it should. >>: In theory, do you have any proof that as the term gets higher and higher the errors start getting smaller aside from this [indiscernible] >> Joakim Anden: In theory, yes. It should be true, yes, but I don't think we proved or look into the details of it, but intuitively, it should, yes. We don't only have, we have lots of types of very abilities in audio signals. We're only going to touch on a few of them here and so except for time shift invariance, we also might want to have frequency transposition invariance. Here for example, we have one word pronounced by two different speakers and we have a difference in pitch, but also different in formance and frequencies, so we can listen to them. >>: Encyclopedias. >>: Encyclopedias. >> Joakim Anden: And so we hear that they're are the same word, but we don't have the same pitch information and we see here in the formance we don't have the same formance information. What's especially interesting is we don't have a pure scaling of the frequencies. We don't have a shift in pitch or a shift in formance that's constant over the whole spectrum. Instead we have a different shift at different parts of it. The distances between the formance peaks is going to vary and so this corresponds more to her frequency warping than a pure frequency transposition. This tells us that we might not just want to be invariant to frequency transposition but also stable to frequency warping in the same way we saw with translations in time. >>: [indiscernible] >>: [indiscernible] so how do you [indiscernible] >> Joakim Anden: The idea is… >>: [indiscernible] >> Joakim Anden: Sorry. The idea is that we're not completely invariant to these types of frequency work things. We're stable to them, which means that a small frequency warping is not going to change your representation very much, so a small change in the formance frequencies is not going to change it very much. A large change is going to change it more, so your representation will be able to hopefully separate between important changes in formance frequencies than the less important ones. But if you have, you guys probably know this better than I do. But if you have a formant frequency that shifts slightly, if that's going to change the idea of your vowel, then you're going to run into trouble with this type of thing here >>: [indiscernible] constrain your walking situation is that these special cases [indiscernible] occur [indiscernible] >> Joakim Anden: The answer is not explicitly, but what's nice with this continuity condition is it tells us that we have a continuity with respect to these time work things, and what that means is that a warping is going to be realized locally as the linear transformation in the representation space. What's nice then is that if we have a linear discriminative classifier at the back end it's going to be able to project against certain directions and in that way create a real invariance to certain time warpings while maybe increasing sensitivity to other time warping. Since we linearize time warpings or frequency warpings, in this case, this classifier should be able to decide which ones is important given enough training data, of course. So that's kind of a nice property that comes out of it. Since we don't have, if we had pure invariance then, yeah, that would become a problem because then you might run into these types of issues. >>: I have a remote question all the way from California. [indiscernible] was actually in the Bay Area so Malcolm [indiscernible] asking me a question. I believe the [indiscernible] transform only makes sense as an envelope follower if there is only a single component in the sub end signal [indiscernible]. So does the theory still make sense if the -- I guess his question is does this theory still make sense if the [indiscernible] transform does not actually give you the truth? >> Joakim Anden: The idea here is it's nice to formalize it as looking at the envelope of a specific frequency component, but it's not necessarily crucial to have it be specifically an envelope and we saw, for example, in the cases where we do have two frequency components in the same filter bank, in the same filter response, what we do see is we don't get the envelopes of each one separately. Instead we get the beating phenomenon between them, which in our case is nice because it gives us information about the internal frequency structure in the filter bank, which we would not necessarily have if we just looked at the individual envelopes present in there. It's not necessarily dependent on having an exact estimate of the envelope in this case. What's important is having a nonlinearity that brings us to the lower frequencies. It just happens that in some cases we can interpret it as a envelope when we just have one frequency component. I hope that answers your question. [laughter]. Okay. There have been some approaches that have been done to get this type of transposition invariance. One, first perspective is to say that transposition frequency which is scaling a frequency is going to be a translation in log frequency. Log frequency and Mel frequency -- in high-frequency, Mel frequency corresponds to log frequency and so at these frequencies we can simply average the Mel frequency spectrum and we get something that has this transposition invariance, and this is basically what the Mel frequency cepstrum does in taking the low frequencies coefficient in the DCT which amounts to an averaging along the Mel frequency scale. The problem is once we start averaging along frequency, we lose information and so we can gain some invariance by doing this, but if we try to gain more and more invariance to capture larger and larger scales we're going to lose more and more information so we're kind of limited in that sense. And another approach is something that you are probably familiar with is the vocal tract length thermalization where you try to realign the spectrum at a given point with that of a reference speaker using a sort of scaling coefficient. An advantage of this is that we don't lose information because we're just simply rescaling the whole spectrum, but it's sometimes problematic to estimate the scaling coefficient and if we don't estimate it exactly, we don't have the desired invariance property. Another problem is that we don't have the stability frequency warping us we start considering nonlinear warping of the spectrum. Then you run into problems of you have a large storage space and so on. These approaches have their advantages, but in this case we lose information. In this case we have a certain instability. The approach that we're going to take is that instead of averaging a log log frequency, we're going to take the scattering transform along the log frequency scale. We fix a time point t and we go along log frequency and we complete another scattering transform and so on. This gives us the separable time and frequency scattering transform, or just a separable scattering transform. It's nice in the fact that it has the invariance that we want to time shifting and frequency transposition and that, of course, depends on the scale that we consider here with the scattering transform. We could decide to have a small-scale or a large-scale, but it captures more information than if we just did a simple averaging. In experiments we usually can restrict ourselves to the first-order coefficients. It's usually enough at the scales that we consider. The problem is that this approach and actually the approach of the original scattering transform loses a lot of important information because we average, we decompose each frequency band separately and we average each frequency band separately. For example, these two signals, if we look at the regular scattering transform with the window size that covers the whole signal, it's going to think that they're going to give the exact same representation because we have the same frequency band here as here, but shift them in time. Since we don't capture any information on the correlation between frequency bands, it's not going to be able to tell the difference. We lose this joint time frequency structure. A way to solve this is instead of decomposing each frequency band separately, we would like to do as in S. Shamma’s cortical representations where we decompose this time frequency scalogram in two dimensions in time and frequency simultaneously instead of doing as in the separable case where we decompose just once along the temporal domain and then average and then once along the frequency domain. Here we get then a two-dimensional wavelet that we use to decompose the scalogram. We can then form like we did before a wavelet modulus transform and cascade this on to the scalogram and this gives us what's called the joint time frequency scattering transform or the joint scattering transform. This has the advantage of having this same invariance stability as the separable scattering transform but it captures more information in that it's able to distinguish between these cases where we have shifted individual frequency components and so it captures correlations between frequency bands. We could also create a representation that is sensitive to frequency transposition by omitting the averaging along frequency and just averaging a long time. This gives us something that has the same invariance properties as the regular temporal scattering transform, but that captures more information. Finally, we're going to look at some classification results to try to validate some of these claims. The first is to… Yeah? >>: Sorry, could you go back? What's the difference between what this time frequency scattering transform and [indiscernible] >> Joakim Anden: We were using Gephardt [phonetic] filters, yes. It's a Gephardt filter bank, yes. This has been done by… >>: I guess, what is the second [indiscernible] >> Joakim Anden: The first-order is going to be x here is -- I had it in the previous slide. x is the scalogram, the wavelet modulus coefficients. The amplitudes from the wavelet coefficients from the first-order. The first-order wavelet decomposition stays the same and then on that the second-order decomposition, which includes this low pass filter and then this wavelet decomposition with the modulus and another low pass filter forms the second-order decomposition. It's going to affect the first-order as well. But the decomposition is regular wavelet transform, then two-dimensional wavelet transform and then you reapply the twodimensional one and so on to get the higher order. We're just going to concern ourselves with the second order for now. Does that answer your question? >>: Enough for now. >> Joakim Anden: We can come back. To try this out we look at the TIMIT task of the phone segment classification. What we do is we try to classify segment where we're given the beginning and the end so it's technically cheating, but it makes for an easy problem to code and to test, so that's what we did. It does give us interesting results on the different types of representations. What we do is we take the scattering transform over about 160 milliseconds and we computed using dirty 2 millisecond frames and we put this onto one big vector that we then put into an SVM and we train a classifier with that. The goal is then to try to classify the phone that's in the center of the segment. The results that we have first is we see that we performed similarly to the Delta MFCCs with the first order coefficients. Here L=1 means that we have first-order. L=2 means we have first and second order, and then first, second and third order for L=3. The first-order coefficient performs similar to the MFCCs which is what we expect because they carry similar type of information. The second-order we do see certain improvement of almost 2 percent absolute, which shows us that the temporal dynamics that we capture within each of these frames is actually useful for this type of classification. The third order we actually do worse and the reason for this is that when we look at small timescales, as I said earlier, when we apply the modulus we are going to force energy towards the lower frequencies and that means that once we get to the third order we are already very regular, already very low frequency and so it's not going to be very important to capture the high frequencies in the way let's. In the third order we're going to have a lot of noise. We're not going to have very much information and so it's not going to help us classify in this case. >>: [indiscernible] does it mean that you feed that into the SVM? >> Joakim Anden: Yes. In the same way that, we take Delta MFCCs over 32 milliseconds and then we concatenate them over this whole big segment and then we feed them exactly as in the scattering case to make it comparable. >>: I thought that the error [indiscernible] wouldn't be so low. >> Joakim Anden: We also include the log duration of the second. That makes about 2 percent absolute difference, so that might be why. >>: What is the state-of-the-art? >> Joakim Anden: The state of the art I think is Jim Glass, so it's like a committee-based classifier with different segment lengths and stuff and so they get 16.7 percent. >>: [indiscernible] >> Joakim Anden: Delta MFCCs in the first-order, it's around 50 I think. Second-order, like I said, is around 200. Third order, I think is around 500, but I'm not sure about the third order. I know it's about 200 for the second-order. >>: Some of the features. >> Joakim Anden: Yeah. But we're also using an SVM so it's not necessarily the end of the world. If we were using GMM's, when we started this I was using GMM's because I figured that's what everybody does and then I found out that the results were not very good. So you are you going to have trouble if you try to choose some sort of generative model like that, but the SVM does handle it pretty nicely. >>: The [indiscernible] think about the [indiscernible] >> Joakim Anden: Yeah. I did some experience where I did like huge vectors of 10,000 and it was able to do it. This is what I was talking about. If we want transposition invariants, what we can do which is quite nice is we can skip the frequency averaging. Since we're using an SVM, the SVM is going to optimize a locally linear discriminative plane at each point and so if we have enough data it should be able to learn how much it wants to average for each classifier. That makes it so we don't have to decide how much transposition in variance to force on our representation, but instead, we can depend on the SVM to learn it if we have enough data. So we tried fixing the averaging scale and then not averaging in frequency and letting the SVM try to learn it and it turned out that that was much better, so that's what these results show. To compare for the second order scattering coefficient, there's 17 percent. If we add scattering transform on top of that, a log frequency, we get 16.5 percent. That's almost a 1 percent absolute improvement. And then for the joint case we've been capture more time frequency structure instead of just temporal structure, we do improve somewhat again over the separable case. We also looked at musical genre classification because that's something that everybody does and it's nice to be able to compare. So this basically consists of having 30 seconds of music and trying to classify it into 1 of 10 genres. What we do is we do for each 700 milliseconds reclassify that and we vote over the whole track to try to get a result. Again, we get results similar for the first-order and the Delta MFCCs. Once we add the second-order we get a big improvement of about 40 percent relative, and the difference between this and the previous case is here we're looking at much larger timescales. The difference between what the first-order captures and what the second-order captures is much greater. We have a lot more temporal dynamics that we need to recover in the second-order and so we do have a significant improvement here for the second-order. The third order does a little bit better, but we're still not in the regime where we expect the third order to carry a lot of energy and a lot of information. >>: So think of the [indiscernible] here is [indiscernible] >> Joakim Anden: State-of-the-art, I can't remember. It's a paper from 2009 where they used modulation spectra and then they do all sorts of LDA's and they sum along certain dimensions and stuff and they get like a 9.4 error rate. I can't remember. I can send you the paper. So this is just using an SVM like kind of feeding it in there. One of our colleagues who was at Princeton previously and who is now with Stephan's group managed to get 8.8 percent using a sparse representation classifier on this using the second-order coefficient. By being a little bit more clever you can squeeze some more out of it. For the transposition invariant case we see that we have a certain improvement with the separable scattering transform. In the case of the joined scattering transform, we actually don't do better. We do about the same. This can be attributed to the fact that we don't -- here the issue is not necessarily that we want to be able to characterize very fine join time frequency structure. Here the issue is that we have data plans that are very far apart and we want to be able to create good invariance between them and so adding them more specific information about the joint time frequency is not necessarily going to help us when what we want is more invariance, not more discriminability in this case. Yeah? >>: What is, what is the type of signal? You would expect that the third order would [indiscernible] >> Joakim Anden: The third order, you would expect the third order to play a larger role when you are looking at larger timescales. Here we are looking at less than one second and you can see by looking at how much energy is contained in the third order that it is not very important. Once you start looking at larger timescales in that you expect it to do better. The reason that we didn't look at larger timescales in this case was because we didn't -- even if we added the third order of going beyond 700 milliseconds, resulted in degraded performance. That has more to do with the nature of the problem. Once we go beyond that we start, if we have something that's about 500 milliseconds we have a greater chance of getting a segment where we have maybe one instrument and so on and so it's easier to characterize as opposed to if you get a larger segment where you have to characterize mixtures of instruments. It becomes something that's more complicated for the classifier to deal with. If we had a problem where we had a very large scale structure like on the order of maybe 6 seconds that we wanted to capture, then, yeah, the third order would be expected to do better. But we haven't seen any tasks where it actually does a lot better yet. That's kind of the thought. In conclusion, just looking at invariance and the stability conditions for representations is a very powerful framework for trying to analyze different properties that they have with their classification performance. The scattering transform provides such a representation that has these nice properties and it turns out that it actually captures important audio phenomena like amplitude modulation, frequency modulation and so on. We can extend the scattering transform to get the joined scattering transform which captures joint time frequency structure and also gives us transposition invariance in a discriminative way. And finally, we see that once we test this in numerical experiments, we do get state-of-the-art results in the two tasks that we look at. I should add that on top of this we've also looked at classification of medical signals in terms of heart rate variability for fetuses. We've also had some nice results in that area, but not in this presentation unfortunately. So that's it. Thank you very much for listening. [applause] >>: [indiscernible] examples [indiscernible] speech or mixed speech? >> Joakim Anden: No. That's kind of one of the next things to look at to see how we perform in terms of talker separation and in terms of presence of noise and so on. I mean Timit is very, very nice. It's very clean and there's no noise, so that's not something that we have looked at but it's something that would be interesting to look at in the future. But also looking at other types of distortion like clipping or masking and so on and see how robust it is to that type of change and how we can make it more robust to them. >> Mike Seltzer: Any other questions? Several people will be joining this afternoon. If you do not get in there I would like you to let me know. All right. Thanks again. >> Joakim Anden: Thank you. [applause].