>> Jasha Droppo: So today I am pleased to... this technique which I'm interested to learn more about. ...

>> Jasha Droppo: So today I am pleased to have Khalid here to give us a presentation. He has this technique which I'm interested to learn more about. He has a long history. Whereas most people in speech are either engineers or computer scientists, we're going to be addressed today by a mathematician who has been at the Paris University and spent some time in Montreal, spent some time at MIT as a postdoc and is currently in INRIA, and it's all you. >> Khalid Daoudi: Thank you, Jasha. So thanks again for welcoming me here and giving me this opportunity to share this fundamental research with you practical guys. And as I said, Jasha, yesterday -- not yesterday, but when we have [inaudible] I need some honest feedback about what you think about what I'm going to talk about, because of course the purpose is to go further in more interesting stuff. So what I'm going to present today is a part of the Ph.D. work of Vahid Khanagha who is going to defend in few weeks, and with the contribution of Oriol Pont, who is a top researcher in our group, and Hussein Yahia, my colleague with whom I cofounded the GEOSTAT team. So GEOSTAT team in -- okay. There are -- that's the first slide. >>: [inaudible] >> Khalid Daoudi: [inaudible] and GEOSTAT team of Bordeaux is working on nonlinear analysis of complex system and signals using principles and methods coming from statistical physics, and we have typical applications in geophysics and astrophysics dimension and also some [inaudible] data. And what I'm going to talk today about is actually these new methods that have been developed in this field of remote sensing application. I found at some point that it is worth to give them a try for speech signal. So the outline of my talk will be the following. I'm going to motivate why are we doing this work, and typically it's because of nonlinearity aspect of speech and that we want to look at speech as the realization of complex system. Then I introduce the formalism that we work with, which is called the Microcanonical Multiscale Formalism, or MFF, to summarize and describe the basic principles behind this formalism and see how we can apply it to speech signals. And then I will show what we have achieved so far in terms of applications, so phonetic segmentation, GCI detections, sparse prediction, and multi-pulse excitation. And then I'll do some conclusions, and I hope that the perspective will get a lot of exchange. So I don't need to underline that you all know that classical speech processing techniques live in linear world, doing bunch of assumption inherently, and basically [inaudible] uncoupled between the vocal tract and speech source, the laminarity of the airflow through the vocal tract, and the periodicity of vibration of vocal folds. But there is -- it is established theoretically and experimentally that there are several nonlinear effects in the speech production mechanism. So the -- and typically that there is some turbulent source signal during the production of voice fricatives. So with this the laminar flow assumption fails. And that there is nonlinear feedback during voicing. And thus the glottal pulses are not exactly periodic and they are skewed in shape. For plosives, the excitation is time spread and have a turbulent component. And that there is coupling between the glottal airflow and the vocal tract system when the glottis open. So the source and the filter does not depend -- does not function independently. And this nonlinear phenomena [inaudible] are emphasized in the case of voice disorder, where you have strong complex nonlinear aperiodicity and turbulent non-Gaussian randomness. So if say in this we would like to look at speech from a nonlinear signal processing perspective, and mainly the people that have been interested in nonlinear speech processing, they were looking at speech in the framework of nonlinear dynamical systems. And typically the dynamical system that share characteristic with turbulence are called chaotic, and there have been a bunch of tools developed in this area which is called chaotic-signal processing. So I'll give you a brief recall about this. So in chaotic signal processing, the speech signal is assumed or the speech production mechanism is supposed to be dynamical system which is governed by the equation there, that Y is multidimensional but -- multidimensional system but which is unknown and that the measurement we do -- that we capture about speech are just 1D projection of this unknown highly dimensional system. And then in '91 there was this existence theorem called the embedding theorem which says that if you take your speech signal and you take time delays of the signal and you consider this D dimensional space now, that this -- the multidimensional X, signal X, shares common properties with the unknown dynamics of Y of N. And these common features are typically correlation dimension, fractal dimensions ->>: What was S again? >> Khalid Daoudi: S is the speech signal. Sorry, I didn't ->>: Okay. >> Khalid Daoudi: Yeah. S is the 1D speech signal. >>: [inaudible] some unknown [inaudible]. >>: So where -- oh, okay. So S -- I see. >>: So like we could think of Y as the articulators and S as the time series, and then this X somehow is related to the original ->> Khalid Daoudi: Yeah. There are some common features. And probably the best known of them is the Lyapunov exponent, which is a characteristic of dynamical system that remain intact in the embedding procedure, and this exponent characterize the degree of chaos of the system. So if you compute your Lyapunov exponent and find that it is greater than 0, then you can say that your system is chaotic, and your characterize its degree of chaosity by the value of this plosive exponent. When it is negative, it means that the system is not chaotic. Okay. >>: [inaudible] how good is the assumption [inaudible]? >> Khalid Daoudi: Yeah. So I was going to say this here in [inaudible] is that actually this theorem doesn't tell you how to choose. It's an existence theorem. It doesn't tell you how to choose the time delay T and the dimension D. And these are very important quantities to have good estimate of these exponents. So and, moreover, it assumes the stationarity. It assumes that you have some stationarity. It's only in this case these exponents have a real meaning. And as I said, so since the embedding theorem is an existence theorem, it is really very difficult to estimate this exponent, and there is a bunch of methods to do so and there are -- there is no consensus on which is the best way to estimate this exponent and you find contradictory result and conclusions about. And even though -- even though if you assume that you can estimate them, these quantities, just the global quantity that gives global description about the system. And I think that it's because of this difficulty in measuring -- in estimating these global measurements and in the weakness of the information that it gathers that these methods didn't have to [inaudible] an impact in nonlinear speech processing. >>: Essentially that just tells you the size of the subspace, right? >> Khalid Daoudi: Yeah, it is your -- it measures ->>: Dimensionality. >> Khalid Daoudi: Yeah. And the divergence rate of nearby orbit. So it assess the complexity of the system without really saying more. So the question that we've been asking is that -- well, not -- not me, but physicists is that is all this necessary, to have these complicated things to deal with, and can we do more simple and still have more comprehension about our dynamical systems. >>: So when you say is this necessary, I mean, [inaudible] pretty simple measure, doesn't really tell you anything useful. >> Khalid Daoudi: Yeah. Yeah. And it's different to estimate. >>: Right. >> Khalid Daoudi: Yeah. And it's not -- it doesn't tell you that much. But a lot of work in nonlinear [inaudible] has been dealing with this kind of stuff. So it turns out actually that these kind of methods belong to what the physicists call the first phase of complex system theory, so where only global measurement can be assessed. So this starts with the Kolmogorov model of turbulence, [inaudible] of structure functions, the canonical representation of fractal or multifractal processes, detrended fluctuation analysis. So these methods, they recognize the existence of interacting multiscale motions, but there is no access to them. In simpler way, they recognize complexity of the system, but they don't tell you that much about the complexity of the system. Since the '90s there is a new trend in the community of complex system theory where it says, okay, we are going now to try to localize geometrical complexity. That means that where the complexity emerge and how it organize itself. And there is a school which adopt the notion of predictability as a way to characterize complexity. And this is where the GEOSTAT team stands. Oh. What's going on? What happens? Why is this moving? >>: By hitting space you got it to go [inaudible] slideshow [inaudible]. >> Khalid Daoudi: Oh. Okay. So in this school that adopt predictability as a way to characterize complexity, the basic principle is to say that in for certain class of complex signals and systems there is some thermodynamic observables that have a power-law behavior. And even more is that what Antonio Turiel and Hussein Yahia have developed is that this power-law behavior can exist at any point of the signal domain. So typically in mathematical wording is that given your signal S of X there exist at least one multiscale functional TR; that when you analyze it around the point X, you find a power-law behavior in [inaudible]. So D here just the dimension of the signal S, so for speech it will be just 1. And the important quantity here is this H of X, which is called the singularity exponents. And the hypothesis is that for this class of systems there exists some geometrical superstructure that dominates completely the dynamic of the system. And that this we can access to this geometrical structure by analyzing the level sets, or some level sets, and they are characterized by the points where the singularity exponent is the same, is the same. So, in other words, you take your signal, you evaluate this multiscale functional, you try to find these Hs at each point, and then you form the sets of points that have the same exponents, the level sets, and some of them are actually the most important quantities that carry information. So in this way, this singularity exponent unlocks the relation between geometry and statistics, because basically what they say is that we are going to access to this structure to geometrically, and this structure contain most of the statistical information. So it's kind of a way to bring together statistical signal processing and dynamical -- nonlinear dynamical signal processing, which is the counter -- deterministic counterpart of statistical signal processing. >>: I think this is important, so I want to understand. >> Khalid Daoudi: Yeah. >>: There's a signal S of X or of time, and you're saying that in some particular point in time that signal has a property that can be expressed by this power law. >> Khalid Daoudi: Yeah. >>: And I should care about that property because that property is measuring how predictable the signal is there. Is that's why it's important? >> Khalid Daoudi: Yeah. It's what -- it's -- because -- I mean, there -- there -- what is nice about the work of Antonio Turiel and his fellow colleagues is that they provide a way of precise estimation of this H at each point of the signal domain, and it gives also an interpretation of this exponent as a way to quantify local predictability. >>: But okay. Now, why shouldn't I just be like simpleminded and say, well, I'll do an LPC and I'll see what the error is right around here and that's the predictability or some other model that predicts a time series and just apply the model and see what the predictability is? >>: It's -- you can do this, for example. Okay. You can take ->>: [inaudible] >> Khalid Daoudi: Excuse me? >>: Yeah, why not just do that? If I want predictability, why not just measure it? >> Khalid Daoudi: No, but, for instance, so it's -- what you are saying that you are defining your notion for predictability. So in particular what you're saying I'm going to look at the residual and where I have the highest peaks, they are the less predictable points. Okay? But here first you are assuming a notion of predictability that can make sense or cannot make sense. The second thing that you can do, for example, if you take -- if you take just the highest peaks and do them as the source signal for your filter, it's not going to work that much because you have polarities in this -- in these peaks. So you can do -- I mean, it's all about what -- what do you mean about predictability. >>: Okay. So maybe you might get your estimate of H by doing something like LPC, even though that might be a bad or a crude approximation? >> Khalid Daoudi: But if you -- I don't -- don't forget also that there is this notion of scale. So if you do LPC, you are [inaudible] of the signal and you are going to have the definition of predictability at the finest resolution. Here it is your actually -- I mean, I don't want to go into all the hypothesis. It tells you how the energy concentrate in the point X while it transfer across case. So all this is based on -- from the notion of turbulence and [inaudible] turbulence. >>: So to make sure I'm still following. So H is the underlying dynamics of the system that you understand that you can characterize, and the second term there is all the stuff that you can't characterize, all the noise? >> Khalid Daoudi: The second stuff is -- this stuff is negligible. I mean, they say that for small scales, this term here, you can neglect it. >>: Okay. So I don't understand where the statistics are coming from. Is it statistics of ->> Khalid Daoudi: Yeah, I'm coming -- I'm coming through this later. Because this is -- the first component is definition of the singularity exponent. >>: So for each X you can expand your original function into this form, right, for each X? >>: Yeah. >>: And but for each X you have different alpha and different H. Is that true? >> Khalid Daoudi: Yeah. You have different alpha, different H. But alpha actually it can cancel and what governs the dynamics are this H here. Because there is a way to show how -- I mean, this alpha is also not that important. You can make it cancel. >>: So even if they are different, you can still cancel it? >> Khalid Daoudi: Hmm? >>: Even if alpha is different [inaudible]? >> Khalid Daoudi: Yeah. It's -- I mean, in the estimation of -- I mean, alpha is there. But in the -- you can get rid of it in the estimation of H. And the important quantity according to [inaudible] is this power-law scaling and not this constant -- this multiplicative factor. >>: Okay. And this only applies if R is going to close to ->> Khalid Daoudi: Yeah. For -- for small scales. >>: By the way [inaudible] is R just as a distance? >> Khalid Daoudi: R just is scale. >>: R is distance? >> Khalid Daoudi: It's the scale at which -- the resolution at which you are looking at the signal. >>: [inaudible] okay. >>: So now I'm getting more confused. Because if there's multiple scales and in each scale you've got your own H, then you've -[multiple people speaking at once] >> Khalid Daoudi: No, no, no. It doesn't depend on scale. >>: H does not depend on scale. >>: Magical H. >>: Okay. H is a function. >> Khalid Daoudi: Yeah. >>: Okay. >> Khalid Daoudi: So typically from S we are going to have access to this H of X and see what we can do with it for the moment. So -- and the choice now of this multiscale functional is important, because if you choose it, for example, just as linear increments, you run into [inaudible] exponents that are mainly used in fractal and meta fractal theory. >>: Okay. So these are [inaudible] in the original signal spaces. >>: That's one definition. >> Khalid Daoudi: One definition. But we not adopt this [inaudible] because actually these are just -- these quantities are only directional and they are unstable and they don't resist to noise. And they don't give you really an interpretation about predictability, just about geometrical regularity of the signal. So these quantities are very known in the canonical framework of meta fractal theory. And -- and -- but they are impractical and they are very difficult to estimate. But what Antonio Turiel and his colleague has proposed is to define this measure as the multiscale functional, which is called the Gradient-Modulus measure and which is defined from typical characterization of intermittence in turbulence and it measures actually the local dissipation of energy, of kinetic energy. So you measure -- you measure -- you sum up or you take -- you sum up all the variations of your signals, of your signal in a ball around X. And then with this measure you try to estimate the H. >>: With what is the gradient being taken with respect to? It's the gradient of your one-dimensional signal? >> Khalid Daoudi: Here I'm in whatever dimension. >>: [inaudible] signal with respect to what? >> Khalid Daoudi: It's to all your dimensions. So here it's ->>: Oh, okay. So if in one dimension the previous exponent you're talking about would be what approximation for the gradient? Or ->> Khalid Daoudi: It's -- if you write it down, it's not that -- it's ->>: [inaudible] >> Khalid Daoudi: Yeah. >>: [inaudible] >>: This one here, that seems like an approximation for the modulus of the gradient that R goes to 0. >> Khalid Daoudi: Well, you can see it as an approximation, but here you sum up all the variations in the ball. >>: I see. >> Khalid Daoudi: So you -- and but if you want you can cancel this. But actually just this measure here makes a lot of difference because they show that it describes better the variations around signal then looking just at directional increments. >>: Is that the -- if I got it, that's an integral over a surface ->> Khalid Daoudi: Yeah. >>: -- [inaudible] is that right? >> Khalid Daoudi: Yeah. >>: Okay. >> Khalid Daoudi: So by taking this multiscale functional, what these people say is that they can characterize the degree of local predictability at each point, how to do this. Then they say if we are correct in saying that there exists this geometrical structure that govern the dynamics, then from this geometrical structure we can have access to all the information about the signal. So this is the original claim. And what they have done is that they show that for particular class of signal this is true. So they take ->>: Wouldn't it be the bigger the value the lower predictability? Because if you're looking at the gradient, that's how fast things are changing. And if things are changing a lot all over the place and you do that integrally, then you end up with a big number. And that would seem to mean that it's not predictable. Because if you go over here, you're moving in this direction, if you go over here, you're going in this direction. >> Khalid Daoudi: Yes. Yeah, but this is what it is. So to estimate, you -- for example, an easy way to look at the log log ratio and the things reverse. I don't know. >>: If you go back a slide. >> Khalid Daoudi: Yeah. >>: This key [inaudible]. [multiple people speaking at once] >>: That's not the exponent. >> Khalid Daoudi: This is not the exponent. >>: I see. Okay. >>: That's not a statement of T, that's a statement of [inaudible]. >>: Are you [inaudible] would also be larger, right, [inaudible]? >> Khalid Daoudi: No, no, it's the opposite. It's when [inaudible]. It's -- it's -- even if ->>: I mean, because I is less than 1. Okay. >> Khalid Daoudi: Yeah. It's very small. >>: I don't get [inaudible]. >> Khalid Daoudi: So -- so the -- so the claim is that if we look at a particular geometrical structure, we have access to most of the information, and here is, to answer your question, the [inaudible] statistics is they are going to show that this is indeed too for some class of sigmas. So you take these exponents, you assume that you can compute them of course, and you look at the so-called most singular manifold. And the assumption is that the most -- so the most singular manifold is the subset of points having the smallest singular exponent. So you compute your Hs and you take the smallest ones and you form the subset of points that corresponds to the smallest exponents. This is called the MSM, the most singular manifold. And as I said, they claim that this MSM corresponds to the point where information concentrates as it transfers across scale and that it governs the dynamic of the system. So and if we are right, then we should be able to reconstruct. And actually they developed a method for system reconstructability which is here. So for natural images we say that you can recover fully your signal by looking just at the gradient of the signal, the restricted -- so this symbol means that it is a restriction of our DMSM, and you diffuse it to an appropriate kernel. And this -- in case of natural images, this kernel is universal and has this form which complies with the scaling property -- the scaling property of the power spectrum of natural images. And there is a lot of applications and strong results in image processing using this formalism. So I recall you have now two components. This measure, the singularity exponent, the singularity exponent that we obtain from them, the MSM, the most singular manifold that allows us to reconstruct fully the signal from only the knowledge of properties of the signal on DMSM. >>: So what's the definition of fully constructed? I don't think -- we still have errors, right? >> Khalid Daoudi: Yeah, yeah, there is. In fact, in the real world there is still errors, but you can never have perfect reconstruction. But they have what they -- they stated they have very good reconstructions. >>: Okay. So you're saying that to the extent the model is correct you have perfect reconstruction? >> Khalid Daoudi: Yeah. >>: And then because the model isn't perfect, you don't have perfect reconstruction, is that the -is that where the imperfection comes from? >>: It's got to be more, because there's this subset. It has to do with the size of the subset. >> Khalid Daoudi: Yeah. Because it's -- actually they have shown this is true for in the continuous case you have perfect reconstruction. For real-world signals, as you said, it depends on the -- how many information you gather here. And the whole goal is depends on what kind of signals, the cardinality of this set here can be low or big, and this gives you plus or minus about your reconstruction. >>: So for continuous signal you can completely reconstruct? >> Khalid Daoudi: Yeah. Because actually [inaudible] this MSM is dense [inaudible] is dense in the signal domain, and you can recover everything from it. But this is the continuous case. So if we summarize now what we have is that there is this -- well, I will say easy-to-compute quantity, which is the singularity exponent, even though it's not that easy for -- in the case of images, I mean, the way they compute it is little bit involved because it's -- it uses a notion of local reconstruction to estimate the Hs. So they evaluate local reconstruction around the point to give a value to H. So if you have good reconstruction, it means the H is big, and so it means that the point is predictable, otherwise when you have better local reconstruction the H is very low. And this is how you can pretend that this singular manifold corresponds actually to N predictable manifold. So, now, this has been done in image processing. But if we look at this sort of construction formula, it looks a little bit appealing to what we know about speech. That imagine one second that we can do the same thing for speech signals. This will tell us that, oh, by looking at this H, we can probably have [inaudible] access to the source signal, and if we can find the good kernel, then we are going to reconstruct. So it will be an alternative, nonlinear alternative to our classical source feature model. >>: [inaudible] objections earlier which was that the source and the filter are not -- are coupled. >> Khalid Daoudi: Yes. This is ->>: They're not coupled here. >> Khalid Daoudi: Yeah, they are -- they are not or they are? >>: It seems like they're not, so two different expressions. >> Khalid Daoudi: Yeah. But when you look at this equation here, you are obtaining the source here after nonlinear operation of the full signal. So probably what you obtain I don't know because this will -- it's not really decoupled. So, but, anyway, we can still -- if we can use this formalism this way and do the assumption that they are decoupled, why not. So -- so let's now see what's going on in the case of speech signals. So I took exactly the same components but I replace now X by T. And let's use the same analogy as in 2D case and take the same multiscale functional. And we want to analyze this new formalism analyze speech, but keeping in mind that we want to do -- and this is very important because, as I said before, these nonlinear techniques that have been developed so far, they are very complicated and people get scared from them and there is not that -- and probably that this is why they don't have that much impact in speech processing, beside the fact that, for example, as I showed before, we don't have access to a lot of information. So I want to try this formalism on speech signal, but with the philosophy to do simple things that everybody understands and also to do -- to develop efficient algorithms, to do things that people can easily implement. And the problem is we don't have this nice reconstruction formula as in the 2D case. Okay. We cannot hope to have this universal kernel that reconstruct for speech signal. So ->>: That's a problem in speech, not a problem [inaudible]? >> Khalid Daoudi: Yeah. So but still just let -- try to do something with these Hs. So since we don't have the reconstruction formula, so we cannot use exactly the same way they estimate the Hs, but we rely on the theoretical results that have been developed by Oriol Pont which say that if you assume that there is energy cascade behind your process and then some other critical stuff, that the cascade variable is infinitely divisible, then you can estimate your exponent just by summing up the so-called transition exponents, which in our case we compute just as the logarithm of the [inaudible] -- the ratio between the logarithm with the multiscale functional and the logarithm of the scale. And you normalize. >>: I have to admit I'm completely lost. >> Khalid Daoudi: Okay. >>: Can you translate this into speech? I mean, what does H correspond to in a speech signal or a speech system? What information is in there? >> Khalid Daoudi: So this is the point actually is that we are using in the case here of 2D, we can say fairly that these exponents correspond to local -- they quantify predictability around a point. Okay? Here we want to say the same thing. But at least personally I feel that they don't have the right to say this because we don't have the reconstruction formula that allows you to give this interpretation of local predictability. But I'm just going to call them singular exponents. Try to find a way how to ->>: If I have a 2D image and I shrink it this way, you know, I chop off some of the rows on the bottom, it's a little bit smaller and I shrink it [inaudible] still have a 2D image, but if I shrink it down all the way to one row, then I have a 1D. Why does it all of a sudden break in that case? >> Khalid Daoudi: What will break? I mean, if ->>: [inaudible] the difficulty comes from the [inaudible] not because of dimensionality. If it's from the dimensionality, then [inaudible]. [multiple people speaking at once] >> Khalid Daoudi: Actually, they have done it [inaudible] to certain extent for stock market series. So where it's -- the dynamics are not the same as in speech of course and -- but I think that for speech signal we have just to take -- I'm very confident that we can achieve the reconstruction formula. But just we have to take ->>: Is there a particular aspect of speech that makes it [inaudible] -[multiple people speaking at once] >>: Right. Especially as an image. So why is that ->> Khalid Daoudi: No, but it's not an image coming from turbulence. >>: Of course it is. >>: [inaudible] much more turbulent than ->>: Okay. I have -- seriously, why ->> Khalid Daoudi: Yeah? >>: I mean, why ->> Khalid Daoudi: No, this one doesn't work for -- I mean, you cannot say that you can apply for any image and you'll have this information about the physical process behind. No. It's not magical. >>: So if they took a picture of the exhaust coming out the tailpipe on a cold day. >> Khalid Daoudi: Of what? >>: Billowing smoke. >>: [inaudible] >>: Is that what you're talking about when you're talking about an image of turbulence? >> Khalid Daoudi: Yeah. Yeah. The smoke. Like if you look cigarette and ->>: And so the theory tells you that there's some points in this image that you can take and then reconstruct. >>: Yeah. >>: But ->>: The spectrogram ->>: -- it doesn't ->>: Specifically reconstruct. >>: -- specifically reconstruct, right? The spectrogram doesn't have -- doesn't look like that. >>: Right. >>: And so that's why you're saying that speech is -- you can't directly apply this to speech. >> Khalid Daoudi: Yeah, of course. >>: It's deterministic [inaudible]. >>: Wait a second. I find it hard to believe that the turbulent image would be more compressible than the nonturbulent. And turning the image into a few of these points, that's a lot like compression. >>: It's a stochastic reconstruction, so it has something that resembles [inaudible] pixel accurate. Is that right? >> Khalid Daoudi: Yeah. It's not -- it's not going to be pixel for pixel, for sure, but ->>: [inaudible] >>: No, no, but I -- I think it's more than that. I think he actually means that reconstructs it, not -- it reconstructs something with the same statistical properties, but then actually reconstructs something that's like a mean square error close approximation. >>: I don't think so. [multiple people speaking at once] >>: What do you mean by the word reconstruction for images? >> Khalid Daoudi: For images is that you take the image, you compute your exponents, you find the lowest ones, and you apply the kernel here ->>: Right. >> Khalid Daoudi: -- and you have a good reconstruction of your image. >>: And what does good mean? >> Khalid Daoudi: Yeah. So good, it's ->>: I mean, no, seriously. >>: I mean, if you had a bunch of images of turbulence, would you be able to find -- and you had the reconstruction, would you be able to find the source image? Would it look similar? >> Khalid Daoudi: If you have ->>: So let's say I have ten pictures of turbulence and I take one and I try and reconstruct it and I see this reconstruction, can I as a human pick out which of the ten I used as a source of that reconstruction? Is it -- I mean, do the images look the same? >> Khalid Daoudi: Well, at least this is what they claim, that they are really looking as the same, that you can recover -- for turbulent images, you can recover all the ->>: Statistics. >> Khalid Daoudi: -- all the -- all the important fronts ->>: [inaudible] >> Khalid Daoudi: Yeah. Unfortunately I don't have -- I would -- if I knew I would have showed you examples of ocean dynamics. >>: You're not arguing for pixel by pixel mean squared error [inaudible]. >> Khalid Daoudi: Well, actually, for example, in -- yeah. If you are thinking about compression or -- I mean, this physicist ->>: If we took a picture of LENA ->> Khalid Daoudi: Yeah. And actually -- but ->>: -- and we did this and then we reconstructed it, would we say, oh, that's LENA? Or would ->> Khalid Daoudi: Yeah, yeah, actually this is true. But, you know, they consider LENA as a natural image. >>: Yeah. >> Khalid Daoudi: You see? [multiple people speaking at once] >>: That's my question. [laughter] >>: Right. It seems ->>: And that's something is not good, because it's not turbulent? >> Khalid Daoudi: It's -- so if I [inaudible] I mean, you want -- you want that -- take the spectrogram and reconstruct it. This is what you mean? >>: Yes. >>: [inaudible] >>: So if you do this, then it can reconstruct LENA, why can't we do this and reconstruct the spectrogram? >> Khalid Daoudi: First -- and -- and imagine if we can cannot construct the spectrograms, then what? >>: Now you can go back to the [inaudible]. >> Khalid Daoudi: I mean, how are you going -- you are going to pick some points, some MSM on the frequency domain and ->>: [inaudible] >> Khalid Daoudi: Excuse me? >>: In some respect [inaudible] reconstruct away from. >> Khalid Daoudi: Yeah. But ->>: So it's isomorphic. It says the same information. >>: All right. So the point is if we can reconstruct LENA, we can reconstruct the spectrogram. If we reconstruct the spectrogram, we can reconstruct the waveform, and therefore it works for the 1D signal. >> Khalid Daoudi: Well, probably you're right [inaudible]. I will -- actually, I have at some point, but I thought that what would [inaudible] and we look at the spectrogram, what would [inaudible] this MSM to what it will correspond to or ->>: I have no idea. >>: Yeah. >>: [inaudible] >>: [inaudible] images. >>: But there's enough of both. >>: Well, it's a D2 image, so, yeah, there's a lot of [inaudible] constrained, not everything which is a valid spectrogram. Because it's a 1D signal. So it's a constrained image. >>: Yeah, so [inaudible]. >> Khalid Daoudi: But probably I will [inaudible] send you the reconstruction and see what you think. >>: All right. >> Khalid Daoudi: So where were -- so, anyway, in summary, for the 1D case, we just adopted this way to estimate the exponents, which is very easy, very simple. >>: So I have a question so I understand how this computation has [inaudible]. These individual H sub-R sub-I are calculated by the log ratio of this T operator ->> Khalid Daoudi: Yeah. >>: -- which is your modulus gradient. >> Khalid Daoudi: Yeah. >>: So it's like how quickly is it changing at the scale RI? >> Khalid Daoudi: Yeah. >>: And the denominator is the scale divided by what, the [inaudible]? >> Khalid Daoudi: Yeah [inaudible] frequency. >>: So that's a constraint. >> Khalid Daoudi: Yeah. >>: And then RI is measured in time? >> Khalid Daoudi: Exactly. So you just -- it's your [inaudible] looking at. >>: Okay. >> Khalid Daoudi: So you see it's very simple to compute. >>: It's trying to figure out what's happening across frequency, how quickly is the signal changing? >> Khalid Daoudi: Yeah. Across [inaudible] scale. Like for scale, yeah. >>: Okay. >> Khalid Daoudi: So I said now we have [inaudible] compute H, and let's look how they look like. So this is a speech signal from TIMIT. And the vertical red lines are just the manual transcription of phonemes in TIMIT. And if you look at the time conditioned distribution of the Hs, so here you have time, so take 32 millisecond window and you look at the histogram of the Hs in each window. And you can see -- well, here it's not very can clear, but if you look closer, you see that actually the distribution of the Hs is changing from phone to phone, but of course is not very clear from here. So we said, okay, so since the distribution is changing, let's just look at the simplest [inaudible] statistic. So we'll look at the local mean, and since we want to keep resolution the finest possible, we just look at the running mean of the Hs. So we look at this function ACC as accumulation, and you can see that we have this nice behavior that it's almost to piecewise linear functional with a breaking point at the boundaries of phonemes, and even can see for shorter ones, see here, it describes the slopes of this piecewise-like linear functional changing [inaudible]. So we say, okay, let's do simple thing and fit a piecewise linear functional to this curve and identify the breaking points. >>: But wouldn't this be [inaudible] technique, you still can't find this kind of behavior? >> Khalid Daoudi: Well, we've looked. And can you tell me of a simple technique that will do this? >>: Just, for example, just the energy you've already gotten. >> Khalid Daoudi: If you're looking at the energy, you are not going to have this kind of piecewise linear functional which such a resolution. >>: So the 0 to 2 is [inaudible]. >> Khalid Daoudi: Yeah. No, but ->>: [inaudible] never get three steps, right? >> Khalid Daoudi: Yeah, but doesn't mean here where you start, because ->>: [inaudible] >>: So the slope can be interpreted as a local mean ->> Khalid Daoudi: Yeah. >>: -- of the -- of the H? >>: Right. >>: But there's usually a flaw in the A to D, because usually a low-pass -- I mean a high-pass filter at the front end, so the local mean should be 0, so it doesn't pass [inaudible]. I'm a little confused as to what this is getting at, because it's essentially the first -- it's essentially the lowest bin, DC bin spectrogram [inaudible] time it's a cumulative. So it's the interval. >>: [inaudible] >>: Yeah. So it's a low-pass filtering. Perfect. >> Khalid Daoudi: Okay. So I'm curious that you say how you look at this. So what are you saying? >>: I mean, you've taken the speech signal and you're integrating over time, which is equivalent to, in the frequency domain, multiplying by 1 over F. >> Khalid Daoudi: Yeah, but here we are not taking the speech signal. It's not the speech. >>: What is that? Oh, that's for H. >> Khalid Daoudi: Yeah. >>: Ah. Okay. I have no idea, then. Sorry. I missed that. >>: Low-pass filter here, H. >> Khalid Daoudi: Yeah. >>: So let's go back a slide, then. I'm sorry. >>: [inaudible] it's moving out the ripple. >>: Go back a slide. So you said that H would be low when you have maximum predictability. But around the 0.3 it looks like a fricative, which would be the least predictability. >> Khalid Daoudi: No. It's the opposite. When it's high, it's ->>: Well, I don't [inaudible] wave points very well, but I think this looks like a fricative. It should be most unpredictable. I mean, this is definitely a vowel. >>: Well, he has it labeled in the previous slide what that is. >>: [inaudible] >>: Go back one slide. >>: Back or forward? No, back. Other way. [multiple people speaking at once] >>: That's a fricative. >>: That's maximum randomness. >> Khalid Daoudi: Yeah. >>: But it's at the lowest H. >> Khalid Daoudi: Yeah. This is [inaudible] unpredictable. >>: Can you just go back? >>: But I thought small H was predictable. >>: Go one before it. >> Khalid Daoudi: No, no, it's the opposite. Smallest H is unpredictable. >>: Towards negative infinity is unpredictable. >> Khalid Daoudi: Yeah. >>: That's what you're saying. [multiple people speaking at once] >>: If you go back one, the one before it's a vowel, and that's also low. >>: Yeah. There's not much difference between this model, which is very predictable, and this fricative, which is completely noise and very random. >>: Compared to some of the other things that are going on. >>: But there seems to be some effective [inaudible] because the most predictable thing is silence. >>: Silence is really predictable. >>: Yep. >>: That part works. >> Khalid Daoudi: But let me remind you that I don't allow myself to talk about predictability at this point for these exponents. I talk about really for images, but still here you can see that once it's a little bit what we are expecting. I don't know if you agree. >>: I'll go with that for now. >>: But the -- so the previous -- the next -- so you take the mean -- wait. Is this the mean -[inaudible] the mean vertically and the ->>: It's over H, so ->>: [inaudible] only over tau, H of -- right, all the -- in the previous slide I thought your H of tau was a sum of other little Hs. >>: Well, H in the previous one was a histogram. >>: Of all the Hs [inaudible]. >>: [inaudible] >> Khalid Daoudi: Yeah. This is just a histogram. >>: Okay. >>: Go back one more. >>: Yeah. So I thought H running even time is a sum of ->> Khalid Daoudi: Yeah, this is just a way how to compute. You can just ->>: Okay. So that's ->> Khalid Daoudi: So these Hs are other transitionals here. We are just [inaudible] ->>: So everything -- so that's just H. Okay. >> Khalid Daoudi: Forget this one. I'm just saying how we compute in this case. [inaudible] you saw it is easy actually. >>: So one of the things you're asking is how would somebody in speech recognition look at a graph like this? It seems like the only really useful thing is like a spectral measure where you're trying to determine whether the sounds around it are similar or different. Then the places where you have a linear progression of this integrator, you expect the expected value of the H in that region is the same. So it seems like you can ->>: Yeah, kind of eyeball it. >>: If the sound isn't changing, then the H isn't changing. >>: Right. >>: And there's some correlation there between the two. >>: But is there also like [inaudible] like there's an A at about .5, there's another A at about 2.5? >>: They're both sloped down. >>: [inaudible] >>: Oh, go backwards, go backwards. I mean, I think this. >>: Go back a slide. >>: The slope, the informative thing. Because it's not an absolute number. >>: So this previous thing is based on the fact -- this next slide is based on the fact this is constant, this is constant, this is constant, [inaudible] there. This is constant. [inaudible] over time, that gives you a particular slope. >>: Yeah. >>: You're just interpolating it into constant [inaudible] right? >>: Right. I think. >>: So what happened to 0.4 on the next graph? Because here it's sloping a lot. So that's -well, it's changing I guess ->>: Yeah, it's a steep slope. >>: Yeah. I don't know what that means. >>: [inaudible] yeah. >>: [inaudible] for H is 0. On the previous slide you've got [inaudible]. >> Khalid Daoudi: Yeah. >>: What's the significance of the [inaudible]? >> Khalid Daoudi: So [inaudible] the negative ones are the most important ones. It's where you have low predictability. And the positive ones is the opposite. So -- and here we are looking at all of them. We are not at DMSM yet. >>: It seems like context dependent predictability. Because like the two [inaudible] are totally different. >>: [inaudible] the integrator. >> Khalid Daoudi: Yeah, so the integrator. So if you look -[multiple people speaking at once] >> Jasha Droppo: So, Khalid, I want to draw everybody's attention to the clock. >> Khalid Daoudi: Yes. >>: Yes. Oh, that's right [inaudible] three o'clock to the other meeting. >> Khalid Daoudi: Yeah, yeah. So -- and here -- so by the simple operation [inaudible] so we compare to the state-of-the-art phonetic segmentation, we don't look at the third row. This is just incremental improvement about this integrator, this ACC functional. And we -- so we compare to what we found is the best. So this paper of Dusan and Rabiner, which we compare on the full TIMIT database so at least we can have fair comparison. And we have better results. But what is more important, we've been discussing this before, is that probably we should trust more. If this stuff makes sense, trust more the segmentation by this method than relying on the manual transcripts of TIMIT, so -- and so probably we should look at typical example where normal, even manual transcription cannot find the transitions and see if this method can find it. But unfortunately we didn't do this because we wanted to go to look at other aspects. So look at the MSM, this measure component of the formalism. So here a voice speech and here the Hs now, just the signal H of G. And you see that -- from this picture here you see that -- and the red vertical line, they correspond to GCI locations. They start from EGG signals. So they are here also. And you see that the lowest exponents are around these GCIs. So this is directly by looking at the H. Here a better example where we zoom. If we take now this -- the MSM by 5 percent of the lowest exponents, you have this behavior. So here again another voice signal. And you see that the MSM is located around GCI. The lowest exponents are around GCL. Even another nice behavior that the lowest exponent in these regions corresponds to the exact location of the GCI called into the EGG signal. So it seems like this MSM corresponds to -- has a physical meaning in speech which apparently here is GCI. So we said, okay, let's compare to what is going on in this field of GCI detection. So this -- basically this level-change functional, just you compare the mean of Hs around the point just to allow it to have only one MSM point per growth of cycle. So it gives this red curve. And then we take the zero crosses of this red -- this green curve, sorry, and we take the lowest MSM -- the lowest exponent points which is the closest to the peak. So we just -- this is to guarantee that we have -- we are taking MSM point, the lowest one, but the ones that are in one glottal cycle. So and here are the performances. So this is defended last year by Thomas Drugman. And here are the results in clean speech. So you have two measures -- I think Mark knows this stuff better than I do -- reliability measures and accuracy measures. So reliability measures are almost the same, but accuracy is always better. And according to Thomas Drugman SEDREAMS confirm or infirm [inaudible] is the best algorithm and the most robust so far, according to his thesis. So and particularly when you look at shorter turbulence windows, we have better results. So we see again this -- when we want to look at high resolution, we have always this benefit because the [inaudible] based. And remind you that this is just by simple transformations of the H. So here are the results on noise, because in clean speech most of method do well. So here you have accuracy -- sorry, the -- the hit rate, and here, the last figure, you have the accuracy. So this is in the case of bubble noise, and you see that [inaudible] for low SNR we are ->>: This is adding noise to the EEG? >> Khalid Daoudi: Yeah. >>: Speech? >> Khalid Daoudi: To the speech signal. >>: I thought that you said it was done -[multiple people speaking at once] >>: You have a speech signal and a parallel [inaudible]. >> Khalid Daoudi: Yeah. And what is interesting is that in case of [inaudible] noise that the results seems to be very good. So it means that there are some robustness. And this has been actually showed image processing and also theoretically that lowest Hs are robust to noise. So, well, I have to be quick. So this is -- so we took this MSM and propose it as a way to provide sparse linear prediction. So the idea is -- so the L-0 norm is the impractical ideal solution. Another [inaudible] to the solution is the L-1 norm, but it involves complex programming. So we propose just to downweight the mean square error minimization on the GCIs. And actually Wednesday morning there was a presentation by Pavo Alco [phonetic] who is doing exactly the same thing. Fortunately I have submit the paper three months ago. So he did the same principle. He's using his way to compute GCI. So we downweight it. And I showed that we have better sparsity after we made the statistical test that by doing this method. But this is -I wouldn't say that here that we are just playing about with our MSM stuff to obtain interesting results. So then here was talking about performance, so we have made the test that we -- on the synthetic vowel you have exact estimate of the original transfer function while the L-2 norm is shifting [inaudible] little bit. And also so we'll have used this MSM to -- we know that multi-- classic [inaudible] do -operates in two stage, where in first stage you have to -- to do -- if you want k pulses, you have to k searches of order N to extract pulse location, one after the other. And then you optimize all the pulses [inaudible] jointly at the end. And this takes a lot of time, this first stage. So we replace this first stage directly by putting the MSM as the process. We say that the MSM gives us the location of the pulses. And then we keep the second stage of MPE the same. And here an example what we find. So this is the original signal, this is the pulse, pulse is given by the MSM, so that amplitude is computed using the second stage of MPE, and this is the reconstruction signal. And here we compare, so to the performances. So the best one is MPE, reoptimized MPE, because there was a first version in '82 which is relatively bad in SNL performances. Ours is here. And this one is one that was published last year using [inaudible] -- using -- last year using [inaudible] methods. So we see that for low case we have better results than these two here. But MPE is better. But MPE does intrinsically minimize SNL. So it's not surprising that is doing good job in SNL. But if you look at perceptual measures, we have almost the same quality. But we win a lot in [inaudible] complexity. So I want to [inaudible] one conclusion. Actually I don't -- my interest is not in the performance, so whatever presentations we gain is that I showed you a very simple nonlinear operation and a signal that allows to have easy access to some interesting local dynamics of the speech signals. So we saw that we can have access to GCIs, to phone transitions. Does this mean that this formalism is potentially good for linear speech processing? I hope that the answer will come from you. So this is the perspective for the short time incremental research. Actually Thomas Drugman wants to use this for synthesis in his system. He has a synthetic system. I heard that GOIs were very complicated to estimate, so probably we can take a look at them. >>: [inaudible] might even have them. [inaudible] that second pulse looked very much like it might have been the GOI. >> Khalid Daoudi: Really? >>: Yeah. So maybe you solved it. >> Khalid Daoudi: What does it mean MSM for unvoiced speech? Normally it should have better mean meaning there because the assumption is that there is turbulence there. Is there just some random location of [inaudible] speech or does it have physical mean as we have found for voice speech? Are we looking at the right intensive variable? I mean, I have showed you this Gradient-Modulus measure as our intensive variable. It is derived -- we are -- we have just copied from image processing. It's for speech. Is the right variable to look at? Can we have better estimate of single exponent. Probably the idea [inaudible] the spectrum. And I want to say a word that -- so this is not in this talk, but Oriol Pont is physicist and mathematician guy in our group. He's working on a way to infer directly the cascade energy, and this is through the notion of optimal wavelets. So he has all the summaries, but we are waiting for the final results where we can infer completely the cascade of the energy using the notion of optimal wavelets. So if we have this, of course we will have another way that I presented you today to look at the systems. But what would be very interesting is to have a reconstruction formula that is similar to the 2D case, and here's it open. Should we consider some dependent kernels? And because this would lead us then to what is interesting for you to have some new ways for feature extraction and so on. And also here you see that we are -- since we are a little structure and we don't have a lot of manpower, it is very important to me to know what are the good problems to look at or what are the good applications to look at instead of just by tied by -- by the lack of manpower. So I hope that you will tell me where this stuff -- if it makes sense. If it's not, you can tell it to me and not be nice to me. And what are the best problems to look at. And I think I'm all right, aren't I? Thank you. [applause] >> Jasha Droppo: So is there any further questions from the audience? There's a lot during presentation. >> Khalid Daoudi: Yeah. >> Jasha Droppo: I know a lot of us have to be somewhere at 3:00. >> Khalid Daoudi: Ah, you have the meeting at 3:00, yeah. Yeah. >> Jasha Droppo: One thing that Khalid doesn't know, being a mathematician, is whether this work is interesting for speech or what kind of -- he's looking for feedback on what kind of directions to take. Is that right? >> Khalid Daoudi: Yeah. Yeah. To have feedback from you practical guys who know better than me the area. >>: So one thing I would say is I think, you know, we are very [inaudible] scale, so it's like fun level or the frame level [inaudible] imagine the dynamics of conversational speech [inaudible] compared to [inaudible] ->>: [inaudible] more interested in the dynamics of speech and articulators than we are of [inaudible]. >>: Right. >> Khalid Daoudi: So it's higher-level transitions that are of better interest. >>: Yeah, the dynamics of the physical process that's generating speech rather than the manifestation of that physical process, which is a little closed [inaudible]. >>: I think in terms of segmental decoding and having some hints as to the segment boundaries could be very useful if those sorts of boundaries that you are identifying there hold up in a little bit of noise and conversational speech and -- >>: If it's generally true that this gives boundaries with reasonable accuracy, I think that's interesting. >> Khalid Daoudi: Yeah, okay. >>: [inaudible] because looking at the whole [inaudible]. >> Khalid Daoudi: No, no. >>: It's a local computation. >> Khalid Daoudi: No, it's a local computation. >>: Thank you. >> Khalid Daoudi: Thank you very much.

>> Jasha Droppo: So today I am pleased to... this technique which I'm interested to learn more about. ...

Related documents

Products

Support

&gt;&gt; Jasha Droppo: So today I am pleased to... this technique which I'm interested to learn more about. ...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Jasha Droppo: So today I am pleased to... this technique which I'm interested to learn more about. ...