>> Jasha Droppo: So today I am pleased to... this technique which I'm interested to learn more about. ...

advertisement
>> Jasha Droppo: So today I am pleased to have Khalid here to give us a presentation. He has
this technique which I'm interested to learn more about. He has a long history. Whereas most
people in speech are either engineers or computer scientists, we're going to be addressed today
by a mathematician who has been at the Paris University and spent some time in Montreal, spent
some time at MIT as a postdoc and is currently in INRIA, and it's all you.
>> Khalid Daoudi: Thank you, Jasha. So thanks again for welcoming me here and giving me
this opportunity to share this fundamental research with you practical guys.
And as I said, Jasha, yesterday -- not yesterday, but when we have [inaudible] I need some
honest feedback about what you think about what I'm going to talk about, because of course the
purpose is to go further in more interesting stuff.
So what I'm going to present today is a part of the Ph.D. work of Vahid Khanagha who is going
to defend in few weeks, and with the contribution of Oriol Pont, who is a top researcher in our
group, and Hussein Yahia, my colleague with whom I cofounded the GEOSTAT team. So
GEOSTAT team in -- okay. There are -- that's the first slide.
>>: [inaudible]
>> Khalid Daoudi: [inaudible] and GEOSTAT team of Bordeaux is working on nonlinear
analysis of complex system and signals using principles and methods coming from statistical
physics, and we have typical applications in geophysics and astrophysics dimension and also
some [inaudible] data.
And what I'm going to talk today about is actually these new methods that have been developed
in this field of remote sensing application. I found at some point that it is worth to give them a
try for speech signal.
So the outline of my talk will be the following. I'm going to motivate why are we doing this
work, and typically it's because of nonlinearity aspect of speech and that we want to look at
speech as the realization of complex system.
Then I introduce the formalism that we work with, which is called the Microcanonical Multiscale
Formalism, or MFF, to summarize and describe the basic principles behind this formalism and
see how we can apply it to speech signals.
And then I will show what we have achieved so far in terms of applications, so phonetic
segmentation, GCI detections, sparse prediction, and multi-pulse excitation.
And then I'll do some conclusions, and I hope that the perspective will get a lot of exchange.
So I don't need to underline that you all know that classical speech processing techniques live in
linear world, doing bunch of assumption inherently, and basically [inaudible] uncoupled between
the vocal tract and speech source, the laminarity of the airflow through the vocal tract, and the
periodicity of vibration of vocal folds.
But there is -- it is established theoretically and experimentally that there are several nonlinear
effects in the speech production mechanism. So the -- and typically that there is some turbulent
source signal during the production of voice fricatives. So with this the laminar flow assumption
fails. And that there is nonlinear feedback during voicing. And thus the glottal pulses are not
exactly periodic and they are skewed in shape.
For plosives, the excitation is time spread and have a turbulent component. And that there is
coupling between the glottal airflow and the vocal tract system when the glottis open.
So the source and the filter does not depend -- does not function independently. And this
nonlinear phenomena [inaudible] are emphasized in the case of voice disorder, where you have
strong complex nonlinear aperiodicity and turbulent non-Gaussian randomness.
So if say in this we would like to look at speech from a nonlinear signal processing perspective,
and mainly the people that have been interested in nonlinear speech processing, they were
looking at speech in the framework of nonlinear dynamical systems. And typically the
dynamical system that share characteristic with turbulence are called chaotic, and there have
been a bunch of tools developed in this area which is called chaotic-signal processing.
So I'll give you a brief recall about this. So in chaotic signal processing, the speech signal is
assumed or the speech production mechanism is supposed to be dynamical system which is
governed by the equation there, that Y is multidimensional but -- multidimensional system but
which is unknown and that the measurement we do -- that we capture about speech are just 1D
projection of this unknown highly dimensional system.
And then in '91 there was this existence theorem called the embedding theorem which says that
if you take your speech signal and you take time delays of the signal and you consider this D
dimensional space now, that this -- the multidimensional X, signal X, shares common properties
with the unknown dynamics of Y of N. And these common features are typically correlation
dimension, fractal dimensions ->>: What was S again?
>> Khalid Daoudi: S is the speech signal. Sorry, I didn't ->>: Okay.
>> Khalid Daoudi: Yeah. S is the 1D speech signal.
>>: [inaudible] some unknown [inaudible].
>>: So where -- oh, okay. So S -- I see.
>>: So like we could think of Y as the articulators and S as the time series, and then this X
somehow is related to the original ->> Khalid Daoudi: Yeah. There are some common features. And probably the best known of
them is the Lyapunov exponent, which is a characteristic of dynamical system that remain intact
in the embedding procedure, and this exponent characterize the degree of chaos of the system.
So if you compute your Lyapunov exponent and find that it is greater than 0, then you can say
that your system is chaotic, and your characterize its degree of chaosity by the value of this
plosive exponent. When it is negative, it means that the system is not chaotic. Okay.
>>: [inaudible] how good is the assumption [inaudible]?
>> Khalid Daoudi: Yeah. So I was going to say this here in [inaudible] is that actually this
theorem doesn't tell you how to choose. It's an existence theorem. It doesn't tell you how to
choose the time delay T and the dimension D. And these are very important quantities to have
good estimate of these exponents.
So and, moreover, it assumes the stationarity. It assumes that you have some stationarity. It's
only in this case these exponents have a real meaning.
And as I said, so since the embedding theorem is an existence theorem, it is really very difficult
to estimate this exponent, and there is a bunch of methods to do so and there are -- there is no
consensus on which is the best way to estimate this exponent and you find contradictory result
and conclusions about.
And even though -- even though if you assume that you can estimate them, these quantities, just
the global quantity that gives global description about the system. And I think that it's because
of this difficulty in measuring -- in estimating these global measurements and in the weakness of
the information that it gathers that these methods didn't have to [inaudible] an impact in
nonlinear speech processing.
>>: Essentially that just tells you the size of the subspace, right?
>> Khalid Daoudi: Yeah, it is your -- it measures ->>: Dimensionality.
>> Khalid Daoudi: Yeah. And the divergence rate of nearby orbit. So it assess the complexity
of the system without really saying more.
So the question that we've been asking is that -- well, not -- not me, but physicists is that is all
this necessary, to have these complicated things to deal with, and can we do more simple and
still have more comprehension about our dynamical systems.
>>: So when you say is this necessary, I mean, [inaudible] pretty simple measure, doesn't really
tell you anything useful.
>> Khalid Daoudi: Yeah. Yeah. And it's different to estimate.
>>: Right.
>> Khalid Daoudi: Yeah. And it's not -- it doesn't tell you that much.
But a lot of work in nonlinear [inaudible] has been dealing with this kind of stuff. So it turns out
actually that these kind of methods belong to what the physicists call the first phase of complex
system theory, so where only global measurement can be assessed. So this starts with the
Kolmogorov model of turbulence, [inaudible] of structure functions, the canonical representation
of fractal or multifractal processes, detrended fluctuation analysis.
So these methods, they recognize the existence of interacting multiscale motions, but there is no
access to them. In simpler way, they recognize complexity of the system, but they don't tell you
that much about the complexity of the system.
Since the '90s there is a new trend in the community of complex system theory where it says,
okay, we are going now to try to localize geometrical complexity. That means that where the
complexity emerge and how it organize itself. And there is a school which adopt the notion of
predictability as a way to characterize complexity. And this is where the GEOSTAT team
stands.
Oh. What's going on? What happens? Why is this moving?
>>: By hitting space you got it to go [inaudible] slideshow [inaudible].
>> Khalid Daoudi: Oh. Okay.
So in this school that adopt predictability as a way to characterize complexity, the basic principle
is to say that in for certain class of complex signals and systems there is some thermodynamic
observables that have a power-law behavior. And even more is that what Antonio Turiel and
Hussein Yahia have developed is that this power-law behavior can exist at any point of the signal
domain.
So typically in mathematical wording is that given your signal S of X there exist at least one
multiscale functional TR; that when you analyze it around the point X, you find a power-law
behavior in [inaudible]. So D here just the dimension of the signal S, so for speech it will be just
1.
And the important quantity here is this H of X, which is called the singularity exponents. And
the hypothesis is that for this class of systems there exists some geometrical superstructure that
dominates completely the dynamic of the system. And that this we can access to this
geometrical structure by analyzing the level sets, or some level sets, and they are characterized
by the points where the singularity exponent is the same, is the same.
So, in other words, you take your signal, you evaluate this multiscale functional, you try to find
these Hs at each point, and then you form the sets of points that have the same exponents, the
level sets, and some of them are actually the most important quantities that carry information.
So in this way, this singularity exponent unlocks the relation between geometry and statistics,
because basically what they say is that we are going to access to this structure to geometrically,
and this structure contain most of the statistical information.
So it's kind of a way to bring together statistical signal processing and dynamical -- nonlinear
dynamical signal processing, which is the counter -- deterministic counterpart of statistical signal
processing.
>>: I think this is important, so I want to understand.
>> Khalid Daoudi: Yeah.
>>: There's a signal S of X or of time, and you're saying that in some particular point in time
that signal has a property that can be expressed by this power law.
>> Khalid Daoudi: Yeah.
>>: And I should care about that property because that property is measuring how predictable
the signal is there. Is that's why it's important?
>> Khalid Daoudi: Yeah. It's what -- it's -- because -- I mean, there -- there -- what is nice about
the work of Antonio Turiel and his fellow colleagues is that they provide a way of precise
estimation of this H at each point of the signal domain, and it gives also an interpretation of this
exponent as a way to quantify local predictability.
>>: But okay. Now, why shouldn't I just be like simpleminded and say, well, I'll do an LPC and
I'll see what the error is right around here and that's the predictability or some other model that
predicts a time series and just apply the model and see what the predictability is?
>>: It's -- you can do this, for example. Okay. You can take ->>: [inaudible]
>> Khalid Daoudi: Excuse me?
>>: Yeah, why not just do that? If I want predictability, why not just measure it?
>> Khalid Daoudi: No, but, for instance, so it's -- what you are saying that you are defining your
notion for predictability. So in particular what you're saying I'm going to look at the residual and
where I have the highest peaks, they are the less predictable points. Okay? But here first you are
assuming a notion of predictability that can make sense or cannot make sense.
The second thing that you can do, for example, if you take -- if you take just the highest peaks
and do them as the source signal for your filter, it's not going to work that much because you
have polarities in this -- in these peaks. So you can do -- I mean, it's all about what -- what do
you mean about predictability.
>>: Okay. So maybe you might get your estimate of H by doing something like LPC, even
though that might be a bad or a crude approximation?
>> Khalid Daoudi: But if you -- I don't -- don't forget also that there is this notion of scale. So if
you do LPC, you are [inaudible] of the signal and you are going to have the definition of
predictability at the finest resolution.
Here it is your actually -- I mean, I don't want to go into all the hypothesis. It tells you how the
energy concentrate in the point X while it transfer across case. So all this is based on -- from the
notion of turbulence and [inaudible] turbulence.
>>: So to make sure I'm still following. So H is the underlying dynamics of the system that you
understand that you can characterize, and the second term there is all the stuff that you can't
characterize, all the noise?
>> Khalid Daoudi: The second stuff is -- this stuff is negligible. I mean, they say that for small
scales, this term here, you can neglect it.
>>: Okay. So I don't understand where the statistics are coming from. Is it statistics of ->> Khalid Daoudi: Yeah, I'm coming -- I'm coming through this later. Because this is -- the first
component is definition of the singularity exponent.
>>: So for each X you can expand your original function into this form, right, for each X?
>>: Yeah.
>>: And but for each X you have different alpha and different H. Is that true?
>> Khalid Daoudi: Yeah. You have different alpha, different H. But alpha actually it can
cancel and what governs the dynamics are this H here. Because there is a way to show how -- I
mean, this alpha is also not that important. You can make it cancel.
>>: So even if they are different, you can still cancel it?
>> Khalid Daoudi: Hmm?
>>: Even if alpha is different [inaudible]?
>> Khalid Daoudi: Yeah. It's -- I mean, in the estimation of -- I mean, alpha is there. But in
the -- you can get rid of it in the estimation of H. And the important quantity according to
[inaudible] is this power-law scaling and not this constant -- this multiplicative factor.
>>: Okay. And this only applies if R is going to close to ->> Khalid Daoudi: Yeah. For -- for small scales.
>>: By the way [inaudible] is R just as a distance?
>> Khalid Daoudi: R just is scale.
>>: R is distance?
>> Khalid Daoudi: It's the scale at which -- the resolution at which you are looking at the signal.
>>: [inaudible] okay.
>>: So now I'm getting more confused. Because if there's multiple scales and in each scale
you've got your own H, then you've -[multiple people speaking at once]
>> Khalid Daoudi: No, no, no. It doesn't depend on scale.
>>: H does not depend on scale.
>>: Magical H.
>>: Okay. H is a function.
>> Khalid Daoudi: Yeah.
>>: Okay.
>> Khalid Daoudi: So typically from S we are going to have access to this H of X and see what
we can do with it for the moment.
So -- and the choice now of this multiscale functional is important, because if you choose it, for
example, just as linear increments, you run into [inaudible] exponents that are mainly used in
fractal and meta fractal theory.
>>: Okay. So these are [inaudible] in the original signal spaces.
>>: That's one definition.
>> Khalid Daoudi: One definition. But we not adopt this [inaudible] because actually these are
just -- these quantities are only directional and they are unstable and they don't resist to noise.
And they don't give you really an interpretation about predictability, just about geometrical
regularity of the signal. So these quantities are very known in the canonical framework of meta
fractal theory. And -- and -- but they are impractical and they are very difficult to estimate.
But what Antonio Turiel and his colleague has proposed is to define this measure as the
multiscale functional, which is called the Gradient-Modulus measure and which is defined from
typical characterization of intermittence in turbulence and it measures actually the local
dissipation of energy, of kinetic energy.
So you measure -- you measure -- you sum up or you take -- you sum up all the variations of
your signals, of your signal in a ball around X. And then with this measure you try to estimate
the H.
>>: With what is the gradient being taken with respect to? It's the gradient of your
one-dimensional signal?
>> Khalid Daoudi: Here I'm in whatever dimension.
>>: [inaudible] signal with respect to what?
>> Khalid Daoudi: It's to all your dimensions. So here it's ->>: Oh, okay. So if in one dimension the previous exponent you're talking about would be what
approximation for the gradient? Or ->> Khalid Daoudi: It's -- if you write it down, it's not that -- it's ->>: [inaudible]
>> Khalid Daoudi: Yeah.
>>: [inaudible]
>>: This one here, that seems like an approximation for the modulus of the gradient that R goes
to 0.
>> Khalid Daoudi: Well, you can see it as an approximation, but here you sum up all the
variations in the ball.
>>: I see.
>> Khalid Daoudi: So you -- and but if you want you can cancel this. But actually just this
measure here makes a lot of difference because they show that it describes better the variations
around signal then looking just at directional increments.
>>: Is that the -- if I got it, that's an integral over a surface ->> Khalid Daoudi: Yeah.
>>: -- [inaudible] is that right?
>> Khalid Daoudi: Yeah.
>>: Okay.
>> Khalid Daoudi: So by taking this multiscale functional, what these people say is that they
can characterize the degree of local predictability at each point, how to do this. Then they say if
we are correct in saying that there exists this geometrical structure that govern the dynamics,
then from this geometrical structure we can have access to all the information about the signal.
So this is the original claim. And what they have done is that they show that for particular class
of signal this is true. So they take ->>: Wouldn't it be the bigger the value the lower predictability? Because if you're looking at the
gradient, that's how fast things are changing. And if things are changing a lot all over the place
and you do that integrally, then you end up with a big number. And that would seem to mean
that it's not predictable. Because if you go over here, you're moving in this direction, if you go
over here, you're going in this direction.
>> Khalid Daoudi: Yes. Yeah, but this is what it is. So to estimate, you -- for example, an easy
way to look at the log log ratio and the things reverse. I don't know.
>>: If you go back a slide.
>> Khalid Daoudi: Yeah.
>>: This key [inaudible].
[multiple people speaking at once]
>>: That's not the exponent.
>> Khalid Daoudi: This is not the exponent.
>>: I see. Okay.
>>: That's not a statement of T, that's a statement of [inaudible].
>>: Are you [inaudible] would also be larger, right, [inaudible]?
>> Khalid Daoudi: No, no, it's the opposite. It's when [inaudible]. It's -- it's -- even if ->>: I mean, because I is less than 1. Okay.
>> Khalid Daoudi: Yeah. It's very small.
>>: I don't get [inaudible].
>> Khalid Daoudi: So -- so the -- so the claim is that if we look at a particular geometrical
structure, we have access to most of the information, and here is, to answer your question, the
[inaudible] statistics is they are going to show that this is indeed too for some class of sigmas.
So you take these exponents, you assume that you can compute them of course, and you look at
the so-called most singular manifold. And the assumption is that the most -- so the most singular
manifold is the subset of points having the smallest singular exponent.
So you compute your Hs and you take the smallest ones and you form the subset of points that
corresponds to the smallest exponents. This is called the MSM, the most singular manifold.
And as I said, they claim that this MSM corresponds to the point where information concentrates
as it transfers across scale and that it governs the dynamic of the system.
So and if we are right, then we should be able to reconstruct. And actually they developed a
method for system reconstructability which is here. So for natural images we say that you can
recover fully your signal by looking just at the gradient of the signal, the restricted -- so this
symbol means that it is a restriction of our DMSM, and you diffuse it to an appropriate kernel.
And this -- in case of natural images, this kernel is universal and has this form which complies
with the scaling property -- the scaling property of the power spectrum of natural images.
And there is a lot of applications and strong results in image processing using this formalism. So
I recall you have now two components. This measure, the singularity exponent, the singularity
exponent that we obtain from them, the MSM, the most singular manifold that allows us to
reconstruct fully the signal from only the knowledge of properties of the signal on DMSM.
>>: So what's the definition of fully constructed? I don't think -- we still have errors, right?
>> Khalid Daoudi: Yeah, yeah, there is. In fact, in the real world there is still errors, but you
can never have perfect reconstruction. But they have what they -- they stated they have very
good reconstructions.
>>: Okay. So you're saying that to the extent the model is correct you have perfect
reconstruction?
>> Khalid Daoudi: Yeah.
>>: And then because the model isn't perfect, you don't have perfect reconstruction, is that the -is that where the imperfection comes from?
>>: It's got to be more, because there's this subset. It has to do with the size of the subset.
>> Khalid Daoudi: Yeah. Because it's -- actually they have shown this is true for in the
continuous case you have perfect reconstruction. For real-world signals, as you said, it depends
on the -- how many information you gather here. And the whole goal is depends on what kind of
signals, the cardinality of this set here can be low or big, and this gives you plus or minus about
your reconstruction.
>>: So for continuous signal you can completely reconstruct?
>> Khalid Daoudi: Yeah. Because actually [inaudible] this MSM is dense [inaudible] is dense
in the signal domain, and you can recover everything from it. But this is the continuous case.
So if we summarize now what we have is that there is this -- well, I will say easy-to-compute
quantity, which is the singularity exponent, even though it's not that easy for -- in the case of
images, I mean, the way they compute it is little bit involved because it's -- it uses a notion of
local reconstruction to estimate the Hs. So they evaluate local reconstruction around the point to
give a value to H.
So if you have good reconstruction, it means the H is big, and so it means that the point is
predictable, otherwise when you have better local reconstruction the H is very low. And this is
how you can pretend that this singular manifold corresponds actually to N predictable manifold.
So, now, this has been done in image processing. But if we look at this sort of construction
formula, it looks a little bit appealing to what we know about speech. That imagine one second
that we can do the same thing for speech signals. This will tell us that, oh, by looking at this H,
we can probably have [inaudible] access to the source signal, and if we can find the good kernel,
then we are going to reconstruct. So it will be an alternative, nonlinear alternative to our
classical source feature model.
>>: [inaudible] objections earlier which was that the source and the filter are not -- are coupled.
>> Khalid Daoudi: Yes. This is ->>: They're not coupled here.
>> Khalid Daoudi: Yeah, they are -- they are not or they are?
>>: It seems like they're not, so two different expressions.
>> Khalid Daoudi: Yeah. But when you look at this equation here, you are obtaining the source
here after nonlinear operation of the full signal. So probably what you obtain I don't know
because this will -- it's not really decoupled.
So, but, anyway, we can still -- if we can use this formalism this way and do the assumption that
they are decoupled, why not.
So -- so let's now see what's going on in the case of speech signals. So I took exactly the same
components but I replace now X by T. And let's use the same analogy as in 2D case and take the
same multiscale functional.
And we want to analyze this new formalism analyze speech, but keeping in mind that we want to
do -- and this is very important because, as I said before, these nonlinear techniques that have
been developed so far, they are very complicated and people get scared from them and there is
not that -- and probably that this is why they don't have that much impact in speech processing,
beside the fact that, for example, as I showed before, we don't have access to a lot of information.
So I want to try this formalism on speech signal, but with the philosophy to do simple things that
everybody understands and also to do -- to develop efficient algorithms, to do things that people
can easily implement.
And the problem is we don't have this nice reconstruction formula as in the 2D case. Okay. We
cannot hope to have this universal kernel that reconstruct for speech signal. So ->>: That's a problem in speech, not a problem [inaudible]?
>> Khalid Daoudi: Yeah. So but still just let -- try to do something with these Hs. So since we
don't have the reconstruction formula, so we cannot use exactly the same way they estimate the
Hs, but we rely on the theoretical results that have been developed by Oriol Pont which say that
if you assume that there is energy cascade behind your process and then some other critical stuff,
that the cascade variable is infinitely divisible, then you can estimate your exponent just by
summing up the so-called transition exponents, which in our case we compute just as the
logarithm of the [inaudible] -- the ratio between the logarithm with the multiscale functional and
the logarithm of the scale. And you normalize.
>>: I have to admit I'm completely lost.
>> Khalid Daoudi: Okay.
>>: Can you translate this into speech? I mean, what does H correspond to in a speech signal or
a speech system? What information is in there?
>> Khalid Daoudi: So this is the point actually is that we are using in the case here of 2D, we
can say fairly that these exponents correspond to local -- they quantify predictability around a
point. Okay? Here we want to say the same thing. But at least personally I feel that they don't
have the right to say this because we don't have the reconstruction formula that allows you to
give this interpretation of local predictability.
But I'm just going to call them singular exponents. Try to find a way how to ->>: If I have a 2D image and I shrink it this way, you know, I chop off some of the rows on the
bottom, it's a little bit smaller and I shrink it [inaudible] still have a 2D image, but if I shrink it
down all the way to one row, then I have a 1D. Why does it all of a sudden break in that case?
>> Khalid Daoudi: What will break? I mean, if ->>: [inaudible] the difficulty comes from the [inaudible] not because of dimensionality. If it's
from the dimensionality, then [inaudible].
[multiple people speaking at once]
>> Khalid Daoudi: Actually, they have done it [inaudible] to certain extent for stock market
series. So where it's -- the dynamics are not the same as in speech of course and -- but I think
that for speech signal we have just to take -- I'm very confident that we can achieve the
reconstruction formula. But just we have to take ->>: Is there a particular aspect of speech that makes it [inaudible] -[multiple people speaking at once]
>>: Right. Especially as an image. So why is that ->> Khalid Daoudi: No, but it's not an image coming from turbulence.
>>: Of course it is.
>>: [inaudible] much more turbulent than ->>: Okay. I have -- seriously, why ->> Khalid Daoudi: Yeah?
>>: I mean, why ->> Khalid Daoudi: No, this one doesn't work for -- I mean, you cannot say that you can apply
for any image and you'll have this information about the physical process behind. No. It's not
magical.
>>: So if they took a picture of the exhaust coming out the tailpipe on a cold day.
>> Khalid Daoudi: Of what?
>>: Billowing smoke.
>>: [inaudible]
>>: Is that what you're talking about when you're talking about an image of turbulence?
>> Khalid Daoudi: Yeah. Yeah. The smoke. Like if you look cigarette and ->>: And so the theory tells you that there's some points in this image that you can take and then
reconstruct.
>>: Yeah.
>>: But ->>: The spectrogram ->>: -- it doesn't ->>: Specifically reconstruct.
>>: -- specifically reconstruct, right? The spectrogram doesn't have -- doesn't look like that.
>>: Right.
>>: And so that's why you're saying that speech is -- you can't directly apply this to speech.
>> Khalid Daoudi: Yeah, of course.
>>: It's deterministic [inaudible].
>>: Wait a second. I find it hard to believe that the turbulent image would be more
compressible than the nonturbulent. And turning the image into a few of these points, that's a lot
like compression.
>>: It's a stochastic reconstruction, so it has something that resembles [inaudible] pixel accurate.
Is that right?
>> Khalid Daoudi: Yeah. It's not -- it's not going to be pixel for pixel, for sure, but ->>: [inaudible]
>>: No, no, but I -- I think it's more than that. I think he actually means that reconstructs it,
not -- it reconstructs something with the same statistical properties, but then actually reconstructs
something that's like a mean square error close approximation.
>>: I don't think so.
[multiple people speaking at once]
>>: What do you mean by the word reconstruction for images?
>> Khalid Daoudi: For images is that you take the image, you compute your exponents, you find
the lowest ones, and you apply the kernel here ->>: Right.
>> Khalid Daoudi: -- and you have a good reconstruction of your image.
>>: And what does good mean?
>> Khalid Daoudi: Yeah. So good, it's ->>: I mean, no, seriously.
>>: I mean, if you had a bunch of images of turbulence, would you be able to find -- and you
had the reconstruction, would you be able to find the source image? Would it look similar?
>> Khalid Daoudi: If you have ->>: So let's say I have ten pictures of turbulence and I take one and I try and reconstruct it and I
see this reconstruction, can I as a human pick out which of the ten I used as a source of that
reconstruction? Is it -- I mean, do the images look the same?
>> Khalid Daoudi: Well, at least this is what they claim, that they are really looking as the same,
that you can recover -- for turbulent images, you can recover all the ->>: Statistics.
>> Khalid Daoudi: -- all the -- all the important fronts ->>: [inaudible]
>> Khalid Daoudi: Yeah. Unfortunately I don't have -- I would -- if I knew I would have
showed you examples of ocean dynamics.
>>: You're not arguing for pixel by pixel mean squared error [inaudible].
>> Khalid Daoudi: Well, actually, for example, in -- yeah. If you are thinking about
compression or -- I mean, this physicist ->>: If we took a picture of LENA ->> Khalid Daoudi: Yeah. And actually -- but ->>: -- and we did this and then we reconstructed it, would we say, oh, that's LENA? Or
would ->> Khalid Daoudi: Yeah, yeah, actually this is true. But, you know, they consider LENA as a
natural image.
>>: Yeah.
>> Khalid Daoudi: You see?
[multiple people speaking at once]
>>: That's my question.
[laughter]
>>: Right. It seems ->>: And that's something is not good, because it's not turbulent?
>> Khalid Daoudi: It's -- so if I [inaudible] I mean, you want -- you want that -- take the
spectrogram and reconstruct it. This is what you mean?
>>: Yes.
>>: [inaudible]
>>: So if you do this, then it can reconstruct LENA, why can't we do this and reconstruct the
spectrogram?
>> Khalid Daoudi: First -- and -- and imagine if we can cannot construct the spectrograms, then
what?
>>: Now you can go back to the [inaudible].
>> Khalid Daoudi: I mean, how are you going -- you are going to pick some points, some MSM
on the frequency domain and ->>: [inaudible]
>> Khalid Daoudi: Excuse me?
>>: In some respect [inaudible] reconstruct away from.
>> Khalid Daoudi: Yeah. But ->>: So it's isomorphic. It says the same information.
>>: All right. So the point is if we can reconstruct LENA, we can reconstruct the spectrogram.
If we reconstruct the spectrogram, we can reconstruct the waveform, and therefore it works for
the 1D signal.
>> Khalid Daoudi: Well, probably you're right [inaudible]. I will -- actually, I have at some
point, but I thought that what would [inaudible] and we look at the spectrogram, what would
[inaudible] this MSM to what it will correspond to or ->>: I have no idea.
>>: Yeah.
>>: [inaudible]
>>: [inaudible] images.
>>: But there's enough of both.
>>: Well, it's a D2 image, so, yeah, there's a lot of [inaudible] constrained, not everything which
is a valid spectrogram. Because it's a 1D signal. So it's a constrained image.
>>: Yeah, so [inaudible].
>> Khalid Daoudi: But probably I will [inaudible] send you the reconstruction and see what
you think.
>>: All right.
>> Khalid Daoudi: So where were -- so, anyway, in summary, for the 1D case, we just adopted
this way to estimate the exponents, which is very easy, very simple.
>>: So I have a question so I understand how this computation has [inaudible]. These individual
H sub-R sub-I are calculated by the log ratio of this T operator ->> Khalid Daoudi: Yeah.
>>: -- which is your modulus gradient.
>> Khalid Daoudi: Yeah.
>>: So it's like how quickly is it changing at the scale RI?
>> Khalid Daoudi: Yeah.
>>: And the denominator is the scale divided by what, the [inaudible]?
>> Khalid Daoudi: Yeah [inaudible] frequency.
>>: So that's a constraint.
>> Khalid Daoudi: Yeah.
>>: And then RI is measured in time?
>> Khalid Daoudi: Exactly. So you just -- it's your [inaudible] looking at.
>>: Okay.
>> Khalid Daoudi: So you see it's very simple to compute.
>>: It's trying to figure out what's happening across frequency, how quickly is the signal
changing?
>> Khalid Daoudi: Yeah. Across [inaudible] scale. Like for scale, yeah.
>>: Okay.
>> Khalid Daoudi: So I said now we have [inaudible] compute H, and let's look how they look
like. So this is a speech signal from TIMIT. And the vertical red lines are just the manual
transcription of phonemes in TIMIT. And if you look at the time conditioned distribution of the
Hs, so here you have time, so take 32 millisecond window and you look at the histogram of the
Hs in each window.
And you can see -- well, here it's not very can clear, but if you look closer, you see that actually
the distribution of the Hs is changing from phone to phone, but of course is not very clear from
here.
So we said, okay, so since the distribution is changing, let's just look at the simplest [inaudible]
statistic. So we'll look at the local mean, and since we want to keep resolution the finest
possible, we just look at the running mean of the Hs.
So we look at this function ACC as accumulation, and you can see that we have this nice
behavior that it's almost to piecewise linear functional with a breaking point at the boundaries of
phonemes, and even can see for shorter ones, see here, it describes the slopes of this
piecewise-like linear functional changing [inaudible].
So we say, okay, let's do simple thing and fit a piecewise linear functional to this curve and
identify the breaking points.
>>: But wouldn't this be [inaudible] technique, you still can't find this kind of behavior?
>> Khalid Daoudi: Well, we've looked. And can you tell me of a simple technique that will do
this?
>>: Just, for example, just the energy you've already gotten.
>> Khalid Daoudi: If you're looking at the energy, you are not going to have this kind of
piecewise linear functional which such a resolution.
>>: So the 0 to 2 is [inaudible].
>> Khalid Daoudi: Yeah. No, but ->>: [inaudible] never get three steps, right?
>> Khalid Daoudi: Yeah, but doesn't mean here where you start, because ->>: [inaudible]
>>: So the slope can be interpreted as a local mean ->> Khalid Daoudi: Yeah.
>>: -- of the -- of the H?
>>: Right.
>>: But there's usually a flaw in the A to D, because usually a low-pass -- I mean a high-pass
filter at the front end, so the local mean should be 0, so it doesn't pass [inaudible].
I'm a little confused as to what this is getting at, because it's essentially the first -- it's essentially
the lowest bin, DC bin spectrogram [inaudible] time it's a cumulative. So it's the interval.
>>: [inaudible]
>>: Yeah. So it's a low-pass filtering. Perfect.
>> Khalid Daoudi: Okay. So I'm curious that you say how you look at this. So what are you
saying?
>>: I mean, you've taken the speech signal and you're integrating over time, which is equivalent
to, in the frequency domain, multiplying by 1 over F.
>> Khalid Daoudi: Yeah, but here we are not taking the speech signal. It's not the speech.
>>: What is that? Oh, that's for H.
>> Khalid Daoudi: Yeah.
>>: Ah. Okay. I have no idea, then. Sorry. I missed that.
>>: Low-pass filter here, H.
>> Khalid Daoudi: Yeah.
>>: So let's go back a slide, then. I'm sorry.
>>: [inaudible] it's moving out the ripple.
>>: Go back a slide. So you said that H would be low when you have maximum predictability.
But around the 0.3 it looks like a fricative, which would be the least predictability.
>> Khalid Daoudi: No. It's the opposite. When it's high, it's ->>: Well, I don't [inaudible] wave points very well, but I think this looks like a fricative. It
should be most unpredictable. I mean, this is definitely a vowel.
>>: Well, he has it labeled in the previous slide what that is.
>>: [inaudible]
>>: Go back one slide.
>>: Back or forward? No, back. Other way.
[multiple people speaking at once]
>>: That's a fricative.
>>: That's maximum randomness.
>> Khalid Daoudi: Yeah.
>>: But it's at the lowest H.
>> Khalid Daoudi: Yeah. This is [inaudible] unpredictable.
>>: Can you just go back?
>>: But I thought small H was predictable.
>>: Go one before it.
>> Khalid Daoudi: No, no, it's the opposite. Smallest H is unpredictable.
>>: Towards negative infinity is unpredictable.
>> Khalid Daoudi: Yeah.
>>: That's what you're saying.
[multiple people speaking at once]
>>: If you go back one, the one before it's a vowel, and that's also low.
>>: Yeah. There's not much difference between this model, which is very predictable, and this
fricative, which is completely noise and very random.
>>: Compared to some of the other things that are going on.
>>: But there seems to be some effective [inaudible] because the most predictable thing is
silence.
>>: Silence is really predictable.
>>: Yep.
>>: That part works.
>> Khalid Daoudi: But let me remind you that I don't allow myself to talk about predictability at
this point for these exponents. I talk about really for images, but still here you can see that once
it's a little bit what we are expecting. I don't know if you agree.
>>: I'll go with that for now.
>>: But the -- so the previous -- the next -- so you take the mean -- wait. Is this the mean -[inaudible] the mean vertically and the ->>: It's over H, so ->>: [inaudible] only over tau, H of -- right, all the -- in the previous slide I thought your H of tau
was a sum of other little Hs.
>>: Well, H in the previous one was a histogram.
>>: Of all the Hs [inaudible].
>>: [inaudible]
>> Khalid Daoudi: Yeah. This is just a histogram.
>>: Okay.
>>: Go back one more.
>>: Yeah. So I thought H running even time is a sum of ->> Khalid Daoudi: Yeah, this is just a way how to compute. You can just ->>: Okay. So that's ->> Khalid Daoudi: So these Hs are other transitionals here. We are just [inaudible] ->>: So everything -- so that's just H. Okay.
>> Khalid Daoudi: Forget this one. I'm just saying how we compute in this case. [inaudible]
you saw it is easy actually.
>>: So one of the things you're asking is how would somebody in speech recognition look at a
graph like this? It seems like the only really useful thing is like a spectral measure where you're
trying to determine whether the sounds around it are similar or different. Then the places where
you have a linear progression of this integrator, you expect the expected value of the H in that
region is the same. So it seems like you can ->>: Yeah, kind of eyeball it.
>>: If the sound isn't changing, then the H isn't changing.
>>: Right.
>>: And there's some correlation there between the two.
>>: But is there also like [inaudible] like there's an A at about .5, there's another A at about 2.5?
>>: They're both sloped down.
>>: [inaudible]
>>: Oh, go backwards, go backwards. I mean, I think this.
>>: Go back a slide.
>>: The slope, the informative thing. Because it's not an absolute number.
>>: So this previous thing is based on the fact -- this next slide is based on the fact this is
constant, this is constant, this is constant, [inaudible] there. This is constant. [inaudible] over
time, that gives you a particular slope.
>>: Yeah.
>>: You're just interpolating it into constant [inaudible] right?
>>: Right. I think.
>>: So what happened to 0.4 on the next graph? Because here it's sloping a lot. So that's -well, it's changing I guess ->>: Yeah, it's a steep slope.
>>: Yeah. I don't know what that means.
>>: [inaudible] yeah.
>>: [inaudible] for H is 0. On the previous slide you've got [inaudible].
>> Khalid Daoudi: Yeah.
>>: What's the significance of the [inaudible]?
>> Khalid Daoudi: So [inaudible] the negative ones are the most important ones. It's where you
have low predictability. And the positive ones is the opposite. So -- and here we are looking at
all of them. We are not at DMSM yet.
>>: It seems like context dependent predictability. Because like the two [inaudible] are totally
different.
>>: [inaudible] the integrator.
>> Khalid Daoudi: Yeah, so the integrator. So if you look -[multiple people speaking at once]
>> Jasha Droppo: So, Khalid, I want to draw everybody's attention to the clock.
>> Khalid Daoudi: Yes.
>>: Yes. Oh, that's right [inaudible] three o'clock to the other meeting.
>> Khalid Daoudi: Yeah, yeah. So -- and here -- so by the simple operation [inaudible] so we
compare to the state-of-the-art phonetic segmentation, we don't look at the third row. This is just
incremental improvement about this integrator, this ACC functional. And we -- so we compare
to what we found is the best.
So this paper of Dusan and Rabiner, which we compare on the full TIMIT database so at least we
can have fair comparison. And we have better results. But what is more important, we've been
discussing this before, is that probably we should trust more. If this stuff makes sense, trust
more the segmentation by this method than relying on the manual transcripts of TIMIT, so -- and
so probably we should look at typical example where normal, even manual transcription cannot
find the transitions and see if this method can find it.
But unfortunately we didn't do this because we wanted to go to look at other aspects. So look at
the MSM, this measure component of the formalism. So here a voice speech and here the Hs
now, just the signal H of G.
And you see that -- from this picture here you see that -- and the red vertical line, they
correspond to GCI locations. They start from EGG signals. So they are here also.
And you see that the lowest exponents are around these GCIs. So this is directly by looking at
the H. Here a better example where we zoom. If we take now this -- the MSM by 5 percent of
the lowest exponents, you have this behavior. So here again another voice signal. And you see
that the MSM is located around GCI. The lowest exponents are around GCL. Even another nice
behavior that the lowest exponent in these regions corresponds to the exact location of the GCI
called into the EGG signal.
So it seems like this MSM corresponds to -- has a physical meaning in speech which apparently
here is GCI. So we said, okay, let's compare to what is going on in this field of GCI detection.
So this -- basically this level-change functional, just you compare the mean of Hs around the
point just to allow it to have only one MSM point per growth of cycle. So it gives this red curve.
And then we take the zero crosses of this red -- this green curve, sorry, and we take the lowest
MSM -- the lowest exponent points which is the closest to the peak. So we just -- this is to
guarantee that we have -- we are taking MSM point, the lowest one, but the ones that are in one
glottal cycle.
So and here are the performances. So this is defended last year by Thomas Drugman. And here
are the results in clean speech. So you have two measures -- I think Mark knows this stuff better
than I do -- reliability measures and accuracy measures.
So reliability measures are almost the same, but accuracy is always better. And according to
Thomas Drugman SEDREAMS confirm or infirm [inaudible] is the best algorithm and the most
robust so far, according to his thesis.
So and particularly when you look at shorter turbulence windows, we have better results. So we
see again this -- when we want to look at high resolution, we have always this benefit because
the [inaudible] based. And remind you that this is just by simple transformations of the H.
So here are the results on noise, because in clean speech most of method do well. So here you
have accuracy -- sorry, the -- the hit rate, and here, the last figure, you have the accuracy.
So this is in the case of bubble noise, and you see that [inaudible] for low SNR we are ->>: This is adding noise to the EEG?
>> Khalid Daoudi: Yeah.
>>: Speech?
>> Khalid Daoudi: To the speech signal.
>>: I thought that you said it was done -[multiple people speaking at once]
>>: You have a speech signal and a parallel [inaudible].
>> Khalid Daoudi: Yeah. And what is interesting is that in case of [inaudible] noise that the
results seems to be very good. So it means that there are some robustness. And this has been
actually showed image processing and also theoretically that lowest Hs are robust to noise.
So, well, I have to be quick. So this is -- so we took this MSM and propose it as a way to
provide sparse linear prediction. So the idea is -- so the L-0 norm is the impractical ideal
solution. Another [inaudible] to the solution is the L-1 norm, but it involves complex
programming.
So we propose just to downweight the mean square error minimization on the GCIs. And
actually Wednesday morning there was a presentation by Pavo Alco [phonetic] who is doing
exactly the same thing. Fortunately I have submit the paper three months ago. So he did the
same principle. He's using his way to compute GCI. So we downweight it. And I showed that
we have better sparsity after we made the statistical test that by doing this method. But this is -I wouldn't say that here that we are just playing about with our MSM stuff to obtain interesting
results.
So then here was talking about performance, so we have made the test that we -- on the synthetic
vowel you have exact estimate of the original transfer function while the L-2 norm is shifting
[inaudible] little bit.
And also so we'll have used this MSM to -- we know that multi-- classic [inaudible] do -operates in two stage, where in first stage you have to -- to do -- if you want k pulses, you have
to k searches of order N to extract pulse location, one after the other. And then you optimize all
the pulses [inaudible] jointly at the end.
And this takes a lot of time, this first stage. So we replace this first stage directly by putting the
MSM as the process. We say that the MSM gives us the location of the pulses. And then we
keep the second stage of MPE the same.
And here an example what we find. So this is the original signal, this is the pulse, pulse is given
by the MSM, so that amplitude is computed using the second stage of MPE, and this is the
reconstruction signal.
And here we compare, so to the performances. So the best one is MPE, reoptimized MPE,
because there was a first version in '82 which is relatively bad in SNL performances. Ours is
here. And this one is one that was published last year using [inaudible] -- using -- last year using
[inaudible] methods.
So we see that for low case we have better results than these two here. But MPE is better. But
MPE does intrinsically minimize SNL. So it's not surprising that is doing good job in SNL. But
if you look at perceptual measures, we have almost the same quality. But we win a lot in
[inaudible] complexity.
So I want to [inaudible] one conclusion. Actually I don't -- my interest is not in the performance,
so whatever presentations we gain is that I showed you a very simple nonlinear operation and a
signal that allows to have easy access to some interesting local dynamics of the speech signals.
So we saw that we can have access to GCIs, to phone transitions.
Does this mean that this formalism is potentially good for linear speech processing? I hope that
the answer will come from you.
So this is the perspective for the short time incremental research. Actually Thomas Drugman
wants to use this for synthesis in his system. He has a synthetic system. I heard that GOIs were
very complicated to estimate, so probably we can take a look at them.
>>: [inaudible] might even have them. [inaudible] that second pulse looked very much like it
might have been the GOI.
>> Khalid Daoudi: Really?
>>: Yeah. So maybe you solved it.
>> Khalid Daoudi: What does it mean MSM for unvoiced speech? Normally it should have
better mean meaning there because the assumption is that there is turbulence there. Is there just
some random location of [inaudible] speech or does it have physical mean as we have found for
voice speech? Are we looking at the right intensive variable? I mean, I have showed you this
Gradient-Modulus measure as our intensive variable. It is derived -- we are -- we have just
copied from image processing. It's for speech. Is the right variable to look at? Can we have
better estimate of single exponent. Probably the idea [inaudible] the spectrum.
And I want to say a word that -- so this is not in this talk, but Oriol Pont is physicist and
mathematician guy in our group. He's working on a way to infer directly the cascade energy, and
this is through the notion of optimal wavelets. So he has all the summaries, but we are waiting
for the final results where we can infer completely the cascade of the energy using the notion of
optimal wavelets. So if we have this, of course we will have another way that I presented you
today to look at the systems.
But what would be very interesting is to have a reconstruction formula that is similar to the 2D
case, and here's it open. Should we consider some dependent kernels? And because this would
lead us then to what is interesting for you to have some new ways for feature extraction and so
on.
And also here you see that we are -- since we are a little structure and we don't have a lot of
manpower, it is very important to me to know what are the good problems to look at or what are
the good applications to look at instead of just by tied by -- by the lack of manpower. So I hope
that you will tell me where this stuff -- if it makes sense. If it's not, you can tell it to me and not
be nice to me. And what are the best problems to look at. And I think I'm all right, aren't I?
Thank you.
[applause]
>> Jasha Droppo: So is there any further questions from the audience? There's a lot during
presentation.
>> Khalid Daoudi: Yeah.
>> Jasha Droppo: I know a lot of us have to be somewhere at 3:00.
>> Khalid Daoudi: Ah, you have the meeting at 3:00, yeah. Yeah.
>> Jasha Droppo: One thing that Khalid doesn't know, being a mathematician, is whether this
work is interesting for speech or what kind of -- he's looking for feedback on what kind of
directions to take. Is that right?
>> Khalid Daoudi: Yeah. Yeah. To have feedback from you practical guys who know better
than me the area.
>>: So one thing I would say is I think, you know, we are very [inaudible] scale, so it's like fun
level or the frame level [inaudible] imagine the dynamics of conversational speech [inaudible]
compared to [inaudible] ->>: [inaudible] more interested in the dynamics of speech and articulators than we are of
[inaudible].
>>: Right.
>> Khalid Daoudi: So it's higher-level transitions that are of better interest.
>>: Yeah, the dynamics of the physical process that's generating speech rather than the
manifestation of that physical process, which is a little closed [inaudible].
>>: I think in terms of segmental decoding and having some hints as to the segment boundaries
could be very useful if those sorts of boundaries that you are identifying there hold up in a little
bit of noise and conversational speech and --
>>: If it's generally true that this gives boundaries with reasonable accuracy, I think that's
interesting.
>> Khalid Daoudi: Yeah, okay.
>>: [inaudible] because looking at the whole [inaudible].
>> Khalid Daoudi: No, no.
>>: It's a local computation.
>> Khalid Daoudi: No, it's a local computation.
>>: Thank you.
>> Khalid Daoudi: Thank you very much.
Download