23826. >> Larry Zitnick: Hi. It's my pleasure to... us from UIUC. His advisor is Yi Ma. ...

advertisement
23826.
>> Larry Zitnick: Hi. It's my pleasure to introduce Hossein Mobahi here, joining
us from UIUC. His advisor is Yi Ma. He's currently actually visiting us here at
MSR about once a week. If you want to talk to him at any point let me know.
He's hanging around and doing research here because his wife is actually
currently working at Microsoft as well.
So Hossein has done a lot of interesting work. He's done some work on image
segmentation, won the best student paper at ACCV and has a follow-up paper at
IJCV on that work. He's done face recognition.
One of my favorite works that he's done is on deep learning using temporal
coherence. You should check that paper out. And today he's going to be talking
about some of his more recent theoretical work, which he submitted to BBPR,
and I'll let you describe it.
>> Hossein Mobahi: Okay. Thank you, Larry. Hello everyone. And welcome to
the talk. So I'm going to talk about noncovex optimization with Gaussian blurring
technique, and continuation. And I will discuss some of its application to image
alignment as well.
This is joint work with my Ph.D. advisor Yi Ma, Yuri of MSRA, and Larry Zitnick
from Building 99 at MSR.
So the talk is of two parts. At the beginning I will just give you some basic
definitions, working definitions, and then some preliminary results from theory
side and then the second half of the talk will be more about application to image
alignment.
Okay. So talking about optimization. We need to address convexity and
noncovexity. We all know that noncovex optimization in general is not tractable.
And because of that, there's always a pressure on engineers to try to model the
problem as a convex one, even if the actual process is not very similar to a
convex process.
Or write it down as a noncovex objective function and try to relax it to a convex
function.
That is one way to deal with it. But if the actual objective function is far from
convex, of course, this is going to be very crude and not very useful.
But unfortunately a lot of real world problems are noncovex, so we cannot ignore
this class of functions. And one good news is that real world problems usually
have some kind of structure and regularity and they're not random. This means
that even if they are noncovex, there's a possibility that if you exploit that
structure in your problem right, then you can solve that problem efficiently.
There are even examples like even integer programming, the examples if you
have some conditions on the problem, then although it's noncovex, but it can be
solved efficiently.
There's actually a vast literature on this, people trying to exploit different features
in the problem. Here you can see some of the squares or difference of convex or
many other.
But today I'm going to talk about one specific approach that it also tries to exploit
some structure, and that's the smoothness of your objective function. And it has
been very popular. People have been using it in practice. But surprisingly,
there's little theoretical understanding about fundamental issues with this method.
And I have some preliminary results for this. I'm working on this theory. So this
is a presentation based on ongoing work. So it's by no means a complete
theoretical framework. But I think even the pieces I have so far is interesting to
the audience.
>>: Let me ask the question, do you know whether this technique has been used
in in world network optimization?
>> Hossein Mobahi: Yes, there is an author. I think Gorse. G-o-r-s-e I think
that's an old paper around the '80s. So, yeah.
Let's be precise about what we mean by smoothing. So smoothing happens in
nature and in physics. Perhaps a representative example is from Heath
equation. Heath equation is simply a partial differential equation. Here you can
see the definition of this partial differential equation.
So if you set the initial condition and if there is a boundary, if your space has
boundary, then you set the boundary condition and then it gives you a solution.
So you can say, okay, let's say -- let's be concrete here.
Let's say you have a bar, and then there's some distribution of temperature along
this bar. Right. At time zero. You leave this bar on its own the heat starts to
propagate and spread across the bar.
And eventually it reaches equilibrium point where the entire temperature on the
bar is the same.
So let's define this function G in two variables X and T, where T is time and X is
the location on the bar. And let's say the initial distribution of temperature on this
bar is denoted by FX. So that's become -- that becomes your initial condition.
Now, what this has to do with smoothing, you need to look at the solution, what
the solution looks like. If you solve this problem for the K stat, your space is
unbounded, that gives the simplest formal solution. The solution would look like
this.
So this is essentially a convolution of your original function and a Gaussian
kernel. If you just replace T with sigma squared, it will become more familiar to
you, probably. So I have a video which shows that this is like the location X and
here is the temperature and this setting, I just run that differential equation to
show how it starts to smooth this function.
So very simple idea. But how can we use this in optimization? And notice how
it's useful for optimization. Let's play this video backward in time.
So we start from a highly smooth function, but what we initially do is we find its
minimizer and try to trace this minimizer back to the original function. At each
point -- okay, we are in a local minimum, and then by the next infinite small
change in the function, we try to follow that minimizer, and basically follow that
path up to when we get back to the time zero.
So remember this was the original function. And this red point is showing the
minimizer you are tracking. So this function, of course, is not convex. This has
two local minima that you can see, one of them is global minimum. When you do
it like this you trace the minimum it actually finds the global one for you.
Okay. This idea as I said, theoretically it is little understood. But I can tell you,
it's very popular even across different disciplines. For example, in computer
vision, which is my field, it's called graduated optimization. You can look it up.
There's some explanation in old book by sisserman and Andrew Blake. I think
it's called visual reconstruction or something like that.
And in chemistry people call it optimization by diffusion equation. In scientific
computing, it's called by homotopic continuation and in engineering it's called
deterministic. There are slight differences between them but the core and
essence is the same.
They start from a smooth function and trace the solution. So it's surprising that
despite its recognition, there is not much known about this problem. And here
are just some of the interesting questions, yet very fundamental, that you might
want to ask yourself when you work with this kind of algorithm.
For example, the idea is that we want to make the objective function smooth.
When is it useful? When you smooth it enough, it becomes convex, right? Of
course, it doesn't happen for any kind of function.
So you want to know for what function, if you push your sigma, the amount of
blurring in your Gaussian, toward infinity, it starts to look like a convex function.
So that's the first question. The next question is that assuming that your function
asymptotically convexes you want to find the convex minimizer.
Now, that's your start point for this process. But as the problem gets smoother
and smoother, it becomes almost the objective function becomes almost like a
flat curve.
And highly unstable. Numerically very unstable to apply any kind of numerical
techniques for finding the solution even for convex function.
So it's not the best way to do it. But we hope that there's a closed forum for that
asymptotic minimizer that we can just get it explicitly without running any
numerical procedure.
The other question is: Okay, what is the size of this asymptotic convex functions,
are they large relative to convex function, if so, how large. And then because we
are tracing the path from this smooth function back to the original function, we
want to know that this path is traceable. What if in the middle of the path you see
a bifurcation and then you don't know which way to choose.
I'm assuming that this is the deterministic algorithm nonrecursive so you don't
want to branch; you just want one direct path back to the solution. And again I
would like to emphasize that this setting, this approach has very deep roots. It
has connection to very fundamental forms of differential equation and
fundamental processes in physics. So there should be something out there. I
believe it's a very fertile unexplored ground and hopefully something will come
out of this research.
>>: You don't mention about the dimensionality.
>> Hossein Mobahi: Not yet.
>>: The one you show, the one dimension.
>> Hossein Mobahi: That's for visualization.
>>: High dimension, it will be much more ->> Hossein Mobahi: No, the concept is the same. Nothing changes about the -as conceptual -- if it's just for visualization purposes. Now, we address some of
these questions along the top. But before again starting we have to be a little bit
precise about what we mean asymptotic convexity. So loosely speaking, that
means that if you smooth your function, by smoothing, remember, I mean
convolution with the Gaussian and has a sigma parameter that the larger, the
smoother. If you push this sigma to infinity, then the function starts becoming
convex. But more rigorously what they mean is that for any radius M that you
pick, there exists some blurring amount sigma star such that for any pair of two
points that lie inside this ball and for any sigma that is greater than sigma star, it
satisfies the convexity condition within that ball.
This is just a definition of convex function.
>>: G ->> Hossein Mobahi: G is ->>: It's the equation.
>> Hossein Mobahi: Yes. So G is the -- in other words, is the smooth version of
your original function FX. So F it was only depending on X. But for G it has two
parameters, one is X, one is sigma. So when sigma goes to 0 it gives you F.
>>: Sigma here is just the -- it's kind of a sole parameter.
>> Hossein Mobahi: Yes.
>>: Could be done equation by some other name maybe.
>> Hossein Mobahi: But for this talk it's the same. So basically for this talk all
the smoothing I'm talking about is just convolving your F with a Gaussian kernel.
And that's the solution of Heath equation. But you can explore other kinds of
smoothing.
>>: This gives the full equation. It just happens that [inaudible] have taken the
fall.
>> Hossein Mobahi: Yes, exactly, if it's unbounded domain.
>>: So high dimensional heat equation you would have Gaussian.
>> Hossein Mobahi: Yes. So actually getting back to the heat equation, so here
you have Laplacian operator and that acts on multi-variables.
All right. Now, a couple of simple propositions. So any convex function that you
have under very mild conditions is also asymptotically convexed. Simple you
write down the definition of convexity F and because your kernel Gaussian is
non-negative it doesn't change the inequality so you can multiply both sides and
you can integrate and you get the definition for G.
So from this we know that at least the class of asymptotic convex function is as
big as convex function under mild conditions.
Now, the other thing that is again not very difficult to figure out is if you're talking
about twice differentiable functions, then you know the Hessian of that function at
any point needs to be poly definite in order for function to be convex, right. Here,
for asymptotic convexity, again you have a similar definition. It says if your sigma
is large enough, larger than sigma star, then everywhere within this ball of radius
M it's positive definite.
Now, let me give you a very bizarre example. In my opinion it's very bizarre
example. Consider that function, it doesn't matter mathematically what it looks
like. Visually it looks like this. So it's almost flat and then goes down very quickly
and then comes back up. Why it's important is it's very close to L 0 norm so
people interested in sparse representation recognize that.
Now, from convexity point of view, what's important about it, as you choose
epsilon smaller and smaller, everywhere on the entire domain becomes concave
function except a very tiny region around this. And now I mark the regions that
are concave by pink. Here you can see it's almost like it doesn't exist because I
mean it's discretized by pixel so it doesn't exist. But what if you start smoothing
your objective function. It changes to this now there is a tiny region you can see
why it is convex. And as you increase the amount of the sigma, the amount of
blurring, this region of convexity starts to expand. And it just has no bound that's
why it's called asymptotically convex. You can just make it convex for as large a
domain as you like.
But the important point about this example was that your initial function, it was
concave everywhere. So this seemed like very difficult for convex relaxation type
of thing. But it's actually asymptotically convex function. Okay. That was a very
good news. I'm going to give you a couple more good news about these
functions. So is there any sufficient condition easy to check that you can say a
function is going to be asymptotically convex? The answer is yes. You just need
to integrate the original function, no smoothing at all, and look at the value of this
integral.
If it is bounded and if it is strictly less than zero, then it's asymptotically convex.
It's a very simple condition. And it's derivative-free. So to me it was very
interesting.
And as I mentioned earlier in the talk, because asymptotic -- when you approach
the asymptotic convex function, becomes flat, so it's really difficult to find its
minimizer by numerical processor. So is there any closed form solution for the
minimizer of that function? Again, to me very interesting closed forum
expression exists for it, which tells you that it's just the center of mass of the
function.
So if you want to derive this mathematically, I have to be like a little bit informal,
because if first there's no time to go through all the details and second it will
become dry, but my arguments here are very high level. Like you cannot
consider them as complete proof. But if there was any question about more
rigorous argument, please feel free.
>>: What if this solution happens to be the same as -- actually we know this is a
solution for missed error estimate, the applications.
>> Hossein Mobahi: Uh-huh.
>>: Does it have any bearing? So if you won't minimize the means for error of an
unended function, the solution -- the criterion the square is actually convex?
>> Hossein Mobahi: Uh-huh. No, I didn't know about that. Yeah, I'm not sure.
Maybe there is.
>>: [inaudible] the Gaussian, the Gaussian totally dominates, where the
parameter is [inaudible] it's like -- you're minimizing the Gaussian pretty much.
Right?
>> Yeah, I see. That may be the connection. That may be the connection. But
can you just kind of hack away the proof at the top -- I'm surprised --
>> Hossein Mobahi: So the way you can prove that is you write down -- okay,
you write down the definition of G. G's convolution of F times your Gaussian
kernel.
Now, you take the second derivative of that and you can do it in F or the
Gaussian kernel. We do it in Gaussian kernel. And what comes out of it is
something, two terms that depend on I think it's something like this. So there are
two terms that one is over -- okay. I think it's like this. I think it would look
something like this, and again your Gaussian kernel. So if I want to be complete,
you have this F of T here and then K of X minus T sigma. And DT. Okay. So
when you differentiate this kernel, you get this. Now, as sigma gets large, you
can ignore this one, right. And also this one is like a negative thing. So you can
ignore that and it just tells you it's negative of -- and, again, as K goes to infinity it
becomes like you're integrating over F alone, right, but with a negative sign.
So in order for this guy to be positive, you want the integral of F to be negative,
right? So but you should work out the details, of course, because...but is it clear
how ->>: The same proof, the squared. No, the first time.
>> Hossein Mobahi: Yeah. Is it clear?
>>: One thing about the connection, seems like you could shape it by custom,
and you happen to have a bar.
>> Hossein Mobahi: Yes.
>>: The function has a finite integral, for example, could it be there?
>> Hossein Mobahi: Actually, we want the function to have a finite integral.
Otherwise we cannot ->>: If the integral is finite but positive.
>> Hossein Mobahi: Then it's asymptotically concave.
>>: Concave?
>> Hossein Mobahi: Yes.
>>: Positive, you subtract a constant.
>> Hossein Mobahi: No, no, but then it's not integrable on the entire domain.
See what I mean? If you add one to the Gaussian and integrate it, it's no longer
bounded. But remember here I explicitly mentioned we want the integral to be
bound. And this is sufficient. So you can't find asymptotic convex functions that
do not satisfy this. But this is handy so you can easily test some class of
functions. And here's a concrete example. So, again, suppose F is that function.
Again that form is not that important but it looks like this red curve. And it has
one minimum here, one minimum here, and maximum there. Right?
And for this problem based on the things that I just told you, you can easily figure
out it's asymptotically convex, you compute the integral even if it doesn't have to
be precise, you see that it's negative. So it's asymptotically convex. And where
is the asymptotic minimizer, again the definition, it's the yellow bar shows this
point. And now here it shows different plots.
So this is the original function, and this is a little bit smoother, and this is even
more smoother. You see that the minimizer is moving toward this one and it's
also looking more convex.
>>: Of course, once you do that, how do you map back the original minimizer for
the record? You know what I'm saying.
>> Hossein Mobahi: We haven't got there yet. So you remember there were a
couple of questions listed. The first one was asymptotic convex. So we were
only talking about that right now. So for now I think based on the proposition I
gave you earlier, you have to know about this under mine condition but I'm going
to give you even more good news.
So, again, these arguments are not rigorous for this talk, but this should
communicate the idea. But more rigorous proof is available for those interested.
So the measure of functions that are convex is 0. If you want to be concrete, we
can limit our class of function to this one so it's just one variate class of function,
twice differentiable. And then you know the second derivative vanishes at
infinity. And the second derivative bounded by capital F everywhere.
Now, the argument is like that, because it vanishes at infinity. So you can find
some radius delta that captures most of the signal. You can make those as large
as you want. And then anything outside of it you set to 0, you don't care. So you
can make this approximation to the original FW prime as small as you want. The
error as small as you want, which is in larger delta.
So we have worked with just this one. So this gives us bounded support of this
function. And everywhere, of course, on this function we have also this
inequality. Now let's divide this bounded support region into N cells. Equally
spaced. Now, the chance of -- okay, what is the situation when this function
becomes convex, when all the cells have positive value, right? But if you look at
each cell, each cell is one evaluation of the function. Because we're talking
about second derivative, its convex if all these cells is positive, right? The
chance of seeing a positive number in each cell is a half, right? Because we are
saying that there's no preference in the sign. We just say the space, its
magnitude is bounded. So it can choose either bound.
So overall the chance of seeing a convex function, meaning that all of these are
positive, is half to the power of N, right? Now, as we increase this N to get closer
and closer to the actual function, this density approaches 0. Right? So we can
say that from this rough argument that the measure of convex function is 0. On
this function class. But what about asymptotically convex function? Well, very
good. Half of the functions that you pick from this space are asymptotically
convex. Why is that? Because here it needs to be double prime, I think. Yeah,
you need double prime here in this. Because, again, there's no bias in our
assumption, the class of functions.
So half of the functions have their second derivative less than 0, their integral.
And half have positive. And you know that as you make the function smoother,
by the same argument it becomes more closer to its integral. So it's like half of
the function are asymptotic comebacks. These are very good news.
One other thing, maybe it's somewhat related to your question, is now suppose
we have asymptotic convex. Suppose we found its minimizer, asymptotic
minimizer. Now the processor required us to follow the path back to the original
problem, right. How do we know that we can do this? Because if somewhere
along the path there is a choice, then we are confused. We don't know which
path to take, and we don't want to branch, as I said. So we need to avoid this
kind of situation.
And mathematically what that means is on the smooth function we don't want the
Hessian to be singular or anywhere, right? So, again, you can do that with
manipulation here. First you can derive the path of minimizer. It's very simple to
derive this one, I think it should be clear. Let's write down the equation for points
that satisfy this stationary point and differentiate that with respect to sigma, which
is like time here, and then you rearrange the terms and you get that, right?
Now, in order to make sure along this path you don't see any singularity you
immediate to make sure that your Hessian remains positive definite along this
path. Asymptotic minimizer we already know the Hessian, the Hessian of the
point is strictly positive definite.
Right? I mean, all the eigenvalues are strictly greater than zero, right? So we
want to maintain that situation. You want -- and let's say okay lambda here
shows this smallest eigenvalue. Okay. You want to see what is the evolution of
this eigenvalue over time, right and then see if it's getting smaller or larger, in
which direction it's moving. You don't want it to go down, right.
Given the eigenvalue itself, can there be said anything -- is there any relationship
at all between this eigenvalue and its evolution in time. Yes, there is. I cannot
get into the details because of time. But here I just say you need to use two
things. One is the property of heat equation that relates this operator to this
operator.
So differentiation in time becomes Laplacian in space and therefore you can read
this again. If I have information about second derivative, the eigenvalues can I
say anything about fourth order derivatives. And again yes, you can if the
function is smooth enough. Just use the negative definiteness of Laplace
operator. And just to give you intuition for those not familiar with that, why is that
the case? So if a function is very smooth, let's say it consists of one sinusoid.
Let's say sinusoid -- sine of X, right. So this is your F double prime. Now, I
differentiate this twice more, and what I get is minus sine of X. So it's just the flip
version. So there is a very coupled -- but as you add more terms to this, like high
frequencies, then this is not quite the case.
But as long as your function is smooth, so these terms are dominant, then you
can relate these derivatives.
>>: I have a question here. Previous slide you have this nice animation that
showed that this little kind of concave bump magically appears exactly where the
minimum of the initially some type of convex smoothie is. That's why you got this
branching behavior that you're trying to ->> Hossein Mobahi: Right.
>>: It seems like if you had added a little bit of noise. This is a singular point you
could have made another animation and the bump shows up just a little to the
right and left and you would not have ever had this ->> Hossein Mobahi: Yes.
>>: That's a perturbed analysis and it could have solved ->> Hossein Mobahi: I get your point but the goal is not to randomly pick -maybe it's misleading because they're both cemented. Let's say one of these
becomes different eventually. You want to follow that path and this one stays like
a local minima.
So if you want to like choose one direction by chance, then maybe you choose
the wrong one, right?
>>: Nothing by chance. So let's step back in that case. Following -- so proving
that your entire trajectory has no branching points is weaker than proving you'll
actually converge to the minimum of the noncovex function. You could have
something that appears very, very far away from where you are, which turns out
to be the global minimum. So that's a separate issue. So all we're saying now is
we're asking for a lesser thing. We're not trying to find the minimum of the
noncovex function. We're just trying to prove that as we do this smoothing thing
we never hit a conflict or a fork in the road that we have to choose.
I'm saying just avoid that. It seems like a perturbed analysis, namely showing
this will happen with measure zero or adding a little bit of the function to prevent
this seems to prove that, right?
>> Hossein Mobahi: This is not the entire goal, this is one condition that -- we
need to have some control over it. So we need to basically say which path it
chooses so that it gets to the global minimum. So I haven't discussed that yet.
But there are different ways you can control and prevent this. One is that but
what we really need for making sure that this path leads to the global minimizer,
we need something more than like local perturbation or adding noise. Because
by adding noise, you are not -- you're not using the assumption that which
direction is better for leading you to the global optimum. It's just noise. But we
can discuss that later if you want.
Because I think I need to get to the application part. Actually, I'm done with the
theory. So if there was any question I would be happy to answer after the talk.
But now let's get to the application of this idea.
So it's image alignment. It's a very fundamental problem computer vision if you
do structure for motion, variant recognition or tracking videos, you usually have to
hit this problem. Now, there are two major approaches -- actually, if anyone
wants to read a little bit more about alignment, there's an excellent tutorial by
Rick Zileski [phonetic] here and all the details are there.
But there are two major approaches that you can use for alignment. The first one
is feature based. So you select a bunch of sparse feature points and then try to
make a correspondence between them and then use that to infer what is the
geometric transformation between the two images.
And the other one is called intensity or direct method, and that's you just subtract
the two images, all the pixels get involved, and then you get a residual. And then
you try to minimize this residual.
So find the alignment that minimizes this residual. So intensity-based method
seems more tempting, because it uses all the information in the image, but in the
first one you're throwing out a lot of information. And so this one seems to be
richer in some sense. But unfortunately, when it comes to practice, usually you
have a lot of local optima. So it doesn't help that much.
Now, again to get you familiar a little bit to the setting, let's say we have a very
simple problem alignment F1 and F2 are two different images. And they are
different from displacement D. So task alignment we can just formulate it like
this, minimize these, minimize this objective to get the optimal D.
Now, this is noncovex because F is out of our control. It's an image and it can
look very crazy function, and therefore D can look very crazy function.
So you need to somehow get away with it. One way is you linearize your F with
respect to this optimization variable. In this case it happened to be D.
And first order tailor you get this. And then plug it in there. Now, it gives you a
convex quadratic function. You can solve it even in close form. And then you
get some D half. It's just an estimate.
And then apply this to the image. But remember there is an approximation here.
So if this approximation is poor, then this is far from your true solution, right?
And then according to tailor's remainder theorem you can bound this difference.
So this is one obvious because here we are all linearizing around the origin so
the larger D the worse it gets. And this one depends on higher order term.
Actually, it's the largest eigenvalue in magnitude of the Hessian.
So you want that to be small as well. Well, that depends on the image and what
if it's not small? What can you do? You can smooth your image. You can
actually blur your image, and the effect of this is that that lambda, capital lambda,
will become small. And helping you so the D, estimate of D you get is closer to D
star.
But, again, there is a problem here, because as you blur your image, you lose
some of its details, right? But you can't do this iteratively. This is called Lucas
and Kanade algorithm. You start from a very coarse blur and make the
alignment. And then apply that alignment to make things a little bit closer.
Now, your D is now reduced a little bit because now you're working in a smaller
domain. So you can reduce the amount of it, because we want this total thing -sorry.
So we want this total thing to be small. So as you optimize over D and D gets
smaller and smaller to desired value, then you can -- you're allowed to increase
this one, and that can happen by like making it less blurred, the image.
Okay. There's even a proof that this idea works -- well, it can recover the correct
displacement as long as we're talking only about displacement. But I have seen
people using it even for motion models that are not displacement. And that is
wrong.
And it's easy to see that. I mean, by wrong, it's not optimal. So let's say that we
change the problem setting. Now instead of displacement, we are talking about
scaling.
So F 1 -- okay. Now we have F 1 SX. So this is the objective function. We have
F 1 SX minus F2. And then we square this and then integrate it.
So, again, we can linearize it. We get this form. And, okay, we can find a closed
form solution, convex quadratic. And look at the error bound. This is very
important. Now, in the error bound you still have that capital lambda. You still
have the deviation of this variable from the linerization point but this term
appeared here, which shows that the quality of this error depends on the location
and image you do the linerization as well. So in other words as you get farther
and farther from the origin this approximation becomes poor.
So suggesting that if you want to do any kind of blurring, it should be more
intense and aggressive for points farther from the origin. Right?
So remember this point here, because we will get back to this. I just wanted to
mention Lucas and Kanade algorithm. So we need some kind of blur that's
spatially variant. But in Lucas and Kanade algorithm, what it does, it just blurs
the images with isotropic Gaussian. So every region is treated equally.
But let's look at the eye, human eye. Let's see what it looks like there. So there
are some color receptors called con -- if I'm pronouncing it correctly, and this is
the density of those points. This is the center of the eye, and as we go towards
the periphery, you see that the density rapidly decreases.
The implication is whatever you see at the center of your, like in your fovea, it
has the highest quality and then as you move toward the periphery then it
becomes blurrier and blurrier, and from biology there's evidence that you need
this kind of spatially [inaudible] blur. Also people envision have heuristically
come up with ideas like Bergen Malik based on intuition they have blurring
kernels that are not spatially invariant but it's heuristic. But today we are going to
derive these kernels in a very principled way for the first time.
So again to be concrete and illustrative, let's take the same example. It's 1 D
scale alignment. That's the actual task. And what traditionally people envision
do is they smooth the signal. So this convolution is over space X. And then try
to solve it by Lucas and Kanade algorithm.
But what I am suggesting today is to smooth the objective function, because
that's where you want the minima, local minima to disappear, right? So this
should be really your goal. And let's look at the landscape of this optimization
objective to see what it looks like. The signals are these, one blue, one red I
hope it's visible. Very simple functions, and they are just flip mirror of each other.
So the optimal scaling is minus 1 so that you flip them. Now, by signal
smoothing, you get this picture. Now, what is this? So this is where you have
the highest blur. And this is your choice of scale factor. So it has two local
minima here. So as you make it blurry now, you still have two local minima. So
you have to choose them -- choose one of them randomly. If you trace the path
of that back to the original by reducing the blur, there is a chance either you hit
the global minimizer or you hit the local minimizer.
This is for this one. Lucas Kanade but if you blur the objective function then
there's only one star point. And for this particular example, it's actually the one
that leads to the global -- there's another path that starts somewhere in the
middle of the road. But we never get that because ->>: Depends on how you blur. You blur it the other way, would it go the other
side?
>> Hossein Mobahi: What do you mean the other way?
>>: Here, when you trace it down here, you trace it on the left, I suppose that's
due to the same problem you mentioned earlier about the application? Maybe I
misunderstood.
>> Hossein Mobahi: Well, no bifurcation happens on this one or that one, there's
no bifurcation. But this is ->>: Under what condition does it trace back to the wrong --
>> Hossein Mobahi: So that's the most difficult question. So if you remember I
was listing a bunch of theoretical results that we have for this problem. And that
is something that I'm working on still. I don't have concrete results to present.
But I really hope that we can also get at least some conditions on that as well to
connect really that theory to the application.
>>: So basically after you do the blurring, you find the global optimum, that's
empirical finding?
>> Hossein Mobahi: Yes.
>>: Wait for a different example to show?
>> Hossein Mobahi: Yeah. Okay. Now, what is the practical challenge here?
Well, if you want to blur your objective function, let's say your transformation is
homography in the plane, so you have eight degrees of freedom. That means
you need to evaluate an eight dimensional integral because this convolution is
going over the parameter space now. Right? And that can be expensive. So
the question is that is there any alternative two dimensional function that if you
compute this integral transform in 2 D it becomes equivalent to this eight
dimensional integral for smoothing, right? And fortunately that's the case at least
for most of the transformation we care about.
And I call this guy the transformation kernel. And how do I derive it? Well, you
use [inaudible] analysis and get to very simple proposition here and based on
this proposition, if you plug in tau -- tau is your transformation model. So tau
takes a point X and then returns another point Y, for example.
So if you plug in your tau here, then you get a list of full [inaudible] form then
convert it back to spatial form. The proof of this proposition is simple you use
fourier presentation and then Parseval's theorem. If you do that for this
transformation you derive the corresponding kernel. So, for example, translation
is X plus D and this is the kernel you get.
Also you can see for homography, this is the transformation. And I then put this
because these are long expressions, but this is some exponential form and this is
some rational form. And this is a visualization of some of these kernels. So
these two are for Alpine. These are for homography. Now, again, the point here
that I want to emphasize is that these kernels, when you want to compute their
integral transform, they are 2 D integrals, not eight dimensional. But they have
the same effect as integrating your function over that eight dimensional space.
So bring some efficiency.
And, of course, I cannot go through the details of the derivation here but I can at
least tell you two things you can check the correctness of them to necessary
condition to check the correctness of them.
One is that because original intention was to smooth by Gaussian kernel, so it
has to obey properties of heat equation, right? So first it needs to satisfy the
relationship between Laplacian and differentiation in sigma and second you
want -- as the smoothing amount goes to zero, you get the original
transformation.
And if you test it on this table, you see that it's the case. Okay. So now in
general if we are not talking about just particular transformation model, you can
write the correlation -- so correlation is another measure like L2 error that you
can use for alignment. You try to maximize it.
So if this is your original correlation objective function, you can start smoothing
that objective function with this Gaussian kernel in the parameter space and then
use that kernel to make it an integral over the space of X. This is the important
point here.
So now this became two dimensional. And there's a simple algorithm that just
follows the path of the minimizer or maximizer. Now, perhaps we're getting close
to the end. One point that is worth mentioning is again getting back to Lucas and
Kanade algorithm and see how it compares with these kernels. It shows for
translation only what you get, if you plug in the translation kernel, then what it
looks like is that you are actually convolving your image with an isotropic
Gaussian. Again, this is reemphasizing, if motion is just translation, then, yeah,
using just an isotropic Gaussian with six sigma is okay. But once you move to
other transformations, say Alpine, then it becomes an integral transform that is
not necessarily a convolution, and you need to do that. And it's spatially varying.
So experiments very limited because these are really preliminary results. But
here you can see the results of alignment. So these are the images we use for
alignment. And this access shows how much transformation was drastic, then
the transformation class was homography. So this one was the most difficult
because it had the, like the largest homography change.
And this curve shows the correlation coefficient that the algorithm converged to
after alignment. So the bigger the better. And you can see that for no blur, red
or Gaussian blur, they do almost similar and they are way below when you blur
the objective function.
Of course, all of them become worse as you make the problem harder, but the
point is that this is always doing better than others. So that was all, and I just
want to acknowledge Vadim and John for their help and thank you very much. I
think we have five minutes if there's any questions.
[applause]
>> Larry Zitnick: Any additional questions?
>>: What happens if you use something other than Gaussian.
>> Hossein Mobahi: Other than Gaussian, no. And the reason is that I really
use -- I really leverage the property of heat equation for the -- if you remember
that discussion about traceability. So I convert the evolution over time to
Laplacian. And then that's very important, because now everything becomes
static. You don't care about time evolution now.
And then you can use that to say something about how convex, you're losing
convexity over the curve and stuff. But, of course, you can consider other -- but
that's probably more difficult, I suppose.
>>: I suppose these other kernel functions you won't be able to get such a simple
solution as [inaudible] the solution for [inaudible] the gravity solution?
>> Hossein Mobahi: If it's other kernel?
>>: Other kernel, if Laplacian kernel, Gaussian kernel.
>> Hossein Mobahi: This here is not Gaussian.
>>: I'm talking about -- one of the solutions you have for asymptotic convex is the
result is the center of gravity.
>> Hossein Mobahi: Oh, you're talking about that example?
>>: [inaudible].
>> Hossein Mobahi: This one?
>> Yeah. So if you use the other kind of kernel to do smoothing, you probably
won't get that?
>> Hossein Mobahi: I'm not sure, as I just said. I only -- the export Gaussian
and the reason was it had a lot of nice property that makes the analysis easier,
yeah. But I think, yeah, you can, but it will be more difficult.
>>: Use a different kind of kernel empirically do you show different solution?
>> Hossein Mobahi: I didn't try. Yes.
>>: So you derived the kernels for particular types of transformations, but there's
a very large branch of possible transformations. And some of them are difficult to
parameterize, so how about trying to learn these from example images?
>> Hossein Mobahi: Ah....
>>: The alignment, the images are given to you. You're supposed to find kernels
that have the optimal right point.
>> Hossein Mobahi: Right. Well, I think in principle, that is doable. But I believe
you have to do it like by some numerical -- like process. I don't think you can do
much by closed form, because it highly depends on the form of your data, and if
you cannot make that much assumption on your data then it can be anything.
>>: Step back, to guarantee anything, it's a procedure. But for comparison, for
the example like you can derive things with the learning procedure, get the same
kernels anyways and then some other things where you can derive things, the
learning procedure will give you something reasonable.
>> Hossein Mobahi: Uh-huh. That's some experiments that is very interesting to
do, especially like the first one to see if we cast this problem as a learning task
and then limit the transformation model to those that we know their kernel, then
will their kernel converge to the same thing.
Because there are a little bit different I think in terms of their objective. So if you
use learning one, I think the goal is -- the goal is eventually do we want to do
classification, I mean, is the optimization criterion optimized for ->>: Won't optimize for construction or -- whatever your function was, you're trying
to find the blurring kernel that's going to minimize the function at the right point.
But you have given examples of images, ground truth.
>> Hossein Mobahi: That's definitely a very interesting experiment to do. But I
know a lot of like through informal chats with people including Neri and with you,
you all had this observation based on empirical result you get something similar
to this blur kernel.
But doing, I think, a very conclusive experiment on that very interesting.
>>: I think like in practice, you really want to know -- eight degrees of freedom in
homography, right? And you want a standard deviation amongst those eight
degrees to be all the same? Probably not. Like translation might have wider
variance in scale, et cetera. I think through training data you would actually learn
the standard deviation of those eight parameters and use that, plug it into the
same model and that would give you the right kernels for that. You want to do it
truly nonperimetric. You do it -- you really just -- it's small subset of parameters.
>> Hossein Mobahi: Yeah, I think you're referring to regularization, right?
>>: Just makes ->> Hossein Mobahi: In the CDR paper we don't know yet if it gets accepted. But
if it gets accepted you will see. So we had one more thing called regularization
of the solution. And that just adds some prior. But we use very simple prior.
Just identity transform. But no special bias. We use that to prevent converging
to really weird transformation. And I think what Larry is suggesting is that now
with learning you can more accurately model that prior, because that's specific to
that data.
But because of time I couldn't talk about regularization here.
>> Larry Zitnick: Thank you.
>> Hossein Mobahi: Thank you.
[applause].
Download