23826. >> Larry Zitnick: Hi. It's my pleasure to... us from UIUC. His advisor is Yi Ma. ...

23826. >> Larry Zitnick: Hi. It's my pleasure to introduce Hossein Mobahi here, joining us from UIUC. His advisor is Yi Ma. He's currently actually visiting us here at MSR about once a week. If you want to talk to him at any point let me know. He's hanging around and doing research here because his wife is actually currently working at Microsoft as well. So Hossein has done a lot of interesting work. He's done some work on image segmentation, won the best student paper at ACCV and has a follow-up paper at IJCV on that work. He's done face recognition. One of my favorite works that he's done is on deep learning using temporal coherence. You should check that paper out. And today he's going to be talking about some of his more recent theoretical work, which he submitted to BBPR, and I'll let you describe it. >> Hossein Mobahi: Okay. Thank you, Larry. Hello everyone. And welcome to the talk. So I'm going to talk about noncovex optimization with Gaussian blurring technique, and continuation. And I will discuss some of its application to image alignment as well. This is joint work with my Ph.D. advisor Yi Ma, Yuri of MSRA, and Larry Zitnick from Building 99 at MSR. So the talk is of two parts. At the beginning I will just give you some basic definitions, working definitions, and then some preliminary results from theory side and then the second half of the talk will be more about application to image alignment. Okay. So talking about optimization. We need to address convexity and noncovexity. We all know that noncovex optimization in general is not tractable. And because of that, there's always a pressure on engineers to try to model the problem as a convex one, even if the actual process is not very similar to a convex process. Or write it down as a noncovex objective function and try to relax it to a convex function. That is one way to deal with it. But if the actual objective function is far from convex, of course, this is going to be very crude and not very useful. But unfortunately a lot of real world problems are noncovex, so we cannot ignore this class of functions. And one good news is that real world problems usually have some kind of structure and regularity and they're not random. This means that even if they are noncovex, there's a possibility that if you exploit that structure in your problem right, then you can solve that problem efficiently. There are even examples like even integer programming, the examples if you have some conditions on the problem, then although it's noncovex, but it can be solved efficiently. There's actually a vast literature on this, people trying to exploit different features in the problem. Here you can see some of the squares or difference of convex or many other. But today I'm going to talk about one specific approach that it also tries to exploit some structure, and that's the smoothness of your objective function. And it has been very popular. People have been using it in practice. But surprisingly, there's little theoretical understanding about fundamental issues with this method. And I have some preliminary results for this. I'm working on this theory. So this is a presentation based on ongoing work. So it's by no means a complete theoretical framework. But I think even the pieces I have so far is interesting to the audience. >>: Let me ask the question, do you know whether this technique has been used in in world network optimization? >> Hossein Mobahi: Yes, there is an author. I think Gorse. G-o-r-s-e I think that's an old paper around the '80s. So, yeah. Let's be precise about what we mean by smoothing. So smoothing happens in nature and in physics. Perhaps a representative example is from Heath equation. Heath equation is simply a partial differential equation. Here you can see the definition of this partial differential equation. So if you set the initial condition and if there is a boundary, if your space has boundary, then you set the boundary condition and then it gives you a solution. So you can say, okay, let's say -- let's be concrete here. Let's say you have a bar, and then there's some distribution of temperature along this bar. Right. At time zero. You leave this bar on its own the heat starts to propagate and spread across the bar. And eventually it reaches equilibrium point where the entire temperature on the bar is the same. So let's define this function G in two variables X and T, where T is time and X is the location on the bar. And let's say the initial distribution of temperature on this bar is denoted by FX. So that's become -- that becomes your initial condition. Now, what this has to do with smoothing, you need to look at the solution, what the solution looks like. If you solve this problem for the K stat, your space is unbounded, that gives the simplest formal solution. The solution would look like this. So this is essentially a convolution of your original function and a Gaussian kernel. If you just replace T with sigma squared, it will become more familiar to you, probably. So I have a video which shows that this is like the location X and here is the temperature and this setting, I just run that differential equation to show how it starts to smooth this function. So very simple idea. But how can we use this in optimization? And notice how it's useful for optimization. Let's play this video backward in time. So we start from a highly smooth function, but what we initially do is we find its minimizer and try to trace this minimizer back to the original function. At each point -- okay, we are in a local minimum, and then by the next infinite small change in the function, we try to follow that minimizer, and basically follow that path up to when we get back to the time zero. So remember this was the original function. And this red point is showing the minimizer you are tracking. So this function, of course, is not convex. This has two local minima that you can see, one of them is global minimum. When you do it like this you trace the minimum it actually finds the global one for you. Okay. This idea as I said, theoretically it is little understood. But I can tell you, it's very popular even across different disciplines. For example, in computer vision, which is my field, it's called graduated optimization. You can look it up. There's some explanation in old book by sisserman and Andrew Blake. I think it's called visual reconstruction or something like that. And in chemistry people call it optimization by diffusion equation. In scientific computing, it's called by homotopic continuation and in engineering it's called deterministic. There are slight differences between them but the core and essence is the same. They start from a smooth function and trace the solution. So it's surprising that despite its recognition, there is not much known about this problem. And here are just some of the interesting questions, yet very fundamental, that you might want to ask yourself when you work with this kind of algorithm. For example, the idea is that we want to make the objective function smooth. When is it useful? When you smooth it enough, it becomes convex, right? Of course, it doesn't happen for any kind of function. So you want to know for what function, if you push your sigma, the amount of blurring in your Gaussian, toward infinity, it starts to look like a convex function. So that's the first question. The next question is that assuming that your function asymptotically convexes you want to find the convex minimizer. Now, that's your start point for this process. But as the problem gets smoother and smoother, it becomes almost the objective function becomes almost like a flat curve. And highly unstable. Numerically very unstable to apply any kind of numerical techniques for finding the solution even for convex function. So it's not the best way to do it. But we hope that there's a closed forum for that asymptotic minimizer that we can just get it explicitly without running any numerical procedure. The other question is: Okay, what is the size of this asymptotic convex functions, are they large relative to convex function, if so, how large. And then because we are tracing the path from this smooth function back to the original function, we want to know that this path is traceable. What if in the middle of the path you see a bifurcation and then you don't know which way to choose. I'm assuming that this is the deterministic algorithm nonrecursive so you don't want to branch; you just want one direct path back to the solution. And again I would like to emphasize that this setting, this approach has very deep roots. It has connection to very fundamental forms of differential equation and fundamental processes in physics. So there should be something out there. I believe it's a very fertile unexplored ground and hopefully something will come out of this research. >>: You don't mention about the dimensionality. >> Hossein Mobahi: Not yet. >>: The one you show, the one dimension. >> Hossein Mobahi: That's for visualization. >>: High dimension, it will be much more ->> Hossein Mobahi: No, the concept is the same. Nothing changes about the -as conceptual -- if it's just for visualization purposes. Now, we address some of these questions along the top. But before again starting we have to be a little bit precise about what we mean asymptotic convexity. So loosely speaking, that means that if you smooth your function, by smoothing, remember, I mean convolution with the Gaussian and has a sigma parameter that the larger, the smoother. If you push this sigma to infinity, then the function starts becoming convex. But more rigorously what they mean is that for any radius M that you pick, there exists some blurring amount sigma star such that for any pair of two points that lie inside this ball and for any sigma that is greater than sigma star, it satisfies the convexity condition within that ball. This is just a definition of convex function. >>: G ->> Hossein Mobahi: G is ->>: It's the equation. >> Hossein Mobahi: Yes. So G is the -- in other words, is the smooth version of your original function FX. So F it was only depending on X. But for G it has two parameters, one is X, one is sigma. So when sigma goes to 0 it gives you F. >>: Sigma here is just the -- it's kind of a sole parameter. >> Hossein Mobahi: Yes. >>: Could be done equation by some other name maybe. >> Hossein Mobahi: But for this talk it's the same. So basically for this talk all the smoothing I'm talking about is just convolving your F with a Gaussian kernel. And that's the solution of Heath equation. But you can explore other kinds of smoothing. >>: This gives the full equation. It just happens that [inaudible] have taken the fall. >> Hossein Mobahi: Yes, exactly, if it's unbounded domain. >>: So high dimensional heat equation you would have Gaussian. >> Hossein Mobahi: Yes. So actually getting back to the heat equation, so here you have Laplacian operator and that acts on multi-variables. All right. Now, a couple of simple propositions. So any convex function that you have under very mild conditions is also asymptotically convexed. Simple you write down the definition of convexity F and because your kernel Gaussian is non-negative it doesn't change the inequality so you can multiply both sides and you can integrate and you get the definition for G. So from this we know that at least the class of asymptotic convex function is as big as convex function under mild conditions. Now, the other thing that is again not very difficult to figure out is if you're talking about twice differentiable functions, then you know the Hessian of that function at any point needs to be poly definite in order for function to be convex, right. Here, for asymptotic convexity, again you have a similar definition. It says if your sigma is large enough, larger than sigma star, then everywhere within this ball of radius M it's positive definite. Now, let me give you a very bizarre example. In my opinion it's very bizarre example. Consider that function, it doesn't matter mathematically what it looks like. Visually it looks like this. So it's almost flat and then goes down very quickly and then comes back up. Why it's important is it's very close to L 0 norm so people interested in sparse representation recognize that. Now, from convexity point of view, what's important about it, as you choose epsilon smaller and smaller, everywhere on the entire domain becomes concave function except a very tiny region around this. And now I mark the regions that are concave by pink. Here you can see it's almost like it doesn't exist because I mean it's discretized by pixel so it doesn't exist. But what if you start smoothing your objective function. It changes to this now there is a tiny region you can see why it is convex. And as you increase the amount of the sigma, the amount of blurring, this region of convexity starts to expand. And it just has no bound that's why it's called asymptotically convex. You can just make it convex for as large a domain as you like. But the important point about this example was that your initial function, it was concave everywhere. So this seemed like very difficult for convex relaxation type of thing. But it's actually asymptotically convex function. Okay. That was a very good news. I'm going to give you a couple more good news about these functions. So is there any sufficient condition easy to check that you can say a function is going to be asymptotically convex? The answer is yes. You just need to integrate the original function, no smoothing at all, and look at the value of this integral. If it is bounded and if it is strictly less than zero, then it's asymptotically convex. It's a very simple condition. And it's derivative-free. So to me it was very interesting. And as I mentioned earlier in the talk, because asymptotic -- when you approach the asymptotic convex function, becomes flat, so it's really difficult to find its minimizer by numerical processor. So is there any closed form solution for the minimizer of that function? Again, to me very interesting closed forum expression exists for it, which tells you that it's just the center of mass of the function. So if you want to derive this mathematically, I have to be like a little bit informal, because if first there's no time to go through all the details and second it will become dry, but my arguments here are very high level. Like you cannot consider them as complete proof. But if there was any question about more rigorous argument, please feel free. >>: What if this solution happens to be the same as -- actually we know this is a solution for missed error estimate, the applications. >> Hossein Mobahi: Uh-huh. >>: Does it have any bearing? So if you won't minimize the means for error of an unended function, the solution -- the criterion the square is actually convex? >> Hossein Mobahi: Uh-huh. No, I didn't know about that. Yeah, I'm not sure. Maybe there is. >>: [inaudible] the Gaussian, the Gaussian totally dominates, where the parameter is [inaudible] it's like -- you're minimizing the Gaussian pretty much. Right? >> Yeah, I see. That may be the connection. That may be the connection. But can you just kind of hack away the proof at the top -- I'm surprised -- >> Hossein Mobahi: So the way you can prove that is you write down -- okay, you write down the definition of G. G's convolution of F times your Gaussian kernel. Now, you take the second derivative of that and you can do it in F or the Gaussian kernel. We do it in Gaussian kernel. And what comes out of it is something, two terms that depend on I think it's something like this. So there are two terms that one is over -- okay. I think it's like this. I think it would look something like this, and again your Gaussian kernel. So if I want to be complete, you have this F of T here and then K of X minus T sigma. And DT. Okay. So when you differentiate this kernel, you get this. Now, as sigma gets large, you can ignore this one, right. And also this one is like a negative thing. So you can ignore that and it just tells you it's negative of -- and, again, as K goes to infinity it becomes like you're integrating over F alone, right, but with a negative sign. So in order for this guy to be positive, you want the integral of F to be negative, right? So but you should work out the details, of course, because...but is it clear how ->>: The same proof, the squared. No, the first time. >> Hossein Mobahi: Yeah. Is it clear? >>: One thing about the connection, seems like you could shape it by custom, and you happen to have a bar. >> Hossein Mobahi: Yes. >>: The function has a finite integral, for example, could it be there? >> Hossein Mobahi: Actually, we want the function to have a finite integral. Otherwise we cannot ->>: If the integral is finite but positive. >> Hossein Mobahi: Then it's asymptotically concave. >>: Concave? >> Hossein Mobahi: Yes. >>: Positive, you subtract a constant. >> Hossein Mobahi: No, no, but then it's not integrable on the entire domain. See what I mean? If you add one to the Gaussian and integrate it, it's no longer bounded. But remember here I explicitly mentioned we want the integral to be bound. And this is sufficient. So you can't find asymptotic convex functions that do not satisfy this. But this is handy so you can easily test some class of functions. And here's a concrete example. So, again, suppose F is that function. Again that form is not that important but it looks like this red curve. And it has one minimum here, one minimum here, and maximum there. Right? And for this problem based on the things that I just told you, you can easily figure out it's asymptotically convex, you compute the integral even if it doesn't have to be precise, you see that it's negative. So it's asymptotically convex. And where is the asymptotic minimizer, again the definition, it's the yellow bar shows this point. And now here it shows different plots. So this is the original function, and this is a little bit smoother, and this is even more smoother. You see that the minimizer is moving toward this one and it's also looking more convex. >>: Of course, once you do that, how do you map back the original minimizer for the record? You know what I'm saying. >> Hossein Mobahi: We haven't got there yet. So you remember there were a couple of questions listed. The first one was asymptotic convex. So we were only talking about that right now. So for now I think based on the proposition I gave you earlier, you have to know about this under mine condition but I'm going to give you even more good news. So, again, these arguments are not rigorous for this talk, but this should communicate the idea. But more rigorous proof is available for those interested. So the measure of functions that are convex is 0. If you want to be concrete, we can limit our class of function to this one so it's just one variate class of function, twice differentiable. And then you know the second derivative vanishes at infinity. And the second derivative bounded by capital F everywhere. Now, the argument is like that, because it vanishes at infinity. So you can find some radius delta that captures most of the signal. You can make those as large as you want. And then anything outside of it you set to 0, you don't care. So you can make this approximation to the original FW prime as small as you want. The error as small as you want, which is in larger delta. So we have worked with just this one. So this gives us bounded support of this function. And everywhere, of course, on this function we have also this inequality. Now let's divide this bounded support region into N cells. Equally spaced. Now, the chance of -- okay, what is the situation when this function becomes convex, when all the cells have positive value, right? But if you look at each cell, each cell is one evaluation of the function. Because we're talking about second derivative, its convex if all these cells is positive, right? The chance of seeing a positive number in each cell is a half, right? Because we are saying that there's no preference in the sign. We just say the space, its magnitude is bounded. So it can choose either bound. So overall the chance of seeing a convex function, meaning that all of these are positive, is half to the power of N, right? Now, as we increase this N to get closer and closer to the actual function, this density approaches 0. Right? So we can say that from this rough argument that the measure of convex function is 0. On this function class. But what about asymptotically convex function? Well, very good. Half of the functions that you pick from this space are asymptotically convex. Why is that? Because here it needs to be double prime, I think. Yeah, you need double prime here in this. Because, again, there's no bias in our assumption, the class of functions. So half of the functions have their second derivative less than 0, their integral. And half have positive. And you know that as you make the function smoother, by the same argument it becomes more closer to its integral. So it's like half of the function are asymptotic comebacks. These are very good news. One other thing, maybe it's somewhat related to your question, is now suppose we have asymptotic convex. Suppose we found its minimizer, asymptotic minimizer. Now the processor required us to follow the path back to the original problem, right. How do we know that we can do this? Because if somewhere along the path there is a choice, then we are confused. We don't know which path to take, and we don't want to branch, as I said. So we need to avoid this kind of situation. And mathematically what that means is on the smooth function we don't want the Hessian to be singular or anywhere, right? So, again, you can do that with manipulation here. First you can derive the path of minimizer. It's very simple to derive this one, I think it should be clear. Let's write down the equation for points that satisfy this stationary point and differentiate that with respect to sigma, which is like time here, and then you rearrange the terms and you get that, right? Now, in order to make sure along this path you don't see any singularity you immediate to make sure that your Hessian remains positive definite along this path. Asymptotic minimizer we already know the Hessian, the Hessian of the point is strictly positive definite. Right? I mean, all the eigenvalues are strictly greater than zero, right? So we want to maintain that situation. You want -- and let's say okay lambda here shows this smallest eigenvalue. Okay. You want to see what is the evolution of this eigenvalue over time, right and then see if it's getting smaller or larger, in which direction it's moving. You don't want it to go down, right. Given the eigenvalue itself, can there be said anything -- is there any relationship at all between this eigenvalue and its evolution in time. Yes, there is. I cannot get into the details because of time. But here I just say you need to use two things. One is the property of heat equation that relates this operator to this operator. So differentiation in time becomes Laplacian in space and therefore you can read this again. If I have information about second derivative, the eigenvalues can I say anything about fourth order derivatives. And again yes, you can if the function is smooth enough. Just use the negative definiteness of Laplace operator. And just to give you intuition for those not familiar with that, why is that the case? So if a function is very smooth, let's say it consists of one sinusoid. Let's say sinusoid -- sine of X, right. So this is your F double prime. Now, I differentiate this twice more, and what I get is minus sine of X. So it's just the flip version. So there is a very coupled -- but as you add more terms to this, like high frequencies, then this is not quite the case. But as long as your function is smooth, so these terms are dominant, then you can relate these derivatives. >>: I have a question here. Previous slide you have this nice animation that showed that this little kind of concave bump magically appears exactly where the minimum of the initially some type of convex smoothie is. That's why you got this branching behavior that you're trying to ->> Hossein Mobahi: Right. >>: It seems like if you had added a little bit of noise. This is a singular point you could have made another animation and the bump shows up just a little to the right and left and you would not have ever had this ->> Hossein Mobahi: Yes. >>: That's a perturbed analysis and it could have solved ->> Hossein Mobahi: I get your point but the goal is not to randomly pick -maybe it's misleading because they're both cemented. Let's say one of these becomes different eventually. You want to follow that path and this one stays like a local minima. So if you want to like choose one direction by chance, then maybe you choose the wrong one, right? >>: Nothing by chance. So let's step back in that case. Following -- so proving that your entire trajectory has no branching points is weaker than proving you'll actually converge to the minimum of the noncovex function. You could have something that appears very, very far away from where you are, which turns out to be the global minimum. So that's a separate issue. So all we're saying now is we're asking for a lesser thing. We're not trying to find the minimum of the noncovex function. We're just trying to prove that as we do this smoothing thing we never hit a conflict or a fork in the road that we have to choose. I'm saying just avoid that. It seems like a perturbed analysis, namely showing this will happen with measure zero or adding a little bit of the function to prevent this seems to prove that, right? >> Hossein Mobahi: This is not the entire goal, this is one condition that -- we need to have some control over it. So we need to basically say which path it chooses so that it gets to the global minimum. So I haven't discussed that yet. But there are different ways you can control and prevent this. One is that but what we really need for making sure that this path leads to the global minimizer, we need something more than like local perturbation or adding noise. Because by adding noise, you are not -- you're not using the assumption that which direction is better for leading you to the global optimum. It's just noise. But we can discuss that later if you want. Because I think I need to get to the application part. Actually, I'm done with the theory. So if there was any question I would be happy to answer after the talk. But now let's get to the application of this idea. So it's image alignment. It's a very fundamental problem computer vision if you do structure for motion, variant recognition or tracking videos, you usually have to hit this problem. Now, there are two major approaches -- actually, if anyone wants to read a little bit more about alignment, there's an excellent tutorial by Rick Zileski [phonetic] here and all the details are there. But there are two major approaches that you can use for alignment. The first one is feature based. So you select a bunch of sparse feature points and then try to make a correspondence between them and then use that to infer what is the geometric transformation between the two images. And the other one is called intensity or direct method, and that's you just subtract the two images, all the pixels get involved, and then you get a residual. And then you try to minimize this residual. So find the alignment that minimizes this residual. So intensity-based method seems more tempting, because it uses all the information in the image, but in the first one you're throwing out a lot of information. And so this one seems to be richer in some sense. But unfortunately, when it comes to practice, usually you have a lot of local optima. So it doesn't help that much. Now, again to get you familiar a little bit to the setting, let's say we have a very simple problem alignment F1 and F2 are two different images. And they are different from displacement D. So task alignment we can just formulate it like this, minimize these, minimize this objective to get the optimal D. Now, this is noncovex because F is out of our control. It's an image and it can look very crazy function, and therefore D can look very crazy function. So you need to somehow get away with it. One way is you linearize your F with respect to this optimization variable. In this case it happened to be D. And first order tailor you get this. And then plug it in there. Now, it gives you a convex quadratic function. You can solve it even in close form. And then you get some D half. It's just an estimate. And then apply this to the image. But remember there is an approximation here. So if this approximation is poor, then this is far from your true solution, right? And then according to tailor's remainder theorem you can bound this difference. So this is one obvious because here we are all linearizing around the origin so the larger D the worse it gets. And this one depends on higher order term. Actually, it's the largest eigenvalue in magnitude of the Hessian. So you want that to be small as well. Well, that depends on the image and what if it's not small? What can you do? You can smooth your image. You can actually blur your image, and the effect of this is that that lambda, capital lambda, will become small. And helping you so the D, estimate of D you get is closer to D star. But, again, there is a problem here, because as you blur your image, you lose some of its details, right? But you can't do this iteratively. This is called Lucas and Kanade algorithm. You start from a very coarse blur and make the alignment. And then apply that alignment to make things a little bit closer. Now, your D is now reduced a little bit because now you're working in a smaller domain. So you can reduce the amount of it, because we want this total thing -sorry. So we want this total thing to be small. So as you optimize over D and D gets smaller and smaller to desired value, then you can -- you're allowed to increase this one, and that can happen by like making it less blurred, the image. Okay. There's even a proof that this idea works -- well, it can recover the correct displacement as long as we're talking only about displacement. But I have seen people using it even for motion models that are not displacement. And that is wrong. And it's easy to see that. I mean, by wrong, it's not optimal. So let's say that we change the problem setting. Now instead of displacement, we are talking about scaling. So F 1 -- okay. Now we have F 1 SX. So this is the objective function. We have F 1 SX minus F2. And then we square this and then integrate it. So, again, we can linearize it. We get this form. And, okay, we can find a closed form solution, convex quadratic. And look at the error bound. This is very important. Now, in the error bound you still have that capital lambda. You still have the deviation of this variable from the linerization point but this term appeared here, which shows that the quality of this error depends on the location and image you do the linerization as well. So in other words as you get farther and farther from the origin this approximation becomes poor. So suggesting that if you want to do any kind of blurring, it should be more intense and aggressive for points farther from the origin. Right? So remember this point here, because we will get back to this. I just wanted to mention Lucas and Kanade algorithm. So we need some kind of blur that's spatially variant. But in Lucas and Kanade algorithm, what it does, it just blurs the images with isotropic Gaussian. So every region is treated equally. But let's look at the eye, human eye. Let's see what it looks like there. So there are some color receptors called con -- if I'm pronouncing it correctly, and this is the density of those points. This is the center of the eye, and as we go towards the periphery, you see that the density rapidly decreases. The implication is whatever you see at the center of your, like in your fovea, it has the highest quality and then as you move toward the periphery then it becomes blurrier and blurrier, and from biology there's evidence that you need this kind of spatially [inaudible] blur. Also people envision have heuristically come up with ideas like Bergen Malik based on intuition they have blurring kernels that are not spatially invariant but it's heuristic. But today we are going to derive these kernels in a very principled way for the first time. So again to be concrete and illustrative, let's take the same example. It's 1 D scale alignment. That's the actual task. And what traditionally people envision do is they smooth the signal. So this convolution is over space X. And then try to solve it by Lucas and Kanade algorithm. But what I am suggesting today is to smooth the objective function, because that's where you want the minima, local minima to disappear, right? So this should be really your goal. And let's look at the landscape of this optimization objective to see what it looks like. The signals are these, one blue, one red I hope it's visible. Very simple functions, and they are just flip mirror of each other. So the optimal scaling is minus 1 so that you flip them. Now, by signal smoothing, you get this picture. Now, what is this? So this is where you have the highest blur. And this is your choice of scale factor. So it has two local minima here. So as you make it blurry now, you still have two local minima. So you have to choose them -- choose one of them randomly. If you trace the path of that back to the original by reducing the blur, there is a chance either you hit the global minimizer or you hit the local minimizer. This is for this one. Lucas Kanade but if you blur the objective function then there's only one star point. And for this particular example, it's actually the one that leads to the global -- there's another path that starts somewhere in the middle of the road. But we never get that because ->>: Depends on how you blur. You blur it the other way, would it go the other side? >> Hossein Mobahi: What do you mean the other way? >>: Here, when you trace it down here, you trace it on the left, I suppose that's due to the same problem you mentioned earlier about the application? Maybe I misunderstood. >> Hossein Mobahi: Well, no bifurcation happens on this one or that one, there's no bifurcation. But this is ->>: Under what condition does it trace back to the wrong -- >> Hossein Mobahi: So that's the most difficult question. So if you remember I was listing a bunch of theoretical results that we have for this problem. And that is something that I'm working on still. I don't have concrete results to present. But I really hope that we can also get at least some conditions on that as well to connect really that theory to the application. >>: So basically after you do the blurring, you find the global optimum, that's empirical finding? >> Hossein Mobahi: Yes. >>: Wait for a different example to show? >> Hossein Mobahi: Yeah. Okay. Now, what is the practical challenge here? Well, if you want to blur your objective function, let's say your transformation is homography in the plane, so you have eight degrees of freedom. That means you need to evaluate an eight dimensional integral because this convolution is going over the parameter space now. Right? And that can be expensive. So the question is that is there any alternative two dimensional function that if you compute this integral transform in 2 D it becomes equivalent to this eight dimensional integral for smoothing, right? And fortunately that's the case at least for most of the transformation we care about. And I call this guy the transformation kernel. And how do I derive it? Well, you use [inaudible] analysis and get to very simple proposition here and based on this proposition, if you plug in tau -- tau is your transformation model. So tau takes a point X and then returns another point Y, for example. So if you plug in your tau here, then you get a list of full [inaudible] form then convert it back to spatial form. The proof of this proposition is simple you use fourier presentation and then Parseval's theorem. If you do that for this transformation you derive the corresponding kernel. So, for example, translation is X plus D and this is the kernel you get. Also you can see for homography, this is the transformation. And I then put this because these are long expressions, but this is some exponential form and this is some rational form. And this is a visualization of some of these kernels. So these two are for Alpine. These are for homography. Now, again, the point here that I want to emphasize is that these kernels, when you want to compute their integral transform, they are 2 D integrals, not eight dimensional. But they have the same effect as integrating your function over that eight dimensional space. So bring some efficiency. And, of course, I cannot go through the details of the derivation here but I can at least tell you two things you can check the correctness of them to necessary condition to check the correctness of them. One is that because original intention was to smooth by Gaussian kernel, so it has to obey properties of heat equation, right? So first it needs to satisfy the relationship between Laplacian and differentiation in sigma and second you want -- as the smoothing amount goes to zero, you get the original transformation. And if you test it on this table, you see that it's the case. Okay. So now in general if we are not talking about just particular transformation model, you can write the correlation -- so correlation is another measure like L2 error that you can use for alignment. You try to maximize it. So if this is your original correlation objective function, you can start smoothing that objective function with this Gaussian kernel in the parameter space and then use that kernel to make it an integral over the space of X. This is the important point here. So now this became two dimensional. And there's a simple algorithm that just follows the path of the minimizer or maximizer. Now, perhaps we're getting close to the end. One point that is worth mentioning is again getting back to Lucas and Kanade algorithm and see how it compares with these kernels. It shows for translation only what you get, if you plug in the translation kernel, then what it looks like is that you are actually convolving your image with an isotropic Gaussian. Again, this is reemphasizing, if motion is just translation, then, yeah, using just an isotropic Gaussian with six sigma is okay. But once you move to other transformations, say Alpine, then it becomes an integral transform that is not necessarily a convolution, and you need to do that. And it's spatially varying. So experiments very limited because these are really preliminary results. But here you can see the results of alignment. So these are the images we use for alignment. And this access shows how much transformation was drastic, then the transformation class was homography. So this one was the most difficult because it had the, like the largest homography change. And this curve shows the correlation coefficient that the algorithm converged to after alignment. So the bigger the better. And you can see that for no blur, red or Gaussian blur, they do almost similar and they are way below when you blur the objective function. Of course, all of them become worse as you make the problem harder, but the point is that this is always doing better than others. So that was all, and I just want to acknowledge Vadim and John for their help and thank you very much. I think we have five minutes if there's any questions. [applause] >> Larry Zitnick: Any additional questions? >>: What happens if you use something other than Gaussian. >> Hossein Mobahi: Other than Gaussian, no. And the reason is that I really use -- I really leverage the property of heat equation for the -- if you remember that discussion about traceability. So I convert the evolution over time to Laplacian. And then that's very important, because now everything becomes static. You don't care about time evolution now. And then you can use that to say something about how convex, you're losing convexity over the curve and stuff. But, of course, you can consider other -- but that's probably more difficult, I suppose. >>: I suppose these other kernel functions you won't be able to get such a simple solution as [inaudible] the solution for [inaudible] the gravity solution? >> Hossein Mobahi: If it's other kernel? >>: Other kernel, if Laplacian kernel, Gaussian kernel. >> Hossein Mobahi: This here is not Gaussian. >>: I'm talking about -- one of the solutions you have for asymptotic convex is the result is the center of gravity. >> Hossein Mobahi: Oh, you're talking about that example? >>: [inaudible]. >> Hossein Mobahi: This one? >> Yeah. So if you use the other kind of kernel to do smoothing, you probably won't get that? >> Hossein Mobahi: I'm not sure, as I just said. I only -- the export Gaussian and the reason was it had a lot of nice property that makes the analysis easier, yeah. But I think, yeah, you can, but it will be more difficult. >>: Use a different kind of kernel empirically do you show different solution? >> Hossein Mobahi: I didn't try. Yes. >>: So you derived the kernels for particular types of transformations, but there's a very large branch of possible transformations. And some of them are difficult to parameterize, so how about trying to learn these from example images? >> Hossein Mobahi: Ah.... >>: The alignment, the images are given to you. You're supposed to find kernels that have the optimal right point. >> Hossein Mobahi: Right. Well, I think in principle, that is doable. But I believe you have to do it like by some numerical -- like process. I don't think you can do much by closed form, because it highly depends on the form of your data, and if you cannot make that much assumption on your data then it can be anything. >>: Step back, to guarantee anything, it's a procedure. But for comparison, for the example like you can derive things with the learning procedure, get the same kernels anyways and then some other things where you can derive things, the learning procedure will give you something reasonable. >> Hossein Mobahi: Uh-huh. That's some experiments that is very interesting to do, especially like the first one to see if we cast this problem as a learning task and then limit the transformation model to those that we know their kernel, then will their kernel converge to the same thing. Because there are a little bit different I think in terms of their objective. So if you use learning one, I think the goal is -- the goal is eventually do we want to do classification, I mean, is the optimization criterion optimized for ->>: Won't optimize for construction or -- whatever your function was, you're trying to find the blurring kernel that's going to minimize the function at the right point. But you have given examples of images, ground truth. >> Hossein Mobahi: That's definitely a very interesting experiment to do. But I know a lot of like through informal chats with people including Neri and with you, you all had this observation based on empirical result you get something similar to this blur kernel. But doing, I think, a very conclusive experiment on that very interesting. >>: I think like in practice, you really want to know -- eight degrees of freedom in homography, right? And you want a standard deviation amongst those eight degrees to be all the same? Probably not. Like translation might have wider variance in scale, et cetera. I think through training data you would actually learn the standard deviation of those eight parameters and use that, plug it into the same model and that would give you the right kernels for that. You want to do it truly nonperimetric. You do it -- you really just -- it's small subset of parameters. >> Hossein Mobahi: Yeah, I think you're referring to regularization, right? >>: Just makes ->> Hossein Mobahi: In the CDR paper we don't know yet if it gets accepted. But if it gets accepted you will see. So we had one more thing called regularization of the solution. And that just adds some prior. But we use very simple prior. Just identity transform. But no special bias. We use that to prevent converging to really weird transformation. And I think what Larry is suggesting is that now with learning you can more accurately model that prior, because that's specific to that data. But because of time I couldn't talk about regularization here. >> Larry Zitnick: Thank you. >> Hossein Mobahi: Thank you. [applause].

23826. >> Larry Zitnick: Hi. It's my pleasure to... us from UIUC. His advisor is Yi Ma. ...

Related documents

Products

Support

23826. &gt;&gt; Larry Zitnick: Hi. It's my pleasure to... us from UIUC. His advisor is Yi Ma. ...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

23826. >> Larry Zitnick: Hi. It's my pleasure to... us from UIUC. His advisor is Yi Ma. ...