>> Cha Zhang: Hello. It's my great pleasure today... College here to give a talk. The title of...

advertisement

>> Cha Zhang: Hello. It's my great pleasure today to welcome Professor Hao Jiang from Boston

College here to give a talk. The title of the talk is: The Efficient Global Methods for Robust

Computer Vision. Hao actually had two PhD's, one in Simon Fraser University in Computer

Science, and a PhD degree in [inaudible] from Tsinghua University in China. Before he joined

Boston College, he had worked at the University of British Colombia as a post doctor research fellow and also MSR Asia as an associate researcher. Without further ado, please.

>> Hao Jiang: Thanks. All right. Okay. So today I'm going to talk about something I did in the past several years. So I’m going to focus on some efficient methods using global methods.

Okay. So, but first let's talk about the motivation, so why we need these kind of global methods. So in computer iteration there are many problems that we need, there's kind of global model instead of local things. So for example, there is a very simple kind of application that we want to match some templates in the target image. So on the template you already, we have a lot of feature points, and on the target image you have many, many possible candidates. So on these, all these feature points you’re already, not only we require them to have good local feature matching; we also need them to have some kind of global constraints.

For example, we need them to match, they have the uniform transformation. So they have like the global scale and the global rotation to be kind of consistency during the matching. So that's something that we need to make sure that the matching to be robust. So that's very important, and not only for matching, in fact, there are many other models we need this kind of global thing instead of local models.

Okay. Unfortunately, the global model is very hard to deal with because this kind of global constraint introduced a lot of kind of things that are hard to deal with. So, in fact, there are a lot of previous methods where they try to deal with this kind of situation. So for example, we can use like RANSAC, we can use half transform to deal with this in the global models, but unfortunately, a lot of models, the previous ones, they have some kind of issue. One is some of them is not fast enough for the real applications. And some others have to deal with some kind of local search instead of doing some kind of real global search, and that's going to affect the quality of the algorithm.

So we're going to use some other different kind of way to solve the global models. So here's our approach. So because in computer iteration we can, you already kind of formulate the problem into some kind of energy minimization problem. So we have some kind of criterion to tell us, so there's kind of, for example, this kind of matching is good or not. But unfortunately, in computer iteration, we don't really need this model to be very accurate. So we can do something to manipulate a model itself so that the model has some kind of interesting property we can use to make it not only reflect the trend of the problem, but also we can make it easier to solve. So that's basically our approach. So we manipulate the formulation so that we can

look at the formulation, look at the problem, and we find some kind of structure so that we can solve a problem very efficiently. So, in fact, we can use this kind of idea to solve different problems. So like we can do matching, we can do object detection, and we can do human pose, and we can do like [inaudible] tracking and even segmentation. So I’m going to talk about some of these applications in this talk. Okay.

So let's talk about the first method we propose. The first one we call, there's a name we call is the lower convex hull method. So I'm going to use one application to explain so how this thing works. But, in fact, this method is quite generous. So we can use it for other problems if the problem formulation has the same similar kind of property in there. But the object matching is a very natural application for this. So let's use this as example.

So here's a setting for the object matching applications. So on the top row you can see there is a bee in there, and the model basically is a match on the bee. So you have feature points, and also on the between the feature points you have all these parallelized constraints in there. So it's like you have the nodes and you have the edges to indicate how these feature problems relate to each other. The goal basically we want to find the correspondents of all these feature points from this template to the target image. So like this is the target image and the matching is basically you find all the correspondents, and you can see this bee is different from that one because there's a rotation, there’s maybe a scaling changes, there could be deformation. So it's a hard problem to deal with. So we have to somehow find the correct matching and also deal with this kind of scale rotation, this kind of everything together.

>>: How do you generate the match?

>> Hao Jiang: I'm sorry?

>>: How do you generate this match?

>> Hao Jiang: Okay. So this one basically is because we find the point of all the correspondents, basically we transform that match to here. So you use the same kind of parallelized-

>>: How do you get the original match?

>> Hao Jiang: Oh, that one is, we use the original one is we use the [inaudible] triangulation.

So you can do some other things. But the [inaudible] triangulation is an easy way to get a match. So basically this is a kind of mesh matching problem. You have a match and you want to find the part correspondent itself, all the nodes to the target in the target image. So that's the basic problem. And the feature is obvious. In fact, not very important, so you can use like to see the feature, you can use shape context, you can do different things. So I’m going to show you some examples to use different features. But the formulation is very similar.

Okay. Let's do some kind of formulation now. So we're going to do this using this energy minimization framework. So basically we define some kind of criterion. So which matching is bad? So we’re going to use a number. So this number basically tell you, okay, if you do this then this is a good one. So for example, this is a template in here and this is a target, so the good matching should have, each of the feature points should have low-cost itself. So that's basically the first term in there. So if you find a good matching, so the first term, so this point upon the matching would be have very low cost, and also, [inaudible] is the kind of the global structure of this model. So the global structure is, they have, all the points should rotate consistently. So there should be some kind of consistent rotation for all of these points. And also all the points should scale consistently. So there's a lot of terms. So that's the reason we have the second term in there and we have two parameters we need to estimate in here. So F basically estimate all this kind of point-to-point correspondents and this s in there, small s, that's the scale. And the zero, zero rotation matrix R.

>>: So, you just mentioned the feature doesn't matter [inaudible]. In some way you could also select [inaudible] features to speed-up, to flow [inaudible] global matching. [inaudible]?

>> Hao Jiang: Right. I mean feature is important. But this formulation is independent of the feature itself. So you can use like, see the feature, you can use some other features. Of course, some features good is better than others. But the whole thing is basically the same. I mean, features were important. If you use good features you get better results. But if you use a kind of bigger features, the result may be a little bit worse. I mean, the whole thing is the same thing. You can use all these kind of thing. If you can get a cost, so for this point, if you can find a cost to match the target point, that's all we want. So no matter what kind of feature you use to get this kind of matching cost.

Okay. So the second term is, we call this the parallelized consistency term. So basically we found this kind of, all the pairs of this kind of vector through rotate, you can see there's a rotation there, then the scaling, they should be consistent. So all the pairs, so this PQ and these two pair, they should have exactly the same, almost similar, I mean they don't have to be exactly the same, but they have to be really similar; this kind of scaling, this kind of rotation.

>>: So you’re assuming [inaudible] end planar rotations?

>> Hao Jiang: Right. So if there is an outer planar rotation it’s going to be that formation, basically. But this model, in fact, allows this kind of deformation a little bit, not much.

>>: Why not [inaudible]?

>> Hao Jiang: So, in fact, there are some others. S you can see that there's an s in there.

There's an R in there. So if you, in fact, an easy way is that you put R and s together so as you said, you can use [inaudible] transformation property. And in fact, we can use like general, like a linear transform in there. But the problem is if you do that, in fact the result has a [inaudible] to match the single point. So, because if you match the single point, the second term will be, have zero energy in there, and that's very small. So we don't want that happen. So that's the reason we split R and s in two different places and then we kind of award this kind of situation.

Basically we scale the target instead of scaling the template. And that, in fact, makes the model much more complex to deal with. But you have to do this. All right.

>>: So how does your second term [inaudible] disappearance of certain features and

[inaudible] second features because they [inaudible] 2D objects and 3D?

>> Hao Jiang: Okay. So in fact we don't, I mean if there's some new features come out, that's fine. So we only match the feature you see on the template. And if there are some feature kind of disappear, that's fine too. I mean, there’s some feature we don't really allow this one feature match the single location. So for example, some object shrinks to some smaller size, so maybe two feature points goes to single point. So, in fact, we allow this kind of situation to appear. So that's probably another really a problem. I mean the outliers is a problem. I'm going to show some videos. If you put a hand before some object, [inaudible] something that’s going to be an issue, but you can still can match the target if there's some other kind of thing helps you.

Okay. So we're going to use this objective function. So this, in fact, another hard part if you just write this out, and hopefully we can solve it. But unfortunately, this thing is very hard to solve because there is something that's not very desirable in here. This thing is a linear model in here. So it's quite a linear thing. In fact it's a quadratic thing. So all the ovals I put in there, they are something like quadratic term in here. You can see this s times X, because s and X, this

[inaudible] X, they're both variables. Okay. Let me explain a little bit of this formulation. In fact, this formulation is exactly the same as the previous one. The only difference is that we use the matrix notations to represent everything to make it simpler so we can take a kind of closer look at this thing and [inaudible]. Also we can manipulate it a little bit easier.

So this X basically is a sign in the matrix. So because we want to find the correspondents from each feature point on the template to the target points, so this X basically is, for example, you have N points on the template, N points on target, it's N by M matrix. And you have each of the element on this matrix is zero, one. So if you have a matching for one point to some feature point, target feature point, there's a line there. And every [inaudible] is zero. So basically,

that’s a zero, one matrix for this [inaudible] X, and this [inaudible] C in there, that's that cost of matrix. So if you do these trees, this transpose of C times X, I mean it's a simple trick of compute all the corresponding kind of product and you can get the older overall kind of matching cost for the matching. And some other things that's already there, so for example, this rotation matrix R and this is s is the scaling factor in there. And this M and this T here, they are basically just the stack of the template points together, you get this M matrix; and you stack everything of the target points there, there's the T matrix. And this E introduced is kind of, you can see this in here is you have a subtraction in there of these pairs; so this is basically, this E’s matrix introduce this kind of effect. So for all the pairs we somehow introduce a minus thing.

>>: So E effectively turns on and off certain rows, correct?

>> Hao Jiang: Turns on, in fact, for all the edges in there. If there's an edge between two points-

>>: Right. M is sort of [inaudible].

>> Hao Jiang: Right.

>>: All the points, correct? Because you have, so you’ll have, for every single points you need to encode every single possible combination of other template points and all the matching points. So E is effectively going to turn on and off those points such that only a single parallelized [inaudible].

>> Hao Jiang: Right. Something like that. Yeah. So as you can see, so this constraint basically, the first one we have to make sure that [inaudible] of the matrix has a single one on each row because each template plot only match to a single target point. You don't want a single feature point match to two of them. So that's kind of a constraint in there. And also it's a better matrix in here. And also this R has a kind of, also normal constraint is a quadratic constraint. So you can see this thing is linear on the objective function, also linear on the constraints. So it's not very easy to deal with; so it’s an integer optimization problem. So we have to do something to, so that we can easily solve this. So the idea of [inaudible] is somehow turn everything to linear and hopefully, even though probably it’s an approximation, but because it's linear, it’s easier to deal with. So that's basically the idea. So we need to do a bunch of things to somehow linearize it.

So the first thing we need to do is we somehow linearize this absolute value term; so basically it’s the term in this [inaudible], this oval in there. So that term, in fact, that’s kind of a standard linear programming trick. So if you want to minimize something at its absolute value you can introduce two [inaudible] variables when Z, you just minimize Y plus Z in objective function and introduce Y minus Z that it equals X in the constraint. And if you make sure this one is

[inaudible] and in fact, it's equivalent to minimize the original. It’s the same because one, they're, one of them has to be zero. And in fact, this Y plus Z has to be the absolute value of X.

And in here, we don't really have this scalar one Z in here, but if we have some, you can use the similar trick but use like two matrix, two matrices. one is [inaudible] Y, the other one is

[inaudible] Z. And you do the same thing, basically. You can replace everything, if you want put into the absolute value term, so it’s [inaudible] subtract the difference of this Y and Z equals that term and you put Y plus Z into the objective function and it can realize this kind of absolute value minimization problem.

>>: So for this matrix, how many new constraints do you have?

>> Hao Jiang: So, in fact, let me see. In fact, how many pairs of these kind of pairs of the points, that’s in fact, how many of this kind of things you have to put in there.

>>: So it's M by T?

>> Hao Jiang: Yeah. There are quite a lot of constraints that you have to put in there. So I'm going to talk about how we can deal with these kind of things. So the whole thing is going to be huge. So you have to do something to solve this. So this is only solves a partial of the problem because even though this thing goes into, is not an objective function, but is still here. So you still kind of, we have to somehow deal with this kind of scale thing because s times X, that’s a quadratic term. So for this, to turn this into some linear function, so what we do is, so instead of doing this kind of matching point on the template to the target, some target point, what we do is we match the point from template to target at some specific scale. So essentially, we introduce another dimension for the matching. A scale. So in here, you can see, basically we extend the matching to originate a [inaudible] matrix is like 2D. So basically you put another dimension that it's a scale in there so that you have a bunch of layers in there, and in fact, you want to find the matching and also the scale at the same time. And if you collapse everything together it's going to get the regional X. So this X are basically a bunch of slices of these three things. So basically the original thing is nonlinear, but if you increase the dimensionality, then hopefully we can turn it into something linear. So that basically that's the idea in here.

>>: Do you have a constraint then that effectively only one scale can be selected at a time?

>> Hao Jiang: Yeah. We need that kind of thing. So after we do that, basically we just do this.

This is approximation. This SL is a bunch of [inaudible] scale in there. But even though this is a contact thing because finally we can relax this thing into some kind of continuous variable. So, in fact, we got a lot of things in the middle. So you got a continuous scale in the end. So that's now the big concern even though it roughly kind of contests as a scale in here. So after we do this and we turn this into a bunch of linear thing in here, but the variable is not X anymore, becomes this XL basically slices of this 3-D, assign the matrix in there. And also, as you said, we

need these, to make sure all these matching, each point matching is the same scale s. So this is s in there.

So now we have turned most of the things into linear. There's only one thing left. It’s this rotation matrix. What we do here is we do approximation using some kind of little square to replace the circle. So if you somehow [inaudible] this, I mean, quantified this rotation matrix using two variables, [inaudible] what you get is basically the original thing is U squared plus B square equals one. That's the circle if you plotted using graph. So this circle thing is apparently now convex, right? I mean if it’s a disc, it’s a convex thing; but it’s a circle, you want to have the boundaries not convex at all. So what we do is, okay, let's use this somehow approximate using this rectangle. This rectangle essentially is basically, it's the U absolute value plus with absolute value equals one. So it’s like approximated, U squared plus B squared equals one thing. And unfortunately, this square is obviously still not convex, but it's easy to fix. We just need to break this square into four line segments, and each of the line segment is convex thing. So now we can, in fact, we introduce, they break the regional problem into four smaller sub problems, but fortunately it’s not too many. Just four of them. So we can still solve them quite efficiently.

So, okay. After all these steps, in fact, we turn this into a linear thing. So the thing I want to show you in here is even though this looks a little bit more complex, but in fact, it's much simpler than the original problem because the object function and these constraints, they become all linear. So we can just throw this thing into some linear solver and hopefully you can solve it quite fast. But in the real situation this is not very useful. The reason is that is because it’s going to be huge linear programming here. And if you directly solve it using a solver, it’s going to take a long, long time. It takes forever. And also, maybe the problem is too big and you cannot fit into the memory of your computer at all. So you have to do something to solve this thing to make it very fast.

So what we do is, so we look at this thing closely and we found there's some properties in there we can use. We don't really need to solve the original problem. In fact, we can solve this using some kind of the simplified version of the problem. This simplified version, in fact, has exactly the same solution as the previous one. So the property it is, in fact, the lower convex hull property were going to use. So that's the reason this kind of math has got a lower convex hull method. Okay. What I showed here is a matching [inaudible] one point. So if you use the Y feature point on the template, you match all the feature points on the target image, and each of them you have a cost. So this [inaudible] here shows the cost. So when you build to the linear optimization problem in there, so originally you have to use all of these, right? So you're going to have lots of variables in there. But fortunately, we don't really need all of them. So because we linearize it, after we linearize it, you relax it into some kind of linear program, and the linear addition and the relaxation itself, in fact, turns that thing into a surface. And this

surface essentially is a lower convex hull of that points. It's a tightest [inaudible] to support all the points. It’s like use a rubber band to support all these points from below, and only the supporting points, the lowest supporting points they are important, they are on the surface.

Everything else is above the surface. In fact, they are useless. You can throw away everything above and just keep everything. The supporting points.

>>: [inaudible]. Why are they useless?

>> Hao Jiang: Because the linear programming itself is in fact, it's just use the surface to represent everything. I mean, this is a linear program. If you don't, the original X is [inaudible], right? But we don't really want to solve this [inaudible] integer problem.

>>: Why not?

>> Hao Jiang: Because it's generate this [inaudible] linear program is simply hard. You don't really want to solve that kind of thing. So it turned X into something like between zero, one.

>>: Yeah. I mean have you tried, you know, some cutting plane-

>> Hao Jiang: Oh, it’s really slow. Yeah. We tried that. It's really slow. So it's not solvable. So that's really why we want to kind of linearize it and relax it to make it easier to solve. So the relaxation, in fact, you don't solve the original problem. In fact, you use this surface in there.

You can prove that. It's just the case. So that means we don't really need to keep all these variables in there. I can throw away all the variables; just keep something we call the effective variables in there. So there's an example in here. So for example, there's a surface in there, and if you don't do anything, don't do any simplification, you can have like 2500 variables to represent the optimization. But if you do this lower convex hull using this trick, you can see the vertices in there, only these blue points in there, there are very few. Only like 19 of them.

So let me, if you use this trick, you can just use 19 variables to represent the problem itself. So it’s going to become much smaller, and that means you can solve this thing really, really fast.

This one shows, in fact, if you increase the point, the number of, in fact, how many target candidates you can have, how many these kind of points in there originally, and if you use lower convex hull, you can see that, so this one, in fact, there's only 28 of them; it's useful. And if you increase this thing to make a meaning of them in here and there are thousand effective things. I mean, even the target kind of candidates, the number increases very fast. It seems that the effective target won’t increase very fast. So probably, so you can still deal with this kind of scale of the problem without any problem.

So that means what we can do is a lot of target feature candidates and we can do these kind of thing very fast, efficiently. And so without worrying about, the complexity is too high. And also the good thing is the result is exactly the same as the previous one. And you don't lose anything by doing this. So here's an example that shows how this thing works. So what I'm going to show you is a very tall example. So we have only like four feature points in there, and that's the graph you use. And this is a target image. There are bunch of, I didn't show the original image, but that's only the feature points, the yellow points in there. And if you do the first matching using this relaxation, what you get is something like that. So in here, these blue dots in there, that's the true point. And the circle in there is what you get. So you can see that the first matching is, in fact, not very accurate. There is in this, there is [inaudible] do the relaxation. We didn’t solve the original problem.

So we need to do something to make this more accurate. What we do is that we use something we call the trans region shrinking method. So basically, after you do the first matching, you don't stop there. You shrink your trans region because you know that the target may be around the point you find. And you make your trans region even smaller and you really do everything and then you can make your result much better. So this is, in fact, shows how this kind of trans region shrinking works. Let me show it again. So after you do this thing, so that rectangle basically is the region you trust. And the trust region can become smaller and smaller and finally you can have much better matching.

>>: How do you determine that trans region?

>> Hao Jiang: Oh. So now we basically, we just use a rectangle around that thing. So you can center that thing around, you can use some other kind of criterion. In fact, it’s not quite critical.

We tried different ways that worked similarly.

>>: So this trans region is a region in which you are only going to use possible feature matching

[inaudible]?

>> Hao Jiang: Right. So basically you're throwing away everything kind of outside of the region.

So that, in fact, makes your convex hull kind of more accurate because the region is smaller and your convex hull becomes more kind of similar to the original surface. So that's another reason it’s going to become more accurate.

>>: So you're solving this as an LP then, correct?

>> Hao Jiang: Right.

>>: So how do you resolve the fractional solutions?

>> Hao Jiang: Okay. So we don't do rounding here. So basically, we just linear, combine all these target points. So each of them has a floating weight, right? So we just linear combine them, I get the target [inaudible]. So that's another good thing. We don't really need to kind of trick that threshold in there. So there's, here's another example use a different features. They use shape context. So that's the template in there. And there's a fish. You can see it’s embedded in a lot of clutter in here. So after the first matching, in fact, it's not so good. It’s just roughly there, but if you do this again, again, when the region becomes much smaller, you get a better and better matching and finally you get the target. In fact, a lot of, about, almost

100 percent of clutter in there, you can still find in there, the matching quite well.

And here's another example that shows we used the shape context instead of, we used

[inaudible] instead of shape context. So this is the template we used, and this is the target image. So we used all the [inaudible] features in the target image. So we used a zero threshold in there, so you can see everywhere there is a [inaudible] feature. And then we do the matching the first one and then you keep on doing this kind of trans region shrinkage and solve this kind of linear program again and again and finally you can find a matching quite well. And you can see the original one is smaller and the target is bigger. So we can do this kind of scale thing and the rotation thing at the same time.

So there are some videos that shows how this thing works for some real use. So not only we can deal with the rotation and scale, we can also deal with, like the butterfly there. There's some kind of deformation in there. So this magazine in there, also there's a deformation in there. So we can do is all these things together. The good thing about it is you don't really need to worry about which feature is good which feature is bad. Basically, we just select as many features as possible, let [inaudible] decide which one we should use. So that's a good thing, especially when there are some questions. [inaudible] that video. In fact, it's not in here.

It's in my website. There’s another video, I put the hand before the book or something. So if you’re interested, you can go to my website and look at that video.

So there are some, it's not totally accurate in here because we don't use any motion continuous thing, so you can see there, sometimes the matching is, suddenly there's a wrong matching there. But, I mean, it's good and bad. Sometimes you hope that this kind, the matching error won't propagate. So this is the way you do. You just match each of single frames. Okay. So we do some comparisons. So in here we compare with some kind of very fast method using the local search. So use the SEM iterative conditional mode. So these show basically that the method’s result. You can see that it doesn't work really well because the feature is weak sometimes, and it won't find the target, right target. But the proposed method works very well for this real use.

An interesting part is that not only this proposed method has the better result, and also, it’s in fact faster than this kind of greedy method. And the reason we can do this very fast is we use this special structure of the problem we call the lower convex hull property. So this, in fact, shows is a fraction of the time using SEM. And also the performance is like several times better than this greedy thing.

>>: Two questions.

>> Hao Jiang: Okay.

>>: So how, what's the absolute rate time per frame?

>> Hao Jiang: Okay. So a typical time is if you have like 100 feature points on a target, each iteration probably a second, something like that.

>>: Okay.

>> Hao Jiang: You're like five iterations is good enough.

>>: [inaudible]?

>> Hao Jiang: Yeah. But if you use like fewer template points it could become much faster. So it's quite sensitive to the number of template points, but not very sensitive of how many target points you have. So that's the good thing about it. So if you have just tens of, like 20 feature points or 50, 50 points, something could be much faster. It could be real time.

>>: And how do you evaluate the accuracy? Like, how is accuracy measured?

>> Hao Jiang: Okay. So for here, in fact we don't really have the other [inaudible], so basically we just like, this is a [inaudible] inspection for this. And there are some, when we do the point matching, so I'll show the result in here. So we use like a point a set. So we know the

[inaudible]. We know the target where they are. So this is like, we just do like the distance average, the distance or something. And so this is, in fact, an error histogram. So, to show that, I mean on the left-hand side is this kind of the small error, and on the right-hand side, a big error, so basically this is a histogram of all these, the matching errors. So the good algorithms will have everything on the left-hand side and [inaudible] on the right-hand side.

And this black curve is it gets worse. So we compare all method with some other methods like the thing plays blind method and there’s a TPS and there is a graph, spectral graph method or something. And this method, in fact, works quite when there are not many noises in the image.

If you have a lot of clutter kind of thing in there it kind of breaks very quickly. The proposed method, it works well in that case.

>>: Does your template adapt to every frame or [inaudible]?

>> Hao Jiang: It doesn't have to be the first template frame. It can randomly choose one. So if there's an object like-

>>: [inaudible]?

>> Hao Jiang: No. It doesn't change.

>>: So the object has deformation [inaudible]?

>> Hao Jiang: I mean, in fact, I'll try to put a hand in front of the book. The hand basically accrues half of the book. But it still works in that case. The reason it works is there's some good features in there. And the bad part, I mean outliers, in fact because there's no target in there, so the template point is going to match everywhere has similar cost. So essentially, bad part of the feature won’t kind of count when you do the optimization. So the outliers basically

[inaudible] a good feature automatically when you do these kind of, but if you have more than half of the quotient we have trouble in there. In our experiments half of the quotient seems okay. It won’t affect the result very much.

Okay. So we also compare with some robust methods like RANSAC. So RANSAC, in fact, works quite well when the feature is good. It's not too bad. So in here, so this pair, in fact, the feature is pretty weak. The reason is you can see this fur has a similar appearance everywhere.

So RANSAC won't find the good hypothesis. In fact, in almost all the hypothesis is wrong. No matter how many times you try, it's wrong. But our method, basically we rely on all the features in there. So it doesn't depend on the kind of the good features itself. So even though,

I mean, it's the second best in feature, if you put them together it has the lower matching cost and we’re going to select that. But RANSAC won't have that kind of result. Okay. So that's the first method. In fact, there's still some method, I mean still some problems we didn’t deal with in the first method. So that's the reason we're going to choose another one we call the decomposition method.

So in here basically we’re going to solve a few extra problems that we didn't do. I'm going to still use this matching problem as an example, but I mean this information also is quite general.

You can use it if the structure [inaudible] some problem, you can use the similar method to deal with. Okay. The problem that we didn't quite solve in the previous version is that because we somehow have to contest the scale, so we need to know at least the range of the scale changes from the template to the target, so for example, it's like within two times so like half of the size to like two times bigger in that. But unfortunately, sometimes you don't really know how big the scale could change. I mean, if you download the image from the Internet, the size come become very different from your template. So how do you deal with that? And probably you

don't really want to try each of the scale because it's going to become really slow if your contrast very fine. And also, maybe you have to try all the rotations. That make things even worse.

So the question is, it is there is a way that we don't really need to know the scale and we can still do the matching? Is that possible? The fact is possible. I'm going to show you one way to do that. And also, another thing about the model, in fact, in the previous model we have to use the parallelized constraints we need to use something like a convex [inaudible]. So basically used [inaudible] in there. But sometimes we want to use some kind of general parallelized constraints. And do how we handle that kind of case? And also, because our model is like, is not a tree at all. So if it’s a tree, basically you can do like the animate program. You allow these kind of, for example, this kind of blue part. The blue edges, if it's a tree itself, you can use very complex constraints in the animate programming. But if it's not, it's not usable.

So what we do is that, okay, let's do some kind of compromise in there. We still use a tree, but this time we use it as a skeleton in there. And we introduce some kind of extra edges in there, this dotted thing in there, that's basically the weak edges. The weak edges is a special thing.

They are somehow can be represented as the linear constraints. So basically, there’s a blue thing, an arbitrary thing. It could be anything. And the skeleton can be very complex constraints. But everything errors you put in there has to be something linear. And so this model, in fact, is not a tree anymore. It's a Loopy graph, but it's not a totally generic Loopy graph. It's something we call the linearly augmented tree. An interesting, this thing is, even though it's not a complete general Loopy graph, but it has the good thing about the good features about this Loopy graph. You can introduce many constraints that tree does not have.

And at the same time, so we have a good algorithm to solve this using a bunch of tree algorithms to solve it. So we have both sides of the good things. We have the property of the non-tree model and also you have, you can solve it very efficiently using tree methods. So that's the reason we want to use this kind of linear augmented tree thing.

Okay. So we still need to do a little bit kind of formulation, but this is very similar to the previous one. So the object function is, we also have this unary term. So basically, for each feature point you want to find the low cost target point. And also, the second term we also need this kind of the spatial kind of consistency term. So we need everything rotate with the similar angle and they have, the scaling has kind of similar angle or something. And in here, okay, so we can also use this kind of matrix notation to represent everything. So this is second term in here. So the difference of the second term from the previous model is that in here we only constrain everything on the tree. So we have a skeleton tree in there. So basically if we constrain the tree to rotate consistently, and the scale consistently, in fact the whole model probably has a property too, in fact, this is important. If you don't do this then that model is

going to fail. We only apply this kind of constraint on the tree edge. So this thing turns out accounts that be represented using some kind of matrix notations. I mean, if you want to know look at the details, I mean the paper has, but in fact there's not much more in there. It’s just notations in there.

>>: But, can I try to kind of try to rephrase what you're saying?

>> Hao Jiang: Okay.

>>: You're saying in order to cut down the number of constraints at costs, instead of ensuring that the rotation and scale is consistent along every single edge-

>> Hao Jiang: Right.

>>: Just select a subset and in this case, that subset happens to be tree.

>> Hao Jiang: Right.

>>: Such that the constraints of [inaudible].

>> Hao Jiang: Right. The fact that we do this there's another, I’m going to show you, there's a later slide that show the structure of the thing. In fact, that kind of structure to appear.

>>: So does it a have to be in a tree necessarily? Or to put it another way, why a tree and not random edges, let's say?

>> Hao Jiang: In fact, it's become a little bit clearer after. So let's wait a little bit, and I'm going to show you later. So this is structure in there. So in here, basically, we want to make sure all these tree edge they have the same scaling. So this term, basically. So the same scaling as zero in there. And also we have the same kind of rotation. There's a zeta in there. But the trick in here is that we don't make sure that all the edges rotate with the same zeta we make sure they have the same cosine, a cosine zeta. The reason that we do this is that because the angle when you rotate across zero, it feeds back to like a 360, that's very hard to deal with. But if you do like sine, cosine, you don’t have this kind of trouble. So even though you have two more kind of matrix in there, but it's fine. So basically, so U, zero is something like a target cosine that zeta and this B, zero is the target sine zeta in there. And we try to minimize everything in here so that, make sure everything the whole thing is doing something consistent.

So the last, I mean you can add some other terms too. I mean, there could be some kind of global thing to make sure there is some other like area constraint to work or something. I mean, if there's something linear you can’t have them in there. It won’t affect the whole model. Okay. So in fact, this is basically the structure of the whole thing. So if we do everything kind of we talked about before, you’re going to have some very special structure in

here. So you have an objective function, you have the constraint, and the constraint affects, there are two parts in the constraints. One part is easy ones and there are hard constraints to add on these kind of easy ones. Okay. The easy ones, in fact, they are the tree constraints.

Everything on the tree. And the harder ones is something that's put on top of this tree. So this is linear, so this part, I mean if you remove this part, so basically the whole thing is the tree optimization problem. It’s optimization, but you can use the animate programming to solve it if you remove these hard constraints. But if you put them back, it's something kind of you put something into the tree formulation. So that's, in fact, this makes this thing a little bit harder to deal with. But it's not too much different from a tree. What we need to do is probably somehow, if we figure out a tree solution, we can try to make this thing work and then probably we can get a solution much faster than solving the whole thing. So that's basically the idea.

I mean, what we did everything before, basically we want to make this kind of structure to appear. All right. Okay. So let me kind of briefly illustrate how this thing can be done faster using these kind of tree solutions. So in fact, the basic idea is, let's assume that you have a bunch of, you have all the tree solutions. So basically you remove the hard part, and if you can somehow magically get all the tree solutions in there, so we represent them using these kind of the green dots in there; and then because you removed the hard constraints and the

[inaudible] region become bigger, so that's the reason we draw bigger area in there, and the real solution probably is only in that triangle. And the red dot is the real kind of global optimal solution. And as you can see that, you can always use all these green dots to do a linear combination and represent that green dots. And hopefully, this can be done a little bit easier than you directly solve the hard problem because these green dots may be easier to get. I mean, even those this thing won't work by itself because you cannot afford to get out of these green dots in there.

Unfortunately, we don't really need to get all of them at the same time. What we do is there's some method which is called the column generation. In fact, it's a method in this [inaudible] decomposition. In fact, it's quite old method. It's like a 20 years, 30 years ago method. In fact, probably even older than that. But that method, in fact, tells us how to deal with these kind of problems. So if you know the solution for this kind of simple problem you can do a little bit more to make it work for this kind of really hard problem.

So the idea is, in fact, you can use like a bunch of proposals to start from. So for example, you do like two dots instead of all the green dots. And you do a linear combination. You find the best thing for the linear combination of these two dots. And you find the one kind of okay solution, not the best. So that's basically the red dots. To make it satisfied is the hard constraints. So after you find that, so we can introduce more proposals. So let's say, okay,

probably we should introduce another one in there. In here, in fact, you already get the final solution if you do a linear combination of using these, only three of them. So the procedure of these column generation is, in fact, how you can introduce more proposals to make sure that you always prove your result. There's a very suspending method to do that.

So this thing is going to, in fact in our problem, so you don't really need to solve the sub problem using any kind of linear program or something. You basically what you do is use a trick. The sub problem is trick. You just do the animate programming. It's really fast. And after you’ve solved that you can solve another small linear program to generate new proposals. And after bunch of iterations you can find the result much quicker than directly solve it. So that's kind of the basic idea of this method.

>>: [inaudible] randomly try [inaudible]?

>> Hao Jiang: You don't randomly try the proposals. In fact, there's a systematic method to tell which, how you choose the next proposal. So, I mean, it's a part of this [inaudible] decomposition method.

>>: So can you elaborate briefly [inaudible]?

>> Hao Jiang: The proposal basically you want, okay. You want to select the one, if you introduce it to reduce your objective function the most. So that's a proposal you want to select. So in fact, in that case, you want to do that; you solve a very small linear program in there. And after you do that, you get, you can compute the proposal instead of randomly choose one.

>>: So does that proposal [inaudible] structure of the match [inaudible]? For example, if it’s human, perhaps those points are more [inaudible] of the human [inaudible]. [inaudible] slide into those anchor points. You've got the decrease cost [inaudible]. Is there any analysis on linkage?

>> Hao Jiang: You mean the structure, whether the structure is going to affect-

>>: Structure of the natural, of the geometry?

>> Hao Jiang: Okay. I think that's a different problem. I mean, the structure definitely. If you choose a different structure, it’s going to affect your solution. I mean some structure is better than others. But in here, basically we just to do the [inaudible] triangulation and get kind of a planar graph in there. And this works okay for this kind of, most rigid object. I mean, if you do like articulated object, this kind of graph may be not very good. You should choose like some other kind of graph in there. But I mean, the optimization method is, they're all the same. It doesn't matter what kind of graph you use. So you can use like , maybe choose some kind of

better graph in there. So for example, if you have articulate part, maybe you want the linkage in here. You want to have the linkage, everything kind of inside the area. But you won't affect the method itself.

>>: The reason I'm saying that is one of the possible thing you could do is segmentation

[inaudible]. [inaudible] random matching easier method? I mean, let’s establish this kind of

[inaudible] structure of the object. Then that structure, perhaps it could go back to [inaudible] in some [inaudible]?

>> Hao Jiang: Okay. I mean, you can automatically generate the structure to help matching something.

>>: Something like that. I was just wondering.

>> Hao Jiang: Okay. That's a good idea.

>>: [inaudible].

>> Hao Jiang: So probably can do something in the future. Along that direction. All right. So that's basically the rough idea of the, I mean I didn't talk about the very details of how this proposal is selected, how the iteration works. But basically that is the some kind of the basic the high-level thing. I mean, there are a lot of details in the paper about this could be selected.

In fact, there's a [inaudible] structure where there's no guessing on this kind of random selection there. So everything is just determined. You can just fold that structure and I mean procedures just go do iterations.

So the example I showed here is basically how this thing kind of looks like if you do these kind of iterations. So, for example, there's a template image, what we want to do here is we want to match the regions. So for example, if you run some kind of super pixel method on this image, you're going to get a lot of this kind of the regions that's a super pixels. And because the super pixel method works really fast. So there's something probably is, also there are very few super pixels in here. So if you use that as a feature to match the target, so we can do probably like real-time matching, even though probably not very accurate, but we can localize the object very quickly and we know the rough pose of something.

So basically that's the motivation we want to use this kind of feature. And if use a very simple graph, the tree here is the star that all the red edges in there. We only have like 10 regions in here, so not many regions. And this is the target. And you can see that this guy, even though he is the same object, in fact the region’s a little bit different. So each time you run this kind of a super pixel [inaudible] on the same object, you’ll probably get some kind of, not exactly the

same, but maybe some part the same, roughly the same, so that's, in fact, is a quite challenging problem. What you do is kind of unstable kind of regions matching.

So here's an example that shows the first iteration. So basically, if you do like a greedy thing, you find the best matching from each region to the target using whatever criterion you use, here going to get something really wrong. So this basically is something you get. But if you run, use the iteration method that we propose, after a few iterations you can get much more accurate kind of matching result. So that's the final result. So you may wonder why I showed two of them, the sort of upper-level and the kind of low bottom one. So in fact, the first one we used, because is a linear program, right? Everything we get the floating-point numbers in there. So the top one basically we do a hard rounding, we use a threshold to round it to get a binary kind of result. And the bottom one we do a linear combination. So, as you can see, the result is quite similar. So if you do hard rounding and linear combination, so in fact, that means the result is quite close to like zero, one solution. So it's a good one. It’s a good solution in here. And another thing you can see is that when we do these kind of proposal things, so basically the energy function’s keep on going low and becoming better and better result. But some middle result you can see is very wrong. So even though the energy keeps going on and kind of going lower and lower, but if it's not the real matching, it could be very wrong result.

So let me introduce this kind of global optimization really helps. You need to find a global optimum instead of something in the middle. Something in the middle is in fact, this could be way wrong. And also, your hard work is a little bit surprising in here. So it doesn't look like it’s approximated target and gradually refined. In fact, you could go very widely, kind of different, and then finally get a target. So that's something I didn't expect at the beginning. So there's another example that shows hard works may be for a little bit articulate object, a human in here. So this is a template and this is target. I purposely put it in, this large [inaudible] in there.

And we do this kind of matching, as you can see, this is a first matching, and it's also doesn't work very well. But if you do iteration in here, and after a little while you just get a very good matching at the end. In fact, there's a little bit, the arms it moves a little bit. So not a big articulation, but a little bit. So this works for human being if there is a small deformation.

So the complexity comparison of this decomposition method with the direct solution method, in fact, the implement is quite big. So if you don't do the decomposition using this tree method to solve it, you're going to get the complexity increase like that. So it's the program, the problem size becomes bigger and the complexity [inaudible] very quickly can become very high.

So it means you cannot solve anything really big. But our method is kind of very flat, kind of surface. So that means we can solve something much bigger than using the direct solution.

And we do some comparison with some other previous methods. So, in fact, that way is the proposed method, A is the template. That’s the matching result. Our method. And there's

some others and some other methods. In fact, you can see the comparison, the methods could, if it doesn't match, in fact, it could get very wrong result in there because the feature’s very big.

So you need a very strong way to find the global solution, global optimal solution in there.

So our methods like at 90 percent kind of chance to get a correct matching and some others, in fact, become much worse. In fact, this one is the first method that we talked about the first part of the talk. And in fact, in here, it's very bad. The reason it's bad is we do this kind of lower convex hull and in fact, it’s something not very desirable because the feature’s bad. If you do lower convex hull you're going to get a very flat surface in there. So that means all the kind of details are lost, and if you optimize you get nothing in there. So the lower convex hull is a very good optimization method if the feature is not too bad. But for these kind of [inaudible] features you need something better than that. And there's another one. So you can see our method is like 90 percent, 91 percent and other methods, the best is 73 percent accuracy matching.

So it's a quite big improvement. In fact, there's some ground truths matching using the points that we compare in the paper, but I didn't show it here. So it also shows the great improvement comparing with some other methods. So there's some videos I’ll show. So in fact, one application that we think this thing what we use for is like a human pose or something. So because of these regions, so probably it's going to kind of correspond to the arm, the head, and the torso or something. So it's going to help us to detract the person and to understand how this movement is. I mean, the reason, in fact, that we use these kind of, is not articulated model, it's the deformable model to do this articulation. It works because the regions big. So the whole arm is a region. So basically, there's no articulation for the model itself. But if the speech intellect [inaudible] may not work in that case.

So, I mean, this method still has limitations. So if you look at it, this mouth, this video closer you can see there acquired some errors. And the errors itself is one, is related to the features itself. The feature’s week and also, so the feature itself I think is the big thing. So this model, even though the model is strong, but if the feature is very bad and you still cannot get any kind of good result from there. And this method is, in fact, useful for not only the same object, so we use it for matching faces, matching like the leaves of the back of the car or something. So in for the face, we use one person face to match all the other face, like 400 faces in there. So you get quite accurate matching for the, like the eye, the nose, and the mouth. Everything. So it's a deformable, so you get quite accurate alignment for that.

>>: Do to try to match one object to a different object and see what kind of result it will get?

>> Hao Jiang: Exactly. We also, let me show you some other applications in here. So like we do like action recognition we can use this kind of matching. So in here, we basically use one

person's movement to match another person's movement. The so if the feature is kind of

[inaudible] independent then it works. So in here, basically we do something like the trajectory.

So if you move everything on a body, the point’s going to plot some trajectory in the space, time space, this kind of dot, through the video. So this kind of trajectory, the shape is quite consistent even though the person is changes, for example, this guy, there’s a man here and there is a woman in here, and they do the same kind of action and you can see that, in fact, they're quite similar no matter who is going to do this and that kind of thing.

And we use this kind of thing, well, application you can probably, if you're on a search for some specific action in some video, so this is a broadcasting video we recorded for golf game, and we wanted, you’re already very interested in some kind of event. So, for example, we are interested in the swing action. So we want to skip everything else, we just use one kind of template and it goes through hours long video and just localized all these kind of instance. And this basically, all the shots in here is the thing that we localize. It's not totally accurate. You can see that there's something quite weird. There's a face in there. And the reason you have that is that we don't really have any kind of semantic kind of meaning in here. Basically, we just a match features. For some reasons, that face is thought to be similar to that kind of thing. I mean, but this thing, for this application is like 70 percent of kind of accuracy for this if you choose that you recall precision, the point, and it's like 70 percent or something. So it's not too bad. You find a lot of them in there. But, I mean, there's still a lot of other things to do for that kind of thing.

And this is kind of the application that we did is for sparse data set like 5000 images of this baseball image data set. We'd try to discover what kind of action’s most common in there. So we use this kind of matching method to compare each pair of the images and then we have the similarity distance between each two pair and then we can do a clustering based on this kind of similarity matrix in there. And we find, okay there is a face, there are a lot of faces in there. In fact, so that’s thought to be kind of a big cluster in there. There's some kind of throwing action, you can see there are a lot of this kind of throwing or some other poses. And there's some kind of-

>>: [inaudible]?

>> Hao Jiang: Oh, yeah. They are the images. Right. So, yeah. Indeed, we can do something.

Probably this only works for these kind of professional sparse images because they may take very similar kind of events all the time. So they choose the specific kind of pose to capture the image. If the amateur images may not have this kind of pattern. But there's something interesting we can do this automatically.

All right. So that's, there's two main theme that I talked about for this how to discover the structure. I want talk about some other applications and maybe not kind of specific kind of common theme in there. But we use something we call relaxation and branch and bound. So there's some kind of integer kind of optimization problem. We want to find basically like a 0, 1 solution in there. But we can't. We have to somehow relax them, and we do something like efficient exhaustive [inaudible] using branch and bounds. So that's the kind of thing we do in here.

So the first application we do is tracking. So before, maybe I’ll show some video in here. So already, the tracking method, one way is we can do DP because each frame we can detect a lot of candidates in there. And what you need to do is you just link all these detection candidates to make it smooth through this time. So that's basically the trick method. But if you do that, for each single object you're going to get a lot of problems. So basically the task to animated programming, sequential animated programming method, you can see that identity that 0, 1, 2, sometimes it's assigned to the same person. Multiple labels are assigned to the same person because in that, if you do this MDP for each of them, you don't have any kind of coupling between the person. Peoples. So we don't know that person or any other ID. So we don't know, you shouldn’t assign another ID to it, but our proposed method basically do everything together. So all the people tracking together, you can [inaudible] that kind of situation there.

Okay. So what we do is in fact we, the motto is in fact the multiple is past. So each of the shot is past, you can do animated programming, right? But if you run each of them you don't have the couplings. So okay, let's link them together and there's some coupling there. So basically, if you detect some object in some region, so the surrounding very close region you shouldn't have another object in there. So if it’s in there, it should be a [inaudible]. So basically, we have a special cushion kind of state in there. So that's basically the model. And, unfortunately, this thing cannot be solved directly using kind of animated program or something. But we can still like do approximation like a linear thing and then we do some rounding and hopefully we can get not too bad result. In fact, this works quite well. So there's some other videos I can show you. So how the result could be. So this squash video is, in fact, is quite challenging because there are like four players and there's the same team players have the same clothing, almost the same clothing. So it's quite hard. But if you look closely, this video, in fact, has ID switching almost at an end. So it's not perfect, and that's just you have problems to deal with. So our problem cannot solve that kind of ID switching. We still cannot solve that problem. So we need some other information, probably consult them. Okay. That's it for the tracking is useful.

And another set of applications that we do is like a human pose. So basically, we wanted to tag a person in the image. We want to find out where the head is, where the torso is, and where the legs are. So the model I used here is like [inaudible], so it's a jigsaw puzzle problem. So

basically, I gave you a bunch of rectangles. So some rectangle represent is a torso and some others represent his arm and leg. So what I want to do is, what I want to you put all these rectangles on the image so that we cover the target as much as possible. And also they link together, become a human body linkage. So that's basically the idea.

So essentially, what do we need for this method? You roughly need to have a foreground segmentation in there. So probably you can do like a background subjection and you get the foreground. So for example, you do a background [inaudible], you roughly get the foreground.

And after you get that, then you can just put rectangles on top of that and try to get a human body shape. So essentially, this is kind of, find the best configuration of all these body rectangles. So the name we called this model is kind of max covering because we try to cover the foreground as much as possible. And also, you should be consistent because it's a human body model. So we call it consistent max cover. Unfortunately, this thing can also be done using this branch and bound thing. So if you formalize it into some kind of optimization, you can relax it and get a lower bound somehow and you can do a branch and a bound and you get the result. And it turns out this method works better than the single tree method. So let me show you some comparison results if you can somehow get the foreground out because we need is roughly foreground in there. Okay. The better the foreground you get, the better the result will be. But this method doesn't need totally accurate foreground. It just needs roughly foreground.

So the first row and the third row show the previous methods, [inaudible] structure method.

So basically you use a tree to represent the human body. The tree method has a problem; it's got a double counting. So because there's no coupling between two arms and two legs, very often, I mean sometimes you’re going to lose detection of one leg. So because we don't know there's a leg in there, so we have to type both of the legs in there. But our method, in fact, because we try to cover the foreground, we know this partially covered. So the legs should go there. So we can get everything in there. So, I mean potentially, it will get, because we use more information so we can get better results. But the model is not tree anymore. So it's a global thing. So we have to kind of solve the problem much more efficiently than the tree method.

All right. So this is another recent work that we did. So for the human pose. So instead of detecting all of these rectangles in there, we detect regions. So the motivation we want to do this is if you download an image from the Internet, in fact, you don't really know the scale of the target. And all the previous is kind of rectangle body part region detection method needs to know the rough size of the target. So like 100 pixel tall, 200 pixel tall or something.

But if you download an image, how do you know how tall the target is? And so the goal basically is, if you download image, if you know probably there's a person in there, we want to

somehow get roughly the torso, the leg, the arm, the regions. And so in this model, basically we gather regions and we use the relative size of all these regions. So we don't really need to know the size of the target. So we can still get the right kind of configuration and detect the persons pose. And here, basically the idea is you detect a lot of proposals. We call the region proposals. In fact, there's, Barry Cohen has a paper, and we use his method to detect all these object class independent proposals. [inaudible] regions looks like object. And here, we basically detect all the parts. Some regions look like body parts, and we put them together and we assemble them into a human body. So that's the basic idea. And also we use this branch and bound thing to solve the problem in here.

Okay. So another kind of application that we deal with for this human body tracking is we hope we can detect the hands and the head and the feet and this we call the key points in these

[inaudible] videos. It turns out this is very challenging problem. As you can see, if you extracted the frames from these videos, you're going to get a lot of blur because the movement is really fast of all these gymnastic movement is quite fast. And so you’ve got a lot of blur. And blur is big trouble for, if you want to detect the rectangle body parts in there.

So, in here, we use a different model. We call it the walk model. So instead of using this for big features, we use the pixels label features. So it's like you start from the head and if you walk down the image, you follow some pattern and you go through some kind of track and you can go to the key points, so like the hand, the feet, or something. So that's basically the model we have. And in fact, we have some other tricks to make sure this model is resistant to the scale change. We can make the walk be longer, could be the shorter, and it can articulate somehow.

So you don't need too many templates. You just need one, and you can achieve all these kind of properties at the same time.

So this is some examples that shows how this thing works. In fact, there's some videos I'll show you. So we use the doll to represent all these kind of, the hand or the feet or something. And even though, I mean some of the object cannot, could be it's not a human person subject. It's like a face or something. So basically, you can detect any kind of the point that you're interested on some object. If this object has some articulation, so this method can deal with that kind of thing. And also, it can handle this a little bit kind of scale changes and even the camera angle change because we do this for each single frame. So we don't need to worry about the camera angle change. We can still track the target in these kind of videos. I mean, it's not 100 percent accurate. So there's still quite some errors in there. But roughly, you get the movement of the target, and so we think it's quite useful for action recognition to, or kind of understanding performance of the target subject or something. So it could be, should be

[inaudible] for that kind of purpose.

And also, we compare this method with some other like previous like [inaudible] method. In fact, the proposed method works much better. So one reason is that the blur thing. And the previous method doesn't work very well for that. And also, the previous method user has a hard time to deal with the rotation, the scale, this kind of thing. But our proposed method can handle this kind of things very nicely.

All right. There's another application also for the human movement. So in this case, we want to somehow extract the foreground from the background. So, in fact, there's some previous tools we can use. For example, this thing Adobe After Effects. There's a tool called roto brush.

So if you label Y region in single image and then the software basically tracks through the video and find all the same regions in the future frames; but this method has kind of the drifting problem, so you basically have to sit there and look at it at a result. If the result drifts a little bit, you have to somehow modify your, the result a little bit and hopefully can track the long video frames.

So this result basically show you. And sometimes it doesn't work at all. For example, for this video, there's some camera angle changes. So there's no continuation for the motion. So in that case it doesn't work. So we want to somehow handle this, and so probably there's a method we can track the object, and even there's some kind of camera angle changes or something we can still kind of find the object. So this is basically what we propose in here. So the basic idea in here is we do a partition first. We partition the super pixels and then we group the super pixels together so that the region that you find looks like the object you try to follow.

And this method is kind of, you don't worry about the motion continuative because we do each single frame. And also, there's a good property for this method is that there is a guarantee that the region you'll find is always connected. And in fact, this is a hard thing to do, but is very important because it's very likely that you find two disconnected regions you can make, the total appearance looks like the target, and in fact, many cases won't be the real target. So this is a very strong constraint in here.

So we have to somehow solve that using something like the cutting plane method. So basically, we keep on putting the cutting planes in there to, if the result is not connected. In fact, this shows how this thing works. So for example, the first image is kind of the template. That's the object you want to look for in all the video frames. And then, given a second video frame, how do you find its target? So basically, the method solves, first the segmentation, so that's the second row, the first image that the segmentation you find. It’s not connected. You can see there’s the color histogram in fact, match perfectly, but it’s not connected, the result is pretty bad. So what you want to do is, okay, let's put some constraints in there to make sure that the result is, should be connected. So basically, we keep on putting in these cutting planes and the

result improves gradually and the end is like that. It’s a totally connected region and this match, it finds the target very well.

So, in fact, the object, in fact, size can change too. So what we want to do is you can do like everything together. So like you can make the object, the scale invariant and at the same time, you want to connect it or something. In fact, our paper shows, in theory we can do that. In fact, it's much faster than if we do it separately. So, yeah. There's some comparisons. And that yellow box, in fact, our method. We tried a bunch of different videos and there's some other methods I showed here. I didn't show the name, but I just want to show you there could be quite wrong if you can't because in this method we don't really know the background information. What we know is basically the foreground information. That's, in fact, makes the problem much harder to deal with.

Okay. So there are some videos I show the result in here. So for all of this kind of the spots videos in here, you can see the background, in fact, change a lot because the object moves from one scene to the other and the background information is not reliable. So the image, you cannot do like, know the foreground information, the background information and do a rough cut or something. It won't work for this case. So basically, you have to somehow make sure the foreground is, has some kind of property like the shape or something and then do something.

In here, you can see, so this method, in fact, works even there's like a motion discontinuity. I mean, there's some errors. Some of the errors, in fact, due to this super pixel segmentation is error, kind of link the two parts that shouldn't be connected, and that caused some errors in there. But still, I mean the proposed method can kind of deal with the size of the general something. So it's quite reliable for these kind of applications.

So apart from this human detection, in fact, it's kind of, energy minimization method is global way to minimize the energy function, in fact, this could be used for other things. So for instance, here we, what we do is want to detect the cuboid objects using connect images.

RGPD images. So basically, you have color images and also adapts images. And how do you detect all these rectangle and this cuboid thing in the indoor scenes? So basically, all these kind of cuboids, they give you some kind of candidates for further recognition or something. So this show is in fact, if you don't use of the 3-D information in there, you're going to get something like, so for example you're looking at this and this table is not detected it all, so this show basically is, if you use this 3-D thing and if you do the cuboid detection, you can get much better results if you compare these two rows.

And the basic idea of this method is still very similar to the previous one that we talked about.

We formulized the thing as some kind of minimization problem and we try to find a criterion that if you minimize it, then it satisfies the constraints or something. And then you just need to

somehow find the structure and then solve it efficiently. So in here, we do this relaxation branch and bound. And it turns out it works out well and is quite fast.

In fact, here's the summary about this global thing. So global method looks like it’s more complex than the local methods. But, in fact, there are a lot of structures we can use in them.

So if we look closely at all these kinds of problems, the global models, and we can, if we can find the specific structure, we can make it really fast. And sometimes even faster in the greedy methods. So it doesn't mean that a model is very complex is bad. In fact, it’s sometimes, it's good because it's more accurate and you have more control for the quality of the results, and so we want this kind of more accurate models. But we can trick the model a little bit and to make it suitable to be efficiently so. So that's something, in fact, it could be applied to many different kind of applications.

In fact, I only touched the very surface up to now. So I do different applications, but there are many others we can do using these kind of global models. So for example, the human pose I do like a 2-D thing, like this 3-D pose; and also for some very different application like computation of photography, so there are many problems, for example, if you want to reconstruct some kind of image in there of these 3-D models, so this kind of reconstruction problem [inaudible] can be formulated as this is kind of optimization problem. So this kind of framework is applicable to all these kind of things. So there are many, many things we can do. And so if we, I mean, if we formulize some kind of suitable models, and hopefully we can solve some [inaudible] problems easily, and also we can gather [inaudible]. So that's something we can do. Hopefully we can do in the future. And, yeah. That's it. Thank you.

Download