17162 >> Richard Hartley: So I'd like to introduce Tiberio...

advertisement
17162
>> Richard Hartley: So I'd like to introduce Tiberio Caetano, who works at National ICT in
Australia where I also work. So he comes from Brazil originally via way of Calgary.
>> Tiberio Caetano: Edmonton.
>> Richard Hartley: Sorry, Edmonton, Alberta. So extremes of temperature. He went through,
did his Ph.D. there and graduated in 2004 before coming to work in Australia.
Now, I hired him into the Vision Group there but he jumped ship into the Machine Learning Group
at some point so he has expertise in both vision and machine learning and I think he's going to
give us a bit of a talk which bridges the two areas today.
>> Tiberio Caetano: Thanks, Richard, for the introduction.
So essentially I'll try to divide this talk in two parts. So the first part will be a very generic talk
about the type of work we are doing in computer vision, and essentially using machine learning
methods to solve computer vision problems.
And the second part, which is probably going to be a little bit shorter, is basically going to be a bit
more specific about one recent project.
Okay. So I would like to thank my collaborators: Jose Ferris from IBM, TJ Watson Research
Center, some of my Ph.D. students here, Julie McAuley, Devon Shi, and Cau Lee and Bing Chin,
Alex Mau over there, collaborators involved in this project here that I will describe.
I think I've missed a picture of Dale Sherman, who is now a collaborator in this project, that I'm
going to present to you. Okay. As I mentioned I'll divide the presentation into two parts. So part
one will be essentially preaching structure prediction in computer vision. I'll quickly describe what
it means and give examples of this.
The second part then will be a little bit more specific about recent work we've just submitted.
So let's focus on the first part, structure prediction, computer vision. So this part I'll talk about
what I call easy vision problems. And so here's an easy vision problem. I want to find a class of
this object.
I'll say, okay, this is a bike. I'm calling this an easy vision problem. So imagine what I would call
a difficult vision problem.
So here's yet another easy vision problem. What's the class of this object? So I find it easy.
That's a car. Typical car. Another easy vision problem. What's the class of this object here? So
that's a ship. Also an easy vision problem.
So it's the design of a new ship that's going to be around very soon. It basically has a very
sophisticated hotel in it. And, well, you need to actually arrive at it by plane.
But yet it is a ship. So these are easy vision problems. I hope you agree with me that these are
all easy vision problems. So for these easy vision problems problems that are classification,
right, you can do two things. Very traditional approach that would be popular in the '60s and '70s
is knowledge-driven approach. So something like this.
So this would be rule-based, model-based, grammar-based so you would represent parts of the
object as object parts and basically you would create a model, a complicated model to try to
represent a model, the object and try to figure out what the model, what the object is, whether it's
a ship or a car or whatever.
So this is popular in the beginnings of computer vision. So you would typically have complicated
descriptions of objects and no doubt leveraging almost no deck leveraging whatsoever.
If you look at how things have been happening more recently, things have been basically been
changing from a knowledge-driven approach to adapt-driven approach. To do classification these
days, for example one option is to use things like support or machine technology and that's
basically driven most of the activities in classification computer vision recently, in the '90s and
recent years.
So essentially the idea here instead of going for those complicated models you go for very simple
models, but yet you leverage data heavily.
So the question I have for you is: What would you do? Would you go back to this age here and
do complex models and no data leveraging or would you just continue what people are doing at
the moment, with simple models and huge data leveraging? What would you do?
That's the question. You would just continue doing this.
>>: [inaudible] what you're doing.
>> Tiberio Caetano: What's too boring?
>>: Maybe the first one.
>> Tiberio Caetano: The first one is too boring. So basically this is what almost everyone is
doing, right? Okay. So let's focus on the second approach here. So there you go. I mean, one
option if you want to do simple classification, have again just remind you we are working here on
simple vision problems. You want to do simple classification, one option is just use support
vector machines. Extremely well known, very successful classification technique.
And there you go. Well, you don't need to do support vector machines. There are many ways to
do classification, one way example to do logistic regression, instead of learning the boundary, you
just learn these discriminative functions separately and then you decide according to the
intersection.
That's something well, for example, logistic regression, also discriminative approach but still
based on density estimation techniques.
There are many other things you could do. Okay. Why am I calling this problems easy vision
problems? Seems I'm being arrogant here. But not the easy vision problems in the following
sense because the output is low dimensional. It's the number of classes.
For example, I want to classify an object. Well, I may have two classes, 10 classes, 50 classes
even 100 classes but I would call this extremely low dimensional.
Well, we usually think these problems are difficult. Not because the output is low dimension,
because the input is high dimensional. Because images are very complicated objects. That's
why computer vision is difficult. My point, one of my points in this talk is that of course this makes
life difficult. That's true.
But this here makes life not as difficult as it could be. If the output were also high dimensional,
then we would really have hard vision problems.
>>: Some variance recognition today, people are starting to talk about labeling pixels, doing
segmentation together with recognition.
>> Tiberio Caetano: Yeah, that is the point of this talk. The recent findings of that. I'm just trying
to draw basically the mainstream up to recent years, and then try to survey the recent years and
do what has been done in the last couple of years. But that's a fair point. That's the point of this
talk.
Okay. Okay. So inference here is trivial. What I mean by inference you want to predict the class
membership of a given image. So basically inference is trivial in the sense, well, for a support
vector machine you just check which class has highest margin score. You can enumerate the
classes because you have a very small number of classes easily.
And logistic regression, you just check which class has higher probability under your individual
regression functions.
Okay. So now let's talk about harder vision problems, which is what Rick mentioned that people
are doing. This is a hard vision problem. Depth estimation. Here you have an image well known
stereo pair. Here you have the ground truth for this.
Why I'm calling this a hard vision problem, because the input is high dimensional but the output is
also high dimension, the output is another image. It's not a simple class label. It's extremely high
dimensional. Likewise here. So the input is an image. The output is yet another image. Image,
noisy.
So could you get from here to there efficiently and with high accuracy? So that's a hard vision
problem, another vision problem, image segmentation. Input is an image. Output is high
dimensional as well. Another image.
Okay. This is bad because this is my favorite team which is matching, and the image is not
showing up. Okay. No problem. I'll show it later, an example of matching, graph matching case.
Basically what was up there is basically a pair of images and matches between features of those
images. The input is a pair of images and the output is also something very complicated, which is
a permutation of the nodes of specific points.
>>: But dimensional analysis doesn't capture everything right because adding two images
together is basically high dimensional and high dimensional output and that's an easy problem,
right? Two images taking their average, right, that's a high dimensional input, high dimensional
output. So just classifying things by dimension doesn't really tell you what the hardness is of the
problem.
>> Tiberio Caetano: I'm not sure I'm getting your point.
>>: I'm saying -- you're saying what makes it a hard problem. A hard problem has a high
dimensional input and high dimensional output, right. The problem of taking two images and
computing their average satisfies your definition of what a hard problem is.
>> Tiberio Caetano: Don't take this too strictly. Don't take this too strictly. This is more pictorial
kind of description of what I'm talking about. So essentially you can also in principle think of two
generic approaches like we've mentioned previously. A knowledge-driven approach or
data-driven approach for this class of problems.
You can qualify this as differently. I'm just using easy and hard just to make you remember one
month from now that you've heard easy and hard, because if I had to use other qualifiers you
wouldn't remember.
So the question is what would you do? So my point here is that for this hard vision problem
people still are playing this game here, mostly.
Most approaches are knowledge-based so, for example, energy functions are typically
handcrafted. Examples: MRF, MRF for segmentation. We have matching. And a bunch of other
discrete optimization problems where you've got this complicated, sophisticated predictor, instead
of just a support vector machine that you're instantiating each of those classes and check which
has highest margin score, you have to run an entire complicated algorithm in order to make a
prediction like here, well, you need to solve MRF, here you need to focus on the assignment.
Here, you know, and other things.
So you work really very hard at trying to solve, to find the best solution for a model that you don't
know in the first place, which that's the right model or not. Because you handcrafted energy
functions.
So good inference, my point, so good inference is not enough. We need to learn the energy
functions themselves. That's what you should be doing. So basically if you were an advocate of
the data-driven approach for easy vision problem why shouldn't you be a advocate for the data of
the difficult problems as well. There's an inconsistency. That's what we need to do. We need to
find good solutions for the right problem not the wrong problem.
And the questions are why should we be working so hard to prove inference incrementally, which
is what a substantial proportion of the vision community is doing in discrete optimization.
When we know that the global optimal are poor anyway. I mean, it's so much effort doing this. I
mean, to me it doesn't make much sense.
Why bother so much being so optimal when the criterion being optimized is simply a reasonable
guess of what a good criterion would be. So I think we should be asking those questions.
So here's an example. This is a beautiful example from a paper in 2005. So here's the original
image, took away image. Here's the ground truth and here's the global optimal acquiring to a
simple pair-wise MRF with handcrafted parameters. So look at the differences between this and
that.
So this is really -- this is the global optimal of the energy function. So this is what people are
really writing more and more and more papers to achieve these results here.
And the many competitor result to that you say this is really crap. If you be extremely picky, this
is crap. Look at the camera. There's no camera here. I can't see any camera.
So I mean shouldn't we be balancing a little bit more how much we are working hard on finding
the right solution to a given problem or finding the right question to be asking in the first place.
So here are a bunch of other examples of these very issues, but also [inaudible] and ICCB 2005
here where you have bunch of images and solution obtained with belief propagation. Here's the
solution with graph cuts. Here's the global optimal. They actually managed to find the global
optimal for a specific MRF, pair-wise MRF function. And if you look at the graph in some cases
they're quite far from what a human labeler would provide.
So why are so many people concentrating their efforts on optimizing optimal energy function, if at
the end you're going to get something that's not that good anyway.
Here's something I want you to think about. So in red you have the quality of the algorithms. So
here's the level of sophistication from algorithm. Here's algorithm one, here's algorithm two. This
is every point in this line is a single paper in ICCVPR, so people are watching this interaction
here.
So algorithm sophistication. So this red line here you have the quality with respect to a poor
energy function. Of course, that's improving.
In this blue line you have quality with respect to real perception. That's stuck. That's stuck.
That's stuck. It's been stuck for a long while here. And in the end people are putting paper and
being paid for it. Those papers. So what's the point?
>>: Got a problem with the dataset?
>>: Tiberio Caetano: Well -- that's ->>: The dataset --
>>: Tiberio Caetano: Exactly. So people are taking these energy functions for granted, right? Of
course you're improving with respect to that energy function but so what. That energy function
was handcrafted in the first place.
Being extremely religious here. Come on, we're scientists. In the machine learning field, learning
energy function has a name. It's called structured estimation or structured learning. In front of
global optimal of a given energy function is simply inference. This has the same name in
computer vision. When people talk about structured prediction they're really talking about the
whole business of predicting high dimensional output. And doing it in learning in the process of
doing so as well.
Not only hand crafting those energy functions but doing learning, leveraging data in order to find
what would be the right problem to be solved in the first place.
So what would be popular estimators? The beginning of the talk we'll discuss support vector
machines and logistic regression. So likewise here people are creating extensions of those
models, of those estimators actually for the structured creators as well. Today we have
structured support vector machines which is a cousin of the SVM. Structured SVMs. So
basically these two references is basically what you need to look at, which are the papers that
introduce essentially tools of how to do learning when the output is very, very high dimensional.
Because we are talking about really high dimensional, imagine an image you cannot enumerate
all possible images. Yet that's the number of classes you have.
So how do you solve that? The learning problem becomes extremely difficult. It's much more
challenging than doing inference. Likewise, you can do something different. You can ask what's
the analogous of logistic regression for the structured case. That has a popular name. Called
conditional random fields. That's only logistic regression for a different feature vector.
>>: You mentioned that cross function, matching stereo is now cracked. But how do you keep
that structure as cross functional structure [inaudible] is cracked, has cracked cross function?
Because you said --
>>: Tiberio Caetano: Because you just train against labeled data, right? So you can get -- you
can have, for example, what you have these days, you can have good either hand segmentation
of correct stereo.
>>: But you're given the parameter to fit that model. But, for example, the random of the field
they find potential functions, you must assume.
>> Tiberio Caetano: You define the features. You don't define -- you define the model class, and
then you search with element in your model class, maximizes the evidence of the observations,
whereas in traditional models you just select an element of your model class. You just select -you hand craft your parameter. So therefore if you have a large model class that you would
realize the entire model class by changing your parameters, you don't explore that entire model
class, you just fix a specific model and that's the best you can do. You work hard at optimizing,
yes.
>>: If you're going to address this later just ignore my question but what about learning
parameters in MRS, like Marshal Tapland has learned.
>> Tiberio Caetano: Sure. Absolutely. But the point is that you want to optimize those
parameters to minimize the final loss which is, for example, the loss of stereo. For example. A
hemming loss between what you are predicting, let's say, and a hand label, just in the product
between.
And minimizing that loss can be tricky.
>>: It can be very challenging. That's one reason is because marshal worked with Gaussian
MRS it was tractable but he could do image denoising where with synthetic data he was
optimizing the loss between the final, the learned MRF and the final output.
>> Tiberio Caetano: The basic problem you see if you really want to minimize a loss that's
interesting, like, for example, a hemming loss, that loss is just continuous. You have a piece-wise
constellation problem. There's no hope of using optimization techniques to solve that.
So what I'm going to say this instead optimize likelihood which is what people are more familiar
with, logistic regression, everyone knows how it works. You just create a more sophisticated
model and you have a complicated partition solved.
But that has a problem because you have the problem of computing the partition function. But
this is an interesting alternative, because here you have no partition function issues. You just
look at the boundary. You just look at a few examples. And this will be much faster than this in
general.
And what the principle here is to upper bound that crazy objective you have. Because you have
the ideal loss you want to minimize would be a very crazy hemming loss, which is piece-wise
constant. There's really no way to do that efficiently.
But if you have a decent surrogate loss that kind of preserved the structure of that original loss,
and that surrogate loss is amenable to efficient optimization, that's what this is all about. And this
has actually been extremely successful.
Although, the first paper that I know of that used these computer visuals is only in 2005. But now
we did this in 2007 and now people have suddenly realized yes we're really into this stuff.
Conditional random fields people have been using a bit longer in computer vision. So this was
first proposed in 2001. But has very nice statistical properties. So this is a consistent estimator.
This is not a consistent estimator, for example. But has serious problems because of the partition
function. Need to create an entire probability estimation on the images and that's when you want
to learn, do grading in the sand then you need to estimate the partition. It's very bad.
This seems to be a practical alternative, the first one. Both have advantages and disadvantages.
Okay. I'll move forward here.
So just one example. Graph matching here. So the graph matching problem is really a hard
problem. So if you want to take into account pair-wise constraints, for example, you get quadratic
assigned formulation.
That can be really hard. So the point is that traditionally people just hand craft this objective
functions for graph matching. You just say, look, these unary similarities will look at six features,
these quadratic similarities will look at relative distances or whatever.
Then you just fix parameters and then you work really very hard to solve that combinatorial
problem and you didn't know in the first place whether you objective was good or not.
The point is you can actually learn those energy functions. So in this work, for example, we use
structured SVM to solve those energy functions.
And then we find some very nice things. For example, you can model graph matching with
complicated combinatorial settings like quadratic assignment or match simple combinatorial
settings like linear assignments, look only at matches between individual pixels instead of having
pair-wise constraints.
That's a tractable model. It turns out if you learn that simple model, you obtain results which are
as good as you would have obtained with a complex model without learning. This is something
we figured out in these examples here.
This is what you get. If you model graph matching, for example, as linear assignment, which is
efficient to compute, it will perform similar with quadratic assignment without learning.
Why? Because you were finding the right weight of the sift features, for example. In such a way
that your final result is what you want it to be, because you have training data. So you engineer
it. Basically you automatically tune the relevance of your 128 sift features, for example, so that
joint weight will, when you solve this combinatorial problem, produce an image that is what you
want it to produce because you have the output.
Either by hand labeling or by laser scans of, for example, the case of adapt estimation.
So this is one example. Okay. So here my matches are up. This is my favorite subject. Graph
matching. This is just to show that, okay, in green here you have correctly matched points in red
you have wrongly matched points, okay.
This is exactly the same model class. It's just that here we have handcrafted the matching score.
And here we have learned the other way around. So here we have handcrafted and here we
have learned and matching scores.
So you see that there are many mistakes that you can avoid. By just tuning what's the right
model in your model class. Of course, here how many classes do you have in factorial? In this
case you have 30 points here, 30 points here. I have as many classes as possible as matches
here because my prediction is the match. Because I'm predicting a match. I'm parameterizing
the algorithm that's producing the match. It's like I have support vector machines with 30 vector
classes. How do I solve that, that's why it's not trivial. You need to look at the paper to see the
techniques people are using.
But point is you can improve. You can -- yes, you can improve. So here another example we
have in CVPR this year which is shape classification. We want to classify shapes, okay?
So one thing that people are doing in shape classification, for example, is to use matching scores.
To use the results of matching scores to do classification, because you know that if you
remember the paper by [inaudible] at Alia in 2002 [inaudible], that was basically shape context
features, maybe people are familiar with that. Essentially what they were trying to do was they
were trying to classify objects based on matching scores.
But they were hand crafting the matching score. So you hand craft the matching scores. You
produce a match that's completely a function of how good was your matching scores. Then you
parameterize the final matches and you learn after that stage. You do this. You trust your
matches completely. Then you do learning after you produce the match you parameterize and
then you do learning.
You may be doing learning completely crap features because you've done shaped matching the
first place. Completely trusting your uncalibrated linear assignment algorithm.
So what we're doing in this paper here we do everything in a single shot. So we optimize the
matching score itself so that the classification loss is minimized. Okay. So here is an example of
no learning. So you produce a match. You can match this scan against this dog, but the
algorithm has wrongly classified the dog and CAML as belonging in the same class.
And here if you do learning, so you can -- you use the matching as a means to classification but
you parameterize the matching itself. And then you can recover, we can say these two guys are
CAMLs, they belong this the same class.
>>: So there has been work on learning better feature descriptors or distance in feature space
and things like that.
>> Tiberio Caetano: Sure.
>>: And you have training databases of incorrect and incorrect matches, right so how does this
differ from learning just a better matching score?
>>: Tiberio Caetano: Because this is optimizing the matching loss. Not this one. This is
optimizing, this is structure prediction. In that case you were tuning your matching scores so as
to minimize a very low dimensional surrogate. This is going directly to the output space saying,
look, I want this matching matrix, this permutation matrix. My losses I want to optimize is the
hemming loss between two parameterization matrices that's extremely high dimensional. You go
to the problem you'd like to solve instead of creating a surrogate low dimensional version.
>>: In that case.
>> Tiberio Caetano: Which paper are you talking about.
>>: Winder, Matthew Brown, here they've done a series of things, where they have sift-like
features or they're trying to learn better descriptors. And they have a database of which features
actually match. There's several million matches of the patches that are known to be true matches
and other matches that are known to be false. So basically they have a loss function which is
correct or incorrect matching.
>> Tiberio Caetano: Right. So the question I have is: What's the algorithm that predicts the -what's the predictor in this case.
>>: [inaudible] might be learning how to make a single match.
>>: That's right.
>>: In this particular case you don't have complete matches.
>> Tiberio Caetano: It's a joint.
>>: So you're saying if you wrap a more complex algorithm around it.
>> Tiberio Caetano: You can still --
>>: You can have access to what's a correct match. You only have access to the final thing.
You wrap your match, a bigger black box, you evaluate the output of the black box. That's the
only difference, right?
>>: Tiberio Caetano: The main thing is that I'm not optimizing the scores of individual matches
yet. I'm not doing that. I'm optimizing the joint match, the quality of the joint match. Because
these two points cannot match to the same point.
For example, I have an injection constraint. That's essentially by injection so they're in there.
Factorial impossible things that I need to optimize. I'm not saying well this guy is similar to this
one let's have a single collaboration, no.
>>: More complex.
>> Tiberio Caetano: It's a more complex prediction model.
>>: The ground truth, mapping from one to the other.
>> Tiberio Caetano: You need the ground truth.
>>: But you know that?
>>: Tiberio Caetano: You need to provide that manually. But that can be an issue in many
cases. But there are two ways to circumvent. One way is to do a semi-supervised vision of this
case. In semi-supervised vision you provide the matches don't provide the others. You can
generalize the setting to marginalize over the hidden variables for the audits.
That's one example. The other example is if you're doing shape matching, if you have a
simplistic setting they're doing something like this, you could easily create a semi-automated
labeling algorithm. You say this is going to match to this and this to that and you run three or four
points and run a simple algorithm that completes the matching and you use that for labeling.
We need to move forward here to make sure we get everything done. Here is yet another
example. This is another work we've done. NIPS last year. All the types of matching problems
have new isometric matching cases so you can also improve by doing this predicting learning for
near -- you just change the types of features you have and type of inference you use, junction
threes and structured SVMs as well.
This is another thing. We just submitted this. But I discovered today that it wasn't accepted at
ICCV. But anyway this is trying to do something interesting. Recently the laundry grouping in
San Diego, they've used conditional random fields to do joint object categorization. What's joint
object categorization?
Imagine just for the sake of simplicity you could segment this image properly. You could segment
this into, you have here four disjoint fusions but you don't know the labels of those regions. You
don't know what they are. But you could segment them properly. As a matter of fact, in order to
assess the quality of different algorithms you need to assume that the segmentation algorithm is
independent of the model.
So let's just assume you can segment this properly and that the question is which object is this
object here? Which object is this and this and this and that?
So of course you could think that you know I will create a real classifier, vegetation classifier, sky
classifier ground classifier, random and predict those things, right?
But this is something simple because actually in order if you see a real it's likely to be above and
not under ground. Whether it's likely to be under the sky and not over the sky.
So different labels of objects, they don't occur independently in the images. So if you want to
predict things, you would like to leverage a dataset, in that particular dataset you have labeled
these and said these [inaudible] are under the sky, over the ground, or if you see water and you
have something big over the water, you know, it's likely to be a ship, not a car, something like
that.
So you want to leverage those correlations, right? So when you do prediction, you predict an
entire combinatorial structure in the output. You don't predict this and that with that guy but that
presents problems because you have a combinatorial problem to solve, right?
Essentially you have a sophisticated predictor but you can't predict it using training data. That's
the point.
In this work we're substantially improving on the work of [inaudible] and collaborators. So we
have to find a lot of venues to stick this stuff in.
So here this we just submitted as well. So what we are trying to do in this case, no images here
but this is a point. Assume you want to do graph matching. Typically in graph matching you
make some assumptions like, for example, I have, I want isomorphism. I want homomorphism, I
want isometries. There's all sorts of types of assumptions that people like to do when they do
graph matching, depending on the type of the problem.
So what we do here we completely, we become completely agnostic with respect to those
assumptions. And we parameterize everything, right? And we have a unified model where we
just learned the weight of everything.
For example, in this case here we have shape convex features, 60 of them here. We are learning
the importance of that. And in this corner here we are learning the importance of isomorphic
features, isometric features. Homomorphic features, all different kinds of graph matching that you
can think of.
There's no point talking about this is a isomorphic graph matching problem, homomorphic graph
problem. You parameterize everything. You learn and here's a soft description of the matching
problem. Of course when you do that you substantially improve on results, existing tradition and
all that.
Concluding for this first part, and this is the main message I want to pass today. The second part
is a bit more technical. But this is the main message I wanted to pass, which is you have simple
prediction problems like classification, just we want to classify objects. People in general use
simple models like support vector machines, right? And perform learning properly. So people
know how to -- computer vision people have learned already how to master these techniques
here. And the results are improving, improving, improving.
However, when you have complex prediction problems, think of graph matching, think of image
segmentation and MRFs and stereo all that stuff. People use complex models but, yes, they're
starting to use learning but they're still not leveraging all the data that you can leverage here. So
most people actually still overlook learning in this case.
And maybe we can still overlook learning if you have a stereo matching problem where I assume
you have a monochromatic image. Imagine five or 10 years down the road when every image
you get even in your web cam will be hyperspectral. You will have 72 channels.
How would you manually tune the stereo parameters for 72 channels? You better start thinking
ahead of time, and automatically learn all this stuff.
Okay? So take home message is use techniques that are allowed doing both things, complex
models and learning. That's the first thing you should take from this talk.
And the second thing is that in this trade-off between model complexity and data leveraging, if
you think computer vision is still far to the first side, so we are using very complicated models, but
we are not leveraging as much data as we could. Okay. So I think a better balance is possible
here. We should think about it. Ask these questions.
>>: Just to run by, the degree of data we had we didn't have brown trees as surrogate --
>>: Tiberio Caetano: I agree with you. That's true.
>>: Label --
>>: Tiberio Caetano: That's true.
>>: A lot of things came online just the past few years.
>> Tiberio Caetano: That's a completely legitimate argument. But there are two points here.
First that's true but that's not the whole story. And, second, even if the lack of completely labelled
data, you can still use this stuff.
So basically you can use a semi-supervised version of this but for that you need to work hard
because you need to look at the last two years they present NIPS and understand all that.
>>: That's the problem, you guys have [inaudible] [laughter].
>> Tiberio Caetano: Different questions. We've been working on different questions.
I just think that since I'm mostly a machine learning person, apply my stuff to vision. The way I
see this is, look, talk to the vision guys and say, look, we have these tools. And we can really
change these things by using these tools.
So let's try to talk to people and excite people about using these tools.
Yes?
>>: My question is that while you're leveraging the [inaudible] data [inaudible] model such that it
can fit the data [inaudible] the risk is that the model will be tuned specifically to the inception of
data and -- [inaudible].
>> Tiberio Caetano: In practice, we use, you know, you just do regularization. This is not a
technical talk, right. But in practice you don't minimize only the loss on the training set, you
minimize the loss plus a regularizer that's basically constrained model class.
It's saying well I'm not going to learn an arbitrarily complex functions. I'm going to learn functions
that are meat, that they do well on the training set, but they all are sufficiently simple. You do just
regularization in practice.
Okay. So this is the main message I had for today. So since I still have 15 minutes let me tell
you a few other things we're doing. So this is something I'm excited about because we just
submitted this stuff. It's very recent.
And it has an application in computer vision. Convex relaxations of mixture regulation apply to
motion segmentation. Here's a motion segmentation problem. This is supposed to be just a
single image of an image sequence.
So I have the skies moving, the other car there is moving. And you have the background here.
And the fact is that you want to segment these motions. You want to segment this motion. So
you want to clarify, this is the ground truth let's put it this way of the segmentation. You know
that's a motion, that's another motion.
The motion model that we have, you want to estimate from the mental matrix all this stuff, they're
all linear regression problems. So if we knew, if we knew to each of these three motions every
one of these features belongs, we could solve simple linear model that will estimate that motion,
simply, just by doing Lees squares regression. The problem is we don't know. The problem is
we don't know that this guy here is and this guy here belong to the same motion.
We don't have this information. In fact, the membership, the motion membership for every one of
these points is a latent variable that cannot attain three variables in this case, but we don't know
what it is.
So how do you solve this problem? If you observe the membership, you have a nice convex
problem that you can serve close to form, Lees squares. But if you don't know, how do you solve
it? Obviously this will also be a very complicated optimization problem, because this will be a
combinatorial optimization problem. You would ideally have to look at all the possible
configurations of three motions of these points, which is exponentially large set.
And solve the regression problem for each one of those and find at the end which configuration
gave you the best fit. The best Lees squares solution. Of course you can't do that in practice, so
what do you do?
So I'll tell you what people do. So this is a setting. So we have just in a more visual way,
mathematical way, you have mixture regression. Here's a bunch of points. Here's another bunch
of points and another bunch of points. If you knew that this belonged to the same line, that this
belongs to the same line, this belongs to the same line, you just solved three independent
regression problems but you don't know that.
Okay. Let's just assume for sake of simplicity there's three models, you don't have a model
selection problem. Only three models you know there are straight lines, but how do you solve
this? What's the typical answer? The typical answer is you have a latent variable for each one of
those guys which is the membership at that point. So some people just -- okay. This would be a
good solution. Okay. And people just use EM. That's what people do. So what do you do? You
pretend you know of a solution and then you solve purchase problem.
You use that solution to initialize and then solve for the membership again. And then go back
and resubmit the model. And then solve again and restate the model.
Well, we know very well that this function that EM is trying to optimize is a very highly known
convex function, which is full of peaks and valleys and in more ways it's discrete. Because we're
optimizing over to the discrete space. It's piece wise conservation. It's the worse possible
optimization problem you can have.
So EZM going to do well if you can afford EM running over --
>>: Is it discrete, or can you have soft assignments in EM?
>>: Tiberio Caetano: No, in EM you have everything is soft assigned and you can dissecuritize.
>>: But it's not a discrete optimization, combinatorial optimization?
>>: Tiberio Caetano: The one thing is the problem itself. The only thing is the model. The
problem is discrete. The model is soft [inaudible] to be exact.
I'm sure many of you have played, have at least maybe implemented EM once in your lifetime or
whatever and run EM. You consider EM can do very badly. It can do very well but it can do
really badly. The reason EM can do badly because it's finding real poor local minimum. So how
do you go and solve this? Okay. Can you try to do EM or you can try to do something else.
So here's the something else we're trying to do. We're trying to, instead of we have a very
complicated function here. Very high dimensional, multi-model function and you want to find the
global optimal here or something. Global optimum.
And EM. EM is great in decent, basically. If you initialize here you'll find this point. If you
initialize here you'll find this point. But there's millions, billions of valleys here. So depending on
initialization you're going to do well. Depending on not, you're not going to do well.
So what do we do? We construct a convex upper bound on this function. And instead of solving
this problem we solve this problem. So we change the problem. Yes, is that bad? Well, of
course you're not solving the original problem.
But the question is: If we happened to construct a convex upper bound whose minimum of the
convex function is not too far from this minimum here, you've really gained something. Because
solving this is easy. And you obtain the same solution that you would have today if you could
have solved this problem.
So before telling you the good news, I'll tell you the bad news. The bad news is that there's no
existing algorithm I'm aware of that tells you this deviation between this point and the optimal of
this point. All what you can do at the moment is construct a convex upper bound which is sound
in an intuitive sense but you cannot really give many guarantees on how this solution is close to
this solution.
So that's the bad news. So the good news is coming now. So we construct this convex upper
bound on this objective function here. But remember this here, this thing here is very misleading,
because this probably is extremely high dimensional.
But still that's a convex function. We know how to optimize convex function. There's nice
technology to do that. It's not simple. This is not simple unconstrained optimization problem.
This is a semi definite problem.
And among many possible convex optimization problems, this is one of the worst you can have.
This is not going to scale. Basically you can solve this for 300 observations, 400 observations.
You cannot solve this for a thousand observations. So this is not good. Typically interior method
so barrier type methods that solve this problem, so essentially you force the constraints in the
objective function by logarithmic barrier with those constraints. This is extremely expensive
because it could be on the matrix of, on the square matrix. So it's [inaudible] really going to be
bad.
So that's what we do. We reformulate the problem as a semi definite problem, which is really
hard to solve, but we figure out two ways of solving this problem exactly without doing semi
definite programming.
Okay. So we bypass completely the need for using semi definite programming. And this is only
possible because of the very structure of this problem. It couldn't be possible in general.
So here is I only have five minutes -- I won't go much into detail but I want to give you the intuition
of what's happening. So every one of those observations I mentioned previously that's called an
XI. You have a latent variable for them which is I telling you to each motion that particular
observation belongs.
So now you construct this VI variable which is a product of these two guys. So this is not known.
This is partially known. This is known but this is not known.
So this vector here is this, this is a product of this thing that you know with this thing you don't
know is something you know partially only. You don't know completely.
And now you can easily, if you have this notation, you can easily write the mixture regression
problem, a mixture of regression problem simply as a regression problem.
So here is your prediction. Here is your observation. Here is your model. So you have just a
product of your vector with parameter vector with V and you have, assuming in this case Lees
squared so you basically assume Gaussian noise in your data.
So here is the noise model. Okay. Classical regression. Just that the key difference is that here
we don't have XI. We have CI, which is something you don't know completely, whereas in
traditional regression you would only have XI because you don't have this class membership
issue.
Okay. Well, it turns out you can reformulate all this stuff in a nice matrix form. Here have
quadratic objective and linear objective, nice Gaussian exponential family and it becomes linear
and this is something regression you mentioned earlier. We want to penalize models that are
complicated, essentially, so we have a Gaussian prior on the parameters.
And what you want to do, you want to actually -- well, there's some notation problem. Yeah, this
is -- these variables, see here. We want to solve -- yes, we want to solve traditional Lees square
problem on W.
Okay. For a fixed C. But C is not fixed. So we need to jointly optimize, jointly solve this linear
regression problem and combinatorial problem. So this is the original objective.
That's as I mentioned before very difficult. You cannot discrete in all this stuff. So after lots of
math you can prove that this is a nice convex relaxation for that problem. And I won't give you
details but essentially what you do, you compute the dual of the original problem, use eventuality
and do a few math tricks and you obtain this optimization problem here.
This problem is a convex problem. Essentially you are optimizing over C. This function here is
concave on C, because it's C square negative here. Concave C. And then you have a semi
definite constraint here because this matrix M needs to be positive. Needs to have barriers and
all that blah, blah, blah. The problem is you have a semi definite program but it's hard to solve.
This is an exact reformulation of that problem. So if you solve this problem here, if you solve this
problem here you will recover the same solution than if you solve this problem here. But solving
this problem is very expensive.
And solving this problem is very easy, because this part here can be solved by basically
eigenvector computations, it's cubic, you can solve this problem sufficiently. This optimization
can be done by gradient in the sand.
So, essentially, what you do, you alternate eigenvector computation in a sense and this process
goes fast. This version of our reformulation runs as fast as EM, basically.
And then here's the algorithm blah, blah, blah you need the input which are those parameters of
your noise model. You need to solve eigenvector computations you use non-smooth computation
to traditional standard, convex optimization tools.
But this is much more efficient than doing semi definite programming than using zero point
methods. It's much faster. This is scaleable. You can solve this for thousands of points.
Okay. This is the first reformulation. You can find another convex reformulation of the original
semi definite program. And this is the ugly looking of that thing. But you can prove that this is
also a convex optimization problem. You have also semi definite constraint here. But that's
some definite problem can be solved efficiently as well.
It's not as fast as the other one. It's about three times slower than EM. Okay. So here's the
algorithm. Okay. Won't go much into details but here's some pictures. Let's close with some
images.
So here I have a few datasets from motion. And here's a first dataset. This is the ground truth,
and if you run EM, you run the first reformulation. The second convex reformulation you obtain,
this is sort of an easy problem. EM is finding the optimal solution already. These guys also end
up finding the optimal solution.
Now you have a slightly more difficult problem here. And then you can start to see the
differences. You observe that EM does a few mistakes here, other mistakes here. This convex
reformulation seems to be at least as good as EM. Is doing is quite well. This one technically
should find the same formulation because these two guys here are solving the same problem.
These differences here are doing basically numerical differences. Here's the ground truth. Here
EM does pretty badly here, has all this mistakes here over there. And here with this convex
relaxation still far from the ground truth but at least you make less mistakes here. You can
quantify, you have tables showing details.
>>: How do you initialize EM?
>>: Tiberio Caetano: We run EM 100 times. Initialized randomly. That's the setting. We could
run EM.
>>: Can't figure out --
>>: Tiberio Caetano: This is the best out of 100 iterations.
>>: The other one's only once.
>> Tiberio Caetano: Once. Convex problem. If you solve it 100 times you get the same solution.
>>: The min max.
>> Tiberio Caetano: Max min. Same speed.
>>: Same iteration of EM or 100 iterations?
>>: Tiberio Caetano: 100 iterations, being fair here. You could run EM for 1,000 iterations but
then it's going to be slow, right.
>>: Yeah.
>>: If you've got that many points, the number of iterations you start with won't make a lot of
difference because the possible -- where you go two to the 30th or 40th or 50th.
>> Tiberio Caetano: In theory, yes. In practice you see you can get a difference if you use 10
iterations or -- we found difference between 100 and 10 iterations.
>>: EM.
>> Tiberio Caetano: But we didn't really run experience with 1,000 iterations to know the answer.
>>: What happens when you take the max min convex and use that to initialize EM.
>> Tiberio Caetano: You would find the same solution. Well, you can -- you mean use this
solution to initialize EM? Well, we haven't done that, but you don't really want to do that, because
that will be really expensive. Because you have to run twice the problem, the ->>: Just one iteration. Like --
>>: Tiberio Caetano: Yes. Yes, but you know use this -- this is something that people do, you
know? Initialize this, because you could have it this point, maybe, right? Maybe we're here. It's
a very good question. We haven't done -- you could have this point here.
>>: Yeah.
>> Tiberio Caetano: So you are minimizing this function, right? But you are not minimizing the
function that you really would care about. That's something you can do, haven't done that. But
yes, definitely you can use this convex optimization to initialize it.
>>: The other thing take the energy of your model of the solutions for each of these techniques
and see which one's slower.
>> Tiberio Caetano: That's true. But there is a slight trick to that, which is the following: There
are hyperparameters in our model. You have this alpha and this gamma here. You can either
choose to cross-validate those parameters. If you want to cross-validate or do some parameter
search at the end, the best EM result and the best convex relaxation result may be for slightly
different models and then your score may be different.
Okay. But we are still in the initial experiments with this. One thing we want to do is just let's
stick to the same hyperparameters and do a very thorough analysis and see what ->>: The programmers the sigma is the noise.
>> Tiberio Caetano: Sigma is the noise. So gamma, one caveat here. We need an extra
parameter than EM. We need this gamma. What this gamma is telling you is the class balance,
the balance between the sizes of the classes.
You can -- this is sort of -- this gamma is an upper bound on the size of the largest class. So if
you want -- you want no class to be larger than 50 percent of the points, you need to have that
gamma.
So we don't really have a solution for avoiding that. But this alpha and this -- so sigma is totally a
property of your data. But often not. Alpha is a parameter of the model.
So you may end up with slightly different guides if you cross validate them. If not then your point
is valid. Okay. So overall, and we have many other motions, I'm just showing three here in our
submission we have appendices and all this stuff with many other things.
Especially the max min. Because the max min is quite fast. And it seems to have preserved
quite a lot of the structure of the problem. All right. Okay. So, yes?
>>: Question on this. What's moving here? The sequence.
>> Tiberio Caetano: The camera. That's why you have all these things, all this green here. The
car and that car.
>>: The image -- I know but this little one, I mean it's twice C. There are already a small number
of images there. What about this one on the bottom? Just those images?
>>: Tiberio Caetano: No, there are more. You have to ask the student who made this. I don't
know. I have to check this. I don't know exactly how many points here.
>>: On the number of the mixture --
>>: Tiberio Caetano: That's a big question. We don't. So these are two fundamentally different
questions. This is a model selection question. So there's no way you can use a likelihood
criterion to optimize that. Because likelihood is monotonic of the complexity. If you make more
complex, more complex models you're always going to obtain a better likelihood.
That's a fundamental question. That's a philosophical question that's the problem of induction in
science. So it's an open problem. And you can use several different model selections, MDL or
whatever or you can try to be bayesian and do all this hierarchical bayesian stuff. Because for
completely bayesian then you have no model selection problem. You just put the prior in and
keep running.
So ->>: [inaudible].
>>: Tiberio Caetano: Can you speak up a little?
>>: For EM you can use EIC, BIC criteria.
>>: Tiberio Caetano: EIC. BIC are model selection criteria, that's exactly what I'm saying.
>>: [inaudible].
>>: Tiberio Caetano: You're right. You're completely right. The advantage of EM is since we
have a probabilistic model you can be as bayesian as you want. So you can do what you're
saying.
In this setting, it's not obvious how to do that. I don't know the answer. Any other questions?
>>: I remember a long time ago, more than 10 years ago, I think blade worked on something
called graduated convexity.
>> Tiberio Caetano: Graduated assignment.
>>: They called it graduated convexity was a continuation method. Popular in the vision
community in the mid to late '80s [inaudible] worked on this. But they were for regularization
problems with nonquadratic equations.
>> Tiberio Caetano: Noncovex as well? Noncovex?
>>: Noncovex. So you would introduce a graduated noncovexity equation.
>> Tiberio Caetano: I see.
>>: So it's a continuation methods ->>: I don't know whether it's ->>: I don't know if it's related or not. It might be because there's this idea at least the
philosophical level. Having sort of an approximating function that's smoother.
>>: You've compared this with EM. I think maybe that's not absolutely state of the art method
here, right, for this particular problem. Maybe your technique's more --
>>: Tiberio Caetano: What would be the state of the art?
>>: You look at the recent types of [inaudible].
>> Tiberio Caetano: Yes, yes.
>>: He and I have a couple of papers.
>> Tiberio Caetano: I know.
>>: And well they're sort of methods that [inaudible].
>> Tiberio Caetano: Yes.
>>: Which pose an FI model on this and then the FI model and then the whole thing becomes
linear and this is like what Renee calls a generalized PCA problem.
>> Tiberio Caetano: Yes.
>>: And then we see it in that -- he also has more recent papers, I think it's CDR 2009 I believe.
>> Tiberio Caetano: That's good.
>>: That's one we've seen.
>> Tiberio Caetano: Maybe I'll talk to him CDR.
>>: Press sensing technique in a way to solve the GPA problem. So when looking at the results,
the numbers in his paper there are fairly amazing. It makes me feel ashamed that he and I
published a previous paper because it just kills it, right.
You should --
>>: Tiberio Caetano: I need to look at that, because, remember, this is a hammer that you look
for a nail. We're not solving the motion segmentation problem. We are solving the mixture
regression problem. Modeling motion segmentation as mixture regression where you have
quadratic, a quadratic criterion. Probably these models, they are different models.
They are not the same model I proposed here. The first, the very first slides I put up. So this is a
quadratic, basically regression problem. So they are different models. So I would have to
reformulate new convex relaxations for every different objective.
>>: Okay.
>> Tiberio Caetano: I'm actually technically done. So if you --
>>: Okay. Tiberio is here today and we have time, him being away, arranged a lot of talks. It
would be nice if you could work with some of the interns and [inaudible] so let me know and I'll
[inaudible].
>> Tiberio Caetano: Thanks, Rick. Thanks, everyone.
[applause]
Download