23871 >> Larry Zitnick: Hi. It's my pleasure to... assistant professor at TTI Chicago. Last spring he spent...

advertisement
23871
>> Larry Zitnick: Hi. It's my pleasure to introduce Dhruv Batra. He's a research
assistant professor at TTI Chicago. Last spring he spent some time as a visiting
researcher at MIT with Bill Freeman, where he's working on learning latency
structural SVMs. Before that he also interned at SRC with Pushmeet. And
before that he was a Ph.D. student at CMU. He was being advised by Tsuhan
chen and working on co-segmentation using interactive methods.
>> Dhruv Batra: Thanks, Larry. Thank you all for coming. So I gave a talk at
MSR a couple of years ago. And I think one of the things I said was it's really
nice to be here at Microsoft, and one of the things I've always wanted to do was
give a talk at Microsoft with a Mac. That was two years ago.
I've been doing that since, I've been giving a talk with a Mac. This trip I forgot my
dongle at home, and so yesterday I had to go to an Apple store and buy that. So
I guess karma got its way for me making that remark.
But, okay. So let me begin. So if I had to summarize here machine learning 20
years ago or what we were doing in machine learning 20 years ago, this picture
essentially summarizes what we were interested in, right? We were interested in
partitioning two classes, and finding best ways of partitioning two classes.
Faces from nonfaces, digit three from digit six. Chairs from tables. This is the
canonical thing that we were interested in. And if I had to say what changed in
the last 20 years, I would say things that changed is now we're interested in
much larger output spaces. We're interested in exponentially large output
spaces.
And I'll show what I mean by that. So segmentation, which is a problem that I'm
interested in, you're given an image. Maybe some user scribbles. Your output
space is either at each pixel a binary label 01 foreground/background or K data
that you know exist in the database. So the space of possible outputs is the
number of labels to the power of pixels that's the output space that we're
essentially searching over.
Or in object detection, where to find an object we work with parts based models
so we say a bicycle is made up of maybe a wheel, maybe a bicycle rim or a
person is made of these parts. We search for these parts, where are these parts
located and spatial reasoning.
So the output space in this case is the number of pixels which is possible
locations of these parts raised to the number of parts. Again, exponentially large
output space that we have to search over. This can be not just in a single frame
but in video. So you're trying to do person or layout in video. So certainly
exponent goes up by the number of frames that you're dealing with.
In super resolution, one of the early -- one of the fundamental models that we
have is a graphical model which says I'm trying to resolve this image into a much
higher resolution, what I'll do is I'll collect a batch of low and high res images, low
and high res, collect a dictionary of low and high res patches.
What I'll say is for each input patch I'll find the closest low res patch in my
dictionary. I'll replace that by the high risk patch. So the space of output, the
space of output that you're searching over is basically your dictionary size raised
to the number of input patches.
Again an exponentially large output space. This is not just vision. People in NLP
are really interested in these problems of dependency parsing. So there's a
parse tree that tells you this word modifies this other word in the sentence. So
it's a directed tree on the space of sentences, on the space of words.
How many directed trees can I form? It's again N to the N minus 2 without the
sentence. So that's the space you're searching over.
And it could be information retrieval. So I go on my favorite search engine, Bing,
and I search for some documents. And the space of output space in this case is
document factorial, that's the space of rankings that I have to search over.
So in some sense if we're on this side of machine learning, if we have the search
over exponentially large output spaces then we need to revisit some of the same
issues that we addressed for the two label case. We have to understand how do
we hold distributions over exponentially large objects. My running example
would be segmentation because it sort of makes sense. How do I hold a
distribution over the space of all possible segmentations which contain
checkerboard segmentations, the stuff I'm interested in, all white, all block, all
possible segmentations.
Given this model how do I perform inference in this model? How do I find the
most likely segmentation I'm interested in, and learning, which is how do I learn
this from data. No expert is going to be able to hand this to me.
All right. So my work has been on all three of these issues, with structured
output models. And today I want to talk about a couple of things, which is
essentially I'll talk about this one work which we're calling the M best mode
problem. And I'll go into details of that.
In the second part, so the first part will be a modeling and inference question.
The second part will be a pure inference question. And time permitting, maybe
there will be some other teasers of things that I'm working on.
So let's get started. So here's the problem that I want to talk about. This is what
we're calling the M best mode problem. And in order to tell me -- in order for me
to tell you that problem, let me give you the model. And the model that we're
going to be working on is the conditional random field, just so that I hand you the
notation. We're given an image, there are some variables. So let's say pixels.
I'm showing a grid graph structure, but I'm not making any assumptions on the
graph structure. This is just a running example.
Each pixel or node in my graph I have some labels that I have to assign to. So
this pixel might be a car. A road, a grass, a person. So K set of labels.
Also, I'm handed an energy function, which scores all possible outputs. So
somehow it gives me the cost of each output.
And that is represented by a collection of node energies or local costs. So it
might be a vector of 10, 10, 10, 0, 100, because you're minimizing cost, this
variable prefers to be the third label, which is cross.
So also at each edge, I'm holding some sort of distributed priors. So nodes that
are adjacent to each other in this graph tend to take the same label. So this is
just showing if you take the same label, your cost is 0. If you take different labels
your cost is 100.
It's just encoding my prior information. Now the problem -- and instead of
thinking this as a cost function, you could think of this as a distribution. If you just
exponentiate the energy and factorize it out by a constant, this becomes the
distribution that I'm showing.
That's fairly straightforward. It's a discrete space so you can sum it out and it's
really easy to come up with. So that summation, computing that summation
might be really hard but it's easy to think of this as distribution.
Now, the task that we're interested in typically is given this distribution, find me
the best segmentation under this distribution.
So and that can be expressed as given this cost function find me the lowest cost
assignment of these variables. I have some node potentials edge potentials let
me find the lowest cost assignment, that's my best segmentation.
And this is too general a problem in general it's NP hard. I can reduce max cart
to this. I can reduce vertex coloring to this. There are inapproximatable
problems I can reduce to this.
So in the general case this is hopeless. And faced with an NP hard problem you
bring out your standard tool set. You either write down your heuristic algorithms,
greedy algorithms or convex relaxations.
Before we do that before it turns into an optimization talk, let's think a second. Is
computer vision really an optimization problem? Or is it just an optimization
problem? Right? If you did have an oracle that could solve this optimization
problem, would you be happy? Would computer vision be solved? And we have
addressed that question before. And the answer is no, we've done large scale
studies. So in this paper by Italia Meltzer and Yanover, they found that if you do
actually spend, if you take current instances on some datasets, take your
models, run exponential time algorithms and find the global optimum under
existing models, those global optimums still have some of the same problems
that approximate solutions do. They tend to oversmooth. They miss certain
objects. They're not good enough.
In fact, even worse than that, which was work done by Rick and others, if you
actually find, if you compare the energy of the ground truth, turns out that the
ground truth has much higher energy than the map.
Turns out that your model thinks that the ground truth is much less probable than
this other thing that it believes in.
So not only do other models not perfect, when we spend more time coming up
with approximate or better approximate inference algorithms, we somewhat
move away from the ground truth. More time is spent up in embedded
algorithms takes you away from the ground truth in some sense, and that's sort
of disappointing.
The reason for this is somewhat obvious. Our models aren't perfect. They're
inaccurate. They're not completely garbage, though. They have some
reasonable information in them. So while one solution to this problem might be
just learn better models, right, go ahead and learn better models. Seems simple
enough.
What I'm going to say you should be asking for more than just the map. You've
learned this rich distribution from data or from an expert or from some source.
Why extract just a point estimate? And the problem that I'm going to talk about,
some people have looked at this problem in the context of combinatorial
optimization algorithm problems. This is called the M best map problem. Instead
of just the best solution they'll find the top K best or top M best solutions.
Can anyone think of a problem with this approach. If you were to find the top M
solutions, what is the problem you would run into?
>>: [inaudible].
>> Dhruv Batra: They'd be nearly identical. Any reasonable objective function
will have some peak when you ask for top M solutions you'll be nicely clustered
around that peak and these solutions are essentially useless for you.
What you would like to solve is this M best mode problem where you can do
some sort of mode hopping where you want to know other things that your model
believes in. And this is the problem that we're calling the M best mode problem.
I want to be very careful we're working with discrete space, this is not continuous
distribution what does mode mean, I'll formalize that in a second.
But before I tell you we can find the M best mode problem how we can solve this,
what would you do if you did have an algorithm that produced some diverse set
of hypotheses, what would you do with it? One thing you can do is anytime
you're working with interactive systems, anytime there's a user in the loop, so this
is interactive segmentation a person scribbles on the image you present to the
person the best segmentation, instead of just one best you can present some five
best.
But you have to ensure that those bests are sufficiently different from each other,
there's a diverse hypothesis. So anytime there's a user in the loop you can
present those solutions and the user can just pick one. So you minimize reaction
time.
The other thing is you can rerank those solutions. You can produce some
diverse hypotheses and run some expensive post processing step that ranks
these. This is the current state of our segmentation algorithm that we have on
Pascal segmentation challenge.
What it does it takes an image. It produces close to 10,000 segmentation
hypotheses. And these are highly correlated. They're highly redundant. But
there are many segmentation hypothesis, and it uses an external segment
ranking mechanism to rerank these segmentation hypotheses. You might ask if I
have access to a ranker, why don't I just optimize the ranker, why don't I search
for the thing that would be the best ranking. Ranking can be expensive. You
want to evaluate the expensive thing, only small number of things.
Okay. So if we now are convinced that this is an interesting problem, let me
show you how to do this. I'm going to present to you the formulation of the
problem. So I'm going to be working with an overcomplete representation.
I said that each pixel had a variable that could take a label, some discrete set of
labels. Instead of representing it as a single variable, I will represent this as a
Boolean vector of length K. So there's a vector of length as many number of
classes that this node is supposed to label. And an entry one in one of the
positions means that that's the class that this variable takes.
So if the entry is in the first place, if the one is in the first entry then XI is set to
one. If it's the second place then the next I is set to 2. And we disallow
configurations where there are more than 1s in the vector or 0-1s in the vector.
You can do the same thing for an edge of variables. Now the vector just
becomes much longer. It becomes K squared. Now you're encoding all
quadratic, all K squared pairs of labels that these two variables can take.
So if you have one in the first place, that means XI is set to 1 and XJ is set to 1.
If you have 1 in the second place it means XI is set to 2 and XJ to 1 and all
quadratic of these.
And notice that these are not independent decisions, right? The decisions you
take an edge has to be, has to agree with the decision that you take it in node.
This encoding is saying that XI is set to 2 and this encoding is saying XI is set to
1. That is not allowed.
Why do we do this? Why do we blow up a set of variables. The reason we did
this is now that energy function of, cost function I showed you I can write it down
as a dot product. I pull out the cost of each label. Multiply that with this Boolean
vector, it exactly picks out the cost of this labeling. And the same thing at the
edges. So it's a nice dot product.
And here's the integer program that you were trying to solve. That energy
minimization is just minimizing this sum of dot products at nodes and edges
subject to mu I and those Boolean vectors being Boolean. That's the integer
programming problem.
This will find you the best segmentation, and in order to find the second best
segmentation, here's the simple modification that I'm doing. All the variables stay
the same. Mu 1 is your map. So that's the best segmentation that you found.
And I have introduced a new inequality that says delta is some diversity function
that measures distance between two labelings or two configurations and I force
that distance must be greater than K.
K is a parameter to the problem. Delta is something that you choose. I will talk
about both of those in a second. But it's just something that forces you to be
different.
Visually what does that look like? Here's what it looks like. This is the space of
all exponential segmentations. You were searching over that space. You
minimized over a convex hull of this space. So this is the best segmentation that
your model thinks that's the map. You disallowed some other segmentations that
lied less than K distance away from it.
And now when you still minimize over the remaining configuration, you find the
second best solution. Right? That's what that visually looks like.
So now this is the problem that we're interested in solving. This is the M best
mode problem. For this part of the talk, I'm going to assume that somehow there
is a black box that solves the map inference problem. In the second part I'll go
into how we solve that. But given a model, there is some algorithm that solves
the map inference problem. But this -- this almost like the map inference
problem, but there's an extra constraint.
So you can't exactly plug your existing algorithms. What you can do is you can
dualize this constraint, which means that instead of handling it in the primal form
in a constraint form, you add it to the objective with Lagrangian multiplier which
means instead of forcing a hard constraint you pay a penalty of lambda every
time you produce a solution that's not K distance away. Every time you produce
a solution that's less than K distance away you pay a penalty of lambda.
So this is the dual problem. The reason why we do this is because this is now
looking like something we know how to handle. And ->>: The previous -- so you're actually penalizing -- the further away you are from
K, the more you like the solution, right? It's not just an inequality constraint.
>> Dhruv Batra: Yes, if you search over the best lambda, you will converge to
the things that if there is something that's -- yeah, I'll get to that in a second. But,
yeah. The reason why we do this is because this objective function now starts to
look like things we know how to handle.
Right? This is an additional term that when we knew how to handle, and in the
literature, if you've seen this is the loss segmented minimization problem. We
handle this every time we have to train SVMs or train structure SVMS, you add
the loss and minimize the original energy.
So this is the problem I'm calling the M modes. There's still two things I haven't
told you about. Delta. So I've traded the primal variable. Now there's a new
dual variable as well. And delta the diversity function that I didn't tell you.
And you can think about for each setting of lambda, which is a dual variable, this
relaxation provides you a lower bound to the original problem that you were
solving. And to get to Rick's question you can try and maximize this lower
bound.
What does this function -- so I tell you lambda, you minimize this, what does this
function look like as a function of lambda? You can easily show that it's a
piece-wise linear concave function. You can maximize over the space of
lambda.
So let's see. What's the delta? Let's nail that down. What's the diversity
function? There's some nice special cases. There are a number of special
cases. If your diversity function was a 0-1 function, if you exactly map then your
distance 0 if you're anything else even one pixel you label different you're some
different configuration. If you work with that, then this is the M best map problem.
So we generalize that. There's some other generalizations that I won't talk
about. We allow a large class of delta functions, and I'll talk about one of them
specifically today, which will correspond to Hamming distance. Here's a delta
function that I'm going to be working with. It says you sum over each node in
your graphical model. Each node in your graph. And you take the dot product of
the mu I with the mu I of the map.
What does this mean? Mu I remember is a Boolean vector. That encodes what
label you took at mu I. Mu is your new variable. This is just counting how many
pixels took the same label as last time.
If you take a dot product with a Boolean vector, only if they agree do you get a 1.
Otherwise you get a 0. So this is exactly Hamming distance. Why is this
interesting?
Well, it's interesting because if I now look at that problem that I was trying to
solve, my original energy function minus this Lagrangian times the loss, this loss
is now linear in mu. It's linear function of the variable that I'm minimizing, and it
nicely decomposes across nodes. So all that happens is that now I have my
original node potentials, plus lambda times some other constant.
Mu I is a Boolean vector of length K. Only one of the entries is set to 1, which is
the map entry. The cost for the map label just went up by lambda.
>>: Just going back to your original formulation, the UIJs are independent of your
UI? They're additional extra variables?
>> Dhruv Batra: Yes, so there are some constraints that I've hidden away. Mu
are not actually independent, there are constraints that tie mu I. Those
constraints exist. I've sort of hidden them away because they weren't relevant.
>>: Mu is always an integer?
>> Dhruv Batra: In the integer programming formulation it will always be an
integer.
>>: [inaudible].
>> Dhruv Batra: In order to solve it, so that's the black box that I hid away. In
order to solve it, you will relax it to NP relaxation.
>>: Forces one bit to be one is also constrained in that family of constraints?
>> Dhruv Batra: Constraint that forces only one B.
>>: To term.
>> Dhruv Batra: Yes, that's also a constraint that's hidden away in this mu. So
there are constraints -- so forcing one bit to be one is sort of a normalization
constraint if you think only one of the bits -- after you relax it it will become a
normalization constraint. And the constraints that they agree with each other will
become marginalization constraints. But they both exist.
>>: One last thing. So you just talked about second best mode.
>> Dhruv Batra: Yes.
>>: But third and fourth, do you have different lambdas for different ones or ->> Dhruv Batra: Yeah. So in the primal case, what you'll have to do is you can
add different inequalities that find me the next best, which is K away from the
first, the second and the third. So you can either have different Ks or you can
have some standard setting of Ks.
So that would say I just plopped down this Hamming hypercube that disallows
some K, some solutions.
In practice, this is going to be a question of how do I set K. And I'll get to that
question in practice.
>>: Lambda single across? [inaudible].
>> Dhruv Batra: So Lagrangians, there's a different Lagrangian for each
inequality.
>>: Optimize.
>> Dhruv Batra: You'll have to optimize over those, yeah, right. All right. So this
is nice that all I have to do now is to modify some potentials, the cost of -- so if
node I was set to label 1, then the cost of node I taking label 1 just went up by
lambda.
And I just rerun the same mechanism that you have for map. So if you had a
black box, that black box still runs I just have to perturb the potentials a little bit.
Even better, since I did not modify this edge potential, data IJ, if there was some
structure in the original edge potentials, I preserve that structure. So if your
original problem was pair-wise binary submodular minimization problem for which
you have exact inference algorithms, this new modified problem is still pairwise
binary submodular.
So if there was some exact algorithm for the first problem then there's an exact
algorithm for the second case. And I think that's the most interesting part.
If you have invested some work to extract one solution out of your model, this
can extract multiple solutions out of your model. So what does this look like in
practice? Here's an image, and somebody scribbled on it. So one color of the
scribble indicates this is the foreground, the other color indicates this is
background.
Here's the ground truth on these images. This is what presumably you'd like to
extract. We encoded this with a pair-wise binary submodular parts model with
color potentials look at the color of the foreground and background, set up node
potentials and parts model.
This is the map that you extract from these scribbles on these images. This is
the best segmentation. This is the exact second best map. So this is literal
definition of second best map. Does anyone even see the difference between
the first and the second best?
>>: Only in the top row.
>> Dhruv Batra: Yeah. Because there's these pixels here that get turned on.
The others are different as well. Maybe there's one pixel that turns on.
So I wasn't making those figures up. This really happens in practice, that you run
your entire algorithm again in the second best decision is essentially useless.
This is the second best mode that we can extract. So in the first case, we're able
to extract the other instance of that object. This entire thing was absent in the
first and second.
In this one, we're able to fill out the arm of the person that was cut out. Right?
All by forcing Hamming dissimilarity.
>>: The third row there's not much difference.
>> Dhruv Batra: Right, the third row there's not much difference.
>>: There must be at least K pixels that are different.
>> Dhruv Batra: Sorry. Yeah. So the way we're solving this is instead of setting
K, we're setting a fixed lambda, which means that you don't actually enforce
diversity, you trade off diversity with the original energy, with some fixed
weighting term, which means that if your original model strongly believes in the
original model, you will still return the first solution back.
So here's a second experiment that we did. Sorry. Here's a second experiment
that we did. This is Pascal's segmentation challenge. For those unfamiliar with
this, this is a large international challenge that's been running for a few years
now.
The organization that's running this challenge releases the images. But not test
set annotations. There's 20 categories on this dataset and a background.
For each image, what you're expected to produce is this. This is the ground
truth. You're expected to produce a labeling of one of these categories for all of
the pixels, or background, which is shown in black here. Right?
And what we did was we took an existing model on this problem, which is the L
model by Radicky [phonetic], Kohli and Tore [phonetic]. They took -- they
developed this over a sequence of papers. If you haven't seen this model before,
this is just a hierarchical CRF. There are some potentials of pixels. There's
some neighboring pixels are joined with some parts models, and there's some
super pixels and some scene level things. It's a hierarchical CRF.
It took them a couple of papers to develop a good inference algorithm for this
model. And all we had to do was modify some terms and run their same
algorithm again. Right? So here's what I'm showing in image. Here's the
ground truth. Blue this is boat. This is sheep. This is the ground truth. This is
the best -- this is the map under their model.
So the first case there's this large region that's labeled as boat. So the
segmentation is wrong. In there the support is wrong. They've labeled it as boat
but the support is wrong. It's one of the mistakes that the model does tend to
make. Whenever it finds evidence for an object, it tends to smear it across the
image. Labels everything as sheep.
In this case it only found one of the instances of sheep. What we did was we
extracted five additional solutions in addition to the map. Right? I'm showing you
the best of those and the worst of those. Best and worst according to ground
truth. So we are checking ground truth to see which one's best and which one's
worse.
And in this case what happens is you're accurately able to crop out that
segmentation. Anytime you change segmentation, this is Hamming. So you get
rewarded for being different from that. In this case you're actually able to get out
that original support of the object, get an actual segmentation. In this case you're
able to get the other instance of the object.
So not just one sheep was present but a second one was also present. Right?
These are examples where we find large improvements in additional solutions
that we extract.
These are examples of medium improvement, where again this is the ground
truth. This is what the map says. In this case the horse, rider, they're both
labeled as the horse in the map. What we did was extracted the rider out, lost
some part of the horse.
In this case everything was labeled sofa. We were able to extract the person out,
lost some bit of the sofa here. Right? And in this case we will extract an object
out here from the map. These are examples of cases where it doesn't make a
big difference of running these additional solutions.
>>: At the beginning of the talk you mentioned approaches where they just try
unsupervised segmentation of riches in 10,000 -- you can do the same thing with
the best, relative to ground truth.
>> Dhruv Batra: Yes.
>>: Did you do that.
>> Dhruv Batra: Yes.
>>: Does that do better, or even with 10,000 bandwidth examples, random
segmentations does it not do as good as your approach?
>> Dhruv Batra: Yes. So you're asking whether we took, whether we did a
reranking on these? Yes, we did. So what I'm showing here are comparison or
empirical results. This is intersection over union criteria. So you produce a
mask. There's a ground truth mask you measure intersection over union, that's
how accurate this mask is. This is the average over all categories.
Here's how well just the map under this model performs. So it's just under 25. M
best map in this case is a nonstarter. This algorithm is actually more difficult to
implement than best mode because it's not just one map computation again. It's
order N where N is the number of pixels, that many computations again. In this
case we did back of the network calculation. It would take us ten years without
parallelization to get additional solution. So just don't have that. And in practice
it doesn't seem to make any difference anyway. We did implement a baseline
which would confuse pixels and flip them to their next best thing.
Here's the oracle numbers for M best modes. If you extract five additional
solutions in addition to the map and took the best looking at ground truth, so
that's cheating. This number is not a valid entry to Pascal, because you can't
look at the ground truth. This goes from bar 24. But this tells you how much
signal there is in those additional five solutions. Go from 25 or so to over 36.
And to tell you the scale of these numbers, the state of art on this dataset is just
about 36. Right? So you took a model that was nowhere close to state of the art
is now beating state of the art. Well beating in the sense there is a solution in
these best, in these 5ish. So the goal now is can we rerank these solutions?
Can we take the six and run the reranker. We have an initial experiment on that
already. We've taken the reranker from that work, applied it to these six. It
already improves on the map. So we're able to do better than the map. But we
haven't yet realized this potential.
So we're still tweaking that reranker to see if we can do better. Intuitively, it feels
like it should be an easier problem that reranking 10,000 segmentations,
because now it's just six. One out of six reranking is much easier, and we hope
this number ->>: The 10,000 it would be great if you had 100 grand -- 10,000 to see how many
random.
>> Dhruv Batra: Ranking.
>>: How many you need to get up to ->> Dhruv Batra: Yes. So I don't know. This line is them reranking 10,000. So
this is them. The state of the art is them but them reranking 10,000. I don't know
if there's a core one ranking as you go fewer and fewer.
>>: Were you using their ranker or were you using your own?
>> Dhruv Batra: No right now we're using their ranker. Adding new features,
retraining it on our own. This is not the latest that we've been able to achieve but
off the shelf if we take their ranker and run it on ours, this is what that does.
>>: But there's some number between their 10,000 and your 5, right?
>> Dhruv Batra: Yes.
>>: Whether it's forcing them to generate fewer or forcing your algorithm to
generate more, two curves, which is your best K and their random K, whatever,
they're doing ranking converge?
>> Dhruv Batra: I don't have access to -- I can't really generate -- sorry. I'll have
to check their paper to see if that core was available.
I would be -- I would expect that they really need many, many segmentation
because if you do a coarse job, all they do is run S min cuts they say this pixel -I assume this pixel is foreground I assume this other pixel is background and I
run a segmentation on this and I just do this for many locations of source
sensing.
It's completely down procedure and I'm fairly certain you need a lot of locations of
SNT to do this.
>>: Different MRS than tail, right.
>> Dhruv Batra: Yes. So the thing I'm interested in and that we have been doing
with this is we've also taken both estimation problems where the goal is not just
segmentation, the goal is not segmentation, but the goal is where is the arm of
this person, where is the head of this person, where is the leg.
So this is a different MRF. It's actually a tree graph. But your labels are
locations, where are these people.
And there has been some preliminary work on trying to find multiple hypotheses.
We have the implementation of [inaudible] for the best case and we're modifying
his code and we're finding improvements on that as well. But there are a lot of
applications, I think, that can benefit from this. I think this can really improve
parameter learning as well. The way we train our models right now is we run a
four loop. You use your current setting of parameters to generate, to ask it
what's the best segmentation it believes in. If it's not the ground truth you modify
it a little until the ground truth is what wins. And if you had access to not just the
map but also some other modes that it believes in, you can converge much
sooner. Right? Because you're extracting other things at this point.
And there are connections to rejection sampling and this. But just to summarize
this part of the talk. I think here's what I would like you to take home from this,
right? You're working on some problem. So think about the problem that you're
interested in. Whether it's ranking documents some retrieval setting or whether
it's structure for motion.
There is some discrete aspect to it, right? And the key thing is are you happy
with the first best solutions that you have. If you don't have the perfect models,
then you're not happy with that.
And if you're not happy, then you should look at extracting multiple solutions out
of your model. And we're hoping that this can help additional applications as
well. Are there any questions about this part before I move on?
Okay. So the next few things will go much quicker, because I'll go into fewer
details. But here's the second part of this talk. I want to introduce a notion. I'll
give a high level picture. I won't go into all three of the things. I'll just give a high
level idea and I'll go through it.
We came up with this idea focused inference and we applied it to a few different
applications. I will just tell you what focused inference is.
So I told you there's this integer program that we're solving. In the first part of the
talk I assume there's a black box that can solve it. Typically these things are
solved with, for example, linear programming relaxations. So you are going to
solve this linear function with some discrete variables. Right? That was a linear
function. So I can take all your parameters, make them really long vector. I can
take all your variables, make a really long vector. And that's just a dot product.
And now the constraints that I had hidden away, that I didn't talk about before,
they're also linear constraints. So this is just -- this is the exact form of
optimization problem that we're studying. It's a linear objective function, linear
constraints, Boolean variables. And what you study is an LP relaxation of this
problem that you replace the Boolean constraints by 0 to 1 constraints. This is
continuous relaxation. What it essentially involves is replacing convex hull of the
solutions by an outer bound.
This outer bound looks like a more complicated structure in 2 D. But it's actually
a simpler structure to optimize over, because this is a very high dimensional
space. In 2 D it looks more complicated but it's actually much simpler. The way
we solve linear programs is by message parsing programs I won't give you
details but what it involves is you localize each part of the graph solve each part
of that graph exactly and you pass messages. Here's what I think my neighbor
should be, and that takes you to the linear programming -- you can interpret
these messages as dual ascent algorithms. But in a sense this is a highly
inefficient procedure.
What we wanted to do, this is what is done right now, and what we wanted to do
was improve this procedure, because our observation was that data does not
look like this. We don't have complexity at all scales, at all nodes, at all relative
locations.
Our data really looks like this. There are regions of complexity but there are
large regions which are essentially simple. I can look at the local potentials and I
know what it's going to be.
If I had to give you an analogy, I would say that the first approach of passing
messages everywhere is sort of like a carpet bombing approach to inference.
Indiscriminate deployment of computation. Everywhere. And what we would like
to see is a more focused deployment of computation, where you find the
important parts of the problem and you only focus your computation there.
So here's the key hammer that I'll explain really simply. We're going to solve a
linear program. Corresponding to the primal linear program, there's a dual one.
There's LP duality theory Lagrangian multipliers of your problem. We know from
the Aldy [phonetic] theory primal decreases, each setting of the dual gives you a
lower bound. The dual increases as you spend more computation. If the two
meet at any point in time you know you've converged; you've solved the problem
exactly.
Moreover because these are linear programs you know complementary
slackness conditions are exactly conditions that you can check for convergence.
When these conditions are satisfied, you know that you have solved your original
problem. Right? In our work, what we essentially did was instead of using them
to check for convergence, we use them to guide where messages should be
passed.
We use them -- we distribute them at various locations in the graph and tell you
which are the important regions in the graph. Right? These conditions distribute
nicely over their time graph. And that's the most important concept here. That's
the key hammer.
And we were in a couple of papers what we were able to show was that we can
say some precise theoretical things about it. We can say that this is a
generalization of complementary slackness conditions.
We can say that it's exactly a notion of distributed primal duality gap at any point
of time you have a best primal, best dual and the sum of these scores sum to
that primal duality gap. And these are really too cheap to compute. Constant
time for each edge you compute the score. It's not like you have to spend
computation to compute this score to help computation. It's really cheap.
So we applied this idea to a few different things. One was distributed message
passing, how did we speed it up. In this case there's an image, current
segmentation. You update the segmentation somehow.
In this case let's say a user says all the white pixels here are where your model
has been updated. So user says I think this is background. Or you might have
data coming in streaming for the next frame so the model has changed
everywhere.
And you get the next segmentation, but this is the key result here. This is what
would happen, the primal -- the dual and the primal, if you were to pass
messages everywhere to go from here to here.
And this is what happens when you use a method to find the important parts of
the region and only pass messages there. You essentially converge much
sooner and the X axis notice is in log scale. Here we're converging 350 times
soon.
What is this baseline that I'm talking about? That's the TRWS implementation of
[inaudible]. And if you've played around with that implementation, it's an
extremely efficient implementation and not an easy one to beat.
In this case we're able to beat it by this big margin, and in the other case we're
also able to beat it by a big margin. The reason why we were able to beat it is
precisely this figure right here.
It's showing you where we passed messages. So the white pixels are where
potentials were updated. Small number of updates. Large number of updates.
But it only passes messages where things really matter, where segmentations
are changing between the two frames.
And that's exactly why you converge so much sooner.
>>: Why does your purple graph start off higher than the red graph?
>> Dhruv Batra: I think what's happening is that -- so what's happening is this is
log scale, and I'm zooming into the region that's closer to convergence. It might
be worse off initially than it's beating to convergence. You can alternate between
the two. You can start off here and then at some point of time switch to the other
algorithm.
This algorithm, the baseline, what it's doing is it's passing messages horizontally
and then vertically. So it makes, initially it makes big improvements. But later it
gets stuck in this process where it has to pass lots of messages. Our stuff is
finding edges where messages need to pass and only passing there. So initially
it doesn't make a lot of improvement, because it keeps -- it keeps passing
messages locally. So there's some things -- segmentations have to change.
So somehow this node has changed. It needs to let its neighbor know and its
neighbor know, it takes a lot of time initially but it converges much sooner
because you're only focusing on that.
>>: The classic at least that style of TRW has a horizontal and vertical suite.
Have people looked at hierarchical pyramid or whatever you want to call it-based
techniques where the propagation looks like it's more happening ->> Dhruv Batra: At multiple scales?
>>: Yeah.
>> Dhruv Batra: Not that I've seen. I mean, so scheduling, which is what I'm
presenting, isn't certainly new. People have looked at scheduling before. And
some people have looked at -- let's see where the messages are changing more.
So if the last time you sent this message this time it's essentially the same, then
maybe you shouldn't send this message, maybe somebody else should send this
message.
I have not yet seen hierarchical -- so I know Pedro looked at this a little bit
where -- but that's essentially constructing a hierarchical graph. Like you have to
reconstruct a different graph.
>>: If you're going to exploit hierarchy, you have to construct a different graph.
Because there are no connections, you jump it's not theory wide that you
introduce those. If you have sort of an auxiliary graph that it's supposed to mimic
lower resolution version in whatever that means, right, it could be used as a hint
graph, right, sort of propagate stuff up at a coarser level and not down and then
say my LP at the final kind of move forward faster because the information -- so
raster order propagation is extremely efficient if you have a tree, right, that's
optimal.
>> Dhruv Batra: That's optimal.
>>: But in general it's not a bad heuristic, in the linear system solving literature,
generation implicit, decades old, but proven to not work nearly as well as
multi-grade, right?
>> Dhruv Batra: Uh-huh.
>>: And now ultra grade, multi grade, to adapt to the intrinsic complexity. And it's
something I'm very interested in. I've only worked on the linear, which would be
equivalent to the quadratic energies or Gaussian MRF versions that's all I've
worked on. I've been dying to start working on it for an article for general
inference problems.
>> Dhruv Batra: Yeah. I think that would make a lot of sense, because that way
you can -- that way you can make large regions flip their labels by just going one
layer up.
>>: I'd be worried about this method, when you have a large region, is that it
wouldn't -- it wouldn't choose that region to actually update messages, because
we don't have a little bit -- a little bit of -- is that a problem?
>> Dhruv Batra: So one thing I showed was that the scoring function, the way I
wrote down the LP and I said, look, I can score every edge, our formulation
extends to scores over regions.
So you can compute scores over large regions as well. So even if every edge
has a little bit of score, the sum might still be the most important part. So you
might decide to go up, if you had written hierarchical ->>: It would be higher.
>> Dhruv Batra: But if you haven't written a hierarchical then you're kind of
stuck.
>>: Like always be kind of below that threshold and never ->> Dhruv Batra: Yeah. But in order for them to be below a threshold there has
to be something else that's always winning, there's this big edge that thinks
here's where the most problem -- yeah. All right. So these things really help.
You can make things really faster. In fact, we applied the same idea to another
direction. We said -- I said that, look, you can compute scores on these edges
and I can tell you which ones are important edges.
But that was all assuming your original LP was a good LP to begin with. It's a
good relaxation. This is an NP hard problem so a lot of cases are going to look
like this. The best lower bound you can extract is nowhere near the best primal
that you can extract.
And so our formulation, you know, there's not a single LP. There's tighter and
tighter LPs, because you can add more constraints to the original linear program.
And our formulation, the first LP was saying that edges are consistent with
nodes, that the labeling that you give at edges is consistent with the labeling that
you give at nodes.
You can write down tighter LPs by saying that triplets are consistent with edges.
That the labeling that you give at three nodes are consistent with the labelings
you give at edges.
>>: The original paired nodes is that question of UIJ, the right marginal.
>>: Yes.
>>: What's the triplet version?
>> Dhruv Batra: For triplet you would introduce a new variable, mu IJK. So it
would be a K cube long Boolean vector and you would force it to be consistent
with something else. Your original energy is still pairwise, so you don't care
about optimizing over mu IJK. Its objective function would still be 0. But it plays
a role in the constraints.
And that tightens the LP because now it has more constraints in the LP. But the
question here is while we could reasonably think about adding all edges to the
original LP, we can't think about adding all triplets. N nodes, lots of triplets. You
can think about tighter relaxations on four nodes. How many of these things are
we going to add?
So if there was a way to score ->>: Why can't you add all triplets it's originally a mesh graph. Still order N
squared ->> Dhruv Batra: If you restrict yourself to only the original triplets that are
present in the original graph, then perhaps you can think of adding them.
But long-range triplets can also tighten your LP which might also be interesting to
add. You can include edges that don't exist in your graph but that can still help
tighten the LP.
Then you have to consider all N cubed or N choose 3.
>>: Philosophically you're making a big jump, because the original thing was you
encode the problem as a continuous optimization or integer program, where it's
assuming the constraints are met, right, it's exact, right? Now you're saying let's
just throw in lots of extra constraints so that the solution would proceed faster,
right?
>> Dhruv Batra: No, but even with these constraints, it is still the original
problem, because think about -- so think about what was happening. If I -- think
about it this way. What is the worst I can do, that I introduce a variable that
depends on all pixels and so instead of being K squared or K cubed it is K to the
N.
What are the constraints I can add to this? That it sums to one over all possible
labelings only choose one labeling, and each of these labeling is consistent with
the sub labelings that you have.
That type of constraint would still be a valid constraint to your LP, right? In fact, if
you threw that in for all the clicks in your tree decomposition, if you wrote down a
junction tree of this graph, and if you wrote down as many constraints as the tree
width of that graph, then your LP is guaranteed to be tight. And we can show
that, that you just need to add as many constraints as the tree with linear LP in
the worst case.
So we're essentially moving towards that by adding more and more constraints.
And the thing to think about is we can't add all triplets or all four things or all four
pairs, and so if we could somehow score these things that would be helpful,
which ones should we add into the relaxation? Can you know a priori -- so when
you add a constraint, it will help, obviously, because tighten LP. But can you
before adding it in know how much it is going to help or have an estimate of how
much?
>>: When you say can't add all of them, right, if the triangles consists of two
edges from the original thing, plus the extra edge, there's only constant, if you
start with a regular planar grid, there's only constant number of such triangles
right? Why would you want to sort of take three nodes that are all over the map
and make up a hypothetical triangle of those three? Right? In other words, it's
just using locality almost as good as using something based on local gaps, right?
>> Dhruv Batra: Right. Right. So you can think about -- you can enumerate,
what you're saying it's really easy to enumerate over all the local triangles. Yes,
you can do that. But what about when you go much higher, four, 5, six, then that
space is becoming much larger, and even enumerating over that space becomes
complicated.
And what I'm trying to say here is there a way to construct these clicks where I
can sort of localize that this is where my primal duality gap is coming from, can I
directly construct a hierarchical so all the edges in this neighborhood have a little
bit score, can I just add this entire region as a click into my relaxation? And that's
what we looked at here.
And that's what essentially what we did. For this problem, there was an original
image. We had access to a blurry noisy version of this image. This is the -- we
set up an MRF problem for denoising and deconvolution. This is the best primal
we could extract out of the pairwise linear program. So what I was showing
before.
This is the best primal I can extract from the triplet LP, if I add triplets into this, it
becomes tight. This is the actual integer map. So that's fine. We can extract
this from this.
Here's the objective function. Not objective function, the prime dual gap
decreasing as a function of time as I'm adding more and more constraints. So if
you add constraints randomly, if you randomly throw in triplets, not random
triplets, but if you enumerate over them and randomly throw one in, then here's
how prime duality gap is increasing because relaxation is getting tighter here's if
you choose using R score. It's converging to 0 much sooner. On some levels
you're essentially three times faster.
At the end, you know, this guy's converged this guy's not even close to
convergence. That's the idea.
>>: What's the intuition when you watch it select triplets what is it typically
selecting.
>>: Typically selecting things that are boundaries. It's selecting things that object
boundaries, so like here it might select some triplets here. So in a sense it is
using locality of the problem. But still it's getting a one step above edges.
>>: Right. So it's locality based on the actual smoothness graph in the blurry
image or in the solution, which is it looking at, does it tend to look at the current
solution and that's what drives it or does it more look at the original input?
>> Dhruv Batra: No, it's looking at the best solution that it can extract and the
best dual value it can extract, which is a function of this. So it's looking at both.
What's is the best primal and the best dual.
>>: Your reasoning about on triplets but not all sets ->> Dhruv Batra: In this case I was reasoning only on triplets. But the formulation
extends to arbitrary size subsets.
>>: Can you do them all together, like 3s and 4s?
>> Dhruv Batra: You can construct them from their subset. So you can score -so at any point in time you can only score the things that exist in the relaxation.
So if only edges exist in the relaxation, I can compute scores and edges, and
summing up the scores and edges in a triplet I can compute scores of edges.
So I can't give you -- so if I had to compute a score for a set of 5, then I would
have to look at all five choose two of those edges that exist now and I would
have to sum their scores up. Does that make sense? All right let me try to finish
up quickly. I won't go into the last part, but we also took in an algorithm alpha
expansion, which at the surface of it looks nothing like an LP relaxation. Looks
like a greedy algorithm. But people have interpreted it as solving that same LP.
And we were able to use this idea to also speed up standard alpha expansion by
factors of 2 to 3.
And that's all I'll say about that. So in general I'm interested in extending this to
QP relaxations. I talked only about linear programming relaxations. I'm
interested in a lot of the methods I said are natively paralyzable. So one. Things
I want to do is have paralyzed implementations. There's this really nice work
coming out of CMU which Is GraphLab, which lets you work at a really high level.
You specify your algorithm. It does all the low level parallelization. It's for
multicore and for distributed settings. And there's a chance that I'll be, there's a
good chance I'll be working with Carlos Scoop at CMU and taking my graphs to
GraphLab.
And I want to look at focused learning trying to do scheduling for learning
problems, and I think there's some interesting scope there. Okay. Let me show
some teasers and I think we should be done in like a minute or so.
In my Ph.D. thesis, I think a lot of people here have seen this before, but we
worked on this problem of interactive goal segmentation where you have a large
collection of images. And somebody, the same object appearing in these
images, and you can write down -- you can build a system. We built a system
where someone can scribble on a single image or few image saying this is the
foreground, this is the object I'm interested in. And our system would go ahead
and segment that object not just in that image, we'll see that, but in the entire
collection of images.
And this, you have to -- in this case you have to look at all the images and so we
also looked at active learning formulations where the system would tell you
where to go next, where should I need to see scribbles next.
This was mostly for building an automatic collage, you scribble once and build a
collage. But what we were able to do was also extend this with Adarsh Kowdie,
who was an intern here that people are familiar with, we were able to extend this
to volumetric reconstruction by using a shape [inaudible] algorithm. So you use a
standard structure for motion pipeline to find camera settings, back projector,
silhouettes back into 3-D to carve out a volumetric reconstruction.
And this is the cutest part. I'll skip this video. This is the cutest part. Others just
got hold of himself at 3-D printer so he was able to actually produce these little
tiny structures from this. These are printed using a 3-D printer using our
algorithms.
So just a couple of images. You scribble. It was able to extract -- it was able to
produce three more [inaudible] printed on a 3-D printer as well.
Worked on other problems like single depth estimation, maybe we can talk about
that if we end up talking about this. We have a really nice algorithm coming
that's the first max margin learning algorithm for Laplacian CRFs, marginals
effective for this problem that haven't been looked at before because the
algorithms didn't exist.
>>: What's the Laplacian CRF?
>> Dhruv Batra: Laplacian CRF is a CRF where the edge potentials have L1
norm terms. When you have L1 norm terms it's not a log linear model. And
some of the existing algorithms don't work because they make a log linear
assumption. So we came up with a first approximate max margin algorithm for
training for these things.
In the past I've also looked at some retrieval problems where suppose you have
an image and you're trying to find out what is the content in this image. I give
you an image you give me a textual description. And this was an algorithm that
we built that was first segmented search for images with respect to the segment
and then do some textual analysis on that to answer your query, essentially.
All right, there's been some work on similarity learning again, also. So with that
I'll stop. Here's the people involved in some of these things and that's it. Thanks.
[applause].
>> Larry Zitnick: Any additional questions?
>>: Interesting talk.
>> Dhruv Batra: Thanks.
Download