>> Larry Zitnick: Okay. It's my pleasure to... student at CMU, fifth-year student of Tsuhan Chen who moved...

advertisement
>> Larry Zitnick: Okay. It's my pleasure to introduce Dhruv. He's currently a
student at CMU, fifth-year student of Tsuhan Chen who moved to Cornell. So
Dhruv is actually at Cornell right now finishing up his Ph.D.
He has done a lot of great work. I think most of it has been in image labeling.
Some in segmentation and also in activity recognition. Today he'll be talking
about his recent work on MRF inference, which he presented a part at NIPS and
part is a submission to CPVR this year.
>> Dhruv Batra: Thanks, Larry. Thank you for coming. First of all, I must admit I
feel a little bit like Morgan Freeman in The Bucket List. It's one of my things, give
a talk at MSR using a Mac - check. So having done that, here's what I'm going to
be talking about today.
This is the work on outer-planar decomposition. This is work on MRF inference
that I'll be mostly focusing. This is joint work with Andrew Gallagher, who is at
Kodak now; Devi Parikh who is at TTSC and our advisor, Tsuhan Chen. So all of
us have been at CMU at some point or another.
Time remaining, I'll also talk about some applications on interactive
co-segmentation which has been some recent work and application to 3-D
reconstruction, which I think will be pretty relevant here, and hopefully -- but the
focus of the talk is going to be the first one.
So let me begin. Let me try and convince you that a number of problems in
vision can be formulated as discrete labeling problems. The classical two-class
segmentation problem in which you're either unsupervised or via some sort of
semi-supervised information, you have some scribbles. You're trying to label
every single pixel as foreground or background. Those are the two classes.
You're labeling sites and pixels, labels are foreground and background. This
could be, of course, multi-class segmentation problem, semantic segmentation in
MSRC data typeset where for every single pixel it's not foreground and
background now, you're trying to label it with one particular class that you see in
this dataset.
This could also be a geometric labeling problem. This was work done by Gary
Holem and also [indiscernible] Saxena at Stanford where, for every single pixel,
you're not trying to label -- you're trying to label a geometric class. A rough
geometric class, saying it's a round plane, vertical surface, sky, vertical surface
facing left or right. So rough geometric information. But once again it's a
discrete labeling problem.
There's also been work in name face association which can be thought of as a
labeling problem. So your labeling sites are these faces that you found in images
and your labels are names that you have extracted through captions associated
with those images.
There's also been work in our group by Andrew Gallagher, who has tried to do
this work on -- you have an image, you have certain labels, which are image level
labels, and you're trying to propagate them to face level labels.
A priori, this transfer is ambiguous, of course. You have all possible answers.
But in his work he tried to look at age classifiers and looking at U.S. Social
Security data to find forced names priors. So he found Mildred was really a
popular name in 1940s and least popular later on. And if you have some sort of
age classifier running on the face set you can do a better job trying to assign
names to faces.
But again at the heart of it, discrete labeling problem. Classical vision. This
hardly needs any explanation here. Stereo is the disparity labeling problem. In
optical flow you have instead of one-dimensional, you have a two-dimensional
labeling problem, two-dimensional discrete motion flow labeling problem.
Denoising, which at the surface doesn't seem like a labeling problem, you're
trying to assign every single pixel with one of the labels 0 to 255. That's your
label space.
So all these problems have of course been well studied under the framework of
Markov random fields, MRF. Hardly needs an explanation here, but just so I can
get my notation right we're going to be working with a set of discrete random
variables.
There is a pairwise MRF or energy function that we're going to describe on this
MRF, and we're going to be interested in MAP inference. Given a discrete
energy function which is composed of node energies and edge energies, I'm
going to want to minimize the energy function and find the best labeling under
this energy function.
So given that's the problem that we're interested in, it's well known that this
problem is in its full generality NP-hard, faced with NP-hard problem, you have
two choices. Solve a sub class. Solve a sub class exactly, or come up with an
approximate algorithm that works on the entire class. Or there is a third option
that you saw the P equal N problem but we'll avoid that to the next lecture.
So given that if you're trying to solve, if you're trying to come up with an exact
algorithm for a certain sub class, there has been work done which foreshowed,
the earlier work if your graph is a tree, we can solve it exactly. In vision, of
course, there's work done by [indiscernible]. If your energy functions are sub
modular, then we can solve this problem exactly irrespective of what the graph
structure is like.
More recently there's also been more work done NIPS last year if your graph
structure is outer-planar solve this outer-planar graphs and what this solution is in
a second.
Coming to approximate algorithms, the first step, of course, was let's take BP
applied to a problem which has loops. And that's, of course, the naive loopy
belief propagation. There's been work on 3-D weighted message passing by
Martin Wainwright, Kolmogorov, Komodakis recently, and this is where our work
in outer-planar decomposition or OPD will fit in. It's going to be an approximate
output algorithm that will work for general case. And I'll point the connections
between outer-planar graphs and outer-planar decomposition in just a second.
Interestingly, in our paper, we point out that a lot of these approximate inference
algorithms can be thought of as decomposition method. So you give me a
general graph. You give me a problem on a graph, and what I'm going to try and
do is I'm going to try and break it down into trackable sub components I can
solve. This is a trivial visualization, you could have solved the problem on the
phone or network anyway. I'll try and break it down to tractable sub graphs which
might be trees, or chains, that I can solve exactly that I'm going to try and merge
these solutions together to get a global solution.
You can think of it as breaking down energy function into a sum of energy
function, solving each one of them first.
So how did these previous works fit into this framework? If you think about BP,
you take a graph and you're trying to find -- you're trying to propogate these
messages along norms. So your local problems are nodes in the neighborhoods,
and you're propagating messages. You're computing the solution exactly on this
neighborhood, and you're passing messages across these neighborhoods.
3-D weighted message passing takes a particular graph, breaks it down into
trees, into spanning trees, found in this graph. Solves each one of these
problems on these trees exactly because we have methods that do that, and
then use a message passing algorithm to combine solutions from these trees.
And in this way, we will note a natural progression of neighborhoods that are
increasing. There's going to be BP, which is solving the star-like problems
exactly. There's going to be TRW, which is solving tree-based problems exactly.
And there's going to be outer-planar decomposition which will solve problems on
a larger neighborhood called outer-planar graphs, which are strictly larger than
trees.
So now let me come to what are outer-planar graphs and why we're interested in
them. Outer-planarity is a notion from classical graph theory. A graph is
outer-planar, first of all, if it allows planar embedding. I can draw it in a plane.
So that was not a planar embedding, this is a planar embedding I've drawn a
graph in a plane.
In addition, all nodes must be accessible from the outside. They must lie on an
external unbounded surface. So this is not an outer-planar graph, because I
can't access this node, too, from the outside, whatever outside, like an
unbounded surface around you without any edge crossing. And an alternate
definition is sometimes more useful to think about this, which is that I should be
able to add an extra node to the graph connected to all other nodes and the
results should be planar and they're equivalent.
Like when you can do this, the graph is outer-planar and there's an if and only if
condition.
So the definitions kind of defined in a topological sense, but one of the examples,
what do these graphs look like. Well, first of all, all trees are outer-planar. You
can draw trees on planes and they're all accessible from the outside.
But it contains much more. It contains things that are not trees that have loops in
them. I took this one graph. I just dropped an edge. And that's outer-planar. If
you're trying to visualize how is this outer-planar, all you have to do is take the
node, plop it on the other side. It's drawn on a plane and suddenly everything is
accessible from the outside.
So in your mind, if you're trying to think of outer planar graphs I visualize sort of a
polygon that has all its edges on the inside without crossing and suddenly
everything is accessible from the outside.
So that's a good way of thinking about these graphs. Why do we care? Because
we can do exact inference on outer planar graphs. So before we come to -- let
me try and just explain to you how this exact inference algorithm works. This
was work done by Schraudolph, et al., at NIPS, and they said we're going to take
an outer-planar graph, we're going to add an extra node. We're going to have
this construction that ads an extra node, connects it to everything else.
And there's a way to go from energies to weights on this graph, with certain
special properties. The property is that every single cart here is going to respond
to a labeling of your nodes. So every time you're in the same segment as this
source node you're zero, every time you're not you're one, you're labeled one.
The cost of a cart is going to correspond to the cost of the labeling.
If you've seen these sort of arguments before, which I'm sure a lot of you have, it
reminds you of the Kolmogorv deconstruction. You're right, this is similar to that.
There are two key differences.
We're constructing these guys are constructing an undirected graph instead of a
directed graph. So you only have to add one node. And you're searching for a
global min cut not an estimate min cut. More importantly the difference is you're
not appealing to sub modularity. There's no constraints on the parameters. Your
energy functions are arbitrary.
Also, if you're not appealing to sub modularity your edge weights can be
negative. STM cut cannot -- min cut can't be solved with max flow techniques.
In fact, this is solved with a more common -- with a common algorithm which
appeals in perfect matching. I won't go into how these problems are solved I'll be
happy to talk to you after the talk if you want to know about the details.
But there is a restriction. I don't want you to think that any problem can be solved
with this construction. The restriction is the graph on the right must be planar.
That's when we can solve this problem. And that restriction means if I make my
original problem a little denser, took the four node, made it completely
connected, what I get on the right is the K-5. The 5 node graph which is fully
connected. That's not planar. I can't solve this problem that's why this planar cut
sort of algorithm implies an outer planar constraint. That's where outer planarity
comes from. Because we can only work with planar graphs.
Given that we can solve energy minimization problems on outer-planar exactly,
are we done? Like shouldn't vision be solved? Well, it's not, because even
though outer-planarity is larger than trees. It's still a restricted class. In vision
problems we're typically dealing with graphs that looks like this that are not
outer-planar. Grid graph, for example, as soon as it's bigger than two-by-two is
not outer-planar. You'll have these landlocked nodes that you can't access.
So you can't solve this problem exactly. If you form your graphs on super pixels
which a lot of time we do break it down to regions make every region a node.
You'll again have these graphs that are not outer-planar.
So what do we do? We are going to propose an algorithm. We're going to
leverage this new class which is amenable to exact inference and propose an
approximate inference algorithm called outer-planar decomposition. Let me
quickly go over what the algorithm entails. It's going to follow the same trends as
I mentioned earlier about decomposition methods.
We're going to take a nonouter-planar graph. This is the smallest
nonouter-planar graph we can find. So I'm just using this as an example. And
I'm going to represent it as a collection of outer-planar graphs that cover this
graph.
Now, this is, of course, an overcomplete representation. I just needed two of
these to cover the edge set of my original graph. I'm going to discuss how many
of these sub graphs you need later. But think about this, all of these graphs on
the right are outer-planar. Some of them have not been drawn as outer-planar
graphs. For example, there are edge crossings happening. But that's only to
visualize the correspondence about which edge has been dropped.
So that's an -- these are all outer-planar graphs. The reason why I made this
decomposition was that so I could pull up that extra node back and get planar
graphs and do min cut in each one of them. So I've solved the problems exactly
on each one of these sub graphs. And you should think of it again as a
decomposition problem where I had a problem that I couldn't solve. I
decomposed it into these sort of overlapping sub problems and I solved each one
of them exactly.
They're not going to agree with the labelings to give the nodes, obviously. If they
agreed we'd be done, we would have solved the problem. And that's why we're
going to develop a message passing algorithm that sort of works on top of these
decompositions.
The message passing algorithm that we use, we actually generalize -- we
actually present not one but we present four message passing algorithms, that
generalizes popular message passing algorithm used in literature. For example,
the first one is OPD NP which is a generalized version of belief propagation.
So it takes the message passing that belief propagation does and lifts it to
outer-planar graphs. In fact, for particular choices of outer-planar graphs you will
get belief propagation back. It will reduce the belief propagation. So it contains
belief propagation as a special case.
Similarly, we have an algorithm called OPD duel decomposition, which was, duel
decomposition paper was introduced to Komadakis, et al, and the same property
holds. We're sort of lifting the message passing scheme to outer-planar graphs.
If you were to choose outer-planar graphs as trees, you would get their work
back.
So all the guarantees -- it would reduce [indiscernible] 2D. Similarly, we can do
this for other message processing schemes but I'll focus on the first two and
that's going to be the kind of experiments that we're going to see.
So let me quickly go over the message passing scheme. There's going to be ->>: Is this going to be linear rather than -- it's not ST min cut, but it's global -what prevents you from leaving all the most to zero? Because the distribution
solutions?
>> Dhruv Batra: You can do that. That solution would give you an energy of
zero. But you're trying to minimize energy. So there might be -- so cutting an
edge might give you a negative. So there might be negative weights. So you
would -- that's precisely because we're not restricted to positive edge weights.
So let me just quickly go over what this message passing algorithm will look like.
We're sort of lifting belief propagation to outer-planar graphs. What we're going
to do is we have these decompositions. I'm going to introduce these agreement
variables that are sort of forcing these decompositions disagree in the labelings
that they're assigned to these nodes. I'm trying to force them to agree.
So I'm sort of introducing these agreement variables. And there are going to be
messages that these decompositions send to the agreement variables, and the
agreement variables send back.
So like I mentioned there are going to be two types of messages. There's going
to be a message from decomposition to these agreement papers and there's
going to be a message from the agreement variables to the decomposition.
And if you've seen belief propagation type things before these messages are
going to look intuitive. If you haven't, I'll try to walk you through the intuition. The
intuition is one of the messages is really simple. What does the agreement send
to the decomposition? It sums up all messages from everybody else, sends to it
a decomposition. Think of the message as a confidence.
So it's a confidence that you assign. How confident am I of this particular
labeling. So I assign a particular node state zero how confident am I of that.
What the agreement variables are doing are just sending you the confidences of
everybody else. Everybody else thinks this node should be stage zero; you're
probably better off incorporating that into your decision.
The message from a decomposition to agreement variables, as I mentioned, is a
sort of confidence measure. The equation is there. But let me walk you through
the intuition.
The intuition is, the decomposition is going to pick a particular node. Picks the
first node. Finds the min cost cut that assigns this, finds the best labeling that
assigns this node state zero. Then finds the best labeling that assigns this node
state one.
Writes those two energies down as a vector and then repeats this process for all
other nodes. What does this mean? Well, the idea is why is this a measure of
confidence? It's a measure of confidence because if you consider a node, if you
assign it state zero and assign state one and you can only achieve the same
minimum energy, you're not confident. You're sort of ambivalent what labeling
should be. But on assigning at state one you can reach significantly lower
energy than assigning at state zero, then you're more confident about state one.
Then you're better off labeling at state one.
>>: These are min margins?
>> Dhruv Batra: Yes. This is exactly the concept of min marginals. You're sort
of constraining your energy function at a particular variable and constraining it to
a particular state and optimizing over everything else.
Okay. And then you pass these messages back and forth until you converge.
An interesting question to ask is where do these sub graphs come from? I just
started with a toy example and showed you some sub graphs that are
outer-planar.
If you were doing a 3-D decomposition one way to do is it do a minimum
spanning tree. In this case it's an interesting question to ask, can I find the
densest sub graph that's outer-planar or even better the right objective function
here would be I have an energy function over this graph, can I find the
outer-planar graph that captures most of this energy, that retains most of this
energy? That would be the ideal thing to do.
Unfortunately, we can't do that because in general this maximum outer-planar
sub graph problem is NP-hard. If you had such an algorithm this would be a
special case of that. So that algorithm can't exist.
What is easy -- sorry, about the two maximals, one is easy is the maximal
outer-planar sub graph problem. The difference between minimum and maximal
is maximal says I can't add another edge to it and still be outer-planar. So it's
sort of a local optimality and there can be many maximals and the maximum of
those maximals is the maximum. Sorry about the tongue twister.
So we can check for outer-planarity and that leads to a interesting heuristic,
which is we just start with a spanning tree. Keep adding edges until it stays, until
it stops being outer-planar. Remove those edges from your graph and repeat.
And this is what the process looks like. I have an extremely dense graph. I'm
going to start with a spanning tree, keeping adding edges until it is outer-planar
and then remove those edges and then repeat the process a couple of times. So
I get these outer-planar sub graphs that are contained in the original graph.
And you can think of randomized schemes where you decrease the weight so
you don't have to be mutually exclusive in edges and things like that.
So now that I've talked about that set of a method, another thing to do is here's a
decomposition scheme that I'm showing for grids. For grids, I can show you an
exact, a deterministic decomposition scheme that takes a grid and converts it into
two outer-planar graphs that contain all those edges. So the first one here
captures all horizontal edges. The second one captures all vertical edges. So
the union is obviously the grid graph.
The first one, they have interesting comb-like structures. So they're sort of like
two ladder structure happening joined at the top by another ladder. But I want
you to notice that you can do exact inference on this component. You can do
exact inference on this graph. So if you imagine an image, you have a pixel
here. You have a pixel at the top right and they can communicate. Of course,
the communication is limited by a [indiscernible] all the information has to flow
through a [indiscernible] channel and that of course is necessary because you
can't solve the general problem. But you can get the exact answers on both of
these components and then merge them together.
Interestingly, this is not just a hack. Every planar graph can be decomposed into
two outer-planar graphs. That was a theorem proved recently. And there's a
linear time algorithm to do this.
And so in our heuristic is able to find the two sub graph decompositions
surprisingly often. Like in one of my experiments I'm going to show a delinear
planar graph and we're able to find these decompositions that are both planar
and only two of them.
All right. So let me just talk about some experiments. We, first of all, wanted
to -- we ran this on a few typical vision problems. For example, the kind I
showed before. But we also wanted to control the energy function. So we tried a
synthetic energy problem.
So the node energies were just sampled from Gaussians. Known energies are
state zero and one. Sampled them from Gaussians.
The edge energies we just set the diagonal terms to zero and off diagonals
sampled from Gaussians of increasing variance.
Why did we increase the variance? The one way to think about these increasing
variances, your edge terms are becoming stronger and stronger. So interaction
potentials are increasing and the problems are getting harder.
Let me just show you some results. There's a lot to take in here. So let me walk
you through it. So, first of all, the three graphs from left to right, the problems are
becoming harder.
The sigmas are increasing. And your age interaction terms are becoming
stronger. We're trying to do and solve an energy minimization problem. So
lower is better.
>>: What's the graph structure here? Is it?
>> Dhruv Batra: It's K-4. So this is a toy example.
>>: Okay. This is a 4. Okay.
>> Dhruv Batra: It's a 4. I'm going to show you larger ones, too. So this is a K-4
two labeling problem. Sigma is being changed from left to right. It's an energy
minimization problem so lower is better. I'm comparing to TRW, BP, QPVO,
which is a generalization of graph cuts, and I'm also plotting the lower bounds.
So you have an energy minimization problem and you have a lower bound that
you're increasing with titration. So the lower bounds is coming up, energy is
going down, when the two meet you know you've solved the problem.
And the interesting thing to note is when sigma is small, almost every method
can solve this problem really quickly. Really, really quickly. When sigma is
large, as sigma gets large, here's TRW lower bound. Here's the energy.
And there's a big gap between the two. So the problems are hard. These
methods are not going to be able to solve these problems. These are extremely
hard nonsub linear problems. And that's where OPD or our method comes in.
Here's the lower bound of OPD and here's the energy.
So you're sort of not even -- you're cutting into the lower bound, and you're
reducing the energy. So you're better off, because now you've shrunk that area
where the energy can be.
Any questions on this one? Okay. And this was the toy example that I showed
before. We also did this on a 30-by-30 grid. Again, the results are similar. The
convergence straights are slightly slower, because it's a larger graph. But on the
left the problems are really easy and every method converges, and on the right
our lower bound is slightly tighter than the TRW lower bound and the energies
are lower.
And this was again, this was the decomposition scheme followed for these
methods.
We also did these experiments on the gender face -- I'm sorry?
>>: I'm curious why BP did better than TRW. Is there any ->> Dhruv Batra: That was actually very surprising to us, too. It's interesting. We
find that BP's been very badly maligned. It's not that bad. It's better than a lot of
methods. I don't know honestly. But it's surprisingly performs very well.
So here's an application that I showed before. This is not a face, name face
name problem, this is a gender face labeling problem. So we have these faces
detected on these images.
So this was some work that Andy was doing in his thesis. He has these group
shots. He can run a face detector. We run a face detector, we find out where
the faces are on the image. The goal is to label gender. Here's the facial weak
classifiers based on facial features that give you your node potential that say how
confident you are whether this face is male or female.
In addition, he has a label -- we have a label dataset where we have genders
labeled for all the faces. We construct a graph here by a planar delinear
triangulation. Find pairwise features that describe relative location and scale of
these faces. So it's like saying if you have a person that's slightly taller than the
person standing right next to them. You go to the dataset and you do nearest
neighborhood and you find that it's more likely that this person is male and this
person is female. It's just incorporating those priors into your labeling.
Again, I'm showing the minimization of the energy defined over this graph. This
is the TRW lower bound. That's the energy, our method sort of solves the
problem exactly with the first titration. There's only one line and it's converged.
Here, the correctly labeled faces are in solids and the incorrect ones are in
dashed lines. Blue following -- I'm sure I will be criticized for this, but following
natural convention mailers, blue and female is pink.
And here again on a larger graph structure. We also tried this on a multi-class
object labeling problem. So this is the semantic segmentation problem. So you
have these segments -- an image P segmented. This is one segment, this is
another segment, this is another segment and we have nodes that represent
those segments and we have a fully connected graph on those nodes. We have
local color and appearance, color and texture classifiers that extract features
from these segments and say whether this is one of these categories or not.
We have pairwise features that describe Corkoran's [phonetic] relative location
and scale, and you can learn parameters on them from a given dataset. And
describe an energy function.
So what's happening is you're sort of incorporating again your priors on these
edges. I've seen a tree next to a road. And if I see a road, if I see a building, if I
see a sky, it's more likely that the this position should be filled by a tree. And of
course here we were wrong.
But you got everything else right. And again here are the energies that are
achieved by various methods. In this case we see a significant gap between the
TRW, lower bound and the TRW energy, and we're sort of better here. We're
getting lower energies. And this is a pretty interesting application. Because this
naturally encodes some of the repulsive potentials that I was talking about
before, where given that you have found a road or -- if this is a road, then your
potentials are sort of saying that this other segment cannot be, let's say, water, or
things like that. Those are extremely [indiscernible] that are hard to encode with
sub modular parameters. And that's where most energy minimization methods
fail.
And it's interesting to see that we can get some improvement here.
>>: Do you know the ground truth, the minimum energy is for these? Is PD
getting close to the energy?
>> Dhruv Batra: I don't know. We did check accuracies on this application. And
OPDs did slightly better. But again there was no parameter learning. So it's not
guaranteed to do any better.
>>: Seems like doesn't need -- after the first iteration ->> Dhruv Batra: It's basically, this is a small graph. And so the problem's fairly
solved -- it would be really interesting if this was on pixels and I'm sure it would
be a slower convergence.
And we also tried this on the standard Middlebury dataset. We have the optical
flow problem. Every single pixel you're trying to label with a two-dimensional
motion flow vector and these are images, as I mentioned, on the Middlebury
dataset. We follow the same energy functions that are in standard papers by
Soleski and Bare, [phonetic] et al. And it's interesting to see that these datasets
are fairly well solved by standard methods.
TRW, for example, here's the lower bound and here's the energy. And you see
that within a few iterations you've solved the problem.
OPD the improvement, if any, could be that we decreased the energy sooner.
But there's not much to improve in these problems.
So I'd say that the take-home messages here were that we took the first step
towards structures that are topologically more complex than trees. And that's
sort of the most exciting thing to me.
We've restricted ourselves to tree-based methods because we could solve
problems on trees. I want to argue no, we can solve more things than trees. We
just haven't incorporated them.
And OPD is really useful for hard nonsub modular problems and I think object
labeling the semantic segmentation is a good example of that problem. Anytime
you're trying to incorporate priors that say this and this cannot be together and
there's a strong repulse potential it might be useful to look at OPD.
And interesting in traditional benchmarks might be getting saturated. It might be
time to throw them away.
I want to mention that we're also thinking about this future work where if you think
about what we have right now, you know, we can solve max flow min cut. We
can use max flow min cut, that's a graph cut idea. And that sort of forces us to
work with sub modular energies. And that's the definition of sub modularity, by
the way, and I showed in this work, what I was leveraging on was this planar cut
idea if your graphs are planar then we can work with this and that forces us to
with outer-planarity and these two constraints are sort of orthogonal one is a
constraint on parameters and one constraint on structure of problem.
In general, your problems are neither going to be outer-planar nor sub modular.
But it's still interesting that we have these black boxes that can solve certain sub
problems exactly. So it might be time to think about decompositions, following
some of the ideas that I mentioned here that take an arbitrary problem, break it
down into the sub modular part of the problem, and the outer-planar part of the
problem, solve them exactly with whatever black box algorithm you can come up
with and merge those solutions together. And that I think might take us to a
better approximation of the kinds of problems that we do want to solve, because
sub modularity, although has gotten us really, really far a lot of problems displace
our modularity, but I think there are still a lot of interesting vision problems that
don't dismays sub modularity, and it might be useful to think about these kinds of
issues.
I do want to talk about something really quickly. This is some work that I've been
doing over the past year on interactive co-segmentation. And some applications
to 3-D reconstruction. I'm going to go through this really quickly and just some
videos basically. And I'd be happy to talk to you more about this.
So this is mostly joint work with students [indiscernible] at Cornell. So let me just
quickly go through -- we all know interactive single limit segmentation, sort of
beaten that problem to death. But people don't take images like that. Typical
image collections look like this. They take multiple images of the same scene, of
the same game that they went to.
And it's interesting question to ask, well, if I know these images are related, they
contain the same thing, can I sort of do better? Can I segment all of them at the
same time, can I co-segment and can the user guide me in the co-segmentation
process. That's the problem we tackled in this paper.
We sort of started with a collection of images. You have scribbles on these
images. And you're trying to -- and you're trying to segment not only a single
image but all this entire collection at the same time.
It follows the same basic principles as a single limit setup. The idea is you
scribble on one on multiple images. We fit some appearance priors. We fit some
appearance models to these scribbles, and then we set up an energy
minimization problem that's solved with graph cuts and all these images. So
that's like the basic setup. But the interesting things happen, how do you let a
user do this? And let me just quickly show some videos that we -So we actually built this system that allows a user to cut out an object out of
multiple images. So this is -- this is the interface that we built. This is an HP
Touch Smart and you see images that are all related in the sense they contain
the same foreground object. The user says this is what I'm interested in. This is
foreground, and marks out -- this is background. And in our system, as you'll see
next, not only segments that particular image, but it segments all the images in
this collection.
So you'll see cutouts out here on the right you'll see cutouts not only from that
image but all those images.
The interesting problem here is not this part. This part is -- it is fairly
straightforward. You can generalize single limit segmentation. The interesting
part is how do you ask users to make corrections to 50 images? You give them
some results.
They don't like the results that you reported. How do you ask them to give you
feedback from 50 images or 100 images or extremely large collections? And the
naive setup would say, you know, naive setup would say what do I do in a single
image I showed them the image asked them to correct. I'm going to show them
50 images ask them to correct. What we said was why not have the
segmentation system guide you towards this process. Why not have the
segmentation system say here's where I'm confused why don't you give me more
scribbles here.
And that's what we developed. It's the idea of intelligent scribble guidance. Why
not have the segmentation algorithm quantize how certain it is about the
segmentation. Mind you, we're not saying that the segmentation knows that it's
incorrect here. There's no way of knowing correct or incorrect segmentations.
It's just certainty. And if you can quantize your uncertainty, you can save the
amount of time it takes for people to get to certain segmentations that are
required.
So I'm just showing you some examples where the user says that you know this
is foreground, this is background. This system might make a mistake. It thinks
that that grass region was also foreground. So sign here gets foreground. But
it's not too confident about it it shows you a box that says why don't you give me
more scribbles here. And once that happens it does a good job of segmenting
everything else.
Interestingly, once we had this -- sorry about that. Once we had this setup, my
co-author, he immediately saw applications to 3-D reconstruction. So his idea
was I see an object out in the real world. I go and take a bunch of pictures off
that object. All I have to do is scribble on that object. So I just tell you that this is
foreground, this is background. I can use this interactive co-segmentation
technique to get some sort of silhouettes or segmentation. And then I can use a
shape from silhouette algorithm to get a rough volumetric 3-D model of that
object and you can do this to objects you can't take back to a controlled setting.
You can't take the stone back to a structured lighting setting and try and get this
dense reconstruction. We thought it was just a cute cool idea that you could just
take a bunch of images, Mark out a few scribbles and get dense reconstructions
from that.
Here's let me just -- let me skip to something that might be more interesting. This
was statute at Cornell, we're at Cornell now so we had to do that. We actually
also got the Statue of Liberty dataset from norm. So what this algorithm is doing
is given these silhouettes, it's also running a structure for motion algorithm to find
our camera parameters. So you know where the cameras are that took those
pictures and you're projecting those silhouettes back into space.
And these are not, mind you, these are not extremely dense reconstructions.
This is not the quality you would get out of laser vein scanner but they're better
than sparse point clouds, and that's sort of the idea here.
We also did this experiment where we just took a video of a person standing on
the ground. We're not using the fact that this is a video. We're just sub sampling
some frames from this video sequence. Took a bunch of these frames just
scribbled on it and said this is foreground, this is what I'm interested in, and we
were able to extract somewhat accurate 3-D reconstruction of that person.
And so this is just for that object. The thing that you might find that might be
more interesting to this audience, given that there was some work done here,
was that my colleague also -- we extended this not just to any particular object,
but to also do this for a planar reconstruction of a scene. And there's, of course,
been a lot of work done on that here. So we used some of the ideas of
co-segmentation to take a group of images to mark out lanes in the scene to say
this is one particular plane, this is another particular plane. Get like a planar
co-segmentation. So you're sort of segmenting out not objects but sort of planes
in the scene.
Let me quickly jump to -- and we're able to -- we're able to not only vendor one
particular object volumetrically into the scene but also get a planar sort of
approximation of the world that we're seeing.
And, again, it uses some of the same ideas of the co-segmentation setup. You're
posing it as a discrete labeling problem over these planes.
Let me see if we have -- yeah. This one is, of course, a toy example. But it's still
interesting to look at, that you have a collection of these images. The user just
marks out a few scribbles and you have this fairly accurate reconstruction of the
world.
That's me. So there's the foreground being segmented. There's a planar
segmentation of the scene. We had some problems with that result. There's
actually a visual hull that's coming out of the body.
All right. So with that I'll stop. If any of this seemed interesting, I'd be happy to
talk to you more after this. All right. Thank you.
[applause]
>> Larry Zitnick: Questions?
>>: So for co-segmentation, is the basic idea that the color model or color
texture model is being blurred across multiple images at once?
>> Dhruv Batra: So the basic idea is that, yes, the appearance is transferrable
across these images. But it's not just -- we actually have a pairwise, a unary
appearance pairwise appearance not only fitting what the foreground is
supposed to look like GMM and what the background is supposed to look like in
terms of GMM, but we're learning which sort of colors are supposed to be similar
to each others, which are not. And there's a learning algorithm that learns that
these two colors appear in the foreground together. So they should be similar to
each other. So you'd rather keep them together. But at the heart of it yes there's
some transferability assumption that what I see from an image should hold true
another image if you change your mind about what the four ground is between
these images then, yes, there's nothing we can do.
>>: On your MRF on the mesh inference, OPD, looked like a trade having single
chains, you had double chains.
>> Dhruv Batra: Right.
>>: Is that the difference? Does that make a big difference in practice?
>> Dhruv Batra: So, that was just one particular structure for grids. That was
just one particular structure that we experimented with. Yes, that structure alone
does make a difference, but in general this approach has nothing to do with that
particular structure. It's that any larger structure that you can find, which is still
outer-planar, it's a lot denser -- it's denser than trees, so you're solving a bigger
chunk of the problem exactly.
And it's all about pushing that envelope of what you solve exactly. When you can
solve a bigger problem, bigger sub problem exactly, when you merge them
together, you're closer to the original problem in some sense.
>>: Have you thought at all about [indiscernible] the cone structure, which
structures, especially given the underlying graph, like visible edges against the
image and whatnot, structure coming to ->> Dhruv Batra: Yeah, so we have been thinking about sort of extracting
structures based on the problem. So imagine a problem in which I sort of
construct a graph which only has, which is outer-planar. So I only construct a
square graph. There are no edges in between. But I throw in some random
edges with zero weights. If you're not looking at the edge weights and you do a
decomposition, then you're far off. But if you somehow knew that the edge
weights are zero and you chose this as a decomposition, you would have solved
the problem exactly.
So we are looking at -- so that heuristic I talked about is along the same lines.
You find a spanning tree that keeps most of the energy, then you keep adding
edges based on how strong those edge weights are. So it's hard to describe that
structure independent of the problem.
But given a problem, we can find a structure that sort of represents most of the ->>: What were some of the other failure decompositions besides the comb?
What do they look like? Because the comb doesn't propagate, intuitively it
seems it does it vertically or horizontally, how do you double wide -- should be
marginal improvement over having a single image.
>> Dhruv Batra: Right. Let me just try and address the first question, which is
what do these decompositions look like. Some of them are not that intuitive, in
the sense that I showed you -- this is a -- this is a delinear planar graph where
the blue dots are the nodes and you're sort of finding these structures. They're
both outer-planar and they're sub graphs of that one. So the interesting thing is
you sort of look at the structure and it looks like if it looks like a tree on triangles,
that is an accurate, that is an accurate representation. Outer-planar graphs, if
you think about them in terms of tree width, they have tree width two. So they're
strictly the next largest structure from trees.
>>: What is the tree width?
>> Dhruv Batra: A tree width is if you were to think of this in terms of a junction
tree approach or in terms of triangulations of your graph, then a tree width is the
largest click that would be formed in a triangulation. So trees have a tree width
one. Outer-planar graphs have tree width two. And so on that X axis you're not
very far away from trees. It's distinctly the next step from trees.
So if it feels like we haven't gotten too far beyond trees, you're right. We haven't.
But it's still interesting to see that in our applications that does make a difference.
You're better off.
>>: These triangles could make a difference. This is just an abstract graph, but
if it was structured for motion or something, trip relationships are much stronger
than pair trip relationships.
>> Dhruv Batra: Yes. And you're incorporating those sort of constraints at a
long-range level.
>>: So we did sculpture graphs to may instruct motion faster, but that's when
you're basically solving a linear system so it's not a graph cut, right?
>> Dhruv Batra: Uh-huh.
>>: And this is -- in each case you're always solving a binary latency problem
even if you're running it on flow or something?
>> Dhruv Batra: Right. Right, but point to remember is all the message passing
algorithms, if you did have an algorithm that solved a multi-level problem, all the
message processing problems would be able to incorporate that without any
change.
So if you -- I showed you a cut-based algorithm. If you weren't willing to use
something else like a junction tree or something then all these message passing
algorithms would work on that, too. But right now we're restricted to solving
binary problems at the sequence.
>>: So a junction tree, that's the -- many people have rediscovered this, but the
pearl may be a one thing junction tree algorithm, right?
>> Dhruv Batra: In a sense. Exactly it's exponential with tree width plus one. So
if you're willing to pay exponential cost, but it's exponential in tree width, so
solving it on trees is easy. Solving it in outer-planar graphs is just slightly trickier.
It's just one -- the tree width goes up by one, basically.
>>: Okay. So how -- so you haven't done that, but how hard -- I'm trying to
figure out how useful this is beyond -- because in addition to working on binary
problems occasionally, like we do segmentation, things like that, we also do a fair
amount of sparse linear solving, things like that or other problems which aren't
binary. Does this planar decomposition give you some that couldn't be used, for
example, as a precondition for a linear system solver?
>> Dhruv Batra: Frankly I haven't thought about that but if I were to abstractly
think about would this make -- if I were to change a question to would this make
a difference for multi-level problems, too, if I were to work with that my intuition it
would make a difference even for that, because -- because right now the way we
solve multi-level problems is within alpha expansion step up ended into these, so
even with alpha expansion, we can see some gains over TRW or BP width would
solve the multi-level problem to begin with. So I'm definitely confident if we had
to throw away the alpha expansion and this two level engine and just come up
with something that solved the multi-level problem exactly, it would do better than
the alpha expansion plus two level thing. It should work, theoretically.
>> Larry Zitnick: Okay. Dhruv, thank you very much.
>> Dhruv Batra: Thanks.
[applause]
Download