>> Sudipta Sinha: Good afternoon. So it's a pleasure... today. Alexander is a fourth-year PhD student at ETH...

advertisement
>> Sudipta Sinha: Good afternoon. So it's a pleasure to have Alexander Schwing here with us
today. Alexander is a fourth-year PhD student at ETH Zurich, where he is doing his PhD under
Marc Pollefeys in the Visual Geometry Group, and he's also coadvised by Tamir Hazan and
Raquel Urtasun in Toyota Technological Institute, TTI Chicago, and he has been doing some
really interesting work on efficient structured prediction and applying it to 3D scene
understanding stereo matching, and he's here to talk about it today.
>> Alexander Schwing: Thanks, Sudipta, for the introduction. So what do I mean by efficient
inference and learning for structured models? So suppose you are given this indoor image on the
left-hand side, where you are interested in predicting the location of the walls. You could go for
a pixel-wise independent prediction approach and obtain a result similar to the one illustrated in
the middle. Obviously, that's kind of noisy. So wouldn't it be kind of cool if you could jointly,
yet efficiently, reason about physically plausible and hence structured configurations and obtain
a result similar to the one illustrated on the right-hand side?
Or, to give you another example, suppose you are given this rectified façade image, and we are
interested in finding the location of the windows, the balconies, the wall and so on. So, again,
you could go for a pixel-wise independent prediction approach, but I would argue that reasoning
within the space of physically plausible configurations actually gives you results that are visually
a lot more pleasing.
So what does it require to do reasoning in those physically plausible, in those structured spaces?
Well, first of all, we need to do this inference task somewhat efficiently, and one way of doing
the inference task efficiently is by distributing it onto all the computational resources that you
have available, and this is going to be the first part of my talk, where I'm going to show you one
way of doing that.
Not only do you want to do inference somewhat efficiently. You also want to learn in those
structured models, and those structured models are usually fairly general, and so the second part
of my talk, I'm going to give you some ideas on what you could do if you have general graphical
models, also with latent variables and how to learn in those. And in the third part, I would then
want to give you some details regarding the motivating layout application, before concluding
with also some hints on possible future work.
So let's look around us a little bit, and let's see what happened with technology in the past couple
of years. So if you look into your cellphone, for example, your cellphone probably has two
cores, four cores or even eight cores, these days, so multicore environments are everywhere, and
obviously we want to leverage those for the problems we are looking at, so we want to distribute
our tasks. In addition, our cellphones acquire information that is larger than ever before. We're
having sensors and rates that acquire bigger and bigger data sets, so we don't want to throw away
some of this information, so we want to reason within this large-scale setting. And, obviously, I
already mentioned that we want to reason in physically plausible -- that is, somewhat structured
spaces, and it would be a waste if we wouldn't leverage the structure, which usually constrains
the resolution somehow. So the task, or the question I'm going to ask, is how can we formulate
this reasoning task, this inference task, to be distributed with respect to both computation and
memory?
So I need to first start with telling you what do I mean by this reasoning? What do I actually
mean by this inference task? So by inference, I mean a standard maximum a posteriori, or like a
score-maximization task, so we're interested in a set of variables, S sub 1s through S sub N, and
we want to maximize some scoring function. This scoring function consists of a couple of terms,
local score terms, theta sub B, and higher-order scoring terms, theta sub alpha, where by higher
order, I mean terms that depend on two or more variables.
I'm also going to assume that the variables we are interested in are actually discrete variables. So
if the variables are discrete, then these local scoring functions, these theta sub Bs, are nothing as
then simple lookup tables. And the structure of the problems, the structure I've been talking
quite a bit about at the very beginning, is fully encoded in those higher-order scoring functions.
And a nice way of visualizing this structure is via what is known as a factor graph. So I gave one
example of a factor graph here on these slides, and we have four variables in this example, S sub
1 through S sub 4, and two factors, one being dependent on three variables, and the other one
being illustrated a green rectangle, depending on two variables. And we draw an edge between
those rectangles and their nodes, the variables, if the factor depends on this variable. And since
all the variables are discrete, also the factors obviously just depend on discrete variables and
therefore are just lookup tables, as well.
Now, how do we solve this maximization task? The first thing to notice is probably that this
maximization task is equivalent to an integer linear program, and to show that, I want to walk
you through a small example, namely, the example of those two nodes being connected by one
factor. And in order to rephrase the problem, we are going to introduce some variables. I
denoted them by B sub V and B sub alpha for beliefs. So now suppose we are going to multiply
every belief variable with the corresponding score and we maximize this task. Obviously, we
don't quite get what we'd like, because the variable of that cost function would be unbounded.
We would simply select a belief to be plus infinity or minus infinity, depending on whether the
score theta is positive or negative. So we need to introduce some constraints. What can we do?
Well, the first thing to do be done is we could ask for the beliefs to be either zero or one. If you
are doing that, then the cost function is at least bounded. We are still not quite where we'd want
to go, because we could either select multiple or none of the scores for a variable. So we need to
somehow constrain the fact and constrain ourselves to selecting at least one variable per factor
and one variable per local scoring function. How are we going to do that? Well, we are going to
enforce that by saying that the beliefs, both the local beliefs and the factor beliefs, have to sum to
one.
Considering the above constraint, we are required therefore to pick at least one and also at most
one of the scores. We are almost at the place where we want to be. The one thing that is still
missing is that the factor beliefs should be somewhat consistent or should be consistent with the
local beliefs, and the way to enforce that is via what is known as a marginalization constraint,
which I am going to illustrate here in a general form. So the cost function is linear. The
constraints are linear, and there are some integrality constraints. Hence, we obtain an integerlinear program, which we need to solve. So here's the integer linear program again, and now let
me actually rewrite that and somewhat simplify. So instead of always writing this in a product
for the cost function, I'm going to just use the sum notation. It might look a little bit messy, but
it's nothing else than an [inner] product, actually. And then I'm going to refer to the
marginalization constraint by its name, and I'm also going to say that this requirement that the
local beliefs and the factor beliefs have to sum to one, I'm going to say that I want those beliefs
to actually be local probability distributions.
I'm also going to include in this local probability constraint the fact that I want the beliefs to be
larger or equal to zero. And then, if I in the next step throw away the integrality constraint, what
I'm going to end up with is what is known as an LP relaxation. So, now, we are ready to state
our initial problem, the problem I posed or the question I asked, where I said, "How can we
distribute inference?" We are now in a position to describe that a little bit more or make it a little
bit more specific, so what is the goal? What do we want to achieve?
So, first of all, we want to optimize this LP relaxation objective, and obviously, we want to
leverage the problem structure, the problem structure that is encoded within these higher-order
scoring functions, and from the initial task description, we want to distribute the memory and the
computation requirements. Obviously and importantly, we want to maintain convergence and
optimality guarantees of existing algorithms. How are we going to do that? Well, we are going
to go for something that is known as dual decomposition approach, so let me try to give you the
intuition of what do I mean by dual decomposition in this case.
So, suppose we are given this grid-like graph of two-by-four random variables or two-by-four
nodes, where the edges are now the factors, the connections between the variables, and I want to
distribute that problem, this graphical model, onto two machines, kappa one -- or onto two
computers, kappa one and kappa two. So now, a computer has to hold a belief only if the
variable is also assigned to that computer, so computer kappa-1 doesn't need to know anything
about the variables B(v), so B(3) to B(8), for example. It's just there, just not assigned to that
computer.
Similarly, a computer has to hold a factor belief if the factor depends on at least one variable that
is assigned to this computer. So that already tells you or shows you that there is some
distribution with respect to the memory going on. We don't have all the beliefs on all the
computers, but, obviously, there is a catch. So on every computer, we naturally need to enforce
marginalization constraints and local belief constraints, the one I have been introducing earlier,
but we can see that there might be factor beliefs that occur on my two computers, kappa-1 and
kappa-2, independently. Naturally, we need to enforce, we need to make sure that those beliefs
are consistent. I'm actually going to, or I'm showing that via those two parallel lines here. The
beliefs that occur on computers independently are illustrated by those parallel lines, and we need
to somehow make sure that, upon convergence, the value on both computers is actually equal.
So with this additional consistency constraint, we are ready to state our program we want to
solve.
Again, it's a linear program, which we can write in a way that it's parallel in terms of the
computer's kappa, just by rearranging the linear cost function, and also we're going to have local
probability constraints on every computer. We're going to have on every computer
marginalization constraints, and we're going to have those consistency constraints on every
computer. So now you might say, wait a minute, this looks like everything is decoupled now.
That cannot be the case. And, obviously, you're right. We cannot completely decouple this
problem. It was originally coupled. The coupling occurs only in those consistency constraints.
Those were the constraints which tied together the individual problems.
The nice thing now is that this program allows us to derive an algorithm, and I want to show you
the intuition of what does the algorithm look like. In fact, the algorithm consists of two parts.
The first part of the algorithm is standard message passing, independently on all the computers
kappa, and therefore possibly in parallel. So you can do message passing in parallel on all the
computers you are having. But, to make sure that, upon convergence, the variables are
consistent, you need to exchange information occasionally, before then going back and doing
some more message passing and iterating between those two steps. So you iterate back and forth
between message passing and exchanging information, and again some message passing. In fact,
you can see that the exchange of information is nothing else than another type of message
passing on a separate graph, actually.
So now that poses a few questions. First of all, can we really do large scale with that approach?
Second, and how often do we have to exchange information? And third, how does it compare to
other state-of-the-art algorithms. So back then, when we did this work, that was in 2010. There
was a library for discrete approximate inference, which we could compare against, and there was
an early version of the graphLAB framework, and it was an early version, and it wasn't yet
distributed, so it only operated in shared-memory environments. That's why we couldn't really
compare our distributed algorithm to graphLAB at that point in time, and the numbers are from
2010 or 2011.
So if you look at the runtime, the library for discrete approximate inference, I dividend that by
four, because I had a four-core machine, and, therefore, the library for discrete approximate
inference didn't really consider all those four cores. In terms of comparison to graphLAB, we
can see that our convex belief propagation approach, the general version of it, actually performs
in the same order of magnitude, so it's equally efficient, and efficiency is measured in how many
nodes does it process per microsecond. Yes?
>>: When you were doing this experiment -- when you did this experiment, you assumed that all
the theta variables are precomputed and stored, then reused just fixed across all of them? So
you're evaluating the function. They're like order one, each theta-B and theta-alpha, is just an
order one lookup?
>> Alexander Schwing: Yes, exactly. So, to repeat the question, the question was whether the
cost function was pre-given. That is, with all the lookup tables, the theta-Bs and the theta-alphas
were given and precomputed, and yes, that's the case when doing this comparison. I don't
include times for computing those function values. That's correct.
So this upper part of the table, therefore, is like a comparison, like kind of a fair comparison,
where you can then see that the primal energies are in about the same order of magnitude. So
now if you go about and derive or code a dedicated algorithm for the task we were looking at -by dedicated, I mean an algorithm that worked on pairwise Markov random fields, only -- so I
take this dedicated algorithm, then I can see that I then get some further improvement, naturally,
because I don't need to be that general.
If I now take this algorithm and further distribute it onto multiple machines, then I can get an
even higher speedup. The one thing to notice is that the primal energy is not quite at the one that
was not distributed, so why is that? And to see that, we look at how often do we have to
exchange information between the different machines? So we're maximizing the primal
function, and maximizing the primal means we have an equivalent problem that minimizes the
dual function, and when we measure the dual with respect to the number of iterations, we can
obviously see that exchanging information every single iteration, which is illustrated by the
yellow line, performs better than when exchanging information every five, 10, 20, 50 or 100
iterations.
But the story is different if we actually plot this same curves with respect to the time. Naturally,
because when measuring with respect to the number of iterations, we didn't really include the
time into the measurement, and the time for the distributed algorithm comes also from the fact
that we need to exchange, that we need to transmit information between computers. So in our
experiments, you could see that exchanging information only every five, 10 or 20 iterations was
actually better than exchanging information every single iteration.
Naturally, that depends on the connection between the computers, which was in our case a
standard local area network connection, and it also depends on the problem we were looking at.
And the problem we were looking at was a standard Markov random field type of problem, like a
grid graph. For this example, the question was how many machines were we using, and for this
example we were using nine machines. So the problem we were looking at was a disparity
estimation problems. Here, you're given two images, in this case, and we wanted to compute the
disparity map. The images were larger than 10 megapixels. We used 250 discrete states per
pixel, so the graph was about 10 million nodes, 24 million edges, and due to the 250 discrete
states we used, the disparity map was actually kind of smooth, very smooth, actually, and we
could capture small deviations.
With that, I'm also going to conclude the first part of my talk, where we looked at how can we
distribute this score-maximization task, where the scores were those theta functions, those local
theta-Bs and those higher-order theta-alphas. So where did those thetas actually come from?
Well, in this case, someone gave them to us, or we computed them from an image, but
oftentimes, we have some data which we want to leverage in order to get to those thetas. So
within this second part, I'm now going to show you what you could do in order to get those
scoring functions. And I'm going to assume that we have a training set of data pairs, where X is
an image and S is, for example, a segmentation. So the inference task from before was this
maximization of a scoring function consisting of two terms, theta, local terms and higher-order
terms.
Instead of now maximizing this scoring function, we can equivalently rewrite it as a
maximization over some inner product between a weight vector we are interested in and a feature
vector of phi. Now, what we are interested in when learning is this weight vector W. So how do
we choose the weight vector W? Well, we are given some training set, right? So we should
leverage this training set somehow, and what we can do is, or what we can ask for is, we can ask
for a score W transposed times five that should actually score that should be smaller for any
possible segmentation we can find than the score for the ground-truth segmentation that is given
to us. Or, put differently, when maximizing the scores, we should always have a lower score
than the ground-truth configuration that is given to us, lower or the same in case the ground-truth
segmentation is within the space of the segmentations we are maximizing over.
Or, put the other way around, we want to penalize within a cost function whenever we find a
maximizing score that scores larger than the ground-truth configuration, and that exactly yields
to what is known as the hinge loss when we also include some margin, so we want to linearly
penalize whenever the maximum is within a margin L of the data score, or in equations, we
maximize over the space of segmentations some inner product between our weight vector and the
feature vector, plus some margin. And whenever this score is larger than the ground-truth
configuration, we want to linearly penalize, and that yields to the cost function that is known
from the max margin Markov networks or the structured support vector machines, where we
minimize with respect to W some regularization term. In this case, I used some regularization
term. In this case, I used just standard L2 regularization, and then I summed the difference of the
maximization minus the ground truth score for all the training -- for all the pairs in the training
set, so for all the X(s).
But we are not quite where we want to be yet. Why? Because that would require us to annotate
every single image in our training set, and obviously, you know that annotating every single
image is very time consuming, it's costly and sometimes might not even be possible. So what we
ideally want to do is we want to work on training sets, or on data sets, that are not fully labeled.
I illustrate that via the black pixels in this image, for example, where a black pixel means we
don't know what label this pixel should be taking. So the data we are looking at, the complete
data we are looking at, consists of two parts -- the annotated part, which I denote by Y, and the
latent or hidden part, which I denote by H. So, again, the complete data consists of those two
parts, the annotated part Y and the hidden part H, and now, what do we want to achieve? What
do we want to optimize?
Therefore, we are looking at the weakly labeled hinges. We want to penalize. Whenever the
best overall prediction, so the best prediction over the joint space, over the complete data space,
exceeds the best prediction with the annotations being planned. In equations, that means
whenever the score when maximizing over the joint space Y and H, which is the complete data
space, whenever that score is larger than when we just maximize over the latent space and clamp
the annotation to be whatever the user gave to us. Whenever the difference between the two is
actually larger than zero, we want to penalize. And that yields to what is known as the latent
structured support vector machine framework, where we are having a cost function which, again,
has an L2 regularization term, plus -- and now we sum again over the entire data set we are given
-- and we have the difference between those two maximization scores.
So here is the cost function again. I didn't change anything. I just rewrote it, slightly. And what
I want to do now is I want to generalize it, and the first thing I'm going to do is I want to
introduce the soft-max function. The soft-max function is a one-parameter extension of the max
function, and the parameter is epsilon. Whenever epsilon approaches zero, this soft-max
function smoothly approximates the max function, and whenever epsilon equals zero, the softmax function is equal to the max function.
So we have the difference of max functions. Now I can just plug in this soft-max function and
we have one additional parameter, epsilon, right? Sure.
>> Sudipta Sinha: One (inaudible), in this setting, you're now calling this structured prediction
of latent variables, but this is not like saying that your output basis structure -- so what is the -I'm trying to understand in what sense is the structured prediction latent variables?
>> Alexander Schwing: Okay. Let me try to rephrase the question. So the question was in what
sense, in what way, is this structured prediction where the output space consists of certain
structures? Is that right?
>> Sudipta Sinha: When you started, at the beginning, you showed an example of a place where
you wanted to do structured predictions, so I understood what you meant there, by saying that
you want to constrain the problem by using the nonstructured. But in the setting here, where you
are learning, we're trying to do the segmentation, and the variables are -- could you explain how
this is a structured prediction problem?
>> Alexander Schwing: Okay, I see. So the question is, rather, in the beginning of my talk, I
was saying that we want to infer or want to do estimation in structured space, so what means
structured spaces with respect to the structure? I'm looking here, because I guess segmentation
problem is not per se a structured problem I was talking about at the very beginning. Let me
actually come back to that question at the end of the second part of my talk and the beginning of
the third part where I'm going to explain more rigidly what I mean by this structure I was talking
about. So in this setting, I'm just talking about pure segmentation for now, but I will get to that
question. Good question.
So now, we can't just replace those max functions with these soft-max functions, and this is what
we would get. Why would that make sense? Why is that a reasonable thing to do? Well, the
nice thing now is that we are having two seemingly different frameworks within one cost
function, namely, the latent structured support vector machines and the hidden conditional
random fields, so for epsilon equal to zero, that's how I did the derivation and that's how I went
through my talk -- for epsilon equal to zero, we used the hinge loss or the weakly labeled hinge
loss. And, similarly, we could say we tried to go for a max margin framework, and what we got
was the latent structured support vector machine formulation.
Now, if I would use instead the log loss and go through the entire derivation using the log loss
and go for a maximum-likelihood approach, I would get the hidden conditional random field,
which would be equivalent to setting epsilon equal to one. And, in addition to having those two
seemingly different frameworks within one cost function, we actually have a whole set of
additional cost functions for different epsilons in between and beyond one.
And, obviously, what we could also do is we could again include the margin, which I left out for
clarity, but when including that, you get -- the cost function looks a little bit more complicated,
but nothing fancy happens. So, now, we want to optimize this cost function. What are the
challenges? So there are two challenges, two problems we need to solve. First of all, we are
having this summations -- here are the summations here -- over exponentially sized sets. So for
like a segmentation problem, we would need to sum over all possible segmentations, which is
exponentially sized summations. And, in addition, we have the difference between two soft-max
functions, the difference between two max functions, and that means we are dealing with a nonconvex problem. So what can we do?
Let's look into what people did in literature. So, to address those difficulties, let's look at how to
address exponentially sized sum or max operations, where we could assume then to be efficiently
and exactly computed, and that will be the case if you're looking at foreground, background
segmentation, for example, and would have submodular energies. Then we could use graph cuts
and the maximum operation would be exactly computing, or if you want to solve matching
problems, or if it's a tree-structured problem.
But, oftentimes -- in particular, oftentimes when talking about physically plausible spaces, we
don't have those structured problems, so we need to go for other approximations, and other
approximations could be local type of approximations, entropy type of approximations. In fact,
this is exactly what we are looking into. We are doing local approximations. The difference
with respect to previous work is that we are using convex duality-based entropy approximations.
And I'm going to explain in a minute why this is beneficial.
For the non-convexity -- that is, for the non-convexity, the difference between the max
operations, we are going to go for a standard approach, expectation maximization or the
concave-convex procedure type of approaches. How do the algorithms look like? Or how do the
algorithms with convex duality-based entropy approximations compare to existing algorithms?
So existing algorithms for hidden conditional random fields and latent structured support vector
machines, here illustrated on the left side, they are kind of a double loop type of structure. You
have an outer loop, and within the outer loop, you first need to solve the latent variable
prediction problem. That is for everything. That is not annotated. You need to figure out what
is either the maximizing state or what is the probability distribution over then the given variables.
And then, once you solve this latent variable prediction problem, which is an inference problem
that you need to solve under convergence, you can update your parameter later. And the one
thing to make sure is that you need to update your parameter to convergence before you go back
and resolve the latent variable prediction problem, and you iterate between those two problems,
basically.
If you look at convex entropy approximations, we can get rid of this second requirement to
converge. So instead -- or we also have to solve the latent variable prediction problem, as
before. But now, it's sufficient to just do a single step in the parameter vector direction before
then going back and solving the latent variable prediction problem. I'm saying it's sufficient,
because obviously you could also do more than a single-parameter update. Obviously, this is the
question. The advantage that we have less inner loops, we got rid of one convergence
requirement, but how many updates should we do, or how often should we update before, again,
solving the more complex latent variable prediction problem? That's a question that we didn't
answer.
Importantly, using those approximations, those convex entropy approximations, we got rid of the
assumption to solve those sub-problems, like matching problems or tree-structured problems,
exactly. So we don't have this requirement in the algorithm anymore, which contrasts, for
example, latent structured support vector, so is there a benefit, or can we show that there is a
benefit for using those entropy approximations? I guess I wouldn't be talking about that if you
couldn't, and the benefit, first of all, is since we have less inner loops, the training time, the
average training time, is lower compared to other algorithms, to standard algorithms.
Importantly, we could also show that in terms of performance, it pays off. What we did was we
increased the amount of latent variables like to 90% in this case, and we tried to train a simple
segmentation problem here.
If you compare our approach to the latent structured support vector machine, we see that latent
structured support vector machines actually learned the wrong model in this case. We wanted to
learn that neighboring pixels should have an equivalent labeling, but latent structured support
vector machine learned that neighboring pixels have the opposite label. So why did that happen?
What was the reason for that to happen? Well, the reason was that, due to the ties during the
latent variable prediction problem, the empirical means vector pointed in the wrong direction, or
had the different signs that were just opposite to what they should have been. And therefore,
latent structured support vector machine learned the wrong model.
That doesn't happen in the approach we use or we propose to use, and it also didn't happen in the
hidden conditional random fields framework. It didn't happen in the hidden conditional random
fields framework because we used the convex version of it, as well. There, we achieved equal
performance at the tradeoff of having a higher or longer training time. So, naturally, that is a
kind of a setup example, and what we want to show is that these approximations are actually also
useful in the real world. And one way to show that they are useful in the real world is if we can
extract information from weakly labeled data -- so what I want to show next is I want to look at
an example where we were given a fixed set of fully annotated samples -- and if you throw in
additional weakly labeled samples, then we can improve the performance.
So the task I am going to look at in this example is the initial layout prediction task. So the task
was, given a single image, we want to predict the 3D parametric box that best describes the
observed room layout. So if you look into this room here, for example, you'll notice that a box, a
cube-like structure is a very good approximation for this room, for example, and that holds for
many other rooms that we are living in.
So how can we find this parametric box? And that also hinges a little bit onto the question of
Sudipta, where we are asking what is now these physically plausible configurations? And the
problem there is that we need to find an adequate parameterization, how can we actually
parameterize the problem such that we don't reason about general labelings in a structured
manner, but how do we reason about configurations or just the configurations we are interested
in. So how can we constrain that problem?
The question is to design a parameterization that it makes sure that this is the case, and that is
often the key thing to look into, and what parameterization did we use? So in our case, we
assumed the vanishing points to be given. When I'm saying we assumed them to be given, I
mean we ran a vanishing point detector. We didn't hand label them. We ran a vanishing point
detector and assumed those to be the true vanishing points. Obviously, there were also errors in
the vanishing points, but we assumed them still to be correct, because we wanted to assess what
is the real-world performance. Now, assuming the vanishing points to be given, the prediction of
a box is equivalent to predicting four variables, S sub 1 through S sub 4. Those four variables
essentially correspond to predicting the angles or the deviations between the connections of two
vanishing points and the ray we are interested in. So ray 1 through ray 4 in this example.
So, given those four rays, we know what the 3D parametric box looks like. So, now the next
question to answer is, once we have a parameterization that encodes this physically plausible
configuration, what are the features? What are the measurements we are interested in? And the
features we are looking into have been designed by other people, and those are orientation maps
and geometric context. Orientation maps are shown in this illustration here, and what our
features are going to do is, we are going to -- I am going to do that with the hypothesized left
wall, which is shown via those brown rays, as an example. So the left will be something like
here, this thing. And so our features are going to count. What are our image cues? What are our
orientation map and geometric contexts? And, giving us estimates for, for example, a left wall or
a front wall. So we are counting, effectively, colors. How much yellow is in a hypothesized left
wall? How much green is in a hypothesized left wall? How much red is in a hypothesized left
wall, and so on?
So the model we are going to get with those variables is then a loopy graphical model, in this
case, luckily, we're either high order, in our case, either a pairwise graph. Or we could
decompose it such that it was a pairwise graph. We could model that. So now we have this
pairwise graphical model, loopy, which is neither submodular nor does it possess a treelike
structure. So that's why I talked in the second part about learning in those settings, and for
simplicity, I used the segmentation problem, although I could have just as well used this as an
example here.
So now what I wanted to look at is how does this -- how can we extract information from weakly
labeled data? And in order to show that, I'm going to compare to a fixed approach, a fixed
approach that trains only on fully annotated samples -- on five fully annotated samples, 10, 20
and 50 fully annotated samples. And I'm going to measure the performance, the prediction error
in terms of pixel-wise classification. So, obviously, we can see with the fixed approach, in blue,
that the more samples we throw into the training procedure, the better the performance, in this
case, because we only have a few number of fully annotated samples. Now, if in addition, I
throw in 25, 50 or 100 weakly labeled samples with the number of angles being thrown away
being either two or three on the right-hand side, so 50% latent variables or 75% latent variables,
you can see that the prediction error goes down. So despite the fact we did some
approximations, we were capable of reducing the prediction error. So we were able to extract
some additional information.
>>: Is this why it's worse for the green one when you went all the way over to 50 annotated
examples and you have an additional 25 examples, which are weakly labeled. It looks like it's
slightly worse. Do you think that's just noise or do you think there's actually something to that?
>> Alexander Schwing: Okay, so the question was, why is it for the green curve, where we
throw in an additional 25 samples in case of the 50, why is it slightly worse than when just
training with a fixed number of samples? All those curves are averaged over 10 runs, I believe,
because within the data set, we obviously had to pick a subset of them. I would say it's mostly
noise that happens. I mean, this is then the error shown on the test set, so you can train on the
training set, yes, but what then happens on the test set is kind of a different story, right? Maybe
if I would average over 20 runs, it might be better. The important thing is that the differences are
actually minor.
So with that curve, I then also want to conclude the second part of my talk, where I wanted to
show that certain approximations can help to train, also if information is not fully provided. And
in the third part, I want to give you some more details on this estimation in physically plausible
configurations, meaning are there specific inference items, or are we capable, or how are we
capable, of designing specific inference items for certain problems. So, again, the problem is the
3D indoor scene understanding task I've been looking at, and what we want to do is, now, given
a linear model W, so in this part I'm going to assume someone did the training for me and I'm
given this linear model. Given this linear model W and some image X, how can we estimate the
layout using a feature map, where this feature map was, again, this counting of how much
yellow, how much green, and so on, was within the image cues.
The parameterization we are going to use, again, consists of those four variables, because this is
a fairly structured configuration that allows us to do exactly what we want it to do. And, further
on, the layout is fully specified if we know the four variables and if we know the vanishing
points. Obviously, what we are going to assume is that the Manhattan world assumption holds,
meaning that the walls are parallel, the floor and the ceiling are parallel, et cetera. And what we
are going to do is, instead of working with or reasoning with singly hypothesized layouts, we are
going to work in internal product spaces. So we are going to assume we are given a set of
layouts, and I'm going to specify the set of layouts via the letter S, and the set of layouts is
denoted by minimum angles and maximum angles. So every of the four random variables, we
are going to have a minimum angle and a maximum angle, and the minimum angle is illustrated
in this graph on the right-hand side down here via the black rays. And the maximum angle is
illustrated by the blue, via the red rays.
So we're considering all the possible layouts for every angle being within one of those internals,
so what could be a possible inference procedure? Well, we could go for a branch and bound
approach. How would that work? Well, suppose we are taking a set of layouts. We are going to
compute an upper bound for the score of the set of layouts, and we are going to put that into a
cue, in addition to the set. Then, we are going to take from this cue the highest-scoring set. We
are going to split it into two parts. We are going to score the two parts independently, put them
back into the cue, and keep going until the highest-scoring element we retrieve consists only of a
single hypothesized layout. And then the algorithm terminates, because there is no way of
splitting a single layout any further. When would that work?
We are guaranteed to obtain the global optimum of our scoring function if two properties hold
for the scoring function. First of all, the scoring function has -- this upper-bounding scoring
function, FPAR, has to be a true upper bound for the cost function. So, for any possible layout
within any possible set, if we score this set, we have a true upper bound to the cost function.
And, in addition, what has to hold is exactness for a single hypothesis. So whenever we throw a
single hypothesis into the bounding function, we get the exact score.
Given those two properties, the abovementioned bound approach gives us the exact -- the global
optimum. So then the next question is, can we actually find such a bounding function, such an
FPAR, being a true upper bound? And what do we need to do in order to find one? Well, we
had this scoring function being this inner product, this inner product between our weight vector
W and our feature vector phi. We can divide this inner product into two parts, a positive scoring
part and a negative scoring part. And the positive scoring part, F-plus, consists of the weight
vector, or the weight vector element, multiplied by its corresponding feature vector, if the weight
vector element is positive. And the negative scoring function, F-minus, consists of the weight
vector multiplied by a feature vector if the weight vector element is smaller or equal to zero.
Why do we have two, a positive part and a negative part happening? The reason that works is
that we have -- that our feature vector is just counting, counting colors, and counting is always
positive. If counting is positive, the sign is uniquely determined by the sign of the weight vector.
So now, having split up this function into two parts, how do we design this bounding function
FPAR? Well, what we need to do is we need to sum the maximum positive contribution of the
constituents and subtract the minimal and negative part. For me, that is a pretty important
statement, so I'll iterate over that again. We want to sum the maximum positive contribution of
some interval, and we want to subtract the minimum negative part. So now consider a front wall.
Consider this front wall here, where we are given maximum angles by red rays and minimal
angles by black rays, and now we want to sum the maximal positive contribution. Well, the sum
of the maximal contribution is just summing up the quadrilateral -- everything that is positive
within the red quadrilateral. We can't get more positive than everything within this red
quadrilateral here.
What is the minimal negative part? Well, the minimal negative part is everything negative
within the black quadrilateral. We can't get -- this is the minimal negative part that we always
need to subtract. For all the other cases, we need to subtract more. So this bounding function
fulfills exactly the two properties from before. It's exact for a single hypothesis. If you have a
single hypothesis, then the red quadrilateral and the black quadrilateral are actually equal, so we
are going to extract -- the positive part and the negative part are exactly considering the same
areas, so the cost function is exactly recovered.
Also, it's a true upper bound, because we are taking always more positive and less negative, or
the least amount of negative. It works for the front wall case, and similarly, it works for all the
other walls. I'm not going to go through it. There is an illustration for the left wall there, as well,
but I'll skip that part. So now, we designed this bounding function, and we need to efficiently
compute that. We wait on prior work for that, and I'm going to skip that part. I'm going to show
you next a little bit about the performance.
I said already what we use as image cues, as features, are orientation maps and geometric
context. And what we are going to measure -- I'm going to show some images of orientation
maps and geometric context later. And what we are measuring is the pixel-wise prediction error
on the images, on some specific data set, which is the layout data set. There was obviously some
previous work on that task, and people usually use different image cues, like orientation maps,
geometric context, or both of them, and we obtain quite decent performance in actually a fraction
of a second. So the branch and bound algorithm is quite fast in this case, and it achieves a
reasonable performance. So how do those results look, visually? Well, here are some examples.
So I show some images, and in the middle column, I'm going to show the orientation maps. On
the right-hand side, you can see the geometric context, which are some feature cues. And the
feature cues on its own, at least for geometric context, you can see here the performance would
be 28.9%.
So on the images, you can see overlaid our prediction result in red, and in blue we are going to
show what will be the best-possible prediction if we would have the ground truth available. So
I'm showing the blue because that illustrates a little bit the discretization artifacts. We are
discretizing everything, so there is not really much to see, much errors to see with respect to
discretization.
>>: The right-most line, again, is that your result? The right most?
>> Alexander Schwing: The right most is like this one?
>>: Yes.
>> Alexander Schwing: No, this is an input image. Like this is geometric context, which we are
using as an input feature. So the colors here -- exactly. The red is...
>>: Blue is a ground truth.
>> Alexander Schwing: Yes. You could see that blue is the ground truth.
>> Sudipta Sinha: So I have a question about this setting. I don't know if it's this particular data
set or if there are more like this, but in these data sets, there are objects in the scene.
>> Alexander Schwing: Yes.
>> Sudipta Sinha: And in some sense, the objects are like distractors, so the ground truth, as I
understand, does not mask out these pixels. All the pixels are just included in the three walls
around the axes.
>> Alexander Schwing: Yes.
>> Sudipta Sinha: So then when people do learning on this data set, on ground truth, the data
sets, how is the learning not getting affected by all these objects that are distractors? In some
cases, they are lined up with the orientation of the box, of the room, but in many cases they are
not. Do you have a sense of what happens?
>> Alexander Schwing: So to repeat the question, the question was in many scenes there are
objects there, and objects kind of distract also the learning algorithm, so how does the learning
not get affected? The question is, the learning does get affected. In fact, it does get distracted,
and also inference does get distracted. As you can see in the two failure cases on the bottom
here, we're actually predicting the red layout, which is kind of obviously distracted by the objects
in that case.
There is -- at least if you don't model the object, then it will distract you, correct. Another source
of failures is wrong vanishing points, so for this image on the right-hand side, you could see that
the vanishing point should be somewhere in the middle of the image, but it's actually far to the
right, outside the image, so since we assume those to be given and fixed, we cannot really do
much about that.
Yes, I'm actually -- I can show you also some video here, for some of the results. So, again,
what we assume is that given a single image, like this one here, what we want to do is, we want
to predict the layout. But if you then just have the layout as illustrated here on the right-hand
side, and you want to render scenes from a new viewpoint, what happens is that you're going to
obtain results or renderings that are actually more distractive. So the goal, or an interesting
additional thing to be done, to be to kind of also model not only the object itself, but also the -not only the layout itself, but also the object on top of it. I haven't talked much about that, but it's
not really published work yet.
Again, on this example, you could see that, given a single image, you can see that you get
significant artifacts if you just render the scene to a single box. And then on this last image,
what you can also see is that, obviously, for some of the pixels, we don't have any texture
information, so what happens if we are predicting objects on top is that we do get some black
regions for which we don't really know what's happening. Obviously, we could do some in
painting there, but we haven't yet looked into that task.
With that video, I also want to conclude my talk and quickly recap. The whole motivation was
to work in physically plausible configuration spaces. What we need to do is, because those can
get fairly large, is we need to do inference somewhat efficiently. One way of doing it officially
is to distribute the task onto as many computers as you have publicly available. On the second
part, I then tell you that those models generally possess a very general structure, so we are
required to learn in general graphical models also if we don't have everything annotated, because
annotation in large environments is time consuming and costly. And the third part, I then
showed you some results regarding a 3D or scene understanding task. So thanks for your
attention, and I'm looking forward to the questions.
Any more questions?
>>: We're going to be chatting later.
>> Alexander Schwing: Sure. Thanks.
Download