>> Sudipta Sinha: Good afternoon. So it's a pleasure to have Alexander Schwing here with us today. Alexander is a fourth-year PhD student at ETH Zurich, where he is doing his PhD under Marc Pollefeys in the Visual Geometry Group, and he's also coadvised by Tamir Hazan and Raquel Urtasun in Toyota Technological Institute, TTI Chicago, and he has been doing some really interesting work on efficient structured prediction and applying it to 3D scene understanding stereo matching, and he's here to talk about it today. >> Alexander Schwing: Thanks, Sudipta, for the introduction. So what do I mean by efficient inference and learning for structured models? So suppose you are given this indoor image on the left-hand side, where you are interested in predicting the location of the walls. You could go for a pixel-wise independent prediction approach and obtain a result similar to the one illustrated in the middle. Obviously, that's kind of noisy. So wouldn't it be kind of cool if you could jointly, yet efficiently, reason about physically plausible and hence structured configurations and obtain a result similar to the one illustrated on the right-hand side? Or, to give you another example, suppose you are given this rectified façade image, and we are interested in finding the location of the windows, the balconies, the wall and so on. So, again, you could go for a pixel-wise independent prediction approach, but I would argue that reasoning within the space of physically plausible configurations actually gives you results that are visually a lot more pleasing. So what does it require to do reasoning in those physically plausible, in those structured spaces? Well, first of all, we need to do this inference task somewhat efficiently, and one way of doing the inference task efficiently is by distributing it onto all the computational resources that you have available, and this is going to be the first part of my talk, where I'm going to show you one way of doing that. Not only do you want to do inference somewhat efficiently. You also want to learn in those structured models, and those structured models are usually fairly general, and so the second part of my talk, I'm going to give you some ideas on what you could do if you have general graphical models, also with latent variables and how to learn in those. And in the third part, I would then want to give you some details regarding the motivating layout application, before concluding with also some hints on possible future work. So let's look around us a little bit, and let's see what happened with technology in the past couple of years. So if you look into your cellphone, for example, your cellphone probably has two cores, four cores or even eight cores, these days, so multicore environments are everywhere, and obviously we want to leverage those for the problems we are looking at, so we want to distribute our tasks. In addition, our cellphones acquire information that is larger than ever before. We're having sensors and rates that acquire bigger and bigger data sets, so we don't want to throw away some of this information, so we want to reason within this large-scale setting. And, obviously, I already mentioned that we want to reason in physically plausible -- that is, somewhat structured spaces, and it would be a waste if we wouldn't leverage the structure, which usually constrains the resolution somehow. So the task, or the question I'm going to ask, is how can we formulate this reasoning task, this inference task, to be distributed with respect to both computation and memory? So I need to first start with telling you what do I mean by this reasoning? What do I actually mean by this inference task? So by inference, I mean a standard maximum a posteriori, or like a score-maximization task, so we're interested in a set of variables, S sub 1s through S sub N, and we want to maximize some scoring function. This scoring function consists of a couple of terms, local score terms, theta sub B, and higher-order scoring terms, theta sub alpha, where by higher order, I mean terms that depend on two or more variables. I'm also going to assume that the variables we are interested in are actually discrete variables. So if the variables are discrete, then these local scoring functions, these theta sub Bs, are nothing as then simple lookup tables. And the structure of the problems, the structure I've been talking quite a bit about at the very beginning, is fully encoded in those higher-order scoring functions. And a nice way of visualizing this structure is via what is known as a factor graph. So I gave one example of a factor graph here on these slides, and we have four variables in this example, S sub 1 through S sub 4, and two factors, one being dependent on three variables, and the other one being illustrated a green rectangle, depending on two variables. And we draw an edge between those rectangles and their nodes, the variables, if the factor depends on this variable. And since all the variables are discrete, also the factors obviously just depend on discrete variables and therefore are just lookup tables, as well. Now, how do we solve this maximization task? The first thing to notice is probably that this maximization task is equivalent to an integer linear program, and to show that, I want to walk you through a small example, namely, the example of those two nodes being connected by one factor. And in order to rephrase the problem, we are going to introduce some variables. I denoted them by B sub V and B sub alpha for beliefs. So now suppose we are going to multiply every belief variable with the corresponding score and we maximize this task. Obviously, we don't quite get what we'd like, because the variable of that cost function would be unbounded. We would simply select a belief to be plus infinity or minus infinity, depending on whether the score theta is positive or negative. So we need to introduce some constraints. What can we do? Well, the first thing to do be done is we could ask for the beliefs to be either zero or one. If you are doing that, then the cost function is at least bounded. We are still not quite where we'd want to go, because we could either select multiple or none of the scores for a variable. So we need to somehow constrain the fact and constrain ourselves to selecting at least one variable per factor and one variable per local scoring function. How are we going to do that? Well, we are going to enforce that by saying that the beliefs, both the local beliefs and the factor beliefs, have to sum to one. Considering the above constraint, we are required therefore to pick at least one and also at most one of the scores. We are almost at the place where we want to be. The one thing that is still missing is that the factor beliefs should be somewhat consistent or should be consistent with the local beliefs, and the way to enforce that is via what is known as a marginalization constraint, which I am going to illustrate here in a general form. So the cost function is linear. The constraints are linear, and there are some integrality constraints. Hence, we obtain an integerlinear program, which we need to solve. So here's the integer linear program again, and now let me actually rewrite that and somewhat simplify. So instead of always writing this in a product for the cost function, I'm going to just use the sum notation. It might look a little bit messy, but it's nothing else than an [inner] product, actually. And then I'm going to refer to the marginalization constraint by its name, and I'm also going to say that this requirement that the local beliefs and the factor beliefs have to sum to one, I'm going to say that I want those beliefs to actually be local probability distributions. I'm also going to include in this local probability constraint the fact that I want the beliefs to be larger or equal to zero. And then, if I in the next step throw away the integrality constraint, what I'm going to end up with is what is known as an LP relaxation. So, now, we are ready to state our initial problem, the problem I posed or the question I asked, where I said, "How can we distribute inference?" We are now in a position to describe that a little bit more or make it a little bit more specific, so what is the goal? What do we want to achieve? So, first of all, we want to optimize this LP relaxation objective, and obviously, we want to leverage the problem structure, the problem structure that is encoded within these higher-order scoring functions, and from the initial task description, we want to distribute the memory and the computation requirements. Obviously and importantly, we want to maintain convergence and optimality guarantees of existing algorithms. How are we going to do that? Well, we are going to go for something that is known as dual decomposition approach, so let me try to give you the intuition of what do I mean by dual decomposition in this case. So, suppose we are given this grid-like graph of two-by-four random variables or two-by-four nodes, where the edges are now the factors, the connections between the variables, and I want to distribute that problem, this graphical model, onto two machines, kappa one -- or onto two computers, kappa one and kappa two. So now, a computer has to hold a belief only if the variable is also assigned to that computer, so computer kappa-1 doesn't need to know anything about the variables B(v), so B(3) to B(8), for example. It's just there, just not assigned to that computer. Similarly, a computer has to hold a factor belief if the factor depends on at least one variable that is assigned to this computer. So that already tells you or shows you that there is some distribution with respect to the memory going on. We don't have all the beliefs on all the computers, but, obviously, there is a catch. So on every computer, we naturally need to enforce marginalization constraints and local belief constraints, the one I have been introducing earlier, but we can see that there might be factor beliefs that occur on my two computers, kappa-1 and kappa-2, independently. Naturally, we need to enforce, we need to make sure that those beliefs are consistent. I'm actually going to, or I'm showing that via those two parallel lines here. The beliefs that occur on computers independently are illustrated by those parallel lines, and we need to somehow make sure that, upon convergence, the value on both computers is actually equal. So with this additional consistency constraint, we are ready to state our program we want to solve. Again, it's a linear program, which we can write in a way that it's parallel in terms of the computer's kappa, just by rearranging the linear cost function, and also we're going to have local probability constraints on every computer. We're going to have on every computer marginalization constraints, and we're going to have those consistency constraints on every computer. So now you might say, wait a minute, this looks like everything is decoupled now. That cannot be the case. And, obviously, you're right. We cannot completely decouple this problem. It was originally coupled. The coupling occurs only in those consistency constraints. Those were the constraints which tied together the individual problems. The nice thing now is that this program allows us to derive an algorithm, and I want to show you the intuition of what does the algorithm look like. In fact, the algorithm consists of two parts. The first part of the algorithm is standard message passing, independently on all the computers kappa, and therefore possibly in parallel. So you can do message passing in parallel on all the computers you are having. But, to make sure that, upon convergence, the variables are consistent, you need to exchange information occasionally, before then going back and doing some more message passing and iterating between those two steps. So you iterate back and forth between message passing and exchanging information, and again some message passing. In fact, you can see that the exchange of information is nothing else than another type of message passing on a separate graph, actually. So now that poses a few questions. First of all, can we really do large scale with that approach? Second, and how often do we have to exchange information? And third, how does it compare to other state-of-the-art algorithms. So back then, when we did this work, that was in 2010. There was a library for discrete approximate inference, which we could compare against, and there was an early version of the graphLAB framework, and it was an early version, and it wasn't yet distributed, so it only operated in shared-memory environments. That's why we couldn't really compare our distributed algorithm to graphLAB at that point in time, and the numbers are from 2010 or 2011. So if you look at the runtime, the library for discrete approximate inference, I dividend that by four, because I had a four-core machine, and, therefore, the library for discrete approximate inference didn't really consider all those four cores. In terms of comparison to graphLAB, we can see that our convex belief propagation approach, the general version of it, actually performs in the same order of magnitude, so it's equally efficient, and efficiency is measured in how many nodes does it process per microsecond. Yes? >>: When you were doing this experiment -- when you did this experiment, you assumed that all the theta variables are precomputed and stored, then reused just fixed across all of them? So you're evaluating the function. They're like order one, each theta-B and theta-alpha, is just an order one lookup? >> Alexander Schwing: Yes, exactly. So, to repeat the question, the question was whether the cost function was pre-given. That is, with all the lookup tables, the theta-Bs and the theta-alphas were given and precomputed, and yes, that's the case when doing this comparison. I don't include times for computing those function values. That's correct. So this upper part of the table, therefore, is like a comparison, like kind of a fair comparison, where you can then see that the primal energies are in about the same order of magnitude. So now if you go about and derive or code a dedicated algorithm for the task we were looking at -by dedicated, I mean an algorithm that worked on pairwise Markov random fields, only -- so I take this dedicated algorithm, then I can see that I then get some further improvement, naturally, because I don't need to be that general. If I now take this algorithm and further distribute it onto multiple machines, then I can get an even higher speedup. The one thing to notice is that the primal energy is not quite at the one that was not distributed, so why is that? And to see that, we look at how often do we have to exchange information between the different machines? So we're maximizing the primal function, and maximizing the primal means we have an equivalent problem that minimizes the dual function, and when we measure the dual with respect to the number of iterations, we can obviously see that exchanging information every single iteration, which is illustrated by the yellow line, performs better than when exchanging information every five, 10, 20, 50 or 100 iterations. But the story is different if we actually plot this same curves with respect to the time. Naturally, because when measuring with respect to the number of iterations, we didn't really include the time into the measurement, and the time for the distributed algorithm comes also from the fact that we need to exchange, that we need to transmit information between computers. So in our experiments, you could see that exchanging information only every five, 10 or 20 iterations was actually better than exchanging information every single iteration. Naturally, that depends on the connection between the computers, which was in our case a standard local area network connection, and it also depends on the problem we were looking at. And the problem we were looking at was a standard Markov random field type of problem, like a grid graph. For this example, the question was how many machines were we using, and for this example we were using nine machines. So the problem we were looking at was a disparity estimation problems. Here, you're given two images, in this case, and we wanted to compute the disparity map. The images were larger than 10 megapixels. We used 250 discrete states per pixel, so the graph was about 10 million nodes, 24 million edges, and due to the 250 discrete states we used, the disparity map was actually kind of smooth, very smooth, actually, and we could capture small deviations. With that, I'm also going to conclude the first part of my talk, where we looked at how can we distribute this score-maximization task, where the scores were those theta functions, those local theta-Bs and those higher-order theta-alphas. So where did those thetas actually come from? Well, in this case, someone gave them to us, or we computed them from an image, but oftentimes, we have some data which we want to leverage in order to get to those thetas. So within this second part, I'm now going to show you what you could do in order to get those scoring functions. And I'm going to assume that we have a training set of data pairs, where X is an image and S is, for example, a segmentation. So the inference task from before was this maximization of a scoring function consisting of two terms, theta, local terms and higher-order terms. Instead of now maximizing this scoring function, we can equivalently rewrite it as a maximization over some inner product between a weight vector we are interested in and a feature vector of phi. Now, what we are interested in when learning is this weight vector W. So how do we choose the weight vector W? Well, we are given some training set, right? So we should leverage this training set somehow, and what we can do is, or what we can ask for is, we can ask for a score W transposed times five that should actually score that should be smaller for any possible segmentation we can find than the score for the ground-truth segmentation that is given to us. Or, put differently, when maximizing the scores, we should always have a lower score than the ground-truth configuration that is given to us, lower or the same in case the ground-truth segmentation is within the space of the segmentations we are maximizing over. Or, put the other way around, we want to penalize within a cost function whenever we find a maximizing score that scores larger than the ground-truth configuration, and that exactly yields to what is known as the hinge loss when we also include some margin, so we want to linearly penalize whenever the maximum is within a margin L of the data score, or in equations, we maximize over the space of segmentations some inner product between our weight vector and the feature vector, plus some margin. And whenever this score is larger than the ground-truth configuration, we want to linearly penalize, and that yields to the cost function that is known from the max margin Markov networks or the structured support vector machines, where we minimize with respect to W some regularization term. In this case, I used some regularization term. In this case, I used just standard L2 regularization, and then I summed the difference of the maximization minus the ground truth score for all the training -- for all the pairs in the training set, so for all the X(s). But we are not quite where we want to be yet. Why? Because that would require us to annotate every single image in our training set, and obviously, you know that annotating every single image is very time consuming, it's costly and sometimes might not even be possible. So what we ideally want to do is we want to work on training sets, or on data sets, that are not fully labeled. I illustrate that via the black pixels in this image, for example, where a black pixel means we don't know what label this pixel should be taking. So the data we are looking at, the complete data we are looking at, consists of two parts -- the annotated part, which I denote by Y, and the latent or hidden part, which I denote by H. So, again, the complete data consists of those two parts, the annotated part Y and the hidden part H, and now, what do we want to achieve? What do we want to optimize? Therefore, we are looking at the weakly labeled hinges. We want to penalize. Whenever the best overall prediction, so the best prediction over the joint space, over the complete data space, exceeds the best prediction with the annotations being planned. In equations, that means whenever the score when maximizing over the joint space Y and H, which is the complete data space, whenever that score is larger than when we just maximize over the latent space and clamp the annotation to be whatever the user gave to us. Whenever the difference between the two is actually larger than zero, we want to penalize. And that yields to what is known as the latent structured support vector machine framework, where we are having a cost function which, again, has an L2 regularization term, plus -- and now we sum again over the entire data set we are given -- and we have the difference between those two maximization scores. So here is the cost function again. I didn't change anything. I just rewrote it, slightly. And what I want to do now is I want to generalize it, and the first thing I'm going to do is I want to introduce the soft-max function. The soft-max function is a one-parameter extension of the max function, and the parameter is epsilon. Whenever epsilon approaches zero, this soft-max function smoothly approximates the max function, and whenever epsilon equals zero, the softmax function is equal to the max function. So we have the difference of max functions. Now I can just plug in this soft-max function and we have one additional parameter, epsilon, right? Sure. >> Sudipta Sinha: One (inaudible), in this setting, you're now calling this structured prediction of latent variables, but this is not like saying that your output basis structure -- so what is the -I'm trying to understand in what sense is the structured prediction latent variables? >> Alexander Schwing: Okay. Let me try to rephrase the question. So the question was in what sense, in what way, is this structured prediction where the output space consists of certain structures? Is that right? >> Sudipta Sinha: When you started, at the beginning, you showed an example of a place where you wanted to do structured predictions, so I understood what you meant there, by saying that you want to constrain the problem by using the nonstructured. But in the setting here, where you are learning, we're trying to do the segmentation, and the variables are -- could you explain how this is a structured prediction problem? >> Alexander Schwing: Okay, I see. So the question is, rather, in the beginning of my talk, I was saying that we want to infer or want to do estimation in structured space, so what means structured spaces with respect to the structure? I'm looking here, because I guess segmentation problem is not per se a structured problem I was talking about at the very beginning. Let me actually come back to that question at the end of the second part of my talk and the beginning of the third part where I'm going to explain more rigidly what I mean by this structure I was talking about. So in this setting, I'm just talking about pure segmentation for now, but I will get to that question. Good question. So now, we can't just replace those max functions with these soft-max functions, and this is what we would get. Why would that make sense? Why is that a reasonable thing to do? Well, the nice thing now is that we are having two seemingly different frameworks within one cost function, namely, the latent structured support vector machines and the hidden conditional random fields, so for epsilon equal to zero, that's how I did the derivation and that's how I went through my talk -- for epsilon equal to zero, we used the hinge loss or the weakly labeled hinge loss. And, similarly, we could say we tried to go for a max margin framework, and what we got was the latent structured support vector machine formulation. Now, if I would use instead the log loss and go through the entire derivation using the log loss and go for a maximum-likelihood approach, I would get the hidden conditional random field, which would be equivalent to setting epsilon equal to one. And, in addition to having those two seemingly different frameworks within one cost function, we actually have a whole set of additional cost functions for different epsilons in between and beyond one. And, obviously, what we could also do is we could again include the margin, which I left out for clarity, but when including that, you get -- the cost function looks a little bit more complicated, but nothing fancy happens. So, now, we want to optimize this cost function. What are the challenges? So there are two challenges, two problems we need to solve. First of all, we are having this summations -- here are the summations here -- over exponentially sized sets. So for like a segmentation problem, we would need to sum over all possible segmentations, which is exponentially sized summations. And, in addition, we have the difference between two soft-max functions, the difference between two max functions, and that means we are dealing with a nonconvex problem. So what can we do? Let's look into what people did in literature. So, to address those difficulties, let's look at how to address exponentially sized sum or max operations, where we could assume then to be efficiently and exactly computed, and that will be the case if you're looking at foreground, background segmentation, for example, and would have submodular energies. Then we could use graph cuts and the maximum operation would be exactly computing, or if you want to solve matching problems, or if it's a tree-structured problem. But, oftentimes -- in particular, oftentimes when talking about physically plausible spaces, we don't have those structured problems, so we need to go for other approximations, and other approximations could be local type of approximations, entropy type of approximations. In fact, this is exactly what we are looking into. We are doing local approximations. The difference with respect to previous work is that we are using convex duality-based entropy approximations. And I'm going to explain in a minute why this is beneficial. For the non-convexity -- that is, for the non-convexity, the difference between the max operations, we are going to go for a standard approach, expectation maximization or the concave-convex procedure type of approaches. How do the algorithms look like? Or how do the algorithms with convex duality-based entropy approximations compare to existing algorithms? So existing algorithms for hidden conditional random fields and latent structured support vector machines, here illustrated on the left side, they are kind of a double loop type of structure. You have an outer loop, and within the outer loop, you first need to solve the latent variable prediction problem. That is for everything. That is not annotated. You need to figure out what is either the maximizing state or what is the probability distribution over then the given variables. And then, once you solve this latent variable prediction problem, which is an inference problem that you need to solve under convergence, you can update your parameter later. And the one thing to make sure is that you need to update your parameter to convergence before you go back and resolve the latent variable prediction problem, and you iterate between those two problems, basically. If you look at convex entropy approximations, we can get rid of this second requirement to converge. So instead -- or we also have to solve the latent variable prediction problem, as before. But now, it's sufficient to just do a single step in the parameter vector direction before then going back and solving the latent variable prediction problem. I'm saying it's sufficient, because obviously you could also do more than a single-parameter update. Obviously, this is the question. The advantage that we have less inner loops, we got rid of one convergence requirement, but how many updates should we do, or how often should we update before, again, solving the more complex latent variable prediction problem? That's a question that we didn't answer. Importantly, using those approximations, those convex entropy approximations, we got rid of the assumption to solve those sub-problems, like matching problems or tree-structured problems, exactly. So we don't have this requirement in the algorithm anymore, which contrasts, for example, latent structured support vector, so is there a benefit, or can we show that there is a benefit for using those entropy approximations? I guess I wouldn't be talking about that if you couldn't, and the benefit, first of all, is since we have less inner loops, the training time, the average training time, is lower compared to other algorithms, to standard algorithms. Importantly, we could also show that in terms of performance, it pays off. What we did was we increased the amount of latent variables like to 90% in this case, and we tried to train a simple segmentation problem here. If you compare our approach to the latent structured support vector machine, we see that latent structured support vector machines actually learned the wrong model in this case. We wanted to learn that neighboring pixels should have an equivalent labeling, but latent structured support vector machine learned that neighboring pixels have the opposite label. So why did that happen? What was the reason for that to happen? Well, the reason was that, due to the ties during the latent variable prediction problem, the empirical means vector pointed in the wrong direction, or had the different signs that were just opposite to what they should have been. And therefore, latent structured support vector machine learned the wrong model. That doesn't happen in the approach we use or we propose to use, and it also didn't happen in the hidden conditional random fields framework. It didn't happen in the hidden conditional random fields framework because we used the convex version of it, as well. There, we achieved equal performance at the tradeoff of having a higher or longer training time. So, naturally, that is a kind of a setup example, and what we want to show is that these approximations are actually also useful in the real world. And one way to show that they are useful in the real world is if we can extract information from weakly labeled data -- so what I want to show next is I want to look at an example where we were given a fixed set of fully annotated samples -- and if you throw in additional weakly labeled samples, then we can improve the performance. So the task I am going to look at in this example is the initial layout prediction task. So the task was, given a single image, we want to predict the 3D parametric box that best describes the observed room layout. So if you look into this room here, for example, you'll notice that a box, a cube-like structure is a very good approximation for this room, for example, and that holds for many other rooms that we are living in. So how can we find this parametric box? And that also hinges a little bit onto the question of Sudipta, where we are asking what is now these physically plausible configurations? And the problem there is that we need to find an adequate parameterization, how can we actually parameterize the problem such that we don't reason about general labelings in a structured manner, but how do we reason about configurations or just the configurations we are interested in. So how can we constrain that problem? The question is to design a parameterization that it makes sure that this is the case, and that is often the key thing to look into, and what parameterization did we use? So in our case, we assumed the vanishing points to be given. When I'm saying we assumed them to be given, I mean we ran a vanishing point detector. We didn't hand label them. We ran a vanishing point detector and assumed those to be the true vanishing points. Obviously, there were also errors in the vanishing points, but we assumed them still to be correct, because we wanted to assess what is the real-world performance. Now, assuming the vanishing points to be given, the prediction of a box is equivalent to predicting four variables, S sub 1 through S sub 4. Those four variables essentially correspond to predicting the angles or the deviations between the connections of two vanishing points and the ray we are interested in. So ray 1 through ray 4 in this example. So, given those four rays, we know what the 3D parametric box looks like. So, now the next question to answer is, once we have a parameterization that encodes this physically plausible configuration, what are the features? What are the measurements we are interested in? And the features we are looking into have been designed by other people, and those are orientation maps and geometric context. Orientation maps are shown in this illustration here, and what our features are going to do is, we are going to -- I am going to do that with the hypothesized left wall, which is shown via those brown rays, as an example. So the left will be something like here, this thing. And so our features are going to count. What are our image cues? What are our orientation map and geometric contexts? And, giving us estimates for, for example, a left wall or a front wall. So we are counting, effectively, colors. How much yellow is in a hypothesized left wall? How much green is in a hypothesized left wall? How much red is in a hypothesized left wall, and so on? So the model we are going to get with those variables is then a loopy graphical model, in this case, luckily, we're either high order, in our case, either a pairwise graph. Or we could decompose it such that it was a pairwise graph. We could model that. So now we have this pairwise graphical model, loopy, which is neither submodular nor does it possess a treelike structure. So that's why I talked in the second part about learning in those settings, and for simplicity, I used the segmentation problem, although I could have just as well used this as an example here. So now what I wanted to look at is how does this -- how can we extract information from weakly labeled data? And in order to show that, I'm going to compare to a fixed approach, a fixed approach that trains only on fully annotated samples -- on five fully annotated samples, 10, 20 and 50 fully annotated samples. And I'm going to measure the performance, the prediction error in terms of pixel-wise classification. So, obviously, we can see with the fixed approach, in blue, that the more samples we throw into the training procedure, the better the performance, in this case, because we only have a few number of fully annotated samples. Now, if in addition, I throw in 25, 50 or 100 weakly labeled samples with the number of angles being thrown away being either two or three on the right-hand side, so 50% latent variables or 75% latent variables, you can see that the prediction error goes down. So despite the fact we did some approximations, we were capable of reducing the prediction error. So we were able to extract some additional information. >>: Is this why it's worse for the green one when you went all the way over to 50 annotated examples and you have an additional 25 examples, which are weakly labeled. It looks like it's slightly worse. Do you think that's just noise or do you think there's actually something to that? >> Alexander Schwing: Okay, so the question was, why is it for the green curve, where we throw in an additional 25 samples in case of the 50, why is it slightly worse than when just training with a fixed number of samples? All those curves are averaged over 10 runs, I believe, because within the data set, we obviously had to pick a subset of them. I would say it's mostly noise that happens. I mean, this is then the error shown on the test set, so you can train on the training set, yes, but what then happens on the test set is kind of a different story, right? Maybe if I would average over 20 runs, it might be better. The important thing is that the differences are actually minor. So with that curve, I then also want to conclude the second part of my talk, where I wanted to show that certain approximations can help to train, also if information is not fully provided. And in the third part, I want to give you some more details on this estimation in physically plausible configurations, meaning are there specific inference items, or are we capable, or how are we capable, of designing specific inference items for certain problems. So, again, the problem is the 3D indoor scene understanding task I've been looking at, and what we want to do is, now, given a linear model W, so in this part I'm going to assume someone did the training for me and I'm given this linear model. Given this linear model W and some image X, how can we estimate the layout using a feature map, where this feature map was, again, this counting of how much yellow, how much green, and so on, was within the image cues. The parameterization we are going to use, again, consists of those four variables, because this is a fairly structured configuration that allows us to do exactly what we want it to do. And, further on, the layout is fully specified if we know the four variables and if we know the vanishing points. Obviously, what we are going to assume is that the Manhattan world assumption holds, meaning that the walls are parallel, the floor and the ceiling are parallel, et cetera. And what we are going to do is, instead of working with or reasoning with singly hypothesized layouts, we are going to work in internal product spaces. So we are going to assume we are given a set of layouts, and I'm going to specify the set of layouts via the letter S, and the set of layouts is denoted by minimum angles and maximum angles. So every of the four random variables, we are going to have a minimum angle and a maximum angle, and the minimum angle is illustrated in this graph on the right-hand side down here via the black rays. And the maximum angle is illustrated by the blue, via the red rays. So we're considering all the possible layouts for every angle being within one of those internals, so what could be a possible inference procedure? Well, we could go for a branch and bound approach. How would that work? Well, suppose we are taking a set of layouts. We are going to compute an upper bound for the score of the set of layouts, and we are going to put that into a cue, in addition to the set. Then, we are going to take from this cue the highest-scoring set. We are going to split it into two parts. We are going to score the two parts independently, put them back into the cue, and keep going until the highest-scoring element we retrieve consists only of a single hypothesized layout. And then the algorithm terminates, because there is no way of splitting a single layout any further. When would that work? We are guaranteed to obtain the global optimum of our scoring function if two properties hold for the scoring function. First of all, the scoring function has -- this upper-bounding scoring function, FPAR, has to be a true upper bound for the cost function. So, for any possible layout within any possible set, if we score this set, we have a true upper bound to the cost function. And, in addition, what has to hold is exactness for a single hypothesis. So whenever we throw a single hypothesis into the bounding function, we get the exact score. Given those two properties, the abovementioned bound approach gives us the exact -- the global optimum. So then the next question is, can we actually find such a bounding function, such an FPAR, being a true upper bound? And what do we need to do in order to find one? Well, we had this scoring function being this inner product, this inner product between our weight vector W and our feature vector phi. We can divide this inner product into two parts, a positive scoring part and a negative scoring part. And the positive scoring part, F-plus, consists of the weight vector, or the weight vector element, multiplied by its corresponding feature vector, if the weight vector element is positive. And the negative scoring function, F-minus, consists of the weight vector multiplied by a feature vector if the weight vector element is smaller or equal to zero. Why do we have two, a positive part and a negative part happening? The reason that works is that we have -- that our feature vector is just counting, counting colors, and counting is always positive. If counting is positive, the sign is uniquely determined by the sign of the weight vector. So now, having split up this function into two parts, how do we design this bounding function FPAR? Well, what we need to do is we need to sum the maximum positive contribution of the constituents and subtract the minimal and negative part. For me, that is a pretty important statement, so I'll iterate over that again. We want to sum the maximum positive contribution of some interval, and we want to subtract the minimum negative part. So now consider a front wall. Consider this front wall here, where we are given maximum angles by red rays and minimal angles by black rays, and now we want to sum the maximal positive contribution. Well, the sum of the maximal contribution is just summing up the quadrilateral -- everything that is positive within the red quadrilateral. We can't get more positive than everything within this red quadrilateral here. What is the minimal negative part? Well, the minimal negative part is everything negative within the black quadrilateral. We can't get -- this is the minimal negative part that we always need to subtract. For all the other cases, we need to subtract more. So this bounding function fulfills exactly the two properties from before. It's exact for a single hypothesis. If you have a single hypothesis, then the red quadrilateral and the black quadrilateral are actually equal, so we are going to extract -- the positive part and the negative part are exactly considering the same areas, so the cost function is exactly recovered. Also, it's a true upper bound, because we are taking always more positive and less negative, or the least amount of negative. It works for the front wall case, and similarly, it works for all the other walls. I'm not going to go through it. There is an illustration for the left wall there, as well, but I'll skip that part. So now, we designed this bounding function, and we need to efficiently compute that. We wait on prior work for that, and I'm going to skip that part. I'm going to show you next a little bit about the performance. I said already what we use as image cues, as features, are orientation maps and geometric context. And what we are going to measure -- I'm going to show some images of orientation maps and geometric context later. And what we are measuring is the pixel-wise prediction error on the images, on some specific data set, which is the layout data set. There was obviously some previous work on that task, and people usually use different image cues, like orientation maps, geometric context, or both of them, and we obtain quite decent performance in actually a fraction of a second. So the branch and bound algorithm is quite fast in this case, and it achieves a reasonable performance. So how do those results look, visually? Well, here are some examples. So I show some images, and in the middle column, I'm going to show the orientation maps. On the right-hand side, you can see the geometric context, which are some feature cues. And the feature cues on its own, at least for geometric context, you can see here the performance would be 28.9%. So on the images, you can see overlaid our prediction result in red, and in blue we are going to show what will be the best-possible prediction if we would have the ground truth available. So I'm showing the blue because that illustrates a little bit the discretization artifacts. We are discretizing everything, so there is not really much to see, much errors to see with respect to discretization. >>: The right-most line, again, is that your result? The right most? >> Alexander Schwing: The right most is like this one? >>: Yes. >> Alexander Schwing: No, this is an input image. Like this is geometric context, which we are using as an input feature. So the colors here -- exactly. The red is... >>: Blue is a ground truth. >> Alexander Schwing: Yes. You could see that blue is the ground truth. >> Sudipta Sinha: So I have a question about this setting. I don't know if it's this particular data set or if there are more like this, but in these data sets, there are objects in the scene. >> Alexander Schwing: Yes. >> Sudipta Sinha: And in some sense, the objects are like distractors, so the ground truth, as I understand, does not mask out these pixels. All the pixels are just included in the three walls around the axes. >> Alexander Schwing: Yes. >> Sudipta Sinha: So then when people do learning on this data set, on ground truth, the data sets, how is the learning not getting affected by all these objects that are distractors? In some cases, they are lined up with the orientation of the box, of the room, but in many cases they are not. Do you have a sense of what happens? >> Alexander Schwing: So to repeat the question, the question was in many scenes there are objects there, and objects kind of distract also the learning algorithm, so how does the learning not get affected? The question is, the learning does get affected. In fact, it does get distracted, and also inference does get distracted. As you can see in the two failure cases on the bottom here, we're actually predicting the red layout, which is kind of obviously distracted by the objects in that case. There is -- at least if you don't model the object, then it will distract you, correct. Another source of failures is wrong vanishing points, so for this image on the right-hand side, you could see that the vanishing point should be somewhere in the middle of the image, but it's actually far to the right, outside the image, so since we assume those to be given and fixed, we cannot really do much about that. Yes, I'm actually -- I can show you also some video here, for some of the results. So, again, what we assume is that given a single image, like this one here, what we want to do is, we want to predict the layout. But if you then just have the layout as illustrated here on the right-hand side, and you want to render scenes from a new viewpoint, what happens is that you're going to obtain results or renderings that are actually more distractive. So the goal, or an interesting additional thing to be done, to be to kind of also model not only the object itself, but also the -not only the layout itself, but also the object on top of it. I haven't talked much about that, but it's not really published work yet. Again, on this example, you could see that, given a single image, you can see that you get significant artifacts if you just render the scene to a single box. And then on this last image, what you can also see is that, obviously, for some of the pixels, we don't have any texture information, so what happens if we are predicting objects on top is that we do get some black regions for which we don't really know what's happening. Obviously, we could do some in painting there, but we haven't yet looked into that task. With that video, I also want to conclude my talk and quickly recap. The whole motivation was to work in physically plausible configuration spaces. What we need to do is, because those can get fairly large, is we need to do inference somewhat efficiently. One way of doing it officially is to distribute the task onto as many computers as you have publicly available. On the second part, I then tell you that those models generally possess a very general structure, so we are required to learn in general graphical models also if we don't have everything annotated, because annotation in large environments is time consuming and costly. And the third part, I then showed you some results regarding a 3D or scene understanding task. So thanks for your attention, and I'm looking forward to the questions. Any more questions? >>: We're going to be chatting later. >> Alexander Schwing: Sure. Thanks.