17162 >> Richard Hartley: So I'd like to introduce Tiberio Caetano, who works at National ICT in Australia where I also work. So he comes from Brazil originally via way of Calgary. >> Tiberio Caetano: Edmonton. >> Richard Hartley: Sorry, Edmonton, Alberta. So extremes of temperature. He went through, did his Ph.D. there and graduated in 2004 before coming to work in Australia. Now, I hired him into the Vision Group there but he jumped ship into the Machine Learning Group at some point so he has expertise in both vision and machine learning and I think he's going to give us a bit of a talk which bridges the two areas today. >> Tiberio Caetano: Thanks, Richard, for the introduction. So essentially I'll try to divide this talk in two parts. So the first part will be a very generic talk about the type of work we are doing in computer vision, and essentially using machine learning methods to solve computer vision problems. And the second part, which is probably going to be a little bit shorter, is basically going to be a bit more specific about one recent project. Okay. So I would like to thank my collaborators: Jose Ferris from IBM, TJ Watson Research Center, some of my Ph.D. students here, Julie McAuley, Devon Shi, and Cau Lee and Bing Chin, Alex Mau over there, collaborators involved in this project here that I will describe. I think I've missed a picture of Dale Sherman, who is now a collaborator in this project, that I'm going to present to you. Okay. As I mentioned I'll divide the presentation into two parts. So part one will be essentially preaching structure prediction in computer vision. I'll quickly describe what it means and give examples of this. The second part then will be a little bit more specific about recent work we've just submitted. So let's focus on the first part, structure prediction, computer vision. So this part I'll talk about what I call easy vision problems. And so here's an easy vision problem. I want to find a class of this object. I'll say, okay, this is a bike. I'm calling this an easy vision problem. So imagine what I would call a difficult vision problem. So here's yet another easy vision problem. What's the class of this object? So I find it easy. That's a car. Typical car. Another easy vision problem. What's the class of this object here? So that's a ship. Also an easy vision problem. So it's the design of a new ship that's going to be around very soon. It basically has a very sophisticated hotel in it. And, well, you need to actually arrive at it by plane. But yet it is a ship. So these are easy vision problems. I hope you agree with me that these are all easy vision problems. So for these easy vision problems problems that are classification, right, you can do two things. Very traditional approach that would be popular in the '60s and '70s is knowledge-driven approach. So something like this. So this would be rule-based, model-based, grammar-based so you would represent parts of the object as object parts and basically you would create a model, a complicated model to try to represent a model, the object and try to figure out what the model, what the object is, whether it's a ship or a car or whatever. So this is popular in the beginnings of computer vision. So you would typically have complicated descriptions of objects and no doubt leveraging almost no deck leveraging whatsoever. If you look at how things have been happening more recently, things have been basically been changing from a knowledge-driven approach to adapt-driven approach. To do classification these days, for example one option is to use things like support or machine technology and that's basically driven most of the activities in classification computer vision recently, in the '90s and recent years. So essentially the idea here instead of going for those complicated models you go for very simple models, but yet you leverage data heavily. So the question I have for you is: What would you do? Would you go back to this age here and do complex models and no data leveraging or would you just continue what people are doing at the moment, with simple models and huge data leveraging? What would you do? That's the question. You would just continue doing this. >>: [inaudible] what you're doing. >> Tiberio Caetano: What's too boring? >>: Maybe the first one. >> Tiberio Caetano: The first one is too boring. So basically this is what almost everyone is doing, right? Okay. So let's focus on the second approach here. So there you go. I mean, one option if you want to do simple classification, have again just remind you we are working here on simple vision problems. You want to do simple classification, one option is just use support vector machines. Extremely well known, very successful classification technique. And there you go. Well, you don't need to do support vector machines. There are many ways to do classification, one way example to do logistic regression, instead of learning the boundary, you just learn these discriminative functions separately and then you decide according to the intersection. That's something well, for example, logistic regression, also discriminative approach but still based on density estimation techniques. There are many other things you could do. Okay. Why am I calling this problems easy vision problems? Seems I'm being arrogant here. But not the easy vision problems in the following sense because the output is low dimensional. It's the number of classes. For example, I want to classify an object. Well, I may have two classes, 10 classes, 50 classes even 100 classes but I would call this extremely low dimensional. Well, we usually think these problems are difficult. Not because the output is low dimension, because the input is high dimensional. Because images are very complicated objects. That's why computer vision is difficult. My point, one of my points in this talk is that of course this makes life difficult. That's true. But this here makes life not as difficult as it could be. If the output were also high dimensional, then we would really have hard vision problems. >>: Some variance recognition today, people are starting to talk about labeling pixels, doing segmentation together with recognition. >> Tiberio Caetano: Yeah, that is the point of this talk. The recent findings of that. I'm just trying to draw basically the mainstream up to recent years, and then try to survey the recent years and do what has been done in the last couple of years. But that's a fair point. That's the point of this talk. Okay. Okay. So inference here is trivial. What I mean by inference you want to predict the class membership of a given image. So basically inference is trivial in the sense, well, for a support vector machine you just check which class has highest margin score. You can enumerate the classes because you have a very small number of classes easily. And logistic regression, you just check which class has higher probability under your individual regression functions. Okay. So now let's talk about harder vision problems, which is what Rick mentioned that people are doing. This is a hard vision problem. Depth estimation. Here you have an image well known stereo pair. Here you have the ground truth for this. Why I'm calling this a hard vision problem, because the input is high dimensional but the output is also high dimension, the output is another image. It's not a simple class label. It's extremely high dimensional. Likewise here. So the input is an image. The output is yet another image. Image, noisy. So could you get from here to there efficiently and with high accuracy? So that's a hard vision problem, another vision problem, image segmentation. Input is an image. Output is high dimensional as well. Another image. Okay. This is bad because this is my favorite team which is matching, and the image is not showing up. Okay. No problem. I'll show it later, an example of matching, graph matching case. Basically what was up there is basically a pair of images and matches between features of those images. The input is a pair of images and the output is also something very complicated, which is a permutation of the nodes of specific points. >>: But dimensional analysis doesn't capture everything right because adding two images together is basically high dimensional and high dimensional output and that's an easy problem, right? Two images taking their average, right, that's a high dimensional input, high dimensional output. So just classifying things by dimension doesn't really tell you what the hardness is of the problem. >> Tiberio Caetano: I'm not sure I'm getting your point. >>: I'm saying -- you're saying what makes it a hard problem. A hard problem has a high dimensional input and high dimensional output, right. The problem of taking two images and computing their average satisfies your definition of what a hard problem is. >> Tiberio Caetano: Don't take this too strictly. Don't take this too strictly. This is more pictorial kind of description of what I'm talking about. So essentially you can also in principle think of two generic approaches like we've mentioned previously. A knowledge-driven approach or data-driven approach for this class of problems. You can qualify this as differently. I'm just using easy and hard just to make you remember one month from now that you've heard easy and hard, because if I had to use other qualifiers you wouldn't remember. So the question is what would you do? So my point here is that for this hard vision problem people still are playing this game here, mostly. Most approaches are knowledge-based so, for example, energy functions are typically handcrafted. Examples: MRF, MRF for segmentation. We have matching. And a bunch of other discrete optimization problems where you've got this complicated, sophisticated predictor, instead of just a support vector machine that you're instantiating each of those classes and check which has highest margin score, you have to run an entire complicated algorithm in order to make a prediction like here, well, you need to solve MRF, here you need to focus on the assignment. Here, you know, and other things. So you work really very hard at trying to solve, to find the best solution for a model that you don't know in the first place, which that's the right model or not. Because you handcrafted energy functions. So good inference, my point, so good inference is not enough. We need to learn the energy functions themselves. That's what you should be doing. So basically if you were an advocate of the data-driven approach for easy vision problem why shouldn't you be a advocate for the data of the difficult problems as well. There's an inconsistency. That's what we need to do. We need to find good solutions for the right problem not the wrong problem. And the questions are why should we be working so hard to prove inference incrementally, which is what a substantial proportion of the vision community is doing in discrete optimization. When we know that the global optimal are poor anyway. I mean, it's so much effort doing this. I mean, to me it doesn't make much sense. Why bother so much being so optimal when the criterion being optimized is simply a reasonable guess of what a good criterion would be. So I think we should be asking those questions. So here's an example. This is a beautiful example from a paper in 2005. So here's the original image, took away image. Here's the ground truth and here's the global optimal acquiring to a simple pair-wise MRF with handcrafted parameters. So look at the differences between this and that. So this is really -- this is the global optimal of the energy function. So this is what people are really writing more and more and more papers to achieve these results here. And the many competitor result to that you say this is really crap. If you be extremely picky, this is crap. Look at the camera. There's no camera here. I can't see any camera. So I mean shouldn't we be balancing a little bit more how much we are working hard on finding the right solution to a given problem or finding the right question to be asking in the first place. So here are a bunch of other examples of these very issues, but also [inaudible] and ICCB 2005 here where you have bunch of images and solution obtained with belief propagation. Here's the solution with graph cuts. Here's the global optimal. They actually managed to find the global optimal for a specific MRF, pair-wise MRF function. And if you look at the graph in some cases they're quite far from what a human labeler would provide. So why are so many people concentrating their efforts on optimizing optimal energy function, if at the end you're going to get something that's not that good anyway. Here's something I want you to think about. So in red you have the quality of the algorithms. So here's the level of sophistication from algorithm. Here's algorithm one, here's algorithm two. This is every point in this line is a single paper in ICCVPR, so people are watching this interaction here. So algorithm sophistication. So this red line here you have the quality with respect to a poor energy function. Of course, that's improving. In this blue line you have quality with respect to real perception. That's stuck. That's stuck. That's stuck. It's been stuck for a long while here. And in the end people are putting paper and being paid for it. Those papers. So what's the point? >>: Got a problem with the dataset? >>: Tiberio Caetano: Well -- that's ->>: The dataset -- >>: Tiberio Caetano: Exactly. So people are taking these energy functions for granted, right? Of course you're improving with respect to that energy function but so what. That energy function was handcrafted in the first place. Being extremely religious here. Come on, we're scientists. In the machine learning field, learning energy function has a name. It's called structured estimation or structured learning. In front of global optimal of a given energy function is simply inference. This has the same name in computer vision. When people talk about structured prediction they're really talking about the whole business of predicting high dimensional output. And doing it in learning in the process of doing so as well. Not only hand crafting those energy functions but doing learning, leveraging data in order to find what would be the right problem to be solved in the first place. So what would be popular estimators? The beginning of the talk we'll discuss support vector machines and logistic regression. So likewise here people are creating extensions of those models, of those estimators actually for the structured creators as well. Today we have structured support vector machines which is a cousin of the SVM. Structured SVMs. So basically these two references is basically what you need to look at, which are the papers that introduce essentially tools of how to do learning when the output is very, very high dimensional. Because we are talking about really high dimensional, imagine an image you cannot enumerate all possible images. Yet that's the number of classes you have. So how do you solve that? The learning problem becomes extremely difficult. It's much more challenging than doing inference. Likewise, you can do something different. You can ask what's the analogous of logistic regression for the structured case. That has a popular name. Called conditional random fields. That's only logistic regression for a different feature vector. >>: You mentioned that cross function, matching stereo is now cracked. But how do you keep that structure as cross functional structure [inaudible] is cracked, has cracked cross function? Because you said -- >>: Tiberio Caetano: Because you just train against labeled data, right? So you can get -- you can have, for example, what you have these days, you can have good either hand segmentation of correct stereo. >>: But you're given the parameter to fit that model. But, for example, the random of the field they find potential functions, you must assume. >> Tiberio Caetano: You define the features. You don't define -- you define the model class, and then you search with element in your model class, maximizes the evidence of the observations, whereas in traditional models you just select an element of your model class. You just select -you hand craft your parameter. So therefore if you have a large model class that you would realize the entire model class by changing your parameters, you don't explore that entire model class, you just fix a specific model and that's the best you can do. You work hard at optimizing, yes. >>: If you're going to address this later just ignore my question but what about learning parameters in MRS, like Marshal Tapland has learned. >> Tiberio Caetano: Sure. Absolutely. But the point is that you want to optimize those parameters to minimize the final loss which is, for example, the loss of stereo. For example. A hemming loss between what you are predicting, let's say, and a hand label, just in the product between. And minimizing that loss can be tricky. >>: It can be very challenging. That's one reason is because marshal worked with Gaussian MRS it was tractable but he could do image denoising where with synthetic data he was optimizing the loss between the final, the learned MRF and the final output. >> Tiberio Caetano: The basic problem you see if you really want to minimize a loss that's interesting, like, for example, a hemming loss, that loss is just continuous. You have a piece-wise constellation problem. There's no hope of using optimization techniques to solve that. So what I'm going to say this instead optimize likelihood which is what people are more familiar with, logistic regression, everyone knows how it works. You just create a more sophisticated model and you have a complicated partition solved. But that has a problem because you have the problem of computing the partition function. But this is an interesting alternative, because here you have no partition function issues. You just look at the boundary. You just look at a few examples. And this will be much faster than this in general. And what the principle here is to upper bound that crazy objective you have. Because you have the ideal loss you want to minimize would be a very crazy hemming loss, which is piece-wise constant. There's really no way to do that efficiently. But if you have a decent surrogate loss that kind of preserved the structure of that original loss, and that surrogate loss is amenable to efficient optimization, that's what this is all about. And this has actually been extremely successful. Although, the first paper that I know of that used these computer visuals is only in 2005. But now we did this in 2007 and now people have suddenly realized yes we're really into this stuff. Conditional random fields people have been using a bit longer in computer vision. So this was first proposed in 2001. But has very nice statistical properties. So this is a consistent estimator. This is not a consistent estimator, for example. But has serious problems because of the partition function. Need to create an entire probability estimation on the images and that's when you want to learn, do grading in the sand then you need to estimate the partition. It's very bad. This seems to be a practical alternative, the first one. Both have advantages and disadvantages. Okay. I'll move forward here. So just one example. Graph matching here. So the graph matching problem is really a hard problem. So if you want to take into account pair-wise constraints, for example, you get quadratic assigned formulation. That can be really hard. So the point is that traditionally people just hand craft this objective functions for graph matching. You just say, look, these unary similarities will look at six features, these quadratic similarities will look at relative distances or whatever. Then you just fix parameters and then you work really very hard to solve that combinatorial problem and you didn't know in the first place whether you objective was good or not. The point is you can actually learn those energy functions. So in this work, for example, we use structured SVM to solve those energy functions. And then we find some very nice things. For example, you can model graph matching with complicated combinatorial settings like quadratic assignment or match simple combinatorial settings like linear assignments, look only at matches between individual pixels instead of having pair-wise constraints. That's a tractable model. It turns out if you learn that simple model, you obtain results which are as good as you would have obtained with a complex model without learning. This is something we figured out in these examples here. This is what you get. If you model graph matching, for example, as linear assignment, which is efficient to compute, it will perform similar with quadratic assignment without learning. Why? Because you were finding the right weight of the sift features, for example. In such a way that your final result is what you want it to be, because you have training data. So you engineer it. Basically you automatically tune the relevance of your 128 sift features, for example, so that joint weight will, when you solve this combinatorial problem, produce an image that is what you want it to produce because you have the output. Either by hand labeling or by laser scans of, for example, the case of adapt estimation. So this is one example. Okay. So here my matches are up. This is my favorite subject. Graph matching. This is just to show that, okay, in green here you have correctly matched points in red you have wrongly matched points, okay. This is exactly the same model class. It's just that here we have handcrafted the matching score. And here we have learned the other way around. So here we have handcrafted and here we have learned and matching scores. So you see that there are many mistakes that you can avoid. By just tuning what's the right model in your model class. Of course, here how many classes do you have in factorial? In this case you have 30 points here, 30 points here. I have as many classes as possible as matches here because my prediction is the match. Because I'm predicting a match. I'm parameterizing the algorithm that's producing the match. It's like I have support vector machines with 30 vector classes. How do I solve that, that's why it's not trivial. You need to look at the paper to see the techniques people are using. But point is you can improve. You can -- yes, you can improve. So here another example we have in CVPR this year which is shape classification. We want to classify shapes, okay? So one thing that people are doing in shape classification, for example, is to use matching scores. To use the results of matching scores to do classification, because you know that if you remember the paper by [inaudible] at Alia in 2002 [inaudible], that was basically shape context features, maybe people are familiar with that. Essentially what they were trying to do was they were trying to classify objects based on matching scores. But they were hand crafting the matching score. So you hand craft the matching scores. You produce a match that's completely a function of how good was your matching scores. Then you parameterize the final matches and you learn after that stage. You do this. You trust your matches completely. Then you do learning after you produce the match you parameterize and then you do learning. You may be doing learning completely crap features because you've done shaped matching the first place. Completely trusting your uncalibrated linear assignment algorithm. So what we're doing in this paper here we do everything in a single shot. So we optimize the matching score itself so that the classification loss is minimized. Okay. So here is an example of no learning. So you produce a match. You can match this scan against this dog, but the algorithm has wrongly classified the dog and CAML as belonging in the same class. And here if you do learning, so you can -- you use the matching as a means to classification but you parameterize the matching itself. And then you can recover, we can say these two guys are CAMLs, they belong this the same class. >>: So there has been work on learning better feature descriptors or distance in feature space and things like that. >> Tiberio Caetano: Sure. >>: And you have training databases of incorrect and incorrect matches, right so how does this differ from learning just a better matching score? >>: Tiberio Caetano: Because this is optimizing the matching loss. Not this one. This is optimizing, this is structure prediction. In that case you were tuning your matching scores so as to minimize a very low dimensional surrogate. This is going directly to the output space saying, look, I want this matching matrix, this permutation matrix. My losses I want to optimize is the hemming loss between two parameterization matrices that's extremely high dimensional. You go to the problem you'd like to solve instead of creating a surrogate low dimensional version. >>: In that case. >> Tiberio Caetano: Which paper are you talking about. >>: Winder, Matthew Brown, here they've done a series of things, where they have sift-like features or they're trying to learn better descriptors. And they have a database of which features actually match. There's several million matches of the patches that are known to be true matches and other matches that are known to be false. So basically they have a loss function which is correct or incorrect matching. >> Tiberio Caetano: Right. So the question I have is: What's the algorithm that predicts the -what's the predictor in this case. >>: [inaudible] might be learning how to make a single match. >>: That's right. >>: In this particular case you don't have complete matches. >> Tiberio Caetano: It's a joint. >>: So you're saying if you wrap a more complex algorithm around it. >> Tiberio Caetano: You can still -- >>: You can have access to what's a correct match. You only have access to the final thing. You wrap your match, a bigger black box, you evaluate the output of the black box. That's the only difference, right? >>: Tiberio Caetano: The main thing is that I'm not optimizing the scores of individual matches yet. I'm not doing that. I'm optimizing the joint match, the quality of the joint match. Because these two points cannot match to the same point. For example, I have an injection constraint. That's essentially by injection so they're in there. Factorial impossible things that I need to optimize. I'm not saying well this guy is similar to this one let's have a single collaboration, no. >>: More complex. >> Tiberio Caetano: It's a more complex prediction model. >>: The ground truth, mapping from one to the other. >> Tiberio Caetano: You need the ground truth. >>: But you know that? >>: Tiberio Caetano: You need to provide that manually. But that can be an issue in many cases. But there are two ways to circumvent. One way is to do a semi-supervised vision of this case. In semi-supervised vision you provide the matches don't provide the others. You can generalize the setting to marginalize over the hidden variables for the audits. That's one example. The other example is if you're doing shape matching, if you have a simplistic setting they're doing something like this, you could easily create a semi-automated labeling algorithm. You say this is going to match to this and this to that and you run three or four points and run a simple algorithm that completes the matching and you use that for labeling. We need to move forward here to make sure we get everything done. Here is yet another example. This is another work we've done. NIPS last year. All the types of matching problems have new isometric matching cases so you can also improve by doing this predicting learning for near -- you just change the types of features you have and type of inference you use, junction threes and structured SVMs as well. This is another thing. We just submitted this. But I discovered today that it wasn't accepted at ICCV. But anyway this is trying to do something interesting. Recently the laundry grouping in San Diego, they've used conditional random fields to do joint object categorization. What's joint object categorization? Imagine just for the sake of simplicity you could segment this image properly. You could segment this into, you have here four disjoint fusions but you don't know the labels of those regions. You don't know what they are. But you could segment them properly. As a matter of fact, in order to assess the quality of different algorithms you need to assume that the segmentation algorithm is independent of the model. So let's just assume you can segment this properly and that the question is which object is this object here? Which object is this and this and this and that? So of course you could think that you know I will create a real classifier, vegetation classifier, sky classifier ground classifier, random and predict those things, right? But this is something simple because actually in order if you see a real it's likely to be above and not under ground. Whether it's likely to be under the sky and not over the sky. So different labels of objects, they don't occur independently in the images. So if you want to predict things, you would like to leverage a dataset, in that particular dataset you have labeled these and said these [inaudible] are under the sky, over the ground, or if you see water and you have something big over the water, you know, it's likely to be a ship, not a car, something like that. So you want to leverage those correlations, right? So when you do prediction, you predict an entire combinatorial structure in the output. You don't predict this and that with that guy but that presents problems because you have a combinatorial problem to solve, right? Essentially you have a sophisticated predictor but you can't predict it using training data. That's the point. In this work we're substantially improving on the work of [inaudible] and collaborators. So we have to find a lot of venues to stick this stuff in. So here this we just submitted as well. So what we are trying to do in this case, no images here but this is a point. Assume you want to do graph matching. Typically in graph matching you make some assumptions like, for example, I have, I want isomorphism. I want homomorphism, I want isometries. There's all sorts of types of assumptions that people like to do when they do graph matching, depending on the type of the problem. So what we do here we completely, we become completely agnostic with respect to those assumptions. And we parameterize everything, right? And we have a unified model where we just learned the weight of everything. For example, in this case here we have shape convex features, 60 of them here. We are learning the importance of that. And in this corner here we are learning the importance of isomorphic features, isometric features. Homomorphic features, all different kinds of graph matching that you can think of. There's no point talking about this is a isomorphic graph matching problem, homomorphic graph problem. You parameterize everything. You learn and here's a soft description of the matching problem. Of course when you do that you substantially improve on results, existing tradition and all that. Concluding for this first part, and this is the main message I want to pass today. The second part is a bit more technical. But this is the main message I wanted to pass, which is you have simple prediction problems like classification, just we want to classify objects. People in general use simple models like support vector machines, right? And perform learning properly. So people know how to -- computer vision people have learned already how to master these techniques here. And the results are improving, improving, improving. However, when you have complex prediction problems, think of graph matching, think of image segmentation and MRFs and stereo all that stuff. People use complex models but, yes, they're starting to use learning but they're still not leveraging all the data that you can leverage here. So most people actually still overlook learning in this case. And maybe we can still overlook learning if you have a stereo matching problem where I assume you have a monochromatic image. Imagine five or 10 years down the road when every image you get even in your web cam will be hyperspectral. You will have 72 channels. How would you manually tune the stereo parameters for 72 channels? You better start thinking ahead of time, and automatically learn all this stuff. Okay? So take home message is use techniques that are allowed doing both things, complex models and learning. That's the first thing you should take from this talk. And the second thing is that in this trade-off between model complexity and data leveraging, if you think computer vision is still far to the first side, so we are using very complicated models, but we are not leveraging as much data as we could. Okay. So I think a better balance is possible here. We should think about it. Ask these questions. >>: Just to run by, the degree of data we had we didn't have brown trees as surrogate -- >>: Tiberio Caetano: I agree with you. That's true. >>: Label -- >>: Tiberio Caetano: That's true. >>: A lot of things came online just the past few years. >> Tiberio Caetano: That's a completely legitimate argument. But there are two points here. First that's true but that's not the whole story. And, second, even if the lack of completely labelled data, you can still use this stuff. So basically you can use a semi-supervised version of this but for that you need to work hard because you need to look at the last two years they present NIPS and understand all that. >>: That's the problem, you guys have [inaudible] [laughter]. >> Tiberio Caetano: Different questions. We've been working on different questions. I just think that since I'm mostly a machine learning person, apply my stuff to vision. The way I see this is, look, talk to the vision guys and say, look, we have these tools. And we can really change these things by using these tools. So let's try to talk to people and excite people about using these tools. Yes? >>: My question is that while you're leveraging the [inaudible] data [inaudible] model such that it can fit the data [inaudible] the risk is that the model will be tuned specifically to the inception of data and -- [inaudible]. >> Tiberio Caetano: In practice, we use, you know, you just do regularization. This is not a technical talk, right. But in practice you don't minimize only the loss on the training set, you minimize the loss plus a regularizer that's basically constrained model class. It's saying well I'm not going to learn an arbitrarily complex functions. I'm going to learn functions that are meat, that they do well on the training set, but they all are sufficiently simple. You do just regularization in practice. Okay. So this is the main message I had for today. So since I still have 15 minutes let me tell you a few other things we're doing. So this is something I'm excited about because we just submitted this stuff. It's very recent. And it has an application in computer vision. Convex relaxations of mixture regulation apply to motion segmentation. Here's a motion segmentation problem. This is supposed to be just a single image of an image sequence. So I have the skies moving, the other car there is moving. And you have the background here. And the fact is that you want to segment these motions. You want to segment this motion. So you want to clarify, this is the ground truth let's put it this way of the segmentation. You know that's a motion, that's another motion. The motion model that we have, you want to estimate from the mental matrix all this stuff, they're all linear regression problems. So if we knew, if we knew to each of these three motions every one of these features belongs, we could solve simple linear model that will estimate that motion, simply, just by doing Lees squares regression. The problem is we don't know. The problem is we don't know that this guy here is and this guy here belong to the same motion. We don't have this information. In fact, the membership, the motion membership for every one of these points is a latent variable that cannot attain three variables in this case, but we don't know what it is. So how do you solve this problem? If you observe the membership, you have a nice convex problem that you can serve close to form, Lees squares. But if you don't know, how do you solve it? Obviously this will also be a very complicated optimization problem, because this will be a combinatorial optimization problem. You would ideally have to look at all the possible configurations of three motions of these points, which is exponentially large set. And solve the regression problem for each one of those and find at the end which configuration gave you the best fit. The best Lees squares solution. Of course you can't do that in practice, so what do you do? So I'll tell you what people do. So this is a setting. So we have just in a more visual way, mathematical way, you have mixture regression. Here's a bunch of points. Here's another bunch of points and another bunch of points. If you knew that this belonged to the same line, that this belongs to the same line, this belongs to the same line, you just solved three independent regression problems but you don't know that. Okay. Let's just assume for sake of simplicity there's three models, you don't have a model selection problem. Only three models you know there are straight lines, but how do you solve this? What's the typical answer? The typical answer is you have a latent variable for each one of those guys which is the membership at that point. So some people just -- okay. This would be a good solution. Okay. And people just use EM. That's what people do. So what do you do? You pretend you know of a solution and then you solve purchase problem. You use that solution to initialize and then solve for the membership again. And then go back and resubmit the model. And then solve again and restate the model. Well, we know very well that this function that EM is trying to optimize is a very highly known convex function, which is full of peaks and valleys and in more ways it's discrete. Because we're optimizing over to the discrete space. It's piece wise conservation. It's the worse possible optimization problem you can have. So EZM going to do well if you can afford EM running over -- >>: Is it discrete, or can you have soft assignments in EM? >>: Tiberio Caetano: No, in EM you have everything is soft assigned and you can dissecuritize. >>: But it's not a discrete optimization, combinatorial optimization? >>: Tiberio Caetano: The one thing is the problem itself. The only thing is the model. The problem is discrete. The model is soft [inaudible] to be exact. I'm sure many of you have played, have at least maybe implemented EM once in your lifetime or whatever and run EM. You consider EM can do very badly. It can do very well but it can do really badly. The reason EM can do badly because it's finding real poor local minimum. So how do you go and solve this? Okay. Can you try to do EM or you can try to do something else. So here's the something else we're trying to do. We're trying to, instead of we have a very complicated function here. Very high dimensional, multi-model function and you want to find the global optimal here or something. Global optimum. And EM. EM is great in decent, basically. If you initialize here you'll find this point. If you initialize here you'll find this point. But there's millions, billions of valleys here. So depending on initialization you're going to do well. Depending on not, you're not going to do well. So what do we do? We construct a convex upper bound on this function. And instead of solving this problem we solve this problem. So we change the problem. Yes, is that bad? Well, of course you're not solving the original problem. But the question is: If we happened to construct a convex upper bound whose minimum of the convex function is not too far from this minimum here, you've really gained something. Because solving this is easy. And you obtain the same solution that you would have today if you could have solved this problem. So before telling you the good news, I'll tell you the bad news. The bad news is that there's no existing algorithm I'm aware of that tells you this deviation between this point and the optimal of this point. All what you can do at the moment is construct a convex upper bound which is sound in an intuitive sense but you cannot really give many guarantees on how this solution is close to this solution. So that's the bad news. So the good news is coming now. So we construct this convex upper bound on this objective function here. But remember this here, this thing here is very misleading, because this probably is extremely high dimensional. But still that's a convex function. We know how to optimize convex function. There's nice technology to do that. It's not simple. This is not simple unconstrained optimization problem. This is a semi definite problem. And among many possible convex optimization problems, this is one of the worst you can have. This is not going to scale. Basically you can solve this for 300 observations, 400 observations. You cannot solve this for a thousand observations. So this is not good. Typically interior method so barrier type methods that solve this problem, so essentially you force the constraints in the objective function by logarithmic barrier with those constraints. This is extremely expensive because it could be on the matrix of, on the square matrix. So it's [inaudible] really going to be bad. So that's what we do. We reformulate the problem as a semi definite problem, which is really hard to solve, but we figure out two ways of solving this problem exactly without doing semi definite programming. Okay. So we bypass completely the need for using semi definite programming. And this is only possible because of the very structure of this problem. It couldn't be possible in general. So here is I only have five minutes -- I won't go much into detail but I want to give you the intuition of what's happening. So every one of those observations I mentioned previously that's called an XI. You have a latent variable for them which is I telling you to each motion that particular observation belongs. So now you construct this VI variable which is a product of these two guys. So this is not known. This is partially known. This is known but this is not known. So this vector here is this, this is a product of this thing that you know with this thing you don't know is something you know partially only. You don't know completely. And now you can easily, if you have this notation, you can easily write the mixture regression problem, a mixture of regression problem simply as a regression problem. So here is your prediction. Here is your observation. Here is your model. So you have just a product of your vector with parameter vector with V and you have, assuming in this case Lees squared so you basically assume Gaussian noise in your data. So here is the noise model. Okay. Classical regression. Just that the key difference is that here we don't have XI. We have CI, which is something you don't know completely, whereas in traditional regression you would only have XI because you don't have this class membership issue. Okay. Well, it turns out you can reformulate all this stuff in a nice matrix form. Here have quadratic objective and linear objective, nice Gaussian exponential family and it becomes linear and this is something regression you mentioned earlier. We want to penalize models that are complicated, essentially, so we have a Gaussian prior on the parameters. And what you want to do, you want to actually -- well, there's some notation problem. Yeah, this is -- these variables, see here. We want to solve -- yes, we want to solve traditional Lees square problem on W. Okay. For a fixed C. But C is not fixed. So we need to jointly optimize, jointly solve this linear regression problem and combinatorial problem. So this is the original objective. That's as I mentioned before very difficult. You cannot discrete in all this stuff. So after lots of math you can prove that this is a nice convex relaxation for that problem. And I won't give you details but essentially what you do, you compute the dual of the original problem, use eventuality and do a few math tricks and you obtain this optimization problem here. This problem is a convex problem. Essentially you are optimizing over C. This function here is concave on C, because it's C square negative here. Concave C. And then you have a semi definite constraint here because this matrix M needs to be positive. Needs to have barriers and all that blah, blah, blah. The problem is you have a semi definite program but it's hard to solve. This is an exact reformulation of that problem. So if you solve this problem here, if you solve this problem here you will recover the same solution than if you solve this problem here. But solving this problem is very expensive. And solving this problem is very easy, because this part here can be solved by basically eigenvector computations, it's cubic, you can solve this problem sufficiently. This optimization can be done by gradient in the sand. So, essentially, what you do, you alternate eigenvector computation in a sense and this process goes fast. This version of our reformulation runs as fast as EM, basically. And then here's the algorithm blah, blah, blah you need the input which are those parameters of your noise model. You need to solve eigenvector computations you use non-smooth computation to traditional standard, convex optimization tools. But this is much more efficient than doing semi definite programming than using zero point methods. It's much faster. This is scaleable. You can solve this for thousands of points. Okay. This is the first reformulation. You can find another convex reformulation of the original semi definite program. And this is the ugly looking of that thing. But you can prove that this is also a convex optimization problem. You have also semi definite constraint here. But that's some definite problem can be solved efficiently as well. It's not as fast as the other one. It's about three times slower than EM. Okay. So here's the algorithm. Okay. Won't go much into details but here's some pictures. Let's close with some images. So here I have a few datasets from motion. And here's a first dataset. This is the ground truth, and if you run EM, you run the first reformulation. The second convex reformulation you obtain, this is sort of an easy problem. EM is finding the optimal solution already. These guys also end up finding the optimal solution. Now you have a slightly more difficult problem here. And then you can start to see the differences. You observe that EM does a few mistakes here, other mistakes here. This convex reformulation seems to be at least as good as EM. Is doing is quite well. This one technically should find the same formulation because these two guys here are solving the same problem. These differences here are doing basically numerical differences. Here's the ground truth. Here EM does pretty badly here, has all this mistakes here over there. And here with this convex relaxation still far from the ground truth but at least you make less mistakes here. You can quantify, you have tables showing details. >>: How do you initialize EM? >>: Tiberio Caetano: We run EM 100 times. Initialized randomly. That's the setting. We could run EM. >>: Can't figure out -- >>: Tiberio Caetano: This is the best out of 100 iterations. >>: The other one's only once. >> Tiberio Caetano: Once. Convex problem. If you solve it 100 times you get the same solution. >>: The min max. >> Tiberio Caetano: Max min. Same speed. >>: Same iteration of EM or 100 iterations? >>: Tiberio Caetano: 100 iterations, being fair here. You could run EM for 1,000 iterations but then it's going to be slow, right. >>: Yeah. >>: If you've got that many points, the number of iterations you start with won't make a lot of difference because the possible -- where you go two to the 30th or 40th or 50th. >> Tiberio Caetano: In theory, yes. In practice you see you can get a difference if you use 10 iterations or -- we found difference between 100 and 10 iterations. >>: EM. >> Tiberio Caetano: But we didn't really run experience with 1,000 iterations to know the answer. >>: What happens when you take the max min convex and use that to initialize EM. >> Tiberio Caetano: You would find the same solution. Well, you can -- you mean use this solution to initialize EM? Well, we haven't done that, but you don't really want to do that, because that will be really expensive. Because you have to run twice the problem, the ->>: Just one iteration. Like -- >>: Tiberio Caetano: Yes. Yes, but you know use this -- this is something that people do, you know? Initialize this, because you could have it this point, maybe, right? Maybe we're here. It's a very good question. We haven't done -- you could have this point here. >>: Yeah. >> Tiberio Caetano: So you are minimizing this function, right? But you are not minimizing the function that you really would care about. That's something you can do, haven't done that. But yes, definitely you can use this convex optimization to initialize it. >>: The other thing take the energy of your model of the solutions for each of these techniques and see which one's slower. >> Tiberio Caetano: That's true. But there is a slight trick to that, which is the following: There are hyperparameters in our model. You have this alpha and this gamma here. You can either choose to cross-validate those parameters. If you want to cross-validate or do some parameter search at the end, the best EM result and the best convex relaxation result may be for slightly different models and then your score may be different. Okay. But we are still in the initial experiments with this. One thing we want to do is just let's stick to the same hyperparameters and do a very thorough analysis and see what ->>: The programmers the sigma is the noise. >> Tiberio Caetano: Sigma is the noise. So gamma, one caveat here. We need an extra parameter than EM. We need this gamma. What this gamma is telling you is the class balance, the balance between the sizes of the classes. You can -- this is sort of -- this gamma is an upper bound on the size of the largest class. So if you want -- you want no class to be larger than 50 percent of the points, you need to have that gamma. So we don't really have a solution for avoiding that. But this alpha and this -- so sigma is totally a property of your data. But often not. Alpha is a parameter of the model. So you may end up with slightly different guides if you cross validate them. If not then your point is valid. Okay. So overall, and we have many other motions, I'm just showing three here in our submission we have appendices and all this stuff with many other things. Especially the max min. Because the max min is quite fast. And it seems to have preserved quite a lot of the structure of the problem. All right. Okay. So, yes? >>: Question on this. What's moving here? The sequence. >> Tiberio Caetano: The camera. That's why you have all these things, all this green here. The car and that car. >>: The image -- I know but this little one, I mean it's twice C. There are already a small number of images there. What about this one on the bottom? Just those images? >>: Tiberio Caetano: No, there are more. You have to ask the student who made this. I don't know. I have to check this. I don't know exactly how many points here. >>: On the number of the mixture -- >>: Tiberio Caetano: That's a big question. We don't. So these are two fundamentally different questions. This is a model selection question. So there's no way you can use a likelihood criterion to optimize that. Because likelihood is monotonic of the complexity. If you make more complex, more complex models you're always going to obtain a better likelihood. That's a fundamental question. That's a philosophical question that's the problem of induction in science. So it's an open problem. And you can use several different model selections, MDL or whatever or you can try to be bayesian and do all this hierarchical bayesian stuff. Because for completely bayesian then you have no model selection problem. You just put the prior in and keep running. So ->>: [inaudible]. >>: Tiberio Caetano: Can you speak up a little? >>: For EM you can use EIC, BIC criteria. >>: Tiberio Caetano: EIC. BIC are model selection criteria, that's exactly what I'm saying. >>: [inaudible]. >>: Tiberio Caetano: You're right. You're completely right. The advantage of EM is since we have a probabilistic model you can be as bayesian as you want. So you can do what you're saying. In this setting, it's not obvious how to do that. I don't know the answer. Any other questions? >>: I remember a long time ago, more than 10 years ago, I think blade worked on something called graduated convexity. >> Tiberio Caetano: Graduated assignment. >>: They called it graduated convexity was a continuation method. Popular in the vision community in the mid to late '80s [inaudible] worked on this. But they were for regularization problems with nonquadratic equations. >> Tiberio Caetano: Noncovex as well? Noncovex? >>: Noncovex. So you would introduce a graduated noncovexity equation. >> Tiberio Caetano: I see. >>: So it's a continuation methods ->>: I don't know whether it's ->>: I don't know if it's related or not. It might be because there's this idea at least the philosophical level. Having sort of an approximating function that's smoother. >>: You've compared this with EM. I think maybe that's not absolutely state of the art method here, right, for this particular problem. Maybe your technique's more -- >>: Tiberio Caetano: What would be the state of the art? >>: You look at the recent types of [inaudible]. >> Tiberio Caetano: Yes, yes. >>: He and I have a couple of papers. >> Tiberio Caetano: I know. >>: And well they're sort of methods that [inaudible]. >> Tiberio Caetano: Yes. >>: Which pose an FI model on this and then the FI model and then the whole thing becomes linear and this is like what Renee calls a generalized PCA problem. >> Tiberio Caetano: Yes. >>: And then we see it in that -- he also has more recent papers, I think it's CDR 2009 I believe. >> Tiberio Caetano: That's good. >>: That's one we've seen. >> Tiberio Caetano: Maybe I'll talk to him CDR. >>: Press sensing technique in a way to solve the GPA problem. So when looking at the results, the numbers in his paper there are fairly amazing. It makes me feel ashamed that he and I published a previous paper because it just kills it, right. You should -- >>: Tiberio Caetano: I need to look at that, because, remember, this is a hammer that you look for a nail. We're not solving the motion segmentation problem. We are solving the mixture regression problem. Modeling motion segmentation as mixture regression where you have quadratic, a quadratic criterion. Probably these models, they are different models. They are not the same model I proposed here. The first, the very first slides I put up. So this is a quadratic, basically regression problem. So they are different models. So I would have to reformulate new convex relaxations for every different objective. >>: Okay. >> Tiberio Caetano: I'm actually technically done. So if you -- >>: Okay. Tiberio is here today and we have time, him being away, arranged a lot of talks. It would be nice if you could work with some of the interns and [inaudible] so let me know and I'll [inaudible]. >> Tiberio Caetano: Thanks, Rick. Thanks, everyone. [applause]