>>: And our next speaker is Jeff Bilmes from University of Washington. >> Jeff Bilmes: Thanks very much. Can everybody hear me? I'm going to talk, I guess the title that was initially sent was Why Sub Modularity Should Be of Interest to People Who Are Interested in Machine Learning and that is maybe sort of somewhat what the title is going to be about what I'm going to talk about today, but actually the real talk is entitled Applications of Semimodular Semigradients in Machine Learning. Here is an outline and for those of you that have not heard or do not remember or do not know what semimodular is, what semimodular functions are, there will be a little bit of background. Then we'll talk about discrete semimodular subgradients and what they are and then three applications of these semigradients, one of which is a problem and then something which was applied in computer vision. Another is a particular optimization problem of a particular style, and lastly our discrete generalizations of Bregman Divergences. Before I begin I'd like to acknowledge both former and current students. This is all joint work with a current student of mine Rishabh Iyer who I'd like to point out. Rishabh is right there. And then former students Stefanie Jegelka and Makund Narasimhan. Some background, so sub modular functions are discrete functions on subsets of some underlying ground set V. and basically they have the property of diminishing returns and that basically means that when you think of a sub modular function as something that offers a value to a set and you're thinking of adding a new item to the set, the value or the change in value of that new item, or the gain of that new item diminishes as the context in which you are considering grows. So basically what you don't have becomes less valuable as what you have grows and what you don't have becomes more valuable as what you do have shrinks, so that's the concept of diminishing returns. Of course, what is value, and value depends on what particular function that you are talking about. Here's an example where the value of an urn is the number of colored balls that lie within that urn, so the value of the left urn would be two. The value of the right urn would be three because there are two colors on the left and three colors on the right. It doesn't matter how many balls are in there, and then the gain of adding a blue ball to the left urn is one because you add a new color and so you've gained one color. Whereas, if you add a blue ball to the right urn the gain is zero because you've not gained any diversity in the number of balls, so this is therefore a sub modular function. Something that's going to be very relevant to our lunch today is consumer cost. Consumer costs are sub modular. Here is an example. For example, the cost of say a McDonald's fries and a Coke minus the cost of fries is greater than the cost of the Happy Meal minus the Coke and fries, and so this is very typical if anyone has ever been to McDonald's before. If you buy a hamburger and fries, then they will throw in the Coke for free. On the other hand, if you just buy fries, you actually have to pay for the Coke, so this is therefore some modularity and that basically means that there have been 20 billion sub modular functions sold in the world [laughter], so rearranging terms, we just rearrange the terms and we get a function that looks like this. This is sort of saying that f of this set plus f of this other set is greater than or equal to f of the union of, which is the fries hamburger and Coke of course, plus f of the intersection which is just a hamburger. So this is a form of sub modular function that is often used to define sub modular functions. It's equivalent to diminishing returns and it basically says that if you have any set A or B that f of A plus f of B is greater than or equal to the f of the union plus f of the intersection. Usually what happens at this point is people say well this is less intuitive and I think this is actually just as intuitive and to hope to prove myself right I'm going to actually try to demonstrate this in the next 30 seconds as to why this equation is intuitive. The idea is to think of sub modular functions as information functions. They provide information or value of a set of sets, of a subset. So we have A and B which are sets and each element is an index of some item which has some value or some information. So the common index that is essentially the intersection of A and B. That's the common index. The set of items that is indexed both by A and by B and let's say that there exists some set C which has the common value or the common information that exists within both A and B. so it's the difference between common index and common value. Now C’s common in A intersection B is the common index, so basically if any sort of information function or value function, intuitively at least it should be the case that the information in the common index can't be greater than the information in the common value. Why? Because there could be different items with different indices which have similar value or even the same value, but they don't live within the common index which is A intersect B. So then you can sort of plot this graphically where in a Venn diagram like style were basically area corresponds to value or information and so A would correspond to this red circle. I guess I have a pointer I can use. B corresponds to this green circle. This intersected region corresponds to the common information or common value and the magenta ellipse corresponds to the common index. That means that the common index is upper bounded by the common value. Given this intuition, you've got this picture. This is sort of a graphical or a pictorial view of the sub modular equation, so you have f of A plus f of B, so you've got the common information counted twice, so two times f of C is greater than or equal to f of the union where the common information is counted only once plus f of the intersection which only has the common index and since the magenta part is less than the blue part, you therefore have some modularity and there you have it. So what you think? Did it work? Is this now very intuitive? >>: I don't know what a common index is. >> Jeff Bilmes: Common index, okay, so we're going to have to move on, sorry [laughter]. This is not my class. It didn't work. I will never do this again. [laughter] unless I have three hours. Someone asked if I have everything I want and I said could I have more time for my talk. So anyway there are many, many applications of sub modular function optimizations. One of the reasons why they are so useful is that on the one hand they are very widely used and applicable and on the other hand it turns out that sub modular function optimization usually if there some form of some modularity involved, you can do this, either exactly or sometimes approximately and usually if it is very, very efficient, or sometimes it can be extremely efficient. For example, if you are trying to do cardinality constraints sub modular maximization and it's a monotone sub modular function, monotone non-decreasing. Then for a long time it's been known that there is a 1-1 over A constant factor approximation to this procedure and, in fact, very recently finally a tight one half approximation bound for nonnegative sub modular function optimization was shown by someone who is now -- Schwartz is now here and I would like to stop in the theory group. He just gave a very nice talk I think last week even or two weeks ago at UW, so this is very exciting; it was very exciting when this paper came out to all of us. There are many, many machine learning applications of sub modular function maximizations including sensor placement, feature selection and extracted document summarization, selecting the most influential set of individuals that actually relates a little bit to the previous talk that we just saw. Now sub modular function minimization is another thing that you might want to do. It's a very different kind of problem and very different techniques are used to minimize versus maximize sub modular functions. Modular functions can be minimized in polynomial time and while it's not as efficient as the approximation algorithms for maximizing sub modular functions, oftentimes for special cases of sub modular functions you can do this extremely efficiently. There are many, many applications of this as well in machine learning including Viterbi inference in probabilistic models or what's called most probable explanation or map inference, image segmentation, clustering, data subset selection and transductive semi-supervised learning and many others as well. Another point on sub modular functions, I think of them as a sort of anti-or contrary to graphical models, so sub modular functions are the opposite of graphical models, so why is that? In a graphical model what you do is you have a graph of some sort. There’s lots of different kinds of graphs for models. You have some graph and basically the graph encodes factorization assumptions about any probability distribution that abides by the particular graph. In this particular case in equation five you've got a graph which has a set of cliques and any probability distribution that lives within the family associated with the graph has to factorize with respect to that set of cliques. You've got factorization and decomposition. In a sub modular function you can actually instantiate probability distributions with sub modular functions; so for example, p of x is equal to one over z of this quantity. Now in this equation there are absolutely no factorization assumptions required. Sub modular distributions are what is called nongraphical. You can't say anything about them with a graph. And you might think well, that means that basically it's just a large clique and the inference is hopeless, but it turns out it's not hopeless because you are making very, very different kinds of restrictions to the model. You are making sub modular like restrictions and it's still possible to say things and to do inference. And this is very powerful because the factorization assumptions are usually nice computationally, but oftentimes that's not what exists in nature. So this is very powerful and therefore they are sort of the opposite of graphical models. That's the end of the background on sub modularity. What I want to talk about are some properties of the sub modular functions that we have been discovering and then exploiting over the past couple of years. Some of what I think are quite remarkable properties and that's the semigradients that they have. I think it should be fairly well understood in this group that basically if you know about convexity and concavity, you know that convex functions have sub gradients and any convex function has a sub differential and if it's a differential with continuous convex functions you have a sub gradient and of course concave functions have the same thing; as super gradients. And here's the picture in case that’s not clear. You have a convex function f and a sub gradient. In this particular case it's a tight sub gradient tighted b is the linear lower bound that's touching at b of the convex function. And any convex function has that and if it's a polyhedral convex function you have multiple linear lower bounds that can exist at the vertex say of the convex function of a polyhedral convex function. And then a concave function has the same kind of thing except that it's a super gradient, something that is touching at a particular point and it's everywhere above the concave function. It's been well known for a while actually and I think in some sense this goes back to the work of Jack Edmonds that sub modular functions have sub differentials and this is probably most eloquently articulated in Fujishige’s book and you can define a sub differential in this particular way. Moreover, there are particular sub differentials that are unbelievably easy to compute. Is that until the 20 minute mark or the 25 minute? That's to 20 minutes, okay good. I just want to know how bad I'm going to be. So the sub differential can be computed using the greedy algorithm. The basic idea is that you choose an ordering of the elements and I think it's on the next slide. You choose an ordering of the elements here. And then you just compute this which is basically like a linear function, like a discrete linear function, what's known as a modular function. And basically this is a function that's very, that only has n degrees of freedom. It's touching in any particular point on the sub modular function. It's everywhere below the sub modular function just like the sub gradient of a convex function. Let's go back to the continuous world for a minute and then asked the following question. So the question is can there be both a tight upper and lower, tight linear lower bound on a convex or concave or for that matter any continuous function? So this thing about a pictorial is so here's an upper bound and here is a lower bound, so this if we have a tight linear upper bound and a tight linear lower bound there is very little space between [laughter] to fit a function, other than of course and affine function. Therefore, if you have a tight linear upper bound and a tight linear lower bound at any one point, you must have an affine function. Now the question is does this also hold for discrete functions? So something that we discovered not too long ago that I think is a remarkable property of sub modular functions is that any sub modular function not only has a tight linear lower bound at any point, but also has a tight linear upper bound at any point, so like I said, in the continuous case that would restrict the functions to be linear. First of all, how does this work? In 1978 another classic paper by Nemhouser, Wolsey and Fisher showed that the following equations, any of these two equations are sufficient and necessary to define sub modularity. And what we did is we used these equations to come up, to relax them a little bit so we subtract off a little bit less here or we add on a little bit more here in such a way that we can actually define two modular functions that are tight at a particular point x and everywhere else upper bounding of this sub modular function. That's for any sub modular function. How does this work? Well what is a modular function? Modular function has n plus one degrees of freedom. Here is a particular example of a modular function. A modular function must satisfy this equality with equality, m of x plus m of y has to equal m of the union plus m of the intersection or equivalently you can say that m has to be written in the following form of some constant plus the sum over all of the elements of x, of the individual value of that element since that's a modular function. And so this particular, here is the one of the two modular functions that I described in the previous slide. This can be written as a modular function by just breaking this term into two terms. The one going over here which becomes a constant with respect to y. Notice that there is no y here and the rest becomes something that involves either any element y, so it's like x intersect y and y except for x so those are basically all of the elements of y that could be considered and you have a value for every single individual element of y. In fact, more recently we have shown that in fact there is an entire sub modular super differential. This was the paper that is going to appear in MIPS this year and you can define a super differential and do all sorts of things with points within the super differential. Okay so, the summary then is that we have sub modular functions. We have very, very efficiently computable linear lower bounds and very, very efficiently computable linear upper bounds. What can we do? The first application was the application of cooperative cut, and image segmentation. The basic idea of cooperative cut is a generalization of graph cut, so in graph cut you've got a graph and it's an edge weighted graph and you've got weights that correspond to the edge and basically the goal is to try to come up with a cut that minimizes the cost of the cut where the cost of the cut is the sum of the edge weights. The idea of cooperative cut is to replace the function on the edges with a sub modular function defined on the edges. Here let's say you still want to find the cut that minimizes the cost, but it's no longer the case that the edges of the cut must not interact. The edges can communicate with each other or as we call it they can cooperate with each other. So we can write it this particular way. We want to find a cut that minimizes this sub modular cost on the edges. That's critical to understand that this sub modular function is defined on the edges. I guess you might, some of you might be aware that if you think of it as a node function, the standard graph cut is sub modular on the nodes, but cooperative cut loses sub modularity on the nodes even though in some sense it has some modular function embedded into it which defines the problem. Here the sub modular function is defined on the edges. So how do we use super gradients? There were many methods that we developed to actually solve this problem, but one of the most efficient and actually the one that was used in practice, and it's actually very, very practical and it works very well, is to use super gradients. The basic idea is a majorization minimization algorithm, very similar in some sense to the EM algorithm where we start with an initial cut which in this case can be anything. It can be the empty set. We find a modular upper bound that is tight at the particular cut and if the function is nonnegative, nondecreasing then basically we get a standard graph cut problem. We solve the standard graph cut problem. We get a cut and then that becomes the cut which then becomes the next is used for the type point at the next modular upper bound and we repeat. We just repeat this process. And it works as I said it's very, very fast and if you have fast graph cut solvers, you don't even have to write the code for the fast graph cut solvers. First of all, difficulty, I mean the actual problem itself of cooperative cut is hard. In fact, it's not possible to approximate better than square root of E where E is the number of edges and this particular algorithm, this majorization, minimization algorithm actually has a bound of O of E so where the difference is, of course, the square root of E it’s obviously not a type bound in any sense of the word, but on the other hand it's so simple and so easy to get working that this was one of the ones that we were most, one of the algorithms of this problem that we were most excited about. Then we applied it and it can apply to very, very large problems like, for example, problems in image segmentation. Here's an example of an insect that comes from the Seattle Insect Museum, so I just want to give a plug to the Seattle Insect Museum. If you have children you want to go see some bugs that look like this, go to downtown Seattle and take pictures of the insects. They are very nice and friendly. [laughter] And the typical problem that happens when you are trying to segment your insect is that you label some of the points in the background and then you label some points in the foreground and what happens is that the antenna or these protrusions get cut off. This is something called the shrinking bias problem in computer vision. Here's another example where you've got some calligraphy which maybe undergoes some contrast gradients. It's a little harder to see on the screen than it is to see on the computer screen, but you get the idea that you have sort of a gradual decrease in lighting which basically causes the segmentation algorithm to just clump things together where you don't have a lot. Here's another example of a fan. You can see that a lot of the fine structure was completely lost in this area here. Here's a little chili plant and these are state-of-the-art image segmentation methods. What we did is we applied cooperative cut to this problem and the intuition as to why this works is that when you have so much of the function defined on the edges, once you have got sort of highly reliable portions of the image segmented based on say the boundary, the body of the object, what that does because of sub modularity is it makes additional edges that are so much similar to it cheaper to use and to be included in the cut. Therefore when you iterate this process, you first get the boundary and suddenly all these other edges that lie along the antenna become cheap and then they get added to the cut and therefore you get improved results. Just a couple of other results, here is the calligraphy. Here is a fan which we were quite happy about and this is quite remarkable. This is a vacuum cleaner; this is a typical thing you want to segment of course. Chili plants, and then here’s some benchmark results. I don't have time to go into the details but basically on these images that you do have these elongated structures, it is much better and on images that don't have elongated structures or contrast gradients it does the same, which is exactly what we want it to do. Now the next application of semigradients that I want to talk about is another problem, which is minimizing or maximizing the difference between two sub modular functions. This is an important problem, so you want to find the minimum distance between sub modular functions. There are many, many applications of this. For example, if you consider sensor placement with some modular costs, sensor placement is usually the case that you place the sensor where we gain the most information. But what if there's a cost associated with the placement of the sensor that is sub modular, and this is a very likely model because it might be the case that placing sensors at a particular region have any economies of scale. Like if you buy the equipment to place a sensor on the roof, if you get the ladder from storage, then you already have a ladder out in the room and you can move it around the room. Or if you want to place the sensor in a particular precarious environment, you invest in the equipment necessary to install that censor and then therefore, thereafter you have the equipment already. So this is -- I still have 5 minutes until the next talk; is that right? So [laughter] I'm being warned. I know that there are no questions, right? So a couple of other applications, there is discriminatively structured graphical models, structured learning and graphical models where you want the graph structure to somehow perform well for classification purposes. Feature selection, where you have some modular class. This is also a typical thing like, for example, you might have a spectral feature and once you choose the first feature for a pattern recognizer and let's say a group of features have been computed using the FFT, once you select the first one then all of them are essentially cheap, but if you don't need any of these particular spectral features then you don't need to compete the FFT and therefore there is a diminishing return associated with feature selection. Most feature selection problems have not associated sort of an interactive cost with the features, and also graphical models inference as well. I mean, if you just take p of x is an exponential model where v is some non-sub modular function and where v is some arbitrary suitability and the function that you want to optimize. And now I am trying to rethink of what I want to talk about next [laughter]. In 2005, we first of all showed that any function can be represented as a difference between sub modular functions. We developed this algorithm which was a form of majorization, minimization algorithm where we take the function f minus g and we replace g. We want to minimize this function so we replace g with its modular lower bound, which basically makes this whole thing an upper bound and then we iteratively optimize this thing. But now we can do with these modular upper bounds is do the same thing. We take the f function, replace this with its modular upper bound and so this basically turns each iteration -- the previous page had each iteration was a sub modular function minimization problem. This is a sub modular function maximization problem and then furthermore, we can actually do both where we can replace f by its modular upper bound and g by its modular lower bound and each iteration of this majorization minimization problem becomes a modular minimization which is, you know, incredibly cheap. It's basically order n to do. So there are a lot of other properties. This was in a paper in UAI this year and we tried this in the context of feature selection and showed that when you have sub modular cost features plotting the pattern recognition error as a function of the sub modular cost, all of these algorithms did well. The one where each iteration is modular, we were particularly happy about that doing well because that one is so easy to optimize, whereas these other greedy heuristics to not do that well. So the last thing I want to talk about is another application of semi-differentials which is generalizations of Bregman Divergences, discrete Bregman Divergences. So Bregman Divergences are well known in machine learning. They been studied and used for many applications in clustering and proximal minimization and online learning. Bregman Divergences generalize things like square two norm and [inaudible] divergence. What we wanted to do was to develop a discrete family of divergences and similar to the way that continuous divergences can seen as something involving the functions and sub gradients. In the sub modular case we can involve the sub gradients of the sub modular function but in fact we have two versions because now we have both the sub grading version of sub modular Bregman and the super gradient version of sub modular Bergman. So I just want to say that you can do all sorts of things with these things. You can, for example, get Hamming distance which is a discrete diversion, but of course Hamming distance isn't the most interesting thing. There are many other ones, like recall and weighting recall, something that is called the alignment error rate which is used in machine translation. You can generate conditional mutual information with the super gradient versions you can again do Hamming, precision, precision measures and other measures like the Itakura-Saito and generalized KL divergence like things and a number of interesting ones based on cuts which we don't necessarily have a good interpretation of yet, but still it's a form of divergence. So I’m going to end with this slide which is some possible application of these Bregman divergences which is to do camien [phonetic] style clustering. Here rather than clustering vectors, you are clustering binary vectors and so you have problems like the left mean problem and the right mean problem. It turns out that depending on which Bregman, sub modular Bregman you choose computing the left mean problem ordinarily would be very, very difficult. But it turns out that if you use a sub gradientbased sub modular Bregman then this problem becomes a sub modular minimization problem. In other words, you have a large number of binary vectors and you want to find the bit vector which is closest to all of them collectively, that's a sub modular minimization problem. It uses a sub modular Bregman and if you use the super gradient-based sub modular Bregman the same problem with the right mean is a sub modular maximization problem and that is the end of the talk. I want to thank again the students and that's it. [applause] did I go over time?