>>: And our next speaker is Jeff Bilmes from... >> Jeff Bilmes: Thanks very much. Can everybody...

advertisement
>>: And our next speaker is Jeff Bilmes from University of Washington.
>> Jeff Bilmes: Thanks very much. Can everybody hear me? I'm going to talk, I guess the title
that was initially sent was Why Sub Modularity Should Be of Interest to People Who Are
Interested in Machine Learning and that is maybe sort of somewhat what the title is going to be
about what I'm going to talk about today, but actually the real talk is entitled Applications of
Semimodular Semigradients in Machine Learning. Here is an outline and for those of you that
have not heard or do not remember or do not know what semimodular is, what semimodular
functions are, there will be a little bit of background. Then we'll talk about discrete
semimodular subgradients and what they are and then three applications of these
semigradients, one of which is a problem and then something which was applied in computer
vision. Another is a particular optimization problem of a particular style, and lastly our discrete
generalizations of Bregman Divergences. Before I begin I'd like to acknowledge both former
and current students. This is all joint work with a current student of mine Rishabh Iyer who I'd
like to point out. Rishabh is right there. And then former students Stefanie Jegelka and
Makund Narasimhan. Some background, so sub modular functions are discrete functions on
subsets of some underlying ground set V. and basically they have the property of diminishing
returns and that basically means that when you think of a sub modular function as something
that offers a value to a set and you're thinking of adding a new item to the set, the value or the
change in value of that new item, or the gain of that new item diminishes as the context in
which you are considering grows. So basically what you don't have becomes less valuable as
what you have grows and what you don't have becomes more valuable as what you do have
shrinks, so that's the concept of diminishing returns. Of course, what is value, and value
depends on what particular function that you are talking about. Here's an example where the
value of an urn is the number of colored balls that lie within that urn, so the value of the left
urn would be two. The value of the right urn would be three because there are two colors on
the left and three colors on the right. It doesn't matter how many balls are in there, and then
the gain of adding a blue ball to the left urn is one because you add a new color and so you've
gained one color. Whereas, if you add a blue ball to the right urn the gain is zero because
you've not gained any diversity in the number of balls, so this is therefore a sub modular
function. Something that's going to be very relevant to our lunch today is consumer cost.
Consumer costs are sub modular. Here is an example. For example, the cost of say a
McDonald's fries and a Coke minus the cost of fries is greater than the cost of the Happy Meal
minus the Coke and fries, and so this is very typical if anyone has ever been to McDonald's
before. If you buy a hamburger and fries, then they will throw in the Coke for free. On the
other hand, if you just buy fries, you actually have to pay for the Coke, so this is therefore some
modularity and that basically means that there have been 20 billion sub modular functions sold
in the world [laughter], so rearranging terms, we just rearrange the terms and we get a function
that looks like this. This is sort of saying that f of this set plus f of this other set is greater than
or equal to f of the union of, which is the fries hamburger and Coke of course, plus f of the
intersection which is just a hamburger. So this is a form of sub modular function that is often
used to define sub modular functions. It's equivalent to diminishing returns and it basically says
that if you have any set A or B that f of A plus f of B is greater than or equal to the f of the union
plus f of the intersection. Usually what happens at this point is people say well this is less
intuitive and I think this is actually just as intuitive and to hope to prove myself right I'm going
to actually try to demonstrate this in the next 30 seconds as to why this equation is intuitive.
The idea is to think of sub modular functions as information functions. They provide
information or value of a set of sets, of a subset. So we have A and B which are sets and each
element is an index of some item which has some value or some information. So the common
index that is essentially the intersection of A and B. That's the common index. The set of items
that is indexed both by A and by B and let's say that there exists some set C which has the
common value or the common information that exists within both A and B. so it's the
difference between common index and common value. Now C’s common in A intersection B is
the common index, so basically if any sort of information function or value function, intuitively
at least it should be the case that the information in the common index can't be greater than
the information in the common value. Why? Because there could be different items with
different indices which have similar value or even the same value, but they don't live within the
common index which is A intersect B. So then you can sort of plot this graphically where in a
Venn diagram like style were basically area corresponds to value or information and so A would
correspond to this red circle. I guess I have a pointer I can use. B corresponds to this green
circle. This intersected region corresponds to the common information or common value and
the magenta ellipse corresponds to the common index. That means that the common index is
upper bounded by the common value. Given this intuition, you've got this picture. This is sort
of a graphical or a pictorial view of the sub modular equation, so you have f of A plus f of B, so
you've got the common information counted twice, so two times f of C is greater than or equal
to f of the union where the common information is counted only once plus f of the intersection
which only has the common index and since the magenta part is less than the blue part, you
therefore have some modularity and there you have it. So what you think? Did it work? Is this
now very intuitive?
>>: I don't know what a common index is.
>> Jeff Bilmes: Common index, okay, so we're going to have to move on, sorry [laughter]. This
is not my class. It didn't work. I will never do this again. [laughter] unless I have three hours.
Someone asked if I have everything I want and I said could I have more time for my talk. So
anyway there are many, many applications of sub modular function optimizations. One of the
reasons why they are so useful is that on the one hand they are very widely used and applicable
and on the other hand it turns out that sub modular function optimization usually if there some
form of some modularity involved, you can do this, either exactly or sometimes approximately
and usually if it is very, very efficient, or sometimes it can be extremely efficient. For example,
if you are trying to do cardinality constraints sub modular maximization and it's a monotone
sub modular function, monotone non-decreasing. Then for a long time it's been known that
there is a 1-1 over A constant factor approximation to this procedure and, in fact, very recently
finally a tight one half approximation bound for nonnegative sub modular function optimization
was shown by someone who is now -- Schwartz is now here and I would like to stop in the
theory group. He just gave a very nice talk I think last week even or two weeks ago at UW, so
this is very exciting; it was very exciting when this paper came out to all of us. There are many,
many machine learning applications of sub modular function maximizations including sensor
placement, feature selection and extracted document summarization, selecting the most
influential set of individuals that actually relates a little bit to the previous talk that we just saw.
Now sub modular function minimization is another thing that you might want to do. It's a very
different kind of problem and very different techniques are used to minimize versus maximize
sub modular functions. Modular functions can be minimized in polynomial time and while it's
not as efficient as the approximation algorithms for maximizing sub modular functions,
oftentimes for special cases of sub modular functions you can do this extremely efficiently.
There are many, many applications of this as well in machine learning including Viterbi
inference in probabilistic models or what's called most probable explanation or map inference,
image segmentation, clustering, data subset selection and transductive semi-supervised
learning and many others as well. Another point on sub modular functions, I think of them as a
sort of anti-or contrary to graphical models, so sub modular functions are the opposite of
graphical models, so why is that? In a graphical model what you do is you have a graph of some
sort. There’s lots of different kinds of graphs for models. You have some graph and basically
the graph encodes factorization assumptions about any probability distribution that abides by
the particular graph. In this particular case in equation five you've got a graph which has a set
of cliques and any probability distribution that lives within the family associated with the graph
has to factorize with respect to that set of cliques. You've got factorization and decomposition.
In a sub modular function you can actually instantiate probability distributions with sub
modular functions; so for example, p of x is equal to one over z of this quantity. Now in this
equation there are absolutely no factorization assumptions required. Sub modular distributions
are what is called nongraphical. You can't say anything about them with a graph. And you
might think well, that means that basically it's just a large clique and the inference is hopeless,
but it turns out it's not hopeless because you are making very, very different kinds of
restrictions to the model. You are making sub modular like restrictions and it's still possible to
say things and to do inference. And this is very powerful because the factorization assumptions
are usually nice computationally, but oftentimes that's not what exists in nature. So this is very
powerful and therefore they are sort of the opposite of graphical models. That's the end of the
background on sub modularity. What I want to talk about are some properties of the sub
modular functions that we have been discovering and then exploiting over the past couple of
years. Some of what I think are quite remarkable properties and that's the semigradients that
they have. I think it should be fairly well understood in this group that basically if you know
about convexity and concavity, you know that convex functions have sub gradients and any
convex function has a sub differential and if it's a differential with continuous convex functions
you have a sub gradient and of course concave functions have the same thing; as super
gradients. And here's the picture in case that’s not clear. You have a convex function f and a
sub gradient. In this particular case it's a tight sub gradient tighted b is the linear lower bound
that's touching at b of the convex function. And any convex function has that and if it's a
polyhedral convex function you have multiple linear lower bounds that can exist at the vertex
say of the convex function of a polyhedral convex function. And then a concave function has
the same kind of thing except that it's a super gradient, something that is touching at a
particular point and it's everywhere above the concave function. It's been well known for a
while actually and I think in some sense this goes back to the work of Jack Edmonds that sub
modular functions have sub differentials and this is probably most eloquently articulated in
Fujishige’s book and you can define a sub differential in this particular way. Moreover, there
are particular sub differentials that are unbelievably easy to compute. Is that until the 20
minute mark or the 25 minute? That's to 20 minutes, okay good. I just want to know how bad
I'm going to be. So the sub differential can be computed using the greedy algorithm. The basic
idea is that you choose an ordering of the elements and I think it's on the next slide. You
choose an ordering of the elements here. And then you just compute this which is basically like
a linear function, like a discrete linear function, what's known as a modular function. And
basically this is a function that's very, that only has n degrees of freedom. It's touching in any
particular point on the sub modular function. It's everywhere below the sub modular function
just like the sub gradient of a convex function. Let's go back to the continuous world for a
minute and then asked the following question. So the question is can there be both a tight
upper and lower, tight linear lower bound on a convex or concave or for that matter any
continuous function? So this thing about a pictorial is so here's an upper bound and here is a
lower bound, so this if we have a tight linear upper bound and a tight linear lower bound there
is very little space between [laughter] to fit a function, other than of course and affine function.
Therefore, if you have a tight linear upper bound and a tight linear lower bound at any one
point, you must have an affine function. Now the question is does this also hold for discrete
functions? So something that we discovered not too long ago that I think is a remarkable
property of sub modular functions is that any sub modular function not only has a tight linear
lower bound at any point, but also has a tight linear upper bound at any point, so like I said, in
the continuous case that would restrict the functions to be linear. First of all, how does this
work? In 1978 another classic paper by Nemhouser, Wolsey and Fisher showed that the
following equations, any of these two equations are sufficient and necessary to define sub
modularity. And what we did is we used these equations to come up, to relax them a little bit
so we subtract off a little bit less here or we add on a little bit more here in such a way that we
can actually define two modular functions that are tight at a particular point x and everywhere
else upper bounding of this sub modular function. That's for any sub modular function. How
does this work? Well what is a modular function? Modular function has n plus one degrees of
freedom. Here is a particular example of a modular function. A modular function must satisfy
this equality with equality, m of x plus m of y has to equal m of the union plus m of the
intersection or equivalently you can say that m has to be written in the following form of some
constant plus the sum over all of the elements of x, of the individual value of that element since
that's a modular function. And so this particular, here is the one of the two modular functions
that I described in the previous slide. This can be written as a modular function by just breaking
this term into two terms. The one going over here which becomes a constant with respect to y.
Notice that there is no y here and the rest becomes something that involves either any element
y, so it's like x intersect y and y except for x so those are basically all of the elements of y that
could be considered and you have a value for every single individual element of y. In fact, more
recently we have shown that in fact there is an entire sub modular super differential. This was
the paper that is going to appear in MIPS this year and you can define a super differential and
do all sorts of things with points within the super differential. Okay so, the summary then is
that we have sub modular functions. We have very, very efficiently computable linear lower
bounds and very, very efficiently computable linear upper bounds. What can we do? The first
application was the application of cooperative cut, and image segmentation. The basic idea of
cooperative cut is a generalization of graph cut, so in graph cut you've got a graph and it's an
edge weighted graph and you've got weights that correspond to the edge and basically the goal
is to try to come up with a cut that minimizes the cost of the cut where the cost of the cut is the
sum of the edge weights. The idea of cooperative cut is to replace the function on the edges
with a sub modular function defined on the edges. Here let's say you still want to find the cut
that minimizes the cost, but it's no longer the case that the edges of the cut must not interact.
The edges can communicate with each other or as we call it they can cooperate with each
other. So we can write it this particular way. We want to find a cut that minimizes this sub
modular cost on the edges. That's critical to understand that this sub modular function is
defined on the edges. I guess you might, some of you might be aware that if you think of it as a
node function, the standard graph cut is sub modular on the nodes, but cooperative cut loses
sub modularity on the nodes even though in some sense it has some modular function
embedded into it which defines the problem. Here the sub modular function is defined on the
edges. So how do we use super gradients? There were many methods that we developed to
actually solve this problem, but one of the most efficient and actually the one that was used in
practice, and it's actually very, very practical and it works very well, is to use super gradients.
The basic idea is a majorization minimization algorithm, very similar in some sense to the EM
algorithm where we start with an initial cut which in this case can be anything. It can be the
empty set. We find a modular upper bound that is tight at the particular cut and if the function
is nonnegative, nondecreasing then basically we get a standard graph cut problem. We solve
the standard graph cut problem. We get a cut and then that becomes the cut which then
becomes the next is used for the type point at the next modular upper bound and we repeat.
We just repeat this process. And it works as I said it's very, very fast and if you have fast graph
cut solvers, you don't even have to write the code for the fast graph cut solvers. First of all,
difficulty, I mean the actual problem itself of cooperative cut is hard. In fact, it's not possible to
approximate better than square root of E where E is the number of edges and this particular
algorithm, this majorization, minimization algorithm actually has a bound of O of E so where
the difference is, of course, the square root of E it’s obviously not a type bound in any sense of
the word, but on the other hand it's so simple and so easy to get working that this was one of
the ones that we were most, one of the algorithms of this problem that we were most excited
about. Then we applied it and it can apply to very, very large problems like, for example,
problems in image segmentation. Here's an example of an insect that comes from the Seattle
Insect Museum, so I just want to give a plug to the Seattle Insect Museum. If you have children
you want to go see some bugs that look like this, go to downtown Seattle and take pictures of
the insects. They are very nice and friendly. [laughter] And the typical problem that happens
when you are trying to segment your insect is that you label some of the points in the
background and then you label some points in the foreground and what happens is that the
antenna or these protrusions get cut off. This is something called the shrinking bias problem in
computer vision. Here's another example where you've got some calligraphy which maybe
undergoes some contrast gradients. It's a little harder to see on the screen than it is to see on
the computer screen, but you get the idea that you have sort of a gradual decrease in lighting
which basically causes the segmentation algorithm to just clump things together where you
don't have a lot. Here's another example of a fan. You can see that a lot of the fine structure
was completely lost in this area here. Here's a little chili plant and these are state-of-the-art
image segmentation methods. What we did is we applied cooperative cut to this problem and
the intuition as to why this works is that when you have so much of the function defined on the
edges, once you have got sort of highly reliable portions of the image segmented based on say
the boundary, the body of the object, what that does because of sub modularity is it makes
additional edges that are so much similar to it cheaper to use and to be included in the cut.
Therefore when you iterate this process, you first get the boundary and suddenly all these
other edges that lie along the antenna become cheap and then they get added to the cut and
therefore you get improved results. Just a couple of other results, here is the calligraphy. Here
is a fan which we were quite happy about and this is quite remarkable. This is a vacuum
cleaner; this is a typical thing you want to segment of course. Chili plants, and then here’s
some benchmark results. I don't have time to go into the details but basically on these images
that you do have these elongated structures, it is much better and on images that don't have
elongated structures or contrast gradients it does the same, which is exactly what we want it to
do. Now the next application of semigradients that I want to talk about is another problem,
which is minimizing or maximizing the difference between two sub modular functions. This is
an important problem, so you want to find the minimum distance between sub modular
functions. There are many, many applications of this. For example, if you consider sensor
placement with some modular costs, sensor placement is usually the case that you place the
sensor where we gain the most information. But what if there's a cost associated with the
placement of the sensor that is sub modular, and this is a very likely model because it might be
the case that placing sensors at a particular region have any economies of scale. Like if you buy
the equipment to place a sensor on the roof, if you get the ladder from storage, then you
already have a ladder out in the room and you can move it around the room. Or if you want to
place the sensor in a particular precarious environment, you invest in the equipment necessary
to install that censor and then therefore, thereafter you have the equipment already. So this is
-- I still have 5 minutes until the next talk; is that right? So [laughter] I'm being warned. I know
that there are no questions, right? So a couple of other applications, there is discriminatively
structured graphical models, structured learning and graphical models where you want the
graph structure to somehow perform well for classification purposes. Feature selection, where
you have some modular class. This is also a typical thing like, for example, you might have a
spectral feature and once you choose the first feature for a pattern recognizer and let's say a
group of features have been computed using the FFT, once you select the first one then all of
them are essentially cheap, but if you don't need any of these particular spectral features then
you don't need to compete the FFT and therefore there is a diminishing return associated with
feature selection. Most feature selection problems have not associated sort of an interactive
cost with the features, and also graphical models inference as well. I mean, if you just take p of
x is an exponential model where v is some non-sub modular function and where v is some
arbitrary suitability and the function that you want to optimize. And now I am trying to rethink
of what I want to talk about next [laughter]. In 2005, we first of all showed that any function
can be represented as a difference between sub modular functions. We developed this
algorithm which was a form of majorization, minimization algorithm where we take the
function f minus g and we replace g. We want to minimize this function so we replace g with its
modular lower bound, which basically makes this whole thing an upper bound and then we
iteratively optimize this thing. But now we can do with these modular upper bounds is do the
same thing. We take the f function, replace this with its modular upper bound and so this
basically turns each iteration -- the previous page had each iteration was a sub modular
function minimization problem. This is a sub modular function maximization problem and then
furthermore, we can actually do both where we can replace f by its modular upper bound and g
by its modular lower bound and each iteration of this majorization minimization problem
becomes a modular minimization which is, you know, incredibly cheap. It's basically order n to
do. So there are a lot of other properties. This was in a paper in UAI this year and we tried this
in the context of feature selection and showed that when you have sub modular cost features
plotting the pattern recognition error as a function of the sub modular cost, all of these
algorithms did well. The one where each iteration is modular, we were particularly happy
about that doing well because that one is so easy to optimize, whereas these other greedy
heuristics to not do that well. So the last thing I want to talk about is another application of
semi-differentials which is generalizations of Bregman Divergences, discrete Bregman
Divergences. So Bregman Divergences are well known in machine learning. They been studied
and used for many applications in clustering and proximal minimization and online learning.
Bregman Divergences generalize things like square two norm and [inaudible] divergence. What
we wanted to do was to develop a discrete family of divergences and similar to the way that
continuous divergences can seen as something involving the functions and sub gradients. In
the sub modular case we can involve the sub gradients of the sub modular function but in fact
we have two versions because now we have both the sub grading version of sub modular
Bregman and the super gradient version of sub modular Bergman. So I just want to say that
you can do all sorts of things with these things. You can, for example, get Hamming distance
which is a discrete diversion, but of course Hamming distance isn't the most interesting thing.
There are many other ones, like recall and weighting recall, something that is called the
alignment error rate which is used in machine translation. You can generate conditional mutual
information with the super gradient versions you can again do Hamming, precision, precision
measures and other measures like the Itakura-Saito and generalized KL divergence like things
and a number of interesting ones based on cuts which we don't necessarily have a good
interpretation of yet, but still it's a form of divergence. So I’m going to end with this slide which
is some possible application of these Bregman divergences which is to do camien [phonetic]
style clustering. Here rather than clustering vectors, you are clustering binary vectors and so
you have problems like the left mean problem and the right mean problem. It turns out that
depending on which Bregman, sub modular Bregman you choose computing the left mean
problem ordinarily would be very, very difficult. But it turns out that if you use a sub gradientbased sub modular Bregman then this problem becomes a sub modular minimization problem.
In other words, you have a large number of binary vectors and you want to find the bit vector
which is closest to all of them collectively, that's a sub modular minimization problem. It uses a
sub modular Bregman and if you use the super gradient-based sub modular Bregman the same
problem with the right mean is a sub modular maximization problem and that is the end of the
talk. I want to thank again the students and that's it. [applause] did I go over time?
Download