>> Mohit Singh: Okay. So hello, everyone. ... good to have Roy back again over here. Roy...

advertisement
>> Mohit Singh: Okay. So hello, everyone. It's
good to have Roy back again over here. Roy was a
post-doc here and for a couple of years and now he's
a post-doc at Princeton UST and would be headed to
Technion as assistant professor like in a couple of
months, I guess. And today he'll talk -- tell us
about some fast and simple algorithms for submodular
maximization, a topic he has been working on for the
past few years.
>> Roy Schwartz: Yes. Thank you all for coming. So
as Mohit said, I'll talk about past and simple
algorithms. Towards the talk the algorithms will be
slightly less fast and slightly less simple, but
we'll start with things which are fast and simple.
So what do we have in some algorithm maximization.
So we have a ground set, and we have some nonexistent
set function that is defined or is ground set which
is submodular, and let's say you have some collection
of subsets M that describes all the feasible
solutions. So in general in submodular maximization
you want to maximize the submodular function F under
the constraint that the out put is feasible. Okay?
So this encodes many problem, many classical problems
is combinatorial optimization. And you can think of
many constraints, for example, the unconstrained
case, the cardinality case which corresponds to
maximum K coverage, for example, a partition matroid
knapsack, and on and on.
So what is the goal of this talk? So the goal is to
see if we can find fast and simple algorithms. So
fast is well defined. You have the running time, so
the faster the better. Simple is not always well
defined and sometimes it's a matter of pace. But
I'll try to convince you that some of the solutions
are indeed simple, simple in the sense that if
someone needs to implement this, they don't need to
read a 30-page paper to understand the algorithm. So
the algorithm is very, very simple.
And the reason -- so one might ask, okay, were are
you interested in simple algorithms? Fast is always
nice. So the reason is that in recent years there
are many applications that arrive at machine learning
and data mining, and even in the machine learning
community there's a subcommunity that deals with the
specific applications of submodular optimization to
problems in learning.
So here are some examples where algorithms that were
developed in the theory community for submodular
maximization or depending on the complaint that were,
first of all, had provable guarantees because most of
these problems are empty hard, so we want something
that bounds how -- the quality of the output, and
also the algorithms are fast and simple, simple
enough that you can actually implement them.
So these are examples of works that actually wanted
to solve some practical problem or more practical
problem, and they had datasets for that problem and
they be throw in any algorithm, heuristic and
actually algorithms that were implemented theory were
used here.
So one of the more amusing one, at least in my case,
is the second one. So it does read gang violence
there. So there is this work that -- they worked, I
think it's the Chicago Police Department and they
modeled the gangs there as a network, and we'll see
in a second the example of network influence.
yeah, of course, and their goal was to reduce gang
violence and they wanted to see which members, which
are actually vertices in the network, they need to
take out from the network in order to maximize the
good influence. Once you take something out of the
network it's good in this case.
>>:
They are taking out the --
>> Roy Schwartz: Taking out, yeah, it doesn't mean
shooting in the head. It means usually
rehabilitating, not putting behind bars, as far as I
understood from very briefly looking at the paper,
but I didn't read their model all the way through,
yeah. But this is one of the more amusing ones, but
there are more, let's say, classic or well understood
examples, but just this is for the fun.
So I mentioned networking.
So I'll just give you one
example, and actually, I think most of these
examples, at these three, not these, but there's
three, actually they are -- in some sort all of them
are a network influence problem.
So what is network influence or how influence spreads
the network, so this is, again, just only one
motivation, but it's actually a motivation that was
used in practice. So we have a network which is a
graph. It could be a social network, a biological
network, any kind of network you like. And initially
some set of vertices or the nodes are activated. So
those turn red.
And now there is some random process usually that
runs, and the influence spreads according to this
process until some point where the influence stops.
So for example, if this is a biological network, it
could be the spread of disease.
So what models are used, so two classical models that
are used are what is called the independent cascade
and the linear threshold models. So in the
independent cascade model, once a vertex or a node is
activated, there are probabilities on the edges. And
that node has a chance to activate one of its
neighbors according to probability on the edge, only
when it becomes activated.
And this describes how the influence spreads. And in
the linear threshold model, each vertex has a random
weight or threshold, and once the total weight of its
neighbors increases above the threshold, it becomes
activated. And this is how the influence spreads.
And these models were also studied in sociology and
in many other areas.
So how does this relate to submodular maximization or
submodularity at all. So in the paper of Kempe,
Kleinberg and Tardos, they show that these classical
models, so if you look at the random variable X of S,
which is the number of vertices that are activated
that are red, once the dynamic stops and S is the
initially infected, this is the red ones, then the
expected value of the number of vertices activated is
basically a submodular function in S.
So the ground set are the nodes and the expected
number of activated nodes at the end is submodular.
So if you want, let's say, to count the expected
number of newly activated vertices, that corresponds
to a non-monotone submodular function, for example.
And Kempe, Kleinberg and Tardos asked this question
relating to how to market things in social networks,
and this relates to [indiscernible] theory by
Hartline, et al., but this is only again one example
of a heavily used application.
Of course, if there are questions, feel free to stop
me anytime.
Okay. So as I said, I decided in this talk to talk
about constrained problems, and I'll start with one
of the simplest constraints, which is the cardinality
constraint. So in this case, what is the problem.
We have the ground set N, a nonnegative submodular
function and some cardinality constraint K, and the
goal is to output any subset that contains up to K
elements and maximize the value of the function,
right?
So in the network influence example, this
corresponds, for example, if you want to market
something online, then you have a budget of K, let's
say, things you can give up for free and you hope
that the influence spreads, and once the influence
spreads, more people buy, let's say, your product.
So for this problem, which is very classical in
combinatorial optimization, in the late '70s,
Nemhauser-Wolsey-Fisher showed that you can get one
minus one over E guarantee by simple greedy
algorithm, and this is known to work for monotone
functions.
And this result is also tied. In what sense is it
tied? So they also showed Nemhauser-Wolsey in the
late '70s that one minus one over E is the absolute
best guarantee you can have.
So what does it mean absolute here? So no one asked
me how the function is given. So usually you assume
you have a black box that is called a value Oracle
that you provide this Oracle subset S and you get the
value of S, namely f of s. So if the algorithm has
access to the similar function only through these
value Oracle, then any algorithm that performs a
polynomial number of queries cannot get an
approximation better than one minus one over E.
So in that sense, this is absolute hardness.
For the special case, let's say that the function is
given explicit limited coverage function, then Uri
Feige showed that this bound is also true, but there
you assume that P doesn't easement NP, right? So in
some sense this seems like very good because greedy
is a simple algorithm. It's very fast and it gives
you the best guarantee you can have when the function
is monotone.
So what happens with the function is not monotone?
For example, in the network influence problem when we
want to maximize the number of newly influenced
vertices? So that's greedy work here. So the answer
is no. And let's see a very simple example.
So our ground set will have an elements. Let's call
them U up to UN, and one special extra element V. So
we're going to look at these two or these two types
of functions. There's a function of V, which is just
the indicator that V is in the solution, in the
subset. And this is, of course, a submodular
function. You can check that it has the
[indiscernible] margin of property, and for every UI,
we'll have an indicator that UI is in the subset but
V is not. So this is a non-monotone function which
you can actually check that also submodular.
So now if you take any nonnegative combination of
those it's also submodular. So we're going to put a
weight that is slightly bigger than one for the
special function of V, just a weight of one for all
the rest and sum them up.
So what will the greedy algorithm do? It will first
choose V, right? You'll keep that for K steps, but
once the algorithm chose V, it's doomed. It cannot
get a value more than one plus epsilon, but the
optimal solution is to choose K of the UIs, which is
value K. So this is a gap of all my [indiscernible]
K essentially which shows that greedy completely
fails here because K can be very large. Okay?
So what is known about, let's say, this case. So
there are works that are known. So Jan Vondrâk
showed that a continuous version of local search gets
0.309, and I don't remember the other digits after
that, and then Shiyon and Jan extending, let's say,
the simulated the annealing approach were able to
show a bound of 0.325, which actually holds for any
matroid constraint, so it comes from a much more
general problem. And after that, together with Mohan
and Seffi we showed you can get one over E, and that
again comes from the general matroid constraint. So
this is not specifically tailored for a cardinality
constraint.
So this is nice, but the problem is that all these
algorithms are, let's say, efficient in theory. So
all of them are polynomial. Essentially K is at most
N, but there are polynomial in N. I think the best
one, this continues greedy into the six to the five,
something like that. So ->>: Did you define exactly what submodular function
was?
>> Roy Schwartz: No, because last time I gave a talk
here about submodular function, Eval all told me you
don't need to define what a submodular function is.
Everyone knows. So Eval is not here. Okay.
So yes.
So a submodular function, let's do the diminishing
returns property. So if you have a subset S of
elements and there's a larger subset T that contains
that set, and you take an element X, and now you ask
what's the change in the value of the function once
you add X to S and once you add X to T.
So diminishing returns means that the more you have,
the less you'll earn, right? Because if you have one
and I give you one more dollar, you'll be happier.
If you had one million dollars and I'll give you one
more dollar, you'll still be happy, but it won't
change much your happiness level, right? So it means
that the change in the value of the function, once
you add X to S is at least as large once you add -sorry -- X to the larger subset T.
And this needs to hold for every S and T that
contains S and X that is not empty, right? So there
are several ways to define submodular functions, but
for me this is the most intuitive one. Okay?
And these sometimes -- we'll use the notation
later -- is called the marginal of S according to
element X. This is the change in the value of the
subset S once you add element X to it, okay? So
thank you for the comment. We're sure, yeah,
everyone remembers.
So the question is what can we do here, because there
are practical applications, and can we get something
with approval that will guarantee is actually also
fast and simple in some sense. So what I will
describe to you now is a simple randomized algorithm
whose running time is order NK, which is exactly the
running time of the greedy algorithm, right? You
have K iterations and each one takes linear time.
But it is oblivious to the fact whether the objective
is monotone or not. And if the objective is
monotone, you get the type one minus one over E
guarantee. And if it's not monotone, you get the one
over E guarantee. Okay? No, because ->>:
You use the same algorithm or not?
>> Roy Schwartz: It's the same algorithm. It's
oblivious check, but this work only for cardinality
constraints. The continuous greedy gives you a
fractional solution to any down monotone closed
polytope, but this is tailored for the cardinality
constraint.
>>:
Sure.
>> Roy Schwartz: Okay? So what's the general idea?
So if we want the greedy-like algorithm, let's say
this is our ground set and S is the current solution
the algorithm holds, so what do you do in greedy?
You look at all the elements that are outside the
solution, let's say everything that is outside the
purple set, and you pick the one with the largest
marginal, right? The one that adding it to the
subset will have the largest gain and you add it.
So what we'll do now is we'll choose a suitable list
of candidates, let's say the yellow dots, and we're
going to choose the least greedily, and then choose
an element uniformly random from that list. Let's
say this element, okay?
So apparently we were not aware of this. This type
of, let's say, approach has a name which is called
GRASP, and David Shmoist pointed this to us, which
means Greedy Randomized Adaptive Search Procedure,
okay? And it was introduced in the OR community by
Feo and Resende, and there it was used as heuristic.
So as far as I know in the examples that this
approach was used, there were no proofs, no rigorous
guarantees. And over here there are several -- these
three were actually surveys, if anyone is interested
in looking this up. So we didn't know it's called
GRASP and we just named it randomized greedy because
it's simpler.
So now let's give a precise description of the
algorithm I just mentioned.
>>: Maybe it's a very trivial question, but in
expectation [indiscernible], right?
>> Roy Schwartz:
>>:
And also for the greedy case?
>> Roy Schwartz:
>>:
Yes.
You mean the monotone case?
Greedy is the terministic, right?
>> Roy Schwartz: Yeah, so if the function is
monotone and you run the greedy algorithm, it's not
an expectation. It's always.
>>:
But all guarantees --
>> Roy Schwartz: Yeah, it's randomized so the
guarantees are an expectation.
>>:
But is it easier to de-randomize?
>> Roy Schwartz: I don't know. The question is can
you de-randomize it and keep the running time,
because if you want to de-randomize it, you want the
terministic algorithm, you could use the continuous
greedy that is adapted to non-monotone functions,
right? You'll get the one over E guarantee, even for
general matroid, but the running time will be
[indiscernible].
So the point here is to get something very fast and
simple. So if you can de-randomize and keep it fast
and simple, that will be great, but -What?
>>: Yours has a dual guarantee. For monotone you
get one last degree. So is it also true for the
continuous greedy case?
>> Roy Schwartz:
That?
>>: If you run continuous greedy on a monotone
function, does that give you ->> Roy Schwartz: Yes. The adapted continuous
reading. Not the original one.
>>: I see. So it does give?
>> Roy Schwartz: Yes. Okay. So now let's see an
exact and we'll see a picture in a second. So we
start, let's say it's called S0. In the beginning we
have nothing and we have K iterations, I equals one,
two, three up to K. So do we in these iterations?
Well, first we need to choose the least -- the yellow
points, the least of candidates. So we choose those
greedily.
So what does it mean? We look at all the elements
and we want to find the subset of them of size M. So
M is all the unchosen elements, right, of size K.
And we just want those that have the largest
marginal. So for every element you have a number,
which is the marginal according to the solution so
far. Let's say in the first iteration S0, which is
effort set, and we choose -- so the list of
candidates in the iteration is the best K candidates
according to what you have so far.
And then what do you do? You just choose one
uniformly at random. You don't look even -- let's
say one has a huge marginal and one has a very tiny
marginal, you ignore that. You just choose one
uniformly at random and you add it to the solution to
create SI, which is the solution to the next
iteration. And you do this K times. Okay? Very
simple.
>>: Ignoring the relative ratios between marginals,
is that for a reason or is it just that ->> Roy Schwartz:
>>:
-- it works.
First of all -Right.
>> Roy Schwartz: That is a very good reason.
it's very simple.
And
>>: Is it the case that choosing portion to margin
doesn't work?
>> Roy Schwartz: I don't know. I don't think we
could prove that. Actually, I think it -- okay. I
think it might work, but you don't want to start
choosing things -- this is easier. Just choose one,
right? You don't care about the values, anything.
Just output what you have at the end.
>>:
Maybe I missed this.
What is K?
>> Roy Schwartz: K is the cardinality constraint.
You can choose up to K elements. Okay? And that is
the size of the parameter. So the least of the
yellow, the number of yellow points in every step is
exactly K. So let's see an example and ignore this
assumption for now.
So in the beginning as your example doesn't contain
any of the blue dots, right? So now we look at the
first list of candidates. Let's say the red set and
the yellow are the candidates. Let's say there are
exactly seven of those. That's AK7, and we choose
one uniformly at random. Let's say the red one here
and we add it to S1. And now again we look at seven
candidates that do not contain the one we already
chose. Let's say this is M2, the second set of
candidates. We choose again one uniformly at random
and let's say this point, the red one. We add it in
another set and on and on.
>>:
So why is this size of the set MI equal to K?
>> Roy Schwartz:
>>:
Uh-huh.
Is there an intuitive reason why --
>> Roy Schwartz: Actually, so because it's simple,
the proof is also simple. So I'm going to completely
prove this algorithm in a few minutes and you'll see
exactly why you need to keep it K, okay? Yes.
>>: [indiscernible] at that point, right? And then
you find you apply that function for each and pick
the K maximum number, right? But in the
[indiscernible] scenario the graph is so large, so is
it possible that the applied function for all ->> Roy Schwartz: So actually you're asking how you
implement the value Oracle in different applications
essentially is what you're saying. In order to
calculate the marginals, you need to know this and
this.
>>:
That's true.
>> Roy Schwartz: So there are specific works, for
example, in the network influence, but that depends
also on the specific application you have that show
how to implement the value Oracle essentially. So
that's an independent question. So now we assume we
have this black box. If I cannot access the
function, then I'm in trouble. I don't know what can
I do -- what I can do, right?
>>: But even if you have the function, it's possible
that you don't have the knowledge of the whole graph,
right?
>> Roy Schwartz:
>>:
That's true.
Oh, you mean the entire ground set?
>> Roy Schwartz:
>>:
Okay.
[indiscernible].
>> Roy Schwartz: Yeah. Okay. So here we assume
that the ground set is known in advance. Okay? So
there are a few works, but those are only recent that
deal essentially, maybe, so the closest thing I can
think of is the online case where you slowly
reveal -- the ground set is slowly revealed to you,
but there usually what is done in online there's an
adversary that controls the order, but that's a good
question. Again, it depends on the application you
have, but here we take the combinatorial optimization
approach for this, okay?
Was there another question? I thought there was
another question here. You had a question?
>>:
Oh, no.
>> Roy Schwartz:
Okay.
>>: So you really just want K ground set element by
marginal?
>> Roy Schwartz:
>>:
You want --
So this MI is really just a --
>> Roy Schwartz:
>>:
Sorry?
You can, yeah.
So --
In some cases you can do that.
>> Roy Schwartz: In linear time. You're just
finding the [indiscernible] right?
>>:
I don't know.
>> Roy Schwartz: You might be right with some
applications it might be a problem, but there are
applications where this was implemented. So again,
it depends on the application you have before you,
okay? But it's a good point.
So no one asked me, the function is not necessarily
monotone, right? So what happens if the best K
marginals, some of them are negative? We don't want
to choose those. So we assume that there are dummy
elements, you can always dummy elements to the ground
set that always have a marginal value of zero.
Essentially they have a linear contribution of zero.
So if you had K or 2K of those, but that is enough to
ensure that all the elements you choose always have
nonnegative marginal. So what does it mean if the
algorithm chooses by chance a dummy element?
Sometimes it means that it was a wasted iteration.
You have exactly K iterations and you in one of those
you actually didn't choose anything from the original
ground set. So this means that if the function is
non-monotone, the output could contain less than K
elements. Okay?
So now I'm going to show you how to analyze this
algorithm. And let's start with the monotone case.
So intuitively this seems more wasteful that the
greedy algorithm, right, because we have probably
things which could be much worse, right? We don't
even choose according to the probability these
proportional to the marginals even.
So now let's prove that it gives you the tied
guarantee, and we'll see anyone that knows the
original greedy proof will see that it's essentially
almost the same proof. But if you don't remember,
we'll see the proof right now.
So what we're going to need for the proof is the
following. So if you remember SI minus one is what
we had in the beginning of the Ith iteration or the
algorithm so far. And let's say MI is the list of
candidates. So what is TI? TI is all the elements
that are in OPT that the algorithm didn't choose so
far, right?
So let's think of them as the green subset, right?
TI might contain some of the things algorithms chose.
So far it might have -- I don't know -- some
intersection with the candidate list, but it's
everything that is in OPT that algorithm didn't
choose so far.
So what we're going to do, assume S -- what the
algorithm chose so far is known to you and we're
going to see what the expected gain of the algorithm
in the Ith iteration. If we lower bound this, we're
done. So how much do we gain -- so if you remember
UI is the element, the uniformly random element from
the candidate list.
So what is the gain here, the expected gain? So we
choose uniformly at random, right? So just one over
K and you sum up over all the candidates that are
marginal, right? So how did we choose the list of
candidates? Greedily, right? We chose the one that
maximize the sum of the marginals. So we can
substitute it now with the -- all the elements in OPT
that the algorithm didn't choose so far, right? And
of course, we can pad it so it sizes exactly K.
So this is at least the same as if we had chosen all
the elements in OPT algorithm didn't choose so far,
right? So we can substitute the list of candidates
with TI.
And now this might contain some dummy elements which
have marginal zero. So this is exactly just summing
over the elements in OPT we didn't choose so far. We
summed their marginals.
So how can you lower bound this? So remember that
the submodular function has the decreasing marginals
property. So if I take a subset and I ask what
happens if I add, let's say, element A and element B,
or if I take the same subset and add A and B
together. So of course, the gain one's adding in A
and B together will be smaller.
So I just add everything that was in OPT and we
didn't choose so far to the function. Adding this.
So it's the difference. So this is lower bounded by
taking the [indiscernible] solution, just dumping
everything that is in OPT there. Right?
And now by monotonicity, this is at least F of OPT,
right? And essentially we're done. So we've proved
now that the expected gain in every step is at least
one over K, the differences between the optimal value
and what the algorithm hold so far. And this, to
your question, why do you need the same K in every
iteration.
So what did we get now? So we saw that the expected
change in the value of the function, the Ith
iteration at least one over K, how far we were from
OPT, right? In the beginning of the iteration when
you solved this recursive formula, you get exactly
this guarantee which gives you at least one minus one
over E OPT. Okay? So this is the standard recursive
formula.
So this is the monotone case, which is not very
interesting. We get the same running time because we
don't need to sort the elements in every iteration.
It's like just choosing the median, so every
iteration takes order N. We have K iteration, so the
algorithm runs in order NK time.
But the interesting question is what happens in the
non-monotone case, right? Because we saw that greedy
fails.
So what do we do in the non-monotone case? So the
only place in the proof that we use the fact that the
function is monotone is when we actually lower
bounded this by the optimal value, right? But if the
function is non-monotone, you take OPT and you dump
other things inside, for example, what the algorithm
chose so far, you might lose, or might lose a lot in
the value.
But the question is how we lower bound this. And
we're going to do it in the following way. And this
is where the random choice is the algorithm come
handy. So no one asked me why do you choose
randomly. Why there's no, let's say, greedy choice
here. So we're going to use a lemma which the lemma
I'm setting now, which is very similar to a lemma
appearing in Feige-Mirrokni-Vondrâk. So if you have
a subset A and you choose a random subset B out of A,
so how do you choose it? You can choose it in a very
complicated way. I don't care. The only thing I
know is that the probability of every element in A to
belong to the subset B is at most B.
Right? But the elements might have weird
dependencies. I don't know. So just upper bound the
marginal probability of every element of A to belong
to the subset B. So if you have that, then the
expected value of the random subset B is at least one
minus B the value of the empty set.
Let's assume for now that this is true. We're going
to prove that in a second. You looked a little bit
surprised, but I'll prove this.
Yeah, I said that all the functions are nonnegative.
We assume that the submodular functions are
nonnegative, because multiplicative guarantees once
the function is nonnegative or useless, because you
can always shift it so only the optimal value is
above zero and then either you solve the problem
exactly or not. They're all MP hard.
So let's assume for now that this is true and we'll
see why this completes the proof. So what happens in
the algorithm, the Ith iteration. In the Ith
iteration, what's the probability that an element is
not chosen. So that's at least one minus one over K.
Why? Because if it's in the candidate list with
probability one over K it is chosen, and if it's not
in the candidate list, it's never chosen. So this
lower bounds of probability that some element in the
Ith iteration is not chosen.
So what's -- now I ask you, what's the probability
that in the Ith iteration some element, or at the end
of the Ith iteration, what's the probability that an
element belongs to the algorithm solution.
So the random choices among the different iterations
are independent. The lists are not, but how you
choose within the list is independent. So you get
that the probability that an element belongs to the
algorithm solution, the end of the Ith iteration is
at most one minus this to the I. Right?
So what does that mean? So now look at the following
function. Let's call it G of S, which is just the
similar our function where we just take the
[indiscernible] OPT, so we force the S to take in
addition everything that is in OPT in case it didn't
do that already.
So this is also a submodular function. You can prove
that it's easy. So what's the expected value of
this, right? Because we want to lower bound this,
right? We want to see this exactly what's our goal.
So this is the expected value of G of what the
algorithm has at the end of the Ith iteration, right?
And this is a submodular function that has this
property which can plug by the lemma, which is at
least one minus one over K to the I, G of the empty
set. But G of the empty set is exactly the value of
the optimal solution. Right?
So we lower bound the expected value of the optimal
solution once you add to it whatever the algorithm
chose so far. And why are we done now? So this is
where we stopped before we used monotonicity in the
analysis before, right? The expected change in the
value of the function is at least one over K.
Instead of being the distance of what we have so far
to OPT, now we have the value of OPT and what the
algorithm chose so far.
So now we just lower bound this. Instead of F of
OPT, we have this factor in addition, right? Exactly
what we had before. You solve this for recursion.
You get this. This is true for every iteration I.
And if you plug in I equals K, you get at least one
over E times the value of the optimal solution.
So now the only thing that is left is to prove that
lemma didn't like, I guess. So let's prove it. And
we'll see that the proof is again very simple and
short. Okay. Happiness is [inaudible]. We'll see.
So this is just a restatement of the lemma. We have
a subset A and we randomly in some way choose a
subset B of A, and the only thing we know about the
distribution of B is that the margin of probability
of every element to belong to this random subset B is
at most B. And we want to prove this, right? Yes.
>>:
Okay.
>> Roy Schwartz: Yeah, so I wanted there to fit in
one line, so I just abbreviating it, but you're
right. This is not extremely precise, but every
element, the margin of probability is at most B.
Okay. So this is the only mutations we'll need for
the proof and the proof will end here, so we won't
need even more than the rest of this slide to prove
this. I think so, if I remember well.
>>:
Why don't you skip to that?
>> Roy Schwartz: What? You'll see in a second.
Yes. So let's say that A, where we choose the random
subset from, let's call it U1, U2 up to UL. And
let's sort the names of the elements as to that
element number one appears with the largest
probability and the second one with probably P2 and
so on up to PL.
So these are the exact probabilities, the exact
marginals, right?
We just say that these are upper
bounded by P, we sort of changed the name of the
lemma so this holds. And now let's say AI will just
be the prefix of the first I elements and exsa is the
indicator that the Ith element is inside of the
random subset, right? So now we just want to
calculate the expected value of B.
So how can we calculate this? So now we can think of
the following process in some sense. So we can look
at the following telescopic sum, right? We start
with an empty set and then you look at the first
element. Is it in B or not? If it's in B, right,
then X1 will be one is the indicator and then we -how much we gain, we gain the marginal of the first
element with respect to what we had so far, right?
And if it's not, then this is zero and we don't care
and then we go to the second element. Just have a
telescopic sum and we have the indicator saying
whether we add that element or not.
So this is why the take the intersection of the
prefix with what really appears in the random subset,
okay? This is just rewriting what the value of B is
as a random variable.
So now
at the
subset
drop B
we have these marginals, right? So I can look
smaller subset and only -- sorry -- the larger
and only losing the marginals, so I'm going to
because it's not very convenient. I want to
look at the terministic subsets.
So this is what we're going to do. So it's just
expectation of this, again, telescopic sum because
these are our marginals, but with respect only to the
prefix that can only be smaller, right? We might
have increased the set.
So now just
is, so just
probability
the prefix,
on and on.
plug in. We know the probability of exsa
F of the empty set and then with
B1 we gain the first element according to
and then we B2, the second element and so
And now just I want to see what are the coefficients
of the different subsets I have here. Let me rewrite
this. So we get one minus T1, the first set which is
empty. P1 minus P2 the second one, and on and on and
we just rewrite this.
So how did we order the elements? So the order in
here is very important. So we know that P1 is the
largest probability and on and on. So all these are
nonnegative, right? So we can drop them. So this is
at least one minus P1, which is one minus P, the
value of the empty set.
Are you happy? Just making sure. Okay. Good. So
this is the entire proof. What? It's a telescopic
sum, essentially. Yes.
So essentially this is the entire proof and we're
done. So I gave you the algorithm which I hope you
think is simple. It's fast because its running time
is asymtotically the same as the greedy algorithm,
and it works for the non-monotone functions which the
greedy fails. So what we have so far, as I said, in
order NK time we can get the one minus one over E for
monotone, which we know, and the one over E for
non-monotone objective. And the question is can we
do faster than that. Yes.
>>: So your example there you had one X element and
K element. That also shows that anything just
sublinear in K, if you do the same algorithm,
sublinear in K elements and --
>> Roy Schwartz:
Sublinear in the running time or --
>>: No, no. I mean something, you are choosing K
elements, choose ->> Roy Schwartz:
will be smaller.
>>:
It would work, right, because of that example?
>> Roy Schwartz:
that example.
>>:
Oh, the size of the candidate list
Could be.
I didn't check it on
Linear functions, right?
>>: No, even linear, but, okay, yeah, but anything
less than K won't work because of that example.
>> Roy Schwartz: I think you might be right, yes.
So no one asked me, okay, what's the meaning of the
randomization here in some sense. So intuitively you
can think of it as some kind of insurance. Because
if the function is monotone, then it might make sense
to choose greedily, but if the function is
non-monotone, you might regret it afterwards because
you might lose value because you chose something that
was better the current step, because the function is
not non-monotone. So randomization intuitively some
kind of insurance that tells you, okay, that might
happen, okay, but not with a very high probability,
okay?
So but now [inaudible] with all this, the question is
can we do better? So I saw that 2014, Ashwin and
Minyon, if the function is monotone, it gave an
algorithm that essentially for any constant epsilon
or any epsilon gets the tied guarantee. It
[indiscernible] in the approximation factor, but the
runtime, instead of being order NK, is order of N
over epsilon log N over epsilon.
>>: So when you say running time, you mean on
Oracle ->> Roy Schwartz: Yes. We'll get to that at the end
of the talk. There will be some -- but for now I'm
cutting the value Oracle calls and the number of
arithmetic operations is bounded by this also. So
it's not that you have 2 to the N arithmetic
operations and that's right, that's cheating, but
something where I'm counting the number of value
Oracle calls which relate to what you asked me,
whether it's easy to implement that or not. Yes.
>>: So can I assume that the K that you selected
because of the [indiscernible], so do you expect that
if you have a larger, larger than K, you have a
parameter of one where K arrived in this case and you
get a better result?
>> Roy Schwartz: So you are asking, say, what the
meaning of a [indiscernible] solution. So optimum
contains up to K elements but you allow the algorithm
to output [indiscernible] More than K? Can you get a
much better approximation?
>>: Still you have K, but the [indiscernible] that
you're selecting the first and the K largest and
select randomly and say increase it, to let's say,
more than K. Do you expect to get better or not?
>> Roy Schwartz: So that's a good question. The
question is what -- okay. I'll choose the size of
this, I think the best for the cardinality constraint
as far as I can remember, but I'm not 100 percent
sure. But that's a very good question how you choose
the size of the candidate list. So I think for
cardinality constraint of K, this is the best. But
I'm not 100 percent sure, but I'm quite sure.
So if you ignore all the epsilons essentially it's O
epsilon of N log N running time. Could be faster
than order NK. And what -- and a faster algorithm
that actually we showed with the [indiscernible] and
actually it was also independently appeared in Meeza
Suliman Ashwin Kalabasi Vondrâk and Kauvza in some AI
conference which I don't remember exactly the name,
they showed how to essentially get in order N log one
over epsilon.
So essentially this is linear now and you lose
only -- so the dependencies on the running time on
epsilon is very mild. It's log one over epsilon, and
you lose that epsilon in the approximation factor.
And if I'll have time -- I think I'll have a few
minutes -- I'll show you how this algorithm works,
which is very simple.
But this is only for monotone F. So you can ask can
you speed this up for non-monotone functions, can you
get better than order NK running time and lose only
epsilon in the approximation guarantee. So you can
do that, but here, as you can see, there are
essentially two algorithms whose running time is
incomparable depending on the value of K. So if you
ignore this one, which is, let's say, less
interesting, you can get the N log one over epsilon,
but you lose one over epsilon squared in the running
time if F is not monotone, okay?
And again, this is all for cardinality constraint.
And by the way, I think they actually, their paper
actually has also experimental results. So they
actually implemented this algorithm. All right?
Yes.
>>:
So is it the same algorithm, the two papers?
>> Roy Schwartz:
Yes, the exact same algorithm.
>>: So in practice I guess you always know if your F
is monotone or non-monotone, right?
>> Roy Schwartz: Usually you know, yes. For this
algorithm you need to know because this algorithm ->>:
That's why I'm asking this.
>> Roy Schwartz: Yes. The fact that the previous
algorithm was oblivious to the fact was only -- let's
say it was a side effect. We didn't aim for that,
but it's nice to have, but it's not that important.
You're right.
Okay. So how does O, let's say not this algorithm
that works in a completely different way, but let's
say these two work. So the idea now is just to
sample. These are sampling-based algorithms.
So the main idea, without going into the parameters,
and maybe for the specific case of a monotone
objective, we'll go over the details, is that first
of all, instead of looking at the entire ground set,
you sample candidate list M just randomly. So let's
say if this is a solution the algorithm had so far,
you sample the yellow set. And we don't even sample
it such that it doesn't contain anything -- any of
the elements the algorithm chose so far. So the
sample set, you just from the entire ground set it
contains things the algorithm already chose.
The sample of the random subset of some specific
size, and then you look at the fraction alpha of the
top marginals in that set and you choose one of the
those uniformly at random.
So if the sampled set was, let's say, the entire
ground set or something of that sort, it's similar.
It's not exactly, but it will be similar what you saw
before, that instead of looking in the entire ground
set, you look at some random part of the ground set
and then you choose uniformly at random from some top
fraction of the marginals there. When I say top
elements, the intention is that top according to the
marginals.
>>:
So when you say sample, how do you sample?
>> Roy Schwartz: Randomly. So if I tell you that
the M is of size five, you take a uniform random
subset of five. Uniformly but depends on the size,
of course.
>>:
Yeah.
>> Roy Schwartz: Yes. Given the size it's uniformly
at random, yes. And it might even contain things you
already chose. You don't need to remove those. Just
uniformly at random a subset of that size. And then
the question is how you choose the threshold alpha of
the top marginals within the sample but from which
you choose uniformly at random. And that also is not
random. It's really fine.
So the faster algorithms, say, for the monotone case,
which I'll describe now and maybe go over the -maybe I'll skip the proof of the --
>>:
So it's just a one short thing?
>> Roy Schwartz: No, you do this K times. This is
you choose one element. You do all of this and you
choose uniformly at random one element and you repeat
this K times.
>>:
Then you choose a new M?
>> Roy Schwartz:
Yes, you choose a new M, yes.
>>: Oh, I see. So the running K [indiscernible]
comes because every time you're looking at a random
subset M.
>> Roy Schwartz:
entire ->>:
And M is much smaller than the
M is concert size.
>> Roy Schwartz: It's not -- you'll see in a second.
Now I'll -- so this is the entire approach for all
those algorithms, even when it's non-monotone and the
complicated running times, but I'll show exactly what
this means in, let's say, the case that the function
is monotone, okay?
So the faster algorithm in the case that the function
is monotone. So again, we start with an empty
solution and we have K iterations as before. So now
instead of greedily choosing the top K marginals that
are outside what the algorithm chose so far, we
choose a random, let's call it MI sample, a random
subset of this size. N over K, so it's not even a
constant size. It depends on K times some log one
over epsilon which we'll see in a second why we need.
And now what is the alpha? What is the fraction of
the elements we choose? So we choose alpha in this
specific case such that we just choose the best
marginal in the sample. So there's no uniform random
choice in this case, right? So alpha is essentially
one over the size of M in this case.
So that's UI, the element with the largest marginal
with respect to what the algorithm chose so far. The
best element in this sample. So it's even simpler in
the case that the objective is monotone. You add it
to the solution and you do this K times and you
output whatever you have.
So this is the specific [indiscernible] of that
approach to the case where the objective is monotone.
So intuitively it will be easier to think of P as the
probability -- the marginal probability of an element
belonging to the sample, right? That doesn't change
over the iteration. So if the size of the sample is
N times something, so that something is the
probability, which is essentially one over K times
none one over epsilon, okay?
And we'll assume that epsilon -- of course, the
epsilon cannot be more than one minus one over E
because then there's no guarantee in terms of the
approximation, right? Because we subtract epsilon.
And epsilon is not even too small. It's at least E
to the minus K, because if it's smaller than that,
then log of one over epsilon becomes larger than K
and just apply the greedy algorithm.
So this gives you better if epsilon, of course, is
not really, really small. And if it's too large,
then you don't care what happens.
Okay. So I have a few minutes for the analysis and
after that I'll show you or just tell you what is
known for more complicated constraints.
So what did we prove for the greedy or the randomized
greedy? What was the proof plan? Again, the proof
plan here is very simple. We proved that the
expected gain in the I iteration is at least one over
K how far we were from the optimal value so far,
right? I'll convince you or tell you you can do is
actually show that you're not far away from that.
You just lose the factor of one minus epsilon
multiplicatively for the gain in every step.
And once you have that, you can resolve the recursive
formula and under the assumption that epsilon is not
large, which is exactly what we have, the output
after the Kth iteration is exactly one minus one over
E minus epsilon the value of OPT.
So all we're going to do is just show that we're
losing a factor of one minus epsilon in the gain in
every step. Once we have that, we're done.
So how are we going to do that? So again, the proof
is not complicated but it's slightly tricky. So now
again we're going to sort the elements according to
their marginals. So we have the current solution the
algorithm has so far, which is SI minus one.
So now let's look at as if we were the randomized
greedy as before. At the top K elements in the
entire ground set even, it could be -- you can look
at elements that go so far, but then the marginal
will be zero. So you look at elements in the entire
ground set and you pick the best -- the one, the K
best marginal. So let's say V1 is the best one, we
do after that and on and on. Right?
So in the randomized greedy, this was our candidate
list, but now we just use it for the proof, not for
the algorithm.
And now we're going to ask what's the probability
that our sample hits one of these, right?
Intuitively high, we're in good shape. So we're
going now to define the event that if we look at the
prefix of the first J candidates according to this
order, that our sample actually contains at least one
of them. So this is what is written here, that the
sample contains at least one of the J largest
marginals, right? So XJ is indicated for that.
So there are two observations here. Just state them
and explain but tell you how you prove them, but I
won't go into the exact proof. So the first one is
actually that you can then decompose given
conditioned on what you have so far that the marginal
value UI is what the algorithm eventually added. You
can decompose it again to that telescopic sum and
think of it in the following way.
So let's say V3, for example, is the smallest index
here that is in the sample. So what is X1? It's
zero, right? X2 will be zero. X3 will be one. What
will be X4?
>>:
One.
>> Roy Schwartz: It will all be one because we look
at the prefix. So once -- let's say from this list,
let's say V3 is the first one that appears if you
start scanning it, then X3, X4 up to the end XK will
all be one, right? So it's like a threshold
function.
So now think of it the following way. You start with
what you have so far and you look at the marginal of
the worst one on this list. And this is essentially
the indicator that, I don't know, either this one or
someone before it appears. And then you keep summing
these indicators. So essentially what you have here
is just, on the right-hand side, written as random
variables the element that the algorithm chose,
right? Because the minimum index will say where the
Xs start -- stop being zero and start becoming one.
Right?
But in the worst case it also -- might also be that
the algorithm didn't add any one of these. So this
is why there's an inequality here, right? And here
we use the effect that the function is monotone,
right? Because choosing some non-else, I don't know,
there might be negative marginals, so this is one way
to rewrite this.
And the second observation which is simple is that P,
the probability we noted before, what's the
probability, let's say, that XJ is not zero. So it
means that from the first J intuitively rat least one
is chosen. So it's one minus the probability none of
them is chosen, which is essentially this. If P is
the probability that the element is in the subset.
There are some dependencies here because not every
element chooses itself independently, but this
inequality is true. Right? Once you have this, it's
linear done. So with these very two simple
observations, the proof is complete. Why? You take
the expected gain in the Ith iteration and you just
plug those two inequalities together. We have a
lower bound on the probability of XJ and we have this
telescopic sum, so it's at least this.
And this, the question is how do you lower bound
this? So I'll skip this very quickly. But you have
a summation over multiplication. You have two
sequences. This one minus P to the power of
something in the marginals, and we order them
according to the values of the marginals. They're
all decreasing, so now we can use the Chebyshev's sum
inequality and get this, but the bottom line here is
just a geometric sum.
But when you do this, you get exactly what we wanted
to prove, right? So I'm skipping this very quickly.
And this completes the proof because from now on from
this point I claim just copy/paste from the previous
proof and just carry this term along and it all
works.
So this was just a sample from the sampling
algorithms, what happens when the function is
monotone, and as I said, I think in the parallel
paper they even implemented this algorithm, as far as
I can remember.
So now what happens with additional constraints? So
I'll go over this very, very quickly. So one
interesting constraint is the partition matroid, and
one of the motivations for this is actually the
submodular welfare problem, which I'm not going to go
into all the previous work, but on the very high
level you have these smiley faces which are the
players, and you have these goods which are
undivisible and you want to assign every player a
subset of the goods, and each player is equipped with
a utility function which is submodular and monotone
like in auctions, and just want to distribute the
elements to the players and maximize the utility,
total welfare, essentially.
So this is a special case of a matroid constraint, so
one minus one over E was known in polytime by the
continuous greedy of Cãlinescu-Chekure-Pál and
Vondrâk. And in that same paper I mentioned before
and saw that in 2014 Ashwin and Nehan again for -and this comes from the general matroid lose an
epsilon and the running time is, if we ignore polylog
factors and the dependence on epsilon is NK. That's
the fast algorithm. So the question can you do
faster, and the answer is that you can break this
barrier and actually you can get again, so this is
with Nevan Mohan and, the same guarantee, and if you
ignore the polylog factors independence epsilon you
get K square root N plus N for maximizing a monotone
submodular function over a partition matroid.
So maybe I'll skip what the [indiscernible] in
general matroid. I'll just mention that in a general
matroid there is an interesting question that too
far, till this point, like Mohit said, we're aiming
to minimize the number of value Oracle queries, but
in general matroids, how do you know that a solution
is feasible or not, right? For cardinality
constraint it's easy. Partition matroid is easy, but
in a general matroid, you have what is called an
independence Oracle. That actually tells you whether
a given subset is feasible or not, and there's
actually, we have an algorithm that shows that
there's a tradeoff between the number of value oracle
queries. You have the same approximation guarantee,
but it depends on the application you have.
And we suspect that this tradeoff is essentially an
artifact of the algorithm, but we don't know of any
lower bound and we don't know how to get rid of it.
So there is some interesting that can actually save
the total number of queries you have, but that, if
someone is interested, I can tell you more about that
later. So thank you.
[applause]
>>:
Any more questions?
>>:
So are you still working in this place or what?
>> Roy Schwartz: So on maximization we're simply not
working much. There is a project about submodular
minimization, but again, this is completely different
because it's not a hard problem. So you can solve it
exactly, but I can tell you more about that later if
you're interested.
>>:
So that is through convex optimization or --
>> Roy Schwartz: Okay. In general, yes, that's one
way to do this. But the faster algorithms that are
known today, again, it depends on whether you want
strongly polynomial or polynomial. Because the
strongly polynomial algorithms have no dependence on
the different values the function can take, but their
dependence on the size of the ground set is worse
than the polynomial.
The polynomial ones, the dependence on N is much
better, but there's an extra dependence on what is
called capital M. And usually it's log capital M,
and capital M is the following if you assume that the
submodular function is integra, then it's the
largest -- its largest value and absolute value
absolute value. So it could be negative. I don't
know. Because you can solve the problem exactly
doesn't mean -- doesn't bother you if you shift it.
So if you, let's say -- so if you normalize it so
that the empty set is zero, you can always do that,
then it's between minus M and N and it's integral.
So the polynomial algorithms have some dependence on
N and log M. Usually it's only log M. It's not even
polylogging log M. So it's only log M. But again,
it depends on kind of what kind of algorithm you're
interested in?
>>:
So strongly will not depend on N at all?
>> Roy Schwartz: Will not depend on M, but those are
usually -- okay. I can just cite what one of the
authors of those algorithms said. They're highly non
[indiscernible]. So I don't know. They're probably
not very simple algorithms to exactly understand
what's going on. In some sense they are extensions
of max flow right? Because max flow under the graph
or [indiscernible], doesn't matter, it's equivalent
to mean cut and mean cut is a special case of
submodular minimization.
So the strongly polynomial algorithms are some -- on
the very high level some extensions of the strongly
polynomial time algorithms for max flow.
>>:
Any more questions?
>> Roy Schwartz:
Thanks.
Okay.
[applause]
Download