>> Mohit Singh: Okay. So hello, everyone. It's good to have Roy back again over here. Roy was a post-doc here and for a couple of years and now he's a post-doc at Princeton UST and would be headed to Technion as assistant professor like in a couple of months, I guess. And today he'll talk -- tell us about some fast and simple algorithms for submodular maximization, a topic he has been working on for the past few years. >> Roy Schwartz: Yes. Thank you all for coming. So as Mohit said, I'll talk about past and simple algorithms. Towards the talk the algorithms will be slightly less fast and slightly less simple, but we'll start with things which are fast and simple. So what do we have in some algorithm maximization. So we have a ground set, and we have some nonexistent set function that is defined or is ground set which is submodular, and let's say you have some collection of subsets M that describes all the feasible solutions. So in general in submodular maximization you want to maximize the submodular function F under the constraint that the out put is feasible. Okay? So this encodes many problem, many classical problems is combinatorial optimization. And you can think of many constraints, for example, the unconstrained case, the cardinality case which corresponds to maximum K coverage, for example, a partition matroid knapsack, and on and on. So what is the goal of this talk? So the goal is to see if we can find fast and simple algorithms. So fast is well defined. You have the running time, so the faster the better. Simple is not always well defined and sometimes it's a matter of pace. But I'll try to convince you that some of the solutions are indeed simple, simple in the sense that if someone needs to implement this, they don't need to read a 30-page paper to understand the algorithm. So the algorithm is very, very simple. And the reason -- so one might ask, okay, were are you interested in simple algorithms? Fast is always nice. So the reason is that in recent years there are many applications that arrive at machine learning and data mining, and even in the machine learning community there's a subcommunity that deals with the specific applications of submodular optimization to problems in learning. So here are some examples where algorithms that were developed in the theory community for submodular maximization or depending on the complaint that were, first of all, had provable guarantees because most of these problems are empty hard, so we want something that bounds how -- the quality of the output, and also the algorithms are fast and simple, simple enough that you can actually implement them. So these are examples of works that actually wanted to solve some practical problem or more practical problem, and they had datasets for that problem and they be throw in any algorithm, heuristic and actually algorithms that were implemented theory were used here. So one of the more amusing one, at least in my case, is the second one. So it does read gang violence there. So there is this work that -- they worked, I think it's the Chicago Police Department and they modeled the gangs there as a network, and we'll see in a second the example of network influence. yeah, of course, and their goal was to reduce gang violence and they wanted to see which members, which are actually vertices in the network, they need to take out from the network in order to maximize the good influence. Once you take something out of the network it's good in this case. >>: They are taking out the -- >> Roy Schwartz: Taking out, yeah, it doesn't mean shooting in the head. It means usually rehabilitating, not putting behind bars, as far as I understood from very briefly looking at the paper, but I didn't read their model all the way through, yeah. But this is one of the more amusing ones, but there are more, let's say, classic or well understood examples, but just this is for the fun. So I mentioned networking. So I'll just give you one example, and actually, I think most of these examples, at these three, not these, but there's three, actually they are -- in some sort all of them are a network influence problem. So what is network influence or how influence spreads the network, so this is, again, just only one motivation, but it's actually a motivation that was used in practice. So we have a network which is a graph. It could be a social network, a biological network, any kind of network you like. And initially some set of vertices or the nodes are activated. So those turn red. And now there is some random process usually that runs, and the influence spreads according to this process until some point where the influence stops. So for example, if this is a biological network, it could be the spread of disease. So what models are used, so two classical models that are used are what is called the independent cascade and the linear threshold models. So in the independent cascade model, once a vertex or a node is activated, there are probabilities on the edges. And that node has a chance to activate one of its neighbors according to probability on the edge, only when it becomes activated. And this describes how the influence spreads. And in the linear threshold model, each vertex has a random weight or threshold, and once the total weight of its neighbors increases above the threshold, it becomes activated. And this is how the influence spreads. And these models were also studied in sociology and in many other areas. So how does this relate to submodular maximization or submodularity at all. So in the paper of Kempe, Kleinberg and Tardos, they show that these classical models, so if you look at the random variable X of S, which is the number of vertices that are activated that are red, once the dynamic stops and S is the initially infected, this is the red ones, then the expected value of the number of vertices activated is basically a submodular function in S. So the ground set are the nodes and the expected number of activated nodes at the end is submodular. So if you want, let's say, to count the expected number of newly activated vertices, that corresponds to a non-monotone submodular function, for example. And Kempe, Kleinberg and Tardos asked this question relating to how to market things in social networks, and this relates to [indiscernible] theory by Hartline, et al., but this is only again one example of a heavily used application. Of course, if there are questions, feel free to stop me anytime. Okay. So as I said, I decided in this talk to talk about constrained problems, and I'll start with one of the simplest constraints, which is the cardinality constraint. So in this case, what is the problem. We have the ground set N, a nonnegative submodular function and some cardinality constraint K, and the goal is to output any subset that contains up to K elements and maximize the value of the function, right? So in the network influence example, this corresponds, for example, if you want to market something online, then you have a budget of K, let's say, things you can give up for free and you hope that the influence spreads, and once the influence spreads, more people buy, let's say, your product. So for this problem, which is very classical in combinatorial optimization, in the late '70s, Nemhauser-Wolsey-Fisher showed that you can get one minus one over E guarantee by simple greedy algorithm, and this is known to work for monotone functions. And this result is also tied. In what sense is it tied? So they also showed Nemhauser-Wolsey in the late '70s that one minus one over E is the absolute best guarantee you can have. So what does it mean absolute here? So no one asked me how the function is given. So usually you assume you have a black box that is called a value Oracle that you provide this Oracle subset S and you get the value of S, namely f of s. So if the algorithm has access to the similar function only through these value Oracle, then any algorithm that performs a polynomial number of queries cannot get an approximation better than one minus one over E. So in that sense, this is absolute hardness. For the special case, let's say that the function is given explicit limited coverage function, then Uri Feige showed that this bound is also true, but there you assume that P doesn't easement NP, right? So in some sense this seems like very good because greedy is a simple algorithm. It's very fast and it gives you the best guarantee you can have when the function is monotone. So what happens with the function is not monotone? For example, in the network influence problem when we want to maximize the number of newly influenced vertices? So that's greedy work here. So the answer is no. And let's see a very simple example. So our ground set will have an elements. Let's call them U up to UN, and one special extra element V. So we're going to look at these two or these two types of functions. There's a function of V, which is just the indicator that V is in the solution, in the subset. And this is, of course, a submodular function. You can check that it has the [indiscernible] margin of property, and for every UI, we'll have an indicator that UI is in the subset but V is not. So this is a non-monotone function which you can actually check that also submodular. So now if you take any nonnegative combination of those it's also submodular. So we're going to put a weight that is slightly bigger than one for the special function of V, just a weight of one for all the rest and sum them up. So what will the greedy algorithm do? It will first choose V, right? You'll keep that for K steps, but once the algorithm chose V, it's doomed. It cannot get a value more than one plus epsilon, but the optimal solution is to choose K of the UIs, which is value K. So this is a gap of all my [indiscernible] K essentially which shows that greedy completely fails here because K can be very large. Okay? So what is known about, let's say, this case. So there are works that are known. So Jan Vondrâk showed that a continuous version of local search gets 0.309, and I don't remember the other digits after that, and then Shiyon and Jan extending, let's say, the simulated the annealing approach were able to show a bound of 0.325, which actually holds for any matroid constraint, so it comes from a much more general problem. And after that, together with Mohan and Seffi we showed you can get one over E, and that again comes from the general matroid constraint. So this is not specifically tailored for a cardinality constraint. So this is nice, but the problem is that all these algorithms are, let's say, efficient in theory. So all of them are polynomial. Essentially K is at most N, but there are polynomial in N. I think the best one, this continues greedy into the six to the five, something like that. So ->>: Did you define exactly what submodular function was? >> Roy Schwartz: No, because last time I gave a talk here about submodular function, Eval all told me you don't need to define what a submodular function is. Everyone knows. So Eval is not here. Okay. So yes. So a submodular function, let's do the diminishing returns property. So if you have a subset S of elements and there's a larger subset T that contains that set, and you take an element X, and now you ask what's the change in the value of the function once you add X to S and once you add X to T. So diminishing returns means that the more you have, the less you'll earn, right? Because if you have one and I give you one more dollar, you'll be happier. If you had one million dollars and I'll give you one more dollar, you'll still be happy, but it won't change much your happiness level, right? So it means that the change in the value of the function, once you add X to S is at least as large once you add -sorry -- X to the larger subset T. And this needs to hold for every S and T that contains S and X that is not empty, right? So there are several ways to define submodular functions, but for me this is the most intuitive one. Okay? And these sometimes -- we'll use the notation later -- is called the marginal of S according to element X. This is the change in the value of the subset S once you add element X to it, okay? So thank you for the comment. We're sure, yeah, everyone remembers. So the question is what can we do here, because there are practical applications, and can we get something with approval that will guarantee is actually also fast and simple in some sense. So what I will describe to you now is a simple randomized algorithm whose running time is order NK, which is exactly the running time of the greedy algorithm, right? You have K iterations and each one takes linear time. But it is oblivious to the fact whether the objective is monotone or not. And if the objective is monotone, you get the type one minus one over E guarantee. And if it's not monotone, you get the one over E guarantee. Okay? No, because ->>: You use the same algorithm or not? >> Roy Schwartz: It's the same algorithm. It's oblivious check, but this work only for cardinality constraints. The continuous greedy gives you a fractional solution to any down monotone closed polytope, but this is tailored for the cardinality constraint. >>: Sure. >> Roy Schwartz: Okay? So what's the general idea? So if we want the greedy-like algorithm, let's say this is our ground set and S is the current solution the algorithm holds, so what do you do in greedy? You look at all the elements that are outside the solution, let's say everything that is outside the purple set, and you pick the one with the largest marginal, right? The one that adding it to the subset will have the largest gain and you add it. So what we'll do now is we'll choose a suitable list of candidates, let's say the yellow dots, and we're going to choose the least greedily, and then choose an element uniformly random from that list. Let's say this element, okay? So apparently we were not aware of this. This type of, let's say, approach has a name which is called GRASP, and David Shmoist pointed this to us, which means Greedy Randomized Adaptive Search Procedure, okay? And it was introduced in the OR community by Feo and Resende, and there it was used as heuristic. So as far as I know in the examples that this approach was used, there were no proofs, no rigorous guarantees. And over here there are several -- these three were actually surveys, if anyone is interested in looking this up. So we didn't know it's called GRASP and we just named it randomized greedy because it's simpler. So now let's give a precise description of the algorithm I just mentioned. >>: Maybe it's a very trivial question, but in expectation [indiscernible], right? >> Roy Schwartz: >>: And also for the greedy case? >> Roy Schwartz: >>: Yes. You mean the monotone case? Greedy is the terministic, right? >> Roy Schwartz: Yeah, so if the function is monotone and you run the greedy algorithm, it's not an expectation. It's always. >>: But all guarantees -- >> Roy Schwartz: Yeah, it's randomized so the guarantees are an expectation. >>: But is it easier to de-randomize? >> Roy Schwartz: I don't know. The question is can you de-randomize it and keep the running time, because if you want to de-randomize it, you want the terministic algorithm, you could use the continuous greedy that is adapted to non-monotone functions, right? You'll get the one over E guarantee, even for general matroid, but the running time will be [indiscernible]. So the point here is to get something very fast and simple. So if you can de-randomize and keep it fast and simple, that will be great, but -What? >>: Yours has a dual guarantee. For monotone you get one last degree. So is it also true for the continuous greedy case? >> Roy Schwartz: That? >>: If you run continuous greedy on a monotone function, does that give you ->> Roy Schwartz: Yes. The adapted continuous reading. Not the original one. >>: I see. So it does give? >> Roy Schwartz: Yes. Okay. So now let's see an exact and we'll see a picture in a second. So we start, let's say it's called S0. In the beginning we have nothing and we have K iterations, I equals one, two, three up to K. So do we in these iterations? Well, first we need to choose the least -- the yellow points, the least of candidates. So we choose those greedily. So what does it mean? We look at all the elements and we want to find the subset of them of size M. So M is all the unchosen elements, right, of size K. And we just want those that have the largest marginal. So for every element you have a number, which is the marginal according to the solution so far. Let's say in the first iteration S0, which is effort set, and we choose -- so the list of candidates in the iteration is the best K candidates according to what you have so far. And then what do you do? You just choose one uniformly at random. You don't look even -- let's say one has a huge marginal and one has a very tiny marginal, you ignore that. You just choose one uniformly at random and you add it to the solution to create SI, which is the solution to the next iteration. And you do this K times. Okay? Very simple. >>: Ignoring the relative ratios between marginals, is that for a reason or is it just that ->> Roy Schwartz: >>: -- it works. First of all -Right. >> Roy Schwartz: That is a very good reason. it's very simple. And >>: Is it the case that choosing portion to margin doesn't work? >> Roy Schwartz: I don't know. I don't think we could prove that. Actually, I think it -- okay. I think it might work, but you don't want to start choosing things -- this is easier. Just choose one, right? You don't care about the values, anything. Just output what you have at the end. >>: Maybe I missed this. What is K? >> Roy Schwartz: K is the cardinality constraint. You can choose up to K elements. Okay? And that is the size of the parameter. So the least of the yellow, the number of yellow points in every step is exactly K. So let's see an example and ignore this assumption for now. So in the beginning as your example doesn't contain any of the blue dots, right? So now we look at the first list of candidates. Let's say the red set and the yellow are the candidates. Let's say there are exactly seven of those. That's AK7, and we choose one uniformly at random. Let's say the red one here and we add it to S1. And now again we look at seven candidates that do not contain the one we already chose. Let's say this is M2, the second set of candidates. We choose again one uniformly at random and let's say this point, the red one. We add it in another set and on and on. >>: So why is this size of the set MI equal to K? >> Roy Schwartz: >>: Uh-huh. Is there an intuitive reason why -- >> Roy Schwartz: Actually, so because it's simple, the proof is also simple. So I'm going to completely prove this algorithm in a few minutes and you'll see exactly why you need to keep it K, okay? Yes. >>: [indiscernible] at that point, right? And then you find you apply that function for each and pick the K maximum number, right? But in the [indiscernible] scenario the graph is so large, so is it possible that the applied function for all ->> Roy Schwartz: So actually you're asking how you implement the value Oracle in different applications essentially is what you're saying. In order to calculate the marginals, you need to know this and this. >>: That's true. >> Roy Schwartz: So there are specific works, for example, in the network influence, but that depends also on the specific application you have that show how to implement the value Oracle essentially. So that's an independent question. So now we assume we have this black box. If I cannot access the function, then I'm in trouble. I don't know what can I do -- what I can do, right? >>: But even if you have the function, it's possible that you don't have the knowledge of the whole graph, right? >> Roy Schwartz: >>: That's true. Oh, you mean the entire ground set? >> Roy Schwartz: >>: Okay. [indiscernible]. >> Roy Schwartz: Yeah. Okay. So here we assume that the ground set is known in advance. Okay? So there are a few works, but those are only recent that deal essentially, maybe, so the closest thing I can think of is the online case where you slowly reveal -- the ground set is slowly revealed to you, but there usually what is done in online there's an adversary that controls the order, but that's a good question. Again, it depends on the application you have, but here we take the combinatorial optimization approach for this, okay? Was there another question? I thought there was another question here. You had a question? >>: Oh, no. >> Roy Schwartz: Okay. >>: So you really just want K ground set element by marginal? >> Roy Schwartz: >>: You want -- So this MI is really just a -- >> Roy Schwartz: >>: Sorry? You can, yeah. So -- In some cases you can do that. >> Roy Schwartz: In linear time. You're just finding the [indiscernible] right? >>: I don't know. >> Roy Schwartz: You might be right with some applications it might be a problem, but there are applications where this was implemented. So again, it depends on the application you have before you, okay? But it's a good point. So no one asked me, the function is not necessarily monotone, right? So what happens if the best K marginals, some of them are negative? We don't want to choose those. So we assume that there are dummy elements, you can always dummy elements to the ground set that always have a marginal value of zero. Essentially they have a linear contribution of zero. So if you had K or 2K of those, but that is enough to ensure that all the elements you choose always have nonnegative marginal. So what does it mean if the algorithm chooses by chance a dummy element? Sometimes it means that it was a wasted iteration. You have exactly K iterations and you in one of those you actually didn't choose anything from the original ground set. So this means that if the function is non-monotone, the output could contain less than K elements. Okay? So now I'm going to show you how to analyze this algorithm. And let's start with the monotone case. So intuitively this seems more wasteful that the greedy algorithm, right, because we have probably things which could be much worse, right? We don't even choose according to the probability these proportional to the marginals even. So now let's prove that it gives you the tied guarantee, and we'll see anyone that knows the original greedy proof will see that it's essentially almost the same proof. But if you don't remember, we'll see the proof right now. So what we're going to need for the proof is the following. So if you remember SI minus one is what we had in the beginning of the Ith iteration or the algorithm so far. And let's say MI is the list of candidates. So what is TI? TI is all the elements that are in OPT that the algorithm didn't choose so far, right? So let's think of them as the green subset, right? TI might contain some of the things algorithms chose. So far it might have -- I don't know -- some intersection with the candidate list, but it's everything that is in OPT that algorithm didn't choose so far. So what we're going to do, assume S -- what the algorithm chose so far is known to you and we're going to see what the expected gain of the algorithm in the Ith iteration. If we lower bound this, we're done. So how much do we gain -- so if you remember UI is the element, the uniformly random element from the candidate list. So what is the gain here, the expected gain? So we choose uniformly at random, right? So just one over K and you sum up over all the candidates that are marginal, right? So how did we choose the list of candidates? Greedily, right? We chose the one that maximize the sum of the marginals. So we can substitute it now with the -- all the elements in OPT that the algorithm didn't choose so far, right? And of course, we can pad it so it sizes exactly K. So this is at least the same as if we had chosen all the elements in OPT algorithm didn't choose so far, right? So we can substitute the list of candidates with TI. And now this might contain some dummy elements which have marginal zero. So this is exactly just summing over the elements in OPT we didn't choose so far. We summed their marginals. So how can you lower bound this? So remember that the submodular function has the decreasing marginals property. So if I take a subset and I ask what happens if I add, let's say, element A and element B, or if I take the same subset and add A and B together. So of course, the gain one's adding in A and B together will be smaller. So I just add everything that was in OPT and we didn't choose so far to the function. Adding this. So it's the difference. So this is lower bounded by taking the [indiscernible] solution, just dumping everything that is in OPT there. Right? And now by monotonicity, this is at least F of OPT, right? And essentially we're done. So we've proved now that the expected gain in every step is at least one over K, the differences between the optimal value and what the algorithm hold so far. And this, to your question, why do you need the same K in every iteration. So what did we get now? So we saw that the expected change in the value of the function, the Ith iteration at least one over K, how far we were from OPT, right? In the beginning of the iteration when you solved this recursive formula, you get exactly this guarantee which gives you at least one minus one over E OPT. Okay? So this is the standard recursive formula. So this is the monotone case, which is not very interesting. We get the same running time because we don't need to sort the elements in every iteration. It's like just choosing the median, so every iteration takes order N. We have K iteration, so the algorithm runs in order NK time. But the interesting question is what happens in the non-monotone case, right? Because we saw that greedy fails. So what do we do in the non-monotone case? So the only place in the proof that we use the fact that the function is monotone is when we actually lower bounded this by the optimal value, right? But if the function is non-monotone, you take OPT and you dump other things inside, for example, what the algorithm chose so far, you might lose, or might lose a lot in the value. But the question is how we lower bound this. And we're going to do it in the following way. And this is where the random choice is the algorithm come handy. So no one asked me why do you choose randomly. Why there's no, let's say, greedy choice here. So we're going to use a lemma which the lemma I'm setting now, which is very similar to a lemma appearing in Feige-Mirrokni-Vondrâk. So if you have a subset A and you choose a random subset B out of A, so how do you choose it? You can choose it in a very complicated way. I don't care. The only thing I know is that the probability of every element in A to belong to the subset B is at most B. Right? But the elements might have weird dependencies. I don't know. So just upper bound the marginal probability of every element of A to belong to the subset B. So if you have that, then the expected value of the random subset B is at least one minus B the value of the empty set. Let's assume for now that this is true. We're going to prove that in a second. You looked a little bit surprised, but I'll prove this. Yeah, I said that all the functions are nonnegative. We assume that the submodular functions are nonnegative, because multiplicative guarantees once the function is nonnegative or useless, because you can always shift it so only the optimal value is above zero and then either you solve the problem exactly or not. They're all MP hard. So let's assume for now that this is true and we'll see why this completes the proof. So what happens in the algorithm, the Ith iteration. In the Ith iteration, what's the probability that an element is not chosen. So that's at least one minus one over K. Why? Because if it's in the candidate list with probability one over K it is chosen, and if it's not in the candidate list, it's never chosen. So this lower bounds of probability that some element in the Ith iteration is not chosen. So what's -- now I ask you, what's the probability that in the Ith iteration some element, or at the end of the Ith iteration, what's the probability that an element belongs to the algorithm solution. So the random choices among the different iterations are independent. The lists are not, but how you choose within the list is independent. So you get that the probability that an element belongs to the algorithm solution, the end of the Ith iteration is at most one minus this to the I. Right? So what does that mean? So now look at the following function. Let's call it G of S, which is just the similar our function where we just take the [indiscernible] OPT, so we force the S to take in addition everything that is in OPT in case it didn't do that already. So this is also a submodular function. You can prove that it's easy. So what's the expected value of this, right? Because we want to lower bound this, right? We want to see this exactly what's our goal. So this is the expected value of G of what the algorithm has at the end of the Ith iteration, right? And this is a submodular function that has this property which can plug by the lemma, which is at least one minus one over K to the I, G of the empty set. But G of the empty set is exactly the value of the optimal solution. Right? So we lower bound the expected value of the optimal solution once you add to it whatever the algorithm chose so far. And why are we done now? So this is where we stopped before we used monotonicity in the analysis before, right? The expected change in the value of the function is at least one over K. Instead of being the distance of what we have so far to OPT, now we have the value of OPT and what the algorithm chose so far. So now we just lower bound this. Instead of F of OPT, we have this factor in addition, right? Exactly what we had before. You solve this for recursion. You get this. This is true for every iteration I. And if you plug in I equals K, you get at least one over E times the value of the optimal solution. So now the only thing that is left is to prove that lemma didn't like, I guess. So let's prove it. And we'll see that the proof is again very simple and short. Okay. Happiness is [inaudible]. We'll see. So this is just a restatement of the lemma. We have a subset A and we randomly in some way choose a subset B of A, and the only thing we know about the distribution of B is that the margin of probability of every element to belong to this random subset B is at most B. And we want to prove this, right? Yes. >>: Okay. >> Roy Schwartz: Yeah, so I wanted there to fit in one line, so I just abbreviating it, but you're right. This is not extremely precise, but every element, the margin of probability is at most B. Okay. So this is the only mutations we'll need for the proof and the proof will end here, so we won't need even more than the rest of this slide to prove this. I think so, if I remember well. >>: Why don't you skip to that? >> Roy Schwartz: What? You'll see in a second. Yes. So let's say that A, where we choose the random subset from, let's call it U1, U2 up to UL. And let's sort the names of the elements as to that element number one appears with the largest probability and the second one with probably P2 and so on up to PL. So these are the exact probabilities, the exact marginals, right? We just say that these are upper bounded by P, we sort of changed the name of the lemma so this holds. And now let's say AI will just be the prefix of the first I elements and exsa is the indicator that the Ith element is inside of the random subset, right? So now we just want to calculate the expected value of B. So how can we calculate this? So now we can think of the following process in some sense. So we can look at the following telescopic sum, right? We start with an empty set and then you look at the first element. Is it in B or not? If it's in B, right, then X1 will be one is the indicator and then we -how much we gain, we gain the marginal of the first element with respect to what we had so far, right? And if it's not, then this is zero and we don't care and then we go to the second element. Just have a telescopic sum and we have the indicator saying whether we add that element or not. So this is why the take the intersection of the prefix with what really appears in the random subset, okay? This is just rewriting what the value of B is as a random variable. So now at the subset drop B we have these marginals, right? So I can look smaller subset and only -- sorry -- the larger and only losing the marginals, so I'm going to because it's not very convenient. I want to look at the terministic subsets. So this is what we're going to do. So it's just expectation of this, again, telescopic sum because these are our marginals, but with respect only to the prefix that can only be smaller, right? We might have increased the set. So now just is, so just probability the prefix, on and on. plug in. We know the probability of exsa F of the empty set and then with B1 we gain the first element according to and then we B2, the second element and so And now just I want to see what are the coefficients of the different subsets I have here. Let me rewrite this. So we get one minus T1, the first set which is empty. P1 minus P2 the second one, and on and on and we just rewrite this. So how did we order the elements? So the order in here is very important. So we know that P1 is the largest probability and on and on. So all these are nonnegative, right? So we can drop them. So this is at least one minus P1, which is one minus P, the value of the empty set. Are you happy? Just making sure. Okay. Good. So this is the entire proof. What? It's a telescopic sum, essentially. Yes. So essentially this is the entire proof and we're done. So I gave you the algorithm which I hope you think is simple. It's fast because its running time is asymtotically the same as the greedy algorithm, and it works for the non-monotone functions which the greedy fails. So what we have so far, as I said, in order NK time we can get the one minus one over E for monotone, which we know, and the one over E for non-monotone objective. And the question is can we do faster than that. Yes. >>: So your example there you had one X element and K element. That also shows that anything just sublinear in K, if you do the same algorithm, sublinear in K elements and -- >> Roy Schwartz: Sublinear in the running time or -- >>: No, no. I mean something, you are choosing K elements, choose ->> Roy Schwartz: will be smaller. >>: It would work, right, because of that example? >> Roy Schwartz: that example. >>: Oh, the size of the candidate list Could be. I didn't check it on Linear functions, right? >>: No, even linear, but, okay, yeah, but anything less than K won't work because of that example. >> Roy Schwartz: I think you might be right, yes. So no one asked me, okay, what's the meaning of the randomization here in some sense. So intuitively you can think of it as some kind of insurance. Because if the function is monotone, then it might make sense to choose greedily, but if the function is non-monotone, you might regret it afterwards because you might lose value because you chose something that was better the current step, because the function is not non-monotone. So randomization intuitively some kind of insurance that tells you, okay, that might happen, okay, but not with a very high probability, okay? So but now [inaudible] with all this, the question is can we do better? So I saw that 2014, Ashwin and Minyon, if the function is monotone, it gave an algorithm that essentially for any constant epsilon or any epsilon gets the tied guarantee. It [indiscernible] in the approximation factor, but the runtime, instead of being order NK, is order of N over epsilon log N over epsilon. >>: So when you say running time, you mean on Oracle ->> Roy Schwartz: Yes. We'll get to that at the end of the talk. There will be some -- but for now I'm cutting the value Oracle calls and the number of arithmetic operations is bounded by this also. So it's not that you have 2 to the N arithmetic operations and that's right, that's cheating, but something where I'm counting the number of value Oracle calls which relate to what you asked me, whether it's easy to implement that or not. Yes. >>: So can I assume that the K that you selected because of the [indiscernible], so do you expect that if you have a larger, larger than K, you have a parameter of one where K arrived in this case and you get a better result? >> Roy Schwartz: So you are asking, say, what the meaning of a [indiscernible] solution. So optimum contains up to K elements but you allow the algorithm to output [indiscernible] More than K? Can you get a much better approximation? >>: Still you have K, but the [indiscernible] that you're selecting the first and the K largest and select randomly and say increase it, to let's say, more than K. Do you expect to get better or not? >> Roy Schwartz: So that's a good question. The question is what -- okay. I'll choose the size of this, I think the best for the cardinality constraint as far as I can remember, but I'm not 100 percent sure. But that's a very good question how you choose the size of the candidate list. So I think for cardinality constraint of K, this is the best. But I'm not 100 percent sure, but I'm quite sure. So if you ignore all the epsilons essentially it's O epsilon of N log N running time. Could be faster than order NK. And what -- and a faster algorithm that actually we showed with the [indiscernible] and actually it was also independently appeared in Meeza Suliman Ashwin Kalabasi Vondrâk and Kauvza in some AI conference which I don't remember exactly the name, they showed how to essentially get in order N log one over epsilon. So essentially this is linear now and you lose only -- so the dependencies on the running time on epsilon is very mild. It's log one over epsilon, and you lose that epsilon in the approximation factor. And if I'll have time -- I think I'll have a few minutes -- I'll show you how this algorithm works, which is very simple. But this is only for monotone F. So you can ask can you speed this up for non-monotone functions, can you get better than order NK running time and lose only epsilon in the approximation guarantee. So you can do that, but here, as you can see, there are essentially two algorithms whose running time is incomparable depending on the value of K. So if you ignore this one, which is, let's say, less interesting, you can get the N log one over epsilon, but you lose one over epsilon squared in the running time if F is not monotone, okay? And again, this is all for cardinality constraint. And by the way, I think they actually, their paper actually has also experimental results. So they actually implemented this algorithm. All right? Yes. >>: So is it the same algorithm, the two papers? >> Roy Schwartz: Yes, the exact same algorithm. >>: So in practice I guess you always know if your F is monotone or non-monotone, right? >> Roy Schwartz: Usually you know, yes. For this algorithm you need to know because this algorithm ->>: That's why I'm asking this. >> Roy Schwartz: Yes. The fact that the previous algorithm was oblivious to the fact was only -- let's say it was a side effect. We didn't aim for that, but it's nice to have, but it's not that important. You're right. Okay. So how does O, let's say not this algorithm that works in a completely different way, but let's say these two work. So the idea now is just to sample. These are sampling-based algorithms. So the main idea, without going into the parameters, and maybe for the specific case of a monotone objective, we'll go over the details, is that first of all, instead of looking at the entire ground set, you sample candidate list M just randomly. So let's say if this is a solution the algorithm had so far, you sample the yellow set. And we don't even sample it such that it doesn't contain anything -- any of the elements the algorithm chose so far. So the sample set, you just from the entire ground set it contains things the algorithm already chose. The sample of the random subset of some specific size, and then you look at the fraction alpha of the top marginals in that set and you choose one of the those uniformly at random. So if the sampled set was, let's say, the entire ground set or something of that sort, it's similar. It's not exactly, but it will be similar what you saw before, that instead of looking in the entire ground set, you look at some random part of the ground set and then you choose uniformly at random from some top fraction of the marginals there. When I say top elements, the intention is that top according to the marginals. >>: So when you say sample, how do you sample? >> Roy Schwartz: Randomly. So if I tell you that the M is of size five, you take a uniform random subset of five. Uniformly but depends on the size, of course. >>: Yeah. >> Roy Schwartz: Yes. Given the size it's uniformly at random, yes. And it might even contain things you already chose. You don't need to remove those. Just uniformly at random a subset of that size. And then the question is how you choose the threshold alpha of the top marginals within the sample but from which you choose uniformly at random. And that also is not random. It's really fine. So the faster algorithms, say, for the monotone case, which I'll describe now and maybe go over the -maybe I'll skip the proof of the -- >>: So it's just a one short thing? >> Roy Schwartz: No, you do this K times. This is you choose one element. You do all of this and you choose uniformly at random one element and you repeat this K times. >>: Then you choose a new M? >> Roy Schwartz: Yes, you choose a new M, yes. >>: Oh, I see. So the running K [indiscernible] comes because every time you're looking at a random subset M. >> Roy Schwartz: entire ->>: And M is much smaller than the M is concert size. >> Roy Schwartz: It's not -- you'll see in a second. Now I'll -- so this is the entire approach for all those algorithms, even when it's non-monotone and the complicated running times, but I'll show exactly what this means in, let's say, the case that the function is monotone, okay? So the faster algorithm in the case that the function is monotone. So again, we start with an empty solution and we have K iterations as before. So now instead of greedily choosing the top K marginals that are outside what the algorithm chose so far, we choose a random, let's call it MI sample, a random subset of this size. N over K, so it's not even a constant size. It depends on K times some log one over epsilon which we'll see in a second why we need. And now what is the alpha? What is the fraction of the elements we choose? So we choose alpha in this specific case such that we just choose the best marginal in the sample. So there's no uniform random choice in this case, right? So alpha is essentially one over the size of M in this case. So that's UI, the element with the largest marginal with respect to what the algorithm chose so far. The best element in this sample. So it's even simpler in the case that the objective is monotone. You add it to the solution and you do this K times and you output whatever you have. So this is the specific [indiscernible] of that approach to the case where the objective is monotone. So intuitively it will be easier to think of P as the probability -- the marginal probability of an element belonging to the sample, right? That doesn't change over the iteration. So if the size of the sample is N times something, so that something is the probability, which is essentially one over K times none one over epsilon, okay? And we'll assume that epsilon -- of course, the epsilon cannot be more than one minus one over E because then there's no guarantee in terms of the approximation, right? Because we subtract epsilon. And epsilon is not even too small. It's at least E to the minus K, because if it's smaller than that, then log of one over epsilon becomes larger than K and just apply the greedy algorithm. So this gives you better if epsilon, of course, is not really, really small. And if it's too large, then you don't care what happens. Okay. So I have a few minutes for the analysis and after that I'll show you or just tell you what is known for more complicated constraints. So what did we prove for the greedy or the randomized greedy? What was the proof plan? Again, the proof plan here is very simple. We proved that the expected gain in the I iteration is at least one over K how far we were from the optimal value so far, right? I'll convince you or tell you you can do is actually show that you're not far away from that. You just lose the factor of one minus epsilon multiplicatively for the gain in every step. And once you have that, you can resolve the recursive formula and under the assumption that epsilon is not large, which is exactly what we have, the output after the Kth iteration is exactly one minus one over E minus epsilon the value of OPT. So all we're going to do is just show that we're losing a factor of one minus epsilon in the gain in every step. Once we have that, we're done. So how are we going to do that? So again, the proof is not complicated but it's slightly tricky. So now again we're going to sort the elements according to their marginals. So we have the current solution the algorithm has so far, which is SI minus one. So now let's look at as if we were the randomized greedy as before. At the top K elements in the entire ground set even, it could be -- you can look at elements that go so far, but then the marginal will be zero. So you look at elements in the entire ground set and you pick the best -- the one, the K best marginal. So let's say V1 is the best one, we do after that and on and on. Right? So in the randomized greedy, this was our candidate list, but now we just use it for the proof, not for the algorithm. And now we're going to ask what's the probability that our sample hits one of these, right? Intuitively high, we're in good shape. So we're going now to define the event that if we look at the prefix of the first J candidates according to this order, that our sample actually contains at least one of them. So this is what is written here, that the sample contains at least one of the J largest marginals, right? So XJ is indicated for that. So there are two observations here. Just state them and explain but tell you how you prove them, but I won't go into the exact proof. So the first one is actually that you can then decompose given conditioned on what you have so far that the marginal value UI is what the algorithm eventually added. You can decompose it again to that telescopic sum and think of it in the following way. So let's say V3, for example, is the smallest index here that is in the sample. So what is X1? It's zero, right? X2 will be zero. X3 will be one. What will be X4? >>: One. >> Roy Schwartz: It will all be one because we look at the prefix. So once -- let's say from this list, let's say V3 is the first one that appears if you start scanning it, then X3, X4 up to the end XK will all be one, right? So it's like a threshold function. So now think of it the following way. You start with what you have so far and you look at the marginal of the worst one on this list. And this is essentially the indicator that, I don't know, either this one or someone before it appears. And then you keep summing these indicators. So essentially what you have here is just, on the right-hand side, written as random variables the element that the algorithm chose, right? Because the minimum index will say where the Xs start -- stop being zero and start becoming one. Right? But in the worst case it also -- might also be that the algorithm didn't add any one of these. So this is why there's an inequality here, right? And here we use the effect that the function is monotone, right? Because choosing some non-else, I don't know, there might be negative marginals, so this is one way to rewrite this. And the second observation which is simple is that P, the probability we noted before, what's the probability, let's say, that XJ is not zero. So it means that from the first J intuitively rat least one is chosen. So it's one minus the probability none of them is chosen, which is essentially this. If P is the probability that the element is in the subset. There are some dependencies here because not every element chooses itself independently, but this inequality is true. Right? Once you have this, it's linear done. So with these very two simple observations, the proof is complete. Why? You take the expected gain in the Ith iteration and you just plug those two inequalities together. We have a lower bound on the probability of XJ and we have this telescopic sum, so it's at least this. And this, the question is how do you lower bound this? So I'll skip this very quickly. But you have a summation over multiplication. You have two sequences. This one minus P to the power of something in the marginals, and we order them according to the values of the marginals. They're all decreasing, so now we can use the Chebyshev's sum inequality and get this, but the bottom line here is just a geometric sum. But when you do this, you get exactly what we wanted to prove, right? So I'm skipping this very quickly. And this completes the proof because from now on from this point I claim just copy/paste from the previous proof and just carry this term along and it all works. So this was just a sample from the sampling algorithms, what happens when the function is monotone, and as I said, I think in the parallel paper they even implemented this algorithm, as far as I can remember. So now what happens with additional constraints? So I'll go over this very, very quickly. So one interesting constraint is the partition matroid, and one of the motivations for this is actually the submodular welfare problem, which I'm not going to go into all the previous work, but on the very high level you have these smiley faces which are the players, and you have these goods which are undivisible and you want to assign every player a subset of the goods, and each player is equipped with a utility function which is submodular and monotone like in auctions, and just want to distribute the elements to the players and maximize the utility, total welfare, essentially. So this is a special case of a matroid constraint, so one minus one over E was known in polytime by the continuous greedy of Cãlinescu-Chekure-Pál and Vondrâk. And in that same paper I mentioned before and saw that in 2014 Ashwin and Nehan again for -and this comes from the general matroid lose an epsilon and the running time is, if we ignore polylog factors and the dependence on epsilon is NK. That's the fast algorithm. So the question can you do faster, and the answer is that you can break this barrier and actually you can get again, so this is with Nevan Mohan and, the same guarantee, and if you ignore the polylog factors independence epsilon you get K square root N plus N for maximizing a monotone submodular function over a partition matroid. So maybe I'll skip what the [indiscernible] in general matroid. I'll just mention that in a general matroid there is an interesting question that too far, till this point, like Mohit said, we're aiming to minimize the number of value Oracle queries, but in general matroids, how do you know that a solution is feasible or not, right? For cardinality constraint it's easy. Partition matroid is easy, but in a general matroid, you have what is called an independence Oracle. That actually tells you whether a given subset is feasible or not, and there's actually, we have an algorithm that shows that there's a tradeoff between the number of value oracle queries. You have the same approximation guarantee, but it depends on the application you have. And we suspect that this tradeoff is essentially an artifact of the algorithm, but we don't know of any lower bound and we don't know how to get rid of it. So there is some interesting that can actually save the total number of queries you have, but that, if someone is interested, I can tell you more about that later. So thank you. [applause] >>: Any more questions? >>: So are you still working in this place or what? >> Roy Schwartz: So on maximization we're simply not working much. There is a project about submodular minimization, but again, this is completely different because it's not a hard problem. So you can solve it exactly, but I can tell you more about that later if you're interested. >>: So that is through convex optimization or -- >> Roy Schwartz: Okay. In general, yes, that's one way to do this. But the faster algorithms that are known today, again, it depends on whether you want strongly polynomial or polynomial. Because the strongly polynomial algorithms have no dependence on the different values the function can take, but their dependence on the size of the ground set is worse than the polynomial. The polynomial ones, the dependence on N is much better, but there's an extra dependence on what is called capital M. And usually it's log capital M, and capital M is the following if you assume that the submodular function is integra, then it's the largest -- its largest value and absolute value absolute value. So it could be negative. I don't know. Because you can solve the problem exactly doesn't mean -- doesn't bother you if you shift it. So if you, let's say -- so if you normalize it so that the empty set is zero, you can always do that, then it's between minus M and N and it's integral. So the polynomial algorithms have some dependence on N and log M. Usually it's only log M. It's not even polylogging log M. So it's only log M. But again, it depends on kind of what kind of algorithm you're interested in? >>: So strongly will not depend on N at all? >> Roy Schwartz: Will not depend on M, but those are usually -- okay. I can just cite what one of the authors of those algorithms said. They're highly non [indiscernible]. So I don't know. They're probably not very simple algorithms to exactly understand what's going on. In some sense they are extensions of max flow right? Because max flow under the graph or [indiscernible], doesn't matter, it's equivalent to mean cut and mean cut is a special case of submodular minimization. So the strongly polynomial algorithms are some -- on the very high level some extensions of the strongly polynomial time algorithms for max flow. >>: Any more questions? >> Roy Schwartz: Thanks. Okay. [applause]