1 >> Eric Horvitz: Okay. We're honored today to have Andreas Krause visiting us. Andreas, we know very well. He was one of our fabulous interns over the years. We had him here a couple years ago now, three years ago maybe, three summers ago. Time really flies. Andreas is an assistant professor at Cal Tech, and he finished his Ph.D. at CMU in 2008. He has a number of awards, including an NSF career award. And in sitting with him, hearing about his latest results, I have to say that I'm very excited about the direction he's gone in generalizing the work that he's so well known for during his dissertation that he did for that era of his life. This new approach on adaptive submodularity really gets to the notion of how we apply some of those interesting submodularity results to the actual case where information is revealed dynamically over time which is, indeed, the general case. So go for it. >> Andreas Krause: Thank you so much, Eric, for inviting me and for this kind introduction. Welcome everyone to my talk. It's always a pleasure to be here at Microsoft research. So today, I'm going to tell you about some brand new results in the area of active learning and stochastic or interactive optimization using generalization of the notion of submodular functions to adaptive policies. I'll tell you about this. This is mainly based on trying to work with Daniel Golovin, who is a post-doc at Cal Tech and last part of the talk is also trying to work with Dave Ray, a grad student at Cal Tech. Okay. Before I get to the new stuff, I'll bring everyone up to speed by giving a quick introduction on for [indiscernible] and some of the locations I've been working on in the past. So I've been really interested in problems of optimized information gathering. For example, we've been working with roboticists at USC and UCLA on using robotic boats in order to monitor rivers and lakes for pollution. Also been collaborating with civil engineers back at Carnegie Mellon on placing sensors in drinking water distribution networks for detecting contaminations. But also looking at sensor placement problems and activity recognition, intelligent buildings. But also looking at more general notions of what sensing means and applying some of the ideas to information gathering problems and information retrieval problems on the [indiscernible]. So all these problems, our goal is to learn something about the state of the world. 2 So just water quality in a particular geographic region, and we can do so by placing sensors or making measurements, conducting experiments which are typically expensive and we can only make a limited number. And the key question coming into all these problems is how can we most cost effectively get the most useful information. And this is really in essence a fundamental challenge in machine learning in AI, the problem of how can we automate notions of curiosity and serendipity. Since this is such a fundamental problem, it's been studied in a lot of areas including experimental design, operations theory research, AI, machine learning, spatial statistics, robotics and sensor networks. Most of the existing techniques that are out there can be broadly grouped into two categories. There's heuristic algorithms that work really well in some applications but they're not theoretically well understood and can potentially do arbitrarily badly and for some applications it could be problematic. There's also algorithms that have the more ambitious goal in trying to find the optimal solutions and they include techniques such as mixed integer programming and solving partially observable mark of decision processes. But these techniques are typically very difficult to scale to large problems. So what I'm really interested is in developing algorithms that both have theoretical strong guarantees and scale to really large problems. And just working on the theoretical side, I'd really like to apply the results on actual applications. So as a running example, let's think about the example of deciding where to put a bunch of sensors in a building to monitor temperature; for example, to detect whether there's a fire or not, okay? So in generally, we take a probabilistic roach. One way to do this is to example have a random variable at every location, S, that models the temperature at this particular location. And they're spatially correlated on joint distribution that models this correlation among this temperature which is typically performed by some physical understanding of the phenomenon. Now, we can't observe the temperature directly, but we can make noisy observations by putting out sensors, okay? So if you have a sensor at location S, then you would get the sensor value YS, which is some noisy copy of the true underlying temperature, XS. Okay? And then we start with in the Bayesian setting, where we have a prior distribution of the temperature modeling the correlation along with the likelihood function that characterizes the assumptions about the noise of the sensors. 3 Okay? Once you have such a model, you can start talking about utility of making observations. So suppose we start with the uniform distribution and we assume, for example, that the temperature is cold, normal or hot with equal probability at all locations. Now, if you have a sensor placed at location Y1 it tells us there's high temperature there, we can do Bayesian inference to calculate the posterior distribution which may indicate that it's probably likely that at location X1, there's higher than usual temperature. And also probably at locations close by, the temperature is probably higher than usual through the correlation. Now, typically, we have to make decisions or take actions based on this posterior distribution, and therefore would prefer posterior distributions that help us make these decisions more effectively. So for now, we will just assume that we have some reward function that takes this posterior distribution and tells us how useful it is. And I'll give you examples as we go on in this talk. If you make a different observation, say cold temperature at location 3, then we get a different posterior distribution, which gives us a different reward. There's various different examples of those reward functions that have been considered. One is if you're in a situation where you want to decide whether there's a fire or not, then we have to ask the question, should we raise an alarm? All right. So we have two actions. We could raise an alarm or not raise an alarm, and the world could be in two states. There could be a fire or there could not be a fire, and if there is a fire and we don't raise an alarm, bad things happen. Similarly, if there is no fire, but we do raise an alarm, you have a false positive. If you have too many of those, then people won't believe our system anymore, and eventually bad things will happen too, okay? Now, if you knew the correct state the world is in, we could just take the optimal action. But the problem is that we don't know that. We only have a belief about what the state of the world is. The posterior distribution, and the best thing you can really do is take the action that maximizes the expected utility. But this gives us a way of quantifying the usefulness of a particular posterior distribution, we can just use the expected value of the -- so the maximum expected utility when acting opt male based on our posterior distribution and that's called the decision theoretic value information which has been an extremely useful and powerful concept all through AI and decision theory. 4 Now, in some applications, we may not apriori have a utility function so you may only want to have posterior distributions that are as certain as possible. Okay. One way of quantifying this notion is the notion of entropy. If you think about spatial prediction problems, if you use robots to study lakes or try to figure out what the temperature is everywhere in this building, you may think about the means square prediction error based on our observations. So these are all ways of taking a posterior distribution and turning them into utility. But there's other objective functions that are useful and have been used in practice. Now, we have these reward functions, and now we can use these in order to quantify how useful any given set of sensor locations would be. The issue is that apriori, we don't know when we place a sensor somewhere, but they are going to tell us. Okay? So the only thing we can really do is average with respect to the observations that these sensors will likely do, times the reward we would get under those particular observations, okay? So this gives us an expected value of information for any set of sensors that we may want to place. And now this is an objective function that we can try to optimize. And the simplest question we can ask is, well, what's the best set of K locations to place sensors? Now, the first thing we did is we actually tried to understand for what kind of problems can we solve this problem exactly? And it turns out for some problems, you can't actually solve it exactly, and this depend on the structure of this underlying model. So turns out if this underlying graphical model is a chain, like a mark of chains, for example, you have a conditional random field or hidden Markov model and you'd like to label the hidden frames of that hidden Markov model, then you can actually find the optimal value information efficiently. Okay? But it turns out that if you just try to slightly generalize that from chains to trees, this problem suddenly becomes really, really hard, right? And as soon as you start talking about spatial correlations, you have much more complex dependencies than adding simple chains. So the problem suddenly becomes really hard, NP to the PP complete. We can't expect to find the optimal solution. So instead of trying to find the optimal solution, let's try to at least find a good solution, a good approximate solution. So probably the simplest approximate algorithm we can think about is the greedy algorithm, which is used a lot so we would start having no sensors and it would definitely place the sensor at the location where it increases our value the most 5 so you place one sensor at a time, see how objective functioning increases and you stop after we've placed all K sensors, but we never change any decisions we've already made. And the question is, how well does the simple algorithm do? So one way of trying to answer this is to run experiments. So we could take some temperature data, run the greedy algorithm to maximize, for example, the improvement in information gain, and if you have a small enough problem, we can actually find the optimal solution through exhaustive integration and it turns out that the greedy algorithm gets us really close to the optimal solution. And, in fact, the seed us in a number of different problems so the question is is any justification of why the greedy algorithm should do well? Okay? And it turns out that the key insight for analyzing this greedy algorithm is the following natural notion of diminishing returns. So suppose you have two placements, A and B. In A, you've placed two sensors, Y1 and Y2. In setting B, you have three more sensors, 3, 4 and 5. Now, [indiscernible] about the additional value that a new sensor would give us in any of those situations. If you add the sensor S to the first set, we gain a lot of additional coverage, a lot of additional information. Versus if you place the sensor to the same location in the second deployment, you only get a little bit of additional information. We can formalize this diminishing return property using [indiscernible] notion of modular functions. Set function F takes a locations and puts a value modular set A contains super set B, and we consider adding new element S to either of those sets. We gain more by adding this element to the small sets than by adding it to the large sets. So this is exactly captures what's going on here, and so in formulas, just means that F of A union S minus F of A is greater than or equal than F of union S minus F of B. Okay. So for the sake of a notation you just use the notation delta S given A as the marginal benefit of adding this element S to the set A. Okay. Why did I tell you about -- first of all, we can actually show that in the sensor placement application, this information gain is, in fact, a submodular function. And so why is this useful? Why did I tell you about this? Well, it turns out that it's known that you can maximize sub lod mar functions using the greedy algorithm and get guarantees about that. So whatever the greedy algorithm gives you, the placement of the greedy algorithm obtains a constant fraction of about 63% of the 6 optimal value. In fact, for information gained, that's the best possible ratio you can get among any efficient algorithm. So in some sense, it's really good algorithm to use for this kind of problem. Okay? And so in my dissertation research, we've been pushing this idea in a number of different directions. And so basically, all these applications here can be cast as the problem of maximizing some submodular objective function, subject to some interesting constraints. Working with algorithms and correcterizing when these problems are submodular and so on. But what all these have in common is that you want to find a set of observations you want to make in advance before obtaining any kind of measurements. So for example, you'd like to decide on the locations you want to place your sensors before you actually get to see any measurements. In that sense, these results are results about non-adaptive optimization problems. They're non-adaptive against observations you would possibly make. Now, there's a lot of really interesting information gathering problems that you want to be adaptive. So one example is medical diagnosis. Suppose you're a veterinarian and you'd like to diagnose a sick puppy. So what you can do is you can run some tests. For example, you could measure the puppy's heart rate. And then depending on the outcome, whether it has a heart rate or not, decide on the next test that you want to run. So for example, take an x-ray. Depending on how that looks, we decide on the next test to run, take a blood sample and so on. What you want to do is you want to diagnose this puppy as cheaply as possible, but all the tests that you run can depend on the measurements you've done in the past. Okay? So now you're interested in finding no longer a set of tests, a fixed set of tests, but a policy, a decision tree that's adaptive to the observations we've already made. Okay? Now, the issue is that now we try to optimize it with those policies instead of with these sets. So you can't use the notion of submodular set functions in order to analyze these algorithms anymore. And so the question is, that we asked is, is there some natural generalization of this notion of submodular functions to these adaptive policies? To give you some information about this problem, let's talk about the really simple submodular optimization problem, one of the most natural ones, the set cover problem, and just think about it in the context of sensor placement. So we have a bunch of 7 possible locations, 1 through N, and all these locations are associated with some kind of sensing regions, right. So if you have a sensor at location 1, then you're going to see the green area. The set WS, all right, WS1. Okay. Now, a sensor placement is just a subset of the locations A, and the total value of that placement is just a total area covered by the regions associated with the elements you pick. Okay. So you have the function F that takes sets of locations and outputs just a union of the elements containing all the sets that you pick. Now, that's a simple example of a submodular function. Fairly easy to see. And the set cover problem has very natural adaptive analog, okay? So what you could think about is suppose you're in a setting where you don't know about what these sensing regions are in advance. So if you put out the sensor, for example a camera, then it could either observe all of the hallway, for example, or there could be some obstacle. Doors could be open and so on, and the sensing region gets reduced by that. Or the sensor could fail, and you don't get to see anything in advance. And you don't know ahead of time, before you actually place the sensor. Okay? Now in this situation, what you can think about is every sensor is actually associated with a collection of sets, collection of sensing regions, and as a random variable, for example, for location 3, it's a random variable X3 that tells us which of those sets gets realized. Okay. And this example, there's two sets, the yellow set and the green set. And if variable X3 takes value one, then the yellow set gets realized. If it takes value zero, the green set gets realized. Okay? So the set that is picked depends now on the realization of those random variables, and now it can think about trying to come up with adaptive policies where you pick a location, then the set gets revealed to you, then you pick another location, again the set gets revealed to you and so on. It's a very natural adaptive analog. And now you can define an objective -- so you can do this if you have bunch of sensors, right, have all these sensors. And now the value of the placement A in a particular world state, joint realization of all those random variables, is just the union of all those sets that are parameterized by these random variables. Okay? So it's a very natural analog of the set cover problem that's been studied in the literature. And more formally, what we're going to do is we have -- so the kind of optimization problems we're going to study is settings where we have a collection of items, 1 8 through N. And with each of those items, have some random variable associated with it, and you have some objective function F that takes a subset of locations we pick, subset of items, and world state, which is the joint realization of all the random variables, okay, which sets get realized where and tells us how useful that is. Okay? And now what we can do is we can quantify the value of the policy. So the situation is that depending on which -- so the policy is some kind of decision tree, right. So depending on the test, you decide which set you pick. Now, you can -- so for every possible world set at XV, you could possibly realize a different set. If there's some set pi of XV that's realized if the world is in state XV. So the value of a policy is just the expected value of the sets that are picked by the policy under the respective world state. And you average over all states the world can be in. Okay? So you can think about basically you have a decision tree, the realization of the world tells you which path you take down this decision tree. At the end, at the leaves, you get some value and look at the average value. Okay? And now you can try to maximize all these policies. That's a well defined optimization problem. Now, the issue is that there's a lot more policies than sets, and this is strict generalization, because you could just set all the outcomes to one single value in which this problem reduces to the classical set function optimization problem. And so K lead up to the problem is hard, hard to approximate. Very strong hardness results for this problem. Okay? Now, since this is a hard problem, we can't expect to find the optimal solution in general so let's try to find a good solution. All right. So what's the natural algorithm we could try? Well, we could try to use some kind of adaptive variant of the greedy algorithm. So how would that work? So suppose we have already made some observation. So in the sensor placement problem, it means we've seen the realization of some of the sets. For example, sensor 1 and 3 have been realized with this green set, and now we can look at not the marginal benefit but the expected marginal benefit of adding new sensor, S2, conditioned on those observations. And we use the notation delta S given XA to denote this expected marginal benefit conditioned on the particular observation XA. 9 Yes? >>: So you don't know whether you're going to get the green or the yellow when you place S2. Do you know what the green and the yellow are? >> Andreas Krause: they are. >>: You know what they are and you have the distribution of how likely Okay. >> Andreas Krause: Okay? Good. And so now once you have these, these marginal benefits, we can really easily implement an adaptive greedy algorithm that just starts having no sets and having nothing selected, and then iteratively just add the sets that maximize the expected marginal benefit conditioned on what you see. Right. That's the greedy algorithm. Then once you pick this element, it observes it, does Bayesian update based on the observation, and adds it to the set. And to see how this works in this sensor selection kind of problem here is you have these three sensors. You want to select two of them. So the one that has the highest marginal benefit, given nothing, is sensor S2. So the greedy policy starts with sensor S2, okay? Now that we've picked it, one of those sets gets realized. For example, the green one gets realized, okay. Now, conditioned on that, I can look at the marginal benefit of S1 or S2, and since there's quite a lot of overlap of the green set with S3, I may want to pick S1. Condition the screen outcome, pick this one. And once I've picked it, I get to see which sets gets realized. Maybe this yellow one. Okay? But if we rewind and instead of the green one, the yellow one gets realized, right, then it may now be better to actually pick sensor S3 instead of S1. Right? So we pick this one, and one of those gets realized, and that's our value, okay? So the greedy algorithm now doesn't construct a set, but actually a policy. And the question is, how does this policy compare against the optimal policy? Okay. So we know that in the classical setting, where we want to pick sets, if the objective function is submodular, then the greedy algorithm is going to give us good solution. So the question is, is there some notion of adaptive version of submodularity. And here's the generalization we came up with, and it's based on these expected module benefits. [indiscernible] objective function F and some distribution adaptive submodular if these conditional marginal benefits are monotonically decreasing in set size. What this basically means is that if you're 10 in a state X, if you compare two situations, XA made some observations and XB made more observations than in XA. Then the marginal benefit of any new item S conditioned on XA has to be greater than or equal to the marginal benefit condition XB. It's just a natural generalization of set function. Right. In the set functions case, you just had delta S given A is greater than or equal to delta S given B. Now you don't just condition on the set, but you condition on the set plus the resulting observation. >>: [inaudible] how much more restrictive that is, I mean, to find that kind of modularity. >> Andreas Krause: So that's a really good question. I'll give you some examples arguing that this is a useful notion in a bunch of applications, okay? But, of course, that's a really new concept, and I think there's a lot to be studied to somehow say which problems satisfy this property. >>: I mean, sitting here, I can imagine cases where it's -- >> Andreas Krause: >>: [inaudible] it's not. Give easy examples where it's not. Talk about the [indiscernible]. >> Andreas Krause: [inaudible]. Good. And also that you need monotonicity so you have an adaptive monotonicity where it said the expected marginal benefits have to be positive. Whenever I added elements, it increases my value and expectation, okay? And now what you can show is that if F is adaptive submodular and adaptive monotonic with respect to this distribution P, then the nice we saw about the greedy algorithm still carries on. The greedy algorithm is a constant fraction of 63% of the optimal value, okay? So this still holds. In fact, a lot of other nice properties that classical modular functions have still carry over to the adaptive study. I'll give you some examples as we go on. So let's just see when what this means for the adaptive set covering problem that has been studied quite a bit, both of the maximization version, where you want to find a collections of sets that has massive values that been studied by [indiscernible]. So this case, the greedy algorithm gets this [indiscernible] guarantee. But you can also think about some notion of coverage, that, for example, want to find the cheapest policy that covers the entire building. All right. That always make sure that all locations are covered. Okay. And that is a natural generalization of the set cover problem. 11 And for that, it also gives you a guarantee, which is basically optimal as a matching lower bound from classical set cover. So this is about set cover. Now let's talk about -- so this is some theoretical results so you may not care about theorems, but here are some practical results that may actually be even more useful from an applications perspective. And it's the fact that you can use lazy evaluations to run the greedy algorithm, and that's something that's been shown to be extremely useful in the classical set function case and also carries over to the adaptive setting. And the way it works it this. So you start with the set A of observations, S1 through SI, and now what the greedy algorithm does, in every iteration, it has to pick the item SI plus 1 to maximize the expected marginal benefit, conditions on what ifs. Okay. So what you can look at the expected marginal benefit and just tries to find the maximum of those. Now, adaptive submodularity implies some really interesting fact about these marginal benefits through the course of the greedy algorithm. It implies if you fix a particular item S, then its expected marginal benefits have to be monotonically decreasing over the course of the greedy algorithm. Okay? So that's the easy consequence of submodularity, which basically means if you have some iteration but you have this yellow marginal benefit for item S, then at some subsequent iteration, the marginal benefits can never be more than it was in the previous iteration. So these marginal benefits can never increase. So why is this useful? Well, you can exploit it in a really interesting variant of the greedy algorithm called the lazy greedy algorithm, and the original version of that for classical submodular function is due to [indiscernible] 1978, but so now we show that you can actually generate and make use of the same insights for the adaptive setting. And so the first iteration is just business as usual. So we can calculate the marginal benefits with respect to all sets, okay? And now here the best one, for example, may be item A. So we pick A. And now, in the next iteration, the E [indiscernible], you would have to recompute the marginal benefit for all four remaining elements. So you have to do four function evaluations. This could be expensive. So now what you could do instead is instead of recomputing all of them, it's just 12 use the previous values. It's just sort them and take the best guess as the one that has the highest value before it. So it's try to look at how good D would be. Of course, the last value is only an upper bounds on what's true value. So we have to recompute it, okay? And by recomputing it, the value could go down. But if it goes down, then you just resort it and put it into this queue. Now, the next best guess would be B. And if you recompute it and it still has the same value, then we now know that it has to be the second best item. So we don't ever have to look at E and C again. Okay? So it means that in this simple example, you save these two function evaluations. That doesn't seem like a lot, but in practice can matter whole lot. Okay? And here, just one preliminary experiment that we did is we look at the adaptive sensor placement problem. I won't talk too much in detail about this. We take data from 350 traffic sensors on a highway in California, and what you want to do is you want to adaptively select the sensors in order to maximize the value. And if you compare the naive greedy algorithm versus the lazy greedy algorithm, and we get performance improvements by a factor of 30 to 40 in this problem. That can make a difference in practice. And this classical setting, there were some studies showing it can be even bigger improvements. So practically, that is a really important benefit that you get if your objective function is an F to submodular. It's another nice consequence of submodular functions is you can calculate data dependent bounds. I told you about the one minus one for e-bounds. In some sense, this bound is the worst case bound that holds -- no matter what the [indiscernible] is. What all the [indiscernible] are. Okay. But [indiscernible] that you work with in practice may not be as adversarial as in this worst case analysis. What you do is to calculate some more instance dependent or problem dependent bounds that are often much tighter than these offline data dependent bounds. It can run your algorithm and then use submodularity get some certificate on how close you are to the optimal solution, okay. And that's something that's known for classical submodular functions too and it also carries over to the adaptive setting. Won't talk too much about detail, but this is again from this placement problem. X axis is the sensors that I picked. Y axis is the number of problems. The blue curve is the adaptive greedy algorithm and the black curve is if you use the one minus one for E bound that I told you about before. And the red curve here is what 13 you get from this data dependent bounds. one minus one rebound. You can see that it's tighter than this So in practice, you can get problem specific bounds just knowing that your objective function is submodular. And what you can also do is you can get these bounds for any algorithm you may run, not just the adaptive greedy algorithm. Okay. So this is some more reasons of why so much submodularity is useful. Now I'll tell you about some more applications, because so far I've only talked the sensor, the set selection adaptive set covering problem. So let me talk about some other applications. And one really interesting application is in viral marketing. So suppose we'd like to get a new product on the market, and we want to convince people to buy it, okay? So one, the idea behind viral marketing is that you can give the product for free to a bunch of people and they hopefully convince their friends who convince their friends and the maximize all impact. Of course, the question is which set of people should we give the product for free to maximize our expected influence? And that's a problem that's been studied by David Kemp, Kleinberg and Eva Tardos in really nice KDD paper. And they show -- they have a particular model of how influence propagates. So they take a social network of people and annotate all these knowledges by probabilities so suppose we get the free phone -- free product, maybe a phone to Alice, then Alice can try to convince their friends. For example, has 30% chance to influence Bob, which may fail, a 50% chance to influence Charlie, which may succeed. Then Charlie's influenced, buys the phone himself, then he tries to influence Bob again, may fail. Dorothy and so on. Over time, you see how this influence propagates through this network. And so what Kempe, Kleinberg and Tardos showed is that the expected number of people influenced is a submodular function of the initial set of people that you select. Okay. So if you want to run an advertisement campaign where you say you have a budget to give out ten phones, right, or some number of phones, then you can use this, the algorithm to find a near optimal set of people that maximize the expected influence. Okay? But this was also about the nonadaptive setting, but you have to commit to the people in advance. This is a very natural adaptive analog, right. So in practice, may want to do is actually run your marketing campaign in stages, right. So you say you give the phone to a bunch of people first then see how successful 14 they are in influencing. You learn from that, then you pick another set of people conditioned on what you've done and so on, right. So it's very, very naturally adaptive analog. And that so here, for example, we may pick Alice first, get to see how successful she was in influencing people and maybe add some intuition of that and then pick Fiona second and so on, gets to see how effective she is, okay? And turns out that's adaptive submodular problem. Okay? So that's already maybe more compelling application than this adaptive set covering problem. And so this adaptive greedy algorithm gets you one minus one guarantee to the optimal adaptive policy now and you can all use these nice tricks with lazy evaluations, online bounds and so on. Okay. So now, you start talking about information gathering and active learning problems and now let's finally get to active learning. Yes? >>: I have a question with this one minus one over E issue, you're comparing to the best optimal depth solution. Can you say something about how good you are compared to the best nonadaptive solution. >> Andreas Krause: That's a good question. So the adaptivity gap so we have, so turns out for maximization, it's not entirely clear how big this adaptivity gap is. For coverage, if you for example want to achieve, say, 90% market segmentation, right, or in the set covering problem you want to cover everything, you can show that that's very large gaps. So you can do a lot better by being adaptive than by not be adaptive. So for example for adaptive set cover, there's adaptivity gap due to [indiscernible] of N over analog N. I don't know it for the viral marketing problem. But for set cover it is, okay? Good. So this is the -- so the viral marketing. Now let's talk about active learning, okay? In particular, let's talk about diagnostic problems. We'd like to diagnose a disease. We can run tests. You start with a bunch of hypotheses. So these pictures are just different hypotheses for diseases that the puppy may have. Okay. And now you take a Bayesian approach. You have a prior of hypothesis, and you have some likelihood function of the outcomes. And let's start with the setting where the observations are deterministic conditioned on the true hypothesis. Okay? So any particular disease uniquely determines the outcome of the test so there's no 15 noise whatsoever. Okay? If you're in this setting, then any test cuts away part of the hypothesis. For example, if you find that X1 equals one, it eliminates some of the hypothesis. If you find out X3 equals zero, it again cuts off part of the hypothesis, okay? But the problem is, of course, apriori, we don't know the outcome of the test. Okay. In particular if you pick test as X2, then we could either eliminate the two hypotheses on the left or the two hypotheses on the right. You don't know apriori which is which, right? Okay. And, of course, what you want to do, sometimes we would like to -- looks like a kind of adaptive set cover problem, though? Because we like to cover all the hypotheses except the true one, okay? And now, of course, the question is how should we test, right. And one natural objective is just to look at the expected reduction in mass of the hypotheses that we eliminate from the test. If you look at every test, look at both outcomes, see how much hypothesis do I rule out, and average of the result comes abated by the likelihood of the outcome. Okay? That is called generalized binary search. Turns out to be [indiscernible] maximizing information gain in the [indiscernible] sense of this problem. And turns out it's adaptive submodular, okay? And I want to quickly show you why it is adaptive submodular. So let's take -- so what you need to do is you need to show that the value of some test X has to monotonically decrease as we gather more and more information. Suppose initially we have some prior probability of these hypotheses, the three on the left, and we call it B not. And so for blue. Blue knot and on the right we have some prior probability for the green hypotheses, right, G not. And it's not hard to show that the initial expected marginal benefit of this test X can be calculated as two times G not, B not divided by G not plus B not. Won't go through this in detail, but it's really easy to show. Now suppose you run some tests, gather some information, so the [indiscernible] hypothesis. So both the blue and the green mass decrease. So now we can look at what is the expected marginal benefit of this test X after we've seen these observations. And that turns out to be two times G1 B1 divided by G1 plus B1. Okay? Now, turns out it's fairly easy to show that whenever B not is greater than or equal to B1, and G not is greater than or equal to G1, which is always the case if you cut away mass of these hypotheses, then some of the final marginal benefit has to 16 be less than or equal to the initial marginal benefit, which proves adaptive submodularity, okay? So the proof fits on one slide, okay? Now, that means that the greedy algorithm for this general [indiscernible] research is [indiscernible] optimal. Of course, that's not new insight. There have been a lot of work in this problem. Some extensions and guarantees have been improved over time. The currently best known approximation ratio for this optimal decision tree problem, noise-free, is four times log of one over P min. P min is this smallest probability among any hypothesis. And it turns out that using this insight, that the objective function is adaptive submodular, you can improve this [indiscernible]. So again get rid of this factor of four. Okay. So that means that in all this adaptive submodularity analysis is tighter than all these existing analyses. But what I think is more interesting is that the fact that this adaptive greedy algorithm is guaranteed, a simple consequence from the fact that the objective function is adaptive modular. Okay. So all these existing algorithms have to set up machinery for specifically analyzing this particular problem. So there's been a lot of work in trying to analyze adaptive set cover, analyze these active learning problems, analyze [indiscernible] problem and so on. Turns the reason why they work is that the objective function is adaptive submodular, and you don't lose anything by abstraction, you get better adaptive bounds. >>: So how does adaptive [indiscernible] compared to naive myopic, the next best? >> Andreas Krause: >>: Let me get to that. Okay. >> Andreas Krause: Yes? >>: My question is about the domain you're working on. each test you make eliminates some possible ->> Andreas Krause: >>: Yes. At least? >> Andreas Krause: Yes. So are you assuming that 17 >>: But there could be some cases where like you learn something but it doesn't really illuminate. >> Andreas Krause: That's what I'm going to get at. >> Andreas Krause: Okay. >> Andreas Krause: So I'm talking about the classical setting, it's called [indiscernible] problems. Verifying mathematical problem assumes that the tests are noise-free. That means that every observation rules out some hypotheses, right? That's because you multiply with [indiscernible], right, if the likelihood of [indiscernible]. >>: But there can be also cases where a combination of tasks eliminate. >> Andreas Krause: Of course, of course, but that happens. That is all modelled here. Okay? Good. So this is -- noise-free, okay? In practice, there's always noise. So if you have sensors, you always have sensor noise, these medical tests could have false positives, false negatives and so on. Unfortunately, all these results in the noise-free case don't carry over. >>: [indiscernible] active learning when you get [indiscernible] model that you're using to do diagnoses, or using the active learning. >> Andreas Krause: It's depending on who you ask. So if you -- so you could call it sequential Bayesian experimental design. You could call it active learning. You could call it adaptive VOI. >>: [indiscernible] I'm not being cutting, splitting hairs. There's also one [indiscernible] reserve that term for how you're going to use your methods to actually choose new data that's unlabeled, for example. >> Andreas Krause: Yeah, so turns out you can easily cast pool based active learning from that perspective. We can talk about that offline. But it's the same model, it's just using ->>: But I'm suggesting you might want to distinguish how you describe the application. 18 >> Andreas Krause: Okay. So in the paper, we actually talk about the learning problem that I just want to tell you about VOI. >>: I haven't personally ever used active learning to describe the task of diagnosis of a fixed model. >> Andreas Krause: >>: So it's called active. [indiscernible] diagnosis. >> Andreas Krause: Okay, diagnosis. Good. So diagnosis. So the problem now, if you have noise, then tests, exactly what is you mentioned can happen. So the test no longer eliminates the diseases. They only make them less likely. And turns out that breaks all the analyses. And it's not even clear what the right optimization problem is anymore. Before, you had the task of eliminating all but the correct hypothesis. Now what you want to do is intuitively gather enough information to make the right decision. So here's one made to make this precise. So suppose I run all the tests, okay? I get to see the outcome of every single test. Then I still have some uncertainty about what the true action is. So posterior distribution may still have probability mass for different hypotheses. So the best I can really hope to do is take the action that maximizes my expected utility. Okay? Let's call that A star. Now I can ask, do I really have to run all these tests? How can I gather enough information so that I've proved to myself that I'm still going to make the right decision? Okay? So how can I cheaply test to guarantee that after stopping, I'm goes to choose the right action? Okay? So this is the natural generalization of this [indiscernible] optimization problem. And now what we could do is we could try to understand how some of these existing approaches that are out there work in this problem. So for example, one natural guess would be to try generalized binary search for this problem, right? Or we could try maximize information gain or myopically maximized value information and so on. Turns out none of those is adaptive submodular, if there's noise. That wouldn't rule out that they work, but actually they empirically can do badly. And I'll show you later. And you can actually theoretically prove that they have the -- they can produce constants about N over log N times the cost of the optimal policy. Okay? Good. 19 So that means we have to look for a new criteria. So here's our proposal. So the idea is -- and that's an idea common to a lot of [indiscernible] problems is that you replace the noisy problem with a noiseless problem, essentially by introducing slack. So what we do is -- what we can do is we can basically create noisy copies of our hypotheses and annotate all of our noisy hypotheses with the outcomes of all the tests. Okay? So what you basically do is suppose in the case of this green disease here, the second and third test obviously is zero, but the first test could either come out zero or one, okay? Maybe zero is more likely than one. This case, we have two copies of this green disease, one over here, one over here. But they annotate it with different vectors of outcomes of this, of these tests. Okay? The same thing for this orange, for this orange disease, right. This orange disease, for example, obviously the third and second and third test come out zero and one, but the first one could either be zero and one. Okay? Just for illustration purposes here. Now, what we -- so now we have reduced the noisy case to the noiseless case, right. Because any of those hypothesis deterministically determines by construction the outcome of other tests. Although we could run generalized binary search of this problem. Of course, the big issue is that these noisy hypothesis encode a lot more information that we need, right, because all we need to do is we need to distinguish between noisy hypotheses that lead to making the same decisions. So what they can do is they can take all those noisy annotated hypotheses and group them into equivalence classes based on which action we would take in either case. And now we only need to distinguish between these equivalence classes rather than the individual elements in these equivalence classes. So one way to do this is to build a graph. We introduce an edge between any pair of noisy hypotheses and different equivalence classes and the weight of these edges we're just going to choose as the product of the individual probability of these hypotheses, okay. So if you have two very likely hypotheses, like this green one, this red one here, then the edge would have heavy weight. But if you look at these two examples over here, then the edge would have very little weight, okay? And now suppose we see the outcome of one test. So the see X1 equals 1. In this case, we eliminate some of those noisy hypotheses. And, of course, now we can also get rid of all these adjacent edges. Okay? So every test eliminates now not nodes 20 in this graph, but edges in this graph. And these edges basically measure our progress in being able to distinguish between these equivalence classes. And also notice that any optimal policy has to cut all the edges in this graph. Right? Because if there's still at least one edge, then there's some positive probability of confusing these two equivalence classes, right? So we have to cover, get rid of all those edges and so we can define the objective function as just a total mass of all the edges cut under a particular observation. Now, turns out that's an adaptive submodular function, and that means the greedy algorithm is going to give us any optimal solution for cutting all these edges. Okay. So the cost of those greedy policies is more algorithmic factor than more the optimal policy. Okay? And so the fact that he depends on some of the probability of the hypotheses. Okay? And so this is the first approximation algorithm for non-myopic VOI and general graphic [indiscernible]. And the idea is that you want to solve this non-myopic problem, but you define some alternative, substitute for the objective function. You don't really optimize value information, but you optimize some proxy objective function, which turns out to be adaptive submodular and somehow guides you in the right direction. Okay? So now for an application, so -- yes, yes. >>: I have a question on this. So it seems like the cost of computing the green is going to be tremendously high. >> Andreas Krause: So very good point. So in practice, you would never actually implement the algorithm as such, but turns out what you can do is you only need to estimate the amount of mass eliminated through the tests, and you can estimate that using sampling. This is rejection sampling approach for estimating how much mass you can do. That's also a very efficient approximation to this objective function, which actually works really well in practice. Okay? Good. So now for some actual application, so I started to collaborate with Colin Cammerer at Cal Tech who is a neural economist, a behavioral economist at Cal Tech. And one interesting paradigm that the study, in order to understand how people make decisions and uncertainty is called the Iowa gambling task. And in that task, subjects are presented two decks of cards. You can flip through. These cards basically have -- so the cards have different values so you could either win, you could lose or gain nothing. Okay? 21 You flip through the cards and you have these two decks of cards. You can look at both of them, then estimate how likely you are to win, lose or gain nothing with respect to these cards. After you've gathered some information about that you have to then decide one of those decks of cards and draw a card and get paid based on that observation. You get the equal -- for the setting here, get equal number of trials for both of those cards. So you gather information for these probabilities. So in some sense, what this test encodes is a particular probability distribution of a rewards. Okay. And suppose we have these two different distributions. In the first setting, you win ten dollars with 70% chance, lose ten dollars with 30% chance. In the second setting, you win ten dollars with 30% chance and gain nothing or lose nothing with 70% chance. Who would prefer the left gamble? >>: How many draws do I get? >> Andreas Krause: >>: One. Only one trial? >> Andreas Krause: Yes. Who would prefer the second one? Okay. So turns out there are some heterogeneity. So the one that has higher expected utility is the first one, of course, right? Four dollar expectation, three in the other one. And the other one, you can -- in the right one, you can own only win. So there's different competing hypothesis of how people make these decisions such as, for example, hypothesis, just people maximize expected utility. there's prospect theory which basically says that people may have losses stronger than they may gain. There's a portfolio optimization which basically says people expect a value, various versus Q-ness, so other moments of distribution differently. So there's basically different ways of looking at features of the probability distributions and encoding utilities. And interesting question, behavioral economy is trying to understand some of what is the variability in the population among those different theories. Does everybody behave the same, or what makes them behave in a certain way in what kind of situations? Okay. And now, of course, what we -- every test requires actually have the subject run through this. Through this setting. So you have to come up with a set of these PDFs, of these stack of cards and gather some information. It was expensive. So what we'd like to do is like to efficiently -- gather data as efficiently as possible. So you can cast this as a Bayesian experiment. The sign problem that if you have one latent variable, just a theory that we tried to identify, a prospect theory, 22 expected value and so on, this has parameters that we also don't know. And all tests that we can run, the observations, X1 through XN, are basically pairs of gambles. okay. So any test that we can run is a particular pair of those gambles, those decks of cards you can show, and now all these different utilities and the parameters give us all these different theories, and parameters give us different utilities for each of those gambles. So we can try to model the pick off the user as some kind of noisy indicator of the perceived difference between the utilities of those gambles. So this is some kind of soft max function that is used a lot in behavioral economics. This is just the observation, and now we can try to figure out how should I test in order to figure out what's the truth here. Okay. And now you can also run two different optimization algorithms to solve this. Here's the result that we do. This is based on simulations. Sample from the model and compare these different theories so you can get lots of trials. And X axis is the number of tests, Y axis is the accuracy of identifying what the correct hypothesis is. Here something interesting happens. Turns out that the random does fairly well, and something that you see quite a bit in active learning sometimes. And it turns out that some of this optimized criteria for this problem actually do worse than random for this application. Okay. And I can speculate offline of why I think this happens but, of course, it's due to the fact that they optimize -- myopically optimize so you don't do any look ahead, okay. And so for example, what happens in uncertainty sampling that easy to understand, uncertainty sampling just picks the test about which you're most unsure what the outcome is, okay. But in this task of distinguishing these different theories, that is quite problematic, because if you have a particular pair of gambles, where the utility of two different theories is equal, then at maximize the uncertainty about the outcome, but also if you run this test, it doesn't tell you anything about the true theory. So in this case, some uncertainty and the amount of information gathered about the theory is negatively correlated. That is one reason why uncertainty sampling is really bad. But there's other reasons for other objectives. So information gain based on the optimal sign is doing better than random, but the adaptive submodular criterion is actually out performing information gain on this particular task. >>: So basically the output of this, of the adaptives of modular approach would 23 be a sequence of studies in humans and choosing these [inaudible]. >> Andreas Krause: right. >>: Basically, yeah. >> Andreas Krause: >>: So you wouldn't actually write down -- But for generating these [inaudible]. >> Andreas Krause: this tree. >>: Basically gives you an algorithm, gives you decision rule, Exactly, right, exactly. But you wouldn't actually write down Okay. >> Andreas Krause: You only would somehow expand the tree on the fly. And that means that the algorithm has to be very efficient. And to show that it's efficient, we actually use it in some human subject experiments where the -- so this is very preliminary, but we started to run about 11 subjects on this trial. So the X axis is the number of tests. Y axis is the probability of the class, the type of decision made at the end, and you can see that people actually behave differently. So actually, a fairly large fraction of people chose the highest maximum expected utility. But here, there's also two subjects that choose behavior according to prospect theory, and there's some examples of other theories as well. The question is can we estimate the heterogeneity in the population. Can we figure out how it depends on certain features of the environment, and so on. >>: [inaudible] decision science course [inaudible]. >> Andreas Krause: Yeah, this is Cal Tech undergrad population, which may not be an unbiased sample. So okay. So this is the -- so this is very, very preliminary. But it's the main intention of this is not to draw any conclusions about these theories. It's only to show that this algorithm is actually practical to run in realtime, okay? So this is, this is this. So before I go on to conclude, I just want to tell you about one other project that's currently going on and so how these ideas connect. One problem we're interested in using community health sensors to sense and respond 24 to crisis, okay. And so for example, I could use advanced meters to plug in the smart grids to try to detect cascading failures. Personal navigation devices to detect traffic jams. One project that we have at Cal Tech and becoming a large now is using axonometers in mobile phones in order to detect earthquakes. Building a community seismic network. The idea is that earth quakes basically behave according to two waves. The primary wave, which is sound wave and compression wave that travels a couple of kilometers per second and the S wave, the secondary wave, which is the one that does the main damage, the sheer wave, travels less fast than the P wave. Okay. So it means that if you can detect this P wave, actually gives you some time to react and possibly do early warning. Okay. And just to give some ideas UCLA in California has currently a seismic network that is about the resolution you would get in the L.A. area. And if you increase the resolution, if you were able to get more data from more sensors, you would be able to much more fine grain be able to estimate somehow the seismic waves propagated over time. And it's one way to gather data is we use these shake tables. You can put phones on these tables, play back recorded earthquakes and see how the sensors behave as we get these recordings. And one big challenge in this problem is that these phones produce a lot of data. Okay. So if you take about a million phones, they would produce about 30 terabytes of data each day. And AT&T or sprint wouldn't be very happy to constantly transmit that much data over the phone. You really have to make decisions over what you should extend. Of course, that has decision theoretic implications. You have to make a decision, should you raise an alarm or not. But now you have to solve. >>: [inaudible]. >> Andreas Krause: >>: Once again? How long -- >> Andreas Krause: Depends on the scenario. For some of the scenarios considered, you can get about two minutes, which actually doesn't like a lot, but actually can help you give elevators, give warnings to cars to stop and also project infrastructure and so on. So it's potential of responding to these events. Okay? And so one approach you take is we actually have axonometer data recorded from people so this may be a person walking. What you can do is look at how seismic events look 25 like played back on the phone and super impose them to get this overlay data and now we can look at basically train model of what normal activity looks like on the usual data, using density estimation techniques. And look at the likelihood of the observations. For example, during an earthquake. And you can actually see, even bigger events you may even be able to detect in some activities. I mean so typically, would only try to detect earthquakes while the phone is linked still. But for the major ones, you may be able to figure out something that's going on while people are using the phones. But now, of course, the question is, how do you calibrate this network. >>: [inaudible]. >> Andreas Krause: >>: Yeah, yeah, exactly. But -- yeah? [indiscernible] is the probability of seeing that things are normal, basically. >> Andreas Krause: Right, exactly, right. But now, of course, the question is how do you cheese choose this threshold. That's actually a really interesting problem, because you have to choose this locally, but you have to choose it locally so that you make globally efficient decisions. Right. So you have to somehow calibrate that. And so we're currently looking at sort of this very much a work in progress using these ideas of adaptive and online optimization of submodular functions in order to calibrate this network. It's just a quick idea of what's currently going on. So to conclude, I told you about adaptive submodularity. I think it's a generalization of modularity to the adaptive problems. A lot of the useful properties loved for classical submodular functions extend to the adaptive setting, like getting guarantees about the greedy algorithm, doing lazy evaluations and getting these data dependent bounds. There's actually a number of applications that can be shown to be adaptive submodularity, like the stochastic set cover, viral marketing, active learning, Bayesian experimental design. And what is nice it provides a unified view on the analysis for these different problems so you can recover a number of results known in literature and get some extensions to that but it also leads to new algorithms like the Bayesian experimental design I mention to you in the end. That's it. >>: How sensitive are your results to knowing that [indiscernible]? 26 >> Andreas Krause: Good question. So it turns out that for the maximization case, it's difficult to say something. But for the coverage case, for example, if you would like to achieve a certain amount of market coverage in the viral marketing or in the active learning setting, you get guarantees even against adversarially chosen realizations. So you can even get, if you start with a distribution, a uniform distribution, then you still get guarantees even against worst case chosen realizations. >>: Can you quantify the robustness? >> Andreas Krause: That's an interesting question for future work. So this is some kind of -- at the end of the spectrum, it's completely adversarial and you can still say something. But, of course, how things happen between exactly matching your prior versus being completely off, there's a lot of possible room to improvement. Any other questions? >>: What's next? >> Andreas Krause: I told you some of the problems that's next. one -- the theories? >>: So I think In terms of the theory, yeah. >> Andreas Krause: So I think one interesting question is, so in some sense you can cast all these information gathering problems as general [indiscernible] processes, right, but just using general purpose black box algorithms for that is, I think, really challenging, because the state space is exponential with the number of observations you can make, and the belief is not decisive belief space. It's exponential indecisive space is doubly exponential with the number of tests you want to run so it's extremely instructible. And nevertheless, you can get approximation guarantees for this class of [indiscernible]. One question is can someone push this further, and so we'll say that can you say something about more general planning problems. So this is one ->>: [indiscernible] approximation, for example, adaptive to [indiscernible]. >> Andreas Krause: Yes, exactly. Right. So can you come up with approximation algorithms for certain types of POMDP, so this is an interesting problem. But also there are robustness issues which were just raised and are really interesting 27 discussion. submodular. is. And also, get, just getting a better idea of which problems are adaptive I mean, this is a new concept, right? So don't know how general this >>: So you have a domain that you don't know much about, and you are thinking about applying these adaptive submodular algorithms on that. >> Andreas Krause: Yes. >>: So what is the first clue, or how do you make a decision like to say this domain is submodular or not? >> Andreas Krause: So in general, trying to prove submodularity can be a bit tricky. But so in some cases, not so hard, right. So I showed you proof on one slide for the adaptive learning, right. So, I mean, this is a really new concept and that means that I think there's a number of low hanging fruit for some problems that are out there that people just haven't looked at. But so in general, I'm not sure about what's right tool boxes for proving adaptive submodularity, but there some are operations that do preserve adaptive submodularity so you can build more complicated adaptive submodular functions from simple adaptive submodular functions. And in the paper we discuss some of those. Maybe [indiscernible]. >>: What I thought would be [indiscernible] design [indiscernible] effects and its property and diagnosis in the future, with no guarantees. Actually design machines, design machines that get them this property, as it's part the design process for maintenance and diagnosis. But guarantees. >> Andreas Krause: >>: That's interesting, can someone come up, yeah. Kind of a wild idea. >> Andreas Krause: Yeah, yeah, yeah, but there may be something. this offline, yeah? We can talk about >>: Are there any additional constraints on submodularity that give you an even more optimal solution, strictly submodular? >> Andreas Krause: Very good question. So for the classical notion of set 28 functions, you can quantify the performance guarantee of the greedy algorithm in terms of quantity called the curvature, which basically means how strong is the diminishing returns property. How quickly does it flatten out. And there's a way of quantifying that, and you can -- so extreme example is linear functions, all right, that don't have any diminishing returns. Turns out the greedy algorithm is optimal for linear functions. Okay. And so now between linear of one, right, optimal and one sort of E, there's some room and you can quantify that you are, and we don't know what you can do for the adaptive submodular case. So an interesting direction. >>: And also, the other thing I was going to ask, in the submodularity, they show that the greedy is the best boundary. Do you have the same result for adaptive? >> Andreas Krause: Well, it's strict generalization of the case, right. So unless you make further restricting assumptions, you can't really say that. So you can always -- any submodular optimization problem is an adaptive submodular problem where all the observations are deterministic. In other words, all the hardness results for submodular carry over. Since the lower bound, so the upper bounds guarantee [indiscernible] the greedy algorithm match up, that's not really room for doing. Yes? >>: Any idea how the quality of the solutions degrades with failure of adaptive submodularity assumption? Do you have like epsilon submodularity, less than or equal to, helped with an epsilon on one side. >> Andreas Krause: It's basically if you implement the greedy algorithm that's only off by a factor of alpha, then still all the guarantees -- that's a bunch of results known for the classical case and they still carry over to the adaptive case. So either if the objective function you try to optimize is close to a submodular function or if you can't implement the greedy algorithm exactly, you can still do something. >> Andreas Krause: So I guess it might also be an empirical question in addition to a theoretical question. Think about boosting, for example, which makes this strong assumption of weak learning and that's not satisfied by most learning methods we boost, yet boosting in practice works extremely well over a wide range of boost developers. I'm wondering if the same thing might be true here. That, in fact, adaptive submodularity is great when you can get it. But, in fact, you do quite well using these methods even when it's not really quite true. 29 >> Andreas Krause: That's very good. theoretically. So ->>: As it is for boosting. >> Andreas Krause: >>: So it's really difficult to answer that As it's for boosting, yes. Saying adaptive submodularity as risk taking. >> Andreas Krause: Right. So an anecdotal example, I told you about the [indiscernible] prediction era, so the variance reduction. If you say you want to monitor temperature in the building, right, or in the lake, then you may want to care about the expected reduction. It means current error. That objective function is not always submodular. There's count examples, and it depends a bit on -- so for Gaussian, it depends a bit on some of the structure on the [indiscernible] and there's stability results and that, and so there's some indication that for some applications, it is submodular. But these are often violated in practice, even thought algorithms work really well. Work better than state of the art existing techniques. So I mean, that's -- I don't know how that's, of the same example in the adaptive case, but at least to the classical case, there's hope in this direction. >>: So then a related question, do you think there might be cases where relaxing the adaptive submodularity requirement might actually yield improved solutions? I mean, for example, this does happen in boosting, right? So typically, if we go from something that's provably weak to something that's not so weak, we often actually get better results with boosting. I'm wondering if there could be a similar thing here, where adaptive submodularity is great from a theoretical point of view, but in fact, in practice, we do somewhat better if we relax it in certain way. >> Andreas Krause: But what do you mean -- so what is -- [indiscernible] a suitable candidate algorithm. One issue is people love greedy algorithms and they often work really well. Like the question is why. And that may be an answer to at least some of those problems. Right? So and I think one of the answers could be that in some applications that maybe lack some modularity but still do well is because the function is close to adaptive submodular or at least in this small space of solution of the greedy algorithm searches that contains somehow at least sufficiently good, many good solutions, the submodularity is satisfied, and that gives them indication of why you should have [inaudible]. 30 >> Eric Horvitz: Thanks very much.