1

advertisement
1
>> Eric Horvitz: Okay. We're honored today to have Andreas Krause visiting us.
Andreas, we know very well. He was one of our fabulous interns over the years. We
had him here a couple years ago now, three years ago maybe, three summers ago. Time
really flies. Andreas is an assistant professor at Cal Tech, and he finished his
Ph.D. at CMU in 2008. He has a number of awards, including an NSF career award.
And in sitting with him, hearing about his latest results, I have to say that I'm
very excited about the direction he's gone in generalizing the work that he's so
well known for during his dissertation that he did for that era of his life.
This new approach on adaptive submodularity really gets to the notion of how we apply
some of those interesting submodularity results to the actual case where information
is revealed dynamically over time which is, indeed, the general case. So go for
it.
>> Andreas Krause: Thank you so much, Eric, for inviting me and for this kind
introduction. Welcome everyone to my talk. It's always a pleasure to be here at
Microsoft research.
So today, I'm going to tell you about some brand new results in the area of active
learning and stochastic or interactive optimization using generalization of the
notion of submodular functions to adaptive policies. I'll tell you about this.
This is mainly based on trying to work with Daniel Golovin, who is a post-doc at
Cal Tech and last part of the talk is also trying to work with Dave Ray, a grad student
at Cal Tech. Okay. Before I get to the new stuff, I'll bring everyone up to speed
by giving a quick introduction on for [indiscernible] and some of the locations I've
been working on in the past.
So I've been really interested in problems of optimized information gathering. For
example, we've been working with roboticists at USC and UCLA on using robotic boats
in order to monitor rivers and lakes for pollution. Also been collaborating with
civil engineers back at Carnegie Mellon on placing sensors in drinking water
distribution networks for detecting contaminations. But also looking at sensor
placement problems and activity recognition, intelligent buildings. But also
looking at more general notions of what sensing means and applying some of the ideas
to information gathering problems and information retrieval problems on the
[indiscernible].
So all these problems, our goal is to learn something about the state of the world.
2
So just water quality in a particular geographic region, and we can do so by placing
sensors or making measurements, conducting experiments which are typically
expensive and we can only make a limited number. And the key question coming into
all these problems is how can we most cost effectively get the most useful
information.
And this is really in essence a fundamental challenge in machine learning in AI,
the problem of how can we automate notions of curiosity and serendipity.
Since this is such a fundamental problem, it's been studied in a lot of areas
including experimental design, operations theory research, AI, machine learning,
spatial statistics, robotics and sensor networks.
Most of the existing techniques that are out there can be broadly grouped into two
categories. There's heuristic algorithms that work really well in some
applications but they're not theoretically well understood and can potentially do
arbitrarily badly and for some applications it could be problematic. There's also
algorithms that have the more ambitious goal in trying to find the optimal solutions
and they include techniques such as mixed integer programming and solving partially
observable mark of decision processes. But these techniques are typically very
difficult to scale to large problems.
So what I'm really interested is in developing algorithms that both have theoretical
strong guarantees and scale to really large problems. And just working on the
theoretical side, I'd really like to apply the results on actual applications.
So as a running example, let's think about the example of deciding where to put a
bunch of sensors in a building to monitor temperature; for example, to detect whether
there's a fire or not, okay? So in generally, we take a probabilistic roach. One
way to do this is to example have a random variable at every location, S, that models
the temperature at this particular location. And they're spatially correlated on
joint distribution that models this correlation among this temperature which is
typically performed by some physical understanding of the phenomenon.
Now, we can't observe the temperature directly, but we can make noisy observations
by putting out sensors, okay? So if you have a sensor at location S, then you would
get the sensor value YS, which is some noisy copy of the true underlying temperature,
XS. Okay? And then we start with in the Bayesian setting, where we have a prior
distribution of the temperature modeling the correlation along with the likelihood
function that characterizes the assumptions about the noise of the sensors.
3
Okay? Once you have such a model, you can start talking about utility of making
observations. So suppose we start with the uniform distribution and we assume, for
example, that the temperature is cold, normal or hot with equal probability at all
locations. Now, if you have a sensor placed at location Y1 it tells us there's high
temperature there, we can do Bayesian inference to calculate the posterior
distribution which may indicate that it's probably likely that at location X1,
there's higher than usual temperature. And also probably at locations close by,
the temperature is probably higher than usual through the correlation.
Now, typically, we have to make decisions or take actions based on this posterior
distribution, and therefore would prefer posterior distributions that help us make
these decisions more effectively. So for now, we will just assume that we have some
reward function that takes this posterior distribution and tells us how useful it
is. And I'll give you examples as we go on in this talk.
If you make a different observation, say cold temperature at location 3, then we
get a different posterior distribution, which gives us a different reward. There's
various different examples of those reward functions that have been considered. One
is if you're in a situation where you want to decide whether there's a fire or not,
then we have to ask the question, should we raise an alarm? All right. So we have
two actions. We could raise an alarm or not raise an alarm, and the world could
be in two states.
There could be a fire or there could not be a fire, and if there is a fire and we
don't raise an alarm, bad things happen. Similarly, if there is no fire, but we
do raise an alarm, you have a false positive. If you have too many of those, then
people won't believe our system anymore, and eventually bad things will happen too,
okay?
Now, if you knew the correct state the world is in, we could just take the optimal
action. But the problem is that we don't know that. We only have a belief about
what the state of the world is. The posterior distribution, and the best thing you
can really do is take the action that maximizes the expected utility.
But this gives us a way of quantifying the usefulness of a particular posterior
distribution, we can just use the expected value of the -- so the maximum expected
utility when acting opt male based on our posterior distribution and that's called
the decision theoretic value information which has been an extremely useful and
powerful concept all through AI and decision theory.
4
Now, in some applications, we may not apriori have a utility function so you may
only want to have posterior distributions that are as certain as possible.
Okay. One way of quantifying this notion is the notion of entropy. If you think
about spatial prediction problems, if you use robots to study lakes or try to figure
out what the temperature is everywhere in this building, you may think about the
means square prediction error based on our observations. So these are all ways of
taking a posterior distribution and turning them into utility. But there's other
objective functions that are useful and have been used in practice.
Now, we have these reward functions, and now we can use these in order to quantify
how useful any given set of sensor locations would be. The issue is that apriori,
we don't know when we place a sensor somewhere, but they are going to tell us. Okay?
So the only thing we can really do is average with respect to the observations that
these sensors will likely do, times the reward we would get under those particular
observations, okay? So this gives us an expected value of information for any set
of sensors that we may want to place.
And now this is an objective function that we can try to optimize. And the simplest
question we can ask is, well, what's the best set of K locations to place sensors?
Now, the first thing we did is we actually tried to understand for what kind of
problems can we solve this problem exactly? And it turns out for some problems,
you can't actually solve it exactly, and this depend on the structure of this
underlying model. So turns out if this underlying graphical model is a chain, like
a mark of chains, for example, you have a conditional random field or hidden Markov
model and you'd like to label the hidden frames of that hidden Markov model, then
you can actually find the optimal value information efficiently.
Okay? But it turns out that if you just try to slightly generalize that from chains
to trees, this problem suddenly becomes really, really hard, right? And as soon
as you start talking about spatial correlations, you have much more complex
dependencies than adding simple chains. So the problem suddenly becomes really
hard, NP to the PP complete. We can't expect to find the optimal solution.
So instead of trying to find the optimal solution, let's try to at least find a good
solution, a good approximate solution.
So probably the simplest approximate algorithm we can think about is the greedy
algorithm, which is used a lot so we would start having no sensors and it would
definitely place the sensor at the location where it increases our value the most
5
so you place one sensor at a time, see how objective functioning increases and you
stop after we've placed all K sensors, but we never change any decisions we've already
made.
And the question is, how well does the simple algorithm do? So one way of trying
to answer this is to run experiments. So we could take some temperature data, run
the greedy algorithm to maximize, for example, the improvement in information gain,
and if you have a small enough problem, we can actually find the optimal solution
through exhaustive integration and it turns out that the greedy algorithm gets us
really close to the optimal solution.
And, in fact, the seed us in a number of different problems so the question is is
any justification of why the greedy algorithm should do well? Okay? And it turns
out that the key insight for analyzing this greedy algorithm is the following natural
notion of diminishing returns.
So suppose you have two placements, A and B. In A, you've placed two sensors, Y1
and Y2. In setting B, you have three more sensors, 3, 4 and 5. Now, [indiscernible]
about the additional value that a new sensor would give us in any of those situations.
If you add the sensor S to the first set, we gain a lot of additional coverage, a
lot of additional information. Versus if you place the sensor to the same location
in the second deployment, you only get a little bit of additional information. We
can formalize this diminishing return property using [indiscernible] notion of
modular functions.
Set function F takes a locations and puts a value modular set A contains super set
B, and we consider adding new element S to either of those sets. We gain more by
adding this element to the small sets than by adding it to the large sets. So this
is exactly captures what's going on here, and so in formulas, just means that F of
A union S minus F of A is greater than or equal than F of union S minus F of B.
Okay. So for the sake of a notation you just use the notation delta S given A as
the marginal benefit of adding this element S to the set A. Okay.
Why did I tell you about -- first of all, we can actually show that in the sensor
placement application, this information gain is, in fact, a submodular function.
And so why is this useful? Why did I tell you about this? Well, it turns out that
it's known that you can maximize sub lod mar functions using the greedy algorithm
and get guarantees about that. So whatever the greedy algorithm gives you, the
placement of the greedy algorithm obtains a constant fraction of about 63% of the
6
optimal value.
In fact, for information gained, that's the best possible ratio you can get among
any efficient algorithm. So in some sense, it's really good algorithm to use for
this kind of problem.
Okay? And so in my dissertation research, we've been pushing this idea in a number
of different directions. And so basically, all these applications here can be cast
as the problem of maximizing some submodular objective function, subject to some
interesting constraints. Working with algorithms and correcterizing when these
problems are submodular and so on. But what all these have in common is that you
want to find a set of observations you want to make in advance before obtaining any
kind of measurements.
So for example, you'd like to decide on the locations you want to place your sensors
before you actually get to see any measurements. In that sense, these results are
results about non-adaptive optimization problems. They're non-adaptive against
observations you would possibly make.
Now, there's a lot of really interesting information gathering problems that you
want to be adaptive. So one example is medical diagnosis. Suppose you're a
veterinarian and you'd like to diagnose a sick puppy. So what you can do is you
can run some tests. For example, you could measure the puppy's heart rate. And
then depending on the outcome, whether it has a heart rate or not, decide on the
next test that you want to run. So for example, take an x-ray. Depending on how
that looks, we decide on the next test to run, take a blood sample and so on. What
you want to do is you want to diagnose this puppy as cheaply as possible, but all
the tests that you run can depend on the measurements you've done in the past. Okay?
So now you're interested in finding no longer a set of tests, a fixed set of tests,
but a policy, a decision tree that's adaptive to the observations we've already made.
Okay? Now, the issue is that now we try to optimize it with those policies instead
of with these sets. So you can't use the notion of submodular set functions in order
to analyze these algorithms anymore. And so the question is, that we asked is, is
there some natural generalization of this notion of submodular functions to these
adaptive policies?
To give you some information about this problem, let's talk about the really simple
submodular optimization problem, one of the most natural ones, the set cover problem,
and just think about it in the context of sensor placement. So we have a bunch of
7
possible locations, 1 through N, and all these locations are associated with some
kind of sensing regions, right.
So if you have a sensor at location 1, then you're going to see the green area. The
set WS, all right, WS1.
Okay. Now, a sensor placement is just a subset of the locations A, and the total
value of that placement is just a total area covered by the regions associated with
the elements you pick. Okay. So you have the function F that takes sets of
locations and outputs just a union of the elements containing all the sets that you
pick. Now, that's a simple example of a submodular function. Fairly easy to see.
And the set cover problem has very natural adaptive analog, okay?
So what you could think about is suppose you're in a setting where you don't know
about what these sensing regions are in advance. So if you put out the sensor, for
example a camera, then it could either observe all of the hallway, for example, or
there could be some obstacle. Doors could be open and so on, and the sensing region
gets reduced by that. Or the sensor could fail, and you don't get to see anything
in advance. And you don't know ahead of time, before you actually place the sensor.
Okay? Now in this situation, what you can think about is every sensor is actually
associated with a collection of sets, collection of sensing regions, and as a random
variable, for example, for location 3, it's a random variable X3 that tells us which
of those sets gets realized. Okay. And this example, there's two sets, the yellow
set and the green set. And if variable X3 takes value one, then the yellow set gets
realized. If it takes value zero, the green set gets realized.
Okay? So the set that is picked depends now on the realization of those random
variables, and now it can think about trying to come up with adaptive policies where
you pick a location, then the set gets revealed to you, then you pick another
location, again the set gets revealed to you and so on. It's a very natural adaptive
analog.
And now you can define an objective -- so you can do this if you have bunch of sensors,
right, have all these sensors. And now the value of the placement A in a particular
world state, joint realization of all those random variables, is just the union of
all those sets that are parameterized by these random variables. Okay? So it's
a very natural analog of the set cover problem that's been studied in the literature.
And more formally, what we're going to do is we have -- so the kind of optimization
problems we're going to study is settings where we have a collection of items, 1
8
through N. And with each of those items, have some random variable associated with
it, and you have some objective function F that takes a subset of locations we pick,
subset of items, and world state, which is the joint realization of all the random
variables, okay, which sets get realized where and tells us how useful that is.
Okay? And now what we can do is we can quantify the value of the policy. So the
situation is that depending on which -- so the policy is some kind of decision tree,
right. So depending on the test, you decide which set you pick.
Now, you can -- so for every possible world set at XV, you could possibly realize
a different set. If there's some set pi of XV that's realized if the world is in
state XV. So the value of a policy is just the expected value of the sets that are
picked by the policy under the respective world state. And you average over all
states the world can be in.
Okay? So you can think about basically you have a decision tree, the realization
of the world tells you which path you take down this decision tree. At the end,
at the leaves, you get some value and look at the average value. Okay? And now
you can try to maximize all these policies. That's a well defined optimization
problem.
Now, the issue is that there's a lot more policies than sets, and this is strict
generalization, because you could just set all the outcomes to one single value in
which this problem reduces to the classical set function optimization problem. And
so K lead up to the problem is hard, hard to approximate. Very strong hardness
results for this problem.
Okay? Now, since this is a hard problem, we can't expect to find the optimal solution
in general so let's try to find a good solution. All right. So what's the natural
algorithm we could try? Well, we could try to use some kind of adaptive variant
of the greedy algorithm.
So how would that work? So suppose we have already made some observation. So in
the sensor placement problem, it means we've seen the realization of some of the
sets. For example, sensor 1 and 3 have been realized with this green set, and now
we can look at not the marginal benefit but the expected marginal benefit of adding
new sensor, S2, conditioned on those observations. And we use the notation delta
S given XA to denote this expected marginal benefit conditioned on the particular
observation XA.
9
Yes?
>>: So you don't know whether you're going to get the green or the yellow when you
place S2. Do you know what the green and the yellow are?
>> Andreas Krause:
they are.
>>:
You know what they are and you have the distribution of how likely
Okay.
>> Andreas Krause: Okay? Good. And so now once you have these, these marginal
benefits, we can really easily implement an adaptive greedy algorithm that just
starts having no sets and having nothing selected, and then iteratively just add
the sets that maximize the expected marginal benefit conditioned on what you see.
Right. That's the greedy algorithm.
Then once you pick this element, it observes it, does Bayesian update based on the
observation, and adds it to the set. And to see how this works in this sensor
selection kind of problem here is you have these three sensors. You want to select
two of them. So the one that has the highest marginal benefit, given nothing, is
sensor S2. So the greedy policy starts with sensor S2, okay?
Now that we've picked it, one of those sets gets realized. For example, the green
one gets realized, okay. Now, conditioned on that, I can look at the marginal
benefit of S1 or S2, and since there's quite a lot of overlap of the green set with
S3, I may want to pick S1. Condition the screen outcome, pick this one. And once
I've picked it, I get to see which sets gets realized. Maybe this yellow one. Okay?
But if we rewind and instead of the green one, the yellow one gets realized, right,
then it may now be better to actually pick sensor S3 instead of S1. Right? So we
pick this one, and one of those gets realized, and that's our value, okay? So the
greedy algorithm now doesn't construct a set, but actually a policy. And the
question is, how does this policy compare against the optimal policy?
Okay. So we know that in the classical setting, where we want to pick sets, if the
objective function is submodular, then the greedy algorithm is going to give us good
solution. So the question is, is there some notion of adaptive version of
submodularity. And here's the generalization we came up with, and it's based on
these expected module benefits. [indiscernible] objective function F and some
distribution adaptive submodular if these conditional marginal benefits are
monotonically decreasing in set size. What this basically means is that if you're
10
in a state X, if you compare two situations, XA made some observations and XB made
more observations than in XA. Then the marginal benefit of any new item S
conditioned on XA has to be greater than or equal to the marginal benefit condition
XB. It's just a natural generalization of set function. Right. In the set
functions case, you just had delta S given A is greater than or equal to delta S
given B. Now you don't just condition on the set, but you condition on the set plus
the resulting observation.
>>: [inaudible] how much more restrictive that is, I mean, to find that kind of
modularity.
>> Andreas Krause: So that's a really good question. I'll give you some examples
arguing that this is a useful notion in a bunch of applications, okay? But, of
course, that's a really new concept, and I think there's a lot to be studied to somehow
say which problems satisfy this property.
>>:
I mean, sitting here, I can imagine cases where it's --
>> Andreas Krause:
>>:
[inaudible] it's not.
Give easy examples where it's not.
Talk about the [indiscernible].
>> Andreas Krause: [inaudible]. Good. And also that you need monotonicity so you
have an adaptive monotonicity where it said the expected marginal benefits have to
be positive. Whenever I added elements, it increases my value and expectation,
okay? And now what you can show is that if F is adaptive submodular and adaptive
monotonic with respect to this distribution P, then the nice we saw about the greedy
algorithm still carries on. The greedy algorithm is a constant fraction of 63% of
the optimal value, okay?
So this still holds. In fact, a lot of other nice properties that classical modular
functions have still carry over to the adaptive study. I'll give you some examples
as we go on. So let's just see when what this means for the adaptive set covering
problem that has been studied quite a bit, both of the maximization version, where
you want to find a collections of sets that has massive values that been studied
by [indiscernible]. So this case, the greedy algorithm gets this [indiscernible]
guarantee. But you can also think about some notion of coverage, that, for example,
want to find the cheapest policy that covers the entire building. All right. That
always make sure that all locations are covered. Okay. And that is a natural
generalization of the set cover problem.
11
And for that, it also gives you a guarantee, which is basically optimal as a matching
lower bound from classical set cover. So this is about set cover. Now let's talk
about -- so this is some theoretical results so you may not care about theorems,
but here are some practical results that may actually be even more useful from an
applications perspective. And it's the fact that you can use lazy evaluations to
run the greedy algorithm, and that's something that's been shown to be extremely
useful in the classical set function case and also carries over to the adaptive
setting.
And the way it works it this. So you start with the set A of observations, S1 through
SI, and now what the greedy algorithm does, in every iteration, it has to pick the
item SI plus 1 to maximize the expected marginal benefit, conditions on what ifs.
Okay. So what you can look at the expected marginal benefit and just tries to find
the maximum of those.
Now, adaptive submodularity implies some really interesting fact about these
marginal benefits through the course of the greedy algorithm. It implies if you
fix a particular item S, then its expected marginal benefits have to be monotonically
decreasing over the course of the greedy algorithm.
Okay? So that's the easy consequence of submodularity, which basically means if
you have some iteration but you have this yellow marginal benefit for item S, then
at some subsequent iteration, the marginal benefits can never be more than it was
in the previous iteration. So these marginal benefits can never increase.
So why is this useful? Well, you can exploit it in a really interesting variant
of the greedy algorithm called the lazy greedy algorithm, and the original version
of that for classical submodular function is due to [indiscernible] 1978, but so
now we show that you can actually generate and make use of the same insights for
the adaptive setting.
And so the first iteration is just business as usual. So we can calculate the
marginal benefits with respect to all sets, okay? And now here the best one, for
example, may be item A. So we pick A. And now, in the next iteration, the E
[indiscernible], you would have to recompute the marginal benefit for all four
remaining elements. So you have to do four function evaluations. This could be
expensive.
So now what you could do instead is instead of recomputing all of them, it's just
12
use the previous values. It's just sort them and take the best guess as the one
that has the highest value before it. So it's try to look at how good D would be.
Of course, the last value is only an upper bounds on what's true value. So we have
to recompute it, okay? And by recomputing it, the value could go down. But if it
goes down, then you just resort it and put it into this queue.
Now, the next best guess would be B. And if you recompute it and it still has the
same value, then we now know that it has to be the second best item. So we don't
ever have to look at E and C again.
Okay? So it means that in this simple example, you save these two function
evaluations. That doesn't seem like a lot, but in practice can matter whole lot.
Okay? And here, just one preliminary experiment that we did is we look at the
adaptive sensor placement problem. I won't talk too much in detail about this. We
take data from 350 traffic sensors on a highway in California, and what you want
to do is you want to adaptively select the sensors in order to maximize the value.
And if you compare the naive greedy algorithm versus the lazy greedy algorithm, and
we get performance improvements by a factor of 30 to 40 in this problem. That can
make a difference in practice. And this classical setting, there were some studies
showing it can be even bigger improvements.
So practically, that is a really important benefit that you get if your objective
function is an F to submodular.
It's another nice consequence of submodular functions is you can calculate data
dependent bounds. I told you about the one minus one for e-bounds. In some sense,
this bound is the worst case bound that holds -- no matter what the [indiscernible]
is. What all the [indiscernible] are. Okay. But [indiscernible] that you work
with in practice may not be as adversarial as in this worst case analysis.
What you do is to calculate some more instance dependent or problem dependent bounds
that are often much tighter than these offline data dependent bounds. It can run
your algorithm and then use submodularity get some certificate on how close you are
to the optimal solution, okay. And that's something that's known for classical
submodular functions too and it also carries over to the adaptive setting.
Won't talk too much about detail, but this is again from this placement problem.
X axis is the sensors that I picked. Y axis is the number of problems. The blue
curve is the adaptive greedy algorithm and the black curve is if you use the one
minus one for E bound that I told you about before. And the red curve here is what
13
you get from this data dependent bounds.
one minus one rebound.
You can see that it's tighter than this
So in practice, you can get problem specific bounds just knowing that your objective
function is submodular. And what you can also do is you can get these bounds for
any algorithm you may run, not just the adaptive greedy algorithm.
Okay. So this is some more reasons of why so much submodularity is useful. Now
I'll tell you about some more applications, because so far I've only talked the
sensor, the set selection adaptive set covering problem.
So let me talk about some other applications. And one really interesting
application is in viral marketing. So suppose we'd like to get a new product on
the market, and we want to convince people to buy it, okay? So one, the idea behind
viral marketing is that you can give the product for free to a bunch of people and
they hopefully convince their friends who convince their friends and the maximize
all impact. Of course, the question is which set of people should we give the product
for free to maximize our expected influence?
And that's a problem that's been studied by David Kemp, Kleinberg and Eva Tardos
in really nice KDD paper. And they show -- they have a particular model of how
influence propagates. So they take a social network of people and annotate all these
knowledges by probabilities so suppose we get the free phone -- free product, maybe
a phone to Alice, then Alice can try to convince their friends. For example, has
30% chance to influence Bob, which may fail, a 50% chance to influence Charlie, which
may succeed.
Then Charlie's influenced, buys the phone himself, then he tries to influence Bob
again, may fail. Dorothy and so on. Over time, you see how this influence
propagates through this network. And so what Kempe, Kleinberg and Tardos showed
is that the expected number of people influenced is a submodular function of the
initial set of people that you select.
Okay. So if you want to run an advertisement campaign where you say you have a budget
to give out ten phones, right, or some number of phones, then you can use this, the
algorithm to find a near optimal set of people that maximize the expected influence.
Okay? But this was also about the nonadaptive setting, but you have to commit to
the people in advance. This is a very natural adaptive analog, right. So in
practice, may want to do is actually run your marketing campaign in stages, right.
So you say you give the phone to a bunch of people first then see how successful
14
they are in influencing.
You learn from that, then you pick another set of people conditioned on what you've
done and so on, right. So it's very, very naturally adaptive analog. And that so
here, for example, we may pick Alice first, get to see how successful she was in
influencing people and maybe add some intuition of that and then pick Fiona second
and so on, gets to see how effective she is, okay?
And turns out that's adaptive submodular problem. Okay? So that's already maybe
more compelling application than this adaptive set covering problem. And so this
adaptive greedy algorithm gets you one minus one guarantee to the optimal adaptive
policy now and you can all use these nice tricks with lazy evaluations, online bounds
and so on.
Okay. So now, you start talking about information gathering and active learning
problems and now let's finally get to active learning. Yes?
>>: I have a question with this one minus one over E issue, you're comparing to
the best optimal depth solution. Can you say something about how good you are
compared to the best nonadaptive solution.
>> Andreas Krause: That's a good question. So the adaptivity gap so we have, so
turns out for maximization, it's not entirely clear how big this adaptivity gap is.
For coverage, if you for example want to achieve, say, 90% market segmentation,
right, or in the set covering problem you want to cover everything, you can show
that that's very large gaps. So you can do a lot better by being adaptive than by
not be adaptive.
So for example for adaptive set cover, there's adaptivity gap due to [indiscernible]
of N over analog N. I don't know it for the viral marketing problem. But for set
cover it is, okay? Good.
So this is the -- so the viral marketing. Now let's talk about active learning,
okay? In particular, let's talk about diagnostic problems. We'd like to diagnose
a disease. We can run tests. You start with a bunch of hypotheses. So these
pictures are just different hypotheses for diseases that the puppy may have. Okay.
And now you take a Bayesian approach. You have a prior of hypothesis, and you have
some likelihood function of the outcomes. And let's start with the setting where
the observations are deterministic conditioned on the true hypothesis. Okay? So
any particular disease uniquely determines the outcome of the test so there's no
15
noise whatsoever.
Okay? If you're in this setting, then any test cuts away part of the hypothesis.
For example, if you find that X1 equals one, it eliminates some of the hypothesis.
If you find out X3 equals zero, it again cuts off part of the hypothesis, okay? But
the problem is, of course, apriori, we don't know the outcome of the test. Okay.
In particular if you pick test as X2, then we could either eliminate the two
hypotheses on the left or the two hypotheses on the right. You don't know apriori
which is which, right?
Okay. And, of course, what you want to do, sometimes we would like to -- looks like
a kind of adaptive set cover problem, though? Because we like to cover all the
hypotheses except the true one, okay?
And now, of course, the question is how should we test, right. And one natural
objective is just to look at the expected reduction in mass of the hypotheses that
we eliminate from the test. If you look at every test, look at both outcomes, see
how much hypothesis do I rule out, and average of the result comes abated by the
likelihood of the outcome. Okay? That is called generalized binary search.
Turns out to be [indiscernible] maximizing information gain in the [indiscernible]
sense of this problem. And turns out it's adaptive submodular, okay? And I want
to quickly show you why it is adaptive submodular. So let's take -- so what you
need to do is you need to show that the value of some test X has to monotonically
decrease as we gather more and more information.
Suppose initially we have some prior probability of these hypotheses, the three on
the left, and we call it B not. And so for blue. Blue knot and on the right we
have some prior probability for the green hypotheses, right, G not. And it's not
hard to show that the initial expected marginal benefit of this test X can be
calculated as two times G not, B not divided by G not plus B not. Won't go through
this in detail, but it's really easy to show.
Now suppose you run some tests, gather some information, so the [indiscernible]
hypothesis. So both the blue and the green mass decrease. So now we can look at
what is the expected marginal benefit of this test X after we've seen these
observations. And that turns out to be two times G1 B1 divided by G1 plus B1. Okay?
Now, turns out it's fairly easy to show that whenever B not is greater than or equal
to B1, and G not is greater than or equal to G1, which is always the case if you
cut away mass of these hypotheses, then some of the final marginal benefit has to
16
be less than or equal to the initial marginal benefit, which proves adaptive
submodularity, okay? So the proof fits on one slide, okay?
Now, that means that the greedy algorithm for this general [indiscernible] research
is [indiscernible] optimal. Of course, that's not new insight. There have been
a lot of work in this problem. Some extensions and guarantees have been improved
over time. The currently best known approximation ratio for this optimal decision
tree problem, noise-free, is four times log of one over P min. P min is this smallest
probability among any hypothesis.
And it turns out that using this insight, that the objective function is adaptive
submodular, you can improve this [indiscernible]. So again get rid of this factor
of four. Okay. So that means that in all this adaptive submodularity analysis is
tighter than all these existing analyses. But what I think is more interesting is
that the fact that this adaptive greedy algorithm is guaranteed, a simple consequence
from the fact that the objective function is adaptive modular.
Okay. So all these existing algorithms have to set up machinery for specifically
analyzing this particular problem. So there's been a lot of work in trying to
analyze adaptive set cover, analyze these active learning problems, analyze
[indiscernible] problem and so on. Turns the reason why they work is that the
objective function is adaptive submodular, and you don't lose anything by
abstraction, you get better adaptive bounds.
>>:
So how does adaptive [indiscernible] compared to naive myopic, the next best?
>> Andreas Krause:
>>:
Let me get to that.
Okay.
>> Andreas Krause:
Yes?
>>: My question is about the domain you're working on.
each test you make eliminates some possible ->> Andreas Krause:
>>:
Yes.
At least?
>> Andreas Krause:
Yes.
So are you assuming that
17
>>: But there could be some cases where like you learn something but it doesn't
really illuminate.
>> Andreas Krause:
That's what I'm going to get at.
>> Andreas Krause:
Okay.
>> Andreas Krause: So I'm talking about the classical setting, it's called
[indiscernible] problems. Verifying mathematical problem assumes that the tests
are noise-free. That means that every observation rules out some hypotheses, right?
That's because you multiply with [indiscernible], right, if the likelihood of
[indiscernible].
>>:
But there can be also cases where a combination of tasks eliminate.
>> Andreas Krause: Of course, of course, but that happens. That is all modelled
here. Okay? Good. So this is -- noise-free, okay? In practice, there's always
noise. So if you have sensors, you always have sensor noise, these medical tests
could have false positives, false negatives and so on. Unfortunately, all these
results in the noise-free case don't carry over.
>>: [indiscernible] active learning when you get [indiscernible] model that you're
using to do diagnoses, or using the active learning.
>> Andreas Krause: It's depending on who you ask. So if you -- so you could call
it sequential Bayesian experimental design. You could call it active learning.
You could call it adaptive VOI.
>>: [indiscernible] I'm not being cutting, splitting hairs. There's also one
[indiscernible] reserve that term for how you're going to use your methods to
actually choose new data that's unlabeled, for example.
>> Andreas Krause: Yeah, so turns out you can easily cast pool based active learning
from that perspective. We can talk about that offline. But it's the same model,
it's just using ->>: But I'm suggesting you might want to distinguish how you describe the
application.
18
>> Andreas Krause: Okay. So in the paper, we actually talk about the learning
problem that I just want to tell you about VOI.
>>: I haven't personally ever used active learning to describe the task of diagnosis
of a fixed model.
>> Andreas Krause:
>>:
So it's called active.
[indiscernible] diagnosis.
>> Andreas Krause: Okay, diagnosis. Good. So diagnosis. So the problem now, if
you have noise, then tests, exactly what is you mentioned can happen. So the test
no longer eliminates the diseases. They only make them less likely. And turns out
that breaks all the analyses. And it's not even clear what the right optimization
problem is anymore. Before, you had the task of eliminating all but the correct
hypothesis. Now what you want to do is intuitively gather enough information to
make the right decision.
So here's one made to make this precise. So suppose I run all the tests, okay? I
get to see the outcome of every single test. Then I still have some uncertainty
about what the true action is. So posterior distribution may still have probability
mass for different hypotheses. So the best I can really hope to do is take the action
that maximizes my expected utility.
Okay? Let's call that A star. Now I can ask, do I really have to run all these
tests? How can I gather enough information so that I've proved to myself that I'm
still going to make the right decision? Okay? So how can I cheaply test to
guarantee that after stopping, I'm goes to choose the right action? Okay?
So this is the natural generalization of this [indiscernible] optimization problem.
And now what we could do is we could try to understand how some of these existing
approaches that are out there work in this problem. So for example, one natural
guess would be to try generalized binary search for this problem, right? Or we could
try maximize information gain or myopically maximized value information and so on.
Turns out none of those is adaptive submodular, if there's noise. That wouldn't
rule out that they work, but actually they empirically can do badly. And I'll show
you later.
And you can actually theoretically prove that they have the -- they can produce
constants about N over log N times the cost of the optimal policy. Okay? Good.
19
So that means we have to look for a new criteria. So here's our proposal. So the
idea is -- and that's an idea common to a lot of [indiscernible] problems is that
you replace the noisy problem with a noiseless problem, essentially by introducing
slack. So what we do is -- what we can do is we can basically create noisy copies
of our hypotheses and annotate all of our noisy hypotheses with the outcomes of all
the tests. Okay?
So what you basically do is suppose in the case of this green disease here, the second
and third test obviously is zero, but the first test could either come out zero or
one, okay? Maybe zero is more likely than one. This case, we have two copies of
this green disease, one over here, one over here. But they annotate it with
different vectors of outcomes of this, of these tests.
Okay? The same thing for this orange, for this orange disease, right. This orange
disease, for example, obviously the third and second and third test come out zero
and one, but the first one could either be zero and one. Okay? Just for
illustration purposes here.
Now, what we -- so now we have reduced the noisy case to the noiseless case, right.
Because any of those hypothesis deterministically determines by construction the
outcome of other tests. Although we could run generalized binary search of this
problem. Of course, the big issue is that these noisy hypothesis encode a lot more
information that we need, right, because all we need to do is we need to distinguish
between noisy hypotheses that lead to making the same decisions.
So what they can do is they can take all those noisy annotated hypotheses and group
them into equivalence classes based on which action we would take in either case.
And now we only need to distinguish between these equivalence classes rather than
the individual elements in these equivalence classes. So one way to do this is to
build a graph. We introduce an edge between any pair of noisy hypotheses and
different equivalence classes and the weight of these edges we're just going to
choose as the product of the individual probability of these hypotheses, okay. So
if you have two very likely hypotheses, like this green one, this red one here, then
the edge would have heavy weight. But if you look at these two examples over here,
then the edge would have very little weight, okay?
And now suppose we see the outcome of one test. So the see X1 equals 1. In this
case, we eliminate some of those noisy hypotheses. And, of course, now we can also
get rid of all these adjacent edges. Okay? So every test eliminates now not nodes
20
in this graph, but edges in this graph. And these edges basically measure our
progress in being able to distinguish between these equivalence classes. And also
notice that any optimal policy has to cut all the edges in this graph. Right?
Because if there's still at least one edge, then there's some positive probability
of confusing these two equivalence classes, right? So we have to cover, get rid
of all those edges and so we can define the objective function as just a total mass
of all the edges cut under a particular observation.
Now, turns out that's an adaptive submodular function, and that means the greedy
algorithm is going to give us any optimal solution for cutting all these edges.
Okay. So the cost of those greedy policies is more algorithmic factor than more
the optimal policy.
Okay? And so the fact that he depends on some of the probability of the hypotheses.
Okay? And so this is the first approximation algorithm for non-myopic VOI and
general graphic [indiscernible]. And the idea is that you want to solve this
non-myopic problem, but you define some alternative, substitute for the objective
function. You don't really optimize value information, but you optimize some proxy
objective function, which turns out to be adaptive submodular and somehow guides
you in the right direction.
Okay? So now for an application, so -- yes, yes.
>>: I have a question on this. So it seems like the cost of computing the green
is going to be tremendously high.
>> Andreas Krause: So very good point. So in practice, you would never actually
implement the algorithm as such, but turns out what you can do is you only need to
estimate the amount of mass eliminated through the tests, and you can estimate that
using sampling. This is rejection sampling approach for estimating how much mass
you can do. That's also a very efficient approximation to this objective function,
which actually works really well in practice.
Okay? Good. So now for some actual application, so I started to collaborate with
Colin Cammerer at Cal Tech who is a neural economist, a behavioral economist at Cal
Tech. And one interesting paradigm that the study, in order to understand how people
make decisions and uncertainty is called the Iowa gambling task. And in that task,
subjects are presented two decks of cards. You can flip through. These cards
basically have -- so the cards have different values so you could either win, you
could lose or gain nothing. Okay?
21
You flip through the cards and you have these two decks of cards. You can look at
both of them, then estimate how likely you are to win, lose or gain nothing with
respect to these cards. After you've gathered some information about that you have
to then decide one of those decks of cards and draw a card and get paid based on
that observation. You get the equal -- for the setting here, get equal number of
trials for both of those cards. So you gather information for these probabilities.
So in some sense, what this test encodes is a particular probability distribution
of a rewards. Okay. And suppose we have these two different distributions. In
the first setting, you win ten dollars with 70% chance, lose ten dollars with 30%
chance. In the second setting, you win ten dollars with 30% chance and gain nothing
or lose nothing with 70% chance. Who would prefer the left gamble?
>>:
How many draws do I get?
>> Andreas Krause:
>>:
One.
Only one trial?
>> Andreas Krause: Yes. Who would prefer the second one? Okay. So turns out
there are some heterogeneity. So the one that has higher expected utility is the
first one, of course, right? Four dollar expectation, three in the other one. And
the other one, you can -- in the right one, you can own only win. So there's
different competing hypothesis of how people make these decisions such as, for
example, hypothesis, just people maximize expected utility.
there's prospect theory which basically says that people may have losses stronger
than they may gain. There's a portfolio optimization which basically says people
expect a value, various versus Q-ness, so other moments of distribution differently.
So there's basically different ways of looking at features of the probability
distributions and encoding utilities. And interesting question, behavioral
economy is trying to understand some of what is the variability in the population
among those different theories. Does everybody behave the same, or what makes them
behave in a certain way in what kind of situations?
Okay. And now, of course, what we -- every test requires actually have the subject
run through this. Through this setting. So you have to come up with a set of these
PDFs, of these stack of cards and gather some information. It was expensive. So
what we'd like to do is like to efficiently -- gather data as efficiently as possible.
So you can cast this as a Bayesian experiment. The sign problem that if you have
one latent variable, just a theory that we tried to identify, a prospect theory,
22
expected value and so on, this has parameters that we also don't know. And all tests
that we can run, the observations, X1 through XN, are basically pairs of gambles.
okay. So any test that we can run is a particular pair of those gambles, those decks
of cards you can show, and now all these different utilities and the parameters give
us all these different theories, and parameters give us different utilities for each
of those gambles. So we can try to model the pick off the user as some kind of noisy
indicator of the perceived difference between the utilities of those gambles. So
this is some kind of soft max function that is used a lot in behavioral economics.
This is just the observation, and now we can try to figure out how should I test
in order to figure out what's the truth here.
Okay. And now you can also run two different optimization algorithms to solve this.
Here's the result that we do. This is based on simulations. Sample from the model
and compare these different theories so you can get lots of trials. And X axis is
the number of tests, Y axis is the accuracy of identifying what the correct hypothesis
is.
Here something interesting happens. Turns out that the random does fairly well,
and something that you see quite a bit in active learning sometimes. And it turns
out that some of this optimized criteria for this problem actually do worse than
random for this application. Okay. And I can speculate offline of why I think this
happens but, of course, it's due to the fact that they optimize -- myopically
optimize so you don't do any look ahead, okay.
And so for example, what happens in uncertainty sampling that easy to understand,
uncertainty sampling just picks the test about which you're most unsure what the
outcome is, okay. But in this task of distinguishing these different theories, that
is quite problematic, because if you have a particular pair of gambles, where the
utility of two different theories is equal, then at maximize the uncertainty about
the outcome, but also if you run this test, it doesn't tell you anything about the
true theory. So in this case, some uncertainty and the amount of information
gathered about the theory is negatively correlated.
That is one reason why uncertainty sampling is really bad. But there's other reasons
for other objectives. So information gain based on the optimal sign is doing better
than random, but the adaptive submodular criterion is actually out performing
information gain on this particular task.
>>:
So basically the output of this, of the adaptives of modular approach would
23
be a sequence of studies in humans and choosing these [inaudible].
>> Andreas Krause:
right.
>>:
Basically, yeah.
>> Andreas Krause:
>>:
So you wouldn't actually write down --
But for generating these [inaudible].
>> Andreas Krause:
this tree.
>>:
Basically gives you an algorithm, gives you decision rule,
Exactly, right, exactly.
But you wouldn't actually write down
Okay.
>> Andreas Krause: You only would somehow expand the tree on the fly. And that
means that the algorithm has to be very efficient. And to show that it's efficient,
we actually use it in some human subject experiments where the -- so this is very
preliminary, but we started to run about 11 subjects on this trial. So the X axis
is the number of tests. Y axis is the probability of the class, the type of decision
made at the end, and you can see that people actually behave differently.
So actually, a fairly large fraction of people chose the highest maximum expected
utility. But here, there's also two subjects that choose behavior according to
prospect theory, and there's some examples of other theories as well. The question
is can we estimate the heterogeneity in the population. Can we figure out how it
depends on certain features of the environment, and so on.
>>:
[inaudible] decision science course [inaudible].
>> Andreas Krause: Yeah, this is Cal Tech undergrad population, which may not be
an unbiased sample. So okay. So this is the -- so this is very, very preliminary.
But it's the main intention of this is not to draw any conclusions about these
theories. It's only to show that this algorithm is actually practical to run in
realtime, okay?
So this is, this is this. So before I go on to conclude, I just want to tell you
about one other project that's currently going on and so how these ideas connect.
One problem we're interested in using community health sensors to sense and respond
24
to crisis, okay. And so for example, I could use advanced meters to plug in the
smart grids to try to detect cascading failures. Personal navigation devices to
detect traffic jams.
One project that we have at Cal Tech and becoming a large now is using axonometers
in mobile phones in order to detect earthquakes. Building a community seismic
network. The idea is that earth quakes basically behave according to two waves.
The primary wave, which is sound wave and compression wave that travels a couple
of kilometers per second and the S wave, the secondary wave, which is the one that
does the main damage, the sheer wave, travels less fast than the P wave. Okay.
So it means that if you can detect this P wave, actually gives you some time to react
and possibly do early warning. Okay. And just to give some ideas UCLA in California
has currently a seismic network that is about the resolution you would get in the
L.A. area. And if you increase the resolution, if you were able to get more data
from more sensors, you would be able to much more fine grain be able to estimate
somehow the seismic waves propagated over time.
And it's one way to gather data is we use these shake tables. You can put phones
on these tables, play back recorded earthquakes and see how the sensors behave as
we get these recordings. And one big challenge in this problem is that these phones
produce a lot of data. Okay. So if you take about a million phones, they would
produce about 30 terabytes of data each day. And AT&T or sprint wouldn't be very
happy to constantly transmit that much data over the phone. You really have to make
decisions over what you should extend. Of course, that has decision theoretic
implications. You have to make a decision, should you raise an alarm or not. But
now you have to solve.
>>:
[inaudible].
>> Andreas Krause:
>>:
Once again?
How long --
>> Andreas Krause: Depends on the scenario. For some of the scenarios considered,
you can get about two minutes, which actually doesn't like a lot, but actually can
help you give elevators, give warnings to cars to stop and also project
infrastructure and so on. So it's potential of responding to these events. Okay?
And so one approach you take is we actually have axonometer data recorded from people
so this may be a person walking. What you can do is look at how seismic events look
25
like played back on the phone and super impose them to get this overlay data and
now we can look at basically train model of what normal activity looks like on the
usual data, using density estimation techniques. And look at the likelihood of the
observations. For example, during an earthquake. And you can actually see, even
bigger events you may even be able to detect in some activities. I mean so typically,
would only try to detect earthquakes while the phone is linked still. But for the
major ones, you may be able to figure out something that's going on while people
are using the phones. But now, of course, the question is, how do you calibrate
this network.
>>:
[inaudible].
>> Andreas Krause:
>>:
Yeah, yeah, exactly.
But -- yeah?
[indiscernible] is the probability of seeing that things are normal, basically.
>> Andreas Krause: Right, exactly, right. But now, of course, the question is how
do you cheese choose this threshold. That's actually a really interesting problem,
because you have to choose this locally, but you have to choose it locally so that
you make globally efficient decisions. Right. So you have to somehow calibrate
that.
And so we're currently looking at sort of this very much a work in progress using
these ideas of adaptive and online optimization of submodular functions in order
to calibrate this network. It's just a quick idea of what's currently going on.
So to conclude, I told you about adaptive submodularity. I think it's a
generalization of modularity to the adaptive problems. A lot of the useful
properties loved for classical submodular functions extend to the adaptive setting,
like getting guarantees about the greedy algorithm, doing lazy evaluations and
getting these data dependent bounds.
There's actually a number of applications that can be shown to be adaptive
submodularity, like the stochastic set cover, viral marketing, active learning,
Bayesian experimental design. And what is nice it provides a unified view on the
analysis for these different problems so you can recover a number of results known
in literature and get some extensions to that but it also leads to new algorithms
like the Bayesian experimental design I mention to you in the end. That's it.
>>:
How sensitive are your results to knowing that [indiscernible]?
26
>> Andreas Krause: Good question. So it turns out that for the maximization case,
it's difficult to say something. But for the coverage case, for example, if you
would like to achieve a certain amount of market coverage in the viral marketing
or in the active learning setting, you get guarantees even against adversarially
chosen realizations. So you can even get, if you start with a distribution, a
uniform distribution, then you still get guarantees even against worst case chosen
realizations.
>>:
Can you quantify the robustness?
>> Andreas Krause: That's an interesting question for future work. So this is some
kind of -- at the end of the spectrum, it's completely adversarial and you can still
say something. But, of course, how things happen between exactly matching your
prior versus being completely off, there's a lot of possible room to improvement.
Any other questions?
>>:
What's next?
>> Andreas Krause: I told you some of the problems that's next.
one -- the theories?
>>:
So I think
In terms of the theory, yeah.
>> Andreas Krause: So I think one interesting question is, so in some sense you
can cast all these information gathering problems as general [indiscernible]
processes, right, but just using general purpose black box algorithms for that is,
I think, really challenging, because the state space is exponential with the number
of observations you can make, and the belief is not decisive belief space. It's
exponential indecisive space is doubly exponential with the number of tests you want
to run so it's extremely instructible. And nevertheless, you can get approximation
guarantees for this class of [indiscernible]. One question is can someone push this
further, and so we'll say that can you say something about more general planning
problems. So this is one ->>:
[indiscernible] approximation, for example, adaptive to [indiscernible].
>> Andreas Krause: Yes, exactly. Right. So can you come up with approximation
algorithms for certain types of POMDP, so this is an interesting problem. But also
there are robustness issues which were just raised and are really interesting
27
discussion.
submodular.
is.
And also, get, just getting a better idea of which problems are adaptive
I mean, this is a new concept, right? So don't know how general this
>>: So you have a domain that you don't know much about, and you are thinking about
applying these adaptive submodular algorithms on that.
>> Andreas Krause:
Yes.
>>: So what is the first clue, or how do you make a decision like to say this domain
is submodular or not?
>> Andreas Krause: So in general, trying to prove submodularity can be a bit tricky.
But so in some cases, not so hard, right. So I showed you proof on one slide for
the adaptive learning, right. So, I mean, this is a really new concept and that
means that I think there's a number of low hanging fruit for some problems that are
out there that people just haven't looked at.
But so in general, I'm not sure about what's right tool boxes for proving adaptive
submodularity, but there some are operations that do preserve adaptive submodularity
so you can build more complicated adaptive submodular functions from simple adaptive
submodular functions. And in the paper we discuss some of those. Maybe
[indiscernible].
>>: What I thought would be [indiscernible] design [indiscernible] effects and its
property and diagnosis in the future, with no guarantees. Actually design machines,
design machines that get them this property, as it's part the design process for
maintenance and diagnosis. But guarantees.
>> Andreas Krause:
>>:
That's interesting, can someone come up, yeah.
Kind of a wild idea.
>> Andreas Krause: Yeah, yeah, yeah, but there may be something.
this offline, yeah?
We can talk about
>>: Are there any additional constraints on submodularity that give you an even
more optimal solution, strictly submodular?
>> Andreas Krause:
Very good question.
So for the classical notion of set
28
functions, you can quantify the performance guarantee of the greedy algorithm in
terms of quantity called the curvature, which basically means how strong is the
diminishing returns property. How quickly does it flatten out.
And there's a way of quantifying that, and you can -- so extreme example is linear
functions, all right, that don't have any diminishing returns. Turns out the greedy
algorithm is optimal for linear functions. Okay. And so now between linear of one,
right, optimal and one sort of E, there's some room and you can quantify that you
are, and we don't know what you can do for the adaptive submodular case. So an
interesting direction.
>>: And also, the other thing I was going to ask, in the submodularity, they show
that the greedy is the best boundary. Do you have the same result for adaptive?
>> Andreas Krause: Well, it's strict generalization of the case, right. So unless
you make further restricting assumptions, you can't really say that. So you can
always -- any submodular optimization problem is an adaptive submodular problem
where all the observations are deterministic. In other words, all the hardness
results for submodular carry over. Since the lower bound, so the upper bounds
guarantee [indiscernible] the greedy algorithm match up, that's not really room for
doing. Yes?
>>: Any idea how the quality of the solutions degrades with failure of adaptive
submodularity assumption? Do you have like epsilon submodularity, less than or
equal to, helped with an epsilon on one side.
>> Andreas Krause: It's basically if you implement the greedy algorithm that's only
off by a factor of alpha, then still all the guarantees -- that's a bunch of results
known for the classical case and they still carry over to the adaptive case. So
either if the objective function you try to optimize is close to a submodular function
or if you can't implement the greedy algorithm exactly, you can still do something.
>> Andreas Krause: So I guess it might also be an empirical question in addition
to a theoretical question. Think about boosting, for example, which makes this
strong assumption of weak learning and that's not satisfied by most learning methods
we boost, yet boosting in practice works extremely well over a wide range of boost
developers. I'm wondering if the same thing might be true here. That, in fact,
adaptive submodularity is great when you can get it. But, in fact, you do quite
well using these methods even when it's not really quite true.
29
>> Andreas Krause: That's very good.
theoretically. So ->>:
As it is for boosting.
>> Andreas Krause:
>>:
So it's really difficult to answer that
As it's for boosting, yes.
Saying adaptive submodularity as risk taking.
>> Andreas Krause: Right. So an anecdotal example, I told you about the
[indiscernible] prediction era, so the variance reduction. If you say you want to
monitor temperature in the building, right, or in the lake, then you may want to
care about the expected reduction. It means current error. That objective
function is not always submodular. There's count examples, and it depends a bit
on -- so for Gaussian, it depends a bit on some of the structure on the
[indiscernible] and there's stability results and that, and so there's some
indication that for some applications, it is submodular. But these are often
violated in practice, even thought algorithms work really well. Work better than
state of the art existing techniques. So I mean, that's -- I don't know how that's,
of the same example in the adaptive case, but at least to the classical case, there's
hope in this direction.
>>: So then a related question, do you think there might be cases where relaxing
the adaptive submodularity requirement might actually yield improved solutions? I
mean, for example, this does happen in boosting, right? So typically, if we go from
something that's provably weak to something that's not so weak, we often actually
get better results with boosting. I'm wondering if there could be a similar thing
here, where adaptive submodularity is great from a theoretical point of view, but
in fact, in practice, we do somewhat better if we relax it in certain way.
>> Andreas Krause: But what do you mean -- so what is -- [indiscernible] a suitable
candidate algorithm. One issue is people love greedy algorithms and they often work
really well. Like the question is why. And that may be an answer to at least some
of those problems. Right? So and I think one of the answers could be that in some
applications that maybe lack some modularity but still do well is because the
function is close to adaptive submodular or at least in this small space of solution
of the greedy algorithm searches that contains somehow at least sufficiently good,
many good solutions, the submodularity is satisfied, and that gives them indication
of why you should have [inaudible].
30
>> Eric Horvitz:
Thanks very much.
Download