Optimizing Sensing using Adaptive Submodularity rsrg @caltech

advertisement
Optimizing Sensing using
Adaptive Submodularity
Andreas Krause
Joint work with
Daniel Golovin and Deb Ray
IPAM Machine Reasoning workshop October 2010
rsrg@caltech
..where theory and practice collide
Information gathering problems
Environmental monitoring
using AUVs
Sensing in buildings
Sensor placement
for activity recognition
Water distribution networks
Selecting blogs & news
2
Information gathering
problems
Want to learn something about the state of the world
Estimate water quality in a geographic region, detect outbreaks, …
We can choose (partial) observations…
Make measurements, place sensors,
choose experimental parameters …
… but they are expensive / limited
hardware cost, power consumption, …
Want to cost-effectively get most useful information!
Fundamental challenge in ML/AI:
How do we automate curiosity and serendipity?
3
Related work
Information gathering problems considered in
Experimental design (Lindley ’56, Robbins ’52…), Decision theory
(Howard ‘66,…), Operations Research (Nemhauser ’78, …),
AI (Horvitz et al ‘84, …), Machine Learning (MacKay ’92, …),
Spatial statistics (Cressie ’91, …), Robotics (Sim&Roy ’05, …),
Sensor Networks (Zhao et al ’04, …), …
Existing algorithms typically
Heuristics: No guarantees! Can do arbitrarily badly.
Find optimal solutions (Mixed integer programming, POMDPs):
Very difficult to scale to bigger problems.
4
Research in my group
Theoretical:
Approximation algorithms that have
theoretical guarantees and scale to large problems
Applied:
Empirical studies with real deployments
and large datasets
5
Running example: Detecting fires
Want to place sensors to detect fires in buildings
6
Bayesian’s view of sensor networks
X1
X2
Y1
Y3
Y2
X4
Y4
Xs: temperature
at location s
X3
X5
Y5
X6
Ys: sensor value
at location s
Y6
Ys = Xs + noise
Joint probability distribution
P(X1,…,Xn,Y1,…,Yn) = P(X1,…,Xn) P(Y1,…,Yn | X1,…,Xn)
Prior
Likelihood
7
Utility of making observations
CNH
X2
CNH
CNH
X1
X3
Y2
Y1
Y1=hot
Y3
CNH
X5
CNH
X4
Y5
Y4
Less uncertain  Reward[ P(X|Y1=hot)] = 0.2
8
A different outcome…
CNH
X2
CNH
CNH
X1
X3
Y2
Y1
Y3=cold
CNH
X5
CNH
X4
Y5
Y4
Reward[ P(X|Y3=cold)] = 0.1
9
Example reward functions
Should we raise a fire alert?
Temp. X
X1
CNH
Fiery hot
normal/cold
No alarm
-$$$
0
Raise alarm
$
-$
Actions
Only have belief about temperature P(X = hot | obs)
 choose a* = argmaxa x P(x|obs) U(x,a)
Decision theoretic value of information
Reward[ P(X | obs) ] = maxa x P(x|obs) U(x,a)
10
Other example reward functions
Entropy
Reward[ P(X) ] = -H(X) = x P(x) log2 P(x)
Expected mean squared prediction error (EMSE)
Reward[ P(X) ] = -1/n s Var(Xs)
Many other objectives possible and useful…
11
Value of information [Lindley ’56, Howard ’64]
For any set A of sensors, its sensing quality is
F(A) = yA P(yA) Reward[P(X | yA)]
Observations
Reward when observing
made by sensors A
YA = yA
12
12
Maximizing value of information
[Krause, Guestrin, Journal of AI Research ’09]
Want to find a set
s.t.
Theorem: Complexity of optimizing value of information
X1
X1
X2
X2
X3
For chains (HMMs, etc.)
Optimally solvable in polytime 
X3
X4
X5
For trees:
NPPP complete

13
Approximating Value of Information
Given: finite set V of locations
Want:
such that
X2
X1
X3
Y2
Y1
Typically NP-hard!
Y3
X5
Greedy algorithm:
Start with
For i = 1 to k
X4
Y5
How well can this simple heuristic do?
Y4
14
Performance of greedy
50
Optimal
OFFICE
52
12
9
54
OFFICE
51
49
QUIET
PHONE
11
8
53
13
14
7
17
18
STORAGE
48
L AB
ELEC
COPY
5
47
19
6
4
46
45
16
15
10
CONFERENCE
21
2
SERVER
44
KITCHEN
39
37
Greedy
41
38
36
23
33
35
40
42
22
1
43
20
3
29
27
31
34
25
32
30
28
24
26
Temperature data
from sensor network
Greedy empirically close to optimal. Why?
15
Key observation: Diminishing returns
Selection A = {Y1, Y2}
Y1
Selection B = {Y1,…, Y5}
Y2
Y2
Y1
Y3
Y4
Y5
Theorem [Krause and Guestrin, UAI ‘05]: If Y cond. ind. given X:
Ys doesn’t help much
Adding Ygain
help= aH(X)
lot! – H(XYs| Y ) isAdding
s willF(A)
Information
submodular!
A
New sensor Ys
+ s Large improvement
Submodularity: B
A
+ s Small improvement
For
16
One reason submodularity is useful
Theorem [Nemhauser et al ‘78]
Greedy algorithm gives constant factor approximation
F(Agreedy) ≥ (1-1/e) F(Aopt)
~63%
Greedy algorithm gives near-optimal solution!
For information gain: Guarantees best possible unless P = NP!
[Krause & Guestrin ’05]
17
Submodular Sensing Problems
Sensing in buildings
Environmental monitoring
Sensor placement
for activity recognition
Selecting blogs & news
Water distribution networks
Non-adaptively select sensor locations A to
maximize monotonic submodular function F(A)
Greedy algorithm finds set of k near-optimal locations
18
Adaptive Optimization
x1 =
1
0
0
1
x2 =
0
1
x3 =
Want to effectively diagnose while minimizing cost of testing!
Interested in a policy (decision tree), not a set.
Can’t use submodular set functions!!
Can we generalize submodularity to adaptive problems?
19
Example: Set cover
s2
s3
s1
Set V = {1,2,3,…,n} of possible locations
Each location s associated with set Ws
Value of placement A:
f is a submodular function!
20
Adaptive set cover
s2
s3
X3=1
s1
X3=0
Set V = {1,2,3,…,n} of possible locations
Independent random variables {X1,X2,X3,…,Xn}
Each set Ws = Ws(Xs) depends on Xs, revealed after selection
Value of A in state XV:
21
Problem Statement
Given:
Items V={1,…,n}
Associated with random variables X1,…,Xn taking values in O
Objective f: 2V × OV  R
Value of policy π:
Want
Tests run by π
if world in state xV
NP-hard (also hard to approximate!)
22
Adaptive greedy algorithm
Suppose we’ve seen XA = xA.
Expected benefit of adding test s:
Adaptive Greedy algorithm:
Start with
For i = 1:k
Pick
Observe
Set
s
s3
s1
2
G
G
23
Example: Adaptive greedy policy
Greedy policy
s2
s3
s2
0
s1
s1
s3
When does this adaptive greedy algorithm work?
24
Adaptive submodularity
[Golovin & Krause, COLT 2010]
f is called adaptive submodular, iff
whenever
f is called adaptive monotone iff
xB observes
more than xA
Theorem: If f is adaptive submodular and adaptive
monotonic w.r.t. to distribution P, then
F(πgreedy) ≥ (1-1/e) F(πopt)
Many other results about submodular set functions
can also be “lifted” to the adaptive setting!
25
Results for Adaptive Set Cover
[generalizes Asadpour et al., Goemans & Vondrak ‘06, Liu et al.]
Greedy policy
s2
s3
s2
0
s1
s1
s3
Theorem:
Maximization: Use adapt. greedy alg. to pick k sets
Coverage: Run until cover everything
26
Lazy evaluations
In round i+1,
have picked
have observed
pick
I.e., maximize “conditional marginal benefit”
Key observation: Adaptive submodularity implies
s
Marginal benefits can never increase!
27
“Lazy” greedy algorithm
Lazy greedy algorithm:




First iteration as usual
Keep an ordered list of marginal
benefits Di from previous iteration
Re-evaluate Di only for top
element
If Di stays on top, use it,
otherwise re-sort
Benefit
a
db
bc
ed
ce
28
Results for lazy evaluations
Data from 357
traffic sensors along
I-880 South
Problem: Sensor selection with failing sensors
f(A,xV) = information gain of active sensors
29
Data dependent bounds
Data from 357
traffic sensors along
I-880 South
Adaptive submodularity also allows to obtain datadependent bounds, often much tighter than (1-1/e)
30
Example: Influence in social networks
[Kempe, Kleinberg, Tardos KDD ’03]
Dorothy
Alice
0.3
Eric
0.2
0.5
0.4
Prob. of
influencing
0.2
0.5
Bob
0.5
Fiona
Charlie
Who should get free cell phones?
V = {Alice, Bob, Charlie, Dorothy, Eric, Fiona}
F(A) = Expected number of people influenced when targeting A
31
Application: Adaptive Viral Marketing
Daria
Alice
Eric
0.2
Prob. of
influencing
0.5
0.3
0.4
0.2
0.5
Bob
0.5
Fiona
Charlie
Adaptively select promotion targets
After each selection, see which people are influenced.
32
Application: Adaptive Viral Marketing
Daria
Alice
0.5
0.3
0.4
Bob
Eric
0.2
0.2
0.5
Fiona
0.5
Charlie
Theorem: Objective adaptive submodular!
Hence, adaptive greedy choice is a (1-1/e) ≈ 63%
approximation to the optimal adaptive solution.
33
Application: Optimal Decision Trees
Prior over diseases P(Y)
Y
“Sick”
Likelihood of test outcomes P(XV | Y)
Suppose that P(XV | Y) is
X1
X2
X3
“Fever”
“Rash”
“Cough”
deterministic (noise free)
Each test eliminates hypotheses y
How should we test to eliminate all incorrect hypotheses?
= expected mass
ruled out by s
if we know xA
“Generalized binary search”
Equivalent to max. infogain
X1=1
X3=0
X2=0
X2=1
34
ODT is Adaptive Submodular
Objective = probability mass of hypotheses
you have ruled out.
It’s Adaptive Submodular.
Test X
Test w
Outcome = 0
Test v
Outcome = 1
35
Guarantees for Optimal Decision Trees
0
x1 =
1
0
x2 =
1
1
0
x3 =
0
1
Garey & Graham, 1974;
Loveland, 1985;
Arkin et al., 1993;
Kosaraju et al., 1999;
Dasgupta, 2004;
Guillory & Bilmes, 2009;
Nowak, 2009;
Gupta et al., 2010
With adaptive
submodular analysis!
Result requires that tests are exact (no noise)!
Noisy Bayesian active learning
[with Daniel Golovin, Deb Ray]
In practice, observations
typically are noisy
Results for noise-free case
do not generalize 
Y
“Sick”
X1
“Fever”
X2
“Rash”
X3
“Cough”
Key problem: Tests no longer
rule out hypotheses
(only make them less likely)
Intuitively, want to gather enough information to
make the right decision!
37
Noisy active learning
Suppose I run all tests, i.e., see
Best I can do is to maximize expected utility
Key question:
How should I cheaply test to guarantee that I choose a*?
Existing approaches:
Generalized binary search?
Maximize information gain?
Maximize value of information?
Not adaptive submodular
in the noisy setting!
Theorem: All these approaches can have cost
more than n/log n times the optimal cost!
38
A new criterion for nonmyopic VOI
Strategy: Replace noisy problem with noiseless problem
Key idea: Make test outcomes part of the hypothesis
Test
[0,1,0]
[1,1,0]
[0,0,0]
[0,0,1]
[1,0,0]
[1,0,1]
39
A new criterion for nonmyopic VOI
Only need to distinguish between noisy hypotheses that
lead to different decisions!
Tests
[1,0,1]
[0,0,0]
[1,1,1]
[1,0,0]
Weight of edge =
product of hypotheses’
probabilities
[0,1,0]
Suppose we
find
= total mass of edges cut if observing
40
Theoretical guarantees
[with Daniel Golovin, Deb Ray]
Theorem: The edge-cutting objective is adaptive
submodular and adaptive monotone.
Suppose
for all
Then it holds that
First approximation guarantees for nonmyopic VOI
in general graphical models!
41
Example: The Iowa Gambling Task
What would you prefer?
.7
Prob.
Prob.
.7
.3
-10$ 0$ +10$
.3
-10$ 0$ +10$
Various competing theories on how people make decisions
under uncertainty
Maximize expected utility? [von Neumann & Morgenstern ‘47]
Constant relative risk aversion? [Pratt ‘64]
Portfolio optimization? [Hanoch & Levy ‘70]
Prospect theory? [Kahnemann & Tversky ’79]
How should we design tests to distinguish theories?
42
Iowa Gambling as BED
Every possible test Xs = (gs,1,gs,2) is a pair of gambles
Theories parameterized by θ
Each theory predicts utility for
each gamble U(g,y,θ)
Y
θ
Theory
Param’s
X1
X2
(g1,1,g1,2)
(g2,1,g2,2)
…
Xn
(gn,1,gn,2)
43
Simulation Results
Adaptive submodular criterion outperforms
existing approaches
44
Preliminary Experimental Results
[in collaboration with Colin Camerer]
Run designs on 11 naïve subjects
Indication of heterogeneity in population
Algorithm useful in real-time settings
45
Current work: Community Sense & Respond
Krause, Chandy, Clayton, Heaton
How can we use community-held sensors
to sense and respond to crises?
Contribute
sensor data
Community
Seismic Network
Sense:
Cascading failures,
traffic jams,
earthquakes
Respond:
Regulate grid, traffic,
stop trains, elevators
Community Seismic Network
P-wave
S-wave
Epicenter
1M phones = 30TB data / day  Need decide what to send 47
Deciding when/what to send
“Walking”
data
Seismic
event
Overlaid
data
Data
likelihood
Need to adaptively calibrate sensitivity of network!
Use adaptive / online submodular optimization! 
48
Conclusions
Adaptive submodularity: A natural generalization of
submodularity to adaptive policies
Many useful properties extend to the adaptive setting
Performance guarantees for greedy maximization and coverage
Lazy evaluations
Data dependent bounds
Many applications are adaptive submodular
Stochastic set cover, viral marketing, active learning, BED, …?
Provides unified view on several existing results, and
allows to develop natural generalizations
Thanks:
49
Download