Optimizing Sensing using Adaptive Submodularity Andreas Krause Joint work with Daniel Golovin and Deb Ray IPAM Machine Reasoning workshop October 2010 Information gathering problems Environmental monitoring using AUVs Sensing in buildings Sensor placement for activity recognition Water distribution networks Selecting blogs & news 2 Information gathering problems Want to learn something about the state of the world Estimate water quality in a geographic region, detect outbreaks, … We can choose (partial) observations… Make measurements, place sensors, choose experimental parameters … … but they are expensive / limited hardware cost, power consumption, … Want to cost-effectively get most useful information! Fundamental challenge in ML/AI: How do we automate curiosity and serendipity? 3 Related work Information gathering problems considered in Experimental design (Lindley ’56, Robbins ’52…), Decision theory (Howard ‘66,…), Operations Research (Nemhauser ’78, …), AI (Horvitz et al ‘84, …), Machine Learning (MacKay ’92, …), Spatial statistics (Cressie ’91, …), Robotics (Sim&Roy ’05, …), Sensor Networks (Zhao et al ’04, …), … Existing algorithms typically Heuristics: No guarantees! Can do arbitrarily badly. Find optimal solutions (Mixed integer programming, POMDPs): Very difficult to scale to bigger problems. 4 Research in my group Theoretical: Approximation algorithms that have theoretical guarantees and scale to large problems Applied: Empirical studies with real deployments and large datasets 5 Running example: Detecting fires Want to place sensors to detect fires in buildings 6 Bayesian’s view of sensor networks X1 X2 Y1 Y3 Y2 X4 Y4 Xs: temperature at location s X3 X5 Y5 X6 Ys: sensor value at location s Y6 Ys = Xs + noise Joint probability distribution P(X1,…,Xn,Y1,…,Yn) = P(X1,…,Xn) P(Y1,…,Yn | X1,…,Xn) Prior Likelihood 7 Utility of making observations CNH X2 CNH CNH X1 X3 Y2 Y1 Y1=hot Y3 CNH X5 CNH X4 Y5 Y4 Less uncertain Reward[ P(X|Y1=hot)] = 0.2 8 A different outcome… CNH X2 CNH CNH X1 X3 Y2 Y1 Y3=cold CNH X5 CNH X4 Y5 Y4 Reward[ P(X|Y3=cold)] = 0.1 9 Example reward functions Should we raise a fire alert? Temp. X X1 CNH Fiery hot normal/cold No alarm -$$$ 0 Raise alarm $ -$ Actions Only have belief about temperature P(X = hot | obs) choose a* = argmaxa x P(x|obs) U(x,a) Decision theoretic value of information Reward[ P(X | obs) ] = maxa x P(x|obs) U(x,a) 10 Other example reward functions Entropy Reward[ P(X) ] = -H(X) = x P(x) log2 P(x) Expected mean squared prediction error (EMSE) Reward[ P(X) ] = -1/n s Var(Xs) Many other objectives possible and useful… 11 Value of information [Lindley ’56, Howard ’64] For any set A of sensors, its sensing quality is F(A) = yA P(yA) Reward[P(X | yA)] Observations Reward when observing made by sensors A YA = yA 12 12 Maximizing value of information [Krause, Guestrin, Journal of AI Research ’09] Want to find a set s.t. Theorem: Complexity of optimizing value of information X1 X1 X2 X2 X3 For chains (HMMs, etc.) Optimally solvable in polytime X3 X4 X5 For trees: NPPP complete 13 Approximating Value of Information Given: finite set V of locations Want: such that X2 X1 X3 Y2 Y1 Typically NP-hard! Y3 X5 Greedy algorithm: Start with For i = 1 to k X4 Y5 How well can this simple heuristic do? Y4 14 Performance of greedy 50 Optimal OFFICE 52 12 9 54 OFFICE 51 49 QUIET PHONE 11 8 53 13 14 7 17 18 STORAGE 48 L AB ELEC COPY 5 47 19 6 4 46 45 16 15 10 CONFERENCE 21 2 SERVER 44 KITCHEN 39 37 Greedy 41 38 36 23 33 35 40 42 22 1 43 20 3 29 27 31 34 25 32 30 28 24 26 Temperature data from sensor network Greedy empirically close to optimal. Why? 15 Key observation: Diminishing returns Selection A = {Y1, Y2} Y1 Selection B = {Y1,…, Y5} Y2 Y2 Y1 Y3 Y4 Y5 Theorem [Krause and Guestrin, UAI ‘05]: If Y cond. ind. given X: Ys doesn’t help much Adding Ygain help= aH(X) lot! – H(XYs| Y ) isAdding s willF(A) Information submodular! A New sensor Ys + s Large improvement Submodularity: B A + s Small improvement For 16 One reason submodularity is useful Theorem [Nemhauser et al ‘78] Greedy algorithm gives constant factor approximation F(Agreedy) ≥ (1-1/e) F(Aopt) ~63% Greedy algorithm gives near-optimal solution! For information gain: Guarantees best possible unless P = NP! [Krause & Guestrin ’05] 17 Submodular Sensing Problems Sensing in buildings Environmental monitoring Sensor placement for activity recognition Selecting blogs & news Water distribution networks Non-adaptively select sensor locations A to maximize monotonic submodular function F(A) Greedy algorithm finds set of k near-optimal locations 18 Adaptive Optimization x1 = 1 0 0 1 x2 = 0 1 x3 = Want to effectively diagnose while minimizing cost of testing! Interested in a policy (decision tree), not a set. Can’t use submodular set functions!! Can we generalize submodularity to adaptive problems? 19 Example: Set cover s2 s3 s1 Set V = {1,2,3,…,n} of possible locations Each location s associated with set Ws Value of placement A: f is a submodular function! 20 Adaptive set cover s2 s3 X3=1 s1 X3=0 Set V = {1,2,3,…,n} of possible locations Independent random variables {X1,X2,X3,…,Xn} Each set Ws = Ws(Xs) depends on Xs, revealed after selection Value of A in state XV: 21 Problem Statement Given: Items V={1,…,n} Associated with random variables X1,…,Xn taking values in O Objective f: 2V × OV R Value of policy π: Want Tests run by π if world in state xV NP-hard (also hard to approximate!) 22 Adaptive greedy algorithm Suppose we’ve seen XA = xA. Expected benefit of adding test s: Adaptive Greedy algorithm: Start with For i = 1:k Pick Observe Set s s3 s1 2 G G 23 Example: Adaptive greedy policy Greedy policy s2 s3 s2 0 s1 s1 s3 When does this adaptive greedy algorithm work? 24 Adaptive submodularity [Golovin & Krause, COLT 2010] f is called adaptive submodular, iff whenever f is called adaptive monotone iff xB observes more than xA Theorem: If f is adaptive submodular and adaptive monotonic w.r.t. to distribution P, then F(πgreedy) ≥ (1-1/e) F(πopt) Many other results about submodular set functions can also be “lifted” to the adaptive setting! 25 Results for Adaptive Set Cover [generalizes Asadpour et al., Goemans & Vondrak ‘06, Liu et al.] Greedy policy s2 s3 s2 0 s1 s1 s3 Theorem: Maximization: Use adapt. greedy alg. to pick k sets Coverage: Run until cover everything 26 Lazy evaluations In round i+1, have picked have observed pick I.e., maximize “conditional marginal benefit” Key observation: Adaptive submodularity implies s Marginal benefits can never increase! 27 “Lazy” greedy algorithm Lazy greedy algorithm: First iteration as usual Keep an ordered list of marginal benefits Di from previous iteration Re-evaluate Di only for top element If Di stays on top, use it, otherwise re-sort Benefit a db bc ed ce 28 Results for lazy evaluations Data from 357 traffic sensors along I-880 South Problem: Sensor selection with failing sensors f(A,xV) = information gain of active sensors 29 Data dependent bounds Data from 357 traffic sensors along I-880 South Adaptive submodularity also allows to obtain datadependent bounds, often much tighter than (1-1/e) 30 Example: Influence in social networks [Kempe, Kleinberg, Tardos KDD ’03] Dorothy Alice 0.3 Eric 0.2 0.5 0.4 Prob. of influencing 0.2 0.5 Bob 0.5 Fiona Charlie Who should get free cell phones? V = {Alice, Bob, Charlie, Dorothy, Eric, Fiona} F(A) = Expected number of people influenced when targeting A 31 Application: Adaptive Viral Marketing Daria Alice Eric 0.2 Prob. of influencing 0.5 0.3 0.4 0.2 0.5 Bob 0.5 Fiona Charlie Adaptively select promotion targets After each selection, see which people are influenced. 32 Application: Adaptive Viral Marketing Daria Alice 0.5 0.3 0.4 Bob Eric 0.2 0.2 0.5 Fiona 0.5 Charlie Theorem: Objective adaptive submodular! Hence, adaptive greedy choice is a (1-1/e) ≈ 63% approximation to the optimal adaptive solution. 33 Application: Optimal Decision Trees Prior over diseases P(Y) Y “Sick” Likelihood of test outcomes P(XV | Y) Suppose that P(XV | Y) is X1 X2 X3 “Fever” “Rash” “Cough” deterministic (noise free) Each test eliminates hypotheses y How should we test to eliminate all incorrect hypotheses? = expected mass ruled out by s if we know xA “Generalized binary search” Equivalent to max. infogain X1=1 X3=0 X2=0 X2=1 34 ODT is Adaptive Submodular Objective = probability mass of hypotheses you have ruled out. It’s Adaptive Submodular. Test X Test w Outcome = 0 Test v Outcome = 1 35 Guarantees for Optimal Decision Trees 0 x1 = 1 0 x2 = 1 1 0 x3 = 0 1 Garey & Graham, 1974; Loveland, 1985; Arkin et al., 1993; Kosaraju et al., 1999; Dasgupta, 2004; Guillory & Bilmes, 2009; Nowak, 2009; Gupta et al., 2010 With adaptive submodular analysis! Result requires that tests are exact (no noise)! Noisy Bayesian active learning [with Daniel Golovin, Deb Ray] In practice, observations typically are noisy Results for noise-free case do not generalize Y “Sick” X1 “Fever” X2 “Rash” X3 “Cough” Key problem: Tests no longer rule out hypotheses (only make them less likely) Intuitively, want to gather enough information to make the right decision! 37 Noisy active learning Suppose I run all tests, i.e., see Best I can do is to maximize expected utility Key question: How should I cheaply test to guarantee that I choose a*? Existing approaches: Generalized binary search? Maximize information gain? Maximize value of information? Not adaptive submodular in the noisy setting! Theorem: All these approaches can have cost more than n/log n times the optimal cost! 38 A new criterion for nonmyopic VOI Strategy: Replace noisy problem with noiseless problem Key idea: Make test outcomes part of the hypothesis Test [0,1,0] [1,1,0] [0,0,0] [0,0,1] [1,0,0] [1,0,1] 39 A new criterion for nonmyopic VOI Only need to distinguish between noisy hypotheses that lead to different decisions! Tests [1,0,1] [0,0,0] [1,1,1] [1,0,0] Weight of edge = product of hypotheses’ probabilities [0,1,0] Suppose we find = total mass of edges cut if observing 40 Theoretical guarantees [with Daniel Golovin, Deb Ray] Theorem: The edge-cutting objective is adaptive submodular and adaptive monotone. Suppose for all Then it holds that First approximation guarantees for nonmyopic VOI in general graphical models! 41 Example: The Iowa Gambling Task What would you prefer? .7 Prob. Prob. .7 .3 -10$ 0$ +10$ .3 -10$ 0$ +10$ Various competing theories on how people make decisions under uncertainty Maximize expected utility? [von Neumann & Morgenstern ‘47] Constant relative risk aversion? [Pratt ‘64] Portfolio optimization? [Hanoch & Levy ‘70] Prospect theory? [Kahnemann & Tversky ’79] How should we design tests to distinguish theories? 42 Iowa Gambling as BED Every possible test Xs = (gs,1,gs,2) is a pair of gambles Theories parameterized by θ Each theory predicts utility for each gamble U(g,y,θ) Y θ Theory Param’s X1 X2 (g1,1,g1,2) (g2,1,g2,2) … Xn (gn,1,gn,2) 43 Simulation Results Adaptive submodular criterion outperforms existing approaches 44 Preliminary Experimental Results [in collaboration with Colin Camerer] Run designs on 11 naïve subjects Indication of heterogeneity in population Algorithm useful in real-time settings 45 Current work: Community Sense & Respond Krause, Chandy, Clayton, Heaton How can we use community-held sensors to sense and respond to crises? Contribute sensor data Community Seismic Network Sense: Cascading failures, traffic jams, earthquakes Respond: Regulate grid, traffic, stop trains, elevators Community Seismic Network P-wave S-wave Epicenter 1M phones = 30TB data / day Need decide what to send 47 Deciding when/what to send “Walking” data Seismic event Overlaid data Data likelihood Need to adaptively calibrate sensitivity of network! Use adaptive / online submodular optimization! 48 Conclusions Adaptive submodularity: A natural generalization of submodularity to adaptive policies Many useful properties extend to the adaptive setting Performance guarantees for greedy maximization and coverage Lazy evaluations Data dependent bounds Many applications are adaptive submodular Stochastic set cover, viral marketing, active learning, BED, …? Provides unified view on several existing results, and allows to develop natural generalizations Thanks: 49