Active Search with Complex Actions and Rewards Yifei Ma www.cs.cmu.edu/~yifeim/active_search.pdf (or *.pptx) Collaborators: Jeff Schneider, Roman Garnett, Tzu-Kuo Huang, Dougal Sutherland Active Search Assume: A set of instances; some smoothness structure Action: query user/environment for label value Goal: find all positive instances as soon as possible [Azimi 2011] 2 Active Search All users on social network Human inspector Frauds ©Tricia Aaderud 2012 3 Active Search A body of water Measurements Polluted regions ©ying yang 2015 4 Active Search Search for all positive items with active feedback Pollution in open water [Scerri 2014] Localization w/ attention window [Houghten 2008, Haupt 2009] Public opinion search Active personalization [Li 2010, Yue 2011] Hyper-parameter optimization [Swersky 2012] 5 My Research State of the art: Euclidean space point action point reward Active Point rewards search Region rewards Point action • Σ-optimality for active learning on graphs [Ma 2013] • Active search on graphs [Ma 2014] • Active area search [Ma 2014] • Active pointillistic pattern search [Ma* 2015] Group action • Active search for sparse rewards (in progress) • Theory of everything ? 6 Active Search on Graphs Assume: known graph; smooth label function Task: find all nodes interactively Applications: find target emails, public opinion search, etc Approach: Exploit+explore My focus: What is good exploration? [NIPS 2013] Incorporate it in active search? [UAI 2015] 7 Good Exploration Similar to Experimental Designs Optimal Design [Smith 1918] Design experiments to optimize some criterion (e.g. variance, entropy) Blind of actual observations D-optimality V-optimality Σ-optimality Eg. regression yi = xiTβ design xi, observe yi, learn β? Kernel regression/Gaussian process [By NOAA (http://www.photolib.noaa.gov/htmls/theb1982.htm) [Public domain], via Wikimedia Commons] 8 Gaussian Random Fields [Zhu 2004] Prior fi : node value. A : adjacency matrix. L=D-A : graph Laplacian. Design s, observe ys, learn f? L= ... … … … 4 -1 … -1 4 -1 -1 4 -1 … -1 4 … … … … … (Posterior) covariance matrix … -1 … … … … … … -1 … … 9 D-Optimality (Information Criterion)? Mutual information is roughly marginal variance Near-optimal sensor placement [Krause 2008] GP-Bandit [Srinivas 2010] Level set estimation [Gotovos 2013] Bandits on graphs [Valko 2014] Waste samples at boundaries 10 Visualization: D-Optimality Picks Outliers High effective dim Choose the periphery Coauthorship graph 1711 nodes, 2898 edges. Labels (author area): • Machine learning • Data mining • Information retrieval • Database 11 V-Optimality [Ji 2012] Marginal variance minimization Variance may not be bounded 12 Σ-Optimality for Active Surveying Bayesian optimal active search and survey [Garnett 2012] Modeling class proportion Bayesian risk minimization What will happen if applied to graphs? Cluster centers? [Ma 2013] 13 Σ-Optimality on Graphs [Ma 2013] Cluster centers! Better active learning accuracy Monotone & submodular: greedy selections 0.8 0.6 0.4 0.2 0 unc ig mig eer vopt sopt 14 accuracy better DBLP Coauthorship 1711 nodes (scholars). Labels: • Machine learning • Data mining • Information retrieval • Database 2898 edges: • Coauthorship 2D embedding: OpenOrd Used in Ji & Han 2012 15 accuracy better Cora Citation Graph 2485 nodes (papers). Labels: • Case based / Genetic algorithms / Neural networks / Probabilistic methods / Reinforcement learning / Rule learning / Theory 5069 edges: • Undirected citation Created by McCallum et al 2000 16 16 Citeseer Citation Graph 2109 nodes (papers) labels: Agents / AI / DB / IR / ML / HCI 3665 edges: accuracy better Presence of a citation 17 Handwritten Digits 1797 images of 8x8 resolution 4-nearest neighbor graph accuracy better Random subsample 70% digits First query fixed at largest degree 18 accuracy better Isolet 1+2+3+4 6238 Spoken letter recordings 617 dimensional frequency feature 5-nearest neighbor graph from raw input Random subsample 70% instances First query fixed at largest degree 19 better Active Surveying DBLP Coauthorship Cora Citation 20 Citeseer Citation Insights? Break It Down to Greedy Application Write current covariance matrix Look-ahead covariance as rank-1 update The Idea: L-2 vs L-1 for robustness. 21 Exploitation (post mean), Tradeoff, Exploration (post std) aIn practice, often explore boundaries first ... N EAR -O PTIMAL S ENSOR P LACEMENTS IN G AUSSIAN P ROCESSES Ë Exploi Ë Explor 2 4 6 8 Active Search on Graphs [Ma 2015] end ret [Krause et al. 2008] [Gotovos et al. 2013] Select observations based on Ë Spectral-UCB [Valko et al. 2014] aGraph kernel from regularized Laplacian, ( ) = 1 ˜ 11 V A A aAlgorithmA unchanged; theoretical guarantees improved aStill explores boundaries first V S... U S U 0 5 10 15 20 Figure 4: An example of placements chosen using entropy and mutual information criteria on a subset of the temperature data from the Intel deployment. Diamonds indicate the positions chosen using entropy; squares the positions chosen using MI. the criterion was derived from the “predictive” formulation H( equivalent to maximizing H( ). \ | ) in Equation (3), which is Caselton and Zidek (1984) proposed a different optimization criterion, which searches for the subset of sensor locations that most significantly reduces the uncertainty about the estimates in the rest of the space. More formally, we consider our space as a discrete set of locations = ∪ composed of two parts: a set of possible positions where we can place sensors, and another set of positions of interest, where no sensor placements are possible. The goal is to place a set of k sensors that will give us good predictions at all uninstrumented locations V \ A . Specifically, we want to find Main Contributions A ∗ = argmaxA⊆S:|A|= k H( XV \ A ) − H( XV \ A | XA ), that is, the set A ∗ that maximally reduces the entropy over the rest of the space V \ A ∗ . Note that this criterion H( XV \ A ) − H( XV \ A | XA ) is equivalent to finding the set that maximizes the mutual information I( XA ; XV \ A ) between the locations A and the rest of the space V \ A . In their follow-up work, Caselton et al. (1992) and Zidek et al. (2000), argue against the use of mutual information in a setting where the entropy H( XA ) in the observed locations constitutes a significant part of the total uncertainty H( XV ). Caselton et al. (1992) also argue that, in order to compute MI( A ), one needs an accurate model of P( XV ). Since then, the entropy criterion has been dominantly used as a placement criterion. Nowadays however, the estimation of complex nonstationary models for P( XV ), as well as computational aspects, are very well understood and handled. Furthermore, we show empirically, that even in the sensor selection case, mutual information outperforms entropy on several practical placement problems. On the same simple example in Figure 4, this mutual information criterion leads to intuitively appropriate central sensor placements that do not have the “wasted information” property of the entropy criterion. Our experimental results in Section 9 further demonstrate the advantages in performance of the mutual information criterion. For simplicity of notation, we will often use MI( A ) = I( XA ; XV \ A ) to denote the mutual information objective function. Notice that in this no- or the f GP- SO min Ë Tradetimal c Discu Ë Σ-opti Choices in previous work Choices by our algorithm has nat Ë New exploration term: favor cluster centers over boundaries Ë GP- SO aMeasured by improvement of Σ-optimality [Ma et al. 2013] 22 243 Active Search on Graphs [Ma 2015] Select observations based on Experiment Nodes: 5000 populated places Edges: wikipedia links Search: 725 administrative regions among countries, cities, towns and villages 23 Wikipedia Pages On Programming Languages Nodes: 5271 programming languages Targets: 202 object-oriented programming pages Most targets reside in one large hub 24 Citation Network Nodes: 14117 papers Targets: 1844 NIPS papers Targets appear in many small components 25 Enron Scandal Emails Nodes: 20112 emails Targets: 803 related to downfall of Enron 26 eider* sity. * * Microsoft Research schnei de@cs. cmu. edu Regret Analysis High Probability Regret Bounds f, = Define Regret (1) Define Information max non-repeat = max y active 2 ) 0 =1 1) proper log 1 + matrix. GP- SOPT.TT / TOPK ˜ Compare With ˜ ( ) (y ; ) f ˜f Assume ( ) =1 + + 2 0 , any . , [ref 5] Experiments Ë Populated Places. 725 targets (administrative regions) in 5,000 nodes. Targets spread over components of varying sizes Ë Wikipedia Pages on Programming Languages. 202 targets (object-oriented 27 Further Questions Modeling covariance in high dim side information [Alon et al. 2014] Sparse additive models [Kirthevasan 2015] 28 Further Questions Markov decision process restless bandits reinforcement learning 29 Exploitation (post mean), Tradeoff, Exploration (post std) aIn practice, often explore boundaries first ... 8 N EAR -O PTIMAL S ENSOR P LACEMENTS IN G AUSSIAN P ROCESSES Ë Exploi Ë Explor 2 4 6 Further Questions end ret [Gotovos et al. 2013] Dynamics?[Krause et al. 2008] Ë Spectral-UCB [Valko et al. 2014] Partial observation? aGraph kernel from regularized Laplacian, ( ) = 1 ˜ 11 V A A A aAlgorithmvs unchanged; theoretical guarantees improved Graph Laplacian covariance? aStill explores boundaries first V S... U S bound? U Matching lower 0 5 10 15 20 Figure 4: An example of placements chosen using entropy and mutual information criteria on a subset of the temperature data from the Intel deployment. Diamonds indicate the positions chosen using entropy; squares the positions chosen using MI. the criterion was derived from the “predictive” formulation H( equivalent to maximizing H( ). \ | ) in Equation (3), which is Caselton and Zidek (1984) proposed a different optimization criterion, which searches for the subset of sensor locations that most significantly reduces the uncertainty about the estimates in the rest of the space. More formally, we consider our space as a discrete set of locations = ∪ composed of two parts: a set of possible positions where we can place sensors, and another set of positions of interest, where no sensor placements are possible. The goal is to place a set of k sensors that will give us good predictions at all uninstrumented locations V \ A . Specifically, we want to find Main Contributions A ∗ = argmaxA⊆S:|A|= k H( XV \ A ) − H( XV \ A | XA ), that is, the set A ∗ that maximally reduces the entropy over the rest of the space V \ A ∗ . Note that this criterion H( XV \ A ) − H( XV \ A | XA ) is equivalent to finding the set that maximizes the mutual information I( XA ; XV \ A ) between the locations A and the rest of the space V \ A . In their follow-up work, Caselton et al. (1992) and Zidek et al. (2000), argue against the use of mutual information in a setting where the entropy H( XA ) in the observed locations constitutes a significant part of the total uncertainty H( XV ). Caselton et al. (1992) also argue that, in order to compute MI( A ), one needs an accurate model of P( XV ). Since then, the entropy criterion has been dominantly used as a placement criterion. Nowadays however, the estimation of complex nonstationary models for P( XV ), as well as computational aspects, are very well understood and handled. Furthermore, we show empirically, that even in the sensor selection case, mutual information outperforms entropy on several practical placement problems. On the same simple example in Figure 4, this mutual information criterion leads to intuitively appropriate central sensor placements that do not have the “wasted information” property of the entropy criterion. Our experimental results in Section 9 further demonstrate the advantages in performance of the mutual information criterion. For simplicity of notation, we will often use MI( A ) = I( XA ; XV \ A ) to denote the mutual information objective function. Notice that in this no- or the f GP- SO min Ë Tradetimal c Discu Ë Σ-opti Choices in previous work Choices by our algorithm has nat Ë New exploration term: favor cluster centers over boundaries Ë GP- SO aMeasured by improvement of Σ-optimality [Ma et al. 2013] 30 243 Table of Content Graph-Based Region Rewards [wikipedia] Group Actions 31 [Charginghawk, wikipedia] Region Rewards Regions of water been polluted? Telephone poll to search for opinions by a group of people? [crw-cmu] [wikipedia] 32 Related Settings Bayesian optimization (BO) [Mockus,1989; Jones 1998; Brochu 2010] Simple reward (BO) vs. cumulative rewards (active search) [Azimi 2011] 33 [Ying Yang] Active Area Search (AAS) [Ma 2014] Point actions On the upper level Pay to observe Assume GP connection Region rewards On the lower level Region integral > threshold Input GP prior Region definitions Threshold 34 1D Toy Example Circles: collected; blue: GP posterior; gray/green: post. of region integrals. Method: maximize 1-step look-ahead expected reward Posterior actions Expected reward 35 Properties of Action Selection Point selection? Design x* to minimize variance of the region integral [Minka] Connect to Σ-optimality Region selection? High posterior mean AND Large ratio in variance reduction after 1-step look-ahead (under simplified assumptions) 36 Water Quality (Dissolved Oxygen) Ground truth dots: raw measurement; target regions: low values (black) Re-pick measurements via active search with region rewards 37 Water Quality (Dissolved Oxygen) Recall of target regions Re-picked measurements 38 PA Election (Races vs. Precincts) Search for positive electoral races with precinct queries. Regions Dots: precinct centers, same color: races; Build kernel on precincts by demographic info. Report a mapGoogle error Map data ©2015 39 Identify Fluid Flow Vortices via Point Observations [Ma 2015] Observe point vectors Objective overlapping windows of 11x11 that contain a vortex Classifier 2-layer neural net 40 Region Rewards, Further Thoughts Define region rewards in multi-armed bandits? • Use bandit algorithm (UCB) to select regions • Mutual information to select points in a region Can we have a better regret rate? Relation to multi-task Bayesian optimization [Swersky 2013]? Region integral (AAS) ≈ average value on multiple tasks (BO) 41 Table of Content Graph-Based Region Rewards [wikipedia] Group Actions 42 [Charginghawk, wikipedia] Group Actions Test mixtures of drugs? Fly high to obtain noisy observations? [Charginghawk, wikipedia] 43 Related Work Adaptive Compressive Sensing Does not Help [Davenport 2011] But for log terms [Haupt 2009] Adaptive Constrained Sensing? Fixed batch-size MAB [Yue 2011] Twenty questions [Jadynak 2012] Other constraints? Non-iid signals? [Hedge 2015] 44 Demo of IG Criterion Actions: any subset of arms; outcome: mean value ± const. noise; search objective: the non-zero entry of a 1-sparse vector. Method: maximize IG of an action 45 Demo of IG Criterion Actions: any subset of arms; outcome: mean value ± const. noise; search objective: the non-zero entry of a 1-sparse vector. Method: maximize IG of an action 46 Demo of IG Criterion Actions: any subset of arms; outcome: mean value ± const. noise; search objective: the non-zero entry of a 1-sparse vector. Method: maximize IG of an action 47 Demo of IG Criterion Actions: any subset of arms; outcome: mean value ± const. noise; search objective: the non-zero entry of a 1-sparse vector. Method: maximize IG of an action 48 Demo of IG Criterion Actions: any subset of arms; outcome: mean value ± const. noise; search objective: the non-zero entry of a 1-sparse vector. Method: maximize IG of an action 49 Demo of IG Criterion Actions: any subset of arms; outcome: mean value ± const. noise; search objective: the non-zero entry of a 1-sparse vector. Method: maximize IG of an action 50 Demo of IG Criterion Actions: any subset of arms; outcome: mean value ± const. noise; search objective: the non-zero entry of a 1-sparse vector. Method: maximize IG of an action 51 Demo of IG Criterion Actions: any subset of arms; outcome: mean value ± const. noise; search objective: the non-zero entry of a 1-sparse vector. Method: maximize IG of an action 52 Demo of IG Criterion Actions: any subset of arms; outcome: mean value ± const. noise; search objective: the non-zero entry of a 1-sparse vector. Method: maximize IG of an action 53 Demo of IG Criterion Actions: any subset of arms; outcome: mean value ± const. noise; search objective: the non-zero entry of a 1-sparse vector. Method: maximize IG of an action 54 Thank you! Active Point rewards search Region rewards Point action • Σ-optimality for active learning on graphs [Ma 2013] • Active search on graphs [Ma 2014] • Active area search [Ma 2014] • Active pointillistic pattern search [Ma* 2015] Group action • Active search for sparse rewards (in progress) • Theory of everything ? 55