Computational Intro: Conservation and Biodiversity Wildlife Corridor Design Carla P. Gomes Joint work with Jon Conrad, Bistra Dilkina, Willem van Hoeve, Ashish Sabharwal, and Jordan Sutter Topics in Computational Sustainability Spring 2010 Outline Wildlife corridor design problem – Problem Definition How hard is it to solve it? – Concepts of Problem Complexity How to model it? – Mixed Integer Programming formulation and other issues How to solve it? – How to scale up solutions? Experimental Results Research Questions 2 Problem Definition 3 Conservation and Biodiversity : Wildlife Corridors Wildlife Corridors Preserve wildlife against land fragmentation Link core biological areas, allowing animal movement between areas. Limited budget; must maximize environmental benefits/utility New York Times (Science) 2006 Conservation and Biodiversity : Grizzly Bear Wildlife Corridors Wildlife Corridors link core biological areas, allowing animal movement between areas. Typically: low budgets to implement corridors. Example: Goal: preserve grizzly bear populations in the U.S. Northern Rockies by creating wildlife corridors connecting 3 reserves: Yellowstone National Park; Glacier Park and Salmon-Selway Ecosystem Grizzly Bear Corridor in Northern Rockies Habitat Suitability can be a challenging Machine Learning problem Real world instance: Cost Corridor for grizzly bears in the Northern Rockies, connecting: Yellowstone Salmon-Selway Ecosystem Glacier Park Study area ~ 320,000 sq km Wildlife Corridor Design: Problem Definition (Informal English Definition ) Instance: – A set of parcels and their neighborhood relationships – A set of reserves or terminals (subset of the parcels) – The cost and the utility (habitat suitability) per parcel Reserve Land parcel Question: Cost and utility info omitted – What is the set of connected parcels, containing the reserves, maximizing the utility, such that the total cost does not exceed a given budget C? Example utility cost Budget 10 Cost = 10;Utility = 9 Budget 11 Cost = 11;Utility = 10 8 Example utility cost Min Cost solution Budget 10 Cost = 7;Utility = 5 Cost = 10;Utility = 9 Budget 11 Cost = 11;Utility = 10 9 Wildlife Corridor Design: (Graph Representation) Input: – A set of parcels and their neighborhood relationship – A set of reserves or terminals (subset of the parcels) – The cost and the utility (habitat suitability) per parcel Output: – A set of connected parcels, containing the reserves maximizing the utility, such that the total cost does not exceed a given budget C Undirected Graph Representation G=(V,E) Reserve Land parcel Cost and utility info omitted in the pictures 10 The Connection Subgraph Problem (Optimization Version) Instance – – – – An undirected graph G = (V,E) Terminal vertices T V Vertex cost function: c(v); utility function: u(v) Cost bound / budget C; Question What’s the subgraph H of G with maximum utility such that – H is connected and contains T – cost(H) C? 11 Utility optimization version : given C, maximize utility Cost optimization version : given U, minimize cost 11 The Connection Subgraph Problem (Decision Version) Instance – – – – An undirected graph G = (V,E) Terminal vertices T V Vertex cost function: c(v); utility function: u(v) Cost bound / budget C; desired utility U Question Is there a subgraph H of G such that – H is connected and contains T – cost(H) C; utility(H) U ? 12 12 Connection Subgraph: other possible applications Social networks What characterizes the connection between two individuals? The shortest path? Size of the connected component? A “good” connected subgraph? If a person is infected with a disease, who else is likely to be? Which people have unexpected ties to any members of a list of other individuals? Vertices in graph: people; edges: know each other or not [Faloutsos, McCurley, Tompkins ’04] Project: Find other applications of the connection graph problem and variants and apply/extend ideas presented in this lecture. 13 Concepts of Problem Complexity: Easy vs. hard problems 14 How hard (complex) is it to solve the connection sub-graph problem? Before answering this question… 15 How do computer scientists differentiate between good (efficient) and bad (not efficient) algorithms The yardstick is that any algorithm that runs in no more than polynomial time is an efficient algorithm; everything else is not. Ordered functions by their growth rates c Order constant 1 lg n lgc n logarithmic 2 polylogarithmic 3 nr ,0<r<1 sublinear 4 n linear 5 nr ,1<r<2 subquadratic 6 n2 n3 nc,c≥1 rn, r>1 quadratic 7 cubic 8 polynomial 9 exponential 10 Efficient algorithms Not efficient algorithms Roughly Speaking… exponential Cost (run time) quadratic linear logarithmic constant Size of instance N 18 C. P. Gomes Polynomial vs. exponential growth (Harel 2000) Binary B&B alg. exponential polynomial N2 LP’s interior point Min. Cost Flow Alg Transportation Alg Assignment Alg Dijkstra’s alg. How can we show a problem is efficiently solvable? – We can show it constructively. We provide an algorithm and show that it solves the problem efficiently. E.g.: Shortest path problem - Dijkstra’s algorithm runs in polynomial time. Therefore the shortest path problem can be solved efficiently. Linear Programming – The Interior Point method has polynomial worstcase complexity. Therefore Linear programming can be solved efficiently. (*) The simplex method has exponential worst case complexity/ However, in practice the simplex algorithm 20 seems to scale as m3, where m is the number of functional constraints. How can we show a problem is not efficiently solvable? – How do you prove a negative? Much harder!!! – This is the aim of complexity theory. 21 Easy (efficiently solvable) problems vs Hard Problems Easy Problems - we consider a problem X to be “easy” or efficiently solvable, if there is a polynomial time algorithm A for solving X. We denote by P the class of problems solvable in polynomial time. Hard problems --- everything else. Any problem for which there is no polynomial time algorithm is an intractable problem. 22 NP-Complete and NP-Hard Problems EXPLOSIVE COMBINATORICS Start Goal Experiment Design Planning and Scheduling And Supply Chain Management Satisfiability (A or B) (D or E or not A) Data Analysis Protein & Data Mining Folding Capital Budgeting And Medical And Financial Appl. Combinatorial Information Applications Auctions Retrieval EXPONENTIAL-TIME ALGORITHMS Software & Hardware Verification Fiber optics routing Many more applications!!! Hard Computational Problems Scale Exponentially Tackling In the worst case practical size instances requires powerful computational and mathematical tools! EXPONENTIAL FUNCTION POLYNOMIAL FUNCTION 23 How hard (complex) is the connection subgraph problem? The connection subgraph problem is NP-Hard. Unfortunately that means we don’t know of good, efficient (polynomial time) algorithms to solve this problem. We believe the connection subgraph problem is intractable: Computer scientists only know of exponential time algorithms to solve it (and computer scientists strongly believe that no polynomial time algorithm will ever be found, but there is no prove either way) Connections in networks: Hardness of feasibility versus optimality. Conrad, J., C. Gomes, W.-J. van Hoeve, A. Sabharwal, and J. Suter. Proc. CPAIOR 07, 2007 pages 16–28. The connection subgraph problem is NP-Hard! Should we give up on finding good solutions? Worst Case Result! Real-world problems are not necessarily worst case and they possess hidden sub-structure that can be exploited allowing scaling up of solutions. Connections in networks: Hardness of feasibility versus optimality. Conrad, J., C. Gomes, W.-J. van Hoeve, A. Sabharwal, and J. Suter. Proc. CPAIOR 07, 2007 pages 16–28. Encoding the connection subgraph problem as a Mixed Integer Programming Problem 26 Single commodity Flow Encoding – Variables: xi , binary variable, for each vertex i ( 1 if included in corridor ; 0 otherwise) Yij, continuous variable for each edge flow ij – Cost constraint: i cixi C – Utility optimization function: maximize i uixi – Connectedness: use a single commodity flow encoding 6 Max Flow = 9 Root (r) 1 1 5 1 1 3 1 2 1 1 1 Single Commodity Flow: MIP ≤ Max utility Budget constraint Reserves This is what makes the problem hard Total flow Flow balance Incoming edges allowed only if selected Note: E’ is the set of directed edges, obtained from replacing each undirected edge of E with two directed edges. Solving the Mixed Integer Programming Encoding connection subgraph instance MIP model Cplex – state of the art MIP solver Branch and Bound solution LP relaxation Cut generation CPLEX feasibility + optimization 29 Experimental Results 30 Synthetic Instances for Evaluation Problem evaluated on semi-structured graphs m x m lattice / grid graph with k terminals Inspired by the conservation corridors problem Place a terminal each on top-left and bottom-right Maximizes grid use Place remaining terminals randomly Assign uniform random costs and utilities from {0, 1, …, 10} m=4 k=4 31 Standard MIP Results: without terminals Note 1: plot in log-scale for better viewing of the sharp transitions Note 2: each data point is median over 100+ random instances 10000 100 A clear easy-hard-easy pattern with uniform random costs & utilities 10 x 10 8x8 1 6x6 0.01 No terminals “find the connected component that maximizes the utility within the given budget” Pure optimization problem; always feasible Still NP-hard Runtime (logscale) 0 0.2 0.4 0.6 0.8 Budget fraction 32 Standard MIP: 3 terminals (feasibility vs. optimization) Split instances into feasible and infeasible; plot median runtime For feasible ones : computation involves proving optimality For infeasible ones: computation involves proving infeasibility Infeasible instances take much longer than the feasible ones! 33 Results: with terminals connection subgraph instance MIP model Problem? MIP+Cplex really weak at feasibility testing Poor scaling: couldn’t even get close to handling real data Can we do better? solution CPLEX feasibility + optimization May 23, 2008 Ashish Sabharwal CP-AI-OR '08 34 A Related Problem (ignoring utilities): Minimum Cost solution The Steiner Tree Problem Input – An undirected graph G = (V,E) – Terminal vertices T V – Edge cost function: c(e); Question What’s the subgraph H of G with minimum cost such that – H is connected and contains T? 35 If the edge costs are all positive, then the resulting subgraph is obviously a tree. 35 The Steiner Tree Problem: Min cost tree connecting the terminals Also NP-Hard but When we only have two terminals shortest path (e.g., Dijkstra algorithm or algorithm based on dynamic programming) Bounded number of terminals Fixed parameter tractable algorithm 36 The Steiner Tree Problem: Min cost tree connecting the terminals Three terminals (as in the case of our grizzly bear problem) Algorithm ---in order to connect the three terminals - find where to place the root of the tree compute all pairs shortest paths (easy algorithm based on dynamic programming or even Dijkstra’s) Algorithm also used for the starting point of a greedy solution – start with the minimum cost corridor and extend it greedily by picking the nodes with decreasing util/cost ratio to use the remaining budget Algorithm also used for pruning (nodes that are too far away and connecting them to the terminals is beyond the budget can be pruned) 37 Solving the connection subgraph problem: Two Phase Approach 1st Phase – compute the minimum Steiner tree based algorithm and produces a greedy solution This phase runs in polynomial time for a constant number of terminal nodes. 2nd Phase - Refines the greedy solution to produce an optimal solution with Cplex 38 Solving the connection subgraph problem: Phase ! 1st Phase – compute the minimum Steiner tree based algorithm – Produces the minimum cost solution – Produces shortest path information used for pruning the serach space - the all-pairs-shortest-paths matrix – Produces a greedy (and often sub-optimal) solution for feasible instances (highest util/cost ratio parcels are selected to use the remaining budget) This phase runs in polynomial time for a constant number of terminal nodes. 39 Solving the connection subgraph problem: Phase II Refines the greedy solution to produce an optimal solution with Cplex – Greedy solution is passed to Cplex as the starting solution (Cplex can change it). – The all-pairs-shortest-paths matrix computed in Phase I is also passed on to Phase II. It is used to statically (i.e., at the beginning) prune away all nodes that are easily deduced to be too far to be part of a solution (e.g., if the minimum Steiner tree containing that node and all of the terminal vertices already exceeds the budget). This significantly reduces the search space size, often in the range of 40-60%. Computes an optimal solution (or the optimal extendedmincost solution) to the utility-maximization version of the connection subgraph problem. 40 Solving the Connection Sub-Graph Problem: Exploiting Structure (A Hybrid MIP/CP Approach) min-cost solution connection subgraph instance compute min-cost Steiner tree ignore utilities APSP matrix MIP model 0 3 6 2 8 40-60% pruned solution 3 0 7 4 1 6 7 0 5 9 2 4 5 0 1 8 1 9 1 0 greedily extend min-cost solution to fill budget “like” knapsack: max u/c dynamic pruning higher utility feasible solution CPLEX starting solution optimization feasibility Conrad, G., van Hoeve, Sabharwal, Sutter 2008 10x10 random lattices, 3 reserves Infeasible instances solved instantaneously! ~20x improvement in runtime on feasible instances 42 10x10 random lattices, 3 reserves Gap between optimal and extended-optimal solutions Peak of hardness still strongly correlated with budget slack 43 Experimental Results: Yellowstone case 44 Grizzly Bear Corridor in Northern Rockies Habitat Suitability can be a challenging Machine Learning problem Real world instance: Cost Corridor for grizzly bears in the Northern Rockies, connecting: Yellowstone Salmon-Selway Ecosystem Glacier Park Study area ~ 320,000 sq km Min Cost Solution for Different Granularities 46 Real Data, 50x50km Parcels Gap between optimal and extended-optimal solutions peaks in a critical region right after min-cost 50x50km Parcels 47 Real Data, 40x40km Parcels Gap between optimal and extended-optimal solutions peaks in a critical region right after min-cost 40x40km Parcels 48 49 50 51 52 Research Issues 53 Encodings Encodings – Complete Methods (proof of optimality) Other MIP formulations that scale better in practice? Other formulations that allow us to prove optimality faster? Other paradigms (e.g., constraint based, SAT modulo theories, extensions of SAT solvers, Mixed logic programming)? – Incomplete Methods (cannot prove optimality but may find good solutions) Simulated annealing, genetic algorithms etc – Hybrid complete/incomplete methods 54 Approximation results Cost optimization NP-hard to approximate within a factor of 1.36 – Utility version? Related Work Moss & Rabani 2001/2007 – Node-Weighted Steiner Tree – costs and utilities on nodes – Approximation results Costa et al 2006/2008/2009 – Steiner Tree with Budget, Revenues and Hop Constraints – Costs and utilities on edges – Directed Steiner Tree encoding and Branch-and-Cut Bistra Dilkina is interested in these issues 55 Models Are Important!!! Single Commodity Flow Quite compact (poly size) Directed Steiner Tree Exponential Number of Constraints ! Captures Better the Connectedness Structure ! Provides good upper bounds! Conrad, Dilkina, Gomes, van Hoeve, Sabharwal, Sutter 2007, 2008, 2009 A broad class of applications for projects A family of problems - spatially targeted interventions Conservation and Biodiversity Site Selection, Reserve Network Design, Wildlife Corridors Social Welfare Portfolios of Asset-based poverty interventions Bistra Dilkina 2009 Spatially targeted interventions Select a subset A of spatially-explicit actions U – Maximize a sustainability function F – Such that cost of actions does not exceed limited budget B max F(A) s.t. C(A) <= B Complexity added by: – Spatial constraints (connectivity, distance, etc) – Data Uncertainty – Dynamics: Meta-population models, Climate change Bistra Dilkina 2009 Additional Levels of Complexity: Stochasticity, Uncertainty, Large-Scale Data Modeling • How to estimate population distributions and habitat suitability? Where and how to collect data? • Multiple species (hundreds or thousands), with interactions (e.g. predator/prey). • Biological and ecological issues (for a species and within-species ) • Maxent Steven Philips, Miro Dudik & Rob Schapire Movements and migrations; Eastern Phoebe Migration • Climate change • Other factors Information Sciences (e.g., different models of land conservation (e.g., purchase, conservation easements, auctions) typically over different time periods). What different objective functions can we consider for preserving species - biodiversity? Source: Daniel Fink. Bagged Decision Trees Daniel Fink,Wesley Hochachka, Art Munson, Mirek Riedewald, 60 Ben Shaby, Giles Hooker, and Steve Kelling, 2009. Summary Wildlife corridor problem problem formulation computational complexity issues models and solution approaches Research questions Our approaches clearly outperform approaches reported in the literature! 61 The End ! 62 Theoretical Results: 1 NP-completeness: reduction from the Steiner Tree problem, preserving the cost function. Idea: – Steiner tree problem already very similar – Simulate edge costs with node costs – Simulate terminal vertices with utility function NP-complete even without any terminals – Recall: Steiner tree problem poly-time solvable with constant number of terminals Also holds for planar graphs 63 Theoretical Results: 2 NP-hardness of approximating cost optimization (factor 1.36): reduction from the Vertex Cover problem Reduction motivated by Steiner tree work [Bern, Plassmann ’89] v1 vn … v2 v3 … vertex cover of size k iff connection subgraph with cost bound C = k and utility U = m 64 65