Slides - Computer Science

advertisement
Computational Intro:
Conservation and Biodiversity
Wildlife Corridor Design
Carla P. Gomes
Joint work with Jon Conrad, Bistra Dilkina, Willem van Hoeve,
Ashish Sabharwal, and Jordan Sutter
Topics in Computational Sustainability
Spring 2010
Outline
 Wildlife corridor design problem
– Problem Definition
 How hard is it to solve it?
– Concepts of Problem Complexity
 How to model it?
– Mixed Integer Programming formulation and other issues
 How to solve it?
– How to scale up solutions?
 Experimental Results
 Research Questions
2
Problem Definition
3
Conservation and Biodiversity :
Wildlife Corridors
Wildlife Corridors
Preserve wildlife against
land fragmentation
Link core biological areas,
allowing animal movement
between areas.
Limited budget; must maximize
environmental benefits/utility
New York Times (Science) 2006
Conservation and Biodiversity :
Grizzly Bear Wildlife Corridors
Wildlife Corridors link core biological areas,
allowing animal movement between areas.
Typically: low budgets to implement corridors.
Example:
Goal: preserve grizzly bear populations in
the U.S. Northern Rockies by creating
wildlife
corridors connecting 3 reserves:
Yellowstone National Park;
Glacier Park and
Salmon-Selway Ecosystem
Grizzly Bear Corridor in
Northern Rockies
Habitat Suitability
can be a challenging
Machine Learning problem
Real world instance:
Cost
Corridor for grizzly bears in the
Northern Rockies, connecting:
Yellowstone
Salmon-Selway Ecosystem
Glacier Park
Study area ~ 320,000 sq km
Wildlife Corridor Design:
Problem Definition
(Informal English Definition )
 Instance:
– A set of parcels and their neighborhood relationships
– A set of reserves or terminals (subset of the parcels)
– The cost and the utility (habitat suitability) per parcel
Reserve
Land parcel
 Question:
Cost and utility info omitted
– What is the set of connected parcels, containing the reserves,
maximizing the utility, such that the total cost does not exceed a
given budget C?
Example
utility
cost
Budget 10
Cost = 10;Utility = 9
Budget 11
Cost = 11;Utility = 10
8
Example
utility
cost
Min Cost solution
Budget 10
Cost = 7;Utility = 5
Cost = 10;Utility = 9
Budget 11
Cost = 11;Utility = 10
9
Wildlife Corridor Design:
(Graph Representation)
 Input:
– A set of parcels and their neighborhood relationship
– A set of reserves or terminals (subset of the parcels)
– The cost and the utility (habitat suitability) per parcel
 Output:
– A set of connected parcels, containing the reserves maximizing the
utility, such that the total cost does not exceed a given budget C
Undirected Graph
Representation
G=(V,E)
Reserve
Land parcel
Cost and utility info omitted in the pictures
10
The Connection Subgraph Problem
(Optimization Version)
Instance
–
–
–
–
An undirected graph G = (V,E)
Terminal vertices T  V
Vertex cost function: c(v); utility function: u(v)
Cost bound / budget C;
Question
What’s the subgraph H of G with
maximum utility such that
– H is connected and contains T
– cost(H)  C?
11
Utility optimization version : given C, maximize utility
Cost optimization version
: given U, minimize cost
11
The Connection Subgraph Problem
(Decision Version)
Instance
–
–
–
–
An undirected graph G = (V,E)
Terminal vertices T  V
Vertex cost function: c(v); utility function: u(v)
Cost bound / budget C;
desired utility U
Question
Is there a subgraph H of G such that
– H is connected and contains T
– cost(H)  C; utility(H)  U ?
12
12
Connection Subgraph:
other possible applications
Social networks
 What characterizes the connection between two individuals?
The shortest path?
Size of the connected component?
A “good” connected subgraph?
 If a person is infected with a disease, who else is likely to be?
 Which people have unexpected ties to any members of a list of
other individuals?
 Vertices in graph: people;
edges: know each other or not
[Faloutsos, McCurley, Tompkins ’04]
Project: Find other applications of the connection graph problem and
variants and apply/extend ideas presented in this lecture.
13
Concepts of Problem Complexity:
Easy vs. hard problems
14
How hard (complex) is it to solve the
connection sub-graph problem?
Before answering this question…
15
How do computer scientists differentiate between good
(efficient) and bad (not efficient) algorithms
The yardstick is that any algorithm that runs in no
more than polynomial time is an efficient algorithm;
everything else is not.
Ordered functions by their growth rates
c
Order
constant
1
lg n
lgc n
logarithmic
2
polylogarithmic
3
nr ,0<r<1
sublinear
4
n
linear
5
nr ,1<r<2
subquadratic
6
n2
n3
nc,c≥1
rn, r>1
quadratic
7
cubic
8
polynomial
9
exponential
10
Efficient algorithms
Not efficient algorithms
Roughly Speaking…
exponential
Cost
(run time)
quadratic
linear
logarithmic
constant
Size of instance
N
18
C. P. Gomes
Polynomial vs. exponential growth
(Harel 2000)
Binary B&B alg.
exponential
polynomial
N2
LP’s interior point
Min. Cost Flow Alg
Transportation Alg
Assignment Alg
Dijkstra’s alg.
 How can we show a problem is efficiently solvable?
– We can show it constructively. We provide an algorithm and
show that it solves the problem efficiently. E.g.:
 Shortest path problem - Dijkstra’s algorithm runs in polynomial time.
Therefore the shortest path problem can be solved efficiently.
 Linear Programming – The Interior Point method has polynomial worstcase complexity. Therefore Linear programming can be solved
efficiently.
(*) The simplex method has exponential worst case complexity/ However, in practice the simplex algorithm
20
seems to scale as m3, where m is the number of functional constraints.
 How can we show a problem is not efficiently
solvable?
– How do you prove a negative? Much harder!!!
– This is the aim of complexity theory.
21
Easy (efficiently solvable) problems vs
Hard Problems
 Easy Problems - we consider a problem X to be “easy” or efficiently
solvable, if there is a polynomial time algorithm A for solving X. We
denote by P the class of problems solvable in polynomial time.
 Hard problems --- everything else. Any problem for which there
is no polynomial time algorithm is an intractable problem.
22
NP-Complete and
NP-Hard Problems
EXPLOSIVE
COMBINATORICS
Start
Goal
Experiment
Design
Planning and Scheduling
And Supply Chain Management
Satisfiability
(A or B) (D or E or not A)
Data Analysis
Protein
& Data Mining
Folding
Capital Budgeting
And Medical
And Financial Appl. Combinatorial Information
Applications
Auctions
Retrieval
EXPONENTIAL-TIME
ALGORITHMS
Software & Hardware
Verification
Fiber optics routing
Many more
applications!!!
Hard Computational
Problems
Scale Exponentially
Tackling
In the worst case
practical size instances
requires powerful computational and
mathematical tools!
EXPONENTIAL
FUNCTION
POLYNOMIAL
FUNCTION
23
How hard (complex) is the connection subgraph problem?
The connection subgraph problem is NP-Hard.
Unfortunately that means we don’t know of good, efficient
(polynomial time) algorithms to solve this problem.
We believe the connection subgraph problem is intractable:
Computer scientists only know of exponential time algorithms
to solve it (and computer scientists strongly believe that no
polynomial time algorithm will ever be found, but there is no
prove either way)
Connections in networks: Hardness of feasibility versus optimality. Conrad, J., C. Gomes, W.-J. van Hoeve, A. Sabharwal,
and J. Suter. Proc. CPAIOR 07, 2007 pages 16–28.
The connection subgraph problem is NP-Hard!
Should we give up on finding good solutions?
Worst Case Result!
Real-world problems are not necessarily
worst case and they possess
hidden sub-structure
that can be exploited allowing
scaling up of solutions.
Connections in networks: Hardness of feasibility versus optimality. Conrad, J., C. Gomes, W.-J. van Hoeve, A. Sabharwal,
and J. Suter. Proc. CPAIOR 07, 2007 pages 16–28.
Encoding the connection subgraph problem as a
Mixed Integer Programming Problem
26
Single commodity Flow Encoding
– Variables: xi , binary variable, for each vertex i ( 1 if included in
corridor ; 0 otherwise)
Yij, continuous variable for each edge flow ij
– Cost constraint:
i cixi  C
– Utility optimization function: maximize i uixi
– Connectedness: use a single commodity flow encoding
6
Max Flow = 9
Root (r)
1
1
5
1
1
3
1
2
1
1
1
Single Commodity Flow: MIP
≤

Max utility

Budget constraint

Reserves
This is what makes
the problem hard

Total flow

Flow balance

Incoming edges
allowed only if
selected
Note: E’ is the set of directed edges, obtained from replacing each undirected edge of E with two directed edges.
Solving the
Mixed Integer Programming Encoding
connection
subgraph
instance
MIP
model

Cplex – state of the art MIP solver

Branch and Bound


solution
LP relaxation
Cut generation
CPLEX
feasibility + optimization
29
Experimental Results
30
Synthetic Instances
for Evaluation
Problem evaluated on semi-structured graphs




m x m lattice / grid graph with k terminals

Inspired by the conservation corridors problem
Place a terminal each on top-left and bottom-right

Maximizes grid use
Place remaining terminals randomly
Assign uniform random costs and utilities
from {0, 1, …, 10}
m=4
k=4
31
Standard MIP
Results: without terminals
Note 1: plot in log-scale for better
viewing of the sharp transitions
Note 2: each data point is median
over 100+ random instances
10000
100
A clear easy-hard-easy
pattern with uniform
random costs & utilities
10 x 10
8x8
1

6x6
0.01

No terminals  “find the connected component that maximizes
the utility within the given budget”
Pure optimization problem; always feasible
Still NP-hard
Runtime (logscale)

0
0.2
0.4
0.6
0.8
Budget fraction
32
Standard MIP:
3 terminals (feasibility vs. optimization)
Split instances into feasible and infeasible; plot median runtime

For feasible ones : computation involves proving optimality

For infeasible ones: computation involves proving infeasibility
Infeasible instances take much longer than the feasible ones!
33
Results: with terminals
connection
subgraph
instance
MIP
model

Problem?

MIP+Cplex really weak at
feasibility testing
Poor scaling: couldn’t even get
close to handling real data
Can we do better?
solution
CPLEX
feasibility + optimization
May 23, 2008
Ashish Sabharwal
CP-AI-OR '08
34
A Related Problem (ignoring utilities):
Minimum Cost solution The Steiner Tree Problem
Input
– An undirected graph G = (V,E)
– Terminal vertices T  V
– Edge cost function: c(e);
Question
What’s the subgraph H of G
with minimum cost such that
– H is connected and contains T?
35
If the edge costs are all positive, then the resulting subgraph is obviously a tree.
35
The Steiner Tree Problem:
Min cost tree connecting the terminals
Also NP-Hard but
 When we only have two terminals  shortest path
(e.g., Dijkstra algorithm or algorithm based on dynamic
programming)
 Bounded number of terminals
 Fixed parameter tractable algorithm
36
The Steiner Tree Problem:
Min cost tree connecting the terminals
Three terminals (as in the case of our grizzly bear problem)

Algorithm ---in order to connect the three terminals - find where to place
the root of the tree  compute all pairs shortest paths (easy algorithm
based on dynamic programming or even Dijkstra’s)
 Algorithm also used for the starting point of a greedy solution – start
with the minimum cost corridor and extend it greedily by picking the
nodes with decreasing util/cost ratio to use the remaining budget
 Algorithm also used for pruning (nodes that are too far away and
connecting them to the terminals is beyond the budget can be
pruned)
37
Solving the connection subgraph problem:
Two Phase Approach
 1st Phase – compute the minimum Steiner tree based
algorithm and produces a greedy solution
This phase runs in polynomial time for a constant number of terminal
nodes.
 2nd Phase - Refines the greedy solution to produce an
optimal solution with Cplex
38
Solving the connection subgraph problem:
Phase !
 1st Phase – compute the minimum Steiner tree based
algorithm
– Produces the minimum cost solution
– Produces shortest path information used for pruning the serach
space - the all-pairs-shortest-paths matrix
– Produces a greedy (and often sub-optimal) solution for feasible
instances (highest util/cost ratio parcels are selected to use the
remaining budget)
This phase runs in polynomial time for a constant number of terminal
nodes.
39
Solving the connection subgraph problem:
Phase II
 Refines the greedy solution to produce an optimal solution
with Cplex
– Greedy solution is passed to Cplex as the starting solution (Cplex
can change it).
– The all-pairs-shortest-paths matrix computed in Phase I is also
passed on to Phase II. It is used to statically (i.e., at the beginning)
prune away all nodes that are easily deduced to be too far to be
part of a solution (e.g., if the minimum Steiner tree containing that
node and all of the terminal vertices already exceeds the budget).
This significantly reduces the search space size, often in the
range of 40-60%.
 Computes an optimal solution (or the optimal extendedmincost solution) to the utility-maximization version of the
connection subgraph problem.
40
Solving the Connection Sub-Graph Problem:
Exploiting Structure (A Hybrid MIP/CP Approach)
min-cost solution
connection
subgraph
instance
compute min-cost
Steiner tree
ignore utilities
APSP
matrix
MIP
model
0
3
6
2
8
40-60%
pruned
solution
3
0
7
4
1
6
7
0
5
9
2
4
5
0
1
8
1
9
1
0
greedily extend
min-cost solution
to fill budget
“like” knapsack: max u/c
dynamic
pruning
higher utility
feasible solution
CPLEX
starting solution
optimization
feasibility
Conrad, G., van Hoeve, Sabharwal, Sutter 2008
10x10 random lattices, 3 reserves
Infeasible instances
solved instantaneously!
~20x improvement
in runtime on
feasible instances
42
10x10 random lattices, 3 reserves
Gap between optimal
and extended-optimal
solutions
Peak of hardness
still strongly
correlated with
budget slack
43
Experimental Results:
Yellowstone case
44
Grizzly Bear Corridor in
Northern Rockies
Habitat Suitability
can be a challenging
Machine Learning problem
Real world instance:
Cost
Corridor for grizzly bears in the
Northern Rockies, connecting:
Yellowstone
Salmon-Selway Ecosystem
Glacier Park
Study area ~ 320,000 sq km
Min Cost Solution for Different Granularities
46
Real Data, 50x50km Parcels
Gap between optimal
and extended-optimal
solutions peaks in a
critical region right
after min-cost
50x50km Parcels
47
Real Data, 40x40km Parcels
Gap between optimal
and extended-optimal
solutions peaks in a
critical region right
after min-cost
40x40km Parcels
48
49
50
51
52
Research Issues
53
Encodings
 Encodings
– Complete Methods (proof of optimality)
 Other MIP formulations that scale better in practice?
 Other formulations that allow us to prove optimality faster?
 Other paradigms (e.g., constraint based, SAT modulo theories,
extensions of SAT solvers, Mixed logic programming)?
– Incomplete Methods (cannot prove optimality but may find good
solutions)
 Simulated annealing, genetic algorithms etc
– Hybrid complete/incomplete methods
54
Approximation results
 Cost optimization NP-hard to approximate within a factor
of 1.36
– Utility version?
Related Work
 Moss & Rabani 2001/2007
– Node-Weighted Steiner Tree – costs and utilities on nodes
– Approximation results
 Costa et al 2006/2008/2009
– Steiner Tree with Budget, Revenues and Hop Constraints
– Costs and utilities on edges
– Directed Steiner Tree encoding and Branch-and-Cut
Bistra Dilkina is interested in these issues
55
Models Are Important!!!
Single Commodity Flow
Quite compact (poly size)
Directed Steiner Tree
Exponential Number of Constraints !
Captures Better the Connectedness Structure !
Provides good upper bounds!
Conrad, Dilkina, Gomes, van Hoeve, Sabharwal, Sutter 2007, 2008, 2009
A broad class of applications for projects

A family of problems - spatially targeted interventions
Conservation and Biodiversity
Site Selection, Reserve Network Design, Wildlife Corridors

Social Welfare

Portfolios of Asset-based poverty interventions


Bistra Dilkina 2009
Spatially targeted interventions
 Select a subset A of spatially-explicit actions U
– Maximize a sustainability function F
– Such that cost of actions does not exceed limited budget B
 max F(A) s.t. C(A) <= B
 Complexity added by:
– Spatial constraints (connectivity, distance, etc)
– Data Uncertainty
– Dynamics: Meta-population models, Climate change
Bistra Dilkina 2009
Additional Levels of Complexity:
Stochasticity, Uncertainty, Large-Scale Data Modeling
•
How to estimate population distributions and habitat suitability? Where and how to
collect data?
•
Multiple species (hundreds or thousands),
with interactions (e.g. predator/prey).
•
Biological and ecological issues (for a species
and within-species )
•
Maxent
Steven Philips, Miro Dudik & Rob Schapire
Movements and migrations;
Eastern Phoebe Migration
•
Climate change
• Other factors
Information Sciences
(e.g., different models of land conservation (e.g.,
purchase, conservation easements, auctions)
typically over different time periods).
What different objective functions can we
consider for preserving species - biodiversity?
Source: Daniel Fink.
Bagged Decision Trees
Daniel Fink,Wesley Hochachka, Art Munson, Mirek Riedewald,
60
Ben Shaby, Giles Hooker, and Steve Kelling, 2009.
Summary
 Wildlife corridor problem
problem formulation
computational complexity issues
models and solution approaches
 Research questions
Our approaches clearly outperform approaches reported in
the literature!
61
The End !
62
Theoretical Results: 1
 NP-completeness: reduction from the Steiner Tree
problem, preserving the cost function. Idea:
– Steiner tree problem already very similar
– Simulate edge costs with node costs
– Simulate terminal vertices with utility function
 NP-complete even without any terminals
– Recall: Steiner tree problem poly-time solvable with constant
number of terminals
 Also holds for planar graphs
63
Theoretical Results: 2
 NP-hardness of approximating cost optimization (factor 1.36):
reduction from the Vertex Cover problem
 Reduction motivated by Steiner tree work [Bern, Plassmann ’89]
v1
vn
…
v2
v3
…
vertex cover of size k iff connection subgraph with
cost bound C = k and utility U = m
64
65
Download