Presentation (PowerPoint File)

advertisement
Cascading Failures in
Infrastructure Networks
David Alderson
Ph.D. Candidate
Dept. of Management Science and Engineering
Stanford University
April 15, 2002
Advisors: William J. Perry, Nicholas Bambos
Outline
• Background and Motivation
• Union Pacific Case Study
• Conceptual Framework
• Modeling Cascading Failures
• Ongoing Work
IPAM 4/15/2002
David Alderson
Background
• Most of the systems we rely on in our daily lives are
designed and built as networks
– Voice and data communications
– Transportation
– Energy distribution
• Large-scale disruption of such systems can be
catastrophic because of our dependence on them
• Large-scale failures in these systems
– Have already happened
– Will continue to happen
IPAM 4/15/2002
David Alderson
Recent Examples
• Telecommunications
– ATM network outage: AT&T (February 2001)
– Frame Relay outage: AT&T (April 1998), MCI (August 1999)
• Transportation
– Union Pacific Service Crisis (May 1997- December 1998)
• Electric Power
– Northeast Blackout (November 1965)
– Western Power Outage (August 1996)
• All of the above
– Baltimore Tunnel Accident (July 2001)
IPAM 4/15/2002
David Alderson
Public Policy
• U.S. Government interest from 1996 (and earlier)
• Most national infrastructure systems are privately
owned and operated
– Misalignment between business imperatives (efficiency)
and public interest (robustness)
• Previously independent networks now tied together
through common information infrastructure
• Current policy efforts directed toward building new
public-private relationships
– Policy & Partnership (CIAO)
– Law Enforcement & Coordination (NIPC)
– Defining new roles (Homeland Security)
IPAM 4/15/2002
David Alderson
Research Questions
Broadly:
• Is there something about the network structure of these
systems that contributes to their vulnerability?
More specifically:
• What is a cascading failure in the context of an infrastructure
network?
• What are the mechanisms that cause it?
• What can be done to control it?
• Can we design networks that are robust to cascading failures?
• What are the implications for network-based businesses?
IPAM 4/15/2002
David Alderson
Outline
• Background and Motivation
• Union Pacific Case Study
• Conceptual Framework
• Modeling Cascading Failures
• Ongoing Work
IPAM 4/15/2002
David Alderson
Union Pacific Railroad
• Largest RR in North America
– Headquartered in Omaha, Nebraska
– 34,000 track miles (west of Mississippi River)
• Transporting
– Coal, grain, cars, other manifest cargos
– 3rd party traffic (e.g. Amtrak passenger trains)
• 24x7 Operations:
– 1,500+ trains in motion
– 300,000+ cars in system
• More than $10B in revenue annually
IPAM 4/15/2002
David Alderson
Union Pacific Railroad
• Four major resources constraining operations:
– Line capacity
(# parallel tracks, speed restrictions, etc.)
– Terminal capacity (in/out tracks, yard capacity)
– Power (locomotives)
– Crew (train personnel, yard personnel)
• Ongoing control of operations is mainly by:
– Dispatchers
– Yardmasters
– Some centralized coordination, primarily through a
predetermined transportation schedule
IPAM 4/15/2002
David Alderson
Union Pacific Railroad
• Sources of network disruptions:
– Weather
(storms, floods, rock slides, tornados, hurricanes, etc.)
– Component failures
(signal outages, broken wheels/rails, engine failures, etc.)
– Derailments (~1 per day on average)
– Minor incidents (e.g. crossing accidents)
• Evidence for system-wide failures
– 1997-1998 Service Crisis
• Fundamental operating challenge
IPAM 4/15/2002
David Alderson
UPRR Fundamental Challenge
Two conflicting drivers:
• Business imperatives necessitate a lean operation
that maximizes efficiency and drives the system
toward high utilization of available network
resources.
• An efficient operation that maximizes utilization is
very sensitive to disruptions, particularly because of
the effects of network congestion.
IPAM 4/15/2002
David Alderson
Railroad Congestion
There are several places where congestion may be
seen within the railroad:
• Line segments
• Terminals
• Operating Regions
• The Entire Railroad Network
• (Probably not locomotives or crews)
Congestion is related to capacity.
IPAM 4/15/2002
David Alderson
Line Segment Velocity
UPRR Capacity Model Concepts
Emprically-Derived
Relationship
Factors Affecting Observed Performance:
•Dispatcher / Corridor Manager Expertise
•On Line Incidents / Equipment Failure
•Weather
•Temporary Speed Restrictions
25
18
The Effect of Forcing
Volume in Excess of Capacity
28
Volume (trains per day)
IPAM 4/15/2002
32
36
35
David Alderson
Implications of Congestion
Concepts of traffic congestion are important for two key
aspects of network operations:
– Capacity Planning and Management
– Service Restoration
In the presence of service interruptions, the objective of
Service Restoration is to:
– Minimize the propagation across the network of any
disturbance caused by a service interruption
– Minimize the time to recovery to fluid operations
IPAM 4/15/2002
David Alderson
Modeling Congestion
We can model congestion using standard models from
transportation engineering.
Define the relationships between:
• Number of items in the system (Density)
• Average processing rate (Velocity)
• Input Rate
• Output Rate (Throughput)
IPAM 4/15/2002
David Alderson
Modeling Congestion
Velocity vs. Density:
 Assume that velocity decreases (linearly) with
the traffic density.
Velocity (v)
K
n

v ( n )  K  1  
 N
N
IPAM 4/15/2002
Density (n)
David Alderson
Modeling Congestion
Throughput vs. Density
 Throughput = Velocity · Density

n2 
 (n)  v(n)  n  K   n  
N

Throughput ()
*
Throughput is
maximized at
n = N/2
with value
* = N/4 (K=1).
N/2
IPAM 4/15/2002
N
Density (n)
David Alderson
Modeling Congestion
Throughput
Velocity
*
N
Density
Velocity
N
Density
Velocity
Throughput
K
IPAM 4/15/2002
*
Throughput
David Alderson
Modeling Congestion
 Let p represent the intensity of congestion onset.
  n p 
v ( n )  K  1    
 N 


  n p 
 (n)  n  v(n)  nK  1    
 N 


v(n)=1-(n/10)^p
mu(n) = n*v(n) = n* { 1- (n/10)^p}
1.2
0.1
1
0.25
0.8
0.5
0.6
1
0.4
2
4
0.2
10
0
n
0
1
2
IPAM 4/15/2002
3
4
5
6
7
8
9
10
8
7
6
5
4
3
2
1
0
0.1
0.25
0.5
1
2
4
10
0
1
2
3
4
5
6
7
8
9
10
n
David Alderson
Modeling Congestion
It is clear that
becomes
  n p 
lim
nK  1    
 N 
p


nK (1   N (n))
nN
1
where  N (n)  
0 otherwise
mu(n) = n*v(n) = n* { 1- (n/10)^p}
8
7
6
5
4
3
2
1
0
0.1
nK
0.25
0.5
1
2
4
10
0
1
2
IPAM 4/15/2002
3
4
5
6
7
8
9
10
n
N
David Alderson
Modeling Congestion
Throughput
Velocity
*
N
Density
Velocity
N
Density
Velocity
Throughput
K
IPAM 4/15/2002
*
Throughput
David Alderson
UP Service Crisis
• Initiating Event
– 5/97 derailment at a critical train yard outside of Houston
• Additionally
– Loss of BNSF route that was decommissioned for repairs
– Embargo at Laredo interchange point to Mexico
• Complicating Factors
– UP/SP merger and transition to consolidated operations
– Hurricane Danny, fall 1997
– Record rains and floods (esp. Kansas) in 1998
• Operational Issues
– Tightly optimized transportation schedule
– Traditional service priorities
IPAM 4/15/2002
David Alderson
UP Service Crisis
Union Pacific Railroad
Inventory Total System Inventory, December 1996 - November 1998
(cars)
370,000
360,000
350,000
340,000
Southern California
330,000
320,000
Central Corridor
(Kansas-Nebraska-Wyoming)
310,000
300,000
290,000
Source:
UP Filings with Surface Transportation Board, September 1997 – December 1998
IPAM
4/15/2002
20-Nov-98
30-Oct-98
9-Oct-98
18-Sep-98
28-Aug-98
7-Aug-98
17-Jul-98
26-Jun-98
5-Jun-98
15-May-98
24-Apr-98
3-Apr-98
13-Mar-98
20-Feb-98
30-Jan-98
9-Jan-98
19-Dec-97
28-Nov-97
7-Nov-97
17-Oct-97
26-Sep-97
Jun-97
Mar-97
Dec-96
280,000
Houston-Gulf Coast
David Alderson
Case Study: Union Pacific
Completed Phase 1 of case study:
• Understanding of the factors affecting system capacity, system
dynamics
• Investigation of the 1997-98 Service Crisis
• Project definition: detailed study of Sunset Route
• Data collection, preliminary analysis for the Sunset Route
Ongoing work:
• A detailed study of their specific network topology
• Development of real-time warning and analysis tools
IPAM 4/15/2002
David Alderson
Outline
• Background and Motivation
• Union Pacific Case Study
• Conceptual Framework
• Modeling Cascading Failures
• Ongoing Work
IPAM 4/15/2002
David Alderson
Basic Network Concepts
• Networks allow the sharing of distributed resources
• Resource use  resource load
– Total network usage = total network load
• Total network load is distributed among the
components of the network
– Many networking problems are concerned with finding a
“good” distribution of load
• Resource allocation  load distribution
IPAM 4/15/2002
David Alderson
Infrastructure Networks
• Self-protection as an explicit design criterion
• Network components themselves are valuable
– Expensive
– Hard to replace
– Long lead times to obtain
• Willingness to sacrifice current system performance
in exchange for future availability
• With protection as an objective, connectivity between
neighboring nodes is
– Helpful
– Harmful
IPAM 4/15/2002
David Alderson
Cascading Failures
Cascading failures occur in networks where
– Individual network components can fail
– When a component fails, the natural dynamics of the
system may induce the failure of other components
Network components can fail because
– Accident
– Internal failure
– Attack
Initiating events
A cascading failure is not
– A single point of failure
– The occurrence of multiple concurrent failures
– The spread of a virus
IPAM 4/15/2002
David Alderson
Related Work
Cascading Failures:
– Electric Power: Parrilo et. al. (1998), Thorp et. al. (2001)
– Social Networks: Watts (1999)
– Public Policy: Little (2001)
Other network research initiatives
– “Survivable Networks”
– “Fault-Tolerant Networks”
Large-Scale Vulnerability
–
–
–
–
Self-Organized Criticality: Bak (1987), many others
Highly Optimized Tolerance: Carlson and Doyle (1999)
Normal Accidents: Perrow (1999)
Influence Models: Verghese et. al. (2001)
IPAM 4/15/2002
David Alderson
Our Approach
• Cascading failures in the context of flow networks
– conservation of flow within the network
• Overloading a resource leads to degraded
performance and eventual failure
• Network failures are not independent
– Flow allocation pattern  resource interdependence
• Focus on the dynamics of network operation and
control
• Design for robustness (not protection)
IPAM 4/15/2002
David Alderson
Taxonomy of Network Flow Models
Coarse
Grained
Models
Quantity of
Interest
Modeling
Approach
Relevant
Decisions
Long-Term
Averages
Static Flow
Models
Capacity
Planning
Time-Dependent
Fluid
Averages
Approximations
Fine
Grained
Models
IPAM 4/15/2002
Averages &
Variances
Diffusion
Approximations
Probability
Distributions
Queueing
Models
Event
Sequences
Simulation
Models
Failure &
Recovery
Ongoing
Operation
(Processing
& Routing)
Reference:
Janusz Filipiak
David Alderson
Time Scales in Network Operations
Short
Time
Scales
Relevant
Decisions
Long
Time
Scales
Ongoing Operation
Failure &
(Processing & Routing) Recovery
Capacity
Planning
Computer
Routing
milliseconds
to seconds
minutes
to hours
days
to weeks
Railroad
Transportation
minutes
to hours
days
to weeks
months
to years
IPAM 4/15/2002
David Alderson
What Are Network Dynamics?
Type of
Network Dynamics
Underlying
Assumption
Dynamics
OF
Networks
Network topology
is CHANGING
Failure &
Recovery
Dynamics
ON
Networks
IPAM 4/15/2002
Network topology
is STATIC
David Alderson
Network Flow Optimization
• Original work by Ford and Fulkerson (1956)
• One of the most studied areas for optimization
• Three main problem types
– Shortest path problems
– Maximum flow problems
– Minimum cost flow problems
• Special interpretation for some of the most
celebrated results in optimization theory
• Broad applicability to a variety of problems
IPAM 4/15/2002
David Alderson
Single Commodity Flow Problem
Notation:
N
set of nodes, indexed i = 1, 2, … N
A
set of arcs , indexed j = 1, 2, … M
di
demand (supply) at node i
fj
flow along arc j
uj
capacity along arc j
A
node-arc incidence matrix,
 1 if arc j enters node i

aij   1 if arc j exits node i
0
otherwise

A set of flows f is feasible if it satisfies the constraints:
Ai f = d i
iN
(flows balanced at node i, and
0  fj  u j
IPAM 4/15/2002
jA
supply/demand is satisfied)
(flow on arc j less than capacity)
David Alderson
Single Commodity Flow Problem

s
t

Feasible region, denoted F():
if i  s
 

Ai f  d i   
if i  t (flows balanced at node i)
 0

0  fj  uj
IPAM 4/15/2002
otherwise
 j A
(flow on arc j feasible)
David Alderson
Minimum Cost Problem

s
t

Let cj = cost on arc j
Minimizef
 (j A)
cj f j
subject to:
 

Ai f  d i   
 0

0  fj  uj
IPAM 4/15/2002
if i  s
if i  t (flows balanced at node i)
otherwise
 j A
(flow on arc j feasible)
David Alderson
Shortest Path Problem
1= 
s
t
 =1
Let costs cj correspond to “distances”, set  = 1
Minimizef
 (j A)
cj f j
subject to:
 

Ai f  d i   
 0

if i  s
if i  t (flows balanced at node i)
otherwise
0  fj  uj=1  j A
IPAM 4/15/2002
(flow on arc j feasible)
David Alderson
Maximum Flow Problem

s
Maximizef
t


subject to:
 

Ai f  d i   
 0

0  fj  uj
IPAM 4/15/2002
if i  s
if i  t (flows balanced at node i)
otherwise
 j A
(flow on arc j feasible)
David Alderson
Network Optimization
Traditional Assumptions:
– Complete information
– Static network (capacities, demands, topology)
– Centralized decision maker
Solution obtained from global optimization algorithms
Relevant issues:
– Computational (time) complexity
• Function of problem size (number of inputs)
• Based on worst-case data
– Parallelization (decomposition)
– Synchronization (global clock)
IPAM 4/15/2002
David Alderson
New Challenges
Most traditional assumptions no longer hold…
• Modern networks are inherently dynamic
– Connectivity fluctuates, components fail, growth is ad hoc
– Traffic demands/patterns constantly change
• Explosive growth  massive size scale
• Faster technology  shrinking time scale
• Operating decisions are made with incomplete,
incorrect information
• Claim: A global approach based on static
assumptions is no longer viable
IPAM 4/15/2002
David Alderson
Cascading Failures & Flow Networks
• In general, we assume that network failures result
from violations of network constraints
• Node feasibility (flow conservation)
• Arc feasibility (arc capacity)
• That is, failure  infeasibility
• The network topology provides the means by which
failures (infeasibilities) propagate
• In the optimization context, a cascading failure is a
collapse of the feasible region of the optimization
problem that results from the interaction of the
constraints when a parameter is changed
IPAM 4/15/2002
David Alderson
Addressing New Challenges
• Extend traditional notions of network optimization to
model cascading failures in flow networks
– Allow for node failures
– Include flow dynamics
• Consider solution approaches based on
– Decentralized control
– Local information
• Leverage ideas from dual problem formulation
• Identify dimensions along which there are explicit
tensions and tradeoffs between vulnerability and
performance
IPAM 4/15/2002
David Alderson
Dual Problem Formulation
Primal Problem
Dual Problem
Min
cT f
Max
Td - uT
s.t.
Af=d
s.t.
TA  c
f0
 unrestricted
fu
0
• Dual variables ,  have interpretation as prices at nodes, arcs
• Natural decomposition as distributed problem
• e.g. Nodes set prices based on local information
• Examples:
• Kelly, Low and many others for TCP/IP congestion control
• Boyd and Xiao for dual decomposition of SRRA problem
IPAM 4/15/2002
David Alderson
Outline
• Background and Motivation
• Union Pacific Case Study
• Conceptual Framework
• Modeling Cascading Failures
• Ongoing Work
IPAM 4/15/2002
David Alderson
Node Dynamics
• Consider each node as a simple input-output system running
in discrete time…
• Let n(k) = flow being
a(k)
n(k)
d(k)
(performance)

d(k)
System is feasible
for a(k) < 
constant a(k)
n*
IPAM 4/15/2002

processed in interval k
• Node dynamics
n(k+1) = n(k) + a(k) – d(k)
• Processing capacity 
• State-dependent output
n ( k ) 0  n ( k )  
d (k )  
n(k)  
 0
• a(k) – d(k) indicates how
n(k) is changing
n(k)
(load) • n* is equilibrium point
• Node “fails” if n(k) > 
David Alderson
Network Dynamics
• The presence of an arc between adjacent nodes couples their
behavior
a1(k)
u1
d1(k)=a2(k)
n1(k)
d1(n1)
u1(k
) 
d2(n2)
a1(k
u1(k
)
)
a2(k
)a2(k
)
n2(k)
d2(k)
2
1
1
n1(k
)
2
n2(k
)
• Arc capacities limit both outgoing and incoming flow
IPAM 4/15/2002
David Alderson
Network Dynamics
• The failure of one node can lead to the failure of another
a1(k)
n1(k)
u1=0
d1(k)=a2(k)
n2(k)
d2(k)
• When a node fails, the capacity of its incoming arcs drop
effectively to zero.
• Upstream node loses capacity of arc
• In the absence of control, the upstream node fails too.
Result: Node failures propagate “upstream”…
Question:
• How will the network respond to perturbations?
IPAM 4/15/2002
David Alderson
Network Robustness
Consider the behavior of the network in response to a
perturbation to arc capacity:
1. Does the disturbance lead to a local failure?
2. Does the failure propagate?
3. How far does it propagate?
Measure the consequences in terms of:
– Size of the resulting failure island
– Loss of network throughput
Key factors:
–
–
–
–
Flow processing sensitivity to congestion
Network topology
Local routing and flow control policies
Time scales
IPAM 4/15/2002
David Alderson
Congestion Sensitivity
In many real network systems, components are sensitive to
congestion
evidence of
congestion
System
Performance

System
Load
• Using the aforementioned family of functions we can tune the
sensitivity of congestion
• Direct consequences on local dynamics, stability, and control
• Tradeoff between system efficiency vs. fragility
• Implications for local behavior
IPAM 4/15/2002
David Alderson
Qualitative Behavior
Output
Rate
Stable
Equilibrium
Unstable
Equilibrium
Congestion
Collapse
Input Rate
x1 *
IPAM 4/15/2002
x2 *

Total
System
Load
David Alderson
Qualitative Behavior
Output
Rate
Input Rate
Fluid
Processing
Mild
Congestion
Severe
Congestion
x1 *
x2 *
System response to changes in input rate
is opposite in fluid vs. congested regions.
IPAM 4/15/2002

Total
System
Load
David Alderson
Qualitative Behavior
“Efficiency” results in “Fragility”
Output
Rate
Safety Margin
New Input Rate
Input Rate
x1 *
IPAM 4/15/2002
y1 *
y2 *
x2 *

Total
System
Load
David Alderson
Ongoing Work
• Modeling behavior of flow networks
– Vulnerability to cascading failures
– Sensitivity to congestion
• Bringing together notions from network optimization,
dynamical systems, and distributed control
• Exploring operating tradeoffs between
– efficiency and robustness
– global objectives vs. local behavior
– system performance vs. system vulnerability
• Collectively, these features provide a framework for
study of real systems
– UPRR case study
– Computer networks
IPAM 4/15/2002
David Alderson
Future Directions
• Development of decision support tools to support
real-time operations
– Warning systems
– Incident recovery
• Investigation of issues related to topology
• Notions from economics
– Network complements and substitutes
– Node cooperation and competition
IPAM 4/15/2002
David Alderson
Key Takeaways
• Large-scale failures happen
– Elements of vulnerability associated with connectivity
– But we are moving to connect everything together…
• Critical tradeoff for network-based businesses
– Business profitability from resource efficiency
– System robustness
• Two fundamental aspects to understanding largescale failure behavior
– Networks
– Dynamics
• Relevance to a wide variety of applications
IPAM 4/15/2002
David Alderson
Thank You
alderd@stanford.edu
Download