Cascading Failures in Infrastructure Networks David Alderson Ph.D. Candidate Dept. of Management Science and Engineering Stanford University April 15, 2002 Advisors: William J. Perry, Nicholas Bambos Outline • Background and Motivation • Union Pacific Case Study • Conceptual Framework • Modeling Cascading Failures • Ongoing Work IPAM 4/15/2002 David Alderson Background • Most of the systems we rely on in our daily lives are designed and built as networks – Voice and data communications – Transportation – Energy distribution • Large-scale disruption of such systems can be catastrophic because of our dependence on them • Large-scale failures in these systems – Have already happened – Will continue to happen IPAM 4/15/2002 David Alderson Recent Examples • Telecommunications – ATM network outage: AT&T (February 2001) – Frame Relay outage: AT&T (April 1998), MCI (August 1999) • Transportation – Union Pacific Service Crisis (May 1997- December 1998) • Electric Power – Northeast Blackout (November 1965) – Western Power Outage (August 1996) • All of the above – Baltimore Tunnel Accident (July 2001) IPAM 4/15/2002 David Alderson Public Policy • U.S. Government interest from 1996 (and earlier) • Most national infrastructure systems are privately owned and operated – Misalignment between business imperatives (efficiency) and public interest (robustness) • Previously independent networks now tied together through common information infrastructure • Current policy efforts directed toward building new public-private relationships – Policy & Partnership (CIAO) – Law Enforcement & Coordination (NIPC) – Defining new roles (Homeland Security) IPAM 4/15/2002 David Alderson Research Questions Broadly: • Is there something about the network structure of these systems that contributes to their vulnerability? More specifically: • What is a cascading failure in the context of an infrastructure network? • What are the mechanisms that cause it? • What can be done to control it? • Can we design networks that are robust to cascading failures? • What are the implications for network-based businesses? IPAM 4/15/2002 David Alderson Outline • Background and Motivation • Union Pacific Case Study • Conceptual Framework • Modeling Cascading Failures • Ongoing Work IPAM 4/15/2002 David Alderson Union Pacific Railroad • Largest RR in North America – Headquartered in Omaha, Nebraska – 34,000 track miles (west of Mississippi River) • Transporting – Coal, grain, cars, other manifest cargos – 3rd party traffic (e.g. Amtrak passenger trains) • 24x7 Operations: – 1,500+ trains in motion – 300,000+ cars in system • More than $10B in revenue annually IPAM 4/15/2002 David Alderson Union Pacific Railroad • Four major resources constraining operations: – Line capacity (# parallel tracks, speed restrictions, etc.) – Terminal capacity (in/out tracks, yard capacity) – Power (locomotives) – Crew (train personnel, yard personnel) • Ongoing control of operations is mainly by: – Dispatchers – Yardmasters – Some centralized coordination, primarily through a predetermined transportation schedule IPAM 4/15/2002 David Alderson Union Pacific Railroad • Sources of network disruptions: – Weather (storms, floods, rock slides, tornados, hurricanes, etc.) – Component failures (signal outages, broken wheels/rails, engine failures, etc.) – Derailments (~1 per day on average) – Minor incidents (e.g. crossing accidents) • Evidence for system-wide failures – 1997-1998 Service Crisis • Fundamental operating challenge IPAM 4/15/2002 David Alderson UPRR Fundamental Challenge Two conflicting drivers: • Business imperatives necessitate a lean operation that maximizes efficiency and drives the system toward high utilization of available network resources. • An efficient operation that maximizes utilization is very sensitive to disruptions, particularly because of the effects of network congestion. IPAM 4/15/2002 David Alderson Railroad Congestion There are several places where congestion may be seen within the railroad: • Line segments • Terminals • Operating Regions • The Entire Railroad Network • (Probably not locomotives or crews) Congestion is related to capacity. IPAM 4/15/2002 David Alderson Line Segment Velocity UPRR Capacity Model Concepts Emprically-Derived Relationship Factors Affecting Observed Performance: •Dispatcher / Corridor Manager Expertise •On Line Incidents / Equipment Failure •Weather •Temporary Speed Restrictions 25 18 The Effect of Forcing Volume in Excess of Capacity 28 Volume (trains per day) IPAM 4/15/2002 32 36 35 David Alderson Implications of Congestion Concepts of traffic congestion are important for two key aspects of network operations: – Capacity Planning and Management – Service Restoration In the presence of service interruptions, the objective of Service Restoration is to: – Minimize the propagation across the network of any disturbance caused by a service interruption – Minimize the time to recovery to fluid operations IPAM 4/15/2002 David Alderson Modeling Congestion We can model congestion using standard models from transportation engineering. Define the relationships between: • Number of items in the system (Density) • Average processing rate (Velocity) • Input Rate • Output Rate (Throughput) IPAM 4/15/2002 David Alderson Modeling Congestion Velocity vs. Density: Assume that velocity decreases (linearly) with the traffic density. Velocity (v) K n v ( n ) K 1 N N IPAM 4/15/2002 Density (n) David Alderson Modeling Congestion Throughput vs. Density Throughput = Velocity · Density n2 (n) v(n) n K n N Throughput () * Throughput is maximized at n = N/2 with value * = N/4 (K=1). N/2 IPAM 4/15/2002 N Density (n) David Alderson Modeling Congestion Throughput Velocity * N Density Velocity N Density Velocity Throughput K IPAM 4/15/2002 * Throughput David Alderson Modeling Congestion Let p represent the intensity of congestion onset. n p v ( n ) K 1 N n p (n) n v(n) nK 1 N v(n)=1-(n/10)^p mu(n) = n*v(n) = n* { 1- (n/10)^p} 1.2 0.1 1 0.25 0.8 0.5 0.6 1 0.4 2 4 0.2 10 0 n 0 1 2 IPAM 4/15/2002 3 4 5 6 7 8 9 10 8 7 6 5 4 3 2 1 0 0.1 0.25 0.5 1 2 4 10 0 1 2 3 4 5 6 7 8 9 10 n David Alderson Modeling Congestion It is clear that becomes n p lim nK 1 N p nK (1 N (n)) nN 1 where N (n) 0 otherwise mu(n) = n*v(n) = n* { 1- (n/10)^p} 8 7 6 5 4 3 2 1 0 0.1 nK 0.25 0.5 1 2 4 10 0 1 2 IPAM 4/15/2002 3 4 5 6 7 8 9 10 n N David Alderson Modeling Congestion Throughput Velocity * N Density Velocity N Density Velocity Throughput K IPAM 4/15/2002 * Throughput David Alderson UP Service Crisis • Initiating Event – 5/97 derailment at a critical train yard outside of Houston • Additionally – Loss of BNSF route that was decommissioned for repairs – Embargo at Laredo interchange point to Mexico • Complicating Factors – UP/SP merger and transition to consolidated operations – Hurricane Danny, fall 1997 – Record rains and floods (esp. Kansas) in 1998 • Operational Issues – Tightly optimized transportation schedule – Traditional service priorities IPAM 4/15/2002 David Alderson UP Service Crisis Union Pacific Railroad Inventory Total System Inventory, December 1996 - November 1998 (cars) 370,000 360,000 350,000 340,000 Southern California 330,000 320,000 Central Corridor (Kansas-Nebraska-Wyoming) 310,000 300,000 290,000 Source: UP Filings with Surface Transportation Board, September 1997 – December 1998 IPAM 4/15/2002 20-Nov-98 30-Oct-98 9-Oct-98 18-Sep-98 28-Aug-98 7-Aug-98 17-Jul-98 26-Jun-98 5-Jun-98 15-May-98 24-Apr-98 3-Apr-98 13-Mar-98 20-Feb-98 30-Jan-98 9-Jan-98 19-Dec-97 28-Nov-97 7-Nov-97 17-Oct-97 26-Sep-97 Jun-97 Mar-97 Dec-96 280,000 Houston-Gulf Coast David Alderson Case Study: Union Pacific Completed Phase 1 of case study: • Understanding of the factors affecting system capacity, system dynamics • Investigation of the 1997-98 Service Crisis • Project definition: detailed study of Sunset Route • Data collection, preliminary analysis for the Sunset Route Ongoing work: • A detailed study of their specific network topology • Development of real-time warning and analysis tools IPAM 4/15/2002 David Alderson Outline • Background and Motivation • Union Pacific Case Study • Conceptual Framework • Modeling Cascading Failures • Ongoing Work IPAM 4/15/2002 David Alderson Basic Network Concepts • Networks allow the sharing of distributed resources • Resource use resource load – Total network usage = total network load • Total network load is distributed among the components of the network – Many networking problems are concerned with finding a “good” distribution of load • Resource allocation load distribution IPAM 4/15/2002 David Alderson Infrastructure Networks • Self-protection as an explicit design criterion • Network components themselves are valuable – Expensive – Hard to replace – Long lead times to obtain • Willingness to sacrifice current system performance in exchange for future availability • With protection as an objective, connectivity between neighboring nodes is – Helpful – Harmful IPAM 4/15/2002 David Alderson Cascading Failures Cascading failures occur in networks where – Individual network components can fail – When a component fails, the natural dynamics of the system may induce the failure of other components Network components can fail because – Accident – Internal failure – Attack Initiating events A cascading failure is not – A single point of failure – The occurrence of multiple concurrent failures – The spread of a virus IPAM 4/15/2002 David Alderson Related Work Cascading Failures: – Electric Power: Parrilo et. al. (1998), Thorp et. al. (2001) – Social Networks: Watts (1999) – Public Policy: Little (2001) Other network research initiatives – “Survivable Networks” – “Fault-Tolerant Networks” Large-Scale Vulnerability – – – – Self-Organized Criticality: Bak (1987), many others Highly Optimized Tolerance: Carlson and Doyle (1999) Normal Accidents: Perrow (1999) Influence Models: Verghese et. al. (2001) IPAM 4/15/2002 David Alderson Our Approach • Cascading failures in the context of flow networks – conservation of flow within the network • Overloading a resource leads to degraded performance and eventual failure • Network failures are not independent – Flow allocation pattern resource interdependence • Focus on the dynamics of network operation and control • Design for robustness (not protection) IPAM 4/15/2002 David Alderson Taxonomy of Network Flow Models Coarse Grained Models Quantity of Interest Modeling Approach Relevant Decisions Long-Term Averages Static Flow Models Capacity Planning Time-Dependent Fluid Averages Approximations Fine Grained Models IPAM 4/15/2002 Averages & Variances Diffusion Approximations Probability Distributions Queueing Models Event Sequences Simulation Models Failure & Recovery Ongoing Operation (Processing & Routing) Reference: Janusz Filipiak David Alderson Time Scales in Network Operations Short Time Scales Relevant Decisions Long Time Scales Ongoing Operation Failure & (Processing & Routing) Recovery Capacity Planning Computer Routing milliseconds to seconds minutes to hours days to weeks Railroad Transportation minutes to hours days to weeks months to years IPAM 4/15/2002 David Alderson What Are Network Dynamics? Type of Network Dynamics Underlying Assumption Dynamics OF Networks Network topology is CHANGING Failure & Recovery Dynamics ON Networks IPAM 4/15/2002 Network topology is STATIC David Alderson Network Flow Optimization • Original work by Ford and Fulkerson (1956) • One of the most studied areas for optimization • Three main problem types – Shortest path problems – Maximum flow problems – Minimum cost flow problems • Special interpretation for some of the most celebrated results in optimization theory • Broad applicability to a variety of problems IPAM 4/15/2002 David Alderson Single Commodity Flow Problem Notation: N set of nodes, indexed i = 1, 2, … N A set of arcs , indexed j = 1, 2, … M di demand (supply) at node i fj flow along arc j uj capacity along arc j A node-arc incidence matrix, 1 if arc j enters node i aij 1 if arc j exits node i 0 otherwise A set of flows f is feasible if it satisfies the constraints: Ai f = d i iN (flows balanced at node i, and 0 fj u j IPAM 4/15/2002 jA supply/demand is satisfied) (flow on arc j less than capacity) David Alderson Single Commodity Flow Problem s t Feasible region, denoted F(): if i s Ai f d i if i t (flows balanced at node i) 0 0 fj uj IPAM 4/15/2002 otherwise j A (flow on arc j feasible) David Alderson Minimum Cost Problem s t Let cj = cost on arc j Minimizef (j A) cj f j subject to: Ai f d i 0 0 fj uj IPAM 4/15/2002 if i s if i t (flows balanced at node i) otherwise j A (flow on arc j feasible) David Alderson Shortest Path Problem 1= s t =1 Let costs cj correspond to “distances”, set = 1 Minimizef (j A) cj f j subject to: Ai f d i 0 if i s if i t (flows balanced at node i) otherwise 0 fj uj=1 j A IPAM 4/15/2002 (flow on arc j feasible) David Alderson Maximum Flow Problem s Maximizef t subject to: Ai f d i 0 0 fj uj IPAM 4/15/2002 if i s if i t (flows balanced at node i) otherwise j A (flow on arc j feasible) David Alderson Network Optimization Traditional Assumptions: – Complete information – Static network (capacities, demands, topology) – Centralized decision maker Solution obtained from global optimization algorithms Relevant issues: – Computational (time) complexity • Function of problem size (number of inputs) • Based on worst-case data – Parallelization (decomposition) – Synchronization (global clock) IPAM 4/15/2002 David Alderson New Challenges Most traditional assumptions no longer hold… • Modern networks are inherently dynamic – Connectivity fluctuates, components fail, growth is ad hoc – Traffic demands/patterns constantly change • Explosive growth massive size scale • Faster technology shrinking time scale • Operating decisions are made with incomplete, incorrect information • Claim: A global approach based on static assumptions is no longer viable IPAM 4/15/2002 David Alderson Cascading Failures & Flow Networks • In general, we assume that network failures result from violations of network constraints • Node feasibility (flow conservation) • Arc feasibility (arc capacity) • That is, failure infeasibility • The network topology provides the means by which failures (infeasibilities) propagate • In the optimization context, a cascading failure is a collapse of the feasible region of the optimization problem that results from the interaction of the constraints when a parameter is changed IPAM 4/15/2002 David Alderson Addressing New Challenges • Extend traditional notions of network optimization to model cascading failures in flow networks – Allow for node failures – Include flow dynamics • Consider solution approaches based on – Decentralized control – Local information • Leverage ideas from dual problem formulation • Identify dimensions along which there are explicit tensions and tradeoffs between vulnerability and performance IPAM 4/15/2002 David Alderson Dual Problem Formulation Primal Problem Dual Problem Min cT f Max Td - uT s.t. Af=d s.t. TA c f0 unrestricted fu 0 • Dual variables , have interpretation as prices at nodes, arcs • Natural decomposition as distributed problem • e.g. Nodes set prices based on local information • Examples: • Kelly, Low and many others for TCP/IP congestion control • Boyd and Xiao for dual decomposition of SRRA problem IPAM 4/15/2002 David Alderson Outline • Background and Motivation • Union Pacific Case Study • Conceptual Framework • Modeling Cascading Failures • Ongoing Work IPAM 4/15/2002 David Alderson Node Dynamics • Consider each node as a simple input-output system running in discrete time… • Let n(k) = flow being a(k) n(k) d(k) (performance) d(k) System is feasible for a(k) < constant a(k) n* IPAM 4/15/2002 processed in interval k • Node dynamics n(k+1) = n(k) + a(k) – d(k) • Processing capacity • State-dependent output n ( k ) 0 n ( k ) d (k ) n(k) 0 • a(k) – d(k) indicates how n(k) is changing n(k) (load) • n* is equilibrium point • Node “fails” if n(k) > David Alderson Network Dynamics • The presence of an arc between adjacent nodes couples their behavior a1(k) u1 d1(k)=a2(k) n1(k) d1(n1) u1(k ) d2(n2) a1(k u1(k ) ) a2(k )a2(k ) n2(k) d2(k) 2 1 1 n1(k ) 2 n2(k ) • Arc capacities limit both outgoing and incoming flow IPAM 4/15/2002 David Alderson Network Dynamics • The failure of one node can lead to the failure of another a1(k) n1(k) u1=0 d1(k)=a2(k) n2(k) d2(k) • When a node fails, the capacity of its incoming arcs drop effectively to zero. • Upstream node loses capacity of arc • In the absence of control, the upstream node fails too. Result: Node failures propagate “upstream”… Question: • How will the network respond to perturbations? IPAM 4/15/2002 David Alderson Network Robustness Consider the behavior of the network in response to a perturbation to arc capacity: 1. Does the disturbance lead to a local failure? 2. Does the failure propagate? 3. How far does it propagate? Measure the consequences in terms of: – Size of the resulting failure island – Loss of network throughput Key factors: – – – – Flow processing sensitivity to congestion Network topology Local routing and flow control policies Time scales IPAM 4/15/2002 David Alderson Congestion Sensitivity In many real network systems, components are sensitive to congestion evidence of congestion System Performance System Load • Using the aforementioned family of functions we can tune the sensitivity of congestion • Direct consequences on local dynamics, stability, and control • Tradeoff between system efficiency vs. fragility • Implications for local behavior IPAM 4/15/2002 David Alderson Qualitative Behavior Output Rate Stable Equilibrium Unstable Equilibrium Congestion Collapse Input Rate x1 * IPAM 4/15/2002 x2 * Total System Load David Alderson Qualitative Behavior Output Rate Input Rate Fluid Processing Mild Congestion Severe Congestion x1 * x2 * System response to changes in input rate is opposite in fluid vs. congested regions. IPAM 4/15/2002 Total System Load David Alderson Qualitative Behavior “Efficiency” results in “Fragility” Output Rate Safety Margin New Input Rate Input Rate x1 * IPAM 4/15/2002 y1 * y2 * x2 * Total System Load David Alderson Ongoing Work • Modeling behavior of flow networks – Vulnerability to cascading failures – Sensitivity to congestion • Bringing together notions from network optimization, dynamical systems, and distributed control • Exploring operating tradeoffs between – efficiency and robustness – global objectives vs. local behavior – system performance vs. system vulnerability • Collectively, these features provide a framework for study of real systems – UPRR case study – Computer networks IPAM 4/15/2002 David Alderson Future Directions • Development of decision support tools to support real-time operations – Warning systems – Incident recovery • Investigation of issues related to topology • Notions from economics – Network complements and substitutes – Node cooperation and competition IPAM 4/15/2002 David Alderson Key Takeaways • Large-scale failures happen – Elements of vulnerability associated with connectivity – But we are moving to connect everything together… • Critical tradeoff for network-based businesses – Business profitability from resource efficiency – System robustness • Two fundamental aspects to understanding largescale failure behavior – Networks – Dynamics • Relevance to a wide variety of applications IPAM 4/15/2002 David Alderson Thank You alderd@stanford.edu