Discussion Paper: Risk Tolerance for Widespread BES Outages with Significant Socio-Economic Impacts This discussion paper presents a probabilistic framework for consideration and management of known and potential risks to achieving an Adequate Level of Reliability (ALR) of the Bulk Electric System (BES), based upon the potential socio‐economic costs associated with such risks, weighed against the often equally uncertain costs of mitigating such risks. This framework could be used to support NERC’s efforts to measure and when necessary, qualitatively assess, trends in BES performance, and to learn from such performance metrics and other lessons learned. Changes to the ALR definition could be implemented in a phased approach to more fully realize the goals that the MRC has established. At this stage of development, the ALR Task Force (ALRTF) proposes to continue describing ALR primarily in terms of deterministic criteria. NERC may consider proceeding to a second phase of development where additional probabilistic data and methods could be established to enable ALR to be described in terms of a composite risk tolerance statistic in alignment with NERC’s goals of using a risk‐based approach to reliability standards and compliance monitoring and enforcement. The MRC’s BES/ALR Policy Issues Task Force White Paper1 includes two recommendations: 1. “Assess the reliability objectives of ALR criteria and explicitly calculate the cost‐effectiveness of requirements within a reliability standard to meet the reliability objectives” 2. “Revise ALR defining criteria to address loss of supply, transmission and controlled/ uncontrolled load loss as a function of operational planning and operator preparations, as well as the resulting normal and abnormal operating states.” Although the first recommendation is to perform cost‐effectiveness analyses on a requirement by requirement basis, presumably during standards development, it would be beneficial to provide higher‐level guidance to standard drafting teams concerning the socio‐economic costs of BES load loss versus the costs that may be incurred to reduce the expected probability that such load loss will occur over the same period. The ALR Task Force does not believe the tools or data to conduct such probabilistic evaluations of socio‐economic costs currently exist. Nor is it likely that such estimation will be feasible in the near future. Nonetheless, it may be valuable to SDTs and industry stakeholders to 1 MRC’s BES/ALR Policy Issues Task Force draft White Paper Outline: Cost/Benefit, Load Loss, Cascading Task Team, undated, July 2011. [http://www.nerc.com/docs/standards/AgendaItem_13‐attach‐1.pdf] make qualitative assessments of the relative benefits of reducing certain risks in order to enable such cost‐effectiveness analyses and to help prevent inconsistencies between standards. However for even qualitative comparisons to be sound there needs to be a common quantitative metric of risk mitigation that can be applied across reliability standards to assess relative cost‐effectiveness. In other words, SDTs are only able to calculate incremental cost‐effectiveness, and without a higher level perspective on the integrated cost‐effectiveness of the standards as a whole, it is very difficult to determine whether the incremental effort is truly beneficial. Probabilistic risk management may provide a practical approach or strategy to more fully address both of the MRC Policy Task Force recommendations outlined above, by addressing the cost‐effectiveness of NERC standards. To calculate cost‐effectiveness, we need to understand the risk in terms of both probability and severity (“loss of supply, transmission and … load loss as a function of operational planning and operator preparation”), that we are managing through BES reliability standards. Common understanding of ALR would be improved through the use of a composite probabilistic statistic that describes our risk tolerance from a Continental or Interconnection perspective, similar in nature to a Cash Flow at Risk metric used to manage company financials that is a composite of all the underlying financial risks, or a 30‐year flood plain metric used to describe the risks of living in a certain location. In many jurisdictions, such risk tolerance methods are already used to establish state or provincial resource adequacy requirements through the use of a LOLE / LOLP methodology to establish planning reserve margins (typically described as a one day in ten years chance of rolling blackouts due to capacity shortage). The idea would be to expand this concept to encompass all risks of widespread BES outages with significant socio‐economic impacts. The analysis would exclude non‐bulk electric system (BES) events, such as hurricanes and other severe weather events that cause severe damage to distribution systems. (Non‐BES events could be analyzed to assess comparative risk tolerances and mitigation costs for distribution outages.) In order to develop such a probabilistic risk management methodology to establish a risk tolerance, we need to increase our understanding of the probability and severity of the risks as well as the costs and effectiveness of mitigation. The Standards are Used to Manage Socio-Economic Risks The ALR Reliability Objectives and Performance Outcomes do not operate in isolation. Rather, reliable planning and operation of the BES is governed by: 1) the statutory framework established by Section 215 of the Federal Power Act and by corresponding obligations established by Canadian provincial authorities, 2) the expectations of numerous other institutions, including state and local regulators which have jurisdiction over rates charged to ultimate consumers, 3) public and consumer expectations that they will receive reliable yet affordable electric service, and 4) the industry’s broader public interest obligation to ensure reliable operation of one of modern society’s most critical infrastructures. Indeed, it is widely recognized by electric sector policymakers, and by government and industry generally, that wide area outages of the electric power system, due to major BES events or to extreme events affecting local distribution systems (which are generally weather‐related), impose a substantial burden on modern society. Risk Tolerance for Widespread BES Outages with Significant Socio-Economic Impacts 2 Hence, the basic question is, are we using the standards to manage local, customer‐centric risks, or macro/socio‐economic risks? The ALR Task Force believes that the larger purpose of NERC Reliability Standards and the NERC statutory construct is to manage socio‐economic risks, e.g., risks to the “common good” of an unreliable BES. The common good describes a specific good that is shared and beneficial for all (or most) members of a given community. Recognizing that it is cost prohibitive to build electric infrastructure that provides more than one level of service, common good means establishing a level of reliability that properly balances reliability vs. cost of providing that reliability for most of society, rather than meeting the needs of those with the highest demand for reliability, because that would unduly burden the remainder of society with unnecessary costs. By way of analogy, there are some customers of water systems who require distilled water. It is not in the interest of the “common good” to design water systems that provide distilled water to every customer; rather, it is the “common good” to deliver potable water to every customer and it is prudent for individual customers who need distilled water to install their own water treatment facilities at their cost so that society as a whole is not burdened with unnecessary costs. Hence, the measure of adequate quality of water is not the needs of individual customers (e.g., distilled); but is rather the threshold that defines “potable” for the “common good”. In a similar vein, in considering establishing a risk tolerance, we need to consider socio‐economic impacts to the “common good”, not micro‐ economic impacts to individual customers. As a result of recent adverse reliability events, policy‐makers and regulators are questioning whether the industry has struck the right balance and are questioning whether the “bar should be raised” on BES reliability to a higher level of ALR, e.g., by analogy is the reliability provided by the BES “potable” to society, or are their too many imperfections and contaminates in the existing supply? Or should individual customers bear the risks and costs of a higher level of service, e.g., distilled household drinking water. A probabilistic risk management framework could be used to answer this question. Again by analogy, if a city has mitigated the risks of being located in a 30 year flood plain, how do we frame the question of whether additional investment in dikes, dams and drainage is prudent to reduce the probability of damages from an extremely severe flood that may occur only once in 100 years? Probabilistic socio‐economic risk analyses can be performed to determine the costs / benefits to society of such investments. However, the complexities of the BES network and the multiplicity of Disturbances that might occur make such calculations exceedingly difficult for the electric industry. Alignment of a Probabilistic Risk Management Framework with the Federal Power Act Section 215 The ALR Task Force has framed the definition of ALR by defining Reliability Objectives (end state, goal) and Performance Outcomes (that achieve the end state goal). The Federal Power Act (FPA) Section 215, which lays out the regulatory schema for NERC in the United States, uses this same construct in section 215(a)(4): Risk Tolerance for Widespread BES Outages with Significant Socio-Economic Impacts 3 “The term `reliable operation' means …” • How – strategy – Performance Outcomes: “operating the elements of the bulk‐power system within equipment and electric system thermal, voltage, and stability limits …” • What – goal – Reliability Objective: “so that instability, uncontrolled separation, or cascading failures of such system will not occur as a result of a sudden disturbance, including a cyber security incident, or unanticipated failure of system elements.” In point of fact, these statutory objectives are at least one step removed from the performance outcomes sought by both policy‐makers and the public, which is a relative freedom from, or low probability of experiencing a widespread performance failure on the BES that results in a widespread outage or blackout. In defining Adequate Level of Reliability, we must also consider the “public interest.” FPA Section 215 at (d) (2) states: “The Commission may approve, by rule or order, a proposed reliability standard or modification to a reliability standard if it determines that the standard is just, reasonable, not unduly discriminatory or preferential, and in the public interest.” (Emphasis added) Society cannot afford a power system that is immune to widespread outages, e.g., immune to natural disasters, etc. Such a system would not be in the “public interest” because the cost of such a power system would be disproportionate to the corresponding benefits to society and would consume resources that could otherwise be applied to achieving other essential societal benefits, potentially slowing the economy. Hence, an ALR ought to balance the socio‐economic benefits of reducing the probability of widespread outages that result from “instability, uncontrolled separation, or cascading failures” of the BES, against the socio‐economic costs of preventing “instability, uncontrolled separation, or cascading failures” that result in widespread outages. For purposes of this discussion paper, we will use the term “widespread outages” and “blackout” interchangeably to mean the performance outcome(s) that result from BES Disturbances that result in “instability, uncontrolled separation, or cascading failures” of the BES. A Risk Management Framework and Establishing a Risk Tolerance It is impossible to prevent widespread outages from ever happening. However, we can take steps to reduce the expected frequency, scope and duration of widespread BES outages. As a result, two questions naturally arise: 1. How often is too often? 2. What distinguishes a widespread outage from a local area outage? To address these naturally arising questions, a “risk tolerance” could be established consisting of: Risk Tolerance for Widespread BES Outages with Significant Socio-Economic Impacts 4 1. A target probability of occurrence (expected frequency), that balances the socio‐economic risks/costs of actual Blackouts vs. the socio‐economic impacts of increased cost of electric service to prevent Blackouts. 2. The expected severity (measured in scope and duration) of the risks to be managed. This is analogous to choosing an insurance deductible below which we are willing to assume the risk of financial loss. In this case, these thresholds would help establish the scope and duration of a Blackout managed through the NERC regulatory construct vs. the outages that are managed either through the assumption of risk, or alternatively, other mitigation programs undertaken under state/local jurisdictional obligations. In order to establish this threshold of magnitude, a socio‐economic probability/severity threshold can be established above which a widespread outage is deemed to have a significant socio‐economic impact. As an end‐state, a BES Risk Management Objective for a potential phase 2 of the ALR definition could be described in terms of a risk tolerance similar to the following: BES Risk Management Objective 1 Protect the socio‐economic fabric of North America by managing the expected frequency and severity (scope and duration) of widespread outages such that these events are expected to occur less frequently than once in XX years per MW of BES load. Such an industry wide over‐arching metric would help industry, policy makers and regulators understand more concretely what level of reliability we strive to achieve in North America. There are several challenges in establishing such a risk tolerance metric: Measurement unit for socio‐economic impact – how would we measure socio‐economic impact? 1) By socio‐economic data such as impact on GDP, population that lost power; 2) by electric quantities as proxies for socio‐economic data, such as MWh; or 3) by some other measure? Another significant challenge is how to estimate expected frequency, e.g., a widespread outage occurs once in X years per Province / State or some other statistical measure of population over which the frequency is calculated. Establishing the socio‐economic impact threshold – establishing a threshold magnitude of socio‐ economic impact will require significant input from policy‐makers – federal, state, provincial and local – and from economists. Establishing the frequency threshold – establishing a threshold essentially requires a high level cost‐effectiveness analysis to find the fulcrum point, by order of magnitude, between socio‐ economic impacts of the risk of widespread outages vs. the socio‐economic impacts of reducing the frequency or magnitude of widespread outages. Some estimations of highly improbable, but very Risk Tolerance for Widespread BES Outages with Significant Socio-Economic Impacts 5 high impact events would also need to be considered in this calculation (e.g., Japan earthquake / tsunami type event). Differentiating between BES and major non‐BES events – many Acts of Nature do tremendous damage to distribution systems while leaving the BES relatively unscathed (e.g., Category 1 or 2 hurricanes, ice storms, etc.). The ALR definition is intended to address risks to the BES; hence, methods would need to be developed to clearly distinguish between BES and non‐BES events. Quantifying high impact, low frequency BES event risk in the face of uncertainty Probabilistic analyses raise risks of false inferences from historical data/experience because there are no probability distributions for HILF events. Similar in nature to using econometric analyses to perform load forecasts, actual experience will always be different than forecast. Many potential events that pose a risk of severe BES impacts may never occur. Event severity may be equally uncertain/variable. Prior BES events may or may not provide a valid indication of future risks. Conversely, emerging threats and technological/market changes may pose significant new risks to the BES without any historical antecedents. Probability Assessment Impact on Current Practices Rigorous methods to assess and combine probability assessments can impact existing industry practices and standards. It is important to understand how probability assessments are performed, how the results are expected to change existing standards for BES design and operation, and the expected performance outcomes for these changes (e.g., changes to BES performance, risk reduction, or organizational capabilities). Establishing a risk tolerance in terms of both probability and severity also helps us determine how to manage certain types of risks. Risk management is typically thought about in terms of a grid similar to the following: Risk Tolerance for Widespread BES Outages with Significant Socio-Economic Impacts 6 High 3. High Impact, Low Probability 1. High Impact, High Probability Severity Low 4. Low Impact, Low Probability 2. Low Impact, High Probability Low High Probability We have already used this terminology and framework in our discussions at NERC, e.g., the term High Impact, Low Frequency (HILF) corresponds to the 3rd quadrant above and is part of the NERC vocabulary. Because our industry also uses the term frequency to mean something different, e.g., 60 Hz, the term probability is used in this report. We also use probability because the measures we seek to define are forward‐looking, while frequency is a measure of past performance. Depending on where the risk falls in that grid, very different risk management strategies are developed to manage those risks. For example: 1. High Impact, High Probability ‐ one goal of risk management is to prevent any risks from residing in the high impact, high probability quadrant, by taking measures to reduce the severity or to reduce the probability of the risk, or both. This is accomplished through establishing planning, design and operating criteria that eliminate events from this quadrant – by moving them into another quadrant, generally quadrants 2 or 3. 2. Low impact, High probability is typically managed through establishing deterministic criteria for managing those risks based on the probabilistic risk assessment, e.g., managing use of facilities to Risk Tolerance for Widespread BES Outages with Significant Socio-Economic Impacts 7 not exceed thermal limits for single contingencies, developing a reserve margin based on LOLE/LOLP analysis, or managing to a Cash Flow at Risk metric typical for a purchasing and selling entity. 3. High Impact, Low Probability events are typically managed in one of two ways depending on the type of risk: a. Emergency Preparedness and Response programs are used to help mitigate the severity of the event. This strategy is typically used when it is not prudent to plan, design and operate the BES to prevent these types of events from occurring, due to the costs or infeasibility of prevention. However, in many (but not all) cases, it is practical to take prudent steps to prepare for such events. Examples include acts of nature (e.g., hurricane), acts of aggression (e.g., physical attack) and bad luck (e.g., a string of multiple unrelated contingencies that occur by happenstance). Strategies include containing the geographic scope of the event, minimizing damage to equipment and advanced preparation to ensure rapid restoration. b. Defense in Depth is a strategy used to help further reduce the probability that an event will occur or to reduce the severity of the event. Examples may include preventing or mitigating acts of aggression (e.g., cyber attack) through cyber standards, and reducing human error through training, procedures and human‐machine interface design. 4. Low Impact, Low Probability events are often used as learning opportunities to help prevent low impact, low frequency events from contributing to high impact or high frequency events. These risk management strategies clearly map to “Reliability Risk Management Concept” curve developed elsewhere within NERC that has been used to characterize the severity and frequency of various major BES events analyzed by the NERC Reliability Assessments Department. See in particular the Integrated Bulk Power System Risk Assessment Concepts White Paper dated August 30, 2010.2 2 http://www.nerc.com/docs/pc/rmwg/Integrated_Bulk_Power_System_Risk_Assessment_Concepts_Final.pdf Risk Tolerance for Widespread BES Outages with Significant Socio-Economic Impacts 8 Reliability Risk Management Concepts By using such a probabilistic risk management framework, it is plausible to reduce the Reliability Objectives to two Performance Outcomes, with associated Risk Management Strategies supporting those outcomes in alignment with the four quadrant approach described above. The Performance Risk Tolerance for Widespread BES Outages with Significant Socio-Economic Impacts 9 Outcome of quadrants 1, 2 and 4 is to prevent a widespread outage. This leaves describing the Performance Outcome for quadrant 3 on High Impact, Low Probability Events. Risk Management Framework for High Impact, Low Probability Events By their very nature, High Impact, Low Probability (HILP) events are difficult to characterize and measure. This class of events is often viewed as uninsurable because actuarial science is not applicable. These events are also sufficiently infrequent such that they do not significantly impact the public risk tolerances discussed above and are not typically included in probabilistic/stochastic risk management methodologies. In financial risk management, these types of risks are managed through scenario and decision tree analyses as opposed to stochastic / probabilistic methods such as Cash Flow at Risk. What we can characterize is that these types of events are beyond the design, planning and operating criteria of the BES. As such, Adverse Reliability Impacts are likely if such an event were to occur. Typical risk management practices for HILP risks are to: 1. Inventory the risks 2. On a risk by risk basis, determine if the risk is prudently manageable 3. If the risk is prudently manageable, develop pragmatic strategies to manage those risks. For instance: Unmanageable risks – in the extreme, unmanageable risks include events such as a large asteroid impact and eruption of the Yellowstone super‐volcano. These are the types of risks that are not pragmatic to mitigate due to very low probability, extreme cost, impractical risk mitigation, etc. Manageable risks – these are the risks that through pragmatic review have been determined to justify some level of investment to manage, recognizing that there is a continuum of risk mitigation strategies and different levels of investment are justified for different risks based on probability and severity of the risk and costs of the alternative risk mitigations. As discussed above, typical risk mitigations include: o Emergency Preparedness and Response – as reflected in: 1) the current EOP standards, 2) utility emergency plans for hurricanes and other acts of nature, 3) spare equipment strategies, and 4) fuel switching strategies o Defense in Depth – such as cyber security risk management reflected in the current NERC CIP standards. To establish a pragmatic risk mitigation strategy for individual risks, a sensible review on a risk‐by‐risk basis is necessary. Such a review would likely need to be collaboration between policy‐makers, regulators and industry to fully understand the socio‐economic risks and the cost‐effectiveness of potential risk mitigation measures. For instance, a current concern of some policy‐makers is whether Risk Tolerance for Widespread BES Outages with Significant Socio-Economic Impacts 10 the industry has sufficient spare transformers to survive an extreme solar flare/GMD disturbance.3 An analysis of costs of such spare transformers vs. the probability of such an event would be a pragmatic activity to determine if the risk is worth managing, and if so, by what strategy. A second BES Risk Management Objective could be phrased as: BES Risk Management Objective 2 High Impact, Low Probability risks are pragmatically addressed to mitigate manageable risks and characterize unmanageable risks. Conclusion A probabilistic risk management approach could become a second phase of the ALR definition effort. A probabilistic risk based approach would more fully describe “loss of supply, transmission and … load as a function of … planning and … preparation” (emphasis added) and provide a framework for “cost‐ effectiveness” studies of risk mitigation strategies contained in standards and their requirements, providing over‐arching guidance to standard drafting to more fully meet the two recommendation from the MRC’s BES/ALR Policy Issues Task Force White Paper. A risk based approach also provides an appropriate framework for prudence reviews of investment, similar in nature to how probabilistic LOLE / LOLP methods are used to establish reserve margins that then in turn justify investments in integrated resource plans under state / local jurisdiction. The review methodologies may vary for High Impact, Low Probability events, but prudence demands that risk characterizations and cost analyses always precede risk mitigation, except in cases of extreme imminent threats. In addition, a risk based approach would help industry determine the “biggest bang for the buck”, enabling industry to better prioritize risk mitigations in terms of costs, time to implement, and effectiveness. 3 2012 Special Reliability Assessment Interim Report: Effects of Geomagnetic Disturbances on the Bulk Power System, February 2012, posted at: http://www.nerc.com/files/2012GMD.pdf Risk Tolerance for Widespread BES Outages with Significant Socio-Economic Impacts 11