A Project to Develop a Handbook of DOE Operational Safety Event and Accident Investigation Techniques Concept White Paper D. Pegram and E. Carnes Office of Health Safety and Security (HSS), Office of Corporate Safety Programs HS-31 Accident Prevention and Investigation Program October 15, 2010 The objective of this project is to convert the existing DOE AI “Workbook” into a DOE Technical Standard as defined in DOE-TSPP-5, dated July 1, 2009, and update the material to include current thinking, methods, and approaches for analysis and the conduct of investigations. The DOE AI Workbook has not been updated since 1999. To accomplish this HS-31 has formed a task team composed of HSS Subject Matter Experts, EFCOG Contractor Subject Matter Experts, and DOE Program Office’s Accident Investigation Points of Contact. This white paper addresses the overall concepts to be introduced, approach, and focus of the Handbook on: Accident Prevention; Highly Reliable Organizations (HRO)/Organizational Learning and, Human Performance Improvement (HPI). This activity supports the establishment of a DOE Community of Interest that promotes the role of stakeholders in owning, developing, and using the “Handbook” as a living document, to be updated as knowledge matures from using the techniques. Why Adopt the Principles of a High Reliability Organization (HRO) and Employ Operational Safety Event and Accident Analysis? “It’s impossible to solve significant problems using the same level of knowledge that created them!” A. Einstein. A key characteristic of successful technology centered organizations is their ability to learn and improve. The importance of quality organizational learning has been researched and written about by W. Edwards Deming, Peter Senge and a host of others. The Integrated Safety Management Feedback and Improvement function envisions organizations that are continually monitoring performance, identifying deviations or questionable conditions, self assessing and using quality analysis to improve. 1 Operational Safety Event and Accident Analysis Is a process using a set of tools that have been built through consensus, academic disciplines, industry, application and practice, that have been found valuable in determining safety management systems and human performance conditions, deficiencies, and related causal factors, that can be utilized to prevent accidents. Organizational/Operational Safety analysis is an essential component of learning as are learning from accidents and near misses. We often call this activity self-assessment, a familiar process within DOE. Highly reliable organizations seek to learn from non-consequential deviations from expectations. Treating small, unexpected things as possible precursors of declining conditions is a characteristic of organizations that excel. These organizations develop learning capabilities and learning analytical tools that span the full range of operational performance from normal operations to accidents and all points in between. DOE has compiled a suite of analytical tools and years of experience in accident investigation. DOE’s accident investigation program and tools have served as models for many other government accident investigation programs. The broader organizational learning applicability of the investigation tools and their theoretical underpinnings is not however widely understood within DOE. The proposed new technical standard intends to bring together the strengths of the existing AI workbook, experiences gained in accident investigation and analyses of lower level events, and performance improvement insights from HRO’s and the DOE HPI applications into a single reference document. The intent is to communicate how these tools and concepts may be used to self assess and improve across the full operational performance spectrum. The High Reliability Organization (HRO) A 2009 study of nuclear power organizational evaluating stresses key aspects of organizational learning in safety critical domains; typical self assessments “…do not pay enough attention to the low probability high consequence incidents. Those “obscure hazards” should be identified better since they are also the probable causes of accidents after the more easily observed high probability hazards have been controlled. This requires going behind the surface levels and analyzing the hazards the organization initially considers not significant. It also requires a good understanding of the technical features of the systems as well as the social system (creating hazards through human action or inaction). Further, going beyond the surface level of the organization requires adequate evaluation tools combined with an ability to use them correctly.” (Teemu Reiman & Pia Oedewald: Evaluating Safety Critical Organizations, VTT, 2009) Organizational learning is a core competency of what are known as high reliability organizations (HROs). Research on HRO’s was introduced into DOE ISM work through the 2004 DNFSB 2 document TECH 35 DNFSB/TECH-35, Safety Management of Complex, High-Hazard Organizations. The DOE ISM Manual also discusses what DOE might learn from HROs; “Experience and research with safety cultures and high-reliability organizations (HRO) over the past ten or more years have raised new insights and deeper understanding relevant to the desired work environment for effective safety management.” High reliability concepts also generated a number of the concepts and techniques found in the DOE Human Performance Improvement Program enhancements used to add further robustness to earlier ISM implementation. A High Reliability Organization (HRO) is one in which in spite of the fact that it deals with hazardous, high consequence operations, does so successfully, and demonstrates a trend of continuous safety performance improvement. The HRO recognizes that near misses and adverse trends in safety performance are an indicator of potential for occurrence of high consequence events. We seek to look at work as it is “actually” preformed, rather than how we “imagine” it to be preformed. A key attribute of being a HRO is to learn from the organization’s mistakes. Operational Safety /Accident Analysis tools, such as: those already developed and, those being currently documented and refined, by the joint DOE/EFCOG project, will be used to learn from areas of management system/performance concerns, information rich events, occurrences, and accidents. HRO requires a “Just Culture” An environment that recognizes human potential for error and clearly defines acceptable, performance, safe behavior in a consistent manner. Recognition of fairness related to the identification and resolution of human performance problems. Distinction between honest mistakes and intentional shortcuts with respect to discipline. Free flow of plant information across all levels of an organization. High level of self-reporting 3 HRO Concepts Accidents can be prevented by proper organizational design supported by a proactive learning culture and operational safety management systems. 1) Manage the System, Not the Parts. 2) Introduce Stability - Reduce Variability in HRO System. 3) Foster a Culture of Reliability. 4) “The Learning Organization” learns and adapts as an organization. 5) Focus is on the Organization not individuals. 6) Recognition that everyone makes mistakes. 7) Encouragement/ reward personnel to disclose errors without consequences. The Role of Human Performance Improvement (HPI) Human performance as it applies to the individual is a series of behaviors executed to accomplish specific task objectives. Organizationally, human performance is the aggregate system of processes, influences, Behaviors, and results that become manifest in the way work is conducted. We accept that the greatest cause of human error is weaknesses in the organizations management systems not lack of skill or knowledge. Latent Organizational Weakness, are hidden deficiencies in management control process or values creating workplace conditions that can provoke an error and/or degrade the integrity of defenses. 4 HPI Event Analysis - A High Reliability Organization fixes the organization not “the people”. Review from the perspective of both: how people involved in the event (context) and, how the safety management system preformed. Evaluate the organization performance prior to and leading up to the event. Recognize the event or accident is the effect or symptom of embedded, latent organizational deficiencies, deeper trouble in the organization, and are not random chance events. If the management response to the event only focuses on the single apparent or root cause, the other management systems, human performance errors leading to contributing causes will not be addressed. The recommended approach and analytical tools to perform this safety management system analysis will be collected, and contained in the, to be developed: DOE-HDBK-XXXX-2011, HANDBOOK FOR DOE OPERATIONAL SAFETY EVENT ANALYSIS AND ACCIDENT INVESTIGATION. The basis for the Handbook will be the conversion of the existing Accident Investigation Workbook into a DOE Technical Standard. The project will be a joint activity sponsored by HSS, the PSO AI POCs, and EFCOG. HPI investigation techniques are being added to our current event and accident investigation processes. The value of investigating an event or accident using HPI is in identifying organizational weaknesses not found by other methods. HPI’s broader view identifies human performance and systemic causes in management systems that impact the organization, not just address areas to prevent the specific event from recurring. The concepts and principles recommended are contained in: DOE-HDBK-1028-2009, June 2009, DOE STANDARD HUMAN PERFORMANCE IMPROVEMENT HANDBOOK VOLUME 1: CONCEPTS AND PRINCIPLES. The examples of the Analytical Methods and Approaches recommended are contained in: DOEHDBK-1028-2009, June 2009, DOE STANDARD HUMAN PERFORMANCE IMPROVEMENT HANDBOOK VOLUME 2: HUMAN PERFORMANCE TOOLS FOR INDIVIDUALS, WORK TEAMS, AND MANAGEMENT. 5 HRO/HPI and ISM Integration Pathway HRO concept needs to be integrated into the site’s ISM System Description. Incorporating HRO/HPI strengthens our ISM system. ISM System attributes need to include HRO/HPI concepts, analysis methods, feedback and improvement practices. HRO/HPI Principles that complement ISM Core Functions and Guiding Principles are: 1) 2) 3) 4) 5) 6) Each employee instinctively feels responsible for safety. Leaders demonstrate commitment to safety. Trust towards each other. Decision-making reflects safety as the overriding priority. An inquisitive operational safety attitude and behavior is essential. A disciplined authorization basis system is in place to ensure all hazards are identified and mitigated before work begins. 7) Organizational learning is embraced. 8) We openly examine our operations and solicit feedback on errors in defining work that can lead to mistakes in analyzing hazards, and produce human performance errors. 9) We dig to determine the human performance errors in the presence of flawed defenses that can have high consequences. 6 ISM - Define the Scope of Work / HPI - Determine the Context Determine location, day of week, time of day, considered routine or special, Error Precursors, Latent Org Weaknesses, relationship to other work scheduled. (Utilize the HPI Human Error Precursor List). ISM - Identify and Analyze the Hazards / HPI - Identify Critical Steps Use Operational Safety Event and Accident Analysis Methods to identify where if Human Error has or will occur would result in an unwanted outcome. ISM - Develop and Implement Hazards Controls / HPI - Integrate HPI Tools Review the HPI tool box for tools that are best suited for the given critical steps and integrate into the job safety hazard, task analysis, and work control process. ISM – Perform Work within Controls / HPI – Utilize HPI Tools Implement enhanced training, pre-job briefings, job aids that incorporate the use of Human Performance Improvement Tools. ISM – Provide Lessons Learned Feedback and Improvement / HPI - Capture and Communicate Lessons Learned Feedback information on the adequacy of Human Performance Tools is gathered; opportunities for improving the definition and planning of work are identified and implemented. Safety Management Systems, barriers, and defenses are evaluated for integration of lessons learned. 7