CHERNOBYL Human Reliability Definition: “The probability that a person will correctly perform some system-required activity during a given time period (assuming time is a limiting factor) without performing any extraneous activity that can degrade the system” (Hollnagel, 2002) Human Reliability Assessment (HRA) Assessment of the impact of human errors on systems safety and, if warranted, the specification of ways to reduce human error impact and/or frequency. HRA is far from being a precise science, but is a useful means for identifying and prioritizing safety vulnerabilities for HE, thereby reducing the frequency of accidents. Hybrid area: - Engineering & reliability - Psychology - Ergonomics Probabilistic Safety Analysis (PSA) Engineering approach Quantitative statement of finding expected frequencies of accidents and then compared against predefined risk criteria. HRA must be incorporated into PSA if risk is to be properly estimated…hence the need for the theoretical framework (psychology and ergonomics) HRA History Started early 1960s…expanded greatly since 1979. Why? Followed exact same procedure as conventional reliability analysis human tasks substituted for equipment failures Greater variability and interdependence for human performance (‘human factor’) How can we get this ‘variability’ and ‘interdependence’ information? Understanding HRA Accident sequence analyzed represented as an event tree (slide on next page). Nodes represent specific tasks/functions with 2 outcomes (success/failure) Engineering approaches (PRA/PSA) can calculate failure probabilities in terms of material & process information, but HRA must also account for the “human factor” to determine if the human AS A COMPONENT will fail. Event tree structure (Hollnagel, 2002) Understanding HRA…cont (2) Traditional approach: Determine HEP (human error probability) using tables (from collected data), HR models, or expert judgement Account for the influence of Performance Shaping Factors (PSFs) Calculate probability of an erroneous action using: Understanding HRA…cont (3) Using this formula we must make the following assumptions: 1.) Probability of failure can be determined for specific types of actions independently of context. 2.) Effects of context are additive (various performance conditions don’t influence each other) Understanding HRA…cont (4) As a result, several models have been used to improve HRA: 1.) Behavioral (Human Factors ) models Focus on simple manifestations (error modes) Described in terms of omissions, commissions, extraneous actions Derive probability that specific errors will occur Problems Causal models very simple Weak in accounting for context Understanding HRA…cont (5) 2.) Information processing models Focus on internal mechanisms i.e. decision making, reasoning, etc. Explain flow of causes and effects through models Problems: Models often complex Limited predictive power (hypothetical basis) Little concern for quantification Context not considered explicitly Better suited for retrospective analysis than predictions Understanding HRA…cont (6) 3.) Cognitive models Focus on relation between error modes and causes Models and (relatively) simple and context specific Premise: Cognition is the reason why performance is efficient (or limited) Operator seen as acting in anticipation of future event Well suited for predictions and retrospective analysis HRA Framework 10 Steps with 3 MAIN GOALS GOALS: 1.) Human error identification (What can go wrong?) 2.) Human error quantification (How often will a human error occur?) 3.) Human error reduction (How can human error be prevented from reoccurring or its impact on the system be reduced?) HRA Generic Methodology Systematic way of approaching HRA logically Will help ensure the problem is dealt with reliably while minimizing error (biases) Encompasses 10 steps from identifying the problem to final documentation Steps in the HRA process: HRA Steps 1.) Defining the problem Define precisely the problem and its setting in terms of the system goals and overall forms of human of human-caused deviations from these goals. 2.) Task analysis Define explicitly the data, equipment, behaviour, plans and interfaces used by the operators to achieve system objectives, and to identify factors affecting human performance within tasks. 3.) Human error analysis Identify all significant human errors affecting performance of the system and finding ways in which human errors can be recovered. HRA Steps cont… 4.) Representation Model human errors and recovery paths in a logical manner for quantitative measurement (integrate human errors with hardware failures) 5.) Screening Define the level of detail and effort with which the quantification will be conducted by defining all significant human errors and interactions and ruling out insignificant errors 6.) Quantification Quantify human error probabilities and human error recovery probabilities (to determine likelihood of success in achieving system goals) HRA Steps cont… 7.) Impact assessment Determine sig. of human reliability to achieve system goals, to decide if improvements in human reliability are required, and (if so) what are the primary errors/factors negatively affecting the system. 8.) Error reduction Identify error reduction mechanisms, the likelihood or error recovery, improving human performance in achieving system goals. 9.) Quality assurance Ensure the enhanced system satisfactorily meets system performance criteria NOW and in the future. HRA Steps cont… 10.) Documentation Detail all information necessary to allow the assessment to be understandable, auditable, and reproducible. 1. Problem Definition 2 parts of defining a problem: 1) Identify the HR problem 2) Identify the HR context - Once HR problem identified and defined within the systems context discussions with designers, engineers & operational managers should occur - Define system goals at various levels operator action is required – this defines higher goals which the operator was aiming for and can get to the operator intentions at the time of the event and the root of the problem Problem Definition cont. (2) Must investigate the “safety culture” of a plant – this can dramatically influence HR and is important to defining the problem If HRA is being carried out as part of an overall risk assessment the HR analyst will probably be given a set of scenarios to assign risk and HE to. By the end of the process the problem should be explicitly defined in its respective system context: - Scenarios to be addressed - Overall tasks requires to achieve safety goals within each scenario 2. Task Analysis Purpose: Provide a complete, comprehensive description of the tasks that have to be performed by the operator to achieve system goals - Many forms of TA, some notable ones include: 1. Sequential – chronological order of events 2. Hierarchal – considers tasks in a hierarchy (importance) 3. Tabular – dynamic situations (operators actions during a power plant emergency) Task Analysis cont. (2) Methods of deriving info from task analysis: Interaction with all levels (operators, maintenance, supervisors, managers, system designers, etc.) Observation (structured & unstructured interviews) Procedure analysis Incident analysis Walkthrough/explanation of procedures from operator(s) Examination of system documentation Task Analysis cont. (3) Important not to completely rely on procedures/operating instructions – practical, real life procedures often differ Operator/employee knowledge (tacit knowledge) gained through experience vital in the TA process. Why? 3. Human Error Analysis (HEA) Stage to identify all errors associated with the task!! Most critical part of HRA. WHY?... - If significant errors are omitted, will not appear in analysis and may seriously UNDERESTIMATE EFFECTS of human error on the system 3. HEA cont…(2) Method example #1 Simplest approach to HEA…consider ‘external error modes’ 1.) Error of omission: Act omitted (not carried out) 2.) Error of commission: Act carried out inadequately Act carried out in wrong sequence Act carried out too early/late Error of quality (too little/too much) 3.) Extraneous error: Wrong (unrequired) act performed 3. HEA cont… Once the factors which influence HRA are identified the next step is representing them in such a way to indicate their effects on the system goals… 4. Representation Visual representation of events/actions in a scenario Can be used to represent simple or complex failure paths Skill & proper knowledge is needed of “tree” construction – trees can become extremely complex and off focus is not carefully put together A smaller scenario with a low number of errors may not need or benefit from this type of representation 4. Representation cont…(2) Fault Tree Typical representation of HE & its effect on a system is to use a fault tree Logical structure that defines what events must occur in order of an undesirable outcome Undesirable event located at the top of the “tree” (most important) 2 different types of gate that allow events to proceed to the next level 1) “OR” gate – Only used if any of the events joined below it by this gate occur 2) “AND” gate – Only used if all events joined below this gate occur Fault Tree 5. Screening Identifies where major effort in the quantification analysis should be applied. Filters out tasks/scenarios which may have little contribution to system failure Screening methods have the ability to potentially eliminate studying important errors and interactions in the analysis As a general rule, when applying any screening technique – if in doubt, leave the human error in the fault/event tree 6. Quantification Human reliability needs to be quantified to something that can compared across the HRA spectrum The metric for HRA quantification is Human Error Probability (HEP) HEP = (# of errors occurred)/(# opportunities for error to occur) Expressed as a number between 0 and 1. Little recorded industrial HEP data available because: Difficulty in estimating opportunities for error in realistic complex tasks (denominator error) Confidentiality and unwillingness to publish poor performance data Lack of awareness regarding the usefulness of human error data (hence no fiscal incentives) There are lots of ways to determine HEP For this course we keep it simple! 6. Quantification…cont.(2) Problems with simulator data in determining HEPs: Personnel using simulators usually highly motivated and know what’s on training curriculum Reliability of emergency training/responses on simulator compared to the real situation (‘cognitive fidelity issue’). Experiments are also usually controlled investigating only one or two variables generalization risky!!! Lack of ‘generalizability’ has led to: Non-data-dependent approaches i.e. Expert opinion/judgment 7) Impact Assessment System risk or reliability is calculated Compared to acceptable levels/standards established Each event is analyzed and classified into a fault tree Both HE & hardware/software analysis are taken into account for the best combination to improve the system If HE dominates error reduction methods must be investigated If HE cannot be reduced to acceptable levels – redesign of the system is necessary 8. Error Reduction Not required if: Human reliability is adequate (what’s adequate?) It’s not the most effective means of achieving system performance (other modifications more suitable) Not within scope of assessment If required, then: Focus on reducing impact/frequency of human errors Implement a more general error reduction strategy (how?) Steps in the HRA process: 8. Error Reduction…cont (2) Ways of reducing impact of critical errors on system by: Prevent hardware/software changes Increase system tolerance Enhance error recovery Reduce error at source 8. Error Reduction…cont (3) Additional considerations: Positive error reducing strategies should be factored back into quantitative analysis Check that HEP(s) and overall system calculated risk become acceptable As part of the quality assurance phase: Provide ‘operational definition’ for each error reduction strategy Ensure strategy is properly implemented and maintained over time 9. Quality Assurance Effectiveness of error reduction mechanism implementation should be ensured by: Monitoring Performance verification (at later stage) Reliability/Validity Analysis (can be hard…why?) Continuous performance monitoring systems present powerful quality assurance systems. Why? Gradual performance standard degradation Increased maintenance loading Loss of personnel Impromptu changes (since startup) Increasing ‘retrofit’ changes 9. Quality Assurance cont…(2) Long-term performance monitoring allows for: Identifying WHEN in time results of HRA my be outdated. Signify need for further (or new) evaluation Justifying acceptability of risk associated with system Avoid gradual deterioration of safety barriers. i.e. BHOPAL SYNDROME (India) 10. Documentation Formally document all results of study Ensure auditability and justifiability of results Can provide database for future investigation and monitoring. Ensure assumptions and judgments included!! Aid new/unconnected personnel to understand Enable independent examination, updating, and reproduction Allows for learning from mistakes Future Directions 1)Low technology Risk 2)Cognitive Errors and misdiagnosis 3) Management, organizational and sociotechincal contributions to risk Low Technology Risk HRA traditionally used for high risk, high technology industries HRA likes to focus on massive accidents that happen less frequently – large consequences Not applied to high risk, low technology sectors as much (ex. mining) – which has a larger number of “small” accidents Can have very valuable applications to low technology industries Cognitive Errors & Misdiagnosis Operators may misdiagnose a situation, not realize the mistake and continue interpretation of the feedback incorrectly Can make matters worse if operator overrides safety system then if nothing was done at all Management, Organizational & Sociotechnical Contributions to Risk HRA should not only be applied to operators/workers on the job site but also management and the organizational design of the plant Bhopal, Challenger Shuttle & Chernobyl all had significant human error BUT current HRA techniques would not have detected risk prior to accidents because error was neither procedural or diagnostic Management, Organizational & Sociotechnical Contributions to Risk Economics, time restraints, social pressures, communication breakdown – personality conflicts, etc. all add pressures on a system and its safety