Chapter …

Part 2 THE MANAGEMENT OF SAFETY DRAFT This page intentionally left blank. Part 2 DRAFT Chapter 4 - Understanding Safety Chapter 4 UNDERSTANDING SAFETY Outline GENERAL CONCEPT OF RISK ACCIDENTS vs. INCIDENTS ACCIDENT CAUSATION  Traditional view of causation  Contemporary view of causation  Findings and causes  Incidents: Precursors of accidents — 1:600 Rule CONTEXT FOR ACCIDENTS AND INCIDENTS  Equipment design  Supporting infrastructure  Human factors — SHEL model  Cultural factors — National — Professional — Organizational  Corporate safety culture — Positive safety cultures — Indications of a positive safety culture – Informed culture – Learning culture – Reporting culture – Just culture — Error tolerance — Blame and punishment HUMAN ERROR  Error Types — Planning errors (mistakes) — Execution errors (slips and lapses) — Errors vs. Violations  Control of human error — Error reduction — Error capturing — Error tolerance ACCIDENT PREVENTION CYCLE Part 2 DRAFT Chapter 4 - Understanding Safety COST CONSIDERATIONS  Cost of accidents  Cost of incidents  Costs of accident prevention 4-2 Part 2 DRAFT Chapter 4 - Understanding Safety Chapter 4 UNDERSTANDING SAFETY GENERAL As discussed in Chapter 1, safety is a condition in which the risk of harm or damage is limited to an acceptable level. The safety hazards creating risk may become evident after a recognized breech of safety such as an accident or incident, or they may be pro-actively identified through formal safety programmes – before an actual safety event occurs. Having identified a safety hazard, the risks must be assessed. With a clear understanding of the nature of the risk, a determination is made as to the “acceptability” of the risk. Those found to be unacceptable must be acted upon. Safety management is centred on such a systematic approach to hazard identification and risk management – in the interests of minimizing the loss of human life, property damage, financial, environmental and societal losses. THE CONCEPT OF RISK Since safety is defined in terms of risk, any consideration of safety must necessarily involve the concept of risk. There is no such thing as absolute safety. Before any assessment of whether or not a system is safe can be made, it is first necessary to determine what is an acceptable level of risk. Risks are often expressed as probabilities, however the concept of risk involves more than this. To illustrate this with a hypothetical example, let us assume that the probability of the supporting cable of a 100 passenger cable-car failing and allowing the car to fall was assessed as being the same as the probability of a 12 passenger elevator failing and allowing the car to fall four floors. While the probabilities of the events occurring may be the same, the potential consequences of the cable car accident are much more severe. Risk is therefore two-dimensional. Evaluation of the acceptability of a given risk associated with a particular hazard must always take account of both the likelihood of occurrence of the hazard, and the severity of its potential consequences. The perceptions of risk can be derived into three broad categories: a) Risks that are so high that they are unacceptable; b) Risks that are so low that they are acceptable; and c) Risks in between these two categories, where consideration needs to be given to the various tradeoffs between risks and benefits. If the risk does not meet the pre-determined acceptability criteria, an attempt must always be made to reduce it to a level that is acceptable, using appropriate mitigation procedures. If the risk cannot be reduced to or below the acceptable level, it may be regarded as tolerable if: a) the risk is less than the pre-determined unacceptable limit; 4-3 Part 2 DRAFT Chapter 4 - Understanding Safety b) the risk has been reduced to a level which is as low as reasonably practicable; and c) the benefits of the proposed system or changes are sufficient to justify accepting the risk. Note. All three of the above criteria should be satisfied before a risk is classed as tolerable. Even where the risk is classed as acceptable, if any measures which could result in further reduction of the risk are identified, and these measures require little effort or resources to implement, they should be implemented. The acronym ALARP is often used to describe a risk which has been reduced to a level that is as low as reasonably practicable. In determining what is “reasonably practicable” in this context, consideration should be given to both the technical feasibility of further reducing the risk, and the cost. All such cases should be evaluated individually. Showing a system is ALARP means that any further risk reduction is either impracticable or grossly outweighted by the costs. It should, however, be remembered that when an individual or the society ‘accepts’ a risk, this does not mean that the risk is eliminated. Some level of risk remains, however, the individual or society has accepted that the residual risk is sufficiently low that it is outweighed by the benefits. These concepts are illustrated diagrammatically in the Tolerability of Risk (TOR) triangle shown in Figure 4-1. (In this diagram, the degree of risk is represented by the width of the triangle.) Risk Unacceptable region Tolerable region (ALARP) Acceptable region Negligible Risk 4-4 Part 2 DRAFT Chapter 4 - Understanding Safety Figure 4-1. Tolerability of Risk (TOR) Triangle Further guidance regarding risk management is contained in Chapter 6. ACCIDENTS vs. INCIDENTS ICAO Annex 13 provides definitions of accidents and incidents that may be summarized as follows: a) An accident is an occurrence during the operation of an aircraft, that entails: 1) A fatality or serious injury; 2) Substantial damage to the aircraft involving structural failure or requiring major repair of the aircraft; or 3) The aircraft is missing. b) An incident is an occurrence, other than an accident, associated with the operation of an aircraft that affects or could affect the safety of operation. A serious incident is an incident involving circumstances indicating that an accident nearly occurred. The ICAO definitions use the word “occurrence” to indicate an accident or incident. From the perspective of safety management, there is a danger in concentrating on the difference between accidents and incidents through definitions that may be arbitrary and limiting. Many incidents occur every day which may, or may not, be reported to the investigation authority, but which come very close to being accidents – often exposing significant risks. Because there is no injury, or little or no damage, they might not be investigated. This is unfortunate because investigation of an incident may yield better results for accident prevention, than the investigation of an accident. The difference between an accident and an incident may just be an element of chance, rather than effective risk management. Indeed, an incident may be thought of as an undesired event that, under slightly different circumstances, could have resulted in harm to people or damage to property and thus would have been classified as an accident. ACCIDENT CAUSATION The strongest evidence of a serious breech of a system’s safety is an accident. Since safety management aims to reduce the probability and consequences of accidents, an understanding of accident and incident causation is essential to understanding safety management. Because accidents and incidents are so closely related, no attempt is made to differentiate accident causation from incident causation. Traditional view of causation Following a major accident, the questioning process begins. For example: a) How and why did competent personnel make the errors necessary to precipitate the accident? and b) Could something like this happen again? 4-5 Part 2 DRAFT Chapter 4 - Understanding Safety Traditionally, investigators have examined a chain of events or circumstances which ultimately led to someone doing something inappropriate thereby triggering the accident. This inappropriate behaviour may have been an error in judgment (such as a deviation from standard operating procedures), an error due to inattention, or a deliberate violation of the rules. Following the traditional approach, the investigative focus was more often than not on finding someone to blame (and punish) for the accident. At best, safety management efforts were concentrated on finding ways of reducing the risk that such unsafe acts would be committed in the first place. However, the errors or violations that trigger accidents seem to occur randomly. With no particular pattern to pursue, safety management efforts to reduce or eliminate random events may be ineffective. Analysis of accident data all too often reveals that the situation prior to the accident was “ripe for an accident”. Safety-minded persons may even have been saying that it was just a matter of time before these circumstances lead to an accident. And when the accident occurs, all too often healthy, qualified, experienced, motivated and well-equipped personnel committed errors that triggered the accident. They (and their colleagues) may have committed these errors or unsafe practices many times before without adverse consequences. Further, some of the unsafe conditions in which they were operating may have been present for years, again without causing an accident. In other words, there is an element of chance present. Sometimes these unsafe conditions were the consequence of decisions by management; they recognized the risks, but often priorities required a trade off. Indeed, front-line personnel work in a context that is defined by organizational and management factors often beyond their control. The front-line employees are but part of a larger system. To be successful, safety management systems require an alternative understanding of accident causation, one that depends on examining the total context (i.e. the system) in which these people work. Contemporary view of causation According to contemporary thinking, accidents require the coming together of a number of enabling factors, each one necessary, but in itself not sufficient to breach system defences. Major equipment failures, or operational personnel errors, are seldom the sole cause of breaches in safety defences. Often these breakdowns are the consequence of human failures in decision-making. The breakdowns may involve active failures at the operational level, or they may involve latent conditions conducive to facilitating a breech of the system’s inherent safety defences. Most accidents include both active and latent failures. Figure 4-2 portrays an accident causation model1 that assists in understanding the interplay of organizational and management factors (i.e. system factors) in accident prevention. Various “defences” are built into the aviation system to protect against inappropriate performance or poor decisions at all levels of the system: the front-line workplace, the supervisory levels and senior management. This model shows that while organizational factors, including management decisions, can create latent failure conditions that could lead to an accident, they also contribute to the system defences. Errors and violations having an immediate adverse effect can be viewed as unsafe acts; these are generally associated with front-line personnel (pilots, controllers, mechanics, etc.). These unsafe acts may 1 Adapted from James Reason, “Collective Mistakes in Aviation: The Last Great Frontier”, Flight Deck, Summer 1992, Issue 4. 4-6 Part 2 DRAFT Chapter 4 - Understanding Safety penetrate the various defences put in place to protect the aviation system by company management, the regulatory authorities, ICAO, etc., resulting in an accident. These unsafe acts may be the result of normal errors, or they may result from deliberate violations of prescribed procedures and practices. The model recognizes that there are many error-producing or violation-producing conditions in the work environment that may affect individual or team behaviour. Reportable Incidents Organization & Management Decisions Workplace Error & Violation Producing Conditions Crew/Team Defences Errors & Violations Outcome Accident Figure 4 -2. Accident Causation Model (Adapted from Dr. James Reason) These unsafe acts are committed in an operational context which includes latent unsafe conditions (latent failures). A latent failure is the result of an action or decision made well before an accident. Its consequences may remain dormant for a long time. Individually, these latent failures are usually not harmful since they are not perceived as being failures in the first place.2 Latent unsafe conditions may only become evident once the system’s defences have been breached. They may have been present in the system well before an accident and are generally created by decisionmakers, regulators and other people far removed in time and space from the accident. Front-line operational personnel can inherit defects in the system, such as those created by poor equipment or task design; conflicting goals (e.g. on-time service vs. safety); defective organizations (e.g. poor internal communications); or bad management decisions (e.g. deferral of a maintenance item). Effective safety management efforts aim to identify and mitigate these latent unsafe conditions on a system-wide basis, rather than by localized efforts to minimize unsafe acts by individuals. Such unsafe acts may only be symptoms of safety problems, not causes. Most latent unsafe conditions start with the decision-makers, even in the best-run organizations. These decision-makers are also subject to normal human biases and limitations, as well as to very real constraints of time, budget, politics, etc. Since some of these unsafe decisions cannot be prevented, steps must be taken to detect them and to reduce their adverse consequences. Fallible decisions by line-management may take the form of inadequate procedures, poor scheduling or neglect of recognizable hazards. They may lead to inadequate knowledge and skills or inappropriate operating procedures. How well line-management and the organization perform their functions sets the 2 Wood, Richard H., Aviation Safety Programs: A Management Handbook, 3rd edition, Englewood, Co.: Jeppesen, 2003 4-7 Part 2 DRAFT Chapter 4 - Understanding Safety scene for error, or violation, producing conditions. For example, how effective is management with respect to setting attainable work goals, organizing tasks and resources, managing day-to-day affairs, communicating internally and externally, etc.? The fallible decisions made by company management and regulatory authorities are too often the consequence of inadequate resources. However, avoiding the costs of strengthening the safety of the system can facilitate accidents that are so expensive as to bankrupt the operator. Findings and causes One of the difficulties in determining accident causes involves the level of confidence that can be placed in the investigative findings. In criminal proceedings, the burden of proof is heavy, seeking virtual certainty (e.g. “beyond reasonable doubt”). However, in accident investigations, it is often impossible to determine causes with absolute certainty. Following an investigation, some stakeholders, especially the general public, media and legal fraternity are primarily interested in “the cause” of the accident. More enlightened stakeholders will seek an understanding of all the causes. However, excessive emphasis on determining the causes may be at the expense of accident prevention. If the investigation went deep enough, beyond the determination of the causes, it may reveal systemic safety deficiencies that must be corrected in the interests of accident prevention. A thorough investigation will reveal many findings, perhaps including latent unsafe conditions that had no role whatsoever in the accident causation. From a safety management point of view, significant safety deficiencies found in an investigation that may be unrelated to the accident must be eliminated, or reduced. A thorough investigation will reveal a number of findings. Some of these findings could be considered causal. Others may have contributed to the accident in some way (sometimes referred to as contributory factors). On the basis that Annex 13 defines the objective of an investigation as accident prevention, a growing number of investigations are determining findings without attempting to identify which of the findings are causal. Incidents: Precursors of accidents Regardless of the accident causation model used, typically there would have been precursors evident before the accident. All too often, these precursors only become evident with hindsight. Latent unsafe conditions may have existed at the time of the occurrence. Identifying and validating these unsafe conditions requires an objective, in-depth risk analysis. Although it is important to fully investigate accidents with high numbers of fatalities, this may not be the most fruitful means for identifying safety deficiencies. Care must be taken to ensure that the “blood priority” (often prevalent in the media after significant loss of life) does not detract from a rational risk analysis of unsafe conditions in aviation. While using accident investigations to identify hazards is important, it is reactive. 1:600 Rule3. Research into industrial safety in 1969 indicated that for every 600 reported occurrences with no injury or damage, there were some: — 30 incidents involving property damage, — 10 accidents involving serious injuries, and 3 Frank E. Byrd, Jr., 1969 4-8 Part 2 DRAFT Chapter 4 - Understanding Safety — 1 major or fatal injury. Fatal Accident 1 Accidents 10 30 600 Figure 4-3. 1:600 rule The 1-10-30-600 ratio shown in Figure 4 -3 is indicative of a wasted opportunity, if investigative efforts are focused only on those rare occurrences where there is serious injury, or significant damage. The latent factors contributing to such accidents may be present in hundreds of incidents, and could be identified – before serious injury or damage ensues. Effective safety management requires that all staff and management identify and analyse hazards before they result in accidents. In aviation incidents, injury and damage (and therefore, liability) are generally less significant than in accidents. Accordingly, there is less publicity associated with these occurrences. In principle, more information regarding such occurrences should be available (e.g. live witnesses, undamaged flight recorders, etc.). Without the threat of substantial damage suits, there also tends to be less of an adversarial atmosphere during the investigation. Thus, there should be a better opportunity to identify why the incidents occurred and, equally, how the defences in place prevented them from becoming accidents. In an ideal world, the underlying safety deficiencies can all be identified, and preventive measures to ameliorate these unsafe conditions initiated, before an accident occurs. 4-9 Part 2 DRAFT Chapter 4 - Understanding Safety CONTEXT FOR ACCIDENTS AND INCIDENTS Accidents and incidents occur within a defined set of circumstances and conditions. These include the aircraft and other equipment, the weather, the airport and flight services, etc. They also include the regulatory, industry and corporate operating climate. Above all, they include the permutations and combinations of human behaviour. At any given time, some of these factors may converge in such a way as to create conditions that are ripe for an accident. Understanding the context in which accidents occur is fundamental to safety management. Some of the principal factors shaping the context for accidents and incidents include equipment design, supporting infrastructure, human and cultural factors, corporate safety culture and cost factors. Each is discussed in this chapter. Equipment design Equipment (and job) design is fundamental to both safe operations and maintenance. Simplistically, the designer is concerned with such questions as: a) Does the equipment do what it is supposed to do? b) Does the equipment interface well with the operator? Is it “user-friendly”? c) Does the equipment fit in the allocated space? etc. From the equipment operator’s perspective, the equipment must “work as advertised”. The ergonomic design must minimize the risk (and consequences) of errors. Are the switches accessible? Is the controlling action intuitive? Are the dials and displays adequate under all operating conditions, etc.? Is the equipment resistant to mistakes, e.g. “Are you sure you want to delete this file?” Each such factor has accident potential. The designer also needs to consider the equipment maintainer’s perspective; there must be sufficient space available to permit access for required maintenance under typical working conditions and with normal human strength and reach limitations. The design must also incorporate adequate feedback to warn of an incorrect assembly. With advances in automation, design considerations become even more apparent. Whether it is the pilot in the cockpit, air traffic controllers at their consoles, or a maintenance engineer using automated diagnostic equipment, the scope for new types of human errors has expanded significantly. Equipment operators may no longer be certain at all times of what is happening, hence the question: “What is the computer doing now?” Their situational awareness may be affected accordingly. Although increased automation has reduced the potential for many types of accidents, safety managers now faces new challenges induced by that automation, such as lack of situational awareness, boredom, etc. Supporting infrastructure From an operator or service provider’s perspective, the availability of adequate supporting infrastructure is essential to the safe operation of aircraft. This includes the adequacy of the State’s performance with respect to such things as: a) Personnel licensing; 4-10 Part 2 DRAFT Chapter 4 - Understanding Safety b) Certification of aircraft, operators, service providers and aerodromes; c) Ensuring the provision of required services; d) Investigation of accidents and incidents; and e) Providing operational safety oversight. From a pilot’s perspective, supporting infrastructure includes such things as: a) Airworthy aircraft suitable for the type of operation; b) Adequate and reliable CNS services; c) Adequate and reliable aerodrome, ground handling, and flight planning services; and d) Effective support from the parent organization with respect to initial and recurrent training, scheduling, flight dispatch or flight following system, etc. An air traffic controller is similarly concerned with such things as: a) Availability of operable (CNS) equipment suitable for the operational task; b) Effective procedures for the safe and expeditious handling of aircraft; and c) Effective support from the parent organization with respect to initial and recurrent training, rostering and general working conditions. Human factors4, 5 In a high technology industry like aviation, the focus of problem solving is often on technology. However, the accident record repeatedly demonstrates that at least three out of four accidents involve performance errors made by apparently healthy and appropriately qualified individuals. In the rush to embrace new technologies, the fallible mortals who must interface with and use this equipment are often overlooked. The sources of some of the problems causing or contributing to these accidents may be traced to poor equipment or procedure design, or to inadequate training or operating instructions. But whatever the origin, understanding normal human performance capabilities, limitations and behaviour in the operational context is central to understanding safety management. An intuitive approach to Human Factors is no longer appropriate. The human element is the most flexible and adaptable part of the aviation system, but it is also the most vulnerable to influences that can adversely affect its performance. With the majority of accidents resulting from less than optimum human performance, there has been a tendency to merely attribute them to human 4 Adapted from Ch 2 of Human Factors Guidelines for Safety Audits Manual (Doc 9806) Readers are referred to Human Factors Training Manual (Doc 9683) for a more comprehensive coverage of the theoretical and practical aspects of Human Factors. 5 4-11 Part 2 DRAFT Chapter 4 - Understanding Safety error. However, the term “human error” is of little help in safety management. Although it may indicate where in the system the breakdown occurred, it provides no guidance as to why it occurred. An error attributed to humans may have been design-induced, or stimulated by inadequate equipment or training, badly designed procedures, or a poor layout of checklists or manuals. Further, the term “human error” allows concealment of the underlying factors that must be brought to the fore if accidents are to be prevented. In contemporary safety thinking, human error is the starting point, rather than the stopping point in accident investigation and prevention. Safety management initiatives seek ways of minimizing or preventing human errors that might jeopardize safety. This requires an understanding of the operating context in which humans err, (i.e. an understanding of the factors and conditions affecting human performance in the workplace). SHEL Model The workplace typically involves a complex set of interrelated factors and conditions, which may affect human performance. The SHEL model6 (sometimes referred to as the SHELL model) can be used to help visualize the interrelationships among the various components of the aviation system. The SHEL model is a development of the traditional “man-machine-environment” system, (the name being derived from the initial letters of its components). SHEL places emphasis on the human being and the human’s interfaces with the other components of the aviation system. The following nomenclature is applied: a) Liveware (L) (humans in the workplace), b) Hardware (H) (machine and equipment), c) Software (S) (procedures, training, support, etc.), and d) Environment (E) (the operating circumstances in which the rest of the L-H-S system must function). Figure 4-4 depicts the SHEL model. This building block diagram is intended to provide a basic understanding of the relationship of the human to other factors in the workplace. 6 The SHEL concept was first developed by Professor Elwyn Edwards in 1972, with a modified diagram to illustrate the model developed by Frank Hawkins in 1975. 4-12 Part 2 DRAFT Chapter 4 - Understanding Safety H S L E S – software H – hardware E – environment L – liveware Figure 4 -4. SHEL Model Liveware. In the centre of the model are those persons at the front line of operations. Although people are remarkably adaptable, they are subject to considerable variations in performance. Humans are not standardized to the same degree as hardware; so the edges of this block are not simple and straight. People do not interface perfectly with the various components of the world in which they work. To avoid tensions that may compromise human performance, the effects of irregularities at the interfaces between the various SHEL blocks and the central Liveware block, must be understood. The other components of the system must be carefully matched to humans if stresses in the system with accident potential are to be avoided. Several different factors put the rough edges on the Liveware block; some of the more important factors affecting individual performance are: Physical Factors include the individual’s physical capabilities to perform the required tasks, (e.g. strength, height, reach, vision, and hearing). Physiological Factors include those factors which affect the human’s internal physical processes, which can compromise the crew’s physical and cognitive performance, e.g. oxygen availability, general health and fitness, disease or illness, tobacco, drug or alcohol use, personal stress, fatigue, or pregnancy. Psychological Factors include those factors affecting the psychological preparedness of the individual to meet all the circumstances that might occur during a flight, e.g. adequacy of training, knowledge and experience, visual illusions and workload. The individual’s psychological fitness for duty includes motivation and judgement, attitude towards risky behaviour, confidence, stress, etc. Psycho-social Factors include all those external factors in the individual’s social system that bring pressure to bear on them, both in their work and their non-work environments, e.g. argument with a supervisor, labour-management disputes, a death in the family, personal financial problems or other domestic tension. 4-13 Part 2 DRAFT Chapter 4 - Understanding Safety The SHEL model is particularly useful in visualizing the interfaces between the various components of the aviation system. These include: Liveware-Hardware (L-H). The interface between the human and the machine is the one most commonly considered when speaking of Human Factors. It determines how the human interfaces with the physical work environment, e.g. design of seats to fit the sitting characteristics of the human body, displays to match the sensory and information processing characteristics of the user, controls with proper movement, coding and location. However, there is a natural human tendency to adapt to L-H mismatches. This tendency may mask serious deficiencies, which may only become evident after an accident. Liveware-Software (L-S). The L-S interface is the relationship between the individual and the supporting systems found in the workplace, e.g. the regulations, manuals, checklists, publications, standard operating procedures and computer software. It includes such “user friendliness” issues as currency, accuracy, format and presentation, vocabulary, clarity, symbology, etc. Increasingly, cockpit automation has altered the nature of crew duties. Workload may have been increased to such an extent during some phases of flight that crew members’ attitudes towards each other may be affected (i.e. the L-L interface). Liveware-Liveware (L-L). The L-L interface is the relationship between the individual and other persons in the workplace. Flight crews, air traffic controllers, maintenance technicians and other operational personnel function as groups and group influences play a role in determining human behaviour and performance. This interface is concerned with leadership, crew cooperation, teamwork and personality interactions. In aviation, the advent of Crew Resource Management (CRM) has resulted in considerable focus on this interface. CRM training promotes teamwork and focuses on the management of normal human errors. The LL interface goes well beyond the crew relationship in the cockpit. Staff/management relationships are also within the scope of this interface, as are corporate culture, corporate climate and company operating pressures that can all significantly affect human performance. Liveware-Environment (L-E). This interface involves the relationship between the individual and the internal and external environments. The internal workplace environment includes such physical considerations as temperature, ambient light, noise, vibration, air quality, etc. The external environment (for pilots) includes such things as visibility, turbulence, terrain, etc. Increasingly, the work environment for flight crews includes disturbances to normal biological rhythms, e.g. sleep patterns. Further, the aviation system operates within a context of broad political and economic constraints, which in turn affect the overall corporate environment. Included here are such factors as the adequacy of physical facilities and supporting infrastructure, the local financial situation, regulatory effectiveness, etc. Just as a crew’s immediate work environment may create pressures to take short cuts, inadequate infrastructure support may also compromise the quality of crew decision-making. For the most part, the rough edges of these interfaces can be managed. For example: a) The designer can ensure the performance reliability of the equipment under specified operating conditions; b) During the certification process, the regulatory authority can define the conditions under which that equipment may be used; c) The organization’s management can specify standard operating procedures and provide initial and recurrent training for the safe use of the equipment; and 4-14 Part 2 DRAFT Chapter 4 - Understanding Safety d) Individual equipment operators can ensure their familiarity and confidence in using the equipment safely under all required operating conditions, etc. Cultural factors7 Culture influences the values, beliefs and behaviours that we share with the other members of our various social groups. Culture serves to bind us together as members of groups and to provide clues as to how to behave in both normal and unusual situations. Some see culture as the “collective programming of the mind”. It is the complex, social dynamic that sets the rules of the game, or the framework for all our interpersonal interactions. Culture is the sum total of the way people conduct their affairs in a particular social milieu. Culture provides a context in which things happen. For safety management, understanding this context called culture is an important determinant of human performance and its limitations. The western world’s approach to management is often based on an emotionally detached rationality, which is considered to be “scientifically” based. It assumes that human cultures in the workplace resemble the laws of physics and engineering, which are universal in application. This assumption reflects a Western cultural bias. Aviation safety must transcend national boundaries, including all the cultures therein. On a global scale, the aviation industry has achieved a remarkable level of standardization across aircraft types, countries and peoples. Nevertheless, it is not difficult to detect differences in how people respond in similar situations. As people in the industry interact (the Liveware-Liveware (L-L) interface), their transactions are affected by the differences in their cultural backgrounds. Different cultures have different ways of dealing with common problems. Organizations are not immune to cultural considerations. Organizational behaviour is subject to these influences at every level. The following three levels of culture have relevance to safety management initiatives: a) National culture differentiates the national characteristics and values system of particular nations. People of different nationalities differ, for example, in their response to authority, how they deal with uncertainty and ambiguity, and how they express their individuality. They are not all attuned to the collective needs of the group (team or organization) in the same way. In collectivist cultures, there is acceptance of unequal status and deference to leaders. Such factors may affect the willingness of individuals to question decisions or actions — an important consideration in Crew Resource Management (CRM) for example. Crew assignments that mix national cultures may also affect team performance by creating misunderstandings. b) Professional culture differentiates the behaviour and characteristics of particular professional groups (e.g. the typical behaviour of pilots vis à vis that of air traffic controllers, or maintenance engineers). Through personnel selection, education and training, on-the-job experience, etc., professionals (e.g. doctors, lawyers, pilots, controllers) tend to adopt the value system and develop behaviour patterns consistent with their peers; they learn to “walk and talk” alike. They generally share a pride in their profession and are motivated to excel in it. On the other side of the coin, they frequently have a sense of personal invulnerability, e.g. they feel that their performance is not affected by personal problems, or they do not make errors in situations of high stress. 7 This section is adapted from Human Factors Guidelines for Safety Audits Manual (Doc 9806). 4-15 Part 2 DRAFT Chapter 4 - Understanding Safety c) Organizational culture differentiates the behaviour and values of particular organizations (e.g. the behaviour of members of one company vs. that of another company, or government vs. private sector behaviour). Organizations provide a shell for national and professional cultures. In an airline for example, pilots may come from different professional backgrounds (e.g. military vs. civilian experience, bush or commuter operations vs. development within a large carrier). They may also come from different organizational cultures due to corporate mergers or lay-offs. Generally, personnel in the aviation industry enjoy a sense of belonging. They are influenced in their day-to-day behaviour by the values of their organization. Does the organization recognize merit? Promote individual initiative? Encourage risk taking? Tolerate breeches of SOPs? Promote open two-way communications, etc.? Thus, the organization is a major determinant of employee behaviour. The greatest scope for creating and nourishing a culture of safety is at the organizational level. This is commonly referred to as corporate safety culture and is discussed further below. The three cultural sets described above are important to safe flight operations. They determine how juniors will relate to their seniors, how information is shared, how personnel will react under stress, how particular technologies will be embraced and used, how authority will be acted upon, how organizations react to human errors (e.g. punish offenders, or learn from experience). Culture will be a factor in how automation is applied in flight operations; how procedures (SOPs) are developed and implemented; how documentation is prepared, presented, and received; how training is developed and delivered; how crew assignments are made; relationships between pilots, operations and ATC; relationships with unions, etc. In other words, culture impacts on virtually every type of interpersonal transaction. In addition, cultural considerations creep into the design of equipment and tools. Technology may appear to be culture-neutral, but it reflects the biases of the manufacturer (e.g. consider the English language bias implicit in much of the world’s computer software). Yet, there is no right and no wrong culture; they are what they are and they each possess a blend of strengths and weaknesses. The challenge for safety managers is to understand how culture affects both individuals and aviation organizations and how that relationship can put safety at risk, or serve to enhance it. Corporate safety culture8 As seen above, many factors create the context for human behaviour in the workplace. Organizational or corporate culture sets the boundaries for accepted human behaviour in the workplace by establishing the behavioural norms and limits. Thus, organizational or corporate culture provides a cornerstone for managerial and employee decision-making; “This is how we do things here!” Safety culture is a natural bi-product of corporate culture. The corporate attitude towards safety influences employees’ collective approach to safety. Safety culture consists of shared beliefs, practices and attitudes. The tone for safety culture is set and nurtured by the words and actions of senior management. Corporate safety culture then is the atmosphere created by management which shapes workers’ attitudes towards safety and accident prevention. Safety culture is affected by such factors as: a) Management’s action and priorities; 8 For a more comprehensive discussion of safety culture see Human Factors Guidelines for Safety Audit Manual (Doc 9806) 4-16 Part 2 DRAFT Chapter 4 - Understanding Safety b) Policies and procedures; c) Supervisory practices; d) Safety planning and goals; e) Actions in response to unsafe behaviours; f) Employee training and motivation; and g) Employee involvement or “buy in”. The ultimate responsibility for safety rests with the directors and management of the organization – whether it is an airline, a service provider (e.g. airports and ATS) or an Approved Maintenance Organization (AMO). The safety ethos of an organization is established from the outset by the extent to which senior management accepts responsibility for safe operations and for the management of risk.9 Positive safety culture10 Although compliance with safety regulations is fundamental to safety, contemporary thinking is that much more is required. Operators that simply comply with the minimum standards set by the regulations are not well situated to identify emerging safety problems. An effective way to promote a safe operation is to ensure that an operator has a positive safety culture. Simply put, all staff must be responsible for and consider the impact of safety on everything they do. This way of thinking must be so deep-rooted that it truly becomes a ‘culture’. All decisions, either by the Board of Directors, by a driver on the ramp, or by an engineer, need to consider the implications on safety. A positive safety culture must be generated from the ‘top down’ and relies on a high degree of trust and respect between workers and management. Workers must believe that they will be supported in any decisions made in the interests of safety. They must also understand that intentional breaches of safety that jeopardize the operation will not be tolerated. In order to ensure a positive safety culture, management must convince employees that while schedule delivery and costs are important, safety is paramount. Some organizations go so far as to enunciate a formal corporate policy on their commitment to a positive safety culture. Indications of a positive safety culture A positive safety culture demonstrates such attributes as: a) Senior management place strong emphasis on safety as part of the strategy of controlling risks (i.e. minimizing losses); 9 10 Adapted from CASA Aviation Safety Management: An Operator’s Guide to Building a Safety Program Adapted from Airbus Safety Strategy (OSPH Draft: Sec 1 - Introduction Sep 1999) 4-17 Part 2 DRAFT Chapter 4 - Understanding Safety b) Decision-makers and operational personnel hold a realistic view of the short- and long-term hazards involved in the organization’s activities; c) Those in top positions: 1) Foster a climate in which there is a positive attitude towards criticisms, comments and feedback from lower levels of the organization on safety matters; 2) Do not use their influence to force their views on subordinates; and 3) Implement measures to contain the consequences of identified safety deficiencies; d) Senior management promote a non-punitive working environment; they tolerate legitimate errors and systematically attempt to derive safety lessons from them; e) There is an awareness of the importance of communicating relevant safety information at all levels of the organization (both within and with outside entities); f) There is promotion of realistic and workable rules relating to hazards, to safety and to potential sources of damage; and g) Personnel are well trained and fully understand the consequences of unsafe acts. h) There is a low incidence of risk-taking behaviour, and a safety ethic which discourages such behaviour. Positive safety cultures typically are: a) Informed cultures. Management fosters a culture where people understand the hazards and risks inherent in their area of operations. Personnel are provided with the necessary knowledge, skills and job experience to work safely, and they are encouraged to identify the threats to their safety and seek the changes necessary to overcome them. b) Learning cultures. Learning is seen as more than a requirement for initial skills training; rather it is valued as a lifetime process. People are encouraged to develop and apply their own skills and knowledge to enhance organizational safety. Staff are updated on safety issues by management and safety reports are fed back to staff so that everyone can learn the pertinent safety lessons. c) Reporting cultures. Managers and operational personnel freely share critical safety information without the threat of punitive action. This is frequently referred to as creating a corporate reporting culture. Personnel are able to report hazards or safety concerns as they become aware of them, without fear of sanction or embarrassment. d) Just cultures. While a non-punitive environment is fundamental for a good reporting culture, the workforce must know and agree on what is acceptable and what is unacceptable behaviour. Negligence or deliberate violations must not be tolerated by either management or by the workers. A culture that recognizes that, in certain circumstances, there may be a need for punitive action is considered a just culture. Personnel tend to be self-disciplined in a just culture. 4-18 Part 2 DRAFT Chapter 4 - Understanding Safety Table 4-1 below summarizes three corporate responses to safety issues ranging from a poor safety culture, through the bureaucratic approach which only meets minimum acceptable requirements, to the ideal positive safety culture. Table 4-1. Characteristics of different safety cultures Safety Culture: Poor Bureaucratic Positive Characteristics Hazard information is: Suppressed Ignored Actively sought Safety messengers are: Discouraged or punished Avoided Tolerated Trained and encouraged Shared Discouraged Allowed but discouraged Local fixes Responsibility for safety is: Dissemination of safety information is: Failures lead to: New ideas are: Cover ups Crushed Fragmented New problems (not opportunities) Rewarded Inquiries and systemic reform Welcomed Error tolerance An important dimension of a positive safety culture is the organization’s attitude towards errors and the perceptions it creates among staff in how it responded to errors. Error tolerance is the term used to describe the ability of a system to accept an error without serious consequence. The concept of ‘error tolerance’ was first applied to the ergonomic design of equipment which incorporated physical defences against inappropriate human acts, for example, air/ground logic to prevent inadvertent gear retraction on the ground. In addition, procedural actions, such as checklists, crosschecks and readbacks, provide error tolerance by identifying unsafe conditions before a mishap occurs. Increasingly, the concept of error tolerance is being extended beyond equipment and job design into corporate safety culture. Creating a positive corporate safety culture is so dependent on effective two-way communications between management and front-line personnel, that organizations are increasingly recognizing the value of voluntary incident reporting systems that provide immunity to the reporter. However, the effectiveness of such reporting systems depends largely on the ‘error tolerance’ of the company. Blame and punishment Once an investigation has identified the cause of an occurrence, it is usually evident who “caused” the event. Traditionally, blame (and punishment) could then be assigned. While the legal environments vary widely between States, many States still focus their investigations on determining blame and apportioning liability. For them, punishment remains a principal tool. 4-19 Part 2 DRAFT Chapter 4 - Understanding Safety Philosophically, punishment is appealing from several points of view, such as: a) Seeking revenge for a breach of trust; b) Protecting society from repeat offenders; c) Altering individual behaviour; or d) Setting an example for others. Punishment may have a role to play in dealing with violations where crews intentionally contravene the “rules”. Arguably, such sanctions may deter the perpetrator of the violation (or others in similar circumstances) from jeopardizing safe flight operations. In principle, such punishment should be awarded, regardless of the outcome of the violation, i.e. punishment should not be awarded only to those whose misdemeanour leads to actual losses for the company. If the accident was the result of an error in judgement or technique, it is almost impossible to effectively punish for that error. Change could be made in selection or training processes, or the system made more tolerant of such errors. If punishment is selected in such cases, two outcomes are almost certain. Firstly, no further reports will be received of such errors. Secondly, since nothing has been done to change the situation, the same accident could be expected again. Perhaps, society needs to use punishment in order to mete out justice. However, the global experience suggests that punishment has little, if any, systemic value for accident prevention. Except in willful cases of negligent behaviour, with deliberate violations of the norms, punishment serves little purpose from a safety perspective. In much of the international aviation community, a more enlightened view of the role of punishment is emerging. In part, this parallels a growing understanding of the causes of human errors (as opposed to violations). Errors are now being viewed as the results of some situation or circumstance, not necessarily the causes of them. As a result, managers are beginning to seek out the unsafe conditions that facilitate such errors. They are beginning to find that the systematic identification of organizational weaknesses and safety deficiencies pays a much higher dividend for safety management than punishing individuals. (That is not to say that these enlightened organizations are not required to take action against individuals who fail to improve after counseling and/or extra training.) While many airlines are taking this more positive approach to the active management of safety, others have been slow to adopt and implement effective ‘non-punitive policies’. Still, others have been slow to extend their non-punitive policies beyond flight operations on a corporate wide-basis (to include maintenance and ground operations).11 HUMAN ERROR Human error is cited as being a causal or contributing factor in the majority of aviation occurrences. All too often, competent personnel commit such errors, although clearly they did not plan to have an accident. Errors are not some type of aberrant behaviour; they are a natural bi-product of virtually all human 11 From IATA Non-Punitive Policy Survey March 2002 4-20 Part 2 DRAFT Chapter 4 - Understanding Safety endeavour. Error must be accepted as a normal component of any system where humans and technology interact. “To err is human” The factors discussed above create the context in which humans commit errors. Given the rough interfaces of the aviation system (as depicted in the SHEL model), the scope for human errors in aviation is enormous. Understanding how normal people commit errors is fundamental to safety management. Only then can effective measures be implemented to minimize the effects of human errors on safety. Even if not altogether avoidable, human errors are manageable through the application of improved technology, relevant training and appropriate regulations and procedures. Most measures aimed at error management involve front-line personnel. However, the performance of pilots, controllers, technicians, etc. can be strongly influenced by organizational, regulatory, cultural and environmental factors affecting the workplace. For example, organizational processes constitute the breeding grounds for many predictable human errors, such as inadequate communication facilities, ambiguous procedures, unsatisfactory scheduling, insufficient resources, unrealistic budgeting – really all processes that the organization can control. Figure 4-5 summarizes some of the factors contributing to human errors – and to accidents. Culture Accidents Training Incidents Personal Factors HUMAN ERRORS Other Factors Procedures Equipment Design Organizational Factors Figure 4-5. Contributing factors to human error 4-21 Part 2 DRAFT Chapter 4 - Understanding Safety Error types Errors may occur at the planning stage, or during the execution of the plan. Planning errors lead to mistakes; either the person follows an inappropriate procedure for dealing with a routine problem or builds a plan for an inappropriate course of action to cope with a new situation. Even when the planned action is appropriate, errors may occur in the execution of the plan. The human factors literature on such errors in execution generally draws a distinction between slips and lapses. A slip is an action which is not carried out as planned, and will therefore always be observable. A lapse is a failure of memory, and may not necessarily be evident to anyone other than the person who experienced the lapse. a) Planning errors (Mistakes) In problem solving we intuitively look for a set of rules (standard operating procedures, rules of thumb, etc.), which are known and have been used before that will be appropriate to the problem in hand. Mistakes can occur in two ways; the application of a rule that is not appropriate to the situation, or the correct application of a rule that is flawed. Misapplication of good rules. This usually happens when an operator is faced with a situation that exhibits many features common to the circumstances for which the rule was intended, but with some significant differences. If the significance of the differences is not recognized, an inappropriate rule may be applied. Application of bad rules. This involves the use of a procedure that past experience has shown to work, but contains unrecognized flaws. If such a solution works in the circumstances under which it was first tried, it may become part of the individual’s regular approach to solving that type of problem. When a person does not have a ready-made solution based on previous experience and/or training, one draws on personal knowledge and experience. Developing a solution to a problem using this method will inevitably take longer than applying a rule-based solution, as it requires reasoning based on knowledge of basic principles. Mistakes can occur because of lack of knowledge, or because of faulty reasoning. The application of knowledge-based reasoning to a problem will be particularly difficult in circumstances where the individual is busy, as his or her attention is likely to be diverted from the reasoning process to deal with other issues. The probability of a mistake occurring becomes greater in such circumstances. b) Execution errors (slips and lapses) The actions of experienced and competent personnel tent to be routine and highly practiced; they are carried out in a largely automatic fashion, except for occasional checks on progress. Slips and lapses can occur as the result of: Attentional slips occur as the result of a failure to monitor the progress of a routine action at some critical point. These are particularly likely when the planned course of action is similar to a routinely used procedure, but not identical. If attention is allowed to wander or a distraction occurs at the critical point where the action differs from the usual procedure, the result can be that the operator will follow the usual procedure rather than the one intended in this instance. Memory lapses occur when we either forget what we had planned to do, or omit an item in a planned sequence of actions. 4-22 Part 2 DRAFT Chapter 4 - Understanding Safety Perceptual errors are errors in recognition. They occur when we believe we saw or heard something which is different from the information actually presented. c) Errors vs. Violations Errors (which are a normal human activity) are quite distinct from violations. Both can lead to a failure of the system. Both can result in a hazardous situation. The difference lies in the intent. A violation is a deliberate act, while an error is unintentional. Take, for example, a situation in which a controller allows an aircraft to descend through the level of a cruising aircraft when the DME distance between them is 18 NM, and this occurs in circumstances where the correct separation minimum is 20 NM. If the controller made a mistake in calculating the difference in the DME distances advised by the pilots, this would be an error. If the controller calculated the distance correctly, and allowed the descending aircraft to continue through the level of the cruising aircraft knowing that the required separation minimum did not exist, this would be a violation. Violations should not be tolerated. There have been a number of accidents where the existence of a corporate culture which tolerated, or in some cases even encouraged, the taking of short cuts rather than following published procedures in full, was identified as a major contributory cause in an accident. Control of human error Fortunately, few errors lead to adverse consequences, let alone accidents. Typically, errors are identified and corrected with no undesirable outcomes, for example, selecting an incorrect frequency or setting the altitude bug to the wrong altitude. On the understanding that errors are normal in human behaviour, the total elimination of human error would be an unrealistic goal. The challenge then is not merely to prevent errors, but to learn to safely manage the inevitable errors. Three strategies for managing human errors are briefly discussed below. Such strategies are relevant in flight operations, air traffic control or aircraft maintenance.12 a) Error reduction strategies intervene directly at the source of the error by reducing or eliminating the contributing factors to the error. They aim at eliminating any adverse conditions that increase the risk of error. Examples of error reduction strategies include improving the access to an aircraft component for maintenance, improving the lighting in which the task is to be performed, reducing environmental distractions and providing better training. b) Error capturing assumes the error has already been made. The intent is to “capture” the error before any adverse consequences of the error are felt. Error capturing is different from error reduction in that it does not directly serve to reduce or eliminate the error. Examples of errorcapturing strategies include crosschecking to verify correct task completion, functional test flights, etc. c) Error tolerance refers to the ability of a system to accept an error without serious consequence. Examples of measures to increase error tolerance are the incorporation of multiple hydraulic or 12 From Human Factors Training Manual (Doc 9683) 4-23 Part 2 DRAFT Chapter 4 - Understanding Safety electrical systems on an aircraft to provide redundancy or a structural inspection programme that provides multiple opportunities to detect a fatigue crack - before it reaches critical length. Some airlines have implemented error management strategies that have significantly reduced human errors in such areas as rushed or non-stabilized approaches and incorrect use of checklists. Reducing the frequency and consequences of human error provides enormous scope for safety management. ACCIDENT PREVENTION CYCLE Accident prevention begins with an appreciation of the operational context in which accidents occur. Given the number and potential relationships of all the factors that may affect safety, an effective management system is required for accident prevention. An example of the type of systematic process required is shown in Figure 4 -6 Accident Prevention Cycle. A brief description of the cycle follows. Identify Hazard Monitor Progress Assess Risks CAccident Prevention Cycle Take Action AControl Options ORisk Communication Figure 4-6. Accident Prevention Cycle 4-24 Part 2 DRAFT Chapter 4 - Understanding Safety Hazard identification is the critical first step in managing safety. Hard evidence of hazards is required and may be obtained in a number of ways, from a variety of sources, for example: a) Hazard and incident reporting systems; b) Investigation and follow-up of reported hazards and incidents; c) Trend analysis; d) Feedback from training; e) Flight data analysis; f) Safety surveys and operational oversight safety audits; g) Monitoring normal line operations; h) State investigation of accidents and serious incidents; and i) Information exchange systems. Each hazard identified must be evaluated and prioritized. This evaluation requires the compilation and analysis of all available data. The data is then assessed to determine the extent of the hazard; is it a “oneof-a-kind”, or is it systemic? A database may be required to facilitate the storage and retrieval of the data. Appropriate tools are then needed to analyse the data. Having validated a safety deficiency, decisions must then be made as to the most appropriate action to be taken to reduce or eliminate the hazard. The solution must take into account the local conditions, as “one size” does not fit all situations. Care must be taken that the solution does not introduce new hazards. This is the process of risk management. Once appropriate safety action has been implemented, performance must be monitored to ensure that the desired outcome has been achieved, for example: a) The hazard has been eliminated (or at least been reduced in severity); b) The action taken permits coping satisfactorily with the hazard; and c) No new hazards have been introduced into the system. If the outcomes are unsatisfactory, the whole process must be repeated. Accident prevention is a dynamic process that requires change. People are inherently resistant to change. Management may be especially resistant to change if it costs money. Supposition and conjecture are unlikely to convince decision-makers to spend money on measures with ill-defined prospects for significant benefit. Thus, providing a compelling argument for change is a fundamental challenge for those managing safety. 4-25 Part 2 DRAFT Chapter 4 - Understanding Safety COST CONSIDERATIONS Operating a profitable, yet safe airline or service provider requires a constant balancing act between the need to fulfill production goals (such as on-time departures) vs. safety goals (which may require taking extra time to ensure that a door is properly secured). The aviation workplace is filled with potentially unsafe conditions which will not all be eliminated; yet, operations must continue. Some airlines adopt a goal of “zero accidents” and state that “safety is their number one priority”. The reality is that airlines (and other commercial aviation organizations) are in the business of making money. Profit or loss is the immediate indicator of the company’s success in meeting its production goals. However, safety is a prerequisite for a sustainable aviation business, as a company tempted to cut corners will eventually realize. For most companies, safety can best be measured by the absence of accidental losses. Companies may realize they have a safety problem following a major accident or loss, in part because it will impact on the profit/loss statement. However, a company may operate for years with many potentially unsafe conditions without adverse consequence. In the absence of effective safety management to identify and correct these unsafe conditions, the company may assume that it is meeting its safety objectives as evidenced by the “absence of losses”. In reality, it has been lucky. C Costs Total Costs Risk Reduction Losses Protection Figure 4-7. Safety vs. costs Safety and profit are not mutually exclusive. Indeed, quality organizations realize that expenditures on the correction of unsafe conditions are an investment toward long-term profitability. Losses cost money. As money is spent on risk reduction measures, costly losses are reduced – as shown in Figure 4-7. However, by spending more and more money on risk reduction, the gains made through reduced losses may not be in proportion to the expenditure. Companies must balance the costs of losses and expenditures on risk reduction measures. In other words, some level of loss is acceptable from a straight profit and loss point of view. However, few organizations can survive the economic consequences of a major accident. Hence, 4-26 Part 2 DRAFT Chapter 4 - Understanding Safety there is a strong economic case for an effective safety management system. They may require energy and persistence, but not always a large budget. Cost of accidents There are three types of costs associated with an accident or serious incident: Direct, Indirect and Industry/social costs. Direct costs: These are the obvious costs, which are fairly easily determined. They mostly relate to physical damage, and include rectifying, replacing or compensating for injuries, aircraft equipment and property damage. The high costs of an accident can be reduced by insurance coverage. (Some large organizations effectively self-insured by putting funds aside to cover their risks). Indirect costs: While insurance may cover specified accident costs, there are many uninsured costs. An understanding of these uninsured costs (or indirect costs) is fundamental to understanding the economics of safety and hence, the viability of measures for accident prevention. Indirect costs include all those things that are not directly covered by insurance and usually total much more than the direct costs resulting from an accident. Such costs are sometimes not obvious and are often delayed. Some examples of uninsured costs that may accrue from an accident include.13 a) Loss of business and damage to the reputation of the organization. Many organizations will not allow their personnel to fly with an operator with a questionable safety record. b) Loss of Use of Equipment equates to lost revenue. Replacement equipment may have to be purchased or leased. Companies operating a one-of-a-kind aircraft may find that their spares inventory and the people specially trained for such an aircraft become surplus. c) Loss of staff productivity. If people are injured in an accident, and are unable to work, many States require that they continue to be paid. Also, they will need to be replaced at least for the short term, incurring the costs of wages, overtime (and possibly training), as well as imposing an increased workload on the experienced workers. d) Investigation and clean-up are usually uninsured costs. Operators may incur costs from the investigation including the costs of their staff involvement in the investigation, the costs of tests and analysis, wreckage recovery, restoring the accident site, etc. e) Insurance deductibles, the policyholder’s obligation to cover the first portion of the cost of any accident must be paid. A claim will also put a company into a higher risk category for insurance purposes, and therefore may result in increased premiums. (Conversely, the implementation of a comprehensive safety management system could help a company to negotiate a lower premium.) 13 Wood, Richard H., Aviation Safety Programs: A Management Handbook, 3rd edition, Englewood, Co.: Jeppesen, 2003 4-27 Part 2 DRAFT Chapter 4 - Understanding Safety f) Legal action and damage claims. Legal costs can accrue rapidly. While it is possible to take out insurance for public liability and damages, it is virtually impossible to cover the cost of time lost handling legal action and damage claims. g) Fines and Citations by governmental authorities may be imposed, including possibly shutting-down unsafe operations. Industry and social costs: In addition to the dollar costs identified above, an aviation disaster may compromise the aviation industry’s overall reputation and market much more widely than just the accident airline. (Events after 11 September 2001 are instructive.) Travelers switching to alternate means of transportation (such as road travel) may be exposed to additional risks – a social cost. Costs of incidents Serious aviation incidents, which result in minor damage or injuries, can also incur many of these indirect or uninsured costs. Typical cost factors arising from such incidents can include: a) Fight delays and cancellations; b) Alternate passenger transportation, accommodation, complaints, etc.; c) Crew change and positioning; d) Loss of revenue and reputation, etc. e) Aircraft recovery, repair and test flight; and f) Incident investigation. Costs of accident prevention The costs of accident prevention are probably even more difficult to quantify than the full costs of accidents — in part, because of the difficulty in assessing the value of accidents that have been prevented. Nevertheless, some airlines have attempted to quantify the costs and benefits of introducing safety management systems (SMS). They have found the cost savings to be substantial. Performing a cost benefit analysis is complicated due to the small number of accidents. However, it is an exercise that should be undertaken, as senior management are not inclined to spend money if there is no quantifiable benefit. One way of addressing this issue is to separate the costs of the safety management system from the cost of correcting safety deficiencies, by charging the safety management costs to the safety department and the safety deficiency costs to the line management most responsible. This exercise involves senior management in considering costs and benefits. If you think safety is expensive, try an accident. ———————— 4-28 Chapter 5 BASICS OF SAFETY MANAGEMENT Outline THE PHILOSOPHY OF SAFETY MANAGEMENT  General  Core business function  Systems approach  System safety FACTORS AFFECTING SYSTEM SAFETY  System failures  Active and latent failures  Equipment faults  Human Errors  System design (Defences-in-depth) SAFETY MANAGEMENT CONCEPTS  Hazard identification  Cornerstones of safety management  Strategies for safety management  Key safety management activities  Safety management process  Safety oversight  Safety Culture  Safety performance indicators and targets  Safety Management Systems (SMS)  Threat Error Management (TEM) Appendices: 1. Three Cornerstones of Safety Management 2. Précis – Threat and Error Management Part 2 DRAFT Chapter 5 - Basics of Safety Management This page intentionally left blank. 5-2 Part 2 DRAFT Chapter 5 - Basics of Safety Management Chapter 5 BASICS OF SAFETY MANAGEMENT THE PHILOSOPHY OF SAFETY MANAGEMENT General In view of the total costs of a major accident, management needs to conserve the organization’s assets and minimize its losses. This requires management of all the factors which can impact on safety – so as to reduce the risks of accidental loss to an acceptable level. Safety management then is the systematic management of the risks associated with flight operations, related ground operations and aircraft engineering or maintenance activities to achieve high levels of safety performance. For the purposes of this manual, safety was defined in Chapter 1 as “the state in which the risk of harm to persons or property damage is reduced to, and maintained at or below, an acceptable level through a continuing process of hazard identification and risk management.” Core business function In successful aviation organizations, safety management is a core business function — just as financial management is. The application of contemporary risk management methods facilitates the identification of their weaknesses and guides management towards the cost-effective resolution of unacceptable risks. Effective information management is required to support safety analyses and for the sharing of safety lessons and best practices across the industry. Finally, some system of performance measurement is required to confirm the effectiveness of the organization’s safety management system and to test the validity of steps taken to reduce risks. Effective safety management requires a realistic balance between safety and production goals. Thus, a coordinated approach, in which all the organization's goals and resources are analysed, helps ensure that decisions concerning safety are realistic and complementary to the operational needs of the organization. The finite limits of financing and operational performance must be accepted in any industry. Defining acceptable and unacceptable risks is therefore important for cost-effective safety management. If properly implemented, safety management measures not only increase safety, but also improve the operational effectiveness of an organization. Experience in other industries and lessons learned from the investigation of aircraft accidents have emphasized the importance of managing safety in a systematic, proactive, and explicit manner. Systematic means that safety management activities will be in accordance with a pre-determined plan, and will be applied in a consistent manner throughout the organization. Proactive means the adoption of an approach which emphasizes prevention, through the identification of hazards and the introduction of risk mitigation measures before the risk-bearing event occurs and adversely affects safety performance. 5-3 Part 2 DRAFT Chapter 5 - Basics of Safety Management Explicit means that all safety management activities should be documented, visible and performed independently from other management activities. Addressing safety in a systematic, proactive and explicit manner ensures, on a long-term basis, that safety becomes an integral part of the day-to-day business of the organization, and that the safety-related activities of the organization are directed to the areas where the benefits will be greatest. Systems approach To ensure that all possible sources of hazards which could affect safety are identified, safety management requires a systems approach. The “system” includes all the people, procedures and technology needed to operate or support the aviation system. Modern approaches to safety management have been shaped by the concepts introduced in Chapter 3 (Understanding Safety), and in particular, the role of organizational issues as contributory factors in accidents and incidents. Safety cannot be achieved simply by introducing rules or directives concerning the procedures to be followed by operational staff. The scope of safety management encompasses almost the whole range of activities of the organization concerned. It is for this reason that safety management must start from the senior level of management, and that potential effects on safety must be examined at all levels of the organization. System safety System safety may be described as the systematic application of specific technical and managerial skills to the identification and control of hazards under pre-determined conditions with an acceptable minimum of accidental loss. The primary objective of system safety is accident prevention and it is achieved by focusing on the identification and control of hazards associated with a system or product. By proactively identifying, assessing, and eliminating or controlling safety-related hazards, acceptable levels of safety are achievable. System safety was developed as an engineering discipline for aerospace and missile defence systems in the 1950s. Its practitioners were safety engineers, not operational specialists. As a result, their focus tended to be on designing and building fail-safe systems. On the other hand, civil aviation tended to focus on flight operations, and safety managers often came from the ranks of pilots. In pursuing improved safety, it became necessary to view aviation safety as more than just the aeroplane and its pilots. Aviation is a total system that includes everything needed to keep an aeroplane in the air safely. The ‘system’ includes the airport, air traffic control, maintenance, flight attendants, ground operational support, dispatch and others. Thus, the total aviation system is more than just the aeroplane and its pilots, and sound safety management is required for all parts of it. FACTORS AFFECTING SYSTEM SAFETY The factors affecting safety within the defined system can be seen from two points of view; first, a discussion of those factors which may result in situations in which safety is compromised, and second, an examination of how an understanding of these factors can be applied to the design of systems to reduce the likelihood of occurrences which may compromise safety. 5-4 Part 2 DRAFT Chapter 5 - Basics of Safety Management The search for factors that could compromise safety must include all levels of the organization responsible for both operations and the provision supporting services. As outlined in Chapter 3 (Understanding Safety), safety starts at the highest level of the organization. Any situation that results in safety standards being compromised can be considered a failure of the system to meet the primary goals of the organization. System failures A system failure is any occurrence that results in the inability to provide the level of service, which the system is intended to provide. Failure of one element of the system does not necessarily result in a system failure. If an alternate that provides the required level of service is available, and there is no interruption of service in the changeover, the system as a whole has not failed. As seen in Chapter 4, a failure rarely has a single cause. There will always be at least one (and often more than one) initiating event. Initiating events occur close in time to the failure, and are obviously linked to it. There are usually also a number of contributory events. Sometimes the link between contributory events and the failure may be obvious (at least after the event), however there is another group of contributory events where the link is not always obvious. These may include decisions regarding system design or organizational policy, taken months, or even years, before the failure. The causal factors for a system failure may be related to equipment, or the result of human error, or a combination of both. In this manual, the term failure will be reserved for system failures as defined above. The term fault will be used to describe malfunctions in equipment or facilities. The term error will be used only in the context of human error. A system failure is any occurrence which results in the inability to provide the level of service that the system is intended to provide. A fault is any occurrence that results in an equipment malfunction. An error is a failure in human performance. Active and latent failures When a failure actually occurs in a system, its results can be observed. Such a failure is classified as an active failure. The term latent failure is used to describe the existence in a system of hazards or unsafe conditions which, given a particular sequence of triggering events, could lead to a failure of the system. The direct causes of active failures are generally the result of equipment faults or errors committed by operational personnel. Latent failures, however, always have a human element. They may be the result of undetected design flaws. They may be related to unrecognized consequences of officially approved procedures. There have also been a number of cases where latent failures have been the direct result of decisions taken by the management of the organization. For example, latent failures exist when the culture of the organization encourages taking short cuts rather than always following approved procedures. The direct cause of a failure associated with taking short cuts would be the failure of a person at the operational level to follow correct procedures. However, if there is general acceptance of this sort 5-5 Part 2 DRAFT Chapter 5 - Basics of Safety Management of behaviour amongst operational personnel, and management are either unaware of this or take no action, there is a latent failure in the system at the management level. Equipment faults The determination of the likelihood of system failures due to equipment faults is the domain of reliability engineering. The probability of system failure is determined by analyzing the failure rates of individual components of the equipment. The causes of the component failures may include electrical, mechanical and software faults. A safety analysis is required to consider both the likelihood of failures during normal operations and the effects of continued unavailability of any one element on other aspects of the system. The analysis should include the implications of any loss of functionality or redundancy as the result of equipment being taken out of service for maintenance. It is therefore important that the scope of the analysis and the definition of the boundaries of the system for purposes of the analysis be sufficiently broad that all necessary supporting services and activities are included. As a minimum, a safety analysis should consider the elements of the SHEL model outlined in Chapter 4; specifically it should include the following sources of faults:      Hardware faults, Software malfunctions, Environmental conditions, Dependencies on external services, or Operational and maintenance procedures. It is possible that a single fault may cause the loss of more than one service or system function. An example would be a situation where both primary and secondary VHF communications equipment shared common primary and stand-by power supplies. In the event of a failure of both primary and stand-by power, both primary and secondary VHF would not be available. Such multiple failures resulting from a single fault are called common mode failures. The linkages, which can cause a common mode failure, are not always obvious. The possibility of common mode failures needs careful investigation in the early stages of system design. The techniques for estimating the probability of overall system failure as a result of equipment faults and estimating parameters such as availability and continuity of service are well established and are described in standard texts on reliability and safety engineering. These issues will not be addressed further in this manual. Human errors An error occurs when the outcome of a task being performed by a human is not the intended outcome. The way in which a human operator approaches a task depends on the nature of the task, and how familiar the operator is with it. Human performance may be skill-based, rule-based or knowledge-based. Errors may be the consequence of lapses in memory, slips in doing what was intended, or the result of mistakes which are conscious errors in judgement. A distinction should also be made between honest or normal errors committed in the fulfilment of assigned duties and deliberate violations of prescribed procedures or accepted safe practices. In order to understand human error and how it occurs, it is therefore necessary to understand the mechanisms involved in human performance. These are discussed in Chapter 4 (Understanding Safety). 5-6 Part 2 DRAFT Chapter 5 - Basics of Safety Management System design Chapter 4 also described the context for accidents. Given the complex interplay of so many human, materiel, and environmental factors in normal operations, the complete elimination of risk is an unachievable goal. Even in organizations with the best of training programmes and a positive safety culture, human operators will occasionally make errors. The best-designed and maintained equipment will occasionally fail. System designers must therefore take account of the inevitability of errors and failures. It important that the system be designed and implemented in such a way that, to the maximum extent possible, errors and equipment failures will not result in an accident; the systems are ‘error tolerant’. The hardware and software components of a system are generally designed to meet specified levels of availability, continuity and integrity. The techniques for estimating system performance in terms of these parameters are well established. When necessary, redundancy can be built into the system, to provide alternatives in the event of failure of one or more elements of the system. The performance of the human element cannot be specified as precisely as this; however, it is essential that the possibility of human error be considered as part of the overall design of the system. This requires an analysis to identify potential weaknesses in the procedural aspects of the system, taking into account the normal shortcomings in human performance. The analysis should also take into account the fact that accidents rarely have a single cause. As noted earlier, they usually occur as part of a sequence of events in a complex situational context; so the analysis needs to consider combinations of events and circumstances, to identify sequences that could possibly result in safety being compromised. Developing a safe and error tolerant system requires that the system contain multiple defences, barriers and safeguards to ensure that, as far as possible, no single failure or error will result in an accident, and that when a failure or error occurs, it will be recognized and remedial action taken before a sequence of events leading to an accident can develop. The need for a series of defences rather than just a single defensive layer arises from the possibility that the defences themselves may not always work perfectly. This design philosophy is called defences-in-depth. Failures in the defensive layers of an operational system can be thought of as creating gaps in the defences. As the operational situation or equipment serviceability states change, gaps may occur as a result of:     Undiscovered and longstanding shortcomings in the defences, The temporary unavailability of some elements of the system as the result of maintenance action, Active failures of equipment, or Human errors or violations. For an accident to occur in a well designed system, these gaps must develop in all the defensive layers of the system at the critical time when that defence should have been capable of detecting the earlier error or failure. An illustration of how an accident event must penetrate all defensive layers is illustrated diagrammatically in Figure 5-1. 5-7 Part 2 DRAFT Chapter 5 - Basics of Safety Management Figure 5-1. Defences-in-depth Table 5-1 below shows some examples of defences, barriers and safeguards at the technological, operational and organizational levels. 5-8 Part 2 DRAFT Chapter 5 - Basics of Safety Management Table 5–1. Examples of Defences, Barriers and Safeguards 1. 2. Equipment  Redundancy: o Full redundancy providing same level of functionality when operating on the alternate system. o Partial redundancy resulting in some reduction in functionality, e.g. local copy of essential data from a centralized network database.  Independent checking of design calculations and assumptions.  System designed to ensure a gradual degradation of capability (not total loss of capability) in the event of failure of individual elements.  Policy and procedures regarding maintenance which minimize loss of functionality in the active system, or loss of redundancy.  Automated work aids, e.g. STCA, MSAW, Runway Entry Monitors, ACAS. These should always be subsidiary to safe and effective operating procedures, which form the primary defence against the events which these automated aids are designed to detect. Operating Procedures Standard Operating Procedures (SOPs). Read back of critical items in clearances and instructions. Use of checklists. 3. Organizational Factors Clear safety policy: Must be implemented, adequate funding for safety management activities. Oversight to ensure correct procedures are followed: No tolerance for violations or short-cuts. Adequate control over the activities of contractors. The discussion of accident causation and the operational context in which accidents occur demonstrate the need for these defences in-depth to include equipment design and operating procedures which take account of the natural limits of humans and material, including human and cultural factors as well as environmental factors. These principles may be applied to the task of reducing risk in both proactive and reactive ways. By careful analysis of a system, it is possible to identify sequences of events where a combination of the equipment faults and human errors could lead to an accident or an incident. Identification of the active and latent failures revealed by this type analysis will enable remedial action to be taken to strengthen the system’s defences. When planning significant change to a system, a formal assessment of the potential safety implications is warranted. Chapter 13 outlines the process for conducting safety assessments. 5-9 Part 2 DRAFT Chapter 5 - Basics of Safety Management SAFETY MANAGEMENT CONCEPTS Hazard identification Hazard identification is the process of identifying those situations or conditions that have the potential to cause harm to persons or damage to property. The scope for hazards is wide, including aspects of design and manufacture, operations and maintenance. Hazards may involve system design faults, active or latent failures, isolated or systemic equipment faults or human errors, etc. Isolated hazards and risks may not be a significant problem. However, when hazards or risks exist concurrently at many levels, there is an increased probability of an accident or incident. Therefore, a structured approach to the identification of hazards is required to ensure that, as far as possible, all potential hazards are identified. Indeed, hazard identification is the sine qua non of safety management. It requires objective, accurate information which is free from emotion. Hazards may be identified through several diverse safety management activities; some are reactive to known events, others are proactive. For example: a) Incident reporting systems; b) Occurrence investigations; c) Trend monitoring and safety analyses; d) Proactive safety processes such as Flight Data Analysis (FDA) and Line Operations Safety Audits (LOSA); and e) Safety audits and formal safety assessments. The defences in depth described above protect against hazards. Part 4 — Applied Safety Management provides additional guidance on the creation and operation of several processes for effective hazard identification. The key features of an effective hazard identification system are: a) Proactive and reactive identification unsafe conditions; b) Collection of current and applicable hazard information; c) A procedure for receiving and actioning reports of hazards; d) Measures to protect the identity of persons identifying the hazards; e) A reliable method of accurately recording, storing, and retrieving hazard data; f) The capability to analyze hazard reports, both individually as well as in aggregate; g) A procedure for distributing lessons learned to affected staff and contractors; and h) Capable of being audited. 5-10 Part 2 DRAFT Chapter 5 - Basics of Safety Management Cornerstones of safety management In its most simple terms, safety management involves hazard identification and the closing of any potential gaps in the defences of the system. (See Fig. 5-1 above.) Effective safety management is multidisciplinary, requiring the systematic application of a variety of techniques and activities across the spectrum of aviation activities, in a continuous process. It builds upon three defining cornerstones: A comprehensive corporate approach sets the tone for the management of safety. The corporate approach builds upon the safety culture of the organization and embraces the organization’s safety policies, objectives and goals, and perhaps most importantly, senior management’s commitment to safety; Effective organizational tools are needed to deliver the necessary activities and processes to advance safety. It includes how the organization arranges its affairs to fulfil its safety policies, objectives and goals, how it establishes standards and allocates resources, etc. The principal focus is on hazards and their potential effects on safety-critical activities; and A system for safety oversight to confirm the organization’s continuing fulfilment of its corporate safety policy, objectives, goals and standards. Feedback mechanisms facilitate continuous improvement. These include such activities as monitoring and analysis of operational data, safety audits and performance analysis, the identification and adoption of best industry practices, etc. A more detailed examination of each of these cornerstones is provided at Appendix 1 to this Chapter (Three Cornerstones of Safety Management). Strategies for safety management The strategy that an organization adopts for its safety management system will reflect its corporate safety culture and may range from purely reactive, responding only to accidents, through to strategies that are highly proactive in their search for safety problems. The traditional or reactive process is dominated by retrospective repairs (i.e. fixing the stable door after the horse has bolted). Under the more contemporary or proactive approach prospective reform plays the leading part (i.e. make a stable from which no horse could run, or even want to). Depending on the strategy adopted, different methods and tools need to be employed. a) Reactive strategy: Investigate accidents and reportable incidents. This strategy is useful for situations involving failures in technology or unusual events. The utility of the reactive approach for accident prevention purposes depends on the extent to which the investigation goes beyond determining the causes, to include an examination of all the contributory factors. The reactive approach tends to be marked by: 1) Management’s safety focus being on compliance with minimum requirements; 2) Safety measurement being based on reportable accidents and incidents with such limitations in value as: — Any analysis is limited to examining actual failures; 5-11 Part 2 DRAFT Chapter 5 - Basics of Safety Management — Insufficient data is available to accurately determine trends, especially those attributable to human errors; and — Little insight is available into the ‘root causes’ and latent unsafe conditions, which facilitate human errors; and 3) Constant catching up is required to match human inventiveness for new types of errors b) Proactive safety strategy: Prevent accidents by aggressively seeking information from a variety of sources which may be indicative of emerging safety problems. Organizations pursuing a proactive strategy for accident prevention believe that the risk of accidents can be minimized by identifying vulnerabilities before they fail, and by taking the necessary actions to reduce those risks. To do this, they actively seek systemic unsafe conditions through such tools as: 1) Hazard and incident reporting systems which promote the identification of latent unsafe conditions; 2) Safety surveys to elicit feedback from front-line personnel about areas of dissatisfaction and unsatisfactory conditions which may have accident potential; 3) Flight data recorder analysis for identifying operational exceedances and confirming normal operating procedures; 4) Operational inspections or audits of all aspects of operations, to identify vulnerable areas before accidents, incidents, or minor safety events confirm a problem exists; and 5) A policy for consideration and embodiment of manufacturers’ service bulletins; etc. Key safety management activities Those organizations which manage safety most successfully practice several common activities. They tend to do specific things; for example:  Organization. They are organized to establish a culture of safety and reduce their accidental losses. Larger organizations may implement a formal safety management system as outlined in Part 3 of this manual.  Safety assessments. They systematically analyze proposed changes to equipment or procedures to identify and mitigate weaknesses before change is implemented.  Occurrence reporting. They have established formal procedures for reporting safety occurrences and other unsafe conditions.  Hazard identification schemes. They employ both reactive and proactive schemes for identifying safety hazards throughout their organization, such as voluntary incident reporting, safety surveys, operational safety audits, safety assessments, etc. Part 4 outlines several safety processes that are effective for the identification of safety hazards; for example, Flight Data Analysis (FDA) and Line Operations Safety Audits (LOSA).  Investigation and analysis. They follow-up on reported occurrences and unsafe conditions, if necessary, initiating competent safety investigations and safety analyses. 5-12 Part 2 DRAFT Chapter 5 - Basics of Safety Management  Performance monitoring. They actively seek feedback necessary to close the loop of the accident prevention cycle (outlined in Chapter 4 – Understanding Safety) – using such techniques as trend monitoring and internal safety audits.  Safety promotion. They actively disseminate the results of safety investigations and analyses, sharing safety lessons learned both internally within the organization and outside if warranted.  Safety oversight. They systematically monitor and assess safety performance. All these activities are described elsewhere in this manual. Safety management process Conceptually, the safety management process parallels the cycle of accident prevention described in Figure 4-6. Both involve a continuous loop process as represented in Figure 5-2 below. Collect Data Latent Unsafe Conditions Re-evaluate Situation Implement Strategies Latent Unsafe Conditions Collect Additional Data Safety Management Process Analyze Data Prioritize Unsafe Conditions Assign Responsibilities Develop Strategies Approve Strategies Figure 5-2. Safety management process Safety management is evidence-based, in that it requires the analysis of actual data to identify hazards. Using risk assessment techniques, priorities are set for reducing the potential consequences of the hazards. 5-13 Part 2 DRAFT Chapter 5 - Basics of Safety Management Strategies to reduce or eliminate the hazards are then developed and implemented with clearly established accountabilities. The situation is reassessed on a continuing basis and additional measures are implemented as required. Following are brief descriptions of the steps of the safety management process outlined in Figure 5-2. a) Data collection. The first step in the safety management process is the acquisition of relevant safety data – the evidence necessary to determine system safety performance or to identify latent unsafe conditions (safety hazards). The data may derive from any corner of the system: the equipment used, the people involved in the operation, their work procedures, their human/equipment/procedures interactions, etc. b) Analyse the data. By analysing all the pertinent information, safety hazards can be identified. The conditions under which the hazards pose real risks, their potential consequences and the likelihood of occurrence can be determined; in other words What can happen? How? and When? This analysis can be both qualitative and quantitative. The inability to quantify and/or the lack of historical data on a particular hazard do not exclude the hazard from the analysis. c) Prioritize the unsafe conditions. The seriousness of hazards is determined by considering them relative to each other, and against an agreed set of acceptability criteria. Those posing the greatest risks are considered for safety action. This may require a cost-benefit analysis. d) Develop strategies. Beginning with the highest priority risks, several options for managing the risks may be considered, for example: 1) Spread the risk across as large a base of risk-takers as practicable. (This is the basis of insurance.) 2) Transfer the risk (or liability) to some other organization or agency (such as through lease vs. buy decisions). 3) Eliminate the risk entirely (possibly by ceasing that operation or practice). 4) Accept the risk, and continue operations unchanged. 5) Mitigate the risk by implementing measures to reduce the risk or at least facilitate coping with the risk. In selecting a risk management strategy, care is required to avoid introducing new (and perhaps worse) risks. (See Chapter 6 for further guidance on risk management.) e) Approve strategies. Having analyzed the risks and decided on an appropriate course of action, management’s approval is required to proceed. The challenge here is the formulation of a convincing argument for (perhaps expensive) change. f) Assign responsibilities and implement strategies. Following the decision to proceed, the “nuts and bolts” of implementation must be worked out. This includes a determination of resource allocation, assignment of responsibilities, scheduling, revisions to operating procedures, etc. 5-14 Part 2 DRAFT Chapter 5 - Basics of Safety Management g) Re-evaluate situation. Implementation is seldom as successful as initially envisaged. Feedback is required to close the loop. What new problems may have been introduced? How well is the agreed strategy for risk reduction meeting performance expectations? What modifications to the system or process may be required? h) Collect additional data. Depending on the re-evaluation step, new information may be required and the full cycle reiterated to refine the safety action. Safety management requires analytical skills that may not be routinely practiced by management. The more complex the analysis, the more important is the need for the application of the most appropriate analytical tools. The closed loop process of safety management also requires feedback to ensure that management can test the validity of their decisions and assess the effectiveness of their implementation. (Chapter 9 provides guidance on safety analysis.) Safety oversight Safety oversight or safety performance monitoring activities are an essential component of the organization’s overall safety management strategy. Safety oversight provides the means by which an organization can verify how well it is fulfilling its safety objectives. An effective safety oversight programme increases the probability of detecting any weaknesses in the system defences before an active failure leads to an accident or serious incident. To do this, data must be collected and analysed. Some of the requirements for a safety performance monitoring system will already be in place in many States. For example, States would normally have published regulations relating to mandatory reporting of accidents and incidents, in accordance with Annex 13 — Aircraft Accident and Incident Investigation, and would have procedures in place for processing and investigating these reports. Identifying weaknesses in the system defences requires more than just collecting retrospective data and producing summary statistics. The underlying causes of reported occurrences are not necessarily immediately apparent, therefore investigation of safety occurrence reports, and any other information concerning possible hazards, should go hand in hand with safety performance monitoring. The implementation of an effective safety oversight programme requires, that organizations: a) Determine relevant safety performance indicators (see below); b) Establish a safety occurrence reporting system; c) Establish a system for the investigation of safety occurrences; d) Develop procedures for the integration of safety data from all available sources; and e) Develop procedures for the analysis of the data and the production of periodic safety performance reports. Chapter 10 – Safety Performance Monitoring provides guidance on the safety oversight function. 5-15 Part 2 DRAFT Chapter 5 - Basics of Safety Management Safety culture Some organizations believe that they have adequately addressed “safety management” by appointing a Director of Safety or a Flight Safety Manager. Often, this person is expected to “manage safety” and “prevent accidents” without a clear set of objectives and priorities, with limited guidance about how to do the work and a lack of resources to adequately undertake the task. Effective safety management is not a single function carried out by a designated organizational element. It needs to be a “way of thinking”, shared by all elements of the organization. Effective safety management requires more than setting up an organizational structure and promulgating rules or directives specifying the procedures to be followed. It requires a genuine commitment to safety on the part of senior management, and an organizational culture such that staff at all levels are safety conscious in their approach to their tasks. There is a close correlation between the philosophy of safety management and the concept of a safety culture. The philosophy defines a way of thinking about safety. The safety culture is a result of this way of thinking being translated into actions, so that the organizational culture becomes safety oriented. In essence, a positive safety culture makes safety “everybody’s business”. The safety culture will mature as exposure to and experience with safety management increases. (The characteristics of an organization with a positive safety culture were outlined in Chapter 4.) Only by having an internal system with adequate safeguards, and a corporate culture which ensures that these safeguards will not be circumvented, can acceptable levels of safety performance be assured. The identification of hazards, implementation of safeguards and development of a safety-oriented organizational culture can only be done from within an organization. An external body, such as the Safety Regulator, can audit the system to see whether it complies with ICAO and national requirements, but it cannot actively manage the day-to-day operation of the safety management system. Safety performance indicators and targets As described above, the safety management process is closed loop (like the accident prevention cycle described earlier in the manual). The process requires feedback to provide a baseline for assessing the system’s performance so that necessary adjustments can be made to effect the desired levels of safety. This requires a clear understanding of how results are to be evaluated; what will be the quantitative or qualitative indicators that will be employed to determine that the system is working? Having decided the factors by which success can be measured, safety management implies the rational setting of specific safety goals and objectives. For the purposes of this discussion, this Manual uses the following terminology. Safety performance indicator. A measure (or metric) used to express the level of safety performance required or achieved in a system. Safety performance target. The required level of safety performance for a system. A safety performance target comprises one or more safety performance indicators, together with desired outcomes expressed in terms of those indicators. It is necessary to draw a distinction between the criteria used to assess operational safety performance through monitoring, and the criteria used for the assessment of planned new systems or procedures. The process for the latter is known as safety assessments which are discussed in Chapter 13. 5-16 Part 2 DRAFT Chapter 5 - Basics of Safety Management Safety performance indicators In order to set safety performance targets, it is necessary to first decide on appropriate safety performance indicators. Safety performance indicators are generally expressed in terms of the frequency of occurrence of some event causing harm. Typical measures which could be used include:      aircraft accidents per flight hour; aircraft accidents per movement; aircraft accidents per year; serious incidents per flight hour; fatalities due to aircraft accidents per year. There is no single safety performance indicator which is appropriate in all circumstances. The indicator chosen to express a safety performance target must be matched to the application in which it will be used, so that it will be possible to make a meaningful evaluation of safety in the same terms as those used in defining the safety performance target. The safety performance indicator(s) chosen to express global, regional and national targets will not generally be appropriate for application to individual organizations. Since accidents are relatively rare events, they do not provide a good indication of safety performance – especially at the local level. Even at the global level, accident rates vary considerably from year to year. An increase or decrease in accidents from one year to the next does not necessarily indicate a change in the underlying level of safety. The frequency of occurrence of certain types of incidents may provide a better indicator of the “safety health” of the system. While these will not have resulted in an accident, they may reveal the existence of weaknesses or flaws in the system which, given a slightly different set of circumstances, may have contributed to an accident. Another safety indicator might reflect the identification and reporting of hazards which may not yet have resulted in either an accident or an incident. Such hazards may be identified through the safety assessment of new or planned systems or operating procedures, through safety oversight programmes or through occurrence reporting programmes (i.e. hazard or incident reporting systems). Safety Performance Targets Having decided on appropriate safety indicators, it is then necessary to decide on what represents an acceptable outcome. ICAO has set a global safety performance target in the objectives of the Global Aviation Safety Plan (GASP). These are to: a) reduce the number of accidents and fatalities, irrespective of the volume of air traffic; and b) achieve a significant decrease in worldwide accident rates, placing the emphasis on regions where these remain high. 5-17 Part 2 DRAFT Chapter 5 - Basics of Safety Management The desired safety outcome may be expressed either in absolute or relative terms. ICAO’s global targets are an example of relative targets. A relative target could also incorporate a desired percentage reduction in accidents or particular types of safety occurrences in a defined time period. In setting safety performance targets based on fatal aircraft accidents for use at the national or regional level, use could be made of acceptability criteria for societal risks which have been developed for other industries. Safety Management Systems (SMS) As stated at the outset, management of safety within an organization may embrace a broad range of activities aimed at fulfilling the organization’s safety objectives. In managing safety in such large systems, these diverse safety activities must be coordinated and integrated into a coherent safety programme. Safety management systems (SMS) are created to integrate the organization’s policies, procedures, responsibilities, organizational structures and safety activities in order to meet its safety objectives. Safety Management Systems have particular application in organizations devoted to flight operations, aircraft engineering and maintenance, air traffic services and aerodrome operations. They help to support several of the broad functions (outlined above) across the entire organization. Their principal value is in the identification of both actual and potential systemic safety hazards. They are also useful in the determination of adequate remedial action to maintain safety at an acceptable level of risk. In addition, they facilitate continuous safety performance monitoring across the system. In creating an SMS, the “system” to be managed must be carefully defined. An SMS provides a logical means for establishing the boundaries and defining the objectives of the system. All the system’s various components can be systematically reviewed, including the interactions among people, procedures, tools, materials, equipment, facilities, software and the environment. All the necessary data to support the safety analyses can be compiled, including the impact of applicable regulations, documented goals, objectives, and performance specifications. An effective SMS is becoming as vital to the survival of an aviation business as is an effective financial management system. An SMS creates the corporate culture in which employees at all levels practice safety in their assigned responsibilities and it makes safety “everybody’s business”, albeit with explicit objectives and lines of accountability. Consistent with the systems approach to safety, an SMS must include accountability for those suppliers, sub-contractors and business partners with the potential to affect the company’s safety performance. The safety culture will mature as exposure to and experience with safety management increases. Generally the requirements, procedures and practices of an SMS are considered under the following headings: a) The organization’s safety policy; b) The core safety management activities which are:     safety monitoring; safety assessment; safety auditing; safety promotion; and 5-18 Part 2 DRAFT Chapter 5 - Basics of Safety Management c) Its support elements including:     the organizational structure; the role of the safety manager; safety responsibilities and accountabilities; and the training and competency of personnel. The relationship between these various components of a safety management system are illustrated in Figure 53 below. Safety Culture Safety Monitoring Philosophy of Safety Management Safety Policy Safety Assessment Safety Auditing Maintenance or Improvement of safety performance Safety Promotion Supporting organizational requirement requirements Safety Management System SystemSSystem Figure 5-3. The Concept of a Safety Management System Each of these core safety management activities and their supporting elements are addressed in this Manual. In particular, Part 3 provides a broader discussion of the concepts underlying safety management systems and guidance on implementing and operating an effective safety management system. Threat and Error Management (TEM) Threat and error management (TEM) is an evolving concept tied in to effective safety management. TEM builds upon a positive safety culture to provide a bridge between aviation operations and human performance. The most important factor influencing human performance in dynamic work environments is the real-world operational context (i.e., organizational, regulatory and environmental factors, etc.) within which people discharge their operational duties. Since operational challenges and complexities directly affect safety, TEM aims to provide a principled approach to the examination of this operational context of human performance. TEM provides a conceptual framework for understanding, from an operational perspective, the interrelationships between safety and human performance in dynamic operational contexts. TEM facilitates 5-19 Part 2 DRAFT Chapter 5 - Basics of Safety Management both understanding human and system performance and quantifying the complexities of the operational context in relation to this description. Potential areas of application for TEM include: a) Investigating single accidents or incidents; b) Understanding systemic patterns within a large set of events (such as operational audits); c) Defining required competencies for actual operating contexts for licensing purposes; d) Defining operational training requirements; etc. The TEM considers three basic components listed below: a) Threats which are circumstances beyond the control of operational personnel which complicate fulfilment of their tasks (such as weather, traffic, equipment serviceability); b) Errors by operational personnel that lead to deviations from expected performance (such as slips, lapses and mistakes); c) Undesired States or operational conditions where an unintended situation results in a reduction in the margins of safety (such as climbing above an assigned altitude). While these unsafe conditions may not always lead to an accident or incident, they require careful management to remain within accepted margins of safety. A TEM model has evolved for the management of such unsafe conditions. Although originally developed for flight deck operations, the TEM model offers a conceptual basis for use across most functions in aviation with potential for impacting on operational safety. The TEM model is described further in the précis at Appendix 2 to this Chapter. Figure 5-4 provides a simplified graphic summary of the Threat and Error Management model.14 Figure 5-4. Simplified Version of Threat Error Management Model ———————— 14 The full version of the TEM model can be found in the Human Factors Training Manual (Doc 9683) and in the ICAO Line Operations Safety Audit (LOSA) Manual (Doc 9803) 5-20 Part 2 DRAFT Chapter 5 - Basics of Safety Management Appendix 1 to Chapter 5 THREE CORNERSTONES OF SAFETY MANAGEMENT 15 Effective safety management systems comprise three defining cornerstones. The characteristics for each are outlined below: a) A comprehensive corporate approach to safety which provides for such things as: — Ultimate accountability for corporate safety is assigned to the Board of Directors and Chief Executive Officer (CEO) with evidence of corporate commitment to safety from the highest organizational levels; — A clearly enunciated safety philosophy, with supporting corporate policies, including a non-punitive policy for disciplinary matters; — Corporate safety goals, with a management plan for meeting these goals; — Well defined roles and responsibilities with specific accountabilities for safety published and available to all personnel involved in safety; — A requirement for an independent safety manager; — Demonstrable evidence of a positive safety culture throughout the organization; — Commitment to a safety oversight process which is independent of line management; — A system of documentation of those business policies, principles, procedures and practices with safety implications; — Regular review of safety improvement plans; and — Formal safety review processes. b) Effective organizational tools for delivering on safety standards through such activities as: — Risk-based resource allocation; — Effective selection, recruitment, development and training of personnel; — Implementation of Standard Operating Procedures (SOPs) developed in cooperation with affected personnel; — Corporate definition of specific competencies (and safety training requirements) for all personnel with duties relating to safety performance; 15 CAP 712 Op. Cit. 5-21 Part 2 DRAFT Chapter 5 - Basics of Safety Management — Defined standards for, and auditing of, asset purchases and contracted services; — Controls for the early detection of - and action on - any deterioration in the performance of safety-significant equipment, systems or services; — Controls for monitoring and recording the overall safety standards of the organization; — The application of appropriate hazard identification, risk assessment and effective management of resources to control identified risks; — Provision for the management of major changes in such areas as the introduction of new equipment, procedures or types of operation, turnover of key personnel, mass layoffs or rapid expansion, mergers and acquisitions; — Arrangements enabling staff to communicate significant safety concerns to the appropriate level of management for resolution and feedback on actions taken; — Emergency response planning and simulated exercises to test the plan’s effectiveness; and — Assessment of commercial policies with regard to their impact on safety. c) A formal system for safety oversight with such desirable elements as: — A system for analysing flight recorder data for the purpose of monitoring flight operations and for detecting unreported safety events; — An organization-wide system for the capture of reports on safety events or unsafe conditions; — A planned and comprehensive safety audit review system which has the flexibility to focus on specific safety concerns as they arise; — A system for the conduct of internal safety investigations, the implementation of remedial actions and the dissemination of such information to all affected personnel; — Systems for the effective use of safety data for performance analysis and for monitoring organizational change as part of the risk management process; — Systematic review and assimilation of best safety practices from other operations; — Periodic review of the continued effectiveness of the safety management system by an independent body; — Line managers’ monitoring of work in progress in all safety critical activities to confirm compliance with all regulatory requirements, company standards and procedures, with particular attention to local practices; — A comprehensive system for documenting all applicable aviation safety regulations, corporate policies, safety goals, standards, SOPs, safety reports of all kinds, etc. and for making such documentation readily available for all affected personnel; and 5-22 Part 2 DRAFT Chapter 5 - Basics of Safety Management — Arrangements for ongoing safety promotion based on measured internal safety performance. ———————— 5-23 Part 2 DRAFT Chapter 5 - Basics of Safety Management Appendix 2 to Chapter 5 Précis on the THREAT ERROR MANAGEMENT MODEL Threat error management is a pro-active method of identifying and dealing with developing unsafe conditions in a dynamic operational situation, thereby preserving accepted margins of safety. As threats to safety develop, as normal errors are committed by operational personnel and as the operational situation degenerates into an undesired state, countermeasures are taken to contain the situation. However, failure to appropriately deal with the situation may result in as incident or even an accident. TEM components There are three basic components of the TEM model: threats, errors and undesired states. An acceptable level of safety can only be maintained by managing each of these components. Threats may be defined as “events or errors that occur beyond the influence of the operational personnel (flight crews, air traffic controllers, etc.) which increase operational complexity, and which must be managed to maintain the margins of safety”. Threats may involve such operational complexities as degraded equipment capabilities, meteorology, topography, airspace congestion, etc. These threats must be managed because they all have the potential to negatively affect operations by reducing margins of safety. The degree of threat warning available may dictate the most appropriate action to manage the situation. For example:  Flight crews and air traffic controllers can prepare their response to expected thunderstorm activity;  Skills and knowledge acquired through training and operational experience must be applied in coping with unexpected threats such as sudden equipment malfunctions;  Safety analysis may reveal less obvious threats (such as equipment design issues, optical illusions, or shortened turn-around schedules) which have the potential to compromise operations. Table 5-2 below presents examples of threats, grouped under two basic categories. The dynamic environment in which operations take place creates both expected and unexpected environmental threats which must be managed in real time. Organizational threats are usually latent in nature. Once identified they can be controlled (i.e. reduced or eliminated) at source by the organization. However, operational personnel remain the last line of defence if the organizational defences are inadequate. 5-24 Part 2 DRAFT Chapter 5 - Basics of Safety Management Table 5-2. Examples of threats (List not inclusive) Environmental Threats Weather: thunderstorms, turbulence, icing, wind shear, cross/tailwind, very low/high temperatures. Organizational Threats Operational pressure: delays, late arrivals, equipment changes. Aircraft: aircraft malfunction, automation event/anomaly, MEL/CDL. ATC: traffic congestion, TCAS RA, ATC command, ATC error, ATC language difficulty, ATC non-standard phraseology, ATC runway change, ATIS communication, units of measurement (QFE/meters). Cabin: flight attendant error, cabin event distraction, interruption, cabin door security. Maintenance: maintenance event/error. Airport: contaminated/short runway; contaminated taxiway, lack of/confusing/faded signage/markings, birds, aids U/S, complex surface navigation procedures, airport constructions. Ground: ground handling event, de-icing, ground crew error. Dispatch: dispatch paperwork event/error. Terrain: High ground, slope, lack of references, “black hole”. Documentation: manual error, chart error. Other: crew scheduling event Other: similar call-signs. Errors are “actions or inactions by operational personnel that lead to deviations from their intended or their organization’s expected performance”. Unmanaged and/or mismanaged errors frequently lead to undesired states, thereby reducing the margins of safety and increasing the probability of adverse events. Examples of errors include the failure to maintain stabilized approach parameters, selection of an inappropriate equipment mode, failure to make a required callout, or misunderstanding an ATC clearance. (Chapter 4 (Understanding Safety) includes a discussion of human errors.) Avoiding a potentially unsafe condition is dependent upon the detection and response to the error by front-line personnel. TEM focuses on this detection and response process. Errors that are properly managed in a timely manner become operationally inconsequential. On the other hand, mismanaged errors induce additional errors or undesired states. Table 5-3 presents examples of errors, grouped under three basic categories: aircraft handling errors, procedural errors and communications errors. Errors are classified based upon the personnel primarily involved in handling the equipment, executing the procedure or making the communication at the moment the error is committed. Errors may be unintentional or involve intentional non-compliance. Proficiency considerations (i.e., skill or knowledge deficiencies, training system deficiencies) may underlie many errors. 5-25 Part 2 DRAFT Chapter 5 - Basics of Safety Management Table 5-3. Examples of flight crew errors (List not inclusive) Error Type Aircraft handling errors Examples Manual handling/flight controls: vertical/lateral and/or speed deviations, incorrect flaps/speedbrakes, thrust reverser or power settings. Automation: incorrect altitude, speed, heading, autothrottle settings, incorrect mode executed, or incorrect entries. Systems/radio/instruments: incorrect packs, incorrect anti-icing, incorrect altimeter, incorrect fuel switch settings, incorrect speed bug, incorrect radio frequency. Ground navigation: attempting to turn down wrong taxiway/runway, taxi too fast, failure to hold short, missed taxiway/runway. Procedural errors SOPs: failure to cross-verify automation inputs. Checklists: wrong challenge and response; items missed, checklist performed late or at the wrong time. Callouts: omitted/incorrect callouts. Briefings: omitted briefings; items missed. Documentation: wrong weight and balance, fuel information, ATIS, or clearance information recorded, misinterpreted items on paperwork; incorrect logbook entries, incorrect application of MEL procedures. Communication errors Crew to external: missed calls, misinterpretations of instructions, incorrect readback, wrong clearance, taxiway, gate or runway communicated. Pilot to pilot: inter-crew miscommunication or misinterpretation Undesired States are "operational conditions where an unintended situation results in a reduction in margins of safety". Undesired states that result from ineffective threat and/or error management may lead to compromising situations and reduce margins of safety in operations. Often considered the last stage before an incident or accident, it is essential to effectively manage undesired states, restoring desired margins of safety. Examples of undesired states would include an aircraft climbing or descending to another level than it should, or an aircraft lining up for the incorrect runway during approach and landing. Table 5-4 presents examples of undesired aircraft states, grouped under three basic categories aircraft handling, ground navigation, and aircraft configuration. Similar categories could be applied for other areas of operations such as air traffic control. 5-26 Part 2 DRAFT Chapter 5 - Basics of Safety Management Table 5-4. Examples of undesired aircraft states (List not inclusive) Source Undesired State Aircraft handling Aircraft control (attitude). Vertical, lateral or speed deviations. Unnecessary weather penetration. Unauthorized airspace penetration. Operation outside aircraft limitations. Unstable approach. Continued landing after unstable approach. Long, floated, firm or off-centreline landing. Ground navigation Proceeding towards wrong taxiway/runway. Wrong taxiway, ramp, gate or hold spot. Incorrect aircraft configurations Incorrect systems configuration. Incorrect flight controls configuration. Incorrect automation configuration. Incorrect engine configuration. Incorrect weight and balance configuration. Undesired states are different from outcomes. Undesired states are transitional states between a normal operational state (such as a stabilized approach) and an outcome or end-state such as a reportable occurrence. For example: a stabilized approach (normal operational state) turns into an unstabilized approach (undesired aircraft state) that results in a runway excursion (outcome). As long as the undesired state can be managed so as to recover normal operations, margins of safety can be preserved. Once the undesired state progresses to an outcome, restoration of normal margins of safety is not possible. Countermeasures As part of the normal discharge of their duties, personnel involved in line operations must employ countermeasures to keep threats, errors and undesired states from reducing margins of safety. These countermeasures may be based strictly on human performance. Others will require the application of skills and knowledge in conjunction with the use of equipment. Enhanced TEM draws upon both systemic and individual / team countermeasures. Applying countermeasures to preserve margins of safety during operations requires significant amounts of time and energy. For example, empirical evidence gathered through observations during training and check flights suggest that as much as 70% of flight crew activities may be related to measures for protecting against threats and errors. Examples would include the use of checklists, briefings, call-outs and SOPs, as well as personal strategies and tactics. Although all countermeasures are initiated by operational personnel (such as flight crews and air traffic controllers), their response may build upon resources provided by the aviation system. These resources 5-27 Part 2 DRAFT Chapter 5 - Basics of Safety Management are already in place in the system before operations begin. Following are some examples of systemic countermeasures:  Standard operation procedures (SOPs);  Checklists;  Briefings;  Airborne Collision Avoidance System (ACAS);  Ground Proximity Warning System (GPWS);  Training; Etc . Other countermeasures rely more on the skills, knowledge and attitudes developed through human performance training and operational experience. These involve personal strategies and tactics as well as team countermeasures (such as Crew or Team Resource Management (CRM or TRM). There are basically three categories of individual and team countermeasures: 1) Planning countermeasures: essential for managing anticipated and unexpected threats; 2) Execution countermeasures: essential for error detection and error response; 3) Review countermeasures: essential for managing the changing conditions of a flight. Table 5–5 below presents examples of individual / team countermeasures for each category.16 Table 5–5. Examples of individual and team countermeasures Planning countermeasures The required briefing was interactive and operationally thorough - Concise, not rushed, and met SOP requirements - Bottom lines were established PLANS STATED Operational plans and decisions were communicated and acknowledged - Shared understanding about plans “Everybody on the same page” WORKLOAD ASSIGNMENT Roles and responsibilities were defined for normal and non-normal situations - Workload assignments were communicated and acknowledged CONTINGENCY MANAGEMENT Crew members developed effective strategies to manage threats to safety - Threats and their consequences were anticipated - Used all available resources to manage threats SOP BRIEFING 16 Further guidance on countermeasures can be found in the sample assessment guides for terminal training objectives (PANS TRG, ------) as well as in the ICAO Line Operations Safety Audit (LOSA) Manual (Doc 9803). 5-28 Part 2 DRAFT Chapter 5 - Basics of Safety Management Execution countermeasures MONITOR / CROSS-CHECK WORKLOAD MANAGEMENT AUTOMATION MANAGEMENT Crew members actively monitored and cross-checked systems and other crew members Operational tasks were prioritized and properly managed to handle primary flight duties - Aircraft position, settings, and crew actions were verified - Avoided task fixation - Did not allow work overload Automation was properly managed to balance situational and/or workload requirements - Automation set-up was briefed to other members - Effective recovery techniques from automation anomalies Review countermeasures Existing plans were reviewed and modified when necessary - Crew decisions and actions were openly analyzed to make sure the existing plan was the best plan INQUIRY Crew members asked questions to investigate and/or clarify current plans of action - Crew members not afraid to express a lack of knowledge - “Nothing taken for granted” attitude ASSERTIVENESS Crew members stated critical information and/or solutions with appropriate persistence - Crew members spoke up without hesitation EVALUATION/ MODIFICATION OF PLANS ____________________ 5-29 Part 2 DRAFT Chapter 6 RISK MANAGEMENT Outline General Hazard Identification Risk Assessment  Problem definition  Probability of adverse consequences  Severity of consequences of occurrence  Risk acceptability Risk Mitigation  Defence analysis  Risk mitigation strategies  Brainstorming  Evaluating risk mitigation options Risk Communication Risk Management Considerations for State Administrations  Occasions warranting risk management by State  Communications by State administrations  Benefits of risk management by State administrations Chapter 6 - Risk Management Part 2 DRAFT Chapter 6 - Risk Management This page intentionally left blank. 6-2 Part 2 DRAFT Chapter 6 - Risk Management Chapter 6 RISK MANAGEMENT Risk management serves to focus safety efforts on those hazards posing the greatest risks. General The aviation industry faces a diversity of risks everyday, many capable of compromising the viability of an operator and some even posing a threat to the industry. Indeed, risk is a by-product of doing business. Not all risks can be eliminated, nor are all conceivable risk mitigation measures economically feasible. The risks and costs inherent in aviation necessitate a rational process for decision-making. Daily, decisions are made in real time, weighing the probability and severity of any adverse consequences implied by the risk against the expected gain of taking the risk. This process is known as risk management. For the purposes of this manual risk management can be defined as: Risk Management: the identification, analysis and elimination (and/or mitigation to an acceptable level) of those hazards, as well as the subsequent risks that threaten the viability of an organization. In other words, risk management facilitates the balancing act between assessed risks and viable risk mitigation. Risk management is an integral component of safety management. It involves a logical process of objective analysis, particularly in the evaluation of the risks. However, in striving for the highest level of safety, it must be accepted that absolute safety is unachievable. An overview of the process for Risk Management is summarized in the flow chart at Figure 6-1 below. 6-3 Part 2 DRAFT Chapter 6 - Risk Management Identify the hazards to equipment, property, personnel or the organization. HAZARD IDENTIFICATION Evaluate the seriousness of the consequences of the hazard occurring. RISK ASSESSMENT Severity / Criticality What are the chances of it happening? RISK ASSESSMENT Probability of Occurrence RISK ASSESSMENT Acceptability Is the consequent risk acceptable and within the organization’s safety performance criteria? YES NO Take action to reduce Accept the risk the risk to an acceptable level RISK MITIGATION Figure 6-1. Risk management process As Figure 6-1 indicates, risk management comprises three essential elements: hazard identification, risk assessment and risk mitigation. The concepts of risk management have equal application in the decisionmaking by management in flight operations, air traffic control, maintenance, airport management, manufacturer (e.g. life of components) and State administrations. HAZARD IDENTIFICATION The concept of hazard identification was introduced in Chapter 5 (Basics of Safety Management). Given that a hazard may involve any situation or condition that has the potential to cause adverse consequences, the scope for hazards in aviation is wide. For example: a) Design factors, including equipment and task design; b) Procedures and operating practices, including their documentation and checklists, and their validation under actual operating conditions; 6-4 Part 2 DRAFT Chapter 6 - Risk Management c) Communications, including the most suitable medium, terminology and such barriers to effective communications as language; d) Personnel factors, such as company policies for recruitment, training, crew scheduling and remuneration; e) Organizational factors, such as the compatibility of production and safety goals, the allocation of resources, operating pressures, effectiveness of internal communications and the corporate safety culture; f) Work environment factors, such as ambient noise and vibration, temperature, lighting and the availability of protective equipment and clothing; g) Regulatory oversight factors, including the applicability and enforceability of regulations; the certification of equipment, personnel and procedures; and the adequacy of surveillance audits; and h) Defences include such factors as the provision of adequate detection and warning systems, the error tolerance of equipment and the extent to which the equipment is hardened against failures. As seen earlier, hazards may be recognized through actual safety events (accidents or incidents), or they may be identified through proactive processes aimed at identifying hazards before they precipitate an occurrence. In practice, both reactive measures and proactive processes provide an effective means of identifying hazards for accident prevention. Reported safety events are clear evidence of problems in the system and therefore, provide an opportunity to learn valuable safety lessons. Safety events should therefore be investigated to identify the hazards putting the system at risk. This involves investigating all the factors, including the organizational and human factors that played a role in the event. Guidance for investigating safety events for accident prevention is included in Chapter 8 (Safety Investigations). Organizations with safety management systems will also use proactive processes for the identification of hazards. Methods of identifying hazards proactively include: a) Safety assessments; b) Trend monitoring and analysis of safety data; c) Analysis of incident reports, maintenance service difficulty reports, etc; d) Analysis of operational data from flight recorders, line operational safety audits, etc.; e) Safety surveys of employees; and f) Safety inspections and audits; etc. Several proactive methods of hazard identification are discussed in Part 4 – Applied Safety Management In a mature safety management system, hazard identification should arise from several sources as an ongoing process. However, there are times in an organization’s life that special attention to hazard identification is warranted on a system-wide basis. Safety assessments (which are discussed in 6-5 Part 2 DRAFT Chapter 6 - Risk Management Chapter 13) provide a structured and systemic process for hazard identification when: a) There is an unexplained increase in safety-related events or safety infractions; b) Major operational changes are planned, including changes to key personnel, routes, aircraft types and other major equipment or systems, etc.; c) The organization is undergoing significant change, such as rapid growth or contraction; or d) Corporate mergers, acquisition or downsizing is planned. Having identified a hazard suspected of posing a risk to the organization, the characteristics of the unsafe condition need to be defined. This requires considering factors such as: a) Extent to which the hazard exists in the organization (or aviation system). For example, is it limited to a particular group of personnel, type of equipment or operation, location? b) Adequacy of existing hazard control measures (or defences)? and c) Extent to which the identified hazard is already being addressed? Such questions help define the characteristics of the problem and reduce the likelihood of initiating premature or inappropriate corrective action. RISK ASSESSMENT Having confirmed the presence of a safety hazard, some form of analysis is required to assess its potential for harm or damage. Typically, this assessment of the hazard involves three considerations: a) The probability of the hazard precipitating an unsafe event (i.e. the probability of adverse consequences should the underlying unsafe conditions be allowed to persist); b) The severity of the potential adverse consequences, or the outcome of such an unsafe event; and c) The rate of exposure to the hazards. The probability of adverse consequences increases through increased exposure to the unsafe conditions (thus, exposure may be viewed as another dimension of probability). Risk is the assessed potential for adverse consequences resulting from a hazard. It is the probability that during a defined period of activity, the hazard will result in an accident with definable consequences. It is the likelihood that the hazard’s potential to cause harm will be realized. Risk assessment involves consideration of both the probability and the severity of any adverse consequences; in other words, the loss potential is determined. In carrying out risk assessments, it is important to distinguish between hazards (the potential to cause harm) and risk (the likelihood of that harm being realized in a specified period of time). A risk assessment matrix (such as that provided in Table 6–1 (below) is a useful tool for prioritizing the hazards most warranting attention. There are many ways to approach the analytical aspects of risk assessment — some more formal than others. For some risks, the number of variables and the availability of both suitable data and mathematical 6-6 Part 2 DRAFT Chapter 6 - Risk Management models may lead to credible results with quantitative methods (requiring mathematical analysis of specific data). However, few hazards in aviation lend themselves to credible analysis solely through numerical methods. Typically, these analyses are supplemented qualitatively through critical and logical analysis of the known facts and their relationships. Considerable literature is available on the types of analysis used in risk assessment. Chapter 9 (Safety Analysis and Studies) includes some common analytical methods, including statistical analytical techniques and analytical tools for assessing potential hazards. For the risk assessments discussed in this manual, sophisticated methods are not required; a basic understanding of a few methods will suffice. Whatever methods are used, there are various ways by which risks may be expressed. For example: a) Number of deaths, loss of revenue, or loss of market share (i.e.. absolute numbers); b) Loss rates (e.g. number of fatalities per 1,000,000 seat miles flown); c) Probability of serious accidents (e.g. 1 every 50 years); d) Severity of outcomes (for example, injury severity); and e) Expected dollar value of losses vs. annual operating revenue (e.g. $1 million loss per $200 million revenue). Problem definition In any analytical process, the problem must first be defined. In spite of identifying a perceived hazard, defining the characteristics of the hazard into a problem for resolution is not always easy. People from different backgrounds and experience will likely view the same evidence from different perspectives. Why, something which poses a significant risk will reflect these different backgrounds, exacerbated by normal human biases. Thus, engineers will tend to see problems in terms of engineering deficiencies; medical doctors as medical deficiencies, and psychologists as behavioural problems, etc. The anecdote in the following box exemplifies the multifaceted nature of defining a problem. 6-7 Part 2 DRAFT Chapter 6 - Risk Management Charlie’s Accident Charlie has an emotional argument with his wife and proceeds to the local bar where he consumes several drinks. He departs the bar in his car at high speed. Minutes later, he loses control on the highway and is fatally injured. We know WHAT happened; we must now determine WHY it happened. The investigation team is comprised of six specialists, each of whom has a completely different perspective on the root safety deficiency. The sociologist identifies a breakdown in interpersonal communications within the marriage. An enforcement officer from the Liquor Control Board notes the illegal sale of alcoholic beverages by the bar on a "two for one" basis. The pathologist determines that Charlie's blood alcohol was in excess of the legal limit. The highway engineer finds inadequate road banking and protective barriers for the posted speed. An automotive engineer determines that Charlie's car had a loose front end and bald tires. The policeman determines that the automobile was traveling at excessive speed for the prevailing conditions. Each of these different perspectives may result in a different definition of the underlying hazard. Any or all of the factors cited in this example may be valid, underlining the nature of multi-causality. But, how the safety issue is defined will affect the course of action taken to reduce or eliminate the hazards. In assessing the risks, all potentially valid perspectives must be evaluated and only the most suitable pursued. Probability of adverse consequences Regardless of the analytical methods used, the probability of causing harm or damage must be assessed. This probability will depend on answers to such questions as: a) Is there a history of occurrences like this, or is this an isolated occurrence? b) How many other aircraft, or components of this type, might have similar defects? c) How many operating or maintenance personnel are following, or are subject to, the procedures in question? d) What percentage of the time is the suspect equipment, or the questionable procedure, in use?, and e) To what extent are there organizational, management or regulatory implications that might reflect larger threats to public safety? Based on such considerations, the likelihood of an event occurring can be assessed. For example: a) Unlikely to occur. Failures that are “unlikely to occur” include isolated occurrences, and risks where the exposure rate is very low or the fleet size is small. The complexity of the circumstances necessary to create an accident situation may be such that it is unlikely such a 6-8 Part 2 DRAFT Chapter 6 - Risk Management chain of events will arise again. For example, it is unlikely that independent systems would fail concurrently. However, even if the possibility is only remote, the consequences of such concurrent failures may warrant follow-up. There is a natural tendency to attribute unlikely events to “coincidence”. Caution is advised. While coincidence may be statistically feasible, coincidence must not be used as an excuse for the absence of due analysis. b) May occur. Failures that “may occur” derive from hazards with a reasonable probability that similar patterns of human performance can be expected under similar working conditions, or that the same material defects exist elsewhere in the system. c) Probably will occur. Such occurrences reflect a pattern (or potential pattern) of material failures that have not yet been rectified. Given the design or maintenance of the equipment, its strength under known operating conditions, etc., continued operations will likely lead to failure. Similarly, given the empirical evidence on some aspect of human performance, it can be expected with some certainty that normal individuals, operating under similar working conditions, would likely commit the same errors or be subject to the same undesirable performance outcome. Severity of consequences of occurrence Having determined the probability of occurrence, the nature of the adverse consequences if the event does occur must be assessed. The potential consequences govern the degree of urgency attached to the safety action required. If there is significant risk of catastrophic consequences, or if the risk of serious injury, property or environmental damage is high, urgent follow-up action is warranted. In assessing the severity of the consequences of occurrence, the following types of questions could apply: a) How many lives are at risk? Employees, fare paying passengers, bystanders or the general public; b) What is the likely extent of property or financial damage? Direct property loss to the operator, damage to aviation infrastructure, third party collateral damage, financial impact and economic impact for the State; c) What is the likelihood of environmental impact? Spill of fuel or other hazardous product and physical disruption of natural habitat; and d) What is the likely political implications and/or media interest? Risk acceptability Based on the risk assessment, the risks can be prioritized relative to other, unresolved safety hazards. This is critical in making rational decisions to allocate limited resources against those hazards posing the greatest risks to the organization. Prioritizing risks requires a rational basis for ranking one risk vis à vis other risks. Criteria or standards are required to define what is an acceptable risk and what is an unacceptable risk. By weighing the likelihood of an undesirable outcome against the potential severity of that outcome, the risk can be 6-9 Part 2 DRAFT Chapter 6 - Risk Management categorized within a risk assessment matrix. Many versions of risk assessment matrices are available from the literature. While the terminology or definition used for the different categories may vary, such tables generally reflect the ideas summarized in the Table below: Table 6-1. Risk assessment matrix17 SEVERITY OF CONSEQUENCES Aviation definition Meaning LIKELIHOOD OF OCCURRENCE Value Qualitative definition Meaning Catastrophic Aircraft destroyed Multiple deaths 5 Frequent Likely to occur many times 5 Hazardous A large reduction in safety margins, physical distress or a workload that the crew cannot be relied upon to perform their tasks accurately or completely. Serious injury or death of a proportion of the occupants. Major aircraft damage. 4 Occasional Likely to occur some times 4 Major A significant reduction in safety margins, a reduction in the ability of the flight crew to cope with adverse operating conditions as a result of increase in workload or as a result of conditions impairing their efficiency. Aircraft serious incident. Injury to occupants. 3 Remote Unlikely, but possible to occur 3 Minor Nuisance. Operating limitations. Use of emergency procedures. Aircraft minor incident. 2 Improbable Very unlikely to occur 2 Negligible Little consequence 1 Extremely improbable Almost inconceivable that the event will occur 1 In this version of a risk assessment matrix: 17 Value Based on: An Introduction to Flight Safety Risk Assessment (Paper by Richard Profit UK CAA) 6-10 Part 2 DRAFT Chapter 6 - Risk Management a) Severity of risk is ranked as Catastrophic, Hazardous, Major, Minor or Negligible with a descriptor for each indicating the potential severity of consequences. As mentioned earlier, other definitions can be used, reflecting the nature of the operation being analysed; b) Probability (or Likelihood) of occurrence is also ranked through five different levels of qualitative definition and descriptors are provided for each likelihood of occurrence; and c) Values may be assigned numerically, to weigh the relative importance of each level of severity and probability. A composite assessment of risk, to assist in comparing risks, may then be derived by multiplying the severity and probability values. Having used a risk matrix to assign values to risks, a range of values may be assigned in order to categorize risks as acceptable, undesirable or unacceptable. a) Acceptable means that no further action needs to be taken, (unless the risk can be reduced further at little cost or effort). b) Undesirable (or tolerable) means that the affected persons are prepared to live with the risk in order to have certain benefits, in the understanding that the risk is being mitigated as best as possible. c) Unacceptable means that operations under the current conditions must cease until the risk is reduced to at least the Tolerable level. A less numeric approach to determining the acceptability of particular risks includes consideration of such factors as: a) Managerial. Is the risk consistent with the organization’s safety policy and standards? b) Affordability. Does the nature of the risk defy cost-effective resolution? c) Legal. Is the risk in conformance with current regulatory standards and enforcement capabilities? d) Cultural. How will the organization’s personnel and other stakeholders view this risk? e) Market. Will the company’s competitiveness and well-being vis-à-vis other companies be compromised by not reducing or eliminating this risk? f) Political. Will there be a political price to pay for not reducing or eliminating this risk? g) Public. How influential will the media or special interest groups be in affecting public opinion regarding this risk? RISK MITIGATION As stated earlier, where risk is concerned, there is no such thing as absolute safety. Risks have to be managed to a level “as low as reasonably practicable” (ALARP18). This means that the risk must be 18 From SRG SMS Policies and Guidelines Document: Safety Management Systems (Sponsor PS Griffith, Head A&ATSD) p. 7 6-11 Part 2 DRAFT Chapter 6 - Risk Management balanced against the time, cost and difficulty of taking measures to reduce or eliminate the risk. When the acceptability of the risk has been found to be Undesirable or Unacceptable, control measures need to be introduced — the higher the risk, the greater the urgency. The level of risk can be reduced either by reducing the severity of the potential consequences, by reducing the likelihood of occurrence, or by reducing the exposure to that risk. Safety action is required to mitigate the risks. Optimal solutions will vary, depending on the local circumstances and exigencies. In formulating meaningful safety action, a good understanding of the adequacy of existing defences is required. Defence analysis19 A major component of any safety system is the defences put in place to protect people, property or the environment. These defences can be used to: a) Reduce the probability of unwanted events occurring; and b) Reduce the severity of the consequences associated with any unwanted events. Defences can be categorized into two types: a) Physical defences include objects that discourage or prevent inappropriate action, or which mitigate the consequences of events (for example, squat switches, switch covers, firewalls, survival equipment, warnings and alarms); and b) Administrative defences include procedures and practices that mitigate the probability of an accident (for example, safety regulations, standard operating procedures, supervision and inspection, and personal proficiency). Before selecting appropriate risk mitigation strategies, it is important to understand why the existing system of defences was inadequate. The following line of questioning may pertain: a) Were defences provided to protect against such hazards? b) Did the defences function as intended? c) Were the defences practical for use under actual working conditions? d) Were affected staff aware of the risks and the defences in place? and e) Are additional risk mitigation measures required? Risk mitigation strategies There is a range of strategies available for risk mitigation. For example: 19 From TSB’s Integrated Safety Investigation Methodology 6-12 Part 2 DRAFT Chapter 6 - Risk Management a) Exposure avoidance. The risky task, practice, operation or activity, is avoided because the risk exceeds the benefits. b) Loss reduction. Activities are taken to reduce the frequency of the unsafe events or the magnitude of the consequences. c) Segregation of exposure (separation or duplication). Action is taken to isolate the effects of the risk or build in redundancy to protect against the risks, i.e. reduce the severity of the risk. For example, protecting against collateral damage in the event of a material failure, or providing back-up systems to reduce the likelihood of total system failure. Brainstorming Generating the ideas necessary to create suitable risk mitigation measures poses a challenge. Developing risk mitigation measures frequently requires creativity, ingenuity and above all an open mind to consider all possible solutions. The thinking of those closest to the problem (usually with the most experience) is often coloured by set-ways and natural biases. Broad participation, including representatives of the various stakeholders, tends to help overcome rigid mind-sets. Thinking “outside the box” is essential to effective problem solving in a complex world. All new ideas should be weighed carefully before rejecting any of them. Evaluating risk mitigation options In evaluating alternatives for risk mitigation, not all are of equal potential for reducing risks. The effectiveness of each option needs to be evaluated before a decision can be taken. It is important that the full range of possible control measures is considered and that trade-offs between measures are considered to find an optimal solution. Each proposed risk mitigation option should be examined from such perspectives as: a) Effectiveness. Will it reduce or eliminate the identified risks? To what extent do alternatives mitigate the risks? Effectiveness can be viewed as being somewhere along a continuum, as follows: 1) Level One: (Engineering actions) The safety action will eliminate the risk, for example, interlocks to prevent thrust reverser activation in-flight; 2) Level Two: (Control actions) The safety action accepts the risk but adjusts the system to mitigate the risk by reducing it to a manageable level, such as, imposing more restrictive operating conditions; and 3) Level Three: (Personnel actions) The actions taken accept that the hazard can neither be eliminated (Level One) nor controlled (Level Two), so personnel must be taught how to cope with it, such as, the addition of a warning, revised check list and extra training. b) Cost / benefit. Do the perceived benefits of the option outweigh the costs? Will the potential gains be proportional to the impact of the change required? c) Practicality. Is it doable and appropriate in terms of, available technology, financial feasibility, administrative feasibility, governing legislation and regulations, political will, etc.? 6-13 Part 2 DRAFT Chapter 6 - Risk Management d) Challenge. Can the risk mitigation measure withstand critical scrutiny from all stakeholders? (Employees, managers, stockholders/State administrations, etc.) e) Acceptability to each stakeholder. How much buy-in (or resistance) from stakeholders can be expected? (Discussions with stakeholders during the Risk Assessment phase may indicate their preferred risk mitigation option.) f) Enforceability. If new rules (SOPs, regulations, etc.) are implemented, are they enforceable? g) Durability. Will the measure withstand the test of time? Will it be of temporary benefit or will it have long-term utility? h) Residual Risks. After the risk mitigation measure is implemented, what will be the residual risks relative to the original hazard? What is the ability to mitigate any residual risks? i) New Problems. What new problems, or new (perhaps worse) risks, will be introduced by the proposed change? Obviously, preference should be given to corrective actions that will completely eliminate the risk. Regrettably, such solutions are often the most expensive. At the other end of the spectrum, when insufficient resources or organizational will are present, the problem is often deferred to the training department to teach staff to cope with the risks. In such cases, management may be avoiding hard decisions by delegating responsibility for the risk to subordinates. RISK COMMUNICATION Risk communication includes any exchange of information about risks, i.e. any public or private communication that informs others about the existence, nature, form, severity or acceptability of risks. The information needs of the following groups may require special attention: a) Management must be apprised of all risks which present loss potential to the organization; b) Those exposed to the identified risks must be apprised of their severity and likelihood of occurrence; c) Those who identified the hazard need feedback on action proposed; d) Those affected by any planned changes need to be apprised of both the hazards and the rationale for the action taken; and e) Others with potential information needs regarding specific risks include: 1) Regulatory authorities; 2) Suppliers; 3) Industry associations; and 4) General public, etc. 6-14 Part 2 DRAFT Chapter 6 - Risk Management Effective communication of the risks (and plans for their resolution) adds value to the Risk Management process. The stakeholders can assist the decision-maker(s) if the risks are communicated early in a fair, objective and understandable way. Thus, risk communications should be: a) Two-way; b) Factual concerning the risks, benefits and uncertainties, and take into account the needs and concerns of the various stakeholders; c) Tested for acceptability by the decision-maker(s) and stakeholders; and d) Documented to reduce misunderstandings. Failure to communicate the safety lessons learned in a clear and timely fashion will undermine management’s credibility in promoting a positive safety culture. For safety messages to be credible, they must be consistent with the facts, with previous statements from management and with the messages from other authorities. These messages need to be framed in terms the stakeholders understand. RISK MANAGEMENT CONSIDERATIONS FOR STATE ADMINISTRATIONS20 Risk management techniques have implications for State administrations in areas ranging from policy development through to the “go/no-go” decisions confronting front-line State civil aviation inspectors. For example: a) Policy: To what extent should a State accept the certification paperwork of another State; b) Regulatory Change: From the many (often-conflicting) recommendations made for regulatory change, how are decisions made; c) Priority Setting: How are decisions made for determining those areas of safety warranting emphasis during safety oversight audits; d) Operational Management: How are decisions made when insufficient resources are available to carry out all planned activities; and e) Operational Inspections: At the front line, how are decisions made when critical errors are discovered outside of normal working hours. Occasions warranting risk management by State administrations Some situations should alert State aviation administrations to the possible need for applying risk management methods, for example: a) Start-up or rapidly expanding companies; b) Corporate mergers; 20 Adapted from TP 13095 Risk Management & Decision Making, Transport Canada 6-15 Part 2 DRAFT Chapter 6 - Risk Management c) Companies facing bankruptcy or other financial difficulties; d) Companies facing serious labour-management difficulties; e) Introduction of major new equipment by an operator; f) Certification of a new aircraft type, new airport, etc.; g) Introduction of new communication, navigation or surveillance equipment and procedures; and h) Significant change to air regulations or other laws potentially impacting on aviation safety, etc. Risk management by State administrations will be affected by such factors as: a) Time available to make the decision (grounding an aircraft, revoking a certificate, etc.); b) Resources available to effect the necessary actions; c) Numbers of people affected by required actions (company-wide, fleet-wide, regional, national, international, etc.); d) The potential impact of the State’s decision for action (or inaction); and e) Cultural and political will to take the action required. Communications by State administrations Effective communications by State Administrations are an important part of the risk management process. Establishing and maintaining good communications with stakeholders paves the way for effective decision-making. Stakeholders’ perceptions about risk will often differ from those of the regulatory authority. Communicating the reasons for public safety decisions requires care because such decisions often arouse strong emotions. Gaining acceptance of planned changes is more likely when key stakeholders are involved in the decision-making process. Such action reflects the safety culture of the State administration; is it one of management by decree, or one which values openness, respect, consultation, collaboration, partnership, etc. Benefits of risk management for State administrations Applying risk management techniques in decision-making offers benefits for State administrations, including: a) Avoiding costly mistakes during the decision-making process; b) Ensuring all aspects of the risk are identified and considered when making decisions; c) Ensuring the legitimate interests of affected stakeholders are considered; 6-16 Part 2 DRAFT Chapter 6 - Risk Management d) Providing decision-makers with a solid defence in support of decisions; e) Making decisions easier to explain to stakeholders and the general public; and f) Providing significant savings in time and money. ____________________ 6-17 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting Chapter 7 HAZARD AND INCCIDENT REPORTING Outline Introduction to Reporting Systems  Value of safety reporting systems  ICAO requirements Types of Incident Reporting Systems  Mandatory incident reporting systems  Voluntary incident reporting systems  Confidential reporting systems Principles of Effective Incident Reporting  Trust  Non-punitive  Inclusive reporting base  Independence  Ease of reporting  Acknowledgment  Promotion International Incident Reporting Systems  ICAO Accident/Incident Reporting System (ADREP)  European Co-ordination Centre for Aviation Incident Reporting Systems (ECCAIRS) State Voluntary Incident Reporting Systems  Aviation Safety Reporting System (ASRS)  Confidential Human Factors Incident Reporting Programme (CHIRP) Company Reporting Systems Implementation of Incident Reporting Systems  What to report?  Who should report?  System management  Reporting method and format  Limitations on the use of incident data Appendix 1. Limitations on the use of data from voluntary incident reporting systems Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting This page intentionally left blank. 7-2 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting Chapter 7 HAZARD AND INCIDENT REPORTING INTRODUCTION TO REPORTING SYSTEMS As outlined in previous chapters, safety management systems involve the reactive and proactive identification of safety hazards. Accident investigations reveal a great deal about safety hazards; but fortunately, aviation accidents are rare events. They are, however, generally investigated more thoroughly than incidents. When safety initiatives rely exclusively on accident data, the limitations of small samples apply. As a result, the wrong conclusions may be drawn, or inappropriate corrective actions taken. Research leading to the 1:600 Rule showed that the number of incidents is significantly greater than the number of accidents for comparable types of occurrences. The causal and contributory factors associated with incidents may also culminate in accidents. Often, only good fortune prevents an incident from becoming an accident. Unfortunately, these incidents are not always known to those responsible for reducing or eliminating the associated risks. This may be due to the unavailability of reporting systems, or people not being sufficiently motivated to report incidents. Value of safety reporting systems Recognizing that knowledge derived from incidents could provide significant insights into safety hazards, several types of incident reporting systems have been developed. Some safety databases contain a large quantity of detailed information. Although these occurrences may not be investigated to any depth, the anecdotal information they provide can offer meaningful insight into the perceptions and reactions of pilots, cabin attendants, mechanics and air traffic controllers. Safety reporting systems should not just be restricted to incidents, but should include provision for the reporting of hazards, i.e. unsafe conditions which have not yet caused an incident. For example, some organizations have programmes for reporting conditions deemed unsatisfactory from the perspective of experienced personnel (Unsatisfactory Condition Reports for potential technical faults). In some States, Service Difficulty Reporting (SDR) systems are effective in identifying airworthiness hazards. Aggregating data from such hazard and incident reports provides a rich source of experience to support other safety management activities. Data from incident reporting systems can facilitate an understanding of the causes of hazards, help define intervention strategies, and the effectiveness of interventions. Depending on the depth to which they are investigated, incidents can provide a unique means of obtaining first-hand evidence on the factors associated with mishaps from the participants themselves. Reporters can describe the relationships between stimuli and their actions. They may provide their interpretation of the effects of various factors affecting their performance, such as fatigue, interpersonal interactions and distractions. Furthermore, many reporters are able to offer valuable suggestions for remedial action. Incident data have also been used to improve operating procedures, display and control design, and provide a better understanding of human performance associated with the operation of aircraft and air traffic control. 7-3 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting ICAO requirements21 ICAO requires that States establish a mandatory incident reporting system to facilitate collection of information on actual or potential safety deficiencies. In addition, States are encouraged to establish a voluntary incident reporting system, adjusting their laws, regulations and policies so that the voluntary programme: a) Facilitates the collection of information that may not be captured by a mandatory incident reporting system; b) Is non-punitive; and c) Affords protection to the sources of the information. TYPES OF INCIDENT REPORTING SYSTEMS In general, an incident involves an unsafe, or potentially unsafe, occurrence or condition that does not involve serious personal injury or significant property damage; that is, it does not meet the criteria for an accident, but could have. When an incident occurs, the individual(s) involved may or may not be required to submit a report. The reporting requirements vary with the laws of the State where the incident occurred. Even if not required by law, operators may require reporting of the occurrence to the company. Mandatory incident reporting system In a mandatory system, people are required to report certain types of incidents. This necessitates detailed regulations outlining who shall report and what shall be reported. The number of variables in aircraft operations is so great that it is difficult to provide a comprehensive list of items or conditions which should be reported. For example, loss of a single hydraulic system on an aircraft with only one such system is critical. On a type with three or four systems, it may not be. A relatively minor problem in one set of circumstances can, in different circumstances, result in a hazardous situation. However, the rule should be: “If in doubt — report it”. Because mandatory systems deal mainly with “hardware” matters, they tend to collect more information on technical failures than on the Human Factor aspects. To help overcome this problem, States with well-developed mandatory reporting systems are introducing voluntary incident reporting systems aimed specifically at acquiring more information on the human factor aspects of occurrences. Voluntary incident reporting systems Annex 13 recommends that States introduce voluntary incident reporting systems to supplement the information obtained from mandatory reporting systems. In such systems, the reporter, without any legal or administrative requirement to do so, submits a voluntary incident report. In a voluntary reporting system, regulatory agencies may offer an incentive to report. For example, enforcement action may be waived for unintentional violations that are reported. The reported information should not be used against the reporters, i.e. such systems must be non-punitive to encourage the reporting of such information. 21 See Annex 13, Ch 8 7-4 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting Confidential reporting systems Confidential reporting systems aim to protect the identity of the reporter. This is one way of ensuring that voluntary reporting systems are non-punitive. Confidentiality is usually achieved by de-identification, often by not recording any identifying information of the occurrence. One such system returns to the user the identifying part of the reporting form and no record is kept of these details. Confidential incident reporting systems facilitate the disclosure of human errors, enabling others to learn from mistakes made, without fear of retribution or embarrassment. PRINCIPLES FOR EFFECTIVE INCIDENT REPORTING SYSTEMS People are understandably reluctant to report their mistakes to the company that employs them, or to the government department that regulates them. Too often following an occurrence, investigators learn that many people were aware of the unsafe conditions before the event. For whatever reasons, however, they did not report the perceived hazards, perhaps because of: a) Embarrassment in front of their peers; b) Self-incrimination, especially if they were responsible for creating the unsafe condition; c) Retaliation from their employer for having spoken out; or d) Sanction (such as enforcement action) by the regulatory authority. Use of the following principles help overcome the natural resistance to safety reporting. Trust Persons reporting incidents must trust that the receiving organization (whether the State or company) will not use the information against them in any way. Without such confidence, people will be reluctant to report their mistakes and they may also be reluctant to report other hazards they are aware of. Trust begins with the design and implementation of the reporting system. Employee input into the development of a reporting system is vital. A positive safety culture in the organization generates the kind of trust necessary for a successful incident reporting system. Specifically, the culture must be error tolerant and non-punitive. In addition, incident reporting systems need to be perceived as being fair in how they treat unintentional errors or mistakes. (Most people do not expect an incident reporting system to exempt criminal acts, or deliberate violations, from prosecution or disciplinary action.) Some States consider such a process to be an example of a “Just Culture”. Non-punitive Non-punitive reporting systems are based on confidentiality. Before employees will freely report incidents, they must receive a commitment from the regulatory authority or from top management that reported information would not be used punitively against them. The person reporting the incident (or unsafe condition) must be confident that anything said will be kept in confidence. In some States, “Access 7-5 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting to Information” laws make it increasingly difficult to guarantee confidentiality. Where this happens, reported information will tend to be reduced to the minimum to meet mandatory reporting requirements. Sometimes reference is made to anonymous reporting systems. Reporting anonymously is not the same as confidential reporting. Most successful reporting systems have some type of callback capability in order to confirm details, or obtain a better understanding of the occurrence. Reporting anonymously makes it impossible to ensure understanding and completeness of the information provided by the reporter. There is also a danger that anonymous reporting may be used for purposes other than safety. Inclusive reporting base Early voluntary incident reporting systems were targeted at flight crews. Pilots are in a position to observe a broad spectrum of the aviation system, and are therefore well situated to comment on the system’s health. Nonetheless, incident reporting systems which focus solely on the flight crew’s perspective, tend to reinforce the idea that everything comes down to pilot error. Taking a systemic approach to safety management requires that safety information be obtained from all parts of the operation. In State-run incident reporting systems, collecting information on the same occurrence from different perspectives facilitates forming a more complete impression of events. For example, ATC instructs an aircraft to ‘go around’ because there is a maintenance vehicle on the runway without authorization. Undoubtedly, the pilot, the controller and the vehicle operator would all have seen the situation from different perspectives. Relying on one perspective only may not provide a complete understanding of the event. Independence Ideally, State-run voluntary incident reporting systems are operated by an organization separate from the aviation administration responsible for the enforcement of aviation regulations. Experience in several States has shown that voluntary reporting benefits from a trusted “third party” managing the system. The “third party” receives, processes and analyses the incident reports and feeds the results back to the aviation administration and the aviation community. With “mandatory” reporting systems, it may not be possible to employ a “third party”. Nevertheless, it is desirable that the aviation administration gives a clear undertaking that any information received will be used for accident prevention purposes only. The same principle applies to an airline or any other aviation operator that uses incident reporting as part of its safety management system. Ease of reporting The task of submitting incident reports should be as easy as possible for the reporter. Reporting forms should be readily available so that anyone wishing to file a report can do so easily. They should be simple to compile, with adequate space for a descriptive narrative and they should encourage suggestions on how to improve the situation or prevent a reoccurrence. To simplify completion, classifying information, such as the type of operation, light conditions, type of flight plan, weather, etc. can use a “tick-off” format. 7-6 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting Acknowledgment The reporting of incidents requires time and effort by the reporter and should be appropriately acknowledged. To encourage further reports, one State includes a blank report form with the acknowledgment letter. In addition, the reporter naturally expects feedback about actions taken in response to the reported safety concern. Promotion The (de-identified) information received from an incident reporting system should be made available to the aviation community in a timely manner. This may also help to motivate people to report further incidents. Such promotion activities may take the form of monthly newsletters or periodic summaries. Ideally a variety of methods would be used with a view to achieving maximum exposure. INTERNATIONAL INCIDENT REPORTING SYSTEMS ICAO Accident/Incident Data Reporting System (ADREP) In accordance with Annex 13, States report to ICAO information on all aircraft accidents, which involve aircraft of a maximum certified take-off mass of over 2,250 kg. ICAO also gathers information on aircraft incidents (involving aircraft over 5,700 kg.) considered to be important for safety and accident prevention. This reporting system is known as ADREP. States report specific data in a pre-determined (and coded) format to ICAO. When ADREP reports are received from States, the information is checked and electronically stored, constituting a databank of worldwide occurrences. ICAO does not require States to investigate incidents. However, if a State does investigate a serious incident, they are requested to submit formatted data to ICAO. The types of serious incidents of interest to ICAO include: a) Multiple system failures; b) Fires or smoke on-board an aircraft; c) Terrain and obstacle clearance incidents; d) Flight control and stability problems; e) Take-off and landing incidents; f) Flight crew incapacitation; g) Decompression; and h) Near collisions and other serious air traffic incidents. 7-7 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting European Co-ordination Centre for Aviation Incident Reporting Systems (ECCAIRS) 22 Many aviation authorities in Europe have collected information about aviation accidents and incidents. However, the number of significant occurrences in individual States was usually not sufficient to give an early indication of potentially serious hazards or to identify meaningful trends. Since many States had incompatible data storage formats, pooling of safety information was almost impossible. To improve this situation, the European Union (EU) introduced occurrence-reporting requirements and developed the ECCAIRS safety database system. The objective of these moves was to improve aviation safety in Europe through the early detection of potentially hazardous situations. The ECCAIRS system includes capabilities for analysing and presenting the information in a variety of formats. The database is compatible with some other incident reporting systems, such as ADREP. Some non-European States have also chosen to implement the ECCAIRS system to take advantage of common classification taxonomies, etc. STATE VOLUNTARY INCIDENT REPORTING SYSTEMS A number of States operate successful voluntary incident reporting systems that utilize common features. Two such systems are described below: Aviation Safety Reporting System (ASRS)23 The United States operates a large aviation occurrence reporting system, known as the Aviation Safety Reporting System (ASRS). The ASRS operates independently from the Federal Aviation Administration (FAA) and is administered by NASA. Pilots, air traffic controllers, cabin crew, mechanics, ground personnel, and others involved in aviation operations may submit reports when aviation safety has been considered to be compromised. Samples of reporting forms are at the ASRS Website. Reports sent to the ASRS are held in strict confidence. All reports are de-identified before being entered into the database. All personal and organizational names are removed. Dates, times and related information, which might reveal an identity, are either generalized or eliminated. ASRS data are used to: a) Identify systemic hazards in the national aviation system for remedial action by appropriate authorities; b) Support policy formulation and planning in the national aviation system; c) Support research and studies in aviation, especially including human factors safety research; and d) Providing information to promote accident prevention. The FAA recognizes the importance of voluntary incident reporting to accident prevention and offers ASRS reporters some immunity from enforcement actions, waiving penalties for unintentional violations reported to ASRS. With over 300,000 reports now on file, this database supports research in the aviation safety — especially relating to human factors. 22 23 For more information on ECCAIRS visit their website at http://eccairs-www.jrc.it The ASRS website is at http://asrs.arc.nasa.gov 7-8 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting Confidential Human Factors Incident Reporting Programme (CHIRP)24 CHIRP contributes to the enhancement of flight safety in the United Kingdom, by providing a confidential reporting system for all individuals employed in aviation. It complements the United Kingdom’s Mandatory Occurrence Reporting system. Noteworthy features of CHIRP include: a) Independence from the regulatory authority; b) Broad availability (including flight crew members, air traffic control officers, licensed aircraft maintenance engineers, cabin crew and the general aviation community); c) Confidentiality of reporters’ identities; d) Analysis by experienced safety officers; e) Newsletters with broad distribution to improve safety standards by sharing safety information; and f) Participation by CHIRP representatives on several aviation safety bodies to assist in resolving systemic safety issues. COMPANY REPORTING SYSTEMS25 In addition to State-operated incident reporting systems (both mandatory and voluntary), many airlines, ATS providers and airport operators are implementing ‘in-house’ reporting systems for the reporting of safety hazards and incidents. If reporting is available to all personnel (not just flight crews), company reporting systems help promote a positive company-wide safety culture. Chapter 16 (Aircraft Operations) includes a deeper examination of company hazard and incident reporting systems. IMPLEMENTATION OF INCIDENT REPORTING SYSTEMS If implemented in a non-punitive work environment, an incident reporting system can go a long way toward creating a positive safety culture. Depending on the size of the organization, the most expedient method for incident and hazard reporting is to use existing “paperwork” such as safety reports and maintenance reports. However, as the volume of reports increases, some sort of computerized system will be required to handle the task. What to report? Any hazard which has the potential to cause damage or injury or which threatens the organization’s viability should be reported. Hazards and incidents should be reported if it is believed that: a) Something can be done to reduce the accident potential; b) Other aviation personnel could learn from the report; or 24 25 Visit the CHIRP website at http://www.chirp.co.uk/air FSF Digest Aviation Safety: US Efforts to Implement Flight Operational Quality Assurance Programs, Jul-Sep 1998 p51 7-9 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting c) The system and its inherent defences did not work “as advertised.” In short, if in doubt as to an event’s safety significance, report it. (Those incidents and accidents that are required to be reported in accordance with State laws or regulations governing accident or incident reporting should also be included in an operator’s reporting database.) Appendix 2 to Chapter 17 (Aircraft Operations) provides examples of the types of events that should be reported to an operator’s incident reporting system. Who should report? To be effective, incident reporting systems should include a broad reporting base. For specific occurrences, the perceptions of different participants in, or witnesses to, an event may be quite different – yet all relevant. Thus, State systems for voluntary incident reporting should encourage the participation of flight and cabin crews, air traffic controllers, airport workers and maintenance technicians. System management In establishing a reporting system, management and senior operations personnel need to determine the operating procedures for the system. For example: a) What types of occurrence, event, or hazard should be reported; b) Where do all the reports go initially; c) How will the reporting be acknowledged; d) Who is responsible for any investigation that may be required; e) How will confidentiality be protected; f) What degree of immunity will be provided and the criteria that will apply for such immunity; g) What is the role of the Safety Manager and the safety committee; h) What are the criteria for bringing a (de-identified) report to management’s attention; and i) How will any safety lessons learned and actions taken be disseminated to relevant staff. Such procedures should be widely disseminated to encourage use of the reporting system. Reporting method and format The method and format chosen for a reporting system matters little as long as it encourages personnel to report all hazards or incidents. The reporting process should be as simple as possible, and be well documented, including details as to what, where and when to report. 7-10 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting In designing reporting forms, the layout should facilitate the submission of information. Sufficient space should be provided to encourage reporters to identify suggested corrective actions. Other factors to be considered in designing a system and reporting forms include: a) Operational personnel are generally not prolific writers, therefore the form should be kept as short as possible; b) Reporters are not safety analysts, therefore, the questions should be in simple, everyday language; c) Use of non-directive questions instead of leading questions. (Non-directive questions include: What happened? Why? How was it fixed? What should be done?); d) Prompts may be required for the reporter to think about ‘system failures’ (how close were they to an accident?), and to consider their error management strategies; e) Focus should be on the detection and recovery from an unsafe situation or condition; and f) Reporters should be encouraged to consider the wider safety lessons inherent in the report, e.g. how the organization and the aviation system could benefit from it. Regardless of the source or method of submission, once the information is received it must be stored in a manner suitable for easy retrieval and analysis. Limitations on the use of incident data Data gathered through voluntary hazard and incident reporting systems may be subject to a number of limitations. In analysing such data, care must be taken to avoid erroneous conclusions and inappropriate safety actions. Some of the factors to be considered in using incident data include: a) Information may not have been verified; b) Reporters are naturally biased (for example, pilots may see the circumstances of an occurrence differently to an air traffic controller because of their different perspectives); c) Reporting forms may inadvertently reflect bias (such as eliciting irrelevant information and avoiding more crucial information); d) Incident-reporting databases may be biased (by forcing the artificial classification of some data); and e) Shortcomings in trend analyses due to the nature and structure of the recorded information. Appendix 1 to this chapter contains further guidance on limitations in the use of data from voluntary incident reporting systems. ———————— 7-11 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting Appendix 1 to Chapter 7 LIMITATIONS ON THE USE OF DATA FROM VOLUNTARY INCIDENT REPORTING SYSTEMS26 27 Care needs to be taken when using data from voluntary submitted incident reports. In drawing conclusions based on such data, analysts should be aware of the following limitations. Information not validated. In some States, voluntary, confidential reports can be fully investigated and information from other sources brought to bear on the incident. However, the confidentiality provisions of smaller programmes (such as company reporting systems) make it difficult to adequately follow-up on the report without compromising the identity of the reporter. Thus, much of the reported information cannot be substantiated. Reporters may have a tendency to understate their errors and blame the occurrence on other parties. Incidents may also be embellished to benefit the reporters. For example, during contract negotiations the number of incidents reported may be artificially inflated in order to support one side’s bargaining position. Reporter biases. Two factors may bias voluntary incident data: who reports and what gets reported. For example, pilots expect air traffic controllers to prevent traffic conflicts at tower-equipped airports. When a conflict occurs, pilots may view it as a problem with the air traffic system and therefore feel it is useful to report the incident. At non-tower airports, pilots are responsible for seeing and avoiding the other aircraft. If they fail to do so, the same pilots may feel culpable and perceive little benefit in reporting the incident; particularly, if no one else is in a position or likely to follow-up on the error. Some of the factors contributing to the subjective nature of voluntary incident reports include: a) Reporters must be familiar with the reporting system, they must have access to reporting forms or phone numbers; b) Reporters’ motivation to report may vary due to the following factors: 1) Level of commitment to safety; 2) Awareness of the reporting system; 3) Perception of the associated risks (local vs. systemic implications); 4) Operational conditions (some types of incident receive more attention than others); 5) Denial, ignorance of safety implications, desire to hide the problem, or fear of recrimination or even disciplinary action, (despite guarantees to the contrary); Development of a Methodology for Operational Reporting and Analysis Systems (OIRAS), Appel d’offres DGAC No 96/01 by Jean Paries and Ashleigh Merritt 26 Adapted in part from a paper by Dr. Sheryl Chappell of NASA ASRS entitled “Using Voluntary Incident Reports for Human Factors Evaluations” Aviation Psychology in Practice, Avebury Press, 1994 27 7-12 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting c) Different occupational groups see things differently, both in terms of interpreting the same event and in terms of deciding what is important; and d) Reporters must be aware of an incident to submit a report. Errors that go undetected are not reported. Report forms. Typically, incident reporting forms induce bias (including bias against reporting at all): a) A report form must be sufficiently short and easy to use that operational personnel are encouraged to use it, thus the number of questions must be necessarily limited; b) Completely open questions (i.e. narratives only) can fail to elicit useful data; c) Questions can guide the reporter, but they can also distort perceptions by leading the reporter to biased conclusions; and d) The range of possible events is so broad that a standard structured form cannot capture all information. (Therefore, analysts may have to contact the reporter to gain specific information.) Incident reporting databases. Information must be categorized in accordance with a pre-determined structure of keywords or definitions for entry into the database for later retrieval. Typically this introduces bias into the databases, compromising their utility. For example: a) Unlike objective physical flight parameters, descriptions of events and any causal attributions are more subjective; b) Categorization requires a system of pre-determined keywords or definitions biasing the database: 1) Reports are analysed to “fit” the keywords. Details that don’t fit are ignored; 2) Impossibility of creating an exhaustive list of keywords for classifying information; 3) Keywords are either present or not present providing a poor approximation of the real world; 4) Information is retrieved according to how it is stored; hence categorization determines the output parameters. For example, if there is no keyword called ‘technical failure’ then ‘technical failure’ will never be found to be the cause of incidents from that database; 5) The categorization system creates a ‘self-fulfilling prophecy’. For example, many incident reporting systems bias the keyword categorization toward CRM. Consequently, CRM is often cited as both the cause of the problem and its cure (more CRM training will redress the perceived CRM deficiency); c) Much of the information in the databases is never retrieved once it is entered; and d) Given the generality of keywords, the analyst must frequently go back to the original report 7-13 Part 2 DRAFT Chapter 7 - Hazard and Incident Reporting to understand contextual details. Relative frequency of occurrence. Since voluntary incident reporting systems do not receive information of the type needed to compute useful rate figures, any attempt to put the incident in the perspective of a frequency of occurrence vis à vis other occurrences will be an educated guess at best. For valid frequency comparisons, three types of data are required: the number of persons actually experiencing similar incidents (not just the reported incidents), the size of the population at risk of similar occurrences, and a measurement of the time period under consideration. Trend analysis. Meaningful trend analysis of the more subjective parameters recorded in incident reporting databases have not been particularly successful. Following are some of the reasons: a) Difficulties in using structured information; b) Limitations in capturing the context for the incident through keywords; c) Inadequate levels of detail and accuracy of recorded data; d) Poor inter-reliability of one report against another; e) Difficulties in merging data from different databases; and f) Difficulties in formulating meaningful queries for the database. ____________________ 7-14 Part 2 Chapter 8 – Safety Investigations DRAFT Chapter 8 SAFETY INVESTIGATIONS Outline Introduction   State investigations — Accidents — Serious incidents In-house investigations Scope of Safety Investigations  How much information is enough? Information Sources   Primary sources of information Secondary sources of information Interviews   Conducting interviews Caveat regarding witness interviews Investigation Methodology Investigating Human Performance Factors Investigating Procedural Deviations Safety Recommendations Appendices 1. Interviewing Techniques 2. Line of Reasoning for Investigating Human Errors 3. Investigating Procedural Deviations with PEAT and MEDA Part 2 Chapter 8 – Safety Investigations DRAFT This page intentionally left blank. 8-2 Part 2 Chapter 8 – Safety Investigations DRAFT Chapter 8 SAFETY INVESTIGATIONS28 Investigation. A process conducted for the purpose of accident prevention which includes the gathering and analysis of information, the drawing of conclusions, including the determination of causes and, when appropriate, the making of safety recommendations. ICAO Annex 13 INTRODUCTION Effective safety management systems depend on the investigation and analysis of safety issues. The accident prevention value of an accident, or a hazard, or incident report is proportional to the quality of the investigative effort. In the ICAO Annex 13 context, investigations are associated with accidents and serious incidents. For reportable accidents and serious incidents, the State will normally provide an investigative team. However, the State investigators will often call upon the special knowledge of the Safety Manager (SM) and other staff for input to the investigation29. Beyond the State investigations, the SM will likely lead investigations into less serious occurrences – but which may have even deeper safety significance than the State investigations. Ideally each reported hazard or occurrence is investigated to the point where all associated risks are identified. The SM and those assigned as investigators must be technically competent both in the area under investigation, as well as in the methods of safety investigation. The SM must earn the confidence and respect of staff, and demonstrate the highest levels of integrity while objectively attempting to determine why things occurred and how potential hazards might impact on safety. Consistent with the principles of safety management, a systems approach is required. State investigations Accidents Accidents provide compelling and incontrovertible evidence of the severity of hazards. Too often it takes the catastrophic and grossly expensive nature of accidents to provide the spur for allocating resources to reduce or eliminate unsafe conditions to an extent otherwise unlikely. By definition, accidents result in damage and/or injury. If we concentrate on investigating only the results of accidents, not the hazards or risks that cause them, we are being reactive. Reactive 28 Based on TC TP 13881 Safety Management Systems for Flight Operations and Aircraft Maintenance. Manual of Aircraft Accident Investigation (Doc 6920) contains much useful information for SMs during formal State investigations. 29 8-3 Part 2 Chapter 8 – Safety Investigations DRAFT investigations are rather inefficient from an accident prevention perspective in that serious latent unsafe conditions posing significant safety risks may be overlooked. The focus of an accident investigation should therefore be directed towards effective risk control. With the investigation directed away from “the chase for the guilty party” and towards effective risk mitigation, cooperation will be fostered among those involved in the accident, facilitating the discovery of the underlying causes. The short-term expediency of finding someone to blame is detrimental to the long-term goal of preventing accidents. Serious incidents The term “serious incident” is used for those incidents which good fortune narrowly prevented from becoming an accident. For example, a near collision with another aircraft, or with the ground, involving large passenger aircraft. Because of the seriousness of such incidents, they should be thoroughly investigated. Some States treat these serious incidents as if they had been accidents. Thus they use an accident investigation team to carry out the investigation, including the publication of a Final Report and the forwarding to ICAO of an ADREP incident data report. This type of full-scale incident investigation has the advantage of providing hazard information to the same standard as that of an accident investigation, without the associated loss of life, aircraft or property. In-house investigations Most occurrences do not warrant investigations by either the State investigative or regulatory authorities. Many incidents are not even required to be reported to the State. Nevertheless such incidents may be indicative of potentially serious hazards – perhaps systemic problems that will not be revealed unless the occurrence is properly investigated. For every accident or serious incident there may be hundreds of minor occurrences, many of which have the potential to become an accident. It is important that all reported hazards and incidents be reviewed and a decision taken on which ones should be investigated and how deeply. SMs will frequently be involved in investigating these occurrences with the view to accident prevention. For in-house investigations, the SM (or investigating team) may require specialist assistance, depending on the nature of the occurrence being investigated. For example: a) Cabin safety specialists for in-flight turbulence encounters, smoke or fumes in the cabin, galley fire, etc.; b) Air Traffic Services for loss of separation, near collisions, frequency congestion, etc.; c) Maintenance engineers for incidents involving material or system failures, smoke or fire, etc.; and d) Airport management advice for incidents involving FOD, snow and ice control, airfield maintenance, vehicle operations, etc. 8-4 Part 2 Chapter 8 – Safety Investigations DRAFT SCOPE OF SAFETY INVESTIGATIONS How far should an investigation go into minor incidents and hazard reports? The extent of the investigation should depend on the actual or potential consequences of the occurrence or hazard. Hazard or incident reports that indicate high-risk potential should be investigated in greater depth than those with low potential. The depth and detail of the investigation should be that which is required to clearly identify and validate the underlying hazards. Understanding why something happened requires a broad appreciation of the context for the occurrence. To develop this understanding of the unsafe conditions, the investigator should take a systems approach, perhaps drawing on the SHELL model outlined in Chapter 4. In deciding upon the scope of work required, SMs must accept that resources are limited. Thus the effort expended should be proportional to the perceived benefit, in terms of potential for identifying systemic hazards and risks to the organization. How much information is enough? The investigative process should be comprehensive, attempting to address all the factors that contributed to the situation. Active failures (sometimes called triggering events) take place immediately prior to the event. They have a direct impact on system safety because of the immediacy of their effects. However, they are not usually the root cause of the event. As such, applying corrective actions to these active failures may not address the real cause(s) of the problem. A more detailed analysis is normally required to understand all the factors that may have contributed. How much data should be collected and analysed in order to develop an accurate picture of the contributing factors? How many employees, support personnel, supervisors, etc. should be interviewed? How far back in time should activities be investigated? To what extent should inter-personal relationships be examined? At what point does past behaviour cease to influence current behaviour? To what level of management should the investigation progress? There are no clear answers to such questions. If the information will not help explain why something happened (or happens), then it is not relevant to the investigation. The questioning process can be carried to extremes. In practice, it is reasonable to stop the investigation at a point where corporate and regulatory authorities no longer have any power to change the underlying situation in the interests of safety. Although the investigation should focus on the factors most likely to have influenced actions, the dividing line between relevance and irrelevance is often blurred. Data that initially may seem to be unrelated to the investigation could later prove to be extremely relevant after relationships between particular elements of the occurrence are better understood. INFORMATION SOURCES Information relevant to a safety investigation can be acquired from a variety of sources which can be considered as primary and secondary sources of information. Primary sources of information include similarly configured equipment, documentation, recorder tapes, interviews, direct observation of personnel activities, and simulations. Secondary sources include occurrence databases, technical literature and the expertise of professionals or specialists. 8-5 Part 2 Chapter 8 – Safety Investigations DRAFT Primary sources of information Physical examination of the equipment used during the safety event may yield useful information. This may include examining the front-line equipment used, its components, or the workstations and equipment used by supporting personnel (e.g. air traffic controllers, maintenance and servicing personnel). Documentation spanning a broad spectrum of the operation is available, for example: a) Maintenance records and logs; b) Personal records/logbooks; c) Certificates, licenses; d) In-house personnel and training records, work schedules, etc.; e) Operator’s manuals and standard operating procedures; f) Training manuals and syllabi; g) Manufacturer’s data and manuals; h) Regulatory authority records; i) Weather forecasts, records and briefing material; and j) Flight planning documents, etc. Recordings (flight recorders, ATC radar and voice tapes, etc.) may provide useful information for determining the sequence of events. In addition to traditional flight data recordings, maintenance recorders in new generation aircraft are a potential additional source of information. Interviews conducted with individuals directly or indirectly involved in the safety event can provide a principal source of information for any investigation. In the absence of measurable data, interviews may be the only source of information. (Thus, SMs involved in investigations need to be skilled in interviewing techniques.) Direct observation of actions performed by operating or maintenance personnel in their work environment can reveal information about potential unsafe conditions. However, the persons being observed must be aware of the purpose of the observations. Simulations permit reconstruction of an occurrence and can facilitate a better understanding of the sequence of events that led up to the occurrence, and the manner in which personnel responded to the event. Computer simulations can be used to reconstruct events using data from on-board recorders, air traffic control tapes, radar recordings and other physical evidence. 8-6 Part 2 Chapter 8 – Safety Investigations DRAFT Secondary sources of information Additional information collected from other sources can facilitate the analysis of factual information. The SM and in-house investigators cannot be experts in every field related to the operational environment. It is important that they realize their limitations. When necessary, they must be willing to consult with other professionals during an investigation. Depending on the nature of the hazard and any required analysis, advice from such disciplines as engineering, medicine, psychology and statistics may be required. Some of the most useful sources of supporting information come from accident/incident databases, inhouse hazard and incident reporting systems, confidential reporting programmes, systems for monitoring line operations (e.g. flight data analysis and LOSA programmes), manufacturers’ databases, etc. A word of caution about databases: before using any data, the prudent SM will assess the limitations inherent in the database. Consideration should be given to the source of the data, its intended purpose, the definitions used in recording the data, its currency, etc. Chapter 15 (Practical Considerations) provides guidance on the management of safety information including limitations on the use of databases. INTERVIEWS Information acquired through interviews can help clarify the context for unsafe acts and conditions. It can be used to confirm, clarify or supplement information learned from other sources. Interviews can help to determine what happened. More importantly, interviews are often the only way to answer the important “why” questions which, in turn, can facilitate appropriate and effective safety recommendations. In preparation for an interview, the interviewer must expect that individuals will perceive and recall things differently. The details of a system defect reported by operational personnel may differ from those observed by maintenance personnel during a service check. Supervisors and management may perceive issues differently than line personnel. The interviewer must accept all views as worthy of further exploration. However, even qualified, experienced and well-intentioned witnesses could be mistaken in their recollection of events they have witnessed. In fact, in interviewing a number of persons on the same event, if different perspectives are not offered, it may be grounds to suspect the validity of the information being received. Conducting interviews The effective interviewer adapts to these differing views, remaining objective and avoiding making an early evaluation of the content of the interview. An interview is a dynamic situation, and the skilled interviewer knows when to continue a line of questioning and when to back off. For best results, interviewers will likely employ a process as follows: a) Careful preparation and planning for the interview; b) Conducting the interview in accordance with a logical, well-planned structure; and c) Assessing the information gathered in the context of all other known information. 8-7 Part 2 Chapter 8 – Safety Investigations DRAFT Appendix 1 to this chapter provides further guidance for conducting effective interviews. Caveat regarding witness interviews Sorting through the often-conflicting nature of witness interviews requires caution. Intuitively, an interviewer may weigh the value of an interview dependent on the background and experience of the person being interviewed. However: a) Persons judged as “good witnesses” may allow their perceptions to be influenced by their experience (i.e. they see and hear what they would “expect”). Consequently, their description of events may be biased; and b) On the other hand, people who have no knowledge of an occurrence they have witnessed are often able to accurately describe the sequence of events. They may be more objective in their observations. The skilled investigator does not overly rely on a single witness – even the testimony of an expert. Rather, information from as many sources as practical needs to be integrated to form an accurate perception of the situation. Credibility is difficult to establish, easy to lose, and almost impossible to regain. INVESTIGATION METHODOLOGY Accident prevention requires a range of activities related to the investigative process. The traditional field phase of the investigation is used to identify and validate perceived safety hazards. Competent safety analysis is required to assess the risks, and effective communications are required to control the risks. In other words, effective safety management requires an integrated approach to safety investigations. Some occurrences and hazards originate from material failures, or occur in unique environmental conditions. However, the majority of unsafe conditions are generated through human errors. When considering human error, an understanding of the unsafe conditions that may have affected human performance or decision-making is required. These unsafe conditions may be indicative of systemic hazards that put the entire aviation system at risk. Consistent with the systems approach to safety, an integrated approach to safety investigations considers all aspects that may have contributed to unsafe behaviour or created unsafe conditions. The logic flow for an integrated process for safety investigations is depicted in Figure 8 - 1 Integrated Safety Investigation Methodology (ISIM). Such a model can guide the SM or safety investigator from initial hazard or incident notification, through to the communication of safety lessons learned. 8-8 Part 2 Chapter 8 – Safety Investigations DRAFT Hazard or Occurrence Notification & Assessment Assess notification and decide to investigate or not. Data Collection Process Identify events and underlying factors Reconstruct logical progression of occurrence events. Sequence of Events Analyze facts & determine findings re underlying factors and hazards Integrated Investigation Estimate risk and determine acceptability for each hazard Risk Assessment Process Identify defences that are missing or inadequate Identify and evaluate risk control options Defence Analysis Risk Control Analysis Communicate safety message to stakeholders Safety Communication Process Figure 8 - 1. Integrated safety investigation model (ISIM)30 Effective investigations do not follow a simple step-by-step process that starts at the beginning and proceeds directly through each phase to completion. Rather, it is an iterative process that may require going back and repeating steps as new data is acquired and/or as conclusions are reached. The SM and the 30 Adapted from ISIM developed by TSB of Canada 8-9 Part 2 Chapter 8 – Safety Investigations DRAFT investigative team must remain flexible, moving forwards and backwards from data collection through analysis and back again. INVESTIGATING HUMAN PERFORMANCE FACTORS Investigators have been quite successful in analysing the measurable data as it pertains to human performance, e.g. strength requirements to move a control column, lighting requirements to read a display, ambient temperature and pressure requirements, etc. Unfortunately, the majority of safety deficiencies derive from issues that do not lend themselves to simple measurement and are thus not entirely predictable. As a result, the information available does not always allow an investigator to draw indisputable conclusions. Several factors typically reduce the effectiveness of a human performance analysis.31 These include: a) The lack of normative human performance data to use as a reference against which to judge observed individual behaviour. (FDA and LOSA data are now providing a baseline for better understanding normal day-to-day performance in flight operations.) b) The lack of a practical methodology for generalizing from the experiences of an individual (crew or team member) to an understanding of the probable effects on a large population performing similar duties. c) The lack of a common basis for interpreting human performance data among the many disciplines (e.g. engineering, operations and management) that make up the aviation community. d) The ease with which humans can adapt to different situations, further complicating the determination of what constitutes a breakdown in human performance. The logic necessary to convincingly analyse some of the less tangible human performance phenomena is different from that required for other aspects of an investigation. Deductive methods are relatively easy to present and lead to convincing conclusions. For example, a measured wind shear produced a calculated aircraft performance loss and a conclusion could be reached that the wind shear exceeded the aircraft’s performance capability. Such straight cause/effect relationships cannot be so easily established with some human performance issues such as complacency, fatigue, distraction or judgement. For example, if an investigation revealed that a crewmember made an error leading to an occurrence under particular conditions (such as fatigue, distractions, or complacency), it does not necessarily follow that the error was made because of these preconditions. There will inevitably be some degree of speculation involved in such a conclusion. The viability of such speculative conclusions is only as good as the reasoning process used and the weight of evidence available. Inductive reasoning involves probabilities. Inferences can be drawn on the most probable or most likely explanations of behavioural events. Inductive conclusions can always be challenged and their credibility depends on the weight of evidence supporting them. Accordingly, they must be based upon a consistent and accepted reasoning method. Other factors to be considered when investigating human performance issues are: 31 A manager from the Boeing Commercial Aeroplane Company (Fadden), 1984 8-10 Part 2 Chapter 8 – Safety Investigations DRAFT a) How to assess the relevance of certain behaviour or actions deemed "abnormal" or "nonstandard"; b) Sensitivity and privacy considerations vs. accident cause/prevention significance; and c) Avoidance of speculation vs. logical and reasonable explanations. Analysis of the human factors must take into account the objective of the investigation (i.e. understanding why something happened). Occurrences are seldom the result of a single cause. Although individual factors when viewed in isolation may seem insignificant, in combination they can result in a sequence of events and conditions that culminate in an accident. The SHEL model provides a systematic approach to examining the constituent elements of the system as well as the interfaces between them. The interactive system suggested by Reason also provides an excellent framework by which investigators can ensure that the analysis addresses all levels of the production system. Understanding the context in which humans err is fundamental to understanding the unsafe conditions that may have affected their behaviour and decision-making. These unsafe conditions may be indicative of systemic risks posing significant accident potential. The Error-Type Decision Tree provides a useful guide for understanding human errors. 8-11 Part 2 Chapter 8 – Safety Investigations DRAFT Was the error caused by a miscommunication, misinterpretation, or failure to communicate pertinent information? YES Communication Error NO YES Was the error caused by a lack of knowledge or basic proficiency with equipment or procedures? Proficiency Error NO YES Was the error a decision that increased risk and for which there were no written regulations or procedures? Operational Decision Error NO Was the error associated with an intention not to follow written regulations or procedures? YES Intentional Non-compliance Error NO Procedural Error Figure 8-2. Error-type decision tree Typically, incidents and accidents involve several, often inter-linked errors. Each error should be investigated to determine the most serious underlying contributory factors. With a better understanding of the context for the human errors (or violations), the SM or investigator can determine the extent to which a systemic hazard exists. The normal risk assessment process can then be used to determine whether the risk is unacceptable, therefore, warranting safety action. 8-12 Part 2 Chapter 8 – Safety Investigations DRAFT Appendix 2 to this chapter provides a line of reasoning suitable for the investigation of human errors. Research initiatives have determined that a majority of hull-loss accidents would have been prevented if established procedures had been followed. Since incidents are often only separated from accidents by a fine margin of luck, non-compliance with established procedures is also likely to be causal in most incidents. Simply listing procedural deviations as a contributing factor to an occurrence implies that further exhortations to “be safe” will alter human performance. However, understanding the causal and contributory context for such deviations from SOPs and accepted safe work practices is fundamental to accident prevention. The reasons behind non-compliance with procedures are not well understood. They may include ambiguously written or poorly understood procedures, inadequate training, time pressures, design shortcomings, incompatible work environments, unexpected operational situations, or just plain poor judgement. Fortunately, following an in-flight incident the crew is available to share their experience, insights and rationale. Ironically, occurrences occasionally arise where deviations from accepted work practices do prevent an accident. These may also warrant investigation to identify the weaknesses in the established procedures. Boeing has developed two investigation methodologies aimed at preventing similar types of procedural deviations: a Procedural Event Analysis Tool (PEAT) and a Maintenance Error Decision Aid (MEDA). Both follow an iterative (integrated) approach. PEAT focuses on the key event elements and the underlying cognitive factors that contributed to the event. In contrast, MEDA looks at organizational factors that contribute to human error, such as poor communication, inadequate information and poor lighting. PEAT and MEDA recognize that traditional efforts to investigate errors are often aimed at identifying the employee who made the error. The usual result is that the employee is subjected to a combination of disciplinary action and recurrent training – with little systemic safety benefit. Therefore, both PEAT and MEDA are based on the philosophy that personnel seldom deliberately deviate from prescribed procedures, especially if doing so is a safety risk. PEAT and MEDA are both software-based analytical tools, requiring specialized forms for following the process in a systematic way. Consequently, specific training in the use of PEAT and MEDA is required to ensure their successful application. Appendix 3 to this Chapter includes further information on PEAT and MEDA. SAFETY RECOMMENDATIONS When an investigation identifies hazards or unmitigated risks, safety action is required. The need for action must be communicated by means of safety recommendations to those with the authority to expend the necessary resources. Failure to make appropriate safety recommendations may leave the risk unattended. For those formulating safety recommendations, the following considerations may apply: Action agency. Who can best take the necessary corrective action? Who has the necessary authority and resources to intervene? Ideally, problems should be addressed at the lowest possible level of authority, such as the departmental or company level as opposed to the national or regulatory level. 8-13 Part 2 Chapter 8 – Safety Investigations DRAFT However, if several organizations are exposed to the same unsafe conditions, extending the recommended action may be warranted. State and international authorities, or multinational manufacturers may best be able to initiate the necessary safety action. What vs. how. Safety recommendations should clearly articulate what must be done, not how to do it. The focus is on communicating the nature of the risks requiring control measures. Detailed safety recommendations which spell out exactly how the problem should be fixed, should be avoided. The responsible manager should be in a better position to judge the specifics of the most appropriate action for the current operating conditions. The effectiveness of any recommendation will be measured in terms of the extent to which the risks have been reduced, rather than strict adherence to the wording in the recommendation. General vs. specific wording. Since the purpose of the safety recommendation is to convince others of an unsafe condition putting some or all of the system at risk, specific language should be used in summarizing the scope and consequences of the identified risks. On the other hand, since the recommendation should specify what is to be done (not how to do it), concise wording is preferable. Recipient’s perspective. In recommending safety action, the following considerations pertain from the recipient’s perspective: a) The safety recommendation is addressed to the most appropriate action authority (i.e. they have the jurisdiction and authority to effect the necessary change). b) There are no surprises (i.e. there has been prior dialogue concerning the nature of the assessed risks). c) It articulates what must be done, while leaving the action authority with the latitude to determine how best to meet that objective. Failure to observe these basic considerations may compromise the risk from receiving the necessary safety attention. Formal safety recommendations warrant written communications. This ensures that everyone is “singing from the same song sheet” and provides the necessary baseline for evaluating the effectiveness of implementation. However, it is important to remember that safety recommendations are only effective if the responsible managers implement them — — — — — — — —- 8-14 Part 2 Chapter 8 – Safety Investigations DRAFT Appendix 1 to Chapter 8 INTERVIEWING TECHNIQUES32 GENERAL Information from interviews can be used to confirm, clarify, or supplement information obtained from other sources. Moreover, in the absence of measurable data, interviews may be the only source of information. The interviewer’s role is to obtain information from the interviewee that is as accurate, complete, and detailed as possible. To accomplish this, the interviewer must: a) Be prepared. b) Have a clear objective. c) Have a good knowledge of the hazard or incident under investigation. d) Be able to adapt to the interviewee’s personal style. e) Be willing to go beyond simple facts. Interviews, particularly those involving human performance factors, must go beyond the “What and When” of the occurrence; they must also attempt to find out “How and Why” it occurred. To facilitate the interview, the interviewer must: a) Keep the number of people participating in the interview to a minimum. b) Include anyone (such as a technical expert, supervisor or union representative) that the interviewee has requested to be present. c) Brief others on their expected behaviour during the interview. d) Not tolerate disruptions of the interview by others. PREPARING FOR THE INTERVIEW Personal preparations The success of the interview will closely relate to personal preparation. Tailor the preparations to the interview. For example, depending on the nature of the reported hazard or incident, review the following: a) The facts relating to any safety event. b) Recorded information, if any. 32 Adapted from Transportation Safety Board of Canada (TSB) Manual of Investigations 8-15 Part 2 Chapter 8 – Safety Investigations DRAFT c) The aircraft systems, if applicable. d) Any operational peculiarities in procedures. e) The crew’s personal records. f) Communication, navigation, and approach facilities, as applicable; etc. Other preparatory factors include: a) If possible, visit the occurrence site. b) Define the general objectives of the interview. c) Prepare a set of appropriate questions to address areas of concern. d) Assess the audience and dress accordingly. e) Arrive for the interview with any required equipment or reference material. f) Consider requesting expert assistance for interviews of a highly technical nature. Timing Timing is critical. In follow-up to an incident or safety event, interviews should be conducted as soon as practicable to avoid: a) Loss of perishable information from fading memory. b) Interpretation and rationalization of events. c) Contamination caused by exposure to others’ recollections or interpretations of events. If an immediate interview is impractical, request a written statement to ensure information is recorded while fresh in the interviewee’s mind. Location If possible, select a location which is: a) Quiet and comfortable. b) Free from interruptions. c) For a witness, where he was when he witnessed the event, if at all possible. 8-16 Part 2 Chapter 8 – Safety Investigations DRAFT The interview The opening of the interview should reassure the interviewee about: a) The purpose of the interview (investigation). b) Interviewee’s rights. c) Your role as the interviewer. d) The procedures to be followed. Establish a rapport with the witness at the outset. Consider the following: a) Be polite. b) Behave in a natural manner; do not make the interview seem artificial. c) Keep interruptions to a minimum. d) Strive for an atmosphere of friendly conversation. e) Intervene only enough to steer the conversation in the desired direction. f) Display a sincere interest. The success of the interview will depend on both the timing and the structure of the questions. Begin the interview with a “free-recall” question, letting the individual talk about what he or she knows of the occurrence or subject matter. A free-recall allows the witness: a) To ease into the interview in a more relaxed manner. b) To feel that what he or she has to say is significant. c) More importantly, it is a source of information which is uncontaminated by the interviewer. Sequence the questions from: a) Easier to harder. b) General to specific. As the interview progresses, use a mixture of other types of questions: a) Open-ended or “trailing-off” questions evoke rapid and accurate descriptions of the events, and lead to more participation by the interviewee. (For example: “You said earlier that your training was…?”) b) Specific questions are necessary to obtain detailed information and may also prompt the person to recollect further details. 8-17 Part 2 Chapter 8 – Safety Investigations DRAFT c) Closed questions produce “yes” or “no” answers (providing little insight beyond the response). d) Indirect questions might be useful in delicate situations. (For example, “You mentioned that the first officer was uneasy about flying that approach. Why?”) A good closing to the interview includes: a) Summary of the key points. b) An opportunity for the interviewee to expand on any points previously covered, or to add further points. c) Reassurance to the interviewee and thanks for cooperation. d) Determination of availability for further interviews (if required). QUESTIONNING TECHNIQUES Following are some of the common traps interviewers may encounter. a) Avoid questions with the definite article unless the object in question has already been mentioned by the witness. 1) Use: “Did you see A broken strut?” 2) Not: “Did you see THE broken strut?” b) When asking a question, avoid leading questions, i.e. any question that contains the answer. Instead, use neutral sentences without adjectives or figurative verbs. 1) Use: “Which way was the aircraft travelling?” 2) Not: “Was the aircraft travelling west?” c) A question which mentions some object (whether the object existed or not) causes a tendency for a witness to assert that he/she saw the object. Design your questions so that they do not mention objects before the interviewee mentions them. d) If you have to ask questions of a personal nature, it is even more important to ask indirect questions. For example: 1) Use: “Was there anything upsetting you on the day of the occurrence?” 2) Not: “How did your marital situation affect you that day?” ADDITIONAL GUIDANCE Following are some additional hints for effective interviews: 8-18 Part 2 Chapter 8 – Safety Investigations DRAFT a) The interview is a dynamic process that requires continuing adaptation to the situation and to the interviewee. b) People approach any occurrence or situation from different perspectives. c) Remain objective and avoid making evaluations early in the interview, concentrate on the questions to be asked. d) Be aware of possible biases when assessing what was said during the interview. e) Do not allow the interviewee’s personality to influence interpretation of the interview. f) Do not accept any information gained in an interview at face value. Use it to confirm, clarify, or supplement information from other sources. In some circumstances there may be many witnesses to be interviewed. The resultant (often conflicting) information must be summarized, sorted and compiled in a useful format. For example: a) Prepare a summary of each interview. b) If appropriate, prepare a data-matrix. c) Summarize each interview by consolidating information under meaningful headings. d) Write an overall description from each set of summaries. e) Do not draw conclusions from evidence which is too inconsistent to support them. TEN COMMANDMENTS OF GOOD LISTENING Good interviews require good listening skills. 1. Stop talking: You cannot listen if you are talking. “Give every person thine ear, but few thy voice” (Polonius). 2. Put the talker at ease: Help witnesses feel that they are free to talk. 3. Show them you want to listen: Look and act interested. (For example: Do not review documents while they talk.) 4. Remove distractions: Do not doodle, tap, or shuffle papers. 5. Empathize with them: Try to put yourself in their place so that you can see their points of view. 6. Be patient: Allow plenty of time. Do not interrupt. 7. Hold your temper: An angry person gets the wrong meaning from words. 8-19 Part 2 Chapter 8 – Safety Investigations DRAFT 8. Do not criticize or embarrass: Criticism puts respondents on the defensive. Do not embarrass witnesses by commenting on their lack of technical knowledge, education, choice of words. They may “clam up” or get angry. Do not argue: even if you win, you lose. 9. Ask questions: Asking questions encourage the respondent and shows that you are listening. It also helps to develop points further. 10. Stop talking: This is first and last because all the other commandments depend on it. You cannot do a good listening job while you are talking. RECORDING THE INTERVIEW Each interview should be documented for future reference. Records may consist of transcripts, interview summaries, notes and tape-recordings (if taped). For major (State) investigations, complete transcripts of interview testimony may be required. A qualified stenographer should be hired specifically to provide a verbatim record of the interview which must not be edited. For formal interviews, both the interviewer and the interviewee may agree that the interview be taped. Recorded interviews may be summarized in whole or in part depending on the degree of detail required to support investigative findings. Consider the following factors in summarizing a taped interview: a) The meaning is not changed; b) Subjective interpretations are not made; c) The interviewee’s attitude is not obscured; d) Significant testimony is not omitted; and e) The nature of the summary is indicated in the record. Those portions of the recorded interviews not relevant to occurrence may be summarized. The majority of interviews will not be recorded. However, a summary of the interview should still be prepared, again ensuring that: a) Subjective interpretations are not made; b) The interviewee’s attitude is not obscured; and c) Significant testimony is not omitted. Information relevant to the hazard or situation under investigation derived from informal interviews or conversations may be summarized in concise notes. ———————— 8-20 Part 2 Chapter 8 – Safety Investigations DRAFT Appendix 2 to Chapter 8 LINE OF REASONING FOR INVESTIGATING HUMAN ERRORS Systematically investigating human errors requires an integrated line of inquiry. In attempting to identify the underlying factors affecting human behaviour which may have led to an occurrence, the SM (or investigator) might use the following reasoning: a) Was the behaviour or decision intentional or unintentional? b) If it was unintentional: 1) Was it a slip, whereby, the person knew what was required but inadvertently did something else (e.g. selected the wrong frequency)? 2) Was it a lapse whereby the person again knew what was required, but forgot (e.g. missed an altitude call-out)? 3) Was the unintentional behaviour the result of: — Unfamiliarity with a new procedure or new work environment? — Lack of skill or currency? — Inattention (due to distraction, fatigue, pre-occupation, complacency, etc.)? c) If it was intentional: 1) Was it a genuine mistake (perhaps due to an inappropriate plan)? 2) Was it a deliberate violation of prescribed operating procedures? 3) Were any mistakes in decision-making due to: — Inadequate knowledge for the conditions? — Ambiguous or inadequate rules governing the procedure? — Biases in decision-making? — Memory problems? 4) Were any deliberate violations due to: — Misunderstanding of correct procedures? 8-21 Part 2 Chapter 8 – Safety Investigations DRAFT — Adaptations of SOPs (i.e. routine application of a non-standard, but accepted, practice)? — Corporate climate of indifference? — Exceptional deviation due to actual conditions (e.g. weather, aircraft serviceability, security issue)? ———————— 8-22 Part 2 Chapter 8 – Safety Investigations DRAFT Appendix 3 to Chapter 8 INVESTIGATING PROCEDURAL DEVIATIONS WITH PEAT AND MEDA PEAT33 is designed to provide airlines with a tool for safety and risk management. It aims at enhancing reliability, consistency and effectiveness of the investigation process. The PEAT form is designed to facilitate the investigation of events involving non-adherence to procedures. The primary focus is to find out why a serious event occurred and if a procedural deviation was involved. The PEAT methodology comprises three elements: Process. PEAT provides a structured process that guides investigators through the identification of key contributing factors and the development of effective recommendations aimed at the elimination of similar errors in future. This includes collecting information about the event, analysing the event for errors, classifying the error and identifying preliminary recommendations. Data Storage. To facilitate analysis, PEAT provides a database for the storage of procedurally related data. Although designed as a structured tool, PEAT also provides the flexibility to analyse narrative information as needed. The database also facilitates tracking progress in addressing issues revealed by PEAT analyses and the identification of emerging trends. Analysis. Using the PEAT tool in a typical analysis of a procedurally related event, a trained investigator will consider the following areas and assess their significance in contributing to flight crew decision errors: a) Flight Phase where the error occurred; b) Equipment factors; 1) Role of automation; 2) Flight deck indications; 3) Aircraft configuration; c) Other stimuli (beyond immediate indications); d) Environmental factors; e) Procedure from which the error resulted: 1) Status of the procedure; 2) Onboard source of the procedure; 3) Procedural factors (e.g. negative transfer, impractical, complexity); 33 For further information on PEAT, see www.boeing.com/commercial/flighttechservices/ftssafety.html. 8-23 Part 2 Chapter 8 – Safety Investigations DRAFT 4) Crew interpretation of the relevant procedure; f) Current policies, guidelines/policies aimed at prevention of event; g) Crew factors: 1) Crew interpretation of the relevant procedure; 2) Crew intentions; 3) Crew understanding of situation at the time of the procedural deviation; 4) Situational awareness factors (e.g. vigilance, attention, etc.); 5) Factors affecting individual performance (e.g. fatigue, workload, etc.); 6) Personal and corporate stressors, management or peer pressures, etc; 7) Crew coordination/communication; and 8) Technical knowledge/skills/experience. MEDA34 Maintenance and inspection errors remain a significant cause of aircraft accidents worldwide. The percentage of commercial aircraft accidents in which maintenance error was a primary cause has risen steeply in recent years. Again, it is accepted that the same type of factors that contribute to accidents can be contributory to incidents and lesser safety events. Consequently, aviation authorities are seeking methods for better managing maintenance errors. MEDA is designed to systematically investigate the causes of maintenance errors that lead to occurrences involving equipment damage or personal injury or to unplanned events (such as a return to the gate, a flight cancellation or delay, a diversion, etc.). MEDA is discussed in greater detail in Chapter 19 (Aircraft Maintenance). __________________ 34 For further information on MEDA, see www.boeing.com/commercial/flighttechservices/ftssafety.html. 8-24 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT Chapter 9 SAFETY ANALYSIS AND STUDIES Outline Introduction  ICAO requirement  Safety analysis – what is it?  Analytical thinking  Reasoning — Objectivity and bias — Challenge process Information sources Analytical methods and tools  General methods and tools — Statistical analyses — Trend analysis — Normative comparisons — Simulation and testing — Expert panel — Cost-benefit analysis  Specialized methods and tools — Event reporting and analysis systems — Occurrence investigation and analysis — Human factors analysis — Flight data analysis — Other analytical tools Analysis process  Data retrieval  Sorting the data  Analysing the data  Drawing conclusions  Common analytical errors Safety studies  Selecting study issues  Information gathering  Participation Significant Issues Lists (SIL) Appendices 1. Understanding bias 2. Basics of statistical analysis Part 2 Chapter 9 – Safety Analysis and Studies DRAFT This page intentionally left blank. 9-2 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT Chapter 9 SAFETY ANALYSIS AND STUDIES INTRODUCTION Having collected and recorded voluminous safety data through safety investigations and various hazard identification programmes, meaningful and supportable conclusions for accident prevention can only be reached through safety analysis. Data reduction to simple statistics in itself serves little useful purpose without evaluation of the practical significance of the statistics in defining a problem that can be resolved. Often the expertise of a number of disciplines is required to analyse identified hazards, following the collaborative and iterative steps of risk management. During the risk assessment phase, the data are analysed, the probability and severity of risks evaluated, and the degree of acceptability of the risks determined. Ideally, this analysis process proceeds with certainty. In reality, facts are often less than “certain”. Some of the data may be contradictory, other data may be biased or misrepresented and much of the information needed will be missing. Such problems defy easy resolution. ICAO requirement35 ICAO recognizes the linkages between sound safety analysis and accident prevention. ICAO promotes accident prevention by the analysis of accident and incident data and by the prompt exchange of safety information. Having established an accident and incident database and an incident reporting system, States are required to analyse the information contained in their accident/incident reports and their databases to determine any preventive actions required. ICAO also recognizes the value of safety studies in recommending changes needed for accident prevention. Safety analysis — what is it? Analysis is the process of organizing facts using specific methods, tools or techniques. It may be used to: a) Verify the utility and limitations of available data; b) Assist in deciding what additional facts are needed; c) Establish consistency, validity and logic; d) Ascertain causal and contributory factors; e) Assist in reaching valid conclusions; etc. 35 See Annex 13 — Aircraft Accident and Incident Investigation. 9-3 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT Safety analysis is based on factual information, possibly from several sources. Relevant data must be collected, sorted and stored in such a manner that it is easily retrievable. Analytical methods and tools suitable to the analysis are then selected and applied. Safety analysis may involve the use of mathematical calculations and/or models to simulate particular events. Safety analysis is often iterative, requiring multiple cycles. It may be quantitative or qualitative. The absence of quantitative baseline data may force a reliance on more qualitative methods of analysis. Regardless, credible safety analysis must evaluate all hypotheses and remain objective, free from conjecture, emotions and politics. Analytical thinking Analytical thinking is the systematic way by which order is drawn from all the available information. Through analytical thinking, conclusions are reached based on data or known information that a particular event occurred (or might occur). Typically, four activities are involved: a) Observation (collecting and studying data relevant to the safety problem); b) Hypothesis (theorizing on any patterns noted in the data); c) Prediction (concluding what might happen, if the hypothesis under consideration is valid); and d) Verification (collecting new facts to test the predictions based on the hypothesis). Reasoning Credible safety analysis is built upon logical (unemotional) reasoning. The reasoning necessary for the reconstruction of an accident sequence differs from the kinds of “what if” reasoning used in safety assessments. Following are some of the reasoning processes typically encountered in safety analysis: a) Logic. Logic is implicit in all reasoning. It helps us link seemingly unrelated information into meaningful patterns. For example, logic is used to support conclusions that, given these circumstances, this event will happen with certainty. Logic is used to determine what conditions are necessary for an event to occur and to what extent these conditions are sufficient for the event to occur. For example, fuel is necessary for a fire; however, the actual amount of fuel present in an occurrence may not have been sufficient to cause a fire of the size which occurred. b) Deduction. There are two principal logic processes used in every day life: deduction and induction. Deductive logic goes from the broadly known, down to narrow or specific conclusions. Specific facts, theories or events are used to explain how various information pieces are tied together (that is, their interrelationships). In reconstructing the chronology and causes of an occurrence or event, deductive reasoning is required to move from the generally known to form increasingly specific conclusions. This process is essential in understanding why and how something happened. c) Induction is a process of drawing general conclusions from specific observations or experience. Induction implies that what happened once will happen again under the same circumstances. For 9-4 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT effective safety analysis, inductive reasoning should be based upon a large body of consistent information; isolated events may not be sufficient to form a general conclusion. Inductive conclusions are really just theories which can only be confirmed through practical experience. d) Hypothesis. Early in the analysis process, hypotheses or scenarios are created to attempt to connect the common elements of available data. They require imagination. Preconceived notions are abandoned and as many hypotheses as possible should be considered. No hypothesis can be accepted as truth until it has been tested against available data and information. A hypothesis should fit the facts; the urge to make the facts fit the hypothesis must be resisted. As the analysis proceeds, initial hypotheses may have to be abandoned in view of contrary data and information. (However, new data may subsequently come to light to support reconsidering a discarded hypothesis.) Objectivity and bias Not all information is equally reliable. It is necessary to treat information with judgement; a high degree of skepticism is appropriate. An open mind is required to give due consideration to all relevant information. Notwithstanding the desire to remain objective, time does not always permit the collection and careful evaluation of sufficient data to ensure objectivity. Intuitive conclusions may be reached which are not consistent with the objectivity required for credible safety analysis. We are all subject to some level of bias in our judgement (and therefore, in the weighting and evaluation of information). Past experience will influence our judgement, as well as our creativity in establishing hypotheses. One of the most frequent forms of judgement error is known as “confirmation bias”. This is the tendency to seek and retain information that confirms what we already believe to be true. As soon as we create a hypothesis, we become susceptible to confirmation bias. Appendix 1 to this chapter includes a discussion of several concepts of bias that are relevant to the drawing of questionable conclusions in safety analysis. Challenge process Fundamental to strong analytical thinking is a challenge process. For any analysis to be credible, it must have gone through an introspective process, starting with rigorous review by peers. The conclusions of safety analyses must stand up to the challenges by managers/specialists/experts. Only if “severe selfcriticism” has been exercised throughout the analytical process will the argument for change stand up to the rigours of external challenge. Anyone engaged in safety analysis should encourage constructive criticism as a normal part of their work. INFORMATION SOURCES Much information is available from a variety of sources to support effective safety analysis. However, new problems are continually being introduced by changes in technology, environmental concerns and operational conditions. Thus the past may not be an ideal indication of future problems. The following are the more widely used source of safety information: a) Review of current databases including: 9-5 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT 1) Investigation reports of accidents and incidents; 2) Safety database(s) (both internal and external to the organization.) These include the mandatory occurrence reporting record (e.g. ADREP), service difficulty reports, industry safety data exchanges (such as STEADES36) etc.); 3) Voluntary incident reports (recognizing that such information is anecdotal and may not be statistically valid); 4) Operational monitoring systems such as FDA and LOSA; and 5) Engine conditions monitoring programmes. b) Review of regulations (governing State as well as regulations from other States to help identify inadequate safety defences. Note that some of these regulations may be out-of-date, incomplete, ambiguous, and even contradictory); c) Review of current corporate situation including: 1) Safety reports and committee minutes; 2) Workplace opinions; and 3) Safety assessment and audit reports; d) Specialist advice from recognized industry experts such as manufacturer's representatives or academics (e.g. sleep researchers). Their perspective may be invaluable in defining the scope and potential consequences of a particular hazardous situation; and e) Literature searches (professional and industry journals and published academic papers) often provide an appreciation of the underlying causes and effects of particular hazards. Indeed, there is so much safety-related information available, managers and regulatory officials may face an over-abundance of information — not all of which is potentially useful. Care is required to select only data and information that are valid, reliable, relevant, etc. for the purposes of the safety analysis. ANALYTICAL METHODS AND TOOLS Having acquired the information needed to support a safety analysis, the choice of available analytical methods and tools is wide. Some of the methods used in safety analysis are automated, some are not. Several software-based tools (requiring different levels of expertise for effective application) are also available. Some of the relevant methods and tools are general, having broad application beyond the specific needs of aviation safety; and some are unique to aviation.37 36 Administered by IATA. The website for the Global Aviation Information Network (GAIN) includes descriptions of some of the industry’s best practices with respect to methods and tools effective in aviation safety analysis (www.gainweb.org). 37 9-6 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT General methods and tools Some of the available general methods and tools include: Statistical analyses. Many of the analytical methods and tools used in safety analysis are built upon statistical procedures and concepts. For example, risk analysis utilizes concepts of statistical probability. Statistics play a major role in safety analysis by helping to quantify situations, providing insight through numbers. This generates more credible results for a convincing safety argument. The type of safety analysis conducted at the level of company safety management systems requires basic skills for analysing numeric data, identifying trends and for making basic statistical computations such as arithmetic means, percentiles and medians. Statistical methods are also useful for graphical presentations of these analyses. Computers can handle the manipulation of large volumes of data. Most statistical analysis procedures are available in commercial software packages (such as MS Excel). Using such applications, data can be entered directly into a pre-programmed procedure. While a detailed understanding of the statistical theory behind the technique is not necessary, the analyst should understand what the procedure does and what the results are intended to convey. While statistics are a powerful tool for safety analysis, they can also be misused, leading to erroneous conclusions. Care must be taken in the selection and use of the data used in statistical analysis. To ensure appropriate application of the more complex methods, the assistance of specialists in statistical analysis may be required. Such specialists may be helpful in the following circumstances: a) Conducting more complex statistical analytical procedures; b) Developing sampling techniques; c) Interpreting statistical outputs particularly when data samples are small; d) Advising on the use of appropriate normative data; e) Assisting in the use of specialized databases, extraction and analysis tools; f) Detecting data corruption; g) Advising on the use and interpretation of data from external sources, etc.; and h) Consolidating data, checking its homogeneity and relevance. A summary of some of the basics of statistical analysis is included in Appendix 2 to this chapter. Trend analysis. By monitoring trends in safety data, predictions may be made about coming events. Emerging trends may be indicative of embryonic hazards. Statistical methods can be used to assess the significance of perceived trends. Upper and lower limits of acceptable performance may be defined against which to compare current performance. Trend analysis can be used to trigger “alarms” when performance is about to depart from accepted limits. Desktop computer analysis software programmes, such as MS Excel have the capability to support many types of trend analysis (such as linear, exponential and forecasting). 9-7 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT Normative comparisons. Sufficient data may not be available to provide a factual basis against which to compare the circumstances of the event or situation under examination with everyday experience (which routinely copes with the same conditions). The absence of credible normative data often compromises the utility of safety analyses. In such cases, it may be necessary to sample real world experience under similar operating conditions, using such methods as: a) Direct observation of skilled personnel performing similar tasks under similar conditions, or observation of like components in service; b) Sampling through the use of questionnaires, surveys or interviews of employees engaged in similar activities; and c) Simulation and testing. Note: By sampling actual flight operations, FDA and LOSA programmes provide much useful normative data for the analysis of unsafe practices in flight operations. Both these programmes are discussed in Part 4 of this manual. Simulation and testing. In particularly difficult cases, the underlying safety hazard may become more evident through testing. For example, laboratory testing may be required for analysing material defects. For suspect operational procedures, simulation in the field under actual operating conditions, or in a simulator may be warranted. Expert panel. Given the diverse nature of safety hazards, the variables in assessing the inherent risks, and the different perspectives possible in evaluating any particular unsafe condition, the views of others should be sought, including peers and specialists. When a multidisciplinary team is formed to evaluate the evidence of an unsafe condition, it can also assist in identifying and evaluating the best course for corrective action. (The safety assessment process described in Chapter 13 depends on hazard identification by a multi-disciplinary group.) Cost-benefit analysis. The acceptance of recommended risk control measures may be dependent on credible cost-benefit analyses. The costs of implementing the proposed measure are weighed against the expected benefits over time. Indeed, cost-benefit analysis may suggest that accepting the risk is preferable to the time, effort and cost of implementing corrective action. Specialized methods and tools In addition to the methods and tools having general application for safety analysis, specialized methods and tools have proven their value and are regularly used by the aviation community throughout the world. GAIN and the OFSH both provide information and guidance on such analytical methods and tools.38 Such inventories of methods and tools are designed to help safety analysts identify existing computer programmes and/or methodologies that can be used to turn aviation safety data into safety information that is suitable for decision-making. 38 Operational Flight Safety Handbook (OFSH), Appendix C and Global Aviation Information Network (GAIN) Working Group B 9-8 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT The level of complexity and sophistication of these methods and tools varies widely. Some go well beyond desktop PC skills and capabilities. Thus, the equipment and skill requirements for successfully applying them vary. Event reporting and analysis systems include a capability for analysing occurrence data. BASIS (originally designed as an easy-to-use reporting system for safety events) is a good example of such a system. Add-on BASIS modules permit analysis beyond the data of the event reporting capability to include analysing human factors and inflight recordings.39 Occurrence investigation and analysis. Occurrence investigation increasingly employs a variety of standard analytical methods and tools. By adhering to a systematic methodology, consistency in the identification of hazards and assessment of risks is enhanced. Some of the tools not only help identify root causes of occurrences, but also tie-in to organizational occurrence databases thereby facilitating trend analysis. Human factors analysis. Linked closely to the methods and tools coming into use for occurrence investigation and analysis, are specialized methods and tools for better understanding the impact of human factors on specific occurrences, as well as families of similar occurrences. Flight data analysis. Given the rapid evolution of FDA (FOQA) programmes, there has been a parallel development of analytical methods and tools for capitalizing on data recorded during a flight, permitting trend analysis and the identification of systemic hazards. With the advent of LOSA, methods and tools are being developed to integrate the data from FDA and LOSA. Other analytical tools. A number of non-specialized analytical tools are available, however when pursuing these specialized types of analysis, it would be wise to utilize an experienced analyst: a) Events and Causal Factor (E&CF) Analysis; b) Change Analysis; c) Hazard-Barrier-Target (HBT) Analysis; and d) Fault Tree Analysis (FTA), etc. ANALYSIS PROCESS Regardless of the analytical tools selected, the safety analysis process involves some or all of the following activities. Data retrieval Safety analysis generally involves an examination of many data points. Different search strategies may be required to get the appropriate data and care is required in determining the search criteria.40 Two types of errors may occur during data retrieval that may compromise the validity of the safety analysis: 39 40 For further information on BASIS, visit the BASIS website at http://www.winbasis.com/ See Chapter 15 Practical Considerations), for a discussion of data quality. 9-9 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT False positive returns may appear appropriate but on further examination are found to be irrelevant; and False negative returns are relevant data that are not identified during the retrieval process. Sorting the data The data retrieved must be sorted and arranged in a way that will support the analysis. Irrelevant data (such as false positives) need to be rejected. Judgements are thus made as to the adequacy of the remaining data. If necessary, additional information is sought, perhaps using different methods or tools. Analysing the data Having retrieved and compiled the requisite data, the most appropriate analytical methods and tools and analytical thought processes are applied. Safety analysis is often an iterative process, sometimes requiring new or additional data or the selection of alternative methods or tools. Hazards are sometimes not visible unless viewed in a particular way. Many cross sections through the data may be required, examining the data from different perspectives. For example, it may not be possible to uncover a safety problem when looking at all recorded events, but a distribution by location may reveal an unacceptable number of occurrences at a particular location, while the other locations have a good safety record. Drawing conclusions The analyst begins to form impressions from the outset. Preliminary findings are made as to what is known with certainty, the relationships between facts, areas where insufficient or inadequate data have been found, etc. Care must be taken to avoid jumping to premature conclusions based on strongly held impressions or beliefs. The analyst should seek to maximize certainty by ensuring an objective and comprehensive understanding of the facts. Preliminary conclusions must withstand the test of such questions as why and so what. Caution is advised when relying largely on occurrence data. Safety databases generally contain only reported information; little data is available on events which were not reported. For example, the database may hold more traffic conflicts that occurred at airports operating air traffic control towers than at airports with no tower. This does not imply that the towers are the cause of the conflicts. Airports with towers generally have more traffic, providing more opportunities for aircraft to come too close to each other. Also, the controller and/or pilots are more likely to report such occurrences. Common analytical errors The following are some common pitfalls in analytical thinking: a) Basing a general conclusion on too little evidence, e.g. using words such as "always" or "never"; b) Drawing premature conclusions from insufficient data; 9-10 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT c) Stacking the evidence by withholding facts so that the evidence points to only one conclusion; d) Linking two events as if one caused the other when the relationship between them may be more complex; e) Assuming that, because one event follows another, it was caused by the first event; f) Assuming that a complicated question has only two possible answers; g) Drawing a conclusion that bears no logical relation to the facts (the “apples and oranges” argument); h) Suggesting that, because two things or situations share some similarities, they must be alike in other ways; and i) Relying on an expert for issues beyond that person’s qualifications and experience. SAFETY STUDIES Some complex or pervasive safety issues can best be understood through an examination in the broadest possible context. At the top of the aviation safety spectrum, systemic safety concerns may be addressed on an industry-wide or a global scale. For example, the industry collectively has been concerned with the frequency and severity of approach and landing accidents and has undertaken major studies, made many safety recommendations and implemented global measures to reduce the risks of accidents during the critical approach and landing phases of flight. The convincing argument necessary to achieve large or systemic changes requires significant data, information, analysis and effective communication. Safety argument based on isolated occurrences and anecdotal information will fall on deaf ears. Data from many sources must be coherently integrated and synthesized to permit drawing valid conclusions. For the purposes of this Manual, these larger, more complex safety analyses are referred to as Safety Studies. The term includes many types of studies and analysis conducted by State authorities, airlines, manufacturers, and professional and industry associations. ICAO recognizes that safety recommendations may arise not only from the investigation of accidents and serious incidents, but also from safety studies.41 Safety studies have application in hazard identification and analysis relating to issues arising from flight operations, maintenance, cabin safety, air traffic control, airport operations, etc. Safety studies of industry-wide concerns generally require a major sponsor. The Flight Safety Foundation, in collaboration with major aircraft manufacturers, ICAO, NASA and other key industry stakeholders has taken a leading role in many such studies. Civil aviation authorities of specific States have also conducted major safety studies, many identifying safety risks of global interest. Several State authorities have also used safety studies for identifying and resolving hazards in their national aviation systems. Although, it is unlikely that small or medium-sized operators would undertake a major safety study, large operators and regulatory officials may well be involved in identifying widespread systemic safety issues through such means. 41 See Annex 13, Chapter 8 9-11 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT Consistent with normal safety management processes, safety studies and safety surveys typically include the following steps: a) Information gathering; b) Recording of pertinent data; c) Preliminary analysis and hazard identification; d) Risk assessment, including prioritization of risks; e) Development of risk control strategies; f) Implementation of preferred risk control options; and g) Monitoring and evaluation to determine the effectiveness of the actions taken, the residual risks and any further action required. Selecting study issues Large operators, manufacturers, safety organizations and regulatory authorities may maintain a list of significant safety issues. (The topic of maintaining a Significant Issues List (SIL) is described later in this Chapter.) Such lists may be based on the accident and incident record in such areas as runway incursions, ground proximity warnings, TCAS advisories, etc. These issues may be prioritized in terms of the risks to the organization or the industry. Given the degree of collaboration and information-sharing necessary to conduct an effective safety study, issues selected for study must enjoy a broad base of support among participants and contributors. Information gathering The information sources cited earlier with respect to safety analysis are equally applicable for safety studies. Several additional methods (outlined below) are available for acquiring the relevant information necessary to support the broad-based analysis of a safety study. Review of occurrence records. Investigated occurrences may be reviewed by selecting those occurrences which meet some pre-defined characteristics such as Controlled Flight Into Terrain, cabin fires or crew fatigue. By reviewing all available material on file, specific elements may be identified that are suitable for further analysis. Structured interviews. Much useful information can be acquired through structured interviews. Care must be taken in selecting the interviewees to avoid biasing the results. Starting from a series of carefully crafted and sequenced questions, the interviewer can probe particular aspects as deeply as warranted. Structured interviews depend very much on trust between the parties and must be conducted in a non-punitive environment. While they can be time-consuming, such interviews offer the potential for acquiring quality information, even though it may not be a statistically representative sample. Success with structured interviews will depend on the quality of the questions and follow-up questions, and the ability of the analyst to reduce much anecdotal information to useful data for analytical purposes. 9-12 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT Directed field investigations of relatively insignificant occurrences (which might normally not be investigated) may uncover sufficient additional information to permit a deeper analysis than that obtained from mandatory reporting systems. By investigating a sample of like-occurrences over a defined period, specific information can be collected in a structured way. Although few of these investigations when considered individually would have contributed much to the collective knowledge of the factors contributing to such occurrences, collectively they may reveal behavioural patterns which are compromising safe operations. Field investigations, supplemented by some of the other methods described here, provide an efficient means for validating systemic risks. Literature search. Whether the safety issues under examination have to do with particular equipment, technology, maintenance, human performance, environmental factors, or organizational and management issues, undoubtedly much has already been written on the subject by “experts”. Safety analysts conducting major studies require library search skills to locate and review the most compelling works written by scholars and other subject matter experts. Careful use of the Internet provides a vast source of knowledge. Prior to commencing a safety study, it may be appropriate to carry out a literature search on the issue under consideration. This search may be useful in deciding if any further action is required; specifically, is it a safety issue to which meaningful value might be added. Experts’ testimony. Direct contact with recognized subject matter experts may be warranted. For example, if crew fatigue is the issue, it may be appropriate to consult directly with those conducting scientific sleep research. Such experts may be contacted informally over the Internet or telephone; they may also be invited to provide more formal input through submissions to a hearing or public inquiry. Public inquiries. For major safety issues that must be considered from many perspectives, State authorities may convene some form of public inquiry. These provide all stakeholders, individually or as representatives of particular interest groups, an opportunity to present their views in an open, impartial process. However, prior to deciding upon convening a public inquiry, considerable preparatory work is required to demonstrate the need for such an expenditure of resources. Care must be taken in drafting the terms of reference for any public inquiry to ensure that the focus is on identifying safety issues – not identifying persons to blame. Hearings. Less formal meetings (than public inquiries) may be convened with a view to hearing the different (and often divergent) views of the major aviation stakeholders, perhaps involving unions or professional associations, regulatory authorities, operators’ and industry associations. As opposed to a public inquiry, the stakeholders are heard in camera (or private); in this way, they may be more candid in stating their positions. Participation To maximize the benefits from a safety study, the widest possible participation is advisable. Inputs from a broad cross section of interest groups will ensure that all perspectives are considered. The solution to one interest group’s problem may create problems and hazards for another. Similarly, the information gathered should be examined from a multidisciplinary perspective. 9-13 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT SIGNIFICANT SAFETY ISSUES LIST (SIL) Some State regulatory authorities, investigative agencies and large operators have found that maintaining a list of high priority safety issues is an effective means for highlighting areas warranting further study and analysis. These lists are sometimes referred to as the “Top Ten” or the “Most Wanted” lists. Such lists prioritize those safety issues that put the aviation system (or the organization) at risk. As such, they may be useful in identifying issues for safety assessment, for safety survey or safety study. If SILs are to be of value in guiding the work of those involved in safety management, they must not chronicle every perceived hazard. Thus, SILs should be limited to not more than ten issues. Typical issues that may warrant inclusion on an SIL might include: a) Frequency of GPWS warnings; b) Frequency of TCAS advisories; c) Runway incursions; d) Altitude deviations (busts); e) Call sign confusion; f) Unstabilized approaches; and g) Air proximities (near misses) at selected aerodromes, etc. SILs should be reviewed and updated annually, adding new high-risk issues and deleting lesser risk issues. ———————— 9-14 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT Appendix 1 to Chapter 9 UNDERSTANDING BIAS42 Everyone’s judgement is shaped by his or her personal experience. Notwithstanding the quest for objectivity, time does not always permit the collection and careful evaluation of sufficient data to ensure objectivity. Based on a lifetime of personal experiences, we all develop mental models that generally serve us well in quickly evaluating everyday situations, “intuitively” without a complete set of facts. Unfortunately, many of these mental models reflect personal bias. Bias is the tendency to apply a particular response regardless of the situation. Following are some of the basic biases that can affect the validity of safety analyses: Frequency bias: We tend to over- or under-estimate the probability of occurrence of a particular event because our evaluation is based solely on our personal experience. We assume that our limited experience is representative of the global situation. Selectivity bias: Our personal preferences give us a tendency to select items based on a restricted core of facts. We have a tendency to ignore those facts which do not quite fit the pattern we expect. We may focus our attention on physically important characteristics, or obvious evidence (e.g. loud, bright, recent) and ignore cues that might provide more relevant information about the nature of the situation. Familiarity bias: In any given situation, we tend to choose the most familiar solutions and patterns. Those facts and processes which match our own mental models (or preconceived notions) are more easily assimilated. We tend to do things in accordance with the patterns of our previous experience, even if they are not the optimum solutions for the current situation, e.g. the route we pick to go somewhere may not always be the most efficient under changing circumstances. Experience can be valuable in helping us focus our attention on those things which are most likely to be problematic, but we should recognize that in following these familiar patterns we may be overlooking critical information. Today, management gurus exhort us to “think outside the box”. Conformity bias: We have a tendency to look for results which support our decision rather than information which would contradict it. As the strength of our mental model increases, we are reluctant to accept facts which do not line up nicely with what we already Aknow”. Time pressures can lead to erroneous assumptions that do not accurately reflect the current reality. We tend to seek information that will confirm what we already believe to be true. Information that is inconsistent with our chosen hypothesis is then ignored or discounted. A frequently cited causal factor in aviation accidents is “expectancy”, i.e. individuals see what they want to, or expect to see, and they hear what they want to, or expect to hear. Expectancy is a form of conformity bias. Group conformity or ‘Group think’: A variation on conformity bias is “group think”. Most of us have a tendency to agree with majority decisions; we yield to group pressures to bring our own thinking in line with that of the group. We do not want to break the group’s harmony by upsetting the prevalent mental model. In the interests of expediency, it is a natural pattern to fall into. 42 Adapted from Human Factors Guidelines for Safety Audits Manual (Doc 9806) 9-15 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT Overconfidence bias. There is a tendency for people to overestimate their knowledge of the situation and its outcome. The result is that attention is placed only on information that supports their choice and ignores contradictory evidence. The defining characteristic of an overconfidence bias is that attention is given to certain information because an individual overrates his/her knowledge of the actual situation. Without the tempering afforded by on-the-job experiences, an inexperienced individual may overrate the value of "classroom" theory versus the more "work-shop"-oriented knowledge used by peers. On the other hand, more seasoned personnel may also let overconfidence bias affect their judgement, having “seen it all before”. ———————— 9-16 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT Appendix 2 to Chapter 9 BASICS OF STATISTICAL ANALYSIS Purpose This précis provides supplemental material in support of statistical analysis for safety management. Data collection To begin an analysis, normally a cursory review of overall counts of the available data and simple crosstabulations of the data are conducted. However, the data that is readily available through routine databases may be insufficient to carry out a credible analysis or risk assessment. Specific additional data may be required. Two methods for acquiring the extra data are: Sampling is one way of acquiring sufficient information for a valid analysis. If a sample of data representative of the larger population of data is compiled, credible conclusions may be drawn from the statistical analysis. Some of the considerations to be taken into account in preparing a valid sample include: a) Size of the sample (the larger the sample size, the more precise the conclusion); b) Selection method (e.g. random samples produce unbiased estimates); c) Representativeness of data (to avoid comparing apples with oranges); d) Homogeneity of sample data (in terms of data definitions, time-frames, operating conditions, etc.); and e) Completeness of data (sufficient to be truly representative of the full population). Because of the resource implications, the sampling process must be limited to only that data necessary to support the scope of the analysis. It may be limited in terms of: a) Duration (e.g. collect data for one year); b) Area (e.g. limit sample to a particular location or fleet); and c) Data type (e.g. use readily obtainable numeric data vs. deriving numbers from narratives). If the sample is a valid representation of population, the statistical analysis should yield credible conclusions on all like-occurrences at a much lower cost than observing or testing every item in the population. Surveys. One way of sampling a large population is through surveys or questionnaires. In addition to the sampling principles described above, some other considerations for survey design include: a) Population to be surveyed? How are they to be selected? 9-17 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT b) Sample size (vs. accuracy requirements)? c) When? (not after an event that could influence the results); d) Sampling method? (Contacting participants? Format (oral/written)? Tic box vs. narrative?) e) Questions to be asked? (Careful formulation of questions is necessary to get the information really wanted, with non-ambiguous answers). Surveys are also a useful means for the collection of normative data (which otherwise might not be available). Describing and presenting data Quantitative information must be presented in a way suitable for facilitating analysis and promoting understanding. Tables are usually the first step in organizing data and graphs provide the most vivid presentation of the overall picture. However, the more specific aspects of the data contained therein such as their averages, variability, and interrelationships are most succinctly summarized by appropriately chosen numerical measures (such as means and medians). Tabulation Tables are most frequently used to organize data and present numeric information, cross tabulating different variables. For example, events of a particular type by month, or by location, compared to similar data for preceding years. Tables are suitable for summarizing data, identifying trends and conveying analytical conclusions. Tables can also be used for examining single variables (incidents vs. age of fleet) or multiple variables (incidents vs. age of fleet by year and by phase of flight). The following considerations apply when preparing a table: Orientation: A well-chosen title, row and column headings supplemented by explanatory information and footnotes help to orient the reader. Assumptions and recognized anomalies in the data should be provided to avoid incorrect conclusions. There should also be an indication of the data sources. Analysis: The overall features of the table should be examined to confirm that the table works. For example, an appropriate measuring system and units have been used, overall sums/averages are reasonable, and there is consistency between summary rows and columns (e.g. their association follows logically and their variability is credible). Graphical presentations Graphs or charts summarize tabular data in a way that facilitates visualizing the relationship between the variables. Graphs reveal facts about data that would otherwise require careful study to detect in a table. It is important that the initial impression portrayed by a graph be an accurate impression. Here too, some basic principles have to be kept in mind in their preparation to avoid misleading conclusions. For example: 9-18 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT a) Omission of the line representing zero units on the vertical scale can significantly magnify a change; if it must be omitted it should be clearly illustrated (perhaps by a jagged break). b) Perspective diagrams (e.g. three-dimensional) can present a distorted representation of any fluctuations. c) Disproportionate scales and X/Y axes can exaggerate trends or relative changes. d) Inappropriate labelling of scales, legends and titles can compromise the graph's credibility. Many types of graphs can be used in presenting numeric data, depending on the nature of the data and the purpose intended. a) Line graphs are useful for illustrating time trends, with time on the horizontal axis. b) Bar graphs allow several variables to be compared over time or against one another. c) Scatter plots are used to graph data with two variables, the independent variable being on the horizontal axis. These are sometimes used to develop a simple trend for activity data to get an approximate forecast of an accident rate. d) Pie charts describe relative proportions of the components of a data set. e) Histograms, which look like bar charts, depict the distribution of data. Arithmetic means and medians Having organized and presented the data in a useful format, it is also important to understand their relationships. Arithmetic mean. The most commonly used numeric relationship is the arithmetic mean (commonly referred to as the average). The arithmetic mean represents a complete set of data through a single value. When data are arranged according to magnitude, the average lies somewhere near the centre of the data set; therefore averages are often referred to as a measure of centre. The median (discussed below) is also a measure of centre. Medians are useful for ranking variables, for example organizing people by age. Medians do not employ the actual numerical value of the data. Rather, the median of a set of numbers is that number located at the midpoint when they are arranged in increasing order. For example, suppose that the age of pilots involved in accidents ranges between 25 and 60 years. The mean age may be computed to be 45, which might suggest that older pilots have more accidents. However, the median age, may be only 30, which confirms that half the pilots in accidents are actually under 30: the mean was influenced by a few pilots who were perhaps closer to 60. Measures of variation of data Measures of centre are usually incomplete and misleading without some accompanying indication of how spread-out the data are. The degree to which data tend to spread about an average value is called the variation or dispersion of the data. As the example on the mean age of pilots above illustrated, use of it alone lumps the minimum and maximum extreme cases together. Measures of dispersion include percentiles, variance and standard deviation. 9-19 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT When using the median to measure centre, use is made of percentiles to indicate the variability or spread. The nth percentile of a data set is a value such that n percent of the observations fall below it. The median is therefore the 50th percentile; the 90th percentile would include the first 90 out of 100 ranked pieces of data. Help from statistical specialists may be required to better understand what the variation in a particular data set means. Time series and trend analysis A set of observations or measurements of the same variable made at different times, usually at equal intervals, is called a time series. They can be represented pictorially by constructing a graph of the variable versus time; for example, the number of incident reports received per month. When the frequency of a particular event is measured over time, it is normal to note a sequential increase or decrease in a run. The question is whether or not this run (up or down) is significant? There are many formulas for evaluating runs; one such rule suggests that a run of seven or more sequentially increasing or sequentially decreasing points be further analyzed. On the other hand, shorter runs may also be informative, suggesting that somehow the system is compensating for the change and returning to more “normal” levels. Making valid comparisons For useful conclusions to be drawn from statistics it requires the valid comparisons of the various data. For example, in comparing accident and incident statistics, more valid comparisons are obtained using rate information than raw numbers of occurrences. If two types of aircraft are compared and type A flies one million hours in one year resulting in one accident and type B flies five million hours in a year resulting in five accidents, the accident rate based on hours flown is the same for both types (one accident per one million flying hours). Rates express the numerical proportions of two sets of data. In accident statistics this usually involves numbers of accidents, incidents, injuries or damage as one set and some measure of exposure, such as flying hours, or numbers of flights as the other. Generally, rates are suited to establishing the general measure of safety of an operation, rather than evaluating specific prevention measures. For a rate to be valid, the sets of data and time frame used must be compatible. For example, long and short haul operators do not fly the same numbers of flights. Their flights are usually of different duration, using aircraft that may have widely varying performance capabilities, and carrying differing numbers of passengers. Therefore, a comparison of the relative safety of these two types of operation will largely depend on whether the exposure base includes numbers of flights, flying hours, miles, passengers or some combination of these. Statistics must be used with caution. For instance, they may show that pilots in a certain age group or with a certain number of flying hours have the most accidents. Arithmetically, these figures are correct; however, they imply that the pilots in these groups are “less safe” than pilots in other groups. Before such a determination can be made, it is necessary to determine the total number of pilots in each age group, since the pilot group with the highest number of accidents may also contain the greatest number of pilots. This principle needs to be considered whenever comparisons are made. Many accident statistics are of little value for safety management because they provide no valid means of comparison. 9-20 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT Traditionally, airline safety statistics have used seat-kilometres (or miles) as an exposure base. With wide-bodied aircraft the number of available seats has increased significantly although the passenger loading can vary considerably. Also, the longer range of such aircraft means that they spend a greater proportion of each flight in the cruise phase. Thus seat kilometres are not a particularly useful basis for measuring safety. Most accidents occur during the landing and take-off phases and this is a constant for all flights, irrespective of other factors. In addition, any flight that results in an accident can be seen to represent a failure of the safety process, irrespective of how long it has been airborne, how far it has flown, or how many seats it has. Accordingly, when comparing safety levels of airline operations, the rate of accidents to numbers of flights (or departures) may be a more appropriate measure than the use of flight hours or seat-kilometres. For most general aviation operations (which normally have shorter flights), the number of accidents or incidents and an exposure base measured in flight hours are often used. The resulting rate is then an expression of accidents or incidents per 1 000 000 hours, 100 000 hours, or 10 000 hours. The base chosen will depend on the amount of aviation activity being considered in order to provide the resulting rate as an easily handled number. In special operations involving unique hazards, it may be desirable to use a different exposure base. For example, aerial application operations usually involve many flights per hour. The large number of take-offs and landings substantially increase the likelihood of an accident during these critical phases of flight. For such operations it may be more useful to compare numbers of accidents with numbers of flights. When considering safety trends, the current record is often compared against a base period to determine whether there has been an improvement or decline in safety. This analysis method can also provide a useful hazard alerting technique. However, when the numbers of accidents/incidents are comparatively small, slight changes in numbers, for instance from one year to another, can provide an erratic and virtually meaningless result. To overcome this, some form of averaging can be used. For instance, the number of accidents in the subject year can be compared to the average number of accidents in a preceding three or five-year period. Alternatively, the number of accidents in the subject year can be added to the number of accidents of the previous two or four years, from which a so-called rolling three or five year average is calculated. While the foregoing methods may give an indication of the safety trends of a particular organization, or operation, it has to be remembered that accidents are infrequent and random events. Thus caution should be applied when using such methods as an indication of relative safety. Common pitfalls As will be apparent, there are many potential pitfalls in the use of statistics to reach meaningful conclusions. The following outlines a few: Understanding of numbers. Numbers can be misleading unless they are used with appropriate care concerning their accuracy, their associated units and the manipulation they are subjected to during the analysis process. Following are examples of common weaknesses in the use of numbers that may compromise the credibility of the statistical analysis: a) Absence of a unit of measure to give the number meaning; b) Manipulation of numbers with inconsistent units of measure; 9-21 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT c) Using unsupportable levels of accuracy (especially when using data from more than one source); d) Inconsistency in rounding-off the level of significance of numbers; and e) Making extrapolations beyond the confidence limits of the available data. Rules of accuracy. Understanding can be enhanced by respecting some basic rules of accuracy in the use of numbers. For example: a) Quote numbers that can be understood (e.g. avoid numbers with exponential powers if possible); b) Indicate the measurement unit (e.g. accidents per 100 000 departures); c) Ensure consistency in the type of measurement used (e.g. metric vs. imperial); and d) Round data when appropriate (e.g. 5 deaths is more meaningful than 5.21). Significant figures. Misunderstanding in the use of numbers can arise from the misuse of significant figures. Differentiation needs be made between what may be precisely measured to many significant figures and the message that is conveyed. Common sense is often the best guide. For example, although it may be determined that the aircraft had 19,989.5 litres of fuel on board, for safety argument sake, is this not the same as 20,000. If there have been only four occurrences of a particular type, two of which were fatal, is it meaningful to imply great precision by saying that 50.00 % of such occurrences are fatal. Likewise, a numeric drop in CFIT accidents from 10 to 8 (20%) should not be described as significant if the activity rate declined proportionately. Different data sources. Not all safety databases are the same. Each database possesses unique characteristics. When using data from different sources (such as other organizations, manufacturers or States) caution is advised. Failure to take the following issues into consideration may result in “comparing apples with oranges”. a) Terminology. Differences in definitions for occurrence type, event or causal and contributory factors, (for example, near-collision, non-fatal accident). b) Storage. Differences in the capture or updating of data; c) Requirements. Differences in reporting requirements; d) Operations. Differences in the size of the infrastructure, number of employees, the volume of traffic, maintenance and safety programmes, etc.; e) Traffic type. Short haul vs. long haul, passenger vs. freight, etc.; and f) Equipment. Age and extent of modernization of the fleet. Trend analysis. Incorrect conclusions can easily be drawn in examining data covering an extended time period. These include: 9-22 Part 2 Chapter 9 – Safety Analysis and Studies DRAFT a) A comparison may be made between annual totals several years apart. The base year for calculating the change may be a "blip" (with an unusually low or high frequency of the event) thereby giving a false impression of an increase or decrease. b) Comparisons may be made between two time periods to show the effect of a change (in the operating environment). The result will be meaningless unless the periods chosen adequately reflect the different operating environments. For example, if the reporting rules for gathering the data have changed over the period, compensation must be made. c) In computing averages over a number of years, the individual data points within the time span should be examined to adjust for factors contributing to data points well outside the expected range (e.g. a labour-management dispute, an unusually high number of casualties from a single accident, etc.). d) In quoting an annual average over a recent number of years, the inclusion of a recent random increase or decrease in the calculation may result in misleading figures. e) When describing a trend over a long time period, it may be misleading to ignore large fluctuations particularly in the opposite direction – even though the general trend is clear. ___________________ 9-23 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT Chapter 10 SAFETY PERFORMANCE MONITORING Outline Introduction Safety health  Assessing safety health — Symptoms of poor safety health — Indicators of improving safety health — Statistical safety performance indicators — Minimum levels of safety Safety oversight  Inspections  Surveys  Safety monitoring  Quality assurance  Safety audits ICAO Universal Safety Oversight Audit Programme (USOAP) Regulatory safety audits Self audit Safety programme review Appendices 1. Sample indicators of safety health 2. Airline management self-audit Part 2 Chapter 10 – Safety Performance Monitoring DRAFT This page intentionally left blank 10-2 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT Chapter 10 SAFETY PERFORMANCE MONITORING INTRODUCTION Safety management systems require feedback on safety performance to complete the closed loop of the safety management cycle. Through such feedback, system performance can be evaluated and any necessary changes effected. In addition, all stakeholders require an indication of the level of safety within an organization. For example: a) Staff may need confidence in their organization’s ability to provide a safe work environment; b) Line management requires feedback on safety performance to assist in the allocation of resources between the often-conflicting goals of production and accident prevention; c) Passengers are concerned with their own mortality; d) Senior management seeks to protect the corporate image (and market share); and e) Shareholders wish to protect their investment, etc. Although the stakeholders in an organization’s safety process want feedback, their individual perspectives as to “what is safe?” vary considerably. Deciding what reliable indicators there are of acceptable safety performance depends largely upon how one views “safety”. For example: a) Senior management may seek the unrealistic goal of “zero accidents”. Unfortunately, as long as aviation involves risk, there will be accidents, even though the accident rate may be very low. b) Regulatory requirements normally define minimum “safe” operating parameters; for example, cloud base and flight visibility limitations. Operations within these parameters contribute to “safety”, however, they do not guarantee it. c) Statistical measures are often used to indicate a level of safety, e.g. the number of accidents per hundred thousand hours, or fatalities per thousand sectors flown. Such quantitative indicators mean little by themselves but they are useful in assessing whether safety is getting better or worse over time. SAFETY HEALTH Recognizing the complex interactions affecting safety and the difficulty of defining what is safe and what is not, some safety experts are referring to the “safety health” of an organization. The term safety health is an indication of an organization’s resistance to unexpected conditions or acts by individuals. It reflects the systemic measures put in place by the organization to defend against the unknown. Further, it is an indication of the organization’s ability to adapt to the unknown. In effect, it reflects the safety culture of the organization. 10-3 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT sis Fl i Pr ght og Da ra ta m A m na e ly Sa Re fety im por Inc pl tin id em g en en Sy t te ste d m pe ra O t in g ex per C pe at er rie ion tific nc al at e an e g M bu d m ra er ild a nt ge s na ed ra ge m nd en la t bo N ur ew di sp m an ut e ag em Sa en t Sy fety st M em a n ad age op m te e n d t O Safety Health (Resistance to Misadventure) Although the absence of safety-related events (accidents, incidents) does not necessarily indicate a “safe” operation, some operations are considered to be “safer” than others. Safety deals with risk reduction to an acceptable (or at least tolerable) level. The level of safety in an organization is unlikely to be static and will vary over time. As an organization adds defences against safety hazards, its safety health may be considered to be improving. However, various factors (hazards) may compromise that safety health, requiring additional measures to strengthen the organization’s resistance to misadventure. The concept of the safety health of an organization varying during its life cycle is depicted in Figure 10 -1 below. Healthy Zone Regulatory Compliance (minimum acceptable level) Unhealthy Zone Time Figure 10–1. Variation in safety health Assessing safety health In principle, the characteristics and safety performance of the “safest” organizations can be identified. These characteristics, which reflect industry’s best practices, can serve as benchmarks for assessing safety performance. Symptoms of poor safety health Poor safety health may be indicated by symptoms that put elements of the organization at risk. Appendix 1 to this Chapter provides examples of symptoms which may be indicative of poor safety health. Weakness in any one area may be tolerable, however, many symptoms indicate serious systemic risks compromising the safety health of the organization. Indicators of improving safety health Appendix 1 also provides indications of improving safety health. These reflect the industry’s “best practices” and a good safety culture. Organizations with the best safety records tend to “maintain or improve their safety fitness” by implementing measures to increase their resistance to the unforeseen. They consistently go beyond the minimum regulatory requirements. 10-4 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT Identifying such symptoms may provide a valid impression of an organization’s safety health. However, they may lack the depth necessary for effective decision-making. Additional tools are required for measuring safety performance in a systematic and convincing way. Statistical safety performance indicators Statistical safety performance indicators illustrate historic safety achievement; they provide a “snap shot” of past events. Presented either numerically or graphically, they provide a simple, easily understood indication of the level of safety in a given aviation sector, in terms of the number or rate of accidents, incidents or casualties over a given time frame. At the highest level, this could be the number of fatal accidents per year over the past ten years. At a lower (more specific level), the safety performance indicators might include such factors as the rate of specific technical events (e.g. losses of separation, engine shut-downs, TCAS advisories, etc.). Statistical safety performance indicators can be focused on specific areas of the operation to monitor safety achievement or identifying areas of interest. This “retrospective” approach is useful in trend analysis, hazard identification, risk assessment and even the choice of risk control measures. Since accidents (and serious incidents) are relatively random and rare events in aviation, assessing safety health based solely on statistical safety performance indicators may not provide a valid predictor of safety performance, especially in the absence of reliable exposure data. Looking backwards does little to assist organizations in their quest to be proactive, putting in place those systems most likely to protect against the unknown. The safest organizations employ additional means for assessing safety performance in their operations. Minimum levels of safety Aviation organizations must meet regulatory requirements to ensure minimum acceptable levels of safety. Organizations that just meet these minimal requirements may not really be healthy from a safety point of view, but they have reduced their vulnerabilities to the unsafe acts and conditions most conducive to accidents. In short, they have taken minimum precautionary measures. Weak organizations that fail to meet the minimum standards will be removed from the aviation system; either proactively by the regulator removing their operating permit, or reactively, in response to commercial pressures, such as the high cost of accidents or serious incidents, or consumer resistance. SAFETY OVERSIGHT One of the three cornerstones for an effective Safety Management System is a formal system for safety oversight. Safety oversight involves regular (if not continuous) monitoring of all aspects of an organization’s operations. On the surface, safety oversight demonstrates compliance with State and company rules, regulations, standards, procedures, etc. However, its value goes much deeper. Monitoring provides another method for proactive hazard identification, validation of the effectiveness of safety actions taken and continuing evaluation of safety performance. Safety oversight can be conducted at the company, the State (or regulatory level) or at the international level. The “monitoring” functions of safety oversight take many forms with varying degrees of formality. 10-5 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT International level. At the international level, the ICAO Universal Safety Oversight Audit Programme (USOAP) (described later in this Chapter) monitors the safety performance of all contracting States. International organizations like IATA are also engaging in the safety oversight of airlines through an audit programme. State level. At the State level, effective safety oversight can be maintained through a mix of some of the following elements: a) No-notice inspections to sample the actual performance of various aspects of the national aviation system; b) Formal (scheduled) inspections which follow a protocol which is clearly understood by the organization being inspected; c) Discouraging non-compliant behaviour through enforcement actions (sanctions or fines); d) Monitoring quality of performance associated with all licensing and certification applications; e) Tracking the safety performance of the various sectors of the industry; f) Responding to occasions warranting extra safety vigilance (such as major labour disputes, airline bankruptcies, rapid expansion or contraction of activity, etc.); and g) Conducting formal safety oversight audits of airlines or service providers such as air traffic control, approved maintenance organizations, training centres or airport authorities, etc. Organizational level. The size and complexity of the organization will determine the best methods for establishing and maintaining an effective safety oversight programme. Companies with adequate safety oversight employ some or all of the following methods: a) First-line supervisors maintain vigilance (from a safety perspective) by monitoring day-today activities; b) Regularly conduct inspections (formal or informal) of day-to-day activities in all safety critical areas; c) Sample employee views on safety (from both a general and a specific point of view) through safety surveys; d) Systematic review and follow-up on all reports of identified safety issues; e) Systematic capture of data which reflect actual day-to-day performance (such as FDA and LOSA); f) Conduct macro-analyses of safety performance (safety studies); g) Conduct a regular operational audit programme (including both internally and externally conducted safety audits); and h) Attention to communicating safety results to all affected personnel, etc. 10-6 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT Safety oversight at the company level essentially includes oversight at the ‘individual level’. Inspections Perhaps the simplest form of safety oversight involves the Safety Manager (SM) carrying out informal “walk-arounds” of all operational areas of the company. Talking to workers and supervisors, witnessing actual work practices, etc. in a non-structured way provides the SM with valuable insights into safety performance “at the coal face”. As the SM’s frequent presence becomes the norm, a level of trust can be established. The resulting feedback should help in fine-tuning the safety management system. To be of value to the organization, the focus of an inspection should be on the quality of the “end product”. Unfortunately, many inspections simply follow a tick-box format. These may be useful for verifying compliance with particular requirements, but are less effective for assessing systemic safety risks. Rather than a tick-box format, a checklist can be used as a guide to help ensure that parts of the operation are not overlooked. Management and line supervisors may also conduct safety inspections to assess adherence to organizational requirements, plans and procedures. However, such inspections may only provide a spot check of the operations, with little potential for systemic safety oversight. Surveys43 Surveys of operations and facilities can provide management with an indication of the levels of safety and efficiency within its organization. Understanding the systemic hazards and inherent risks associated with everyday activities allows an organization to minimize unsafe acts and respond proactively by improving the processes, conditions and other systemic issues that lead to unsafe acts. Safety surveys are one way to systematically examine particular organizational elements or the processes used to perform a specific operation — either generally, or from a particular safety perspective. They are particularly useful in assessing attitudes of selected populations, say line-pilots for a particular aircraft type or air traffic controllers working a particular position In attempting to determine the underlying hazards in a system, such surveys are usually independent of routine inspections by government or company management. Surveys completed by operational personnel can provide important diagnostic information about daily operations. They can provide an inexpensive mechanism to obtain significant information regarding many aspects of the organization, including: a) Perceptions and opinions of operational personnel; b) Level of teamwork and cooperation among various employee groups; c) Problem areas or bottlenecks in daily operations; d) Corporate safety culture; and e) Current areas of dissent or confusion. 43 Adapted from TC 13881 Safety Management Systems for Flight Operations and Aircraft Maintenance Organizations 10-7 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT Safety surveys usually involve the use of checklists, questionnaires and informal confidential interviews. Surveys, particularly those using interviews, may elicit information that cannot be obtained any other way. Typically, specific data, suitable for safety analysis, including assessment of safety performance, can be acquired through well structured and managed surveys. However, the validity of all survey information obtained may need to be verified before corrective action is taken. Like voluntary incident reporting systems, surveys are subjective, reflecting individuals’ perceptions. As such they are subject to the same kinds of limitations, such as the biases of the author, biases of the respondents, biases in interpreting the data, etc. The activities of safety surveys can span the complete risk management cycle from hazard identification, through risk assessment into safety oversight and quality assurance. They are most likely to be conducted by organizations that have truly made the transition from a reactive to a proactive safety culture. Chapter 15 (Practical Considerations for Operating a Safety Management System) includes guidance to Safety Managers on the conduct of safety surveys. Safety monitoring Traditionally, understanding of the unsafe conditions inherent in aviation developed largely upon the evidence of accident and incident investigations. While an important element of aviation safety, this reactive approach overlooks the experience of the thousands of flights each day that progress without incident. Learning strictly through accident/incident investigations is like closing the barn door after the horses have escaped. Methods are evolving to learn more from the experience of typical flight operations. What strategies work best for operational personnel in coping with the dynamic situations of flight operations, effectively managing the normal threats, errors and undesirable states with the potential to erode accepted margins of safety? By observing normal operations, normative data can be acquired and analyzed to identify hazards that might otherwise not come to the attention of safety managers. One method of monitoring normal operations on the flight deck is known as LOSA (Line Operations Safety Audits) as described in Ch 16 (Aircraft Operations). Following the Threat Error Management (TEM) model, analysis of safety data acquired through this methodology has provided several airlines with important insight into the threats and errors that flight crew have to manage in their day-to-day operations. The same principles are being extended for over-the-shoulder monitoring of air traffic control operations; the concept for ATS is called NOSS (Normal Operations Safety Surveys (NOSS) and is described further in Ch 17 (Air Traffic Services). Quality assurance44 A quality assurance system (QAS) defines and establishes an organization’s quality policy and objectives. It ensures that the organization has in place those elements necessary to improve efficiency and reduce risks. If properly implemented, a QAS ensures that procedures are carried out consistently, that problems are identified and resolved, and that the organization continuously reviews and improves its procedures, 44 Adapted from TP13881, Safety Management Systems for Flight Operations and Aircraft Maintenance Organizations, Transport Canada, 2002 10-8 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT products and services. A QAS should proactively identify problems and improve procedures in order to meet corporate objectives. In a Safety Management system (SMS), these same functions are applied to understanding the human and organizational issues that can impact on safety. A SMS employs similar methods to identify safety problems within the organization and to reduce or eliminate the potential for related accidents through improved procedures. Safety management focuses on the identification and control of safety risks. The functions of a quality assurance programme help ensure that the requisite systemic measures have been taken to meet the organization’s safety goals. However, quality assurance does not “assure safety”. Rather, quality assurance measures help management ensure that the necessary systems are in place within their organization to reduce the risk of accidents. A quality assurance programme includes procedures for monitoring the performance of all aspects of an organization including such elements as: a) Well designed and documented procedures (e.g. standard operating procedures); b) Inspection and testing methods; c) Monitoring of equipment and operations; d) Internal and external audits; e) Monitoring of corrective actions taken; and f) The use of appropriate statistical analysis, when required. A number of internationally accepted quality assurance standards are in use today. The most appropriate system depends on the size, complexity and product of the organization. ISO 9000 is one set of international standards used by many companies to implement an in-house quality system. Such systems will also ensure that the organization’s suppliers have appropriate quality assurance systems in place. Safety audits Safety auditing is a core safety management activity. Like financial audits, safety audits provide a means for systematically assessing how well the organization is meeting its safety objectives. The safety audit programme, together with other safety oversight activities, provides feedback to both managers of individual units and senior management concerning the safety performance of the organization. This feedback provides evidence of the level of safety performance being achieved and, where safety deficiencies have been identified, enables appropriate corrective measures to be developed. In this sense, safety auditing is a proactive safety management activity, providing a means of identifying potential problems before they have an impact on safety. A safety audit should verify compliance with standards and procedures, including compliance with the organization’s safety management procedures. However, compliance with standards and procedures is not necessarily sufficient to ensure safety. Safety audits may be conducted internally by the organization, or by an external safety auditor. Demonstrating safety performance for State regulatory authorities is the most common form of external 10-9 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT safety audit. Increasingly, however, other stakeholders may require an independent audit as a precondition to providing a specific approval, such as for financing, insurance, partnerships with other airlines, entry into foreign airspace, etc. Regardless of the driving force for the audit, the purpose, activities and products from both internal and external audits are similar. While the procedures used in the conduct of these two types of audits are similar, the scope is quite different. Safety audits should be conducted on a regular and systematic basis in accordance with the organization’s safety audit programme. Guidance on the conduct of safety audits is included in Chapter 14 (Safety Auditing). ICAO UNIVERSAL SAFETY OVERSIGHT AUDIT PROGRAMME (USOAP) ICAO recognizes the need for States to exercise effective safety oversight of their aviation industries. Thus, ICAO has established the Universal Safety Oversight Audit Programme (USOAP) 45. The primary objectives of USOAP are: a) To determine the degree of conformance by States in implementing ICAO Standards; b) To observe and assess the States’ adherence to ICAO Recommended Practices, associated procedures, guidance material and safety-related practices; c) To determine the effectiveness of States’ implementation of safety oversight systems through the establishment of appropriate legislation, regulations, safety authorities and inspections, and auditing capabilities; and d) To provide Contracting States with advice in order to improve their safety oversight capability. A first USOAP audit cycle of most ICAO Contracting States addressing Annex 1 — Personnel Licensing, Annex 6 — Operation of Aircraft and Annex 8 — Airworthiness of Aircraft has been completed. Summary reports of the audits containing an abstract of the findings, recommendations and the proposed State corrective actions are published and distributed by ICAO to enable other Contracting States to form an opinion on the status of aviation safety in the audited State. Future USOAP audit cycles will use a systemic approach, focusing on safety critical SARPs of all safety related annexes. The audit findings to date have revealed many shortcomings in individual State’s compliance with ICAO SARPs. REGULATORY SAFETY AUDITS For some States, the ICAO USOAP audits are the only assessment made of the States’ aviation safety performance. However, many States do carry out a programme of safety audits to ensure the integrity of their national aviation system. Performance measurements and the appraisal of safety management documentation, including the processes for hazard identification and risk assessment, should be part of the regulatory safety audit. Audits conducted by a safety regulatory authority would take a broad view of the safety management procedures of the organization as a whole. The key issues in such an audit would be: 45 Guidance Material is available from ICAO to assist States in preparing for USOAP audits. See ICAO Docs: Safety Oversight Audit Manual (Doc 9735) and Human Factors Guidelines for Safety Audits Manual (Doc 9806). 10-10 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT f) Surveillance and compliance. The regulatory authority needs to ensure that the required international, national or local standards are complied with prior to issuing any licence or approval and that the situation will apply for the duration of the licence or approval. The regulator determines an acceptable means for demonstrating compliance. The organization being audited is then required to provide documentary evidence that the regulatory requirements can and will be met. g) Areas and degree of risk. A regulatory audit should ensure that the organization’s safety management system is based on sound principles and procedures. Organizational systems are in place to periodically review procedures to ensure that all safety standards are being continuously met. Assessments should be made of how risks are identified and how any necessary changes are made. The audit should confirm that the individual parts of the organization are performing as an integrated system. Therefore, regulatory safety audits must be conducted in sufficient depth and scope to ensure that the organization has considered the various interrelationships in its management of safety. h) Competence. The organization has adequate staff that are trained, to ensure that the safety management system functions as intended. In addition to confirming the continuing competency of all staff, the regulatory authority needs to assess the capabilities of personnel in key positions. The possession of a licence granting specific privileges does not necessarily measure the competence of the holder to perform managerial tasks assigned by the organization; for example, competence as a pilot may not equate to managerial acumen. Where there are short term skills gaps, the organization will need to satisfy the Regulator that they have a viable plan to mitigate the situation as soon as practicable. In addition, the Regulator should be interested in the involvement of the highest level of management with responsibility for safety in the day-to-day safety of the organization. i) Safety management. Major safety issues are managed effectively, and the organization is generally meeting its safety performance targets. SELF-AUDIT Critical self-assessment (or self-audit) is a tool that management can employ to measure safety margins. A comprehensive questionnaire to assist airline management to conduct a self-audit of those factors affecting accident prevention is included at Appendix 2 to this chapter. This self-audit checklist is designed for use by senior airline management to identify organizational events, policies, procedures or practices that may be indicative of safety hazards. There are no right or wrong answers applicable to all situations. Nor are all the questions relevant to many types of operations. However, the thrust of a response to a line of questioning may be revealing of the organization’s safety health. Although this self-audit was originally designed for use in flight operations, the line of questioning is relevant for the management of most operational aspects of civil aviation. Thus, this audit checklist can be adapted for application in a variety of situations with the potential to contribute to an accident. Increasingly, airlines are entering into “alliances” and code-sharing agreements. Under such arrangements, the fare-paying public may have difficulty differentiating one airline from another; the expectation is that an equivalent level of service will prevail, including an equivalent level of safety. 10-11 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT Cooperative audits among the participants of these agreements ensure a consistent level of safety across the alliance partners. The self-audit checklist may provide a common basis for such collective audits. SAFETY PROGRAMME REVIEW Any system requires feedback on the fulfillment of the system’s objectives in order to adjust the various inputs and processes. A programme review validates the safety management system; it confirms that a systemic approach is being taken to safety management (as opposed to a patchwork of uncoordinated and unrelated safety initiatives). Through a regular review process, management can pursue continuous improvement in the system. When a safety management system is first implemented, typically there is enthusiasm, with new hopes and aspirations. As the various elements of the safety management system come into effect, managing it shifts from implementation to maintenance. The initial enthusiasm may wane. As safety performance data builds, analysis may provide misleading conclusions unless a broad, system-wide perspective is taken. For example, the number of reports to the hazard and incident reporting system may increase during implementation, then reduce somewhat. This does not necessarily mean that the number of actual hazards has been reduced. Perhaps the system is not operating as intended. Staff may have lost confidence in the non-punitive nature of the organization’s safety culture, or in management’s readiness to address identified safety deficiencies. It is the responsibility of the Chief Executive and the SM to ensure that this does not happen. ———————— 10-12 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT Appendix 1 to Chapter 10 SAMPLE INDICATORS OF SAFETY HEALTH POOR SAFETY HEALTH IMPROVING SAFETY HEALTH CAA CAA  Inadequate governing legislation and regulations;  Potential conflicts of interest (such as regulator also being service provider);  Inadequate civil aviation infrastructure and systems;  Inadequate fulfillment of regulatory functions (such as licensing, surveillance and enforcement);  Inadequate resources and organization for the magnitude and complexity of regulatory requirements;  Instability and uncertainty within the CAA, compromising quality and timeliness of regulatory performance;  Absence of formal safety processes such as incident reporting and safety oversight;  Stagnation in safety thinking (such as reluctance to embrace proven best practices)  National incident reporting programmes (both mandatory and voluntary);  National safety monitoring programmes including incident investigations, accessible safety databases, trend analysis, etc.;  Regulatory oversight including routine surveillance, regular safety audits and monitoring of best industry practices;  Risk based resource allocation for all regulatory functions; and  Safety Promotion programmes to assist operators Operational Organization Operational Organization  Inadequate organization and resources for current operations;  Instability and uncertainty due to recent organizational change;  Poor financial situation;  Unresolved labour – management disputes;  Record of regulatory non-compliance;  Low operational experience levels for type of equipment or operations;  Fleet inadequacies such as age and mix;  Poorly defined (or no) corporate flight safety function;  Inadequate training programmes;  Corporate complacency re safety record, current work practices, etc.;  Poor safety culture  Proactive corporate safety culture;  Investment in human resources in such areas as non-mandatory training;  Formal safety processes for maintaining safety database, incident reporting, investigation of incidents, safety communications, etc.;  Operation of a comprehensive safety management system (i.e. appropriate corporate approach, organizational tools and safety oversight);  Strong internal two-way communications in terms of openness, feedback, reporting culture, dissemination of lessons learned;  Safety education and awareness in terms of data exchange, safety promotion, participation in safety fora, training aids. ———————— 10-13 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT Appendix 2 to Chapter 10 AIRLINE MANAGEMENT SELF-AUDIT46 OBJECTIVE This self-audit may be used by airline management to identify administrative, operational and maintenance processes and related training that might indicate safety hazards. The results can be used to focus management attention on those issues possibly posing a risk of incidents or accidents. MANAGEMENT AND ORGANIZATION Management structure 46  Does the company have a formal written statement of corporate safety policies and objectives?  Are these adequately disseminated throughout the organization? Is there visible senior management support for these safety policies?  Does the organization have a safety department or a designated Safety Manager (SM)?  Is this department or SM effective?  Does the departmental SM report directly to senior corporate management?  Does the organization support the periodic publication of a safety report or newsletter?  Does the organization distribute safety reports or newsletters from other sources?  Is there a formal system for regular communication of safety information between management and employees?  Are there periodic safety meetings?  Does the company participate in industry safety activities, such as those sponsored by the Flight Safety Foundation (FSF), the International Air Transport Association (IATA) and others?  Does the organization formally investigate incidents and accidents? Are the results of these investigations disseminated to managers and operational personnel?  Does the organization have a confidential, non-punitive hazard and incident-reporting programme?  Does the organization maintain an incident database?  Is the incident database routinely analysed to determine trends? Adapted from Flight Safety Foundation: Flight Safety Digest, May 1999 10-14 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT  Does the company operate a Flight Data Analysis (FDA) programme?  Does the company operate a Line Operational Safety Audit (LOSA) programme?  Does the company conduct safety studies as a means of proactively identifying safety deficiencies?  Does the organization use outside sources to conduct safety reviews or audits?  Does the organization solicit input from aircraft manufacturers’ product support groups? Management and corporate stability  Does the organization use outside sources to conduct safety reviews or audits?  Does the organization solicit input from aircraft manufacturers’ product support groups?  Have there been significant or frequent changes in ownership or senior management within the past three years?  Have there been significant or frequent changes in the leadership of operational divisions within the organization in the past three years?  Have any managers of operational divisions resigned because of disputes about safety matters, operating procedures or practices? Financial stability of the organization  Has the organization recently experienced financial instability, a merger, an acquisition or other major reorganization?  Was consideration given to safety matters during and following the period of instability, merger, acquisition or reorganization?  Are safety-related technological advances implemented before they are directed by regulatory requirement, i.e., is the organization proactive in using technology to meet safety objectives? Management selection and training  Are there well-defined management selection criteria?  Is operational background and experience a requirement in the selection of management personnel?  Are first-line operational managers selected from operationally qualified candidates?  Do new management personnel receive formal safety indoctrination and training? 10-15 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT  Is there a well-defined career path for operational managers?  Is there a formal process for the annual evaluation of managers? Workforce  Have there been recent layoffs by the organization?  Are there a large number of personnel employed on a part-time or contractual basis?  Does the company have formal rules or policies to manage the use of contract personnel?  Is there open communication between management, the workforce and unions about safety issues?  Is there a high rate of personnel turnover in operations or maintenance?  Is the overall experience level of operations and maintenance personnel low or declining?  Is the distribution of age or experience level within the organization considered in long-term organizational planning?  Are the professional skills of candidates for operations and maintenance positions evaluated formally during the selection process?  Are multicultural processes and issues considered during employee selection and training?  Is special attention given to safety issues during periods of labour-management disagreements or disputes?  Have there been recent changes in wages or work rules?  Does the organization have a corporate employee health maintenance programme?  Does the organization have an employee assistance programme that includes treatment for drug and alcohol abuse? Fleet stability and standardization  Is there a company policy concerning cockpit standardization within the organization’s fleet?  Do pilots and flight operations personnel participate in fleet acquisition decisions? Relationship with the regulatory authority  Are safety standards set primarily by the organization or by the appropriate regulatory authority? 10-16 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT  Does the organization set higher standards than those required by the regulatory authority?  Does the organization have a constructive, cooperative relationship with the regulatory authority?  Has the organization been subject to recent safety-enforcement action by the regulatory authority?  Does the organization consider the differing experience levels and licensing standards of other States when reviewing applications for employment?  Does the regulatory authority routinely evaluate the organization’s compliance with required safety standards? Operations specifications  Does the organization have formal flight-operations control, e.g. dispatch or flight following?  Does the organization have special dispatch requirements for extended twin-engine operations (ETOPS)?  Are fuel/route requirements determined by the regulatory authority?  If not, what criteria does the company use?  Does each flight crew member have a copy of the pertinent operations specifications? OPERATIONS AND MAINTENANCE TRAINING Training and checking standards  Does the organization have written standards for satisfactory performance?  Does the organization have a defined policy for dealing with unsatisfactory performance?  Does the organization maintain a database of training performance?  Is this database periodically reviewed for trends?  Are check pilots periodically trained and evaluated?  Does the organization have established criteria for instructor/check pilot qualification?  Does the organization provide specialized training for instructors/check pilots?  Are training and checking performed by formally organized, independent departments?  How effective is the coordination among flight operations, flight training and flight standards? 10-17 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT Operations training  Does the company have appropriate training and checking syllabi?  Does this training include:  Line-oriented flight training (LOFT)?  Crew resource management (CRM)?  Human factors?  Wind shear?  Dangerous goods?  Security?  Adverse weather operations? e.g., anti-icing and de-icing procedures?  Altitude and terrain awareness?  Aircraft performance?  Rejected take-offs?  ETOPS?  Instrument Landing System (ILS) Category II and Category III approaches?  Emergency-procedures training including pilot/flight attendant interaction?  International navigation and operational procedures?  Standard International Civil Aviation Organization (ICAO) radiotelephone phraseology?  Volcanic-ash avoidance/encounters?  If a ground-proximity warning system (GPWS), traffic alert and collision avoidance system (TCAS) and other special systems are installed, is specific training provided for their use? Are there clearly established policies for their use?  Are English language skills evaluated during training and checking?  Is English language training provided?  At a minimum, are the procedures contained in the manufacturer’s aircraft operations manual covered in the training programme?  Are there formal means for modification of training programmes as a result of incidents, accidents, or other relevant operational performance? Training devices  Are approved simulators available and used for all required training?  Is most of the organization’s training performed in the simulator? 10-18 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT  Do simulators include GPWS, TCAS, background communications and other advanced features?  Are simulators and/or training devices configurations controlled?  Has the organization established a simulator/training device quality assurance programme to ensure that these devices are maintained to acceptable standards?  Does the regulatory authority formally evaluate and certify simulators? Flight attendant training  Do flight attendants receive comprehensive initial and recurrent safety training?  Does this training include hands-on use of all required emergency and safety equipment?  Is the safety training of flight attendants conducted jointly with pilots?  Does this training establish policies and procedures for communications between cockpit and cabin crew?  Are evacuation mock-up trainers that replicate emergency exits available for flight attendant training? Do they include smoke simulation? Maintenance procedures, policies and training  Does the regulatory agency require licensing of all maintenance personnel?  Is formal maintenance training provided by the organization for all maintenance personnel? Is such training done on a recurrent basis? How is new equipment introduced?  Does the organization have a maintenance quality assurance programme?  If contract maintenance is used, is it included in the quality assurance programme?  Is hands-on training required for maintenance personnel?  Does the organization use a minimum equipment list (MEL)? Does the MEL meet or exceed the master MEL?  Does the organization have a formal procedure covering communication between maintenance and flight personnel?  Are “inoperative” placards used to indicate deferred-maintenance items? Is there clear guidance provided for operations with deferred-maintenance items?  Are designated individuals responsible for monitoring fleet health?  Does the organization have an aging-aircraft maintenance programme? 10-19 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT  Is there open communication between the maintenance organization and other operational organizations, such as dispatch? How effective is this communication?  Does the organization use a formal, scheduled maintenance programme?  Are policies established for flight and/or maintenance personnel to ground an aircraft for maintenance?  Are flight crew members ever pressured to accept an aircraft that they believe should be grounded? Scheduling practices  Are there flight and duty time limits for pilots?  Are there flight and duty time limits for flight attendants?  Do the flight and duty time limits meet or exceed Regulatory requirements?  Does the organization train flight crewmembers to understand fatigue, circadian rhythms and other factors that affect crew performance?  Does the organization allow napping in the cockpit?  Are on-board crew-rest facilities provided?  Are there minimum standards for the quality of layover rest facilities?  Does the organization have a system for tracking flight and duty time limits?  Has the organization established minimum crew-rest requirements?  Are augmented crews used for long-haul flights?  Are circadian rhythms considered in constructing flight schedules?  Are there duty time limits and rest requirements for maintenance personnel? Crew qualifications  Does the organization have a system to record and monitor flight crew currency?  Does the record keeping system include initial qualification, proficiency checks and recurrent training, special airport qualifications, line-check observations for:  Pilots in command?  First and second officers? 10-20 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT  Flight engineers?  Instructors and check pilots?  Flight attendants?  Does the regulatory authority provide qualified oversight of instructor and check-pilot qualifications?  Are the simulator instructors line-qualified pilots?  Does the organization permit multiple-aircraft qualification for line pilots?  Do organizational check pilots have complete authority over line-pilot qualification, without interference from management?  If the organization operates long-haul flights, is there an established policy for pilot currency, including instrument approaches and landings?  Does the organization have specific requirements for crew pairing for crew scheduling? PUBLICATIONS, MANUALS AND PROCEDURES  Are all flight crewmembers issued personal copies of their type operations manuals/FCOM and any other controlled publications?  How are revisions distributed?  How are the issue and receipt of revisions recorded?  Does the organization have an airline operations manual?  Is the airline operations manual provided to each crew member?  Is the airline operations manual periodically updated?  Does the airline operations manual define:  Minimum number of flight crew members?  Pilot and dispatcher responsibilities?  Procedures for hand over of control of the aircraft?  Stabilized approach criteria?  Dangerous goods procedures?  Required crew briefings for selected operations, including cockpit and cabin crew members? 10-21 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT  Specific pre-departure briefings for flights in areas of high terrain or obstacles?  Sterile-cockpit procedures?  Requirements for use of oxygen?  Access to cockpit by non-flight crew members?  Company communications?  Controlled flight into terrain (CFIT) avoidance procedures?  Procedures for operational emergencies, including medical emergencies and bomb threats?  Aircraft anti-icing and de-icing procedures?  Procedures for handling hijacking and disruptive passengers?  Company policy specifying that there will be no negative consequences for go-arounds and diversions when required operationally?  The scope of the captain’s authority?  A procedure for independent verification of key flight-planning and load information?  Weather minimums, maximum cross- and tail-wind components?  Special minimums for low-time captains?  Are all manuals and charts subject to a review and revision schedule?  Does the organization have a system for distributing time-critical information to the personnel who need it?  Is there a company manual specifying emergency response procedures?  Does the organization conduct periodic emergency response drills?  Are airport facility inspections mandated by the company?  Do airport facility inspections include reviews of notices to airmen (NOTAMs)? Signage and lighting? Runway condition, such as rubber accumulations, foreign object damage (FOD) etc.? Aircraft rescue and fire-fighting (ARFF)? Navigational aids (NAVAIDS)? Fuel quality? DISPATCH, FLIGHT FOLLOWING AND FLIGHT CONTROL  Does initial/recurrent dispatcher training meet or exceed Regulatory requirements?  Are operations during periods of reduced ARFF equipment availability covered in the flight operations manual?  Do dispatcher/flight followers have duty-time limitations? 10-22 Part 2 Chapter 10 – Safety Performance Monitoring DRAFT  Are computer-generated flight plans used?  Are ETOPS alternates specified? ____________________ 10-23 Part 2 DRAFT Chapter 11 - Emergency Response Planning Chapter 11 EMERGENCY RESPONSE PLANNING Outline Introduction ICAO requirements Plan Contents  Governing policies  Organization  Notifications  Initial response  Additional assistance  Crisis Management Centre (CMC)  Records  Accident site  News media  Formal investigations  Family assistance  Post critical incident stress counseling  Post occurrence review Aircraft Operator’s Responsibilities Checklists Training and Exercises Involvement of the Safety Manager Part 2 DRAFT Chapter 11 - Emergency Response Planning This page intentionally left blank. 11-2 Part 2 DRAFT Chapter 11 - Emergency Response Planning Chapter 11 EMERGENCY RESPONSE PLANNING INTRODUCTION Perhaps because aviation accidents are rare events, few organizations are prepared when one occurs. Many organizations do not have effective plans in place to manage events during an emergency or crisis. How an organization fares in the aftermath of an accident can depend on how well it handles the first few hours and days following a major safety occurrence. An emergency response plan outlines in writing what should be done after an accident occurs, and who is responsible for each action. At first glance, the initial response to an emergency may appear to have little to do with safety management. However, emergency response provides an opportunity to learn as well as to apply safety lessons aimed at the reduction of damage or injury. Successful response to an emergency begins with effective planning for a range of possible events. An Emergency Response Plan (ERP) provides the basis for a systematic approach to managing the organization’s affairs in the aftermath of a significant unplanned event – in the worst case, a major accident. The purpose of an emergency response plan is to ensure that there is: a) Orderly and efficient transition from normal to emergency operations; b) Delegation of emergency authority; c) Assignment of emergency responsibilities; d) Authorization by key personnel for actions contained in the plan; e) Coordination of efforts to cope with the emergency; and f) Safe continuation of aircraft operations or return to normal operations as soon as possible. ICAO requirements Any organization conducting or supporting flight operations should have an emergency response plan. For example: a) The ICAO Annex 14 — Aerodromes states that, before operations commence at an airport, an emergency response plan should be in place to deal with an aircraft accident occurring on, or in the vicinity of the airport. b) The ICAO Manual entitled Preparation of an Operations Manual (Doc 9376) states that the operations manual of a company should give instructions and guidance on the duties and obligations of personnel following an accident. It should include guidance on the establishment and operation of a central accident/emergency response centre – the focal point for crisis 11-3 Part 2 DRAFT Chapter 11 - Emergency Response Planning management. In addition to guidance for accidents involving company aircraft, guidance should also be provided for accidents involving aircraft for which it is the handling agent (for example, through code-sharing agreements or contracted services). Larger companies may choose to consolidate all this emergency planning information in a separate volume of their operations manual. c) The ICAO Airport Services Manual, Part 7 — Airport Emergency Planning (Doc 9137) gives guidance to both airport authorities and aircraft operators on preplanning for emergencies, as well as on coordination between the different airport agencies, including the operator. To be effective, an ERP should be: a) Be relevant and useful for the people who are likely to be on duty at the time of an accident; b) Include checklists and quick reference contact details of relevant personnel; c) Be regularly tested through exercises; and d) Be updated when details change. PLAN CONTENTS An Emergency Response Plan (ERP) would normally be documented in the format of a manual. It should set out the responsibilities and required roles and actions for the various agencies and personnel involved in dealing with emergencies. An ERP should take account of such considerations as: Governing policies. The ERP should provide direction for responding to emergencies, such as governing laws and regulations for accident investigations, agreements with local authorities, company policies and priorities, etc. Organization. The ERP should outline management’s intentions with respect to the responding organizations by: a) Designating who will lead and who will be assigned to the responding teams; b) Defining the roles and responsibilities for personnel assigned to the response teams; c) Clarifying the reporting lines of authority; d) Setting up of a Crisis Management Centre (CMC); e) Establishing procedures for receiving a large number of requests for information, especially during the first few days after a major accident; f) Designating the corporate spokesperson for dealing with the media; g) Defining what resources will be available, including financial authorities for immediate activities; 11-4 Part 2 DRAFT Chapter 11 - Emergency Response Planning h) Designating the company representative to any formal investigations undertaken by State officials; and i) Defining a call-out plan for key personnel, etc. An organization or flow chart could be used to show organizational functions and communication relationships. Notifications. The plan should specify who in the organization should be notified of an emergency, who will make external notifications and by what means. The notification needs of the following should be considered: a) Management (chief pilot, chief maintenance engineer and senior management team, etc.); b) State authorities (Search and Rescue, regulatory authority, accident investigation board, etc.); c) Local emergency response services (airport authorities, fire fighters, police, ambulances, medical agencies, etc); d) Relatives of victims — a sensitive issue, in many States handled by the police with information provided by the operator; e) Company personnel; f) Media; and g) Legal, accounting, insurers, etc. Initial response. Depending on the circumstances, an initial response team may be dispatched to the accident site to augment local resources and oversee the operator’s interests. Factors to be considered for such a team include: a) Who would lead the emergency response team? b) Who would be included on the initial response team? c) Who would speak for the company at the accident site? d) What would be required by way of special equipment, clothing, documentation, transportation, accommodation, etc.? Additional assistance. Company employees with appropriate training and experience can provide useful support during the preparation, exercising and updating of a company’s ERP. Their expertise may be useful in planning and executing such tasks as: a) Acting as passengers in crash exercises; b) Handling of survivors; c) Assisting with interviewing passenger witnesses; and 11-5 Part 2 DRAFT Chapter 11 - Emergency Response Planning d) Dealing with next of kin, etc. Crisis Management Centre (CMC). A CMC should be established at the operator’s headquarters once the activation criteria have been met. In addition, a command post (CP) may be established at or near the accident site. The ERP should address how the following requirements for the CMC and CP are to be met: a) Staffing (perhaps for 24 hours a day 7 days per week during the initial response period); b) Communications equipment (telephones, fax, Internet, etc.); c) Documentation requirements, maintenance of emergency activities logs; d) Impounding related company records; e) Office furnishings and supplies; f) Reference documents (such as emergency response checklists and procedures, company manuals, airport emergency plans, telephone lists, etc.); and g) Collection of latest information for daily updates, internal distribution, etc. The services of a crisis centre may be contracted from another airline or other specialist organization to look after the airline’s interests in a crisis away from home base. Company personnel would normally supplement such a contracted centre as soon as possible. Records. In addition to the organization’s need to maintain logs of events and activities post the occurrence, the organization will also be required to provide information for any State investigation team. The ERP should provide the following types of information to investigators: a) All records relevant to the aircraft, the flight crew and the operation; b) Lists of points of contact and any personnel associated with the occurrence; c) Notes of any interviews (and statements) with anyone associated with the event; and d) Any photographic or other evidence, etc. Accident site. After a major accident, representatives from many jurisdictions have legitimate reasons for accessing the site, for example, police, fire fighters, medics, airport authorities, coroners (medical examining officers) to deal with fatalities, State accident investigators, relief agencies such as the Red Cross and even the media. Although coordination of the activities of these stakeholders is the responsibility of the State’s police and/or investigating authority, the aircraft operator should clarify the following aspects of activity at the accident site: a) Nominating a senior company representative at the accident site: 1) If at home base; 2) If away from home base; 11-6 Part 2 DRAFT Chapter 11 - Emergency Response Planning 3) If offshore or in a foreign State. b) Management of surviving passengers; c) Needs of relatives of victims; d) Security of wreckage; e) Handling of human remains and personal property of the deceased; f) Preservation of evidence; g) Provision of assistance (as required) to the investigation authorities; and h) Removal and disposal of wreckage; etc. News media. How the company responds to the media may affect how well the company recovers from the event. Clear direction is required. For example: a) What information is protected by statute (FDR data, CVR and ATC recordings, witness statements etc.); b) Who may speak on behalf of the parent organization at head office and at the accident site (Public Relations Manager, Chief Executive Officer or other senior executive, manager, owner); c) Direction regarding a prepared statement for immediate response to media queries; d) What information may be released (What should be avoided); e) The timing and content of the company’s initial statement; and f) Provisions for regular updates to the media. Formal investigations. Guidance for company personnel dealing with State accident investigators and police should be provided. Family assistance. The EPR should also include guidance on the organization’s approach to assisting the families of accident victims (crew and passengers). This guidance may include such things as: a) State requirements for the provision of family assistance services; b) Travel and accommodation arrangements to visit the accident location and survivors; c) Programme coordinator and point(s) of contact for each family; d) Provision of up-to-date information; e) Grief counselling, etc.; f) Immediate financial assistance to victims and their families; and 11-7 Part 2 DRAFT Chapter 11 - Emergency Response Planning g) Memorial services, etc. Some States define the types of assistance to be provided by an airline operating within and into the country. Post critical incident stress counselling. For personnel working in stressful situations, the ERP may include guidance, specifying duty limits and providing for post incident stress counselling. Post occurrence review. Direction should be provided to ensure that, following the emergency, key personnel carry out a full debrief and record all significant lessons learned which may result in amendments to the ERP and associated checklists. AIRCRAFT OPERATOR’S RESPONSIBILITIES The aircraft operator’s emergency response plan should be coordinated with the airport emergency plan so that the operator’s personnel know which responsibilities the airport will assume and what response is required by the operator. 47 As part of their emergency response planning, aircraft operators in conjunction with the airport operator are expected to48: a) Provide training to prepare personnel for emergencies; b) Make arrangements to handle incoming telephone queries concerning the emergency; c) Designate a suitable holding area for uninjured persons (“meeters and greeters”); d) Provide a description of duties for company personnel (e.g. person in command, receptionists for receiving passengers in holding areas); e) Gather essential passenger information and coordinating fulfillment of their needs; f) Develop arrangements with other operators and agencies for the provision of mutual support during the emergency; g) Prepare and maintain an emergency kit containing: 1) Necessary administrative supplies (forms, paper, name tags, computers, etc.); and 2) Critical telephone numbers (doctors, local hotels, linguists, caterers, airline transport companies, etc.). In the event of an aircraft accident on or near the airport, operators will be expected to take such actions as: a) Report to airport command post to coordinate the aircraft operator’s activities; b) Assist in the location and recovery of any flight recorders; 47 48 See Chapter 18, Aerodrome Operations, for further discussion of airport emergency response planning. See also Airport Services Manual, Part 7 — Airport Emergency Planning (Doc 9137) Op. Cit. 11-8 Part 2 DRAFT Chapter 11 - Emergency Response Planning c) Assist investigators with the identification of aircraft components and ensure that hazardous components are made safe; d) Provide information regarding passengers and flight crew, and the existence of any dangerous goods on board; e) Transport uninjured persons to the designated holding area; f) Make arrangements for any uninjured persons who may intend to continue their journey, or who need accommodations or other assistance; g) Release information to the media in coordination with the airport public information officer and police; and h) Remove the aircraft (and/or wreckage) upon the authorization of the investigation authority. While this paragraph is oriented towards the airline operator, some of the concepts would also apply to emergency planning by aerodrome operators and air traffic service providers. CHECKLISTS Everyone involved in the initial response to a major aircraft accident will be suffering from some degree of shock. Therefore, the emergency response process lends itself to the use of checklists. These checklists can form an integral part of the company’s Operations or Emergency Response Manuals. To be effective, checklists must be regularly: a) Reviewed and updated (for example, currency of callout lists, contact details, company documentation for the CMC, etc.); and b) Tested through realistic exercises. TRAINING AND EXERCISES An emergency response plan is a paper indication of intent. Hopefully, much of an ERP will never be tested under actual conditions. Training is required to ensure that these intentions are backed by operational capabilities. Since training has a short “shelf-life”, regular drills and exercises are advisable. Some portions of the ERP, such as the callout and communications plan can be tested by “desktop” exercises. Other aspects, such as “on-site” activities involving other agencies, need to be exercised at regular intervals. Such exercises have the advantage of demonstrating deficiencies in the plan, which can be rectified before an actual emergency. INVOLVEMENT OF THE SAFETY MANAGER When there is an accident, management will often turn to the Safety Manager (SM) for advice. The SM’s broad knowledge of operations and maintenance, his skill set and network of contacts are an invaluable resource to the organization in the event of any emergency. As such, the SM should play a major role in 11-9 Part 2 DRAFT Chapter 11 - Emergency Response Planning any emergency response. Following are some of the tasks that may be assigned to the SM in Emergency Response Planning: a) Advise senior management of the need for an ERP, on appropriate contents, and on training and exercising of the plan; b) Assist operational management in planning their portions of the ERP; c) Liaise with the Airport Manager(s) to coordinate emergency response requirements; d) Liaise with companies or organizations that may provide emergency response service; e) Ensure any necessary training has been provided; f) Ensure coordination of ERP with that of other airline partners; and g) Ensure the plan is tested regularly and is updated as required. __________________ 11-10

Chapter …

Related documents

Products

Support

Chapter …

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib