. Software System Safety Nancy G. Leveson MIT Aero/Astro Dept. c Copyright by the author, November 2004. Accident with No Component Failures c � VENT LA GEARBOX LC CONDENSER CATALYST COOLING VAPOR WATER REFLUX REACTOR COMPUTER c � Types of Accidents Component Failure Accidents Single or multiple component failures Usually assume random failure System Accidents Arise in interactions among components No components may have "failed" Caused by interactive complexity and tight coupling Exacerbated by the introduction of computers. . . c . ! # ! . Safety Reliability Accidents in high−tech systems are changing their nature, and we must change our approaches to safety accordingly. . . c Confusing Safety and Reliability From an FAA report on ATC software architectures: "The FAA’s en route automation meets the criteria for consideration as a safety−critical system. Therefore, en route automation systems must posses ultra−high reliability." From a blue ribbon panel report on the V−22 Osprey problems: "Safety [software]: ... Recommendation: Improve reliability, then verify by extensive test/fix/test in challenging environments." c ! ! % Does Software Fail? Failure: Nonperformance or inability of system or component to perform its intended function for a specified time under specified environmental conditions. A basic abnormal occurrence, e.g., burned out bearing in a pump relay not closing properly when voltage applied Fault: Higher−order events, e.g., relay closes at wrong time due to improper functioning of an upstream component. All failures are faults but not all faults are failures. c Reliability Engineering Approach to Safety Reliability: The probability an item will perform its required function in the specified manner over a given time period and under specified or assumed conditions. (Note: Most software−related accidents result from errors in specified requirements or function and deviations from assumed conditions.) Concerned primarily with failures and failure rate reduction Parallel redundancy Standby sparing Safety factors and margins Derating Screening Timed replacements c ! ' ! ( Reliability Engineering Approach to Safety (2) Assumes accidents are the result of component failure. Techniques exist to increase component reliability Failure rates in hardware are quantifiable. Omits important factors in accidents. May even decrease safety. Many accidents occur without any component ‘‘failure’’ e.g. Accidents may be caused by equipment operation outside parameters and time limits upon which reliability analyses are based. Or may be caused by interactions of components all operating according to specification Highly reliable components are not necessarily safe. c Software−Related Accidents Are usually caused by flawed requirements Incomplete or wrong assumptions about operation of controlled system or required operation of computer. Unhandled controlled−system states and environmental conditions. Merely trying to get the software ‘‘correct’’ or to make it reliable will not make it safer under these conditions. c ! . Software−Related Accidents (con’t.) Software may be highly reliable and ‘‘correct’’ and still be unsafe. Correctly implements requirements but specified behavior unsafe from a system perspective. Requirements do not specify some particular behavior required for system safety (incomplete) Software has unintended (and unsafe) behavior beyond what is specified in requirements. c � ) * * # + A Possible Solution Enforce discipline and control complexity Limits have changed from structural integrity and physical constraints of materials to intellectual limits Improve communication among engineers Build safety in by enforcing constraints on behavior Example (batch reactor) System safety constraint: Water must be flowing into reflux condenser whenever catalyst is added to reactor. Software safety constraint: Software must always open water valve before catalyst valve c - , c � ) * * # + ! The Problem to be Solved The primary safety problem in computer−based systems is the lack of appropriate constraints on design. The job of the system safety engineer is to identify the design constraints necessary to maintain safety and to ensure the system and software design enforces them. . c � ) * * ' + + ! An Overview of The Approach Engineers should recognize that reducing risk is not an impossible task, even under financial and time constraints. All it takes in many cases is a different perspective on the design problem. Mike Martin and Roland Schinzinger Ethics in Engineering c � System Safety A planned, disciplined, and systematic approach to preventing or reducing accidents throughout the life cycle of a system. ‘‘Organized common sense ’’ (Mueller, 1968) Primary concern is the management of hazards: Hazard identification evaluation elimination control through analysis design management MIL−STD−882 ) * * ' # c System Safety (2) � ) * * ' + + Hazard analysis and control is a continuous, iterative process throughout system development and use. Conceptual development Design Development Operations Hazard identification Hazard resolution Verification Change analysis Operational feedback Hazard resolution precedence: 1. 2. 3. 4. Eliminate the hazard Prevent or minimize the occurrence of the hazard Control the hazard if it occurs. Minimize damage. Management c � Process Steps 1. Perform a Preliminary Hazard Analysis Produces hazard list 2. Perform a System Hazard Analysis (not just Failure Analysis) Identifies potential causes of hazards 3. Identify appropriate design constraints on system, software, and humans. 4. Design at system level to eliminate or control hazards. 5. Trace unresolved hazards and system hazard controls to software requirements. ) * * ' c � ) * * ' + + % Specifying Safety Constraints Most software requirements only specify nominal behavior Need to specify off−nominal behavior Need to specify what software must NOT do What must not do is not inverse of what must do Derive from system hazard analysis c � Process Steps (2) 6. Software requirements review and analysis Completeness Simulation and animation Software hazard analysis Robustness (environment) analysis Mode confusion and other human error analyses Human factors analyses (usability, workload, etc.) ) * * ' ' c � ) * * ' + + ( Process Steps (3) 7. Implementation with safety in mind Defensive programming Assertions and run−time checking Separation of critical functions Elimination of unnecessary functions Exception−handling etc. 8. Off−nominal and safety testing c � Process Steps (4) 9. Operational Analysis and Auditing Change analysis Incident and accident analysis Performance monitoring Periodic audits ) * * ' , c 0 k 3 1 l 3 3 5 n : 7 o 7 A Human−Centered, Safety−Driven Design Process System Engineering Human Factors L F Q A P ? g O N O N P Q R G V C N System Safety V A F X Preliminary Task Analysis Q A \ ? A ] R Q A P V C V N N E R _ P ? A X Preliminary Hazard Analysis N X Operator Goals and Responsibilities Hazard List Fault Tree Analysis e Q A Q V P Q N O N P Q R V A F ] Task Allocation Principles _ Q V X P ? A ] Operator Task and Training Requirements V A V C Q X F F d E ? Q ] Q N ? G R Q A P Safety Requirements and Constraints N ] A B A N P V X ? A P N ] System Hazard Analysis C C B j V P Q P V N W N V A Completeness/Consistency Analysis F G X Q A Q V P Q N O N P Q R F Q N ? G A ] > ? A B C E F ? A G H J L M Operator Task Analysis Simulation and Animation Simulation/Experiments J F Q C V A F Q \ V C E V P Q _ X Q V X Usability Analysis P V N W N V A F B R _ A X C V B Q U X Y Q X A ] P X W U P ] [ V \ State Machine Hazard Analysis ? U X ] Deviation Analysis (FMECA) Other Human Factors Evaluation (workload, situation awareness, etc.) > N O N P Q R F Q N ? G A M Mode Confusion Analysis Q N ? G A V A F B A h N P E X B P ] Human Error Analysis B R _ A X F ? N Q A P N b B A X _ C V O P C X N b P V ? A ? A ] G N V R V P Q ? ] V A F _ F V C N b ] Q V X A X P R ] X V A E V C Timing and other analyses N ] Safety Verification Q ? f Operational Analysis ? Q C F g ? B V P ? P Q Safety Testing Software FTA A ] X N P ? A G b ? A N a P V C C V P ? A X V A F P V ? A ? A G ] Performance Monitoring b Operational Analysis Change Analysis Incident and accident analysis Periodic audits Periodic audits c _ V Q ] Change Analysis P ? A N X Performance Monitoring 9 q : s ; < u 1 v w x c ( , . z y . { ) Preliminary Hazard Analysis 1. Identify system hazards 2. Translate system hazards into high−level system safety design constraints. 3. Assess hazards if required to do so. 4. Establish the hazard log. . . c y z { ) System Hazards for Automated Train Doors Train starts with door open. Door opens while train is in motion. Door opens while improperly aligned with station platform. Door closes while someone is in doorway Door that closes on an obstruction does not reopen or reopened door does not reclose. Doors cannot be opened for emergency evacuation. c System Hazards for Air Traffic Control z y { { ) . - Controlled aircraft violate minimum separation standards (NMAC). Airborne controlled aircraft enters an unsafe atmospheric region. Controlled airborne aircraft enters restricted airspace without authorization. Controlled airborne aircraft gets too close to a fixed obstable other than a safe point of touchdown on assigned runway (CFIT) Controlled airborne aircraft and an intruder in controlled airspace violate minimum separation. Controlled aircraft operates outside its performance envelope. Aircraft on ground comes too close to moving objects or collides with stationary objects or leaves the paved area. Aircraft enters a runway for which it does not have clearance. Controlled aircraft executes an extreme maneuver within its performance envelope. Loss of aircraft control. c y z Exercise: Identify the system hazards for this cruise−control system The cruise control system operates only when the engine is running. When the driver turns the system on, the speed at which the car is traveling at that instant is maintained. The system monitors the car’s speed by sensing the rate at which the wheels are turning, and it maintains desired speed by controlling the throttle position. After the system has been turned on, the driver may tell it to start increasing speed, wait a period of time, and then tell it to stop increasing speed. Throughout the time period, the system will increase the speed at a fixed rate, and then will maintain the final speed reached. The driver may turn off the system at any time. The system will turn off if it senses that the accelerator has been depressed far enough to override the throttle control. If the system is on and senses that the brake has been depressed, it will cease maintaining speed but will not turn off. The driver may tell the system to resume speed, whereupon it will return to the speed it was maintaining before braking and resume maintenance of that speed. ) c y z { ) ! Hazards must be translated into design constraints. HAZARD DESIGN CRITERION Train starts with door open. Train must not be capable of moving with any door open. Door opens while train is in motion. Doors must remain closed while train is in motion. Door opens while improperly aligned with station platform. Door must be capable of opening only after train is stopped and properly aligned with platform unless emergency exists (see below). Door closes while someone is in doorway. Door areas must be clear before door closing begins. Door that closes on an obstruction does not reopen or reopened door does not reclose. An obstructed door must reopen to permit removal of obstruction and then automatically reclose. Doors cannot be opened for emergency evacuation. Means must be provided to open doors anywhere when the train is stopped for emergency evacuation. Example PHA for ATC Approach Control HAZARDS 1. A pair of controlled aircraft violate minimum separation standards. REQUIREMENTS/CONSTRAINTS 1a. ATC shall provide advisories that maintain safe separation between aircraft. 1b. ATC shall provide conflict alerts. 2. A controlled aircraft enters an unsafe atmospheric region. (icing conditions, windshear areas, thunderstorm cells) 2a. ATC must not issue advisories that direct aircraft into areas with unsafe atmospheric conditions. 2b. ATC shall provide weather advisories and alerts to flight crews. 2c. ATC shall warn aircraft that enter an unsafe atmospheric region. c y z { ) # Example PHA for ATC Approach Control (2) HAZARDS REQUIREMENTS/CONSTRAINTS 3. A controlled aircraft enters restricted airspace without authorization. 3a. ATC must not issue advisories that direct an aircraft into restricted airspace unless avoiding a greater hazard. 3b. ATC shall provide timely warnings to aircraft to prevent their incursion into restricted airspace. 4. A controlled aircraft gets too 4. ATC shall provide advisories that close to a fixed obstacle or maintain safe separation between terrain other than a safe point of aircraft and terrain or physical obstacles. touchdown on assigned runway. 5. A controlled aircraft and an intruder in controlled airspace violate minimum separation standards. HAZARDS 6. Loss of controlled flight or loss of airframe integrity. 5. ATC shall provide alerts and advisories to avoid intruders if at all possible. REQUIREMENTS/CONSTRAINTS 6a. ATC must not issue advisories outside the safe performance envelope of the aircraft. 6b. ATC advisories must not distract or disrupt the crew from maintaining safety of flight. 6c. ATC must not issue advisories that the pilot or aircraft cannot fly or that degrade the continued safe flight of the aircraft. 6d. ATC must not provide advisories that cause an aircraft to fall below the standard glidepath or intersect it at the wrong place. Classic Hazard Level Matrix c ´ µ ¶ y µ z · ¸ ¹ { º ) » ¼ ¾ SEVERITY I II Catastrophic Critical III Marginal IV Negligible A Frequent I−A II−A III−A IV−A B Moderate I−B II−B III−B IV−B C Occasional I−C II−C III−C IV−C D Remote I−D II−D III−D IV−D E Unlikely I−E II−E III−E IV−E F Impossible I−F II−F III−F IV−F LIKELIHOOD c ´ µ y ¶ z µ · ¸ ¹ { ) º » ¼ Another Example Hazard Level Matrix A Frequent C Occasional B Probable D Remote ¦ ¡ ¨ ¡ ¥ ¡ ¨ ¡ ¥ ¡ ¨ ¡ ¥ ¢ § ¡ ¢ § ¡ ¢ § ¡ ¥ § ¨ ¡ ¢ ¢ ¥ © Catastrophic I ¤ E F Improbable Impossible ¡ £ ¤ ¡ ¥ ¡ ¡ © © ¢ « ª ¡ ¢ £ ¤ ¡ ¥ ¡ ¢ £ ¤ ¡ ¥ ¡ ¢ £ ¤ ¡ ¥ ¡ 1 2 ¦ ¥ Critical ¨ ¡ ¥ ¡ ¨ ¡ ¥ ¥ ¤ ¡ ¥ § ¨ 4 | } ~ } 9 12 © ¡ ¡ ¢ ¢ ¥ ¡ ¨ 3 } ¯ £ ¤ ¡ ¨ § ° ¢ ¢ ® § © ¢ ¥ ª ¢ § ¡ ¢ § ¡ ¡ © © ¢ ¨ ¡ « ¨ ¡ ¡ ª ¡ ¢ £ ¤ ¡ ¥ II ¡ ¢ £ ¤ ¡ ¥ ¡ 3 ¥ ¥ } ~ } 6 | } ~ } 7 12 12 8 10 12 12 12 12 12 ¬ } } ~ } } § ¢ ² ² ¢ « ³ ¡ Marginal ¨ 4 | } III } ~ } Negligible IV 6 5 ¬ 10 11 ¢ © ¢ £ ¤ ¡ ¥ 12 ½ c ¿ À · Á µ ´ µ  ¶ µ à · Å ¸ Æ ¹ Å Ç º È » É » ¹ ¾ Å À Ê · Ë · Hazard Causal Analysis Used to refine the high−level safety constraints into more detailed constraints. Requires some type of model (even if only in head of analyst) Almost always involves some type of search through the system design (model) for states or conditions that could lead to system hazards. Top−down Bottom−up Forward Backward c ¿ À · Á µ  ´ µ à ¶ µ Å · Æ ¸ Å ¹ Ç È º É » ¹ Forward vs. Backward Search Initiating Events Final States Initiating Events Final States A W nonhazard A W nonhazard B X HAZARD B X HAZARD C Y nonhazard C Y nonhazard D Z nonhazard D Z nonhazard Forward Search Backward Search » Å ½ Ê À · Ë · c ¿ À · Á µ ´ µ  ¶ µ à · Å ¸ Æ ¹ Å Ç º È » É » ¹ Å Î Ê À FTA and Software · Appropriate for qualitative analyses, not quantitative ones System fault trees helpful in identifying potentially hazardous software behavior. Can use to refine system design constraints. FTA can be used to verify code. Identifies any paths from inputs to hazardous outputs or provides some assurance they don’t exist. Not looking for failures but incorrect paths (functions) c ¿ À · Á µ  à Š´ Æ µ Å ¶ Ç È µ · É ¸ ¹ ¹ Å º Ê À » · Fault Tree Example Explosion and Pressure too high Valve failure Relief valve 1 does not open Relief valve 2 does not open or or Computer does not open valve 1 Valve failure or Sensor Computer Failure output too late Operator does not know to open valve 2 Operator inattentive and Computer does not issue command to open valve 1 Valve 1 Position Indicator fails on Open Indicator Light fails on Ë » · Í Ë · c ¿ À · Á µ ´  µ ¶ à µ Å · Æ ¸ Å Ç ¹ º È » É ¹ » Å ÿ À Ê · Ë · Example Fault Tree for ATC Arrival Traffic ü ò Ù ã Û Û Ô Ö Û Ñ Þ ã Ý á Ñ Ö ã Ð ç Ô â Ñ à ò à Ù â ì Ô è Ù Ù Ð Û Û Ñ Ô Ý Ô Ö ç Ù Ð Þ Ð Ù õ Ö ì Û Û Ñ Ö à Ñ Ù Ð Ð Ö ç Ô Þ Ù Û Ñ Ù Û Ö ã Ö Û à Ñ ç Ù à â Ô Þ â Ù Ö Û Ù ò á ã ò Ô ò Ô Ò Û Ð Ñ Ù Ö Ù Û Ù ã á Ñ Ý Ò ð Ð ð Û à ç â à Þ Ô î Ù ò Ñ Ñ Ñ â Ð Ô Ù á Û Ù Û Ö Ô Ñ Ö Ù Ý Ò Ñ Ð Ô Ù ì Ö Ù Þ î Þ Ñ Û Û ô Ð à ç Ð â Ù Ð Ö Ù Ö Ò Ý â ì Û â Ö â Ñ Ö Ô ç á Ù Ô Ö Ð Û Ñ Ý î Ø Ô Ý Ù Ø Ô Ù Ò Ù Ô Ø Ô ç Ñ Ð Ö Ð ì Ù ð Ð ç â Þ â Ñ ò Ñ ì Ö Ñ Ö Ô Ð Ð Þ Ù Û Ô ç ò Ö Ù Ù Ù à Ð à â ç Û à Ù à à Ý Ñ Ñ î Ô ç Ö Û ð õ Ô Û â Ý Ô á î Ô ã Ð Û Û Ð à Ö Û Ø à Ù Ö î ì Û Ö Ý Þ Ô Û Û ã Þ Þ Û â Ù ã â ô ç Ô â Ö Ñ à â ì Ð Ð Ù Û Ô ì Ö â Ù â Ñ â ò â Ð Ò â Þ ç Ð â Ñ Ö á Ô Ô Þ ã Ù Ô â Ð Ù Û Ô Û Ô Ô Û ò õ Ù Þ Ñ Ö Û Ù Ô à Þ Ð Û ã á Ð Ô ç Ý ã Ù Þ Þ Ô Ñ ç Ô ç â Ô Û Û Ý Ö â Ð Ð Ö ç Þ Ý Ù â Þ Û Ñ Ö Ö Û Ö Ð Ñ ç Û Ð â Ø Ö â Ø Ô Ô õ Û Ñ à Ù Ð Û Û Ö Ö Û ì Ý Ð à Ô Þ Ö Û â Ù Ù Ù Ù ò ò Ô Ô â â Ñ Ô ð â Ð â Ù Ò Ñ Ö ã Û â Ý ò Ù ã Ù Ö Ö Ô Ù Ð Û Ñ Ö ì Ù â Ô ô î � ü Ö Û ç Þ é Ø ì Ô â â Ñ Ö Ö â Ð ò Þ ì Ñ Ñ ì Ñ à â Ö ã ð Þ õ Û Ø á ç Ù ç â Ö Ô Ý Ù Ö ã Û â Ò Ñ Ð Ù Ö á Ý � Û â ü Ò à â Ð Ô ì Ñ ò Ù Ù Û ã Ö ò à é è Ù Ñ Ò Û à Û Ø ò Ô Ñ ì à õ ç Ô Ö Ù Ð Ô â à à â Û Ð Û ó Ö ç Ð Ñ Ö Ö á ò î â Ý ò ç Ö Ñ ì Ù Ù â Ñ Ò ð Ö Ý á â ç ç Ð Ô Ù ç Ð Ö â Ñ Ð Ð Ô Û Ñ Ö Ñ Ý ð â ç Û ã Þ Ô Þ Û Ø Ù Ù Ô Ù Ñ Ô ù ã Ö Ý à Þ â ô Ù Ù Ð Ù Ù Ð á ç ò Þ Ô â ò Ù Ñ à ç Ö Ð Ô Û Ñ Ù Ý ð � � � c ¿ À · Á µ ´  µ ¶ à Šµ · Æ ¸ Å Ç ¹ º È É » ¹ » Å Example Fault Tree for ATC Arrival Traffic (2) Ê æ Ñ Ù Û Ô Ý Ö Ô Ù Ð Þ Ô Ð Ñ Ð à à Ñ â Ô Û ã Ö Ù ç ù Ð Ô â á Ö Ý æ Ñ Ö Ù Ö Ñ Ð Ð ì õ Û Ô Û ç Ñ à ç ç á à Ñ Ô â Ô â ì Ñ ç ò â â Ñ â ì Ñ Ö â ç ç ì ç Ù Ñ Ô Ö î Ñ ç Ð ò Ý â â Ù ì á ç Ý â ð Ù Ö Ù Ö ò Ù ò ì ì Ð Ô õ Ñ Ô à Ñ Û â Ñ ò ç â Ô Ñ Ô ç à Ö Ô Û Û Ù Ð î ç â á Ð Ô á ç ô Ñ ç ò Ð â â Ý Û â Û Ñ â à õ Ñ ì Ù Ð Û Ð Ô � â Ð ò ì Ô Ô õ Û Ý Ö è Ö ò Ù â â æ ç â ò Ø é æ ç î Û Ý è æ Ð â Þ à ò Ñ Ð Ñ ç ç â Ö Û Ð á æ Ñ ç â â ò ô Û â ò ì Ð Ò á ç Ù ç à Û Ù î â à Ô Û Ô õ â Ô Ñ Û Ð à Ñ ç â Ñ Ñ Ð Û ì à ì Ñ ç Ð Ñ ò Ð â ç à à à Ù ì Û Ñ Ñ ì Ð Ñ Û Ô â Ù õ õ Ð â ð Ù � Ö ç ç ò á Ñ Ö â Ñ ç Û ç ç Û Û õ â ì Ð Ô ì Ñ Ù â Ô Ñ Ù â ç Ñ î ç Ð Ô ò Ð Ù Ð Û Ñ Ö Ð â Ñ õ ò Û Ñ Ù à Ð Ô Ð à â Ô Ù Ù Ù à Ñ ì à â Ô â Ñ ç Ö Ö â Ù Û ì õ Ð Ð Ñ Û Û Ö Û ç Ñ ç ç Ñ Ù Ñ á â Ô õ ç î Ñ Û ì Ö � é æ ï ö ð Ý Ñ î ã ç ã Þ Ù Û è Û á à Ý Ù á Ö Ô Û à Ý á Ù Ð Û Ñ Ö Ý Ñ ã Þ ã â ã Ù Ù á Ö Û Ö à á Ô Û Ý Ñ Ù Ð Û Ñ Ö ç Ö â Ð ò Ñ â Ò é Ð Ô â Ñ à Ô Ñ Ù Ö â Ô Û ì ç Û Ù á ç õ Ø è é à ì Û ç â Ñ Ô Ý Ô Ô ç î Ù Þ Ð é é ï Ù ì Û Ñ Þ Ù Û à á Ô â Þ Ù Ô â í ì á Û Ñ â Ñ Ö Ö Ý Ò Ô Ñ Ö Ø þ ç î Ý ð Ñ à Ñ Ø Û Ý Ù à ç à Û ò Ô Ù ç î Ö ç Ù Û ò Ñ Ô à Ù Ñ Ø Ý Ý Ô Ö à õ Ù Û Þ Û Ù Ð Ð â Ù Ñ Ò ô â â ó à ì Ò Û Ð ð Ù ã Ö Û ò ì Û ç ò à Ù ô î ç à Ý â ç Ù à Ý Ô à â Û Ù ì â â Ñ â Ö Ö Ö Û Ö Ø À ý · Ë · c ¿ ¸ Á Å ´ Ç µ ¶ µ µ à · Å ¸ Æ ¹ Å º Ç » È É ¹ É ¹ Î Î À Å Ê · Ë · · Ë · Requirements Completeness Most software−related accidents involve software requirements deficiencies. Accidents often result from unhandled and unspecified cases. We have defined a set of criteria to determine whether a requirements specification is complete. Derived from accidents and basic engineering principles. Validated (at JPL) and used on industrial projects. Completeness: Requirements are sufficient to distinguish the desired behavior of the software from that of any other undesired program that might be designed. c ¿ Requirements Completeness Criteria (2) How were criteria derived? Mapped the parts of a control loop to a state machine I/O I/O Defined completeness for each part of state machine States, inputs, outputs, transitions Mathematical completeness Added basic engineering principles (e.g., feedback) Added what have learned from accidents ¸ Á Å Ç ´ µ µ ¶ à µ Å · Æ ¸ Å ¹ Ç º È » Î Å Í Ê À c Requirements Completeness Criteria (3) About 60 criteria in all including human−computer interaction. (won’t go through them all they are in the book) Startup, shutdown Mode transitions Inputs and outputs Value and timing Load and capacity Environment capacity Failure states and transitions Human−computer interface Robustness Data age Latency Feedback Reversibility Preemption Path Robustness Most integrated into SpecTRM−RL language design or simple tools can check them. µ Ë Ç µ ´  µ ¶ µ µ ¹ · Á · ¸ ¹ É º ¹ ½ Å Ê ÿ À · Ë · c ¿ ¸ Á Å ´ Ç µ ¶ µ µ à · Å ¸ Æ ¹ Å º Ç » È É ¹ É ¹ Î ¾ À Å Ê · Ë · · Ë · Requirements Analysis Model Execution, Animation, and Visualization Completeness State Machine Hazard Analysis (backwards reachability) Software Deviation Analysis Human Error Analysis Test Coverage Analysis and Test Case Generation Automatic code generation? c ¿ ¸ Á Å Ç ´ µ µ ¶ à Model Execution and Animation SpecTRM−RL models are executable. Model execution is animated Results of execution could be input into a graphical visualization Inputs can come from another model or simulator and output can go into another model or simulator. µ Å · Æ ¸ Å ¹ Ç º È » Î Å ½ Ê À c ´ µ ¶ µ · ¸ ¹ µ º · Ë ½ ¹ Design for Safety Software design must enforce safety constraints Should be able to trace from requirements to code (vice versa) Design should incorporate basic safety design principles c ´ µ ¶ µ · ¸ µ ¹ · º Ë Î ¼ ¹ Safe Design Precedence HAZARD ELIMINATION Substitution Simplification Decoupling Elimination of human errors Reduction of hazardous materials or conditions HAZARD REDUCTION Design for controllability Barriers Lockins, Lockouts, Interlocks Failure Minimization Safety Factors and Margins Redundancy HAZARD CONTROL Reducing exposure Isolation and containment Protection systems and fail−safe design DAMAGE REDUCTION Decreasing cost Increasing effectiveness .. c ¿ À ¿ À · Á µ  · Á µ  ´ µ à ¶ Å µ Æ · Å ¸ Ç ¹ È º É » ¹ Ê » Å Ê À Å · Ë · Ë · STAMP (Systems Theory Accident Modeling and Processes) STAMP is a new theoretical underpinning for developing more effective hazard analysis techniques for complex systems. . c Accident models provide the basis for Investigating and analyzing accidents Preventing accidents Hazard analysis Design for safety Assessing risk (determining whether systems are suitable for use) Performance modeling and defining safety metrics ´ à µ Å ¶ Æ µ Å · Ç ¸ È ¹ É º ¹ » À ¼ · c ¿ ¿ À · Á µ ´ µ ¶  µ à · Å ¸ Æ ¹ Å Ç º » È É ¹ » Å À Ê · Chain−of−Events Models Ë Explain accidents in terms of multiple events, sequenced as a forward chain over time. Events almost always involve component failure, human error, or energy−related event Form the basis of most safety−engineering and reliability engineering analysis: e.g., Fault Tree Analysis, Probabilistic Risk Assessment, FMEA, Event Trees and design: e.g., redundancy, overdesign, safety margins, ... c ¿ À · Á µ ´  µ à ¶ µ Å Æ · ¸ Å Ç ¹ º È É » ¹ Å Chain−of−Events Example Equipment damaged Operating pressure . # J = & / # ) # 8 8 & 8 , 1 + 5 1 F # 8 8 I # 5 = # # 8 # ) ' / / ! 1 + ' 8 , , 8 I & # 8 , # # 8 # , C 1 ' + C / # 8 M 8 1 , , # A = # 8 ' & F + ! # Fragments projected , Tank rupture 1 C 8 I # E & 8 , = ' 1 ) A 1 F ! G ' ? , # , # 1 % + & ' 5 1 ) ! 6 # 1 + 7 , 8 / # ) , ' E C # , = 1 & , , 1 + 5 ) > = # ! # 8 A # D # ) ) # 8 + + # C C , 1 , # # / # C 1 , E ) / + # # + / 5 , + # 8 8 8 , & ) , & # = 8 & ' / + # 8 6 , ' # C C + + F , , = A , # K # 8 , L # ) + 8 # E # # # ' ? / ! + , ' 8 1 / , , & 6 # ' > , # , 1 + # = + , ! # 1 ' + F ! ! 1 F 5 # ? A 1 ' C & # 8 # ) # # 1 ' E F ! ? ' ? 8 1 Personnel injured 8 / # # + , / + , 1 ' + ' / ' + ' , 7 , 1 + 5 6 A ? ! OR Weakened metal Corrosion , > AND Moisture 1 ! # 8 1 / + , C # = & C ' # ' , + ' F ! 1 # > + = 1 F + , 1 , ' + > # ) 8 8 ' E C # 1 F ! # + , 8 > ' , ' 8 ) # 8 8 & ' < # = > ' C # # Ê > À · Ë · · c ¿ À · Á µ ´  µ ¶ à µ Å · Æ ¸ Å Ç ¹ º È » É ¹ Å ¾ À Ê · Ë Chain−of−Events Example: Bhopal · E1: Worker washes pipes without inserting slip blind E2: Water leaks into MIT tank E3: Explosion occurs E4: Relief valve opens E5: MIC vented into air E6: Wind carries MIC into populated area around plant c ¿ Limitations of Event Chain Models: Social and organizational factors in accidents ~ o P Z R Z ` b j e Y T X ` Y b ] l P c X n o X R b R o n T Z Y e c Y Y T f Z i Y Y e Y Y e f ] ] Y c e n b e Y T Z T Y b o _ P T ` R S i ] ` n q e o ` ` R c Y ] i ` n f X j R c X c P Y X Z o c Z P n z Y S l e Y i ] R { S Y ` ] Y P Y Z c R e Z Y e R Y m ` j b n o Y T o ` Y i R X P z Y R b ` P Z Z Z R ] o R b j ` P P X Y R m O P S b j S ` X T j b i Y o n X Z b X Y Z e ` m } Z X Z Y Y c X R b z R Y V X o ` Y e X b ] Y f o [ ] ] X T e Z Y Y ] P X P _ _ Models need to include the social system as well as the technology and its underlying science. System accidents Software error À · Á µ  ´ à µ ¶ Å µ Æ · Å ¸ Ç È ¹ º É » ¹ Å Ê À ½ · Ë · c ¿ À · Á µ  ´ µ ¶ à µ Å · Æ ¸ Å Ç ¹ º È » É ¹ Å Î À Ê · Ë · Limitations of Event Chain Models (2) Human error Deviation from normative procedure vs. established practice Cannot effectively model human behavior by decomposing it into individual decisions and actions and studying it in isolation from the physical and social context value system in which it takes place dynamic work process Adaptation Major accidents involve systematic migration of organizational behavior under pressure toward cost effectiveness in an aggressive, competitive environment. c ¿ Vessel Design Harbor Design Cargo Management À · Design Stability Analysis Shipyard Equipment load added Impaired stability Truck companies Excess load routines Passenger management Excess numbers Berth design Calais Passenger Management Traffic Scheduling Vessel Operation Operations management Standing orders Docking procedure Capsizing Change of docking procedure Berth design Zeebrugge Unsafe heuristics Operations management Transfer of Herald to Zeebrugge Captain’s planning Operations management Time pressure Operational Decision Making: Decision makers from separate departments in operational context very likely will not see the forest for the trees. Crew working patterns Accident Analysis: Combinatorial structure of possible accidents can easily be identified. Á µ  ´ à µ ¶ Å µ Æ · Å ¸ Ç È ¹ º É » ¹ Å Ê À Í · Ë · c ¿ À · Á µ ´  µ ¶ à µ Å · Æ ¸ Å Ç ¹ º È » É ¹ Å ÿ À Ê · Ë Ways to Cope with Complexity · Analytic Reduction (Descartes) Divide system into distinct parts for analysis purposes. Examine the parts separately. Three important assumptions: 1. The division into parts will not distort the phenomenon being studied. 2. Components are the same when examined singly as when playing their part in the whole. 3. Principles governing the assembling of the components into the whole are themselves straightforward. c ¿ À Ways to Cope with Complexity (con’t.) Statistics Treat as a structureless mass with interchangeable parts. Use Law of Large Numbers to describe behavior in terms of averages. Assumes components sufficiently regular and random in their behavior that they can be studied statistically. · Á µ  ´ à µ ¶ Å µ Æ · Å ¸ Ç È ¹ º É » ¹ Å Ê À ý · Ë · c ¿ À · Á µ ´  µ ¶ à µ Å · Æ ¸ Å Ç ¹ º È » É ¹ Å À Ê · Ë What about software? · Too complex for complete analysis: Separation into non−interacting subsystems distorts the results. The most important properties are emergent. Too organized for statistics Too much underlying structure that distorts the statistics. c ¿ À · Á µ  ´ à µ ¶ Å µ Æ · Å ¸ Ç ¹ È º É » ¹ Systems Theory Developed for biology (Bertalanffly) and cybernetics (Norbert Weiner) For systems too complex for complete analysis Separation into non−interacting subsystems distorts results Most important properties are emergent. and too organized for statistical analysis Concentrates on analysis and design of whole as distinct from parts (basis of system engineering) Some properties can only be treated adequately in their entirety, taking into account all social and technical aspects. These properties derive from relationships between the parts of systems −− how they interact and fit together. Å ¾ Ê À ¼ · Ë · c ¿ À · Á µ  ´ µ ¶ à µ Å · Æ ¸ Å Ç ¹ º È » É ¹ ¾ Å » À Ê · Ë · Systems Theory (2) Two pairs of ideas: 1. Emergence and hierarchy Levels of organization, each more complex than one below. Levels characterized by emergent properties Irreducible Represent constraints upon the degree of freedom of components a lower level. Safety is an emergent system property It is NOT a component property. It can only be analyzed in the context of the whole. c ¿ À · Á µ Systems Theory (3) 2. Communication and control Hierarchies characterized by control processes working at the interfaces between levels. A control action imposes constraints upon the activity at one level of a hierarchy. Open systems are viewed as interrelated components kept in a state of dynamic equilibrium by feedback loops of information and control. Control in open systems implies need for communication  ´ à µ ¶ Å µ Æ · Å ¸ Ç È ¹ º É » ¹ Å ¾ Ê À · Ë · c ´ µ ¶ µ · ¿ ¸ ¹ º É » ½ ÿ » ½ ý A Systems Theory Model of Accidents Safety is an emergent system property. Accidents arise from interactions among People Societal and organizational structures Engineering activities Physical system components that violate the constraints on safe component behavior and interactions. Not simply chains of events or linear causality, but more complex types of causal connections. Need to include the entire socio−technical system c ´ µ ¶ µ ¿ · ¸ ¹ É º c ´ µ ¶ µ · ¿ STAMP ¸ ¹ º É » ½ » Î ¼ (Systems−Theoretic Accident Model and Processes) Based on systems and control theory Systems not treated as a static design A socio−technical system is a dynamic process continually adapting to achieve its ends and to react to changes in itself and its environment Preventing accidents requires designing a control structure to enforce constraints on system behavior and adaptation. c ´ µ ¶ µ ¿ · STAMP (2) Views accidents as a control problem e.g., O−ring did not control propellant gas release by sealing gap in field joint Software did not adequately control descent speed of Mars Polar Lander. Events are the result of the inadequate control Result from lack of enforcement of safety constraints To understand accidents, need to examine control structure itself to determine why inadequate to maintain safety constraints and why events occurred. ¸ ¹ É º SYSTEM DEVELOPMENT SYSTEM OPERATIONS Congress and Legislatures Congress and Legislatures Government Reports Lobbying Hearings and open meetings Accidents Legislation Government Regulatory Agencies Industry Associations, User Associations, Unions, Insurance Companies, Courts Regulations Standards Certification Legal penalties Case Law Government Reports Lobbying Hearings and open meetings Accidents Legislation Government Regulatory Agencies Industry Associations, User Associations, Unions, Insurance Companies, Courts Certification Info. Change reports Whistleblowers Accidents and incidents Regulations Standards Certification Legal penalties Case Law Accident and incident reports Operations reports Maintenance Reports Change reports Whistleblowers Company Management Policy, stds. Company Management Status Reports Risk Assessments Incident Reports Safety Policy Standards Resources Safety Policy Standards Resources Project Management Operations Management Hazard Analyses Safety Standards Hazard Analyses Progress Reports Safety−Related Changes Progress Reports Problem reports Operating Assumptions Operating Procedures Test reports Hazard Analyses Review Results Revised operating procedures Safety Reports Hazard Analyses Work Procedures safety reports audits work logs inspections Manufacturing Operating Process Human Controller(s) Implementation and assurance Manufacturing Management Change requests Audit reports Work Instructions Design, Documentation Safety Constraints Standards Test Requirements Operations Reports Software revisions Hardware replacements Documentation Actuator(s) Physical Process Design Rationale Maintenance and Evolution Automated Controller Problem Reports Incidents Change Requests Performance Audits Sensor(s) c ´ µ ¶ µ · ¿ ¸ ¹ º É » Î Note: Does not imply need for a "controller" Component failures may be controlled through design e.g., redundancy, interlocks, fail−safe design or through process manufacturing processes and procedures maintenance procedures But does imply the need to enforce the safety constraints in some way. New model includes what do now and more c ´ µ ¶ µ ¿ Accidents occur when: Design does not enforce safety constraints unhandled disturbances, failures, dysfunctional interactions Inadequate control actions Control structure degrades over time, asynchronous evolution Control actions inadequately coordinated among multiple controllers. Boundary areas Controller 1 Process 1 Controller 2 Process 2 Overlap areas (side effects of decisions and control actions) Controller 1 Controller 2 Process · ¸ ¹ É º » Î ¾ c Process Models Model of Process Model of Automation µ ¶ µ · ¿ Sensors Human Supervisor (Controller) ´ Measured variables ¸ ¹ º É » Î ½ » Î Î Process inputs Automated Controller Controls Displays Model of Process Controlled Process Model of Interfaces Disturbances Process outputs Actuators Controlled variables Process models must contain: Required relationship among process variables Current state (values of process variables) The ways the process can change state c ´ µ ¶ µ ¿ · ¸ ¹ É Relationship between Safety and Process Model Accidents occur when the models do not match the process and incorrect control commands are given (or correct ones not given) How do they become inconsistent? Wrong from beginning e.g. uncontrolled disturbances unhandled process states inadvertently commanding system into a hazardous state unhandled or incorrectly handled system component failures [Note these are related to what we called system accidents] Missing or incorrect feedback and not updated correctly Time lags not accounted for Explains most software−related accidents º Safety and Human Mental Models c ´ µ ¶ µ · ¿ ¸ ¹ º É » Î Í » Î ÿ Explains developer errors May have incorrect model of required system or software behavior development process physical laws etc. Also explains most human/computer interaction problems Pilots and others are not understanding the automation What did it just do? Why did it do that? What will it do next? How did it get us into this state? How do I get it to do what I want? Why won’t it let us do that? What caused the failure? What can we do so it does not happen again? Or don’t get feedback to update mental models or disbelieve it c ´ µ ¶ µ ¿ Validating and Using the Model Can it explain (model) accidents that have already occurred? Is it useful? In accident and mishap investigation In preventing accidents Hazard analysis Designing for safety Is it better for these purposes than the chain−of−events model? · ¸ ¹ É º c ´ µ ¶ µ · ¿ ¸ ¹ º É » Î ý » Î Using STAMP in Accident and Mishap Investigation and Root Cause Analysis c ´ µ ¶ µ ¿ · Modeling Accidents Using STAMP Three types of models are needed: 1. Static safety control structure Safety requirements and constraints Flawed control actions Context (social, political, etc.) Mental model flaws Coordination flaws 2. Dynamic structure Shows how the safety control structure changed over time 3. Behavioral dynamics Dynamic processes behind the changes, i.e., why the system changes ¸ ¹ É º c ARIANE 5 LAUNCHER OBC Executes flight program; Controls nozzles of solid boosters and Vulcain cryogenic engine Diagnostic and flight information D A Nozzle command B Main engine command Booster Nozzles Main engine Nozzle SRI Measures attitude of launcher and its movements in space C Horizontal velocity Strapdown inertial platform Backup SRI Measures attitude of launcher and its movements in space; Takes over if SRI unable to send guidance info C Horizontal velocity Ariane 5: A rapid change in attitude and high aerodynamic loads stemming from a high angle of attack create aerodynamic forces that cause the launcher to disintegrate at 39 seconds after command for main engine ignition (H0). Nozzles: Full nozzle deflections of solid boosters and main engine lead to angle of attack of more than 20 degrees. Self−Destruct System: Triggered (as designed) by boosters separating from main stage at altitude of 4 km and 1 km from launch pad. OBC (On−Board Computer) OBC Safety Constraint Violated: Commands from the OBC to the nozzles must not result in the launcher operating outside its safe envelope. Unsafe Behavior: Control command sent to booster nozzles and later to main engine nozzle to make a large correction for an attitude deviation that had not occurred. Process Model: Model of the current launch attitude is incorrect, i.e., it contains an attitude deviation that had not occurred. Results in incorrect commands being sent to nozzles. Feedback: Diagnostic information received from SRI Interface Model: Incomplete or incorrect (not enough information in accident report to determine which) − does not include the diagnostic information from the SRI that is available on the databus. Control Algorithm Flaw: Interprets diagnostic information from SRI as flight data and uses it for flight control calculations. With both SRI and backup SRI shut down and therefore no possibility of getting correct guidance and attitude information, loss was inevitable. c ARIANE 5 LAUNCHER OBC Executes flight program; Controls nozzles of solid boosters and Vulcain cryogenic engine Diagnostic and flight information D A Nozzle command B Main engine command Booster Nozzles Main engine Nozzle SRI Measures attitude of launcher and its movements in space C Horizontal velocity Strapdown inertial platform Backup SRI Measures attitude of launcher and its movements in space; Takes over if SRI unable to send guidance info C Horizontal velocity SRI (Inertial Reference System): SRI Safety Constraint Violated: The SRI must continue to send guidance information as long as it can get the necessary information from the strapdown inertial platform. Unsafe Behavior: At 36.75 seconds after H0, SRI detects an internal error and turns itself off (as it was designed to do) after putting diagnostic information on the bus (D). Control Algorithm: Calculates the Horizontal Bias (an internal alignment variable used as an indicator of alignment precision over time) using the horizontal velocity input from the strapdown inertial platform (C). Conversion from a 64−bit floating point value to a 16−bit signed integer leads to an unhandled overflow exception while calculating the horizontal bias. Algorithm reused from Ariane 4 where horizontal bias variable does not get large enough to cause an overflow. Process Model: Does not match Ariane 5 (based on Ariane 4 trajectory data); Assumes smaller horizontal velocity values than possible on Ariane 5. Backup SRI (Inertial Reference System): SRI Safety Constraint Violated: The backup SRI must continue to send guidance information as long as it can get the necessary information from the strapdown inertial platform. Unsafe Behavior: At 36.75 seconds after H0, backup SRI detects an internal error and turns itself off (as it was designed to do). Control Algorithm: Calculates the Horizontal Bias (an internal alignment variable used as an indicator of alignment precision over time) using the horizontal velocity input from the strapdown inertial platform (C). Conversion from a 64−bit floating point value to a 16−bit signed integer leads to an unhandled overflow exception while calculating the horizontal bias. Algorithm reused from Ariane 4 where horizontal bias variable does not get large enough to cause an overflow. Because the algorithm was the same in both SRI computers, the overflow results in the same behavior, i.e., shutting itself off. Process Model: Does not match Ariane 5 (based on Ariane 4 trajectory data); Assumes smaller horizontal velocity values than possible on Ariane 5. c á â ã â DEVELOPMENT Titan 4/Centaur/Milstar ä å æ ç OPERATIONS Space and Missile Systems Center Launch Directorate (SMC) Defense Contract Management Command (Responsible for administration of LMA contract) contract administration software surveillance oversee the process Prime Contractor (LMA) (Responsible for design and construction of flight control system) LMA System Engineering LMA Quality Assurance IV&V Analex Software Design and Development Third Space Launch Squadron (3SLS) (Responsible for ground operations management) Analex−Cleveland verify design LMA Analex Denver Flight Control Software IV&V of flight software Ground Operations (CCAS) Honeywell IMS software Aerospace Monitor software development and test LMA FAST Lab Titan/Centaur/Milstar System test of INU Ø Ì ¿ Í Î · ¥ Ï ³ ¦ § à ³ ´ ¶ ± · ¹ ± Ò · ¿ » Æ ´ ¬ ¡ £ ² ¥ ± ¢ Î Û Ü Ý Þ ß à ß Â ¨ ¬ Ú « © £ ± Å ª © ¢ É £ ¬ £ ¢ ª £ ¡ ¡ ¢ ¬ ® ª Å ¬ £ È £ Ç © Å ± ª È É ¡ ² ¬ Æ Ê ¬ ¡ É « ¬ § Ó » ¦ £ ½ » ¦ ± ¿ ¨ ² ¡ « À Á © ª Å ª « ¡ ª © ¬ ¬ £ ¬ ¡ ¬ ¬ ® £ ¡ Ê ¬ ² ¢ ¬ ª ¡ ¢ ¡ Å ª ¬ ¬ È ª £ ¬ £ « © ¬ ² ¡ « £ ± ± É ² ¢ ¨ « ± ª ¡ ± Õ ² ² ¡ ¡ ¢ ¢ © © ² ¢ ¬ ¬ £ £ ¡ ¡ ¢ ¢ ¡ ¬ £ ¡ £ ª Ê Å ª ± ² ª £ ¬ ¢ È ¬ ® £ £ ¨ ¬ « ª « ² © ± ¢ £ Ê ¬ ª ¡ ¬ ¨ © ¬ ª ¢ « £ ¡ ¡  ª É Ö Õ ± ¢  ¢ × · © £ Å ¶ Ê « ¢ Ð Á ¬ ¿ Ç À £ » Ö ¿ ¢ ¢ ¹ ¬ ¡ ¢ ± ½ · ² Å » £ ¶ £ Á É ¦ ¶ ¦ ´ Î ´ Ù ¬ ¬ ¢ £ ¬ ¬ £ ª ¨ ¡ è é ê . . 5 a E P ì O ì î 0 c V ò î 0 N System Hazard: System Safety Constraints: f g ï S ò 0 0 W î O D ó ò 0 P ï ì ò ñ ð O ò ò ï í S ï ð 0 ð ò P S î 0 ï D ð a P O ï S ï d a ð ð 0 P 0 D ï 1 a 0 N î O ï ò V ð í 0 ï . D 1 0 í ì ï í 0 D í c V ï ò î ï î a S ð ð ò 0 a ò B ó S ï N S ð 0 ì V í ] a ì E P í ì B e O ï D ò ï O 0 ð ò í W ï D 1 ì í D ï 0 N e D ï 0 ð c W á â ã â ä l æ ç è é n é é @ 4 f å h D ï 0 ð i a D P ì ï ñ 1 a î ï í ò ï E 0 O ò 1 V ð ò 1 ì î 0 N f h 5 a E P ì O S 0 D P ï S 1 0 D î a ð 0 î 1 a î ï ð 0 N a O 0 ð ì î ] ò ó 0 c V ò î a ð 0 ì ó e D ï 0 ð i a D P ì ï ñ ì î O 1 ò V ð ò 1 ì å î 0 â å ÷ ÷ ä ù â å ÷ 0 W ÷ ù ö õ B W â å ú m í ÷ ò ä ù ï ì ó ì æ O D ï ö ì ÷ ù ö ò í D í å â ø N V ð ò O 0 N a å ø õ ú ð ø 0 æ î ï ò ó ò P P ò e ø ö ä ù ö h N ä ÷ W ö M ^ 9 _ - L M K ? F æ ä â ù ë ì í ì î ï ð ñ ò â å ÷ â ä 0 D P ï ú å ú õ V ï W ò ó R ì 0 O D D P P ï ã ä å ÷ â ä ÷ æ æ ä õ S ô ÷ å ö N ô 0 õ â ú 0 T ÷ â ë æ ö S ä ö ÷ å ä ù ö ó ÷ R Q õ å æ þ ü ý ä ö ù ú ÷ â å ÷ ä ù ö ä ö â ô õ ÷ ä â ö õ ø ù õ ä ö ä ú ÷ â + â å ÷ ä ä ö G â â ÷ õ ú ò . 0 ð í 1 0 í ï æ õ 5 â æ â ð ò . ì í O ì D P ÷ ä â å ÷ ä ù ú 0 î ï ì í B C D ÷ â ö ù ö E ò . 0 ð í 1 0 í ï ÷ â å ÷ @ ä ù åå â ô æ ä â ù Q â ä å æ æ ö å â õ ÷ ÷ â å ö ÷ ú õ å ú õ ÷ å ö ú ï í ì î ï ð ñ ò Y Z ? S 0 _ í . ì ð ò í 1 0 í õõ â ÷ ? ä \ å ÷ æ ú â â õ ÷ õ å = æ â ú ö õ ä å ö å ù ö ÷ æ â ÷ â ä â ú õ ú ä ø ÷ â â õ ú ö üü ý þ ÿ � � ý ÷ ÷ â ù ÷ ö õ õ å ú æ ö D P ] 0 ð ï ò í 5 7 ì î ï ð ñ ò ó ò å ì O a P ï a ð 0 4 m 5 7 9 9 ò 1 1 ì î î ì ò í 0 ð â V 0 ð D ï ì ò í ÷ æ ú õ # ä â â ä ò ò N m D í $ å æ ö N ä ö ú õ ô o a ð D P ^ ó ó D ì ð % % ' # $ % % ) Design flaw: Shallow location Design flaw: No chlorinator ú â ú q æ â ú å ö î ä î Z Q ÷ ö ú ð õ 9 å í ã ù ö å ì þ æ ö â B ää öö æ ø ô ë ææ õõ ö 4 ^ ææ øø ï ö ÷ öö ä ú ó \ å ù ûû ì ù ø ? â â õ ù ë ÷ ä ä ù å ÷ ö ô â õ ä ö ÷ ææ ö î å ? ã â ÷ ä ø ù ú æ ä õ Porous bedrock Minimal overburden Heavy rains ö ö ÷ â + â ä ä å ö 4 D o P 0 ] 0 î ì ð N ï ò 0 ÷ æ å í í ï î ! . . c Dynamic Structure å â å ÷ ä ù â ã â ä ÷ á ÷ â å ö ÷ ÷ ù ö õ â å ú ÷ ä ù æ ö ÷ ù ö å â ø å ø õ ú ø ø æ ä å æ ö ä ù ö M ^ 9 _ - L M K ? F æ ä â ù ë ì í ì î ï ð ñ ò â å ÷ â ä 0 D P ï ú õ å ú õ ÷ 3 å ö N ì O D P ã ä å ÷ â ä ÷ ô 0 V ï W ò ó R 0 D P ï æ æ ä õ S â ú 0 T ô ÷ â ë æ ö S ä ö ÷ å ä ù ö ó ÷ R Q ù ú õ å æ þþ üü ýý ä ö â æ â ä ú ÷ â å ÷ ä ù ö ä ö â ô õ ÷ ä â ö õ ø ù õ ä ö ä ú ÷ â + â ä ä ö G â â ÷ õ ú ò . 0 ð í 1 0 í ï æ õ 5 â æ â ð ò . ì í O ì D P ÷ ä â å ÷ ä ù ú 0 î ï ì í B C D ÷ â å ö ÷ ù ö E ò . 0 ð í 1 0 í ï ÷ â å ÷ ù ä @ åå â ô æ ä â ù Q â ä å æ æ ö å â õ ÷ ÷ â å ö ÷ ú õ å ú õ ÷ å ö ú í ì î ï ð ñ ò Y Z ? \ å S 0 _ í . ì ð ò í 1 0 í â ÷ ? ä ö ÷ æ â ÷ â ä â ú õ ú ä ø ÷ â â õ ø ÷ æ ú â â ù ÷ õ õ å õ ä å æ = â ö ú å ö ÷ â ù ÷ ö ÷ ú ü ý þ ÿ � � ý ì õ å æ ö î ï ð ñ ò D ó P ò å 0 ] ð ï ò í 5 7 ì O a P ï a ð 0 m 4 5 7 9 9 ò 1 1 ì î î ì ò í 0 ð â V 0 ð D ï ì ò í ÷ æ ú õ = â ä ö ò ò N m D í â ä â õ ô o a ð D P ^ ó ó D ì ð ú â ú N ä ú æ # â ö q å ö î $ % % ' # $ % % ) ä ú ÷ ö î Z Q õ 9 ú ð ú å í þ æ ö ã õ ö å ù ì ää öö ö â B ææ õõ æ ô ë ææ øø ö 4 ^ õõ ï ö ÷ öö ä ú ó \ å ù û ì ? ù ø ï â â õ ù ë ÷ ä ä ù å ÷ ö ô â õ ä ö ÷ ææ ö ö å æ Design flaw: No chlorinator Design flaw: Shallow location î ? ã â ÷ ä G æ æ õ ö æ õ ú H å Porous bedrock Minimal overburden Heavy rains J 4 D o 0 P î ] 0 ì N ð 0 ï ò í 5 í ï î 0 ð î ì ï . ì D í ï B 0 C D E @ ! ç è c } } r Modeling Behavioral Dynamics w } z } ~ z y u ~ y z { | } ~ w z { | } ~ ~ } | ~ } z ~ { } ~ } } s t } | } ~ } } ~ } { z ~ r } { ¡ ~ } s ~ } { } ¥ z } ~ u } ~ } } ~ } } r | w } u t z ~ ¨ ~ } u } | ~ w } } } | ~ } u ~ ~ u } ~ { } } ~ w } } ~ } } } ~ } ~ } } ~ } } } ~ } y } ~ } } ¨ ~ } z ~ z ~ } { ¡ ~ } ~ r } ~ } t ~ } } r z ~ ~ z ~ { r } ~ ¦ } ~ } w z } ~ w £ } ~ ~ } } ~ } ¥ z } } ~ w w } ¨ } } } ~ ¨ } } ~ y ~ ¤ } « } £ | ¡ ~ } z ~ { ~ ~ £ ¤ } ~ ~ } z ~ ¡ } r u u } ~ { } w | ¦ u } | ~ ~ } u u } { } w ~ } u z ~ u } ~ } } ~ { } r t { } w } ¤ u u } ~ { } ­ } w z ¨ w } £ t t } ¨ z ¨ y } } } } r } ~ } } ~ } u y } ~ } } ¤ y u u } z ~ y r ~ } } ~ r } r } © } £ } } ~ ~ } s t u w c á â ã â ä å æ ç A (Partial) System Dynamics Model of the Columbia Accident ¹ Ë Budget cuts Budget cuts directed toward safety  ¾ » Í · À Rate of safety increase ³ ´ º ¼ ½ ¾ ¿ º » À º ³ Á  à º ± Ä ¿ Å º ° Ç Æ µ » ¯ ¶ ° ± · ° ¹ ² ¹ ² º » ¾ Complacency » External Pressure Rate of increase in complacency Performance Pressure È · » Expectations µ Á Ä ¼ Ê Á ° Perceived safety Risk º ² ³ ¯ ° ± ° ² Launch Rate Success Success Rate Accident Rate µ  À Safety System safety efforts Priority of safety programs ° ¹ è é Î Steps in a STAMP analysis: c á â ã â ä å æ ç è Ï Ñ 1. Identify System hazards System safety constraints and requirements Control structure in place to enforce constraints 2. Model dynamic aspects of accident: Changes to static safety control structure over time Dynamic processes in effect that led to changes 3. Create the overall explanation for the accident Inadequate control actions and decisions Context in which decisions made Mental model flaws Control flaws (e.g., missing feedback loops) Coordination flaws c á â ã â ä å æ ç è STAMP vs. Traditional Accident Models Examines interrelationships rather than linear cause−effect chains Looks at the processes behind the events Includes entire socio−economic system Includes behavioral dynamics (changes over time) Want to not just react to accidents and impose controls for a while, but understand why controls drift toward ineffectiveness over time and Change those factors if possible Detect the drift before accidents occur Ï è c á â ã â ä å æ ç è Ï è Ï ê Using STAMP to Prevent Accidents Hazard Analysis Safety Metrics and Performance Auditing Risk Assessment c á â ã â STAMP−Based Hazard Analysis (STPA) Provides information about how safety constraints could be violated. Used to eliminate, reduce, and control hazards in system design, development, manufacturing, and operations Assists in designing safety into system from the beginning Not just after−the−fact analysis Includes software, operators, system accidents, management, regulatory authorities Can use a concrete model of control (SpecTRM−RL) that is executable and analyzable ä å æ ç Ò STPA − Step1: Identify hazards and translate into high−level requirements and constraints on behavior c á â ã â ä å æ ç è Ï Ó TCAS Hazards 1. A near mid−air collision (NMAC) (a pair of controlled aircraft violate minimum separation standards) 2. A controlled maneuver into the ground 3. Loss of control of aircraft 4. Interference with other safety−related aircraft systems 5. Interference with ground−based ATC system 6. Interference with ATC safety−related advisory c á â ã STPA − Step 2: Define basic control structure â ä å æ ç è Displays Aural Alerts Airline Ops Mgmt. Own and Other Aircraft Information Pilot Operating Mode TCAS Advisories FAA Local ATC Ops Mgmt. Air Traffic Aircraft Radio Controller Advisories Flight Data Processor Airline Ops Mgmt. Aircraft Pilot Operating Mode Aural Alerts Displays Radar TCAS Own and Other Aircraft Information Ï n c á â ã â ä å æ ç è Ï è Ï é STPA − Step 3: Identify potential inadequate control actions that could lead to hazardous process state In general: 1. A required control action is not provided 2. An incorrect or unsafe control action is provided. 3. A potentially correct or inadequate control action is provided too late (at the wrong time) 4. A correct control action is stopped too soon c á â ã â ä For the NMAC hazard: TCAS: 1. The aircraft are on a near collision course and TCAS does not provide an RA 2. The aircraft are in close proximity and TCAS provides an RA that degrades vertical separation 3. The aircraft are on a near collision course and TCAS provides an RA too late to avoid an NMAC 4. TCAS removes an RA too soon. Pilot: 1. The pilot does not follow the resolution advisory provided by TCAS (does not respond to the RA) 2. The pilot incorrectly executes the TCAS resolution advisory. 3. The pilot applies the RA but too late to avoid the NMAC 4. The pilot stops the RA maneuver too soon. å æ ç Ï c á â ã â STPA − Step 4: Determine how potentially hazardous control actions could occur. ä å æ ç è Ï Ô Ï Î Eliminate from design or control or mitigate in design or operations In general: Can use a concrete model in SpecTRM−RL Assists with communication and completeness of analysis Provides a continuous simulation and analysis environment to evaluate impact of faults and effectiveness of mitigation features. c á â ã â Step 4a: Augment control structure with process models for each control component Step 4b: For each of inadequate control actions, examine parts of control loop to see if could cause it. Guided by set of generic control loop flaws Where human or organization involved must evaluate: Context in which decisions made Behavior−shaping mechanisms (influences) Step 4c: Consider how designed controls could degrade over time ä å æ ç è INPUTS FROM OWN AIRCRAFT Radio Altitude Aircraft Altitude Limit Radio Altitude Status Config Climb Inhibit Barometric Altitude Own MOde S address Barometric Altimeter Status Air Status Altitude Climb Inhibit Altitude Rate Increase Climb Inhibit Discrete Prox Traffic Display Traffic Display Permitted TCAS c Climb Inhibit Inhibited Not Inhibited Unknown On System Start Fault Detected Descent Inhibit Inhibited Not Inhibited Unknown Sensivity Level 1 2 3 4 5 6 7 Current RA Sense None Climb Descend Unknown Current RA Level Û Ü Û Ý Õ Þ ß Ö à × Ø á â á â ã Ù Other Aircraft (1..30) Model Own Aircraft Model Status Ú Status Increase Climb Inhibit Altitude Reporting On ground Airborne Unknown Inhibited Not Inhibited Unknown Increase Descent Inhibit Classification Inhibited Not Inhibited Unknown Yes No Lost Unknown RA Sense Other Traffic Proximate Traffic Potential Threat Threat Unknown None Climb Descend None Unknown Altitude Layer RA Strength Layer 1 Layer 2 Layer 3 Layer 4 Unknown Crossing Not Selected VSL 0 VSL 500 VSL 1000 VSL 2000 Nominal 1500 Increase 2500 None VSL 0 VSL 500 VSL 1000 VSL 2000 Unknown Non−Crossing Int−Crossing Own−Cross Unknown Reversal Reversed Not Reversed Unknown Unknown Other Bearing Other Bearing Valid Other Altitude Other Altitude Valid Range Mode S Address Sensitivity Level Equippage INPUTS FROM OTHER AIRCRAFT c Sensors Human Supervisor (Controller) Model of Process Model of Automation Measured variables Ú Û Ü Õ Û Ö Ý Þ × ß Ø à á Ù Process inputs Automated Controller Controls Displays Model of Process Model of Interfaces Controlled Process Disturbances Process outputs Actuators Controlled variables STPA − Step 4b: Examine control loop for potential to cause inadequate control actions c Ú Û Ü Û Õ Ý Þ Ö ß × à Ø á â ä Ù Inadequate Control Actions (enforcement of constraints) Design of control algorithm (process) does not enforce constraints Process models inconsistent, incomplete, or incorrect (lack of linkup) Flaw(s) in creation or updating process Inadequate or missing feedback Not provided in system design Communication flaw Inadequate sensor operation (incorrect or no information provided) Time lags and measurement inaccuracies not accounted for Inadequate coordination among controllers and decision−makers (boundary and overlap areas) Inadequate Execution of Control Action Communication flaw Inadequate "actuator" operation Time lag c STPA − Step4c: Consider how designed controls could degrade over time. Ú Û Ü Õ Û Ö Ý Þ ß × E.g., specified procedures ==> effective procedures Use system dynamics models? Use information to design protection against changes: e.g. operational procedures controls over changes and maintenance activities auditing procedures and performance metrics management feedback channels to detect unsafe changes Ø à Ù á â å c Ú Û Ü Õ Û Ý Ö Þ × ß Ø à á â æ Ù Comparisons with Traditional HA Techniques Top−down (vs. bottom−up like FMECA) Considers more than just component failures and failure events Guidance in doing analysis (vs. FTA) Handles dysfunctional interactions, software, management, etc. Concrete model (not just in head) Not physical structure (HAZOP) but control (functional) structure General model of inadequate control HAZOP guidewords based on model of accidents being caused by deviations in system variables Includes HAZOP model but more general Compared with TCAS II Fault Tree (MITRE) STPA results more comprehensive