Developing Safety Critical Software: Fact and Fiction John A McDermid Overview Fact – costs and distributions Fiction – get the requirements right Fiction – get the functionality right Fiction – abstraction is the solution Fiction – safety critical code must be “bug free” Some key messages Part 1 Fact – costs and distributions Fiction – get the requirements right Overview Fact – costs and distributions Fiction – get the requirements right Fiction – get the functionality right Fiction – abstraction is the solution Fiction – safety critical code must be “bug free” Some key messages Costs and Distributions Examples of industrial experience – Specific example – Some more general observations Example covers – Cost by phase – Where errors are introduced – Where errors are detected and their relationships To System Integration Process Phases Effort/Cost by Phase Management 8% System Integration 17% Other Softw are 3% System Specification 25% Hardw are Softw are Integration 1% Softw are Integration Test 7% From System Specification Low Level Softw are Test 17% Softw are Static Analysis 1% Softw are Implementation 10% Softw are Design 3% Via Software Engineering Review s and Inspections 8% Error Introduction FE MIN FE ERRORS RAISED NO FE USER REQUIREMENTS SYSTEM REQUIREMENTS FE = Functional Effect DOCUMENT TRACEABILITY HARDWARE SOFTWARE Min FE typically data change Le ftw ar e Phases on Pie Chart Te es e Te st in g tin g tin g ra tio n Fl ig ht te s Ai rfr am st st ra tio n In te g In te g ra tio n Pr efli gh tT Sy st em So In te g Te n tio ns en ta tio ec ftw ar e pl em In sp So Im an d ve l ftw ar e w Ha rd wa re So Lo ftw ar e ev ie ws So R ERRORS RAISED Finding Requirements Errors Requirements testing tends to find requirements errors System Validation Result - High Development Cost FE MIN FE Errors raised NO FE Errors Introduced Here….. Result - High Development Cost FE MIN FE REQUIREMENT ERROR FUNCTIONAL EFFECT NO FE REQUIREMENT ERROR MINOR FUNCTIONAL EFFECT Errors raised Errors Raised REQUIREMENT ERROR NO FUNCTIONAL EFFECT ….are not found until here Errors Introduced Here….. Result - High Development Cost FE MIN FE REQUIREMENT ERROR FUNCTIONAL EFFECT NO FE REQUIREMENT ERROR MINOR FUNCTIONAL EFFECT Errors raised Erros Raised REQUIREMENT ERROR NO FUNCTIONAL EFFECT ….are not found until here Errors Introduced Here….. After following safety critical development process Software and Money Typical productivity – 5 Lines of Code (LoC) per person day 1 kLoC per person year – Requirements to end of module test Typical avionics “box” – 100 kLoC – 100 person years of effort – Circa £10M for software, so £500M on a modern aircraft? US Aircraft Software Dependence 90 80 70 60 50 40 30 20 10 0 1960 F-22 B-2 F-16 F-15 F-111 A-7 F4 1964 1970 1975 1982 1990 2000 Year DoD Defense Science Board Task Force on Defense Software, November 2000 Increasing Dependence Software often determinant of function Software operates autonomously – Without opportunity for human intervention, e.g. Mercedes Brake Assist Software affected by other changes – e.g new weapons fit on EuroFighter Software has high levels of authority Inappropriate CofG control in fuel system can reduce fatigue life of wings Growing Dependency Problem is growing – Now about a third of aircraft development costs – Increasing proportion of car development Around 25% of capital cost of new cars in electronics – Problem made more visible by rate of improvements in tools for “mainstream” software development Growth of Airborne Software Approx £1.5B at current productivity and costs 100000 2014 Code Size kLoC 10000 2004 2004 1000 100 10 1987 1993 1999 1998 1980 1 In Service Date The Problem - Size matters 12000 10000 1 function point = 80 SLOC of Ada 1 function point =128 SLOC of C 8000 6000 4000 2000 0 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% Probability of Software Project Being Cancelled Capers Jones, Becoming Best In Class, Software Productivity Research, 1995 briefing Is Software Safety an Issue? Software has a good track record – A few high profile accidents Therac 25 Ariane 501 Cali (strictly data not software) – Analysis of 1,100 “computer related deaths” Only 34 attributed to software Chinook - Mull of Kintyre Was this caused by FADEC software? But Don’t be Complacent Many instances of “pilot error” are system assisted Software failures typically leave no trace Increasing software complexity and authority Can’t measure software safety (no agreement) Unreliability of commercial software Cost of safety critical software Summary Safety critical software a growing issue – Software-based systems are dominant source of product differentiation – Starting to become a major cost driver – Starting to become the drive (drag) on product development Can’t cancel, have to keep on spending!!! – Not major contributor to fatal accidents Although many incidents Overview Fact – costs and distributions Fiction – get the requirements right Fiction – get the functionality right Fiction – abstraction is the solution Fiction – safety critical code must be “bug free” Some key messages Requirements Fiction Fiction stated – Get the requirements right, and the development will be easy Facts – Getting requirements right is difficult – Requirements are biggest source of errors – Requirements change – Errors occur at organisational boundaries Embedded Systems Computer system embedded in larger engineering system Requirements come from – “Flow down” from system – Design decisions (commitments) – Safety and reliability analyses Derived safety requirements (DSRs) – Fault management/accommodation As much as 80% for control applications Almost Everything on One Picture Platform REQ = restriction on NAT Control loops, high level modes, end to end response times, etc. x REQ Specifies what system must do. Stated mainly in terms of inputs from and outputs to the platform - as directed by commands from the user/operator (if any). System Controller/ Operator NB Based on Parnas’ four variable model Almost Everything on One Picture Platform REQ = restriction on NAT Control loops, high level modes, end to end response times, etc. S1 S2 S3 A1 IN Control System & Software System SOFTREQ Physical dcomposition of system, to sensors and actuators plus controller. Control Interface SOFTREQ specifies what control software must do. REQ = IN SOFTREQ OUT OUT Almost Everything on One Picture Platform REQ = restriction on NAT Control loops, high level modes, end to end response times, etc. S2 S3 A1 IN Control System & Software System I/P SPEC Control Interface Functional decomposition of software. Mapping of control functions to generic architecture. SOFTREQ = I/P SPEC O/P O/P OUT Input Fn Output Fn Redefinition of Including SOFTREQ allowing Including signal loop for digitisation validation closing noise, sensor management, actuator dynamics I/P O/P Control I/F S1 HAL Almost Everything on One Picture Platform REQ = restriction on NAT Control loops, high level modes, end to end response times, etc. S2 S3 A1 IN I/P SPEC O/P Input Fn Including signal validation Redefinition of SOFTREQ allowing for digitisation noise, sensor management, actuator dynamics data selection Output Fn Including loop closing Control System & Software System Control Interface Controller Structure Defines FMAA structure. I/P O/P Control I/F Physical decomposition of controller. Application Data Selection S1 HAL F M A A OUT Types of Layer Some layers have design meaning – Abstraction from computing hardware Time in mS from reference, or ... – Not interrupts or bit patterns from clock hardware – The “System” HAL “Raw” sensed values, e.g. pressure in psia – Not bit patterns from analogue to digital converters – FMAA to Application Validated values of platform properties – May also have computational meaning e.g. call to HAL forces scheduling action Commitments Development proceeds via a series of commitments – A design decision which can only be revoked at significant cost – Often associated with architectural decision or choice of component Use of triplex redundancy, choice of pump, power supply, etc. – Commitments can be functional or physical Most common to make physical commitments Derived Requirements Commitments introduce derived requirements (DRs) – Choice of pump gives DRs for control algorithm, iteration rate, also requirements for initialisation, etc. – Also get derived safety requirements (DSRs), e.g. detection and management of sensor failure for safety System Level Requirements Allocated requirements – System level requirements which come from platform – May be (slight) modification due to design commitments, e.g. Platform – control engine thrust to within ± 0.5% of demanded System – control EPR or N1 to within ± 0.5% of demanded Stakeholder Requirements Direct requirements from stakeholders, e.g. – The radar shall be able to detect targets travelling up to mach 2.5 at 200 nautical miles, with 98% probability – In principle allocated from platform In practice often stated in system terms – Need to distinguish legitimate requirements from “soluntioneering” Legitimacy depends on the stakeholder, e.g. CESG and cryptos Requirements Types Main requirements types – Invariants, e.g. Forward and reverse thrust will not be commanded at the same time – Functional transform inputs to outputs, e.g. Thrust demand from thrust-lever resolver angle – Event response – action on event, e.g. Active ATP on passing signal at danger – Non-functional (NFR) – constraints, e.g. Timing, resource usage, availability Changes to Types Note requirements types can change – NFR to functional – System – achieve < 10-5 per hour unsafe failures – Software – detect failure modes x, y and z of the pressure sensor P30 with 99% coverage, and mitigate by … Requirements notations/methods must be able to reflect requirements types Requirements Challenges Even if systems requirements are clear, software requirements – Must deal with quantisation (sensors) – Must deal with temporal constraints (iteration rates, jitter) – Must deal with failures Systems requirements often tricky – Open-loop control under failure – Incomplete understanding of physics Requirements Errors Project data suggests – Typically more than 70% of errors found post unit test are requirements errors – F22 (and other data sets) put requirements errors at 85% – Finding errors drives change The later they are found, the greater the cost Some data, e.g. F22, write 3 LoC for every one delivered The Certainty of Change 300 May verify all code 3 times! %Change 200 100 20% Change mainly due to requirements errors The majority of to Cumulative change – high cost due reverification in presence are of modules dependencies stable 0 Module Requirements and Organisations Requirements errors are often based on misinterpretations (its obvious that …) – Thus errors (more likely to) happen at organisational/cultural boundaries Systems to software, safety to software … – Study at NASA by Robyn Lutz 85% of requirements errors arose at organisational boundaries Summary Getting requirements right is a major challenge – Software is deeply embedded Discretisation, timing etc. an issue – Physics not always understood Requirements (genuinely) change – Notion that can get requirements right is simplistic Notion of “correct by construction” optimistic Part 2 Fiction – get the functionality right Fiction – abstraction is the solution Fiction – safety critical code must be “bug free” Some key messages Overview Fact – costs and distributions Fiction – get the requirements right Fiction – get the functionality right Fiction – abstraction is the solution Fiction – safety critical code must be “bug free” Some key messages Functionality Fiction Fiction stated – Get the functionality right, and the rest is easy Facts – Functionality doesn’t drive design Non-Functional Requirements (NFRs) are critical Functionality isn’t independent of NFRs – Fault management is a major aspect of complexity Functionality and Design Functionality – System functions allocated to software – Elements of REQ which end up in SOFTREQ NB, most of them – At software level, requirements have to allow for properties of sensors, etc. Consider an aero engine example Engine Pressure Block Engine Pressure Sensor Aero engine measures P0 – Atmospheric pressure – A key input to fuel control, etc. Example input P0Sens – Byte from A/D converter – Resolution – 1 bit 0.055 psia – Base = 2, 0 = low (high value 16) – Update rate = 50mS Pressure Sensing Example Simple requirement – Provide validated P0 value to other functions and aircraft Output data item – P0Val 16 bits Resolution – 1 bit 0.00025 psia Base = 0, 0 = low (high value 16.4) Example Requirements Simple functional requirement – RS1: P0Val shall be provided within 0.03 bar of sensed value – R1: P0Val = P0Sens [± 0.03] (software level) – Note: simple algorithm P0Val = (P0Sens * 0.055 + 2)/0.00025 P0Sens = 0 → P0Val = 8000 = 00010 1111 0100 0000 binary P0Sens = 1111 1111 = 16.025 → P0Val = 64100 = 1111 1010 0100 0100 – Does R1 meet RS1? Does the algorithm meet R1? A Non-Functional Requirement Assume duplex sensors – P0Sens1 and P0Sens2 System level – RS2: no single point of failure shall lead to loss of function (assume P0Val is covered by this requirement) This will be a safety or availability requirement NB in practice may be different sensors wired to different channels, and cross channel comms Software Level NFR Software level – R2: If | P0Sens1 - P0Sens2 | < 0.06 then P0Val = (P0Sens1 + P0Sens2 )/2 else P0Val = 0 – Is R2 a valid requirement? In other words, have we stated the right thing? – Does R2 satisfy RS2? Temporal Requirements Timing is often an important system property – It may be a safety property, e.g. sequencing in weapons release System level – RS3: validated pressure value shall never lag sensed value by more than 100mS NB not uncommon to ensure quality of control Software Level Timing Software level requirement, assuming scheduling on 50mS cycles – R3: P0Val (t) = P0Sens (t-2) [± 0.03] – If t is quantised in units of 50mS, representing cycles – Is R3 a valid requirement? – Does R3 satisfy RS3? NB need data on processor timing to validate Timing and Safety Software level – R4: If | P0Sens1 (t) - P0Sens2 (t) | < 0.06 then P0Val (t+1) = (P0Sens1 (t) + P0Sens2 (t))/2 else if | P0Sens1 (t) - P0Sens1 (t-1) | < | P0Sens2 (t) - P0Sens2 (t-1) | then P0Val (t+1) = (P0Sens1 (t)) else P0Val (t+1) = (P0Sens2 (t)) – What does R4 respond to (can you think of an RS4)? Requirements Validation Is R4 a valid requirement? – Is R4 “safe” in the system context (assume that misleading values of P0 could lead to a hazard, e.g. a thrust roll-back on take off) Does R4 satisfy RS3? Does R4 satisfy RS2? Does R4 satisfy RS1? Real Requirements Example still somewhat simplistic – Need to store sensor state, i.e. knowledge of what has failed Typically timing, safety, etc. drive the detailed design – Aspects of requirements, e.g. error bands, depend on timing of code – Requirements involve trade-offs between, say, safety and availability Requirements and Architecture NFRs also drive the architecture – Failure rate 10-6 per hour Probably just duplex (especially if fail stop) Functions for cross comms and channel change – Failure rate 10-9 per hour Probably triplex or quadruplex Changes in redundancy management NB change in failure rate affects low level functions Quantification The “system level” functionality is in the minority – Typically over half is fault management – EuroFighter example FCS 1/3 MLoC Control laws 18 kLoC Note, very hard to validate – 777 flight incident in Australia due to error in fault management, and software change Boeing 777 Incident near Perth Problem caused by Air Data Inertial Reference Unit (ADIRU) – Software contained a latent fault which was revealed by a change June 2001 accelerometer #5 fails with erroneous high output values, ADIRU discards output values Power Cycle on ADIRU occurs each occasion aircraft electrical system is restarted Aug 2006 accelerometer #6 fails, latent software error allows use of previously failed accel #5 Summary Functionality is important – But not the primary driver of design Key drivers of design – Safety and availability Turns into fault management at software level – Timing behaviour Functionality not independent of NFRs – Requirements change to reflect NFRs Overview Fact – costs and distributions Fiction – get the requirements right Fiction – get the functionality right Fiction – abstraction is the solution Fiction – safety critical code must be “bug free” Some key messages Abstraction Fiction Fiction stated – Careful use of abstraction will address problems of requirements etc. Fact – Most forms of abstraction don’t work in embedded control systems State abstraction is of some use The devil is in the detail Data Abstraction Most data is simple – Boolean, integer, floating point – Complex data structures are rare May exist in a maintenance subsystem (e.g. records of fault events) – Systems engineers work in low-level terms, e.g. pressures, temperatures, etc. Hence requirements are in these terms Control Models are Low Level Looseness A key objective is to ensure that requirements are complete – Specify behaviour under all conditions – Normal behaviour (everything working) – Fault conditions Single faults, and combinations – Impossible conditions So design is robust against incompletely understood requirements/environment Despatch Requirements Can despatch (use) system “carrying” failures – Despatch analysis based on Markov model – Evaluate probability of being in nondespatchable state, e.g. only one failure from hazard – Link between safety/availability process and software design Fault Management Logic Fault-accommodation requirements may use four valued logic – Working, undetected, detected, and confirmed . w – Table illustrates w w “logical and” ([.]) u u – Used for analysis u d c u d c u d c d d d d c c c c c c Example Implementation . w d c w w d c d d d c c c c c State Abstraction Some state abstraction is possible – Mainly low-level state to operational modes Aero engine control – Want to produce thrust proportional to demand (thrust lever angle in cockpit) – Can’t measure thrust directly – Can use various “surrogates” for thrust Work with best value, but reversionary models Thrust Control Engine pressure ratio (EPR) – between atmosphere & the exhaust pressures – Best approximation to thrust – Depends on P0 Low level state modelling “health” of P0 sensor – If P0 fails, revert to use N1 (fan speed) – Have control modes EPR, N1, etc. which abstract away from details of sensor fault state Summary Opportunity for abstraction much more limited than in “IT” systems – Hinders many classical approaches Abstraction is of some value – Mainly state abstraction, relating low-level state information, e.g. sensor “health” to system level control modes NB formal refinement, a la B, is helped by this, as little data refinement Overview Fact – costs and distributions Fiction – get the requirements right Fiction – get the functionality right Fiction – abstraction is the solution Fiction – safety critical code must be “bug free” Some key messages “Bug Free” Fiction Fiction stated – Safety critical code must be “bug free” Facts – It is hard to correlate fault density and failure rate – <1 fault per kLoC is pretty good! – Being “bug free” is unrealistic, and there is a need to “sentence” faults Close to Fault Free? DO 178A Level 1 software (engine controller) – now would be DAL A – Natural language specifications and macroassembler – Over 20,000,000 hours without hazardous failure – But on version 192 (last time I knew) Changes “trims” to reflect hardware properties Pretty Buggy DO 178B Level A software (aircraft system) – Natural language, control diagrams and high level language – 118 “bugs” found in first 18 months, 20% critical – Flight incidents but no accidents – Informally “less safe” than the other example, but still flying, still no accidents Fault Density So far as one can get data – <1 flaw per kLoC for SC is pretty good – Commercial much worse, may be as high as 30 faults per kLoC – Some “extreme” cases Space Shuttle – 0.1 per kLoC Praxis system – 0.04 per kLoC – But will a hazardous situation arise? Faults and Failures Why doesn’t software “crash” more often? – Paths miss “bugs” as don’t get critical data – Testing “cleans up” common paths – Also “subtle faults” which don’t cause a crash NB IBM OS Program Execution Space Ÿ Ÿ Ÿ Bugs Ÿ Ÿ Program Path – 1/3 of failures were “3,000 year events” Pictures © 3BP.com Commercial Software Examples of data dependent faults? – Loss of availability is acceptable – Most SCS have to operate through faults Can’t “fail stop” – even reactor protection software needs to run circa 24 hours for heat removal Retrospective Analysis Retrospective analysis of US civil product for UK military use – Analysis of over 500kLoC, in several languages – Found 23 faults per kLoC, 3% safety critical – Vast majority not safety critical NB most of the 3% related to assumptions, i.e. were requirements issues Find and Fix If a fault is found it may not be fixed – First it will be “sentenced” If not critical, it probably won’t be fixed – Potentially critical faults will be analysed Can it give rise to a problem in practice? If decide not to change, document reasons – Note: changes may bring (unknown) faults e.g. Boeing 777 near Perth Perils of Change Dependency Module Summary Probably no safety critical software is fault free – Less than 1 fault per kLoC is good – Hard to correlate fault density with failure rate (especially unsafe failures) In practice – Sentence faults, and change if net benefit Need to show presence of faults – To decide if need to remove them Overview Fact – costs and distributions Fiction – get the requirements right Fiction – get the functionality right Fiction – abstraction is the solution Fiction – safety critical code must be “bug free” Some key messages Summary of the Summaries Safety critical software – Has a good track record – Increased dependency, complexity, etc. mean that this may not continue Much of the difficulty is in requirements – Partly a systems engineering issue – Many of the problems arise from errors in communication – Classical CS approaches limited utility Research Directions (1) Advances may come at architecture – Improve notations to work at architecture and implement via code generation – Develop approaches, e.g. good interfaces, product lines, to ease change – Focus on V&V, recognising that the aim is fault-finding AADL an interesting development Research Directions (2) Advances may come at requirements – Work with systems engineering notations Improve to address issues needed for software design and assessment, NB PFS Produce better ways of mapping to architecture Try to find ways of modularising, to bound impact of change, e.g. contracts – Focus on V&V, e.g. simulation Developments of Parnas/Jackson ideas? Research Directions (3) Work on automation, especially for V&V – Design remains creative – V&V is 50% of life-cycle cost, and can be automated – Examples include Auto-generation of test data and test oracles Model-checking consistency/completeness The best way to apply “classical” CS? Coda Safety critical software research – Always “playing catch up” – Aspirations for applications growing fast To be successful – Focus on “right problems”, i.e. where the difficulties arise in practice – If possible work with industry – to try to provide solutions to their problems