Developing Safety Critical Software: Fact and Fiction

advertisement
Developing Safety Critical
Software: Fact and Fiction
John A McDermid
Overview
Fact – costs and distributions
 Fiction – get the requirements right
 Fiction – get the functionality right
 Fiction – abstraction is the solution
 Fiction – safety critical code must be
“bug free”
 Some key messages

Part 1
Fact – costs and distributions
Fiction – get the requirements
right
Overview
Fact – costs and distributions
 Fiction – get the requirements right
 Fiction – get the functionality right
 Fiction – abstraction is the solution
 Fiction – safety critical code must be
“bug free”
 Some key messages

Costs and Distributions

Examples of industrial experience
– Specific example
– Some more general observations

Example covers
– Cost by phase
– Where errors are introduced
– Where errors are detected

and their relationships
To System
Integration
Process Phases
Effort/Cost
by Phase
Management
8%
System Integration
17%
Other
Softw are
3%
System
Specification
25%
Hardw are
Softw are
Integration
1%
Softw are
Integration Test
7%
From System
Specification
Low Level
Softw are Test
17%
Softw are
Static Analysis
1%
Softw are
Implementation
10%
Softw are
Design
3%
Via Software Engineering
Review s and
Inspections
8%
Error Introduction
FE
MIN FE
ERRORS RAISED
NO FE
USER
REQUIREMENTS
SYSTEM
REQUIREMENTS
FE = Functional Effect
DOCUMENT
TRACEABILITY
HARDWARE
SOFTWARE
Min FE typically data change
Le
ftw
ar
e
Phases on Pie Chart
Te
es
e
Te
st
in
g
tin
g
tin
g
ra
tio
n
Fl
ig
ht
te
s
Ai
rfr
am
st
st
ra
tio
n
In
te
g
In
te
g
ra
tio
n
Pr
efli
gh
tT
Sy
st
em
So
In
te
g
Te
n
tio
ns
en
ta
tio
ec
ftw
ar
e
pl
em
In
sp
So
Im
an
d
ve
l
ftw
ar
e
w
Ha
rd
wa
re
So
Lo
ftw
ar
e
ev
ie
ws
So
R
ERRORS RAISED
Finding Requirements Errors
Requirements
testing tends to
find requirements
errors
System Validation
Result - High Development Cost
FE
MIN FE
Errors raised
NO FE
Errors
Introduced
Here…..
Result - High Development Cost
FE
MIN FE
REQUIREMENT ERROR FUNCTIONAL EFFECT
NO FE
REQUIREMENT ERROR MINOR FUNCTIONAL EFFECT
Errors raised
Errors Raised
REQUIREMENT ERROR NO FUNCTIONAL EFFECT
….are not
found until
here
Errors
Introduced
Here…..
Result - High Development Cost
FE
MIN FE
REQUIREMENT ERROR FUNCTIONAL EFFECT
NO FE
REQUIREMENT ERROR MINOR FUNCTIONAL EFFECT
Errors raised
Erros Raised
REQUIREMENT ERROR NO FUNCTIONAL EFFECT
….are not
found until
here
Errors
Introduced
Here…..
After following safety
critical development
process
Software and Money

Typical productivity
– 5 Lines of Code (LoC) per person day 
1 kLoC per person year
– Requirements to end of module test

Typical avionics “box”
– 100 kLoC
– 100 person years of effort
– Circa £10M for software, so  £500M on a
modern aircraft?
US Aircraft Software Dependence
90
80
70
60
50
40
30
20
10
0
1960
F-22
B-2
F-16
F-15
F-111
A-7
F4
1964
1970
1975
1982
1990
2000
Year
DoD Defense Science Board Task Force on Defense Software, November 2000
Increasing Dependence
Software often determinant of function
 Software operates autonomously

– Without opportunity for human
intervention, e.g. Mercedes Brake Assist

Software affected by other changes
– e.g new weapons fit on EuroFighter

Software has high levels of authority
Inappropriate CofG control in
fuel system can reduce fatigue
life of wings
Growing Dependency

Problem is growing
– Now about a third of aircraft development
costs
– Increasing proportion of car development

Around 25% of capital cost of new cars in
electronics
– Problem made more visible by rate of
improvements in tools for “mainstream”
software development
Growth of Airborne Software
Approx £1.5B at current
productivity and costs
100000
2014
Code Size kLoC
10000
2004
2004
1000
100
10
1987
1993
1999
1998
1980
1
In Service Date
The Problem - Size matters
12000
10000
1 function point = 80 SLOC of Ada
1 function point =128 SLOC of C
8000
6000
4000
2000
0
5% 10% 15% 20% 25% 30% 35% 40% 45% 50%
Probability of Software Project Being Cancelled
Capers Jones, Becoming Best In Class, Software Productivity Research, 1995 briefing
Is Software Safety an Issue?

Software has a good track record
– A few high profile accidents



Therac 25
Ariane 501
Cali (strictly data not software)
– Analysis of 1,100 “computer related
deaths”

Only 34 attributed to software
Chinook - Mull of Kintyre
Was this caused by FADEC software?
But Don’t be Complacent






Many instances of “pilot error” are system
assisted
Software failures typically leave no trace
Increasing software complexity and authority
Can’t measure software safety (no
agreement)
Unreliability of commercial software
Cost of safety critical software
Summary

Safety critical software a growing issue
– Software-based systems are dominant
source of product differentiation
– Starting to become a major cost driver
– Starting to become the drive (drag) on
product development

Can’t cancel, have to keep on spending!!!
– Not major contributor to fatal accidents

Although many incidents
Overview
Fact – costs and distributions
 Fiction – get the requirements right
 Fiction – get the functionality right
 Fiction – abstraction is the solution
 Fiction – safety critical code must be
“bug free”
 Some key messages

Requirements Fiction

Fiction stated
– Get the requirements right, and the
development will be easy

Facts
– Getting requirements right is difficult
– Requirements are biggest source of errors
– Requirements change
– Errors occur at organisational boundaries
Embedded Systems
Computer system embedded in larger
engineering system
 Requirements come from

– “Flow down” from system
– Design decisions (commitments)
– Safety and reliability analyses

Derived safety requirements (DSRs)
– Fault management/accommodation

As much as 80% for control applications
Almost Everything on One Picture
Platform
REQ = restriction on NAT
Control loops, high level modes,
end to end response times, etc.
x
REQ Specifies what system must do.
Stated mainly in terms of inputs from
and outputs to the platform
- as directed by commands from the
user/operator (if any).
System
Controller/
Operator
NB Based on Parnas’
four variable model
Almost Everything on One Picture
Platform
REQ = restriction on NAT
Control loops, high level modes,
end to end response times, etc.
S1
S2
S3
A1
IN
Control System & Software
System
SOFTREQ
Physical dcomposition of
system, to sensors and
actuators plus controller.
Control Interface
SOFTREQ specifies what
control software must do.
REQ = IN  SOFTREQ OUT
OUT
Almost Everything on One Picture
Platform
REQ = restriction on NAT
Control loops, high level modes,
end to end response times, etc.
S2
S3
A1
IN
Control System & Software
System
I/P
SPEC
Control Interface
Functional decomposition of software.
Mapping of control functions to generic
architecture.
SOFTREQ = I/P  SPEC  O/P
O/P
OUT
Input Fn
Output Fn
Redefinition of
Including SOFTREQ allowing Including
signal
loop
for digitisation
validation
closing
noise, sensor
management,
actuator dynamics
I/P
O/P
Control
I/F
S1
HAL
Almost Everything on One Picture
Platform
REQ = restriction on NAT
Control loops, high level modes,
end to end response times, etc.
S2
S3
A1
IN
I/P
SPEC
O/P
Input Fn
Including
signal
validation
Redefinition of
SOFTREQ
allowing for
digitisation noise,
sensor
management,
actuator
dynamics
data selection
Output Fn
Including
loop
closing
Control System & Software
System
Control Interface
Controller
Structure
Defines FMAA
structure.
I/P
O/P
Control
I/F
Physical decomposition of controller.
Application
Data
Selection
S1
HAL
F
M
A
A
OUT
Types of Layer

Some layers have design meaning
– Abstraction from computing hardware

Time in mS from reference, or ...
– Not interrupts or bit patterns from clock hardware
– The “System” HAL

“Raw” sensed values, e.g. pressure in psia
– Not bit patterns from analogue to digital converters
– FMAA to Application

Validated values of platform properties
– May also have computational meaning

e.g. call to HAL forces scheduling action
Commitments

Development proceeds via a series of
commitments
– A design decision which can only be
revoked at significant cost
– Often associated with architectural decision
or choice of component

Use of triplex redundancy, choice of pump,
power supply, etc.
– Commitments can be functional or physical

Most common to make physical commitments
Derived Requirements

Commitments introduce derived
requirements (DRs)
– Choice of pump gives DRs for control
algorithm, iteration rate, also requirements
for initialisation, etc.
– Also get derived safety requirements
(DSRs), e.g. detection and management of
sensor failure for safety
System Level Requirements

Allocated requirements
– System level requirements which come
from platform
– May be (slight) modification due to design
commitments, e.g.


Platform – control engine thrust to within ±
0.5% of demanded
System – control EPR or N1 to within ± 0.5% of
demanded
Stakeholder Requirements

Direct requirements from stakeholders, e.g.
– The radar shall be able to detect targets travelling
up to mach 2.5 at 200 nautical miles, with 98%
probability
– In principle allocated from platform

In practice often stated in system terms
– Need to distinguish legitimate requirements from
“soluntioneering”

Legitimacy depends on the stakeholder, e.g. CESG and
cryptos
Requirements Types

Main requirements types
– Invariants, e.g.

Forward and reverse thrust will not be
commanded at the same time
– Functional transform inputs to outputs, e.g.

Thrust demand from thrust-lever resolver angle
– Event response – action on event, e.g.

Active ATP on passing signal at danger
– Non-functional (NFR) – constraints, e.g.

Timing, resource usage, availability
Changes to Types

Note requirements types can change –
NFR to functional
– System – achieve < 10-5 per hour unsafe
failures
– Software – detect failure modes x, y and z
of the pressure sensor P30 with 99%
coverage, and mitigate by …

Requirements notations/methods must
be able to reflect requirements types
Requirements Challenges

Even if systems requirements are clear,
software requirements
– Must deal with quantisation (sensors)
– Must deal with temporal constraints
(iteration rates, jitter)
– Must deal with failures

Systems requirements often tricky
– Open-loop control under failure
– Incomplete understanding of physics
Requirements Errors

Project data suggests
– Typically more than 70% of errors found
post unit test are requirements errors
– F22 (and other data sets) put requirements
errors at 85%
– Finding errors drives change


The later they are found, the greater the cost
Some data, e.g. F22, write 3 LoC for every one
delivered
The Certainty of Change
300
May verify all code 3 times!
%Change
200
100

20%
Change mainly due to requirements
errors
The majority
of to Cumulative
change
– high
cost due
reverification
in presence
are
of modules
dependencies
stable
0
Module
Requirements and Organisations

Requirements errors are often based on
misinterpretations (its obvious that …)
– Thus errors (more likely to) happen at
organisational/cultural boundaries

Systems to software, safety to software …
– Study at NASA by Robyn Lutz

85% of requirements errors arose at
organisational boundaries
Summary

Getting requirements right is a major
challenge
– Software is deeply embedded

Discretisation, timing etc. an issue
– Physics not always understood

Requirements (genuinely) change
– Notion that can get requirements right is
simplistic

Notion of “correct by construction” optimistic
Part 2
Fiction – get the functionality right
Fiction – abstraction is the solution
Fiction – safety critical code must be
“bug free”
Some key messages
Overview
Fact – costs and distributions
 Fiction – get the requirements right
 Fiction – get the functionality right
 Fiction – abstraction is the solution
 Fiction – safety critical code must be
“bug free”
 Some key messages

Functionality Fiction

Fiction stated
– Get the functionality right, and the rest is
easy

Facts
– Functionality doesn’t drive design


Non-Functional Requirements (NFRs) are
critical
Functionality isn’t independent of NFRs
– Fault management is a major aspect of
complexity
Functionality and Design

Functionality
– System functions allocated to software
– Elements of REQ which end up in
SOFTREQ

NB, most of them
– At software level, requirements have to
allow for properties of sensors, etc.

Consider an aero engine example
Engine Pressure Block
Engine Pressure Sensor

Aero engine measures P0
– Atmospheric pressure
– A key input to fuel control, etc.

Example input P0Sens
– Byte from A/D converter
– Resolution – 1 bit  0.055 psia
– Base = 2, 0 = low (high value  16)
– Update rate = 50mS
Pressure Sensing Example

Simple requirement
– Provide validated P0 value to other
functions and aircraft

Output data item
– P0Val



16 bits
Resolution – 1 bit  0.00025 psia
Base = 0, 0 = low (high value  16.4)
Example Requirements

Simple functional requirement
– RS1: P0Val shall be provided within 0.03 bar of
sensed value
– R1: P0Val = P0Sens [± 0.03] (software level)
– Note: simple algorithm
P0Val = (P0Sens * 0.055 + 2)/0.00025
P0Sens = 0 → P0Val = 8000 = 00010 1111 0100 0000 binary
P0Sens = 1111 1111 = 16.025 → P0Val = 64100 = 1111
1010 0100 0100
– Does R1 meet RS1? Does the algorithm meet R1?
A Non-Functional Requirement

Assume duplex sensors
– P0Sens1 and P0Sens2

System level
– RS2: no single point of failure shall lead to
loss of function (assume P0Val is covered by
this requirement)


This will be a safety or availability requirement
NB in practice may be different sensors wired
to different channels, and cross channel comms
Software Level NFR

Software level
– R2: If | P0Sens1 - P0Sens2 | < 0.06
then P0Val = (P0Sens1 + P0Sens2 )/2
else P0Val = 0
– Is R2 a valid requirement?

In other words, have we stated the right thing?
– Does R2 satisfy RS2?
Temporal Requirements

Timing is often an important system
property
– It may be a safety property, e.g.
sequencing in weapons release

System level
– RS3: validated pressure value shall never
lag sensed value by more than 100mS
NB not uncommon to ensure quality of
control
Software Level Timing

Software level requirement, assuming
scheduling on 50mS cycles
– R3: P0Val (t) = P0Sens (t-2) [± 0.03]
– If t is quantised in units of 50mS,
representing cycles
– Is R3 a valid requirement?
– Does R3 satisfy RS3?
NB need data on processor timing to validate
Timing and Safety

Software level
– R4: If | P0Sens1 (t) - P0Sens2 (t) | < 0.06
then P0Val (t+1) = (P0Sens1 (t) +
P0Sens2 (t))/2
else if | P0Sens1 (t) - P0Sens1 (t-1) | <
| P0Sens2 (t) - P0Sens2 (t-1) |
then P0Val (t+1) = (P0Sens1 (t))
else P0Val (t+1) = (P0Sens2 (t))
– What does R4 respond to (can you think of
an RS4)?
Requirements Validation

Is R4 a valid requirement?
– Is R4 “safe” in the system context (assume
that misleading values of P0 could lead to a
hazard, e.g. a thrust roll-back on take off)
Does R4 satisfy RS3?
 Does R4 satisfy RS2?
 Does R4 satisfy RS1?

Real Requirements

Example still somewhat simplistic
– Need to store sensor state, i.e. knowledge
of what has failed

Typically timing, safety, etc. drive the
detailed design
– Aspects of requirements, e.g. error bands,
depend on timing of code
– Requirements involve trade-offs between,
say, safety and availability
Requirements and Architecture

NFRs also drive the architecture
– Failure rate 10-6 per hour


Probably just duplex (especially if fail stop)
Functions for cross comms and channel change
– Failure rate 10-9 per hour


Probably triplex or quadruplex
Changes in redundancy management
NB change in failure rate affects low level
functions
Quantification

The “system level” functionality is in the
minority
– Typically over half is fault management
– EuroFighter example



FCS  1/3 MLoC
Control laws  18 kLoC
Note, very hard to validate
– 777 flight incident in Australia due to error
in fault management, and software change
Boeing 777 Incident near Perth

Problem caused by Air Data Inertial
Reference Unit (ADIRU)
– Software contained a latent fault which
was revealed by a change
June 2001 accelerometer
#5 fails with erroneous
high output values, ADIRU
discards output values
Power Cycle on ADIRU
occurs each occasion
aircraft electrical system
is restarted
Aug 2006 accelerometer
#6 fails, latent software
error allows use of
previously failed accel #5
Summary

Functionality is important
– But not the primary driver of design

Key drivers of design
– Safety and availability

Turns into fault management at software level
– Timing behaviour

Functionality not independent of NFRs
– Requirements change to reflect NFRs
Overview
Fact – costs and distributions
 Fiction – get the requirements right
 Fiction – get the functionality right
 Fiction – abstraction is the solution
 Fiction – safety critical code must be
“bug free”
 Some key messages

Abstraction Fiction

Fiction stated
– Careful use of abstraction will address
problems of requirements etc.

Fact
– Most forms of abstraction don’t work in
embedded control systems


State abstraction is of some use
The devil is in the detail
Data Abstraction

Most data is simple
– Boolean, integer, floating point
– Complex data structures are rare

May exist in a maintenance subsystem (e.g.
records of fault events)
– Systems engineers work in low-level terms,
e.g. pressures, temperatures, etc.

Hence requirements are in these terms
Control Models are Low Level
Looseness

A key objective is to ensure that
requirements are complete
– Specify behaviour under all conditions
– Normal behaviour (everything working)
– Fault conditions

Single faults, and combinations
– Impossible conditions

So design is robust against incompletely
understood requirements/environment
Despatch Requirements

Can despatch (use) system “carrying”
failures
– Despatch analysis based on Markov model
– Evaluate probability of being in nondespatchable state, e.g. only one failure
from hazard
– Link between safety/availability process
and software design
Fault Management Logic

Fault-accommodation requirements may
use four valued logic
– Working, undetected, detected,
and confirmed
. w
– Table illustrates
w w
“logical and” ([.])
u u
– Used for analysis
u
d
c
u
d
c
u
d
c
d
d
d
d
c
c
c
c
c
c
Example Implementation
.
w
d
c
w w
d
c
d
d
d
c
c
c
c
c
State Abstraction

Some state abstraction is possible
– Mainly low-level state to operational modes

Aero engine control
– Want to produce thrust proportional to
demand (thrust lever angle in cockpit)
– Can’t measure thrust directly
– Can use various “surrogates” for thrust

Work with best value, but reversionary models
Thrust Control

Engine pressure ratio (EPR) – between
atmosphere & the exhaust pressures
– Best approximation to thrust
– Depends on P0

Low level state modelling “health” of P0 sensor
– If P0 fails, revert to use N1 (fan speed)
– Have control modes

EPR, N1, etc. which abstract away from details
of sensor fault state
Summary

Opportunity for abstraction much more
limited than in “IT” systems
– Hinders many classical approaches

Abstraction is of some value
– Mainly state abstraction, relating low-level
state information, e.g. sensor “health” to
system level control modes

NB formal refinement, a la B, is helped
by this, as little data refinement
Overview
Fact – costs and distributions
 Fiction – get the requirements right
 Fiction – get the functionality right
 Fiction – abstraction is the solution
 Fiction – safety critical code must be
“bug free”
 Some key messages

“Bug Free” Fiction

Fiction stated
– Safety critical code must be “bug free”

Facts
– It is hard to correlate fault density and
failure rate
– <1 fault per kLoC is pretty good!
– Being “bug free” is unrealistic, and there is
a need to “sentence” faults
Close to Fault Free?

DO 178A Level 1 software (engine
controller) – now would be DAL A
– Natural language specifications and macroassembler
– Over 20,000,000 hours without hazardous
failure
– But on version 192 (last time I knew)

Changes “trims” to reflect hardware properties
Pretty Buggy

DO 178B Level A software (aircraft
system)
– Natural language, control diagrams and
high level language
– 118 “bugs” found in first 18 months, 20%
critical
– Flight incidents but no accidents
– Informally “less safe” than the other
example, but still flying, still no accidents
Fault Density

So far as one can get data
– <1 flaw per kLoC for SC is pretty good
– Commercial much worse, may be as high
as 30 faults per kLoC
– Some “extreme” cases


Space Shuttle – 0.1 per kLoC
Praxis system – 0.04 per kLoC
– But will a hazardous situation arise?
Faults and Failures

Why doesn’t software “crash” more
often?
– Paths miss “bugs” as
don’t get critical data
– Testing “cleans up”
common paths
– Also “subtle faults”
which don’t cause a crash

NB IBM OS
Program Execution Space
Ÿ
Ÿ
Ÿ
Bugs
Ÿ
Ÿ
Program Path
– 1/3 of failures were “3,000 year events”
Pictures © 3BP.com
Commercial Software

Examples of data dependent faults?
– Loss of availability is acceptable
– Most SCS have to operate through faults

Can’t “fail stop” – even reactor protection
software needs to run circa 24 hours for heat
removal
Retrospective Analysis

Retrospective analysis of US civil
product for UK military use
– Analysis of over 500kLoC, in several
languages
– Found 23 faults per kLoC, 3% safety
critical
– Vast majority not safety critical

NB most of the 3% related to assumptions, i.e.
were requirements issues
Find and Fix

If a fault is found it may not be fixed
– First it will be “sentenced”

If not critical, it probably won’t be fixed
– Potentially critical faults will be analysed


Can it give rise to a problem in practice?
If decide not to change, document reasons
– Note: changes may bring (unknown) faults

e.g. Boeing 777 near Perth
Perils of Change
Dependency
Module
Summary

Probably no safety critical software is
fault free
– Less than 1 fault per kLoC is good
– Hard to correlate fault density with failure
rate (especially unsafe failures)

In practice
– Sentence faults, and change if net benefit

Need to show presence of faults
– To decide if need to remove them
Overview
Fact – costs and distributions
 Fiction – get the requirements right
 Fiction – get the functionality right
 Fiction – abstraction is the solution
 Fiction – safety critical code must be
“bug free”
 Some key messages

Summary of the Summaries

Safety critical software
– Has a good track record
– Increased dependency, complexity, etc.
mean that this may not continue

Much of the difficulty is in requirements
– Partly a systems engineering issue
– Many of the problems arise from errors in
communication
– Classical CS approaches limited utility
Research Directions (1)

Advances may come at architecture
– Improve notations to work at architecture
and implement via code generation
– Develop approaches, e.g. good interfaces,
product lines, to ease change
– Focus on V&V, recognising that the aim is
fault-finding

AADL an interesting development
Research Directions (2)

Advances may come at requirements
– Work with systems engineering notations



Improve to address issues needed for software
design and assessment, NB PFS
Produce better ways of mapping to architecture
Try to find ways of modularising, to bound
impact of change, e.g. contracts
– Focus on V&V, e.g. simulation

Developments of Parnas/Jackson ideas?
Research Directions (3)

Work on automation, especially for V&V
– Design remains creative
– V&V is 50% of life-cycle cost, and can be
automated
– Examples include



Auto-generation of test data and test oracles
Model-checking consistency/completeness
The best way to apply “classical” CS?
Coda

Safety critical software research
– Always “playing catch up”
– Aspirations for applications growing fast

To be successful
– Focus on “right problems”, i.e. where the
difficulties arise in practice
– If possible work with industry – to try to
provide solutions to their problems
Download