Dependability TDDB47 Real Time Systems Lecture 3: Dependability and fault tolerance Calin Curescu

advertisement
Dependability
TDDB47 Real Time Systems
•
How can we produce systems that do their job, and how to
measure how well they do their jobs?
– How do things go wrong and why?
– What can we do about it?
•
This lecture: Basic overview of fault tolerant systems
Lecture 3: Dependability and fault tolerance
Calin Curescu
Real-Time Systems Laboratory
Department of Computer and Information Science
Linköping University, Sweden
The lecture notes are partly based on lecture notes by Simin Nadjm-Tehrani, Jörgen Hansson, Anders Törne.
They also loosely follow Burns’ and Welling book “Real-Time Systems and Programming Languages”. These
lecture notes should only be used for internal teaching purposes at Linköping University.
Lecture 3: Dependability and fault tolerance
Calin Curescu
36 pages
Lecture 3: Dependability and fault tolerance
Calin Curescu
Accidents/Mishaps
•
•
2 of 36
What is dependability?
1987
– Six patients in USA and Canada got very high doses of
radiation and severe burns from the cancer treatment
system Therac 25.
– Doses as high as 15,000-20,000 radiation units compared
with the normal levels (~ 200 units) had been given. Three
died.
Property of a computing system which
allows reliance to be justifiably placed on
the service it delivers.
[Avizienis et al.]
1994
– TCAS is a system designed to avoid mid-air collisions
between passenger planes.
– Two commercial aircrafts came as close as 1.6 km to each
other while flying over Oregon in USA.
• minimum safety rule ~4.8 km.
Lecture 3: Dependability and fault tolerance
Calin Curescu
3 of 36
Lecture 3: Dependability and fault tolerance
Calin Curescu
Dependability
Attributes of dependability
•
[Sv. Pålitlighet]
•
Safety
– non-occurrence of catastrophic consequences on the
environment
Availability
– the readiness for usage
Integrity
– non-occurrence of unauthorized alteration of information
Reliability
– continuity of correct service
Confidentiality
– absence of unauthorized disclosure of information,
Maintainability
– ability to undergo repairs and modifications.
•
•
•
•
IFIP WG 10.4
Lecture 3: Dependability and fault tolerance
Calin Curescu
•
5 of 36
4 of 36
Lecture 3: Dependability and fault tolerance
Calin Curescu
6 of 36
Reliability
Faults, Errors & Failures
•
•
[Sv. Tillförlitlighet]
•
Means that the system (functionally) behaves as specified,
and does it continually over measured intervals of time.
•
•
– Typical measure in aerospace: 10-9 failures/hour
•
Lecture 3: Dependability and fault tolerance
Calin Curescu
7 of 36
Fault [sv. felkälla]
– a defect within the system or a situation that can lead to
failure
Error [sv. felyttring]
– manifestation (symptom) of the fault - an unexpected
behaviour
Failure [sv. misslyckande]
– system not performing its intended function
System state point of view
– Failure – an unspecified external state
– Error – an unspecified internal state
– Fault - the reason for the illegal state transition
Lecture 3: Dependability and fault tolerance
Calin Curescu
Transitions
•
Failure Æ fault Æ error Æ failure Æ fault
•
Systems are composed of components which are
themselves systems, hence
– a failure in a subcomponent is a fault in the component
Lecture 3: Dependability and fault tolerance
Calin Curescu
9 of 36
What is a bug?
Lecture 3: Dependability and fault tolerance
Calin Curescu
Falut sources
•
8 of 36
10 of 36
Examples
Four sources of faults which can result in system failure
– Inadequate specification
– Design errors in software
– Hardware component failure
– Interference in the communication subsystem
Lecture 3: Dependability and fault tolerance
Calin Curescu
11 of 36
Lecture 3: Dependability and fault tolerance
Calin Curescu
12 of 36
Fault Types
•
•
•
•
Transient fault
– Starts at a particular time, remains in the system for some
period and then disappears.
– Many faults in communication systems are transient.
• E.g. hardware components which have an adverse reaction
to radioactivity.
Permanent fault
– Remains in the system until repaired
• E.g., a broken wire or a software design error.
Intermittent fault
– Is a transient fault that occurs repeatedly.
• E.g. a hardware component that is heat sensitive, it works for a
time, stops working, cools down and then starts to work again.
Lecture 3: Dependability and fault tolerance
Calin Curescu
Failure types
•
•
•
13 of 36
Value domain
– Value error
– Constraint error
Timing Domain
– Early
– Late
• Fail late
– Omission
• Fail silent
• Fail stop
• Fail controlled
Impromptu/Comission
Arbitrary /Fail uncontrolled
Lecture 3: Dependability and fault tolerance
Calin Curescu
Another failure classification
•
•
14 of 36
Fault, Error, Failure
Node failures
– Crash
– Omission
– Byzantine
•
Goal of system verification and validation is to eliminate
faults
•
Goal of safety/risk analysis is to focus on important faults
Channel failures
– Crash (and potential partitions)
– Message loss
– Erroneous/arbitrary messages
•
Goal of fault tolerance is to reduce effects of errors if they
appear
– In order to eliminate or delay failures
Lecture 3: Dependability and fault tolerance
Calin Curescu
15 of 36
Lecture 3: Dependability and fault tolerance
Calin Curescu
Measuring Reliability and Availability
Time from an initial instant to the next failure event
•
•
Typical measures:
– MTTF – mean time to failure
– MTBF – mean time between failures
– MTTR – mean time to repair
•
•
Availability
– Ratio of service time to elapsed time
• MTTF/(MTTF+MTTR)
Lecture 3: Dependability and fault tolerance
Calin Curescu
17 of 36
16 of 36
Achieving Reliability
•
•
Hard to differentiate
an arbitrary failure
from a commission
followed by an
omission failure
Fault prevention
– attempts to eliminate any possibility of faults creeping
into a system before it goes operational.
Stage 1: Fault avoidance techniques
Stage 2: Fault removal techniques
•
Fault tolerance enables a system to continue functioning
even in the presence of faults.
•
Both approaches attempt to produces systems which have
desirable failure modes.
Lecture 3: Dependability and fault tolerance
Calin Curescu
18 of 36
Fault Avoidance
•
•
•
Limit the introduction of faults during system construction
Hardware
– use of the most reliable components within the given costb and
performance constraints
– use of thoroughly-refined techniques or interconnection of
components and assembly of subsystems
– packaging the hardware to screen out expected forms of
interference
Software
– rigorous, if not formal, specification of requirements
– use of proven design methodologies
– use of languages with facilities for data abstraction and modularity
– use of software engineering environments to help manipulate
software components and thereby manage complexity
Lecture 3: Dependability and fault tolerance
Calin Curescu
19 of 36
Fault Removal
•
In spite of fault avoidance, design errors in both hardware and
software components will exist.
•
Fault removal
– design reviews
– program verification
– code inspections
– system testing.
•
Limitations
– A test can only show that a fault exist, not that it does not exist
– Sometimes impossible to test under realistic conditions
– Errors that have been introduced at the requirements stage
Lecture 3: Dependability and fault tolerance
Calin Curescu
20 of 36
Fault tolerance
•
•
•
Means that a system provides a degraded (but acceptable)
function
– Even in presence of faults
– For a period defined by certain model assumptions
Due to
– Incomplete fault removal
– Hardware failure
• Inaccessible or infeasible repair and maintenance
Redundancy
•
Redundant elements - as they are not required in a perfect
system
– Detect and recover from faults
– Used by all fault-tolerant techniques
•
Increase overall complexity
– Can lead to less reliable systems.
•
Static vs. dynamic redundancy
Address unforeseen faults?
Lecture 3: Dependability and fault tolerance
Calin Curescu
21 of 36
Lecture 3: Dependability and fault tolerance
Calin Curescu
Static redundancy
•
•
Used in any case
Error masking properties
•
•
Hardware – N-modular redundancy
Software – N-version programming
•
22 of 36
Dynamic redundancy
•
•
Used only in case of error appearance
Error detecting properties
•
•
Software – recovery methods
Hardware – switching to back-up or recovery module
•
•
Data – self-correcting codes
Time – re-computing a result
Influence on timeliness?
Lecture 3: Dependability and fault tolerance
Calin Curescu
23 of 36
Lecture 3: Dependability and fault tolerance
Calin Curescu
24 of 36
N-Version
•
From D. Lardner: Edinburgh Review, year 1824:
•
•
– ”The most certain and effectual check upon errors which
arise in the process of computation is to cause the same
computations to be made by separate and independent
computers*; and this check is rendered still more decisive
if their computations are carried out by different methods.”
•
* people who compute
Lecture 3: Dependability and fault tolerance
Calin Curescu
N-Version Programming
25 of 36
Design diversity
Independent generation of N (N > 2) functionally equivalent
programs from the same initial specification
No interactions between groups.
•
The programs execute
– concurrently
– with the same inputs
•
Their results are compared by a driver process
– Results should be identical
– If different – the consensus result is taken to be correct.
Lecture 3: Dependability and fault tolerance
Calin Curescu
Ultimate N-Version
26 of 36
Consistent comparison problem
•
N-version specification
•
To what extent can votes be compared?
•
N-version development teams
•
•
Text or integer arithmetic produce identical results
Real numbers usually produce different values
•
N-version programming languages
•
•
N-version hardware
Need for inexact voting techniques
– Temperature threshold example
•
Ulra-reliable hardware voting
Lecture 3: Dependability and fault tolerance
Calin Curescu
27 of 36
Lecture 3: Dependability and fault tolerance
Calin Curescu
Software Dynamic Redundancy
•
Phases
– Error detection
• No fault tolerance scheme can be utilised until the
associated error is
detected.
– Damage confinement and assessment
• To what extent has the system been corrupted?
• May depend on detection delay
– Error recovery
• Get system to a state from which it can continue its
normal operation (perhaps with degraded functionality).
•
– Forward vs. backward error recovery
Lecture 3: Dependability and fault tolerance
Calin Curescu
Error Detection
•
– Fault treatment and continued service
• An error is a symptom of a fault, what to do with it?
29 of 36
28 of 36
Environmental detection (run-time system or hardware)
– Hardware: e.g. illegal instruction
– O.S: null pointer
Application detection
– Replication checks, e.g., 2 version difference check
– Timing checks
– Watchdog timers - reset timer periodically, else error
– Reversal checks: e.g. squareroot check by squaring
– Coding checks, e.g., checksums
– Reasonableness checks, e.g., range of variables
– Structural checks, e.g., redundant pointers in data structures
– Dynamic reasonableness check
Lecture 3: Dependability and fault tolerance
Calin Curescu
30 of 36
Forward error recovery
•
Continues from an erroneous state by making selective
corrections to the system state
Backward Error Recovery
•
– This has the same functionality but uses a different
algorithm (c.f. N-Version Programming)
– Recovery point & checkpointing
• Saving appropriate system state
– May include making safe the controlled environment which
may be hazardous or damaged because of the error
– It is system specific and depends on accurate predictions of
the location and cause of errors (i.e, damage assessment)
•
Examples: redundant pointers in data structures and the use
of self-correcting codes such as Hamming Codes
Lecture 3: Dependability and fault tolerance
Calin Curescu
Relies on restoring the system to a previous safe state and
executing an alternative section of the program
•
31 of 36
Advantage: the erroneous state is cleared and it does not
rely on finding the location or cause of the fault
• can be used to recover from unanticipated faults including
design errors
Lecture 3: Dependability and fault tolerance
Calin Curescu
32 of 36
Static (NV) vs. Dynamic (RB) Redundancy
•
•
•
•
•
Design overheads
– both require alternative algorithms, NV requires driver, RB requires
acceptance test
Runtime overheads
– NV requires N * resources,
– RB requires establishing recovery points
Diversity of design both susceptible to errors in requirements
Error detection
– vote comparison (NV)
– acceptance test (RB)
Atomicity
– NV vote before it outputs to the environment
– RB must be instructed to only output after the passing of an
acceptance test.
Lecture 3: Dependability and fault tolerance
Calin Curescu
33 of 36
Lecture 3: Dependability and fault tolerance
Calin Curescu
34 of 36
Lecture 3: Dependability and fault tolerance
Calin Curescu
35 of 36
Lecture 3: Dependability and fault tolerance
Calin Curescu
36 of 36
Download