Accidents/Mishaps TDDB47 Real Time Systems Lecture 4: Dependability and fault tolerance Therac 25

advertisement
Accidents/Mishaps
TDDB47 Real Time Systems
•
Lecture 4: Dependability and fault tolerance
Therac 25
– Six patients in USA and Canada got very high doses of radiation
and severe burns from the cancer treatment system Therac 25.
– Doses as high as 15,000-20,000 radiation units compared with
the normal levels (~ 200 units) had been given. Five died.
Mikael Asplund
•
Ariane 5
– Software reuse
– ~$500 million dollars lost
Real-Time Systems Laboratory
Department of Computer and Information Science
Linköping University, Sweden
The lecture notes are partly based on lecture notes by Calin Curescu, Simin NadjmTehrani, Jörgen Hansson, Anders Törne. They also loosely follow Burns’ and Welling
book “Real-Time Systems and Programming Languages”. These lecture notes should
only be used for internal teaching purposes at Linköping University.
Lecture 4: Dependability and fault tolerance
Mikael Asplund
Lecture 4: Dependability and fault tolerance
Mikael Asplund
2
What is dependability?
Excercise
Property of a computing system which
allows reliance to be justifiably placed on
the service it delivers.
[Avizienis et al.]
Lecture 4: Dependability and fault tolerance
Mikael Asplund
3
Lecture 4: Dependability and fault tolerance
Mikael Asplund
Attributes of dependability
•
[Sv. Pålitlighet]
•
Safety
– non-occurrence of catastrophic consequences on the
environment
Availability
– the readiness for usage
Integrity
– non-occurrence of unauthorized alteration of information
Reliability
– continuity of correct service
Confidentiality
– absence of unauthorized disclosure of information,
Maintainability
– ability to undergo repairs and modifications.
•
•
•
•
•
Lecture 4: Dependability and fault tolerance
Mikael Asplund
4
Faults, Errors & Failures
• Fault [sv. felkälla]
– a defect within the system or a situation that can lead to failure
• Error [sv. felyttring]
– manifestation (symptom) of the fault - an unexpected behaviour
• Failure [sv. misslyckande]
– system not performing its intended function
5
Lecture 4: Dependability and fault tolerance
Mikael Asplund
6
Transitions
•
Failure  fault  error  failure  fault
•
Systems are composed of components which are
themselves systems, hence
What is a bug?
– a failure in a subcomponent is a fault in the component
Lecture 4: Dependability and fault tolerance
Mikael Asplund
7
Fault, Error, Failure
•
Goal of system verification and validation is to eliminate
faults
•
Goal of safety/risk analysis is to focus on important faults
•
Goal of fault tolerance is to reduce effects of errors if they
appear
Lecture 4: Dependability and fault tolerance
Mikael Asplund
8
Measuring Reliability and Availability
• Time from an initial instant to the next failure event
• Typical measures:
– MTTF – mean time to failure
– MTBF – mean time between failures
– MTTR – mean time to repair
– In order to eliminate or delay failures
• Availability
– Ratio of service time to elapsed time
• MTTF/(MTTF+MTTR)
Lecture 4: Dependability and fault tolerance
Mikael Asplund
9
Lecture 4: Dependability and fault tolerance
Mikael Asplund
10
Dependability
How avoid failures?
• Fault prevention
• Fault tolerance
• Fault removal
• Fault forecasting
IFIP WG 10.4
Lecture 4: Dependability and fault tolerance
Mikael Asplund
11
Lecture 4: Dependability and fault tolerance
Mikael Asplund
12
External factors
Time and Duration of Faults
• Transient fault
– E.g. hardware components which have an adverse reaction
to radioactivity.
• Permanent fault
– E.g., a broken wire or a software design error.
Movie.
• Intermittent fault
– E.g. a hardware component that is heat sensitive, it works for a
time, stops working, cools down and then starts to work again.
Lecture 4: Dependability and fault tolerance
Mikael Asplund
13
Lecture 4: Dependability and fault tolerance
Mikael Asplund
Failure types
•
Value domain
•
Timing Domain
Commission
•
Arbitrary
Fault models
• Node failures
– Crash
– Omission
– Byzantine
– Early
– Late
– Omission
•
• Channel failures
– Crash (and potential partitions)
– Message loss
– Erroneous/arbitrary messages
Lecture 4: Dependability and fault tolerance
Mikael Asplund
15
Lecture 4: Dependability and fault tolerance
Mikael Asplund
Necessary!
•
Increase overall complexity
•
Static
–
–
–
–
•
16
N-Version
Redundancy
•
14
• From D. Lardner: Edinburgh Review, year 1824:
– ”The most certain and effectual check upon errors which arise in
the process of computation is to cause the same computations to
be made by separate and independent computers*; and this
check is rendered still more decisive if their computations are
carried out by different methods.”
Error masking properties
Hardware – N-modular redundancy
Software – N-version programming
Influence on timeliness?
Dynamic
–
–
–
–
–
Error detecting properties
Software – recovery methods
Hardware – switching to back-up or recovery module
Data – self-correcting codes
Time – re-computing a result
Lecture 4: Dependability and fault tolerance
Mikael Asplund
* people who compute
17
Lecture 4: Dependability and fault tolerance
Mikael Asplund
18
N-Version Programming
•
Design diversity
•
Independent generation of N (N > 2) functionally equivalent
programs from the same initial specification
•
No interactions between groups.
•
The programs execute
Ultimate N-Version
• N-version specification
• N-version development teams
• N-version programming languages
• N-version hardware
– concurrently
– with the same inputs
• Ultra-reliable hardware voting
•
Their results are compared by a driver process
– Results should be identical
– If different – the consensus result is taken to be correct.
Lecture 4: Dependability and fault tolerance
Mikael Asplund
19
Lecture 4: Dependability and fault tolerance
Mikael Asplund
Consistent comparison problem
•
To what extent can votes be compared?
20
Software Dynamic Redundancy
• Error detection
– Environmental
– Application
•
Text or integer arithmetic produce identical results
• Damage confinement and assessment
•
Real numbers usually produce different values
• Error recovery
– Forward vs. backward error recovery
• Fault treatment and continued service
Lecture 4: Dependability and fault tolerance
Mikael Asplund
21
Lecture 4: Dependability and fault tolerance
Mikael Asplund
Forward error recovery
•
Continues from an erroneous state by making selective
corrections to the system state
•
Examples:
Backward error recovery
•
Relies on restoring the system to a previous safe state and
executing an alternative section of the program
•
Advantages
– the erroneous state is cleared and it does not rely on finding the
location or cause of the fault
– can be used to recover from unanticipated faults including design
errors
•
Disadvantages
– Takes time
– Might not be acceptable
– redundant pointers in data structures and
– the use of self-correcting codes such as Hamming Codes
Lecture 4: Dependability and fault tolerance
Mikael Asplund
23
22
Lecture 4: Dependability and fault tolerance
Mikael Asplund
24
Dynamic redundancy and exceptions
•
An exception can be defined as the occurrence of an error
•
Exception handling is a forward error recovery mechanism
•
Exception handling facility however, can be used to provide
backward error recovery
Reading
• Chapter 5, 6.5 and 12.8 of Burns & Wellings
Lecture 4: Dependability and fault tolerance
Mikael Asplund
• Dependable Systems: IFIP terminology as described in
[Avizienis et.al. 2004]
25
Lecture 4: Dependability and fault tolerance
Mikael Asplund
26
Download