Implementing Fault Tolerance in Physical Architecture

advertisement
Principles of
Engineering System Design
Dr T Asokan
asok@iitm.ac.in
Implementing Fault Tolerance in Physical
Architecture Development
Case Study: Aircraft crash- Iowa
• United 232: 3-engine aircraft crashed on 19/7/1989
while making an emergency landing after losing one of
the three engines. 110 people died, 185 survived.
 Three redundant hydraulic systems, each powered by a
unique engine, were available for aircraft stabilisation.
 The three hydraulic system converged at the location near
the tail where the fan disk ripped out, the single point of
failure for all the hydraulic systems.
Error detection Functions
Failure: Deviation in behavior between the system
requirements
Error : A subset of the system state, which may lead
system failure.
Fault: a defect in the system that can cause an error.
and its
to
Fault tolerance is the ability of a system to tolerate
faults and
continue performing.
•Fault tolerance can be achieved only for those errors
that are
observed.
Functions associated with fault tolerance are:
Error detection
Damage confinement
Error recovery
Fault isolation and reporting
• Error detection is defining possible errors,
deviations in the subset of the system’s state
from the desired state, in the design phase before
they occur, and establishing a set of functions for
checking for the occurrence of each error.
– Type checks, range checks, timing checks
• Damage confinement is protecting the system from the
possible spread of failure to other parts of the system.
• Firewalls
• Error recovery attempts to correct the error after
the error has been detected and the errors
extent
defined.
• Backward recovery, forward recovery
• Fault isolation and reporting attempts to determine
where in the system the fault occurred that
generated the error.
Redundancy to Achieve Fault Tolerance
A primary source of high availability and fault tolerance is
redundancy: Hardware, software, information, and time.
Hardware redundancy uses extra hardware to enable the
detection of errors as well as to provide additional
operational hardware components after errors have
occurred.
Hardware redundancy can be implemented in Passive,
Active, and Hybrid forms
Passive hardware redundancy masks or hides the occurrence of
errors rather than detecting them. Recovery is achieved by
having extra hardware available when needed.
The most common implementation is Triple Modular
Redundancy (TMR). Relies on majority voting scheme to mask
error in one of the three hardware units.
Input 1
Component 1
Input 2
Component 2
Input 3
Component 3
VOTER
Output
Triplicated TMR
Input 1
Component 1
VOTER
Output 1
Input 2
Component 2
VOTER
Output 2
Input 3
Component 3
VOTER
Output 3
Software implementation of voting for
TTMR
Input 1
sampler
Two-port
memory
processor
Two-port
memory
Input 2
sampler
Two-port
memory
processor
Two-port
memory
Input 3
sampler
Two-port
memory
processor
Two-port
memory
Active hardware redundancy
Active hardware redundancy attempts to do all the four
functions i.e. detect errors, confine damage, recover from
errors, and isolate and report fault.
•Hardware duplication with comparison
•Hot standby sparing
•Cold standby sparing
•Pair-and-a-spare
Hardware duplication with
comparison
Hardware duplication
with comparison is the basi
building block for active redundancy
Component 1
Output
Comparator
Input
Component
. 2
.
.
Agree/
Disagree
Hot standby sparing and Cold standby sparing
Most common approaches to hardware redundancy
C o m po n e n t 1
E rro r
D e te c tio n
C o m pon ent 2
E r ro r
D ete c tio n
In p u t
C o m po n e n t N
E r ro r
D ete c tio n
N to 1
S w itc h
O u tp u t
Component 1
Error
Detection
Component 2
Error
Detection
Input
Component N
Error
Detection
N to 1
Switch
1
Output
Component 1
Error
Detection
Output
Component 2
Error
Detection
Input
N to 2
Switch
Component N
Comparator
Agree/disagree
Error
Detection
Pair and a spare active redundancy
Component 1
Error
Detection
Output
Component 2
Error
Detection
Input
Component N
Nto2
Switch
Comparator
Agree/disagree
Error
Detection
Hybrid Hardware Redundancy
•Combination of N-modular redundancy with spares or TMR
with duplication with comparison.
•Critical computation systems normally use Active or Hybrid
redundancy.
•Active redundancy reduces the life of the system
• Hybrid redundancy is the costliest
• Software redundancy
• N-versions,
• capability checks: Periodic hardware
tasks with known answers
• consistency checks: compares output of a
component with known characteristics
• Information redundancy: achieved by extra bits
of information to enable error detections
• Helps to catch system induced errors
• Parity checks
• Time redundancy
• Standby systems error detection
Design Flexibility
 The mark of a long-lived system is one that has been
upgraded successfully many times
 System should have an adaptable platform for such upgrades (
eg: Windows NT operating system)
 Engineering systems to be designed to be “changeable” in the
future
 Four aspects of changeability are:
 Flexibility
Agility
Robustness
Adaptability
 Flexibility represents the property of the system to be changed easily.
Changes from external to be incorporated to cope with changing
environments
 Computers with various interface ports can interface with many
external systems
 Flexibility is important for future upgrades
 Agility characterizes a systems ability to be changed rapidly
 Race cars are designed to be agile to enable easy modifications to
suit the tracks
 Robustness represents a systems ability to be insensitive towards
changing environments.
 An all terrain vehicle such as a Jeep is robust enough to run on
different terrains.
 Adaptability characterizes a systems ability to adapt itself towards
changing environments. No changes form external have to be
incorporated to cope with changing environments.
 Some of the intelligent software/OS are designed to learn and
adapt to different users
Summary
• Development of Physical architecture from
Functional architecture
• Generic and Instantiated architecture
• Morphological box
• Fault tolerance in physical architecture
• Redundancy for fault tolerance
– Hardware, software, information, time
• Passive, Active redundancy
• Hot standby, cold standby, pair and spare
• Design Flexibility
Download