Fault-Tolerant Computing Motivation, Background, and Tools

advertisement
Fault-Tolerant Computing
Motivation,
Background,
and Tools
Oct. 2006
Terminology, Models, and Measures
Slide 1
About This Presentation
This presentation has been prepared for the graduate
course ECE 257A (Fault-Tolerant Computing) by
Behrooz Parhami, Professor of Electrical and Computer
Engineering at University of California, Santa Barbara.
The material contained herein can be used freely in
classroom teaching or any other educational setting.
Unauthorized uses are prohibited. © Behrooz Parhami
Oct. 2006
Edition
Released
First
Oct. 2006
Revised
Terminology, Models, and Measures
Revised
Slide 2
Terminology, Models, and Measures
for Dependability
Oct. 2006
Terminology, Models, and Measures
Slide 3
Oct. 2006
Terminology, Models, and Measures
Slide 4
Impairments to Dependability
Oct. 2006
Terminology, Models, and Measures
Slide 5
The Fault-Error-Failure Cycle
Includes both
components
and design
Aspect
Impairment
Structure

State

Behavior
Fault

Error

Failure
Fault
0
Correct
signal
0
0
Replaced
with
NAND?
Schematic diagram of the Newcastle hierarchical model
and the impairments within one level.
Oct. 2006
Terminology, Models, and Measures
Slide 6
The Four-Universe Model
Universe
Impairment
Physical

Logical

Informational

External
Failure

Fault

Error

Crash
Cause-effect diagram for Avižienis’ four-universe
model of impairments to dependability.
Oct. 2006
Terminology, Models, and Measures
Slide 7
Unrolling the Fault-Error-Failure Cycle
Aspect
Impairment
Structure

State

Behavior
Fault

Error

Failure
First
Cycle
Second
Cycle
Abstraction
Impairment
Component

Logic

Information

System

Service

Result
Defect

Fault

Error

Malfunction

Degradation

Failure
LowLevel
MidLevel
HighLevel
Cause-effect diagram for an extended six-level
view of impairments to dependability.
Oct. 2006
Terminology, Models, and Measures
Slide 8
Multilevel Model
Component
Logic
Defective
Legend:
Legned:
Service
Result
Oct. 2006
Low-Level
Impaired
Faulty
Initial
Entry
Entry
Information
System
Ideal
Erroneous
Mid-Level
Impaired
Deviation
Malfunctioning
Remedy
Degraded
Tolerance
Failed
Terminology, Models, and Measures
High-Level
Impaired
Slide 9
Analogy for the Multilevel Model
An analogy for our
multi-level model of
dependable computing.
Defects, faults, errors,
malfunctions,
degradations, and
failures are
represented by pouring
water from above.
Valves represent
avoidance and
tolerance techniques.
The goal is to avoid
overflow.
Oct. 2006
Terminology, Models, and Measures
Slide 10
Why Our Concern with Dependability?
Reliability of n-transistor system, each having failure rate l
R(t) = e–nlt
There are only 3 ways of making systems more reliable
1.0
Reduce l
.9999
.9990
.9900
.9048
0.8
Reduce n
Reduce t
0.6
–n l t
e
0.4
Alternative:
Change the reliability
formula by introducing
redundancy in system
Oct. 2006
.3679
0.2
0.0
104
106
Terminology, Models, and Measures
nt
108
1010
Slide 11
Highly Dependable Computer Systems
Long-life systems: Fail-slow, Rugged, High-reliability
Spacecraft with multiyear missions, systems in inaccessible locations
Methods: Replication (spares), error coding, monitoring, shielding
Safety-critical systems: Fail-safe, Sound, High-integrity
Flight control computers, nuclear-plant shutdown, medical monitoring
Methods: Replication with voting, time redundancy, design diversity
High-availability: Fail-soft, Robust, High-availability
Telephone switching centers, transaction processing, e-commerce
Methods: HW/info redundancy, backup schemes, hot-swap, recovery
Just as performance enhancement techniques gradually migrate from
supercomputers to desktops, so too dependability enhancement
methods find their way from exotic systems into personal computers
Oct. 2006
Terminology, Models, and Measures
Slide 12
Aspects of Dependability
Oct. 2006
Terminology, Models, and Measures
Slide 13
Concepts from Probability Theory
Probability density function: pdf
f(t) = prob[t  x  t + dt] / dt = dF(t) / dt
Liftimes of 20
identical systems
Cumulative distribution function: CDF
F(t) = prob[x  t] =  0t f(x) dx
Expected value of x
+
Ex =  - x f(x) dx = k xk f(xk)
0
10
20
30
40
50
30
40
50
30
40
50
1.0
0.8
CDF
0.6
Variance of x
+
2
sx =  - (x – Ex)2 f(x) dx
= k (xk – Ex)2 f(xk)
Time
0.4
F(t)
0.2
0.0
0
10
20
Time
0.05
Covariance of x and y
yx,y = E [(x – Ex)(y – Ey)]
= E [x y] – Ex Ey
pdf
0.04
0.03
f(t)
0.02
0.01
0.00
0
Oct. 2006
10
Terminology, Models, and Measures
20
Time
Slide 14
Some Simple Probability Distributions
F(x)
1
CDF
CDF
CDF
CDF
Normal
Binomial
f(x)
pdf
Uniform
Oct. 2006
pdf
Exponential
pdf
Terminology, Models, and Measures
Slide 15
Reliability and MTTF
Reliability: R(t)
Probability that system remains in the
“Good” state through the interval [0, t]
Two-state
nonrepairable
system
R(t + dt) = R(t) [1 – z(t) dt]
Hazard function
R(t) = 1 – F(t)
Start
State
Failure
Failed
CDF of the system lifetime, or its unreliability
Constant hazard function z(t) = l  R(t) = e–lt
(system failure rate is independent of its age)
Mean time to failure: MTTF
+
+
MTTF =  0 t f(t) dt =  0 R(t) dt
Expected value of lifetime
Oct. 2006
Good
Exponential
reliability law
Area under the reliability curve
(easily provable)
Terminology, Models, and Measures
Slide 16
Failure Distributions of Interest
Exponential: z(t) = l
R(t) = e–lt
MTTF = 1/l
Discrete versions
Geometric
R(k) = q k
Rayleigh: z(t) = 2l(lt)
R(t) = e(-lt)2
MTTF = (1/l) p / 2
Weibull: z(t) = al(lt) a–1
R(t) = e(-lt)a
MTTF = (1/l) G(1 + 1/a)
Discrete Weibull
Erlang:
MTTF = k/l
Gamma:
Erlang and exponential are special cases
Normal:
Reliability and MTTF formulas are complicated
Oct. 2006
Terminology, Models, and Measures
Binomial
Slide 17
Comparing Reliabilities
Reliability difference: R2 – R1
Reliability gain: R2 / R1
Reliability improvement factor
RIF2/1 = [1–R1(tM)] / [1–R2(tM)]
Reliability functions
for Systems 1/2
System Reliability (R)
Example:
[1 – 0.9] / [1 – 0.99] = 10
Reliability improv. index
RII = log R1(tM) / log R2(tM)
Mission time extension
MTE2/1(rG) = T2(rG) – T1(rG)
Mission time improv. factor:
MTIF2/1(rG) = T2(rG) / T1(rG)
Oct. 2006
Terminology, Models, and Measures
Slide 18
Availability, MTTR, and MTBF
(Interval) Availability: A(t)
Fraction of time that system is in the
“Up” state during the interval [0, t]
Two-state
repairable
system
Steady-state availability: A = limt A(t)
Pointwise availability: a(t)
Probability that system available at time t
A(t) = (1/t)  0t a(x) dx
Repair
Start
State
Down
Up
Failure
Availability = Reliability, when there is no repair
Availability is a function not only of how rarely a system fails (reliability)
but also of how quickly it can be repaired (time to repair)
Repair rate
MTTF
MTTF
m
A=
=
=
1/m = MTTR
MTTF + MTTR MTBF
l+m
In general, m >> l, leading to A  1
Oct. 2006
Terminology, Models, and Measures
(Will justify this
equation later)
Slide 19
System Up and Down Times
Short repair time implies
good Maintainability (serviceability)
Time to first failure
Repair
Start
State
Down
Up
Failure
Time between failures
Repair time
Up
Down
0
t1
t 2 t'2
t'1
t
Time
Oct. 2006
Terminology, Models, and Measures
Slide 20
Performability and MCBF
Performability: P
Composite measure, incorporating
both performance and reliability
Three-state
degradable system
Repair
Start
State
Up 2
Up 1
Partial repair
Down
Failure
Partial failure
Simple example
Worth of “Up2” twice that of “Up1”
t
pUpi = probability
system is in state Upi
Question:
What is system
availability here?
P = 2pUp2 + pUp1
pUp2 = 0.92, pUp1 = 0.06, pDown = 0.02, P = 1.90
(system performance equiv. To that of 1.9 processors on average)
Performability improvement factor of this system (akin to RIF) relative
to a fail-hard system that goes down when either processor fails:
PIF = (2 – 2  0.92) / (2 – 1.90) = 1.6
Oct. 2006
Terminology, Models, and Measures
Slide 21
System Up, Partially Up, and Down Times
Important to prevent
direct transitions to the
“Down” state (coverage)
Repair
Start
State
Up 2
Up 1
Partial failure
Partial repair
Down
Failure
Partial
Failure
Up
Partially Up
Total
Failure
Partial
Repair
t2
t'2
Time
MCBF
Down
0
Oct. 2006
t1
Terminology, Models, and Measures
t'1
t 3 t'3 t
Slide 22
Integrity and Safety
Risk: Prob. of being in “Unsafe Failed” state
There may be multiple unsafe states,
each with a different consequence (cost)
Simple analysis
Lump “Safe Failed” state with “Good”
state; proceed as in reliability analysis
Three-state
fail-safe system
Failure
Start
State
Good
Failure
More detailed analysis
Even though “Safe Failed” state is more
desirable than “Unsafe Failed”, it is still
not as desirable as the “Good” state;
so keeping it separate makes sense
Safe
Failed
Unsafe
Failed
For example, if a repair transition is introduced between “Safe Failed”
and “Good” states, we can tackle questions such as the expected
outage of the system in safe mode, and thus its availability
Oct. 2006
Terminology, Models, and Measures
Slide 23
Download