System reliability - University of Calgary Webdisk Server

advertisement
SENG 637
Dependability,
Dependability Reliability &
Testing of Software
Systems
Ch t 3:
Chapter
3 System
S t
R
Reliability
li bilit
Department of Electrical & Computer Engineering, University of Calgary
B.H. Far
(far@ucalgary.ca)
http://www.enel.ucalgary.ca/People/far/Lectures/SENG637/
far@ucalgary.ca
1
Dependability Models

Dependability models capture conditions that make a
system fail in terms of the structural relationships
between the system components
Dependability models
Reliability Graph
Fault Tree model
Reliability Block Diagram
far@ucalgary.ca
2
System Reliability /1



A system usually consists of components
Each component consists of sub-components
Components may have



Different reliability
Different dependencies between them
System reliability is a function of the
reliabilities of the (sub-)
(sub ) components and of
the relationships between the components
far@ucalgary.ca
3
Complexity Concerns




For a system consisting of n components
Every component can be in one of the two
conditions: working or failed
How many possible combinations of the status
of these n components?
How do you calculate the reliability of the
system given the probability of each
componentt being
b i working
ki (or
( failed)?
f il d)?
Reliability Block Diagram can help!
4
Reliability Block Diagram (RBD)






Reliability Block Diagram (RBD) is a graphical
representation
p
of how the components
p
of a system
y
are
connected from reliability point of view.
RBD is used to model the various series-parallel and complex
block combinations (paths) that result in system success.
success
Reliability of the system is derived in terms of reliabilities of
its individual components.
The most common configurations of an RBD are the series
and parallel configurations.
A system is usually composed of combinations of serial and
parallel configurations.
configurations
RBD analysis is essential for determining reliability,
availability and down time of the system.
far@ucalgary.ca
5
Reliability Block Diagram (RBD)


Reliability Block Diagram helps reliability analysis using a
functional diagram to portray and analyze the reliability
relationship of components in a system.
Each element of a system is represented by a block that is in
some way interconnected with or through the other blocks of
the system at a desired level of assembly. The basic
relationship between components are depicted as lines that
may be:



Serial: if a single failure results in entire assembly or system failure
P ll l if all
Parallel:
ll redundant
d d units
i failure
f il
causes system failure
f il
Or a combination of these two
SENG637 (Winter 2008)
far@ucalgary.ca
6
Reliability Block Diagram (RBD)


In a serial system configuration, the elements must
all
ll workk for
f the
th system
t to
t workk andd the
th system
t fails
f il
if one of the components fails. The overall reliability
of a serial system is lower than the reliability of its
individual components.
In parallel configuration, the components are
considered to be redundant and the system will still
cease to work if all the parallel components fail. The
overall reliability of a parallel system is higher than
the reliability of its individual components.
far@ucalgary.ca
7
Reliability Block Diagram (RBD)




In developing the reliability block diagram, a complete
understanding of a system
system’ss mission definition and service
use profile is required.
In addition, the system is broken down into subsystems/
functions that are necessary for mission success.
Each essential function is represented as a separate block in
the diagram,
diagram and blocks are grouped according to
series/parallel combinations. Lines are drawn connecting the
blocks in a logical order for mission success.
RBD can be used to


Calculate system reliability
Possible single point of failure (SPF)
far@ucalgary.ca
8
RBD: Process Steps
1.
2.
3.
4.
Define boundary of the system for analysis
Break system into functional components
Determine serial
serial-parallel
parallel combinations
Represent each components as a separate
block in the diagram
5. Draw lines connecting the blocks in a logical
order
d ffor mission
i i success
far@ucalgary.ca
9




No redundancy
ALL component of the
system
y
are needed to make
the system function
properly
If any one of the
components fails, the
system fails
Example
far@ucalgary.ca
System block
b
diagrram
Serial System Reliability /1
Power Unit
Processor
Monitor
10
Serial System Reliability /2



Reliability Block Diagram
S t is
System
i composedd off n independent
i d
d t serially
i ll
connected components.
Failure of any component has a cross system effect,
effect
i.e., results in failure of the whole system.
R1/1

R2 /2
...
Example
Power Unit
R1 /1
Processor
R2 /2
far@ucalgary.ca
Monitor
R3 /3
11
Combining Reliabilities /1


Serial system reliability can be calculated from
component reliabilities, if the components fail
independently of each other.
For serial systems:
Qp
Qp number of components
k 1
Rk component
p
reliabilityy
R   Rk


Components reliabilities (Rk) must be expressed with
respect to a common interval.
A serial system has always smaller reliability than its
components (because Rk  1).
far@ucalgary.ca
12
Example: Serial System

A system is composed of 3 independent
serially connected components




R1 = 0.95
R2 = 0.87
R3 = 0.82
Note: all Rs must be given
for a common duration,
e.g., 10 hours of operation.
Rsystem = 0.95  0.87  0.82 = 0.6777
Serial system reliability is smaller than any
individual reliability of the components
far@ucalgary.ca
14
Parallel System Reliability /1




Syste w
System
with
t Redundancy
edu da cy
Only ONE out of N
((identical)) components
p
is
needed to make the
system function properly
System block diagram
If ALL of the
components fail, the
Server 1
Server 2
system fails
f il
Example
far@ucalgary.ca
15
Parallel System Reliability /2


System is composed of n independent
components connected in parallel.
Failure of all components
p
results in the failure
of the whole system (principle of active
redundancy).
R1 /1
n
Unreliabil ityy Fp((t))   Fi((t))
i1
n
i1
i1
Reliabilit y Rp(t)  1   Fi(t)  1   (1 - Ri(t))
far@ucalgary.ca
R2 /2
...
n
16
Parallel System Example


The system is composed of 2 identical servers
connected in parallel

R1 = 0.6777

R2 = 0.6777
Server 1
Rs1 /s1
Server 2
Rs2 /s2
Rsystem = 1 – ((1 – 0.6777) 
(1 – 0.6777)) = 0.8961
Parallel system reliability is greater than any
individual reliability of the components
far@ucalgary.ca
17
Sequential System Reliability



When one component fails, the next one is assigned
t fulfill
to
f lfill the
th job
j b (e.g.,
(
t l h
telephone
switching
it hi circuits)
i it )
This is done in sequence until no components left
li bili ffor such
h system:
Reliability
C1
C2
Ci
n 1
R sys (t)  
i0
λ i t 
i
i!
e
 λ it
Cn
far@ucalgary.ca
18
Mixed System Reliability /1




Syste w
System
with
t Redundancy
edu da cy
Only ONE out of N set
of components
p
is needed
System block diagram
to make the system
function properly
Po e U
Power
Unit
it
Po e Unit
Power
U it
If ALL sets of
components fail, the
Processor
Processor
system fails
f il
Example
Monitor
far@ucalgary.ca
Monitor
19
Mixed System Example

Reliability Block Diagram
System level
Server 1
Rs1 /s1
Server 2
Rs2 /s2
Component level
Power Unit
R1 //1
Processor
R2 //2
Monitor
R3 //3
Power Unit
R1 /1
Processor
R2 /2
Monitor
R3 /3
far@ucalgary.ca
20
Various Configurations /1

Parallel-Series configuration
Power Unit
Power Unit
Processor
Processor
Monitor
Monitor
far@ucalgary.ca
21
Various Configurations /2

Reliability Block Diagram for Parallel-Series
configuration
Power Unit
R1 //1
Processor
R2 //2
Monitor
R3 //3
Power Unit
R1 /1
Processor
R2 /2
Monitor
R3 /3
far@ucalgary.ca
22
Parallel--Series System
Parallel
R11
R12
R1j
R1n
R21
R22
R2j
R2n
Ri1
Ri2
Rij
Rin
Rm1
Rm2
Rmj
Rmn
path
n
Path reliabilit y Ri   Rij
j 1
m
m
n
i1
i1
j 1
System reliabilit y R  1   (1 - Ri)  1   (1 - Rij)
far@ucalgary.ca
23
Various Configurations /3

Series-Parallel configuration
Power
o e Unit
U t1
Processor
ocesso 1
Monitor
o to 2
Bus 1
B 2
Bus
Power Unit 2
Processor 1
far@ucalgary.ca
Monitor 2
24
Various Configurations /4

Reliability Block Diagram for Series-Parallel
configuration
Power Unit
R1 /1
Processor
R2 /2
Monitor
R3 /3
Power Unit
R1 /1
Processor
R2 /2
Monitor
R3 /3
far@ucalgary.ca
25
Series--Parallel System
Series
R11
R12
R1j
R1n
R21
R22
R2j
R2n
Ri1
Ri2
Rij
Rin
Rm1
Rm2
Rmj
Rmn
subsystem
m
Subsystem reliabilit y Rj  1   (1 - Rij)
i1
m


System reliability R   R j   1   (1 - Rij)
j 1
j 1 
i1

n
far@ucalgary.ca
n
26
Active Redundancy






Employs parallel systems.
All components are active at the same time.
Each component is able to meet the functional
requirements of the system.
Each component satisfies the minimum
reliability condition for the system.
Only one component is required to meet the
functional requirements of the system.
System only fails if all components fail.
far@ucalgary.ca
28
m – out of – n System


System has n components.
At least m components need to
work correctly for the system to
function properly (m  n).



m=n: serial
i l system
m=1: parallel system
R1
R2
m/n
Ri
e.g.: airplane
p ew
with 4 eengines
g es can
c
fly with only 2 engines.
Rn
n
R sys (t)  1     R(t)i (1  R(t)) ni
i0  i 
m 1
Assumption:
All components have the same reliability.
far@ucalgary.ca
n
n!
  
 i  i! (n  i)!
29
RBD: Example /1
Possible single
point of failure
(SPF)
far@ucalgary.ca
30
RBD: Example /2
Possible over
redundancy
far@ucalgary.ca
31
SPF: How to Avoid?





Adopt redundancy  Use dissimilar methods –
consider
id common-cause vulnerability
l
bilit
Adopt a fundamental design change
Use equipment
i
which
hi h is
i extremely reliable/robust
li bl / b
Perform frequent Preventive Maintenance/
R l
Replacement
t  repair
i before
b f
failure
f il
happens
h
Reduce or eliminate service and/or environmental
stresses  extreme
e treme stress leads to failure
fail re
far@ucalgary.ca
32
RBD: When and How

When to use RBD?


How to use RBD?




When reliability
Wh
li bilit off a complex
l system
t mustt be
b calculated
l l t d
and the reliability-wise weaknesses of the system must be
identified.
Draw the RDB diagram.
Calculate the system reliability using the RBD diagram.
diagram
Perform calculation such as availability and downtime.
There
e e are
a e a number
u be oof auto
automated
ated tools,
too s, integrated
teg ated
with the other methods, such as Fault Tree Analysis
(FTA), to generate the diagram and to analyze it.
far@ucalgary.ca
33
RBD: Benefits & Limitations


RBD is the simplest way of visualizing the reliability of the
complex systems.
The benefits of the RBD are:


Establishes reliability goals; Evaluates component failure impact on
overall system safety; Provides a basis for “what
what if
if” analysis;
Allocates component reliability by calculating system MTBF;
Provides cost savings in large system trouble-shooting; Estimates
system reliability; Analyzes various system configurations in trade-off
studies; Identifies potential design problems; Determines system
sensitivity to component failures
Disadvantages are that some complex constructs, such as
standby,
db branching
b
hi andd load
l d sharing,
h i etc., cannot be
b clearly
l l
represented using the traditional RBD constructs.
far@ucalgary.ca
34
RBD: Conclusion

Restricting
assumptions:



Statistical
independence
Failures
i d
independence
d
Repairs
i d
independence
d
far@ucalgary.ca
35
Example 1

In reliability prediction for an assembled system we usually
use a “bottoms-up”
p approach
pp
byy estimatingg the failure rate for
each subsystem and then combining the failure rates for the
entire assembly. The following figure illustrates a system in
which the subsystems A, B, C, and D are in a serial
configuration Each subsystem is composed of several parts
configuration.
which are connected as serial or parallel as shown.
far@ucalgary.ca
36
Example 1 (cont’d)
(cont d)

The failure rates of serial parts 1, 2, and 3 are
0.1, 0.3, and 0.5 per hour, respectively.
Determine the failure rate and MTTF for
subsystem A.
A  1  2  3
A  0.1  0.3  0.5  0.9 failures per hour
1
1
MTTFA  
 1.11 hours
 0.9
far@ucalgary.ca
37
Example 1 (cont’d)
(cont d)

The failure rates of parallel parts 4, 5, and 6 in
subsystem B are 0.2, 0.4, and 0.25 per hour,
respectively. Determine the failure rate and MTTF
for subsystem B.
RB  e  Bt  1  (1  R4 )(1  R5 )(1  R6 )
0 20
0 40
0 25
RB  e  Bt  1  (1  e 0.20
)(1  e 0.40
)(1  e 0.25
)
RB  e  Bt  0.9868
B = 0.0133 failures per hour
MTTFB 
1
B
 75.1879 hours
far@ucalgary.ca
38
Example 1 (cont’d)
(cont d)

The failure rate of parallel parts 7 is constant and is
0.2 per hour. Determine the failure rate and MTTF
for subsystem C in which at least 2 out of 3 items
must be working.
n
n i
RC  1     R7 i 1  R7 
R7  e 0.2  0.8187
i 0  i 
 3  0
 3 1
3
2
RC  1    R7 (1  R7 )    R7 (1  R7 )   1  (1  R7 )3  3R7 (1  R7 ) 2 
1 
 0 

m 1
RC  1  ((1  0.8187))3  3  0.8187(1
(  0.8187)) 2  =0.9133
RC  e  t  0.9133
MTTFC 
1
C
C =1.004
failures per hour
 0.9960 hours
far@ucalgary.ca
39
Example 1 (cont’d)
(cont d)

The failure rates of parallel parts 8 and 9 in
subsystem
b t D are 00.25
25 and
d 0.2
0 2 per hhour,
respectively. Determine the failure rate and
MTTF for subsystem D.
D
RD  e  Dt  1  (1  R8 )(1  R9 )
RD  e  Dt  1  (1  e 0.25 )(1  e 0.20 )
RD  e  Dt  0.9599
D = 0.0409 failures per hour
MTTFD 
1
D
 24.4498
24 4498 hours
h
far@ucalgary.ca
40
Example 1 (cont’d)
(cont d)

Determine the overall failure rate and MTTF
for the whole system.
system  A  B  C  D  0.9  0.0133  1.004  0.0409
system
 1.9582 failures per hour
y
MTTFsystem 
1
system
y
 0.5107 hours
far@ucalgary.ca
41
Example 2

For the mixed mode
system composed of
serial and parallel
components shown
below, calculate the
overall system
reliability.
li bilit The
Th
numbers are
reliability for each
module.
SENG521 (Fall 2004)
far@ucalgary.ca
42
Example 2 (cont’d)
(cont d)
R1 = 1 – (1 – 0.9)(1 – 0.8)(1 – 0.9) = 0.998
R2 = 1 – (1 – 0.9)(1 – 0.95) = 0.995
R3 = 1 – (1 – 0.9)(1 – 0.9) = 0.990
R4 = 0.998
0 998 x 0.995
0 995 x 0.99
0 99 = 0.983
0 983
R5 = 0.95 x 0.99 x 0.95 = 0.893
Rsystem = 1 – (1 – 0.983)(1 – 0.893) = 0.998
SENG521 (Fall 2004)
far@ucalgary.ca
43
Example 3
The fastening of two mechanical parts in an
automobile brake system should be reliable.
It is done by means of two flanges which are
pressed together with 4 bolt and nut pairs E1
to E4 placed 90 degrees to each other.

Experience shows that the fastening holds when
at least two opposing bolt and nut pairs work.
E3
E2
E1
E4
Example 3 (cont’d)
(cont d)
a) Draw the reliability block diagram (RBD) for this
fi ti off the
fixation
th system.
t
Example 3 (cont’d)
(cont d)
b) Compute reliability of the fixation described
in (a) if all bolt and nut pairs have the
reliability of R=0.90.
R1 = R2 = 0.9 × 0.9 = 0.81
Rsystem = 1 – (1 – R1) (1 – R2)
= 1 – (1 – 0.81)
0 81) (1 – 0.81)
0 81) = 0.9639
0 9639
Example 3 Q1 (cont’d)
(cont d)
c) Name the redundancy type if only any two of
the bolt and nuts were sufficient for the
required function.
Two-out-of-four redundancy.
Example 3 (cont’d)
(cont d)
d) Compute reliability for case (c) if all bolt and
nut pairs have the reliability of R=0.90.
 4 
 4
0
4
1
3
R  1     0.9  1  0.9      0.9  1  0.9   
1 
 0 

4
3
1   0.1  3.6   0.1   0.9963


Hazard Analysis
“A common mistake in engineering, is
to put too much confidence in
software. There seems to be a feeling
among non-software professionals
that software will not or cannot fail,
fail
which leads to complacency and over
reliance on computer functions.”
Nancy G. Leveson
Leveson, N.G., Safeware – System Safety and Computers, Addison-Wesley, 1995.
far@ucalgary.ca
49
Hazard Analysis

Goal




Identify events that may eventually lead to accidents
Identify individual elements/operations within a system
that render it vulnerable
vulnerable…(Single
(Single Point Failures)
Determine impact on system
Techniques





FMEA: Failure Modes and Effects Analysis
FMECA: Failure Modes, Effects and Criticality Analysis
ETA: Event Tree Analysis
FTA: Fault Tree Analysis
HAZOP: HAZard and OPerability studies
far@ucalgary.ca
50
FMEA


FMEA is a technique to identify and prioritize how
items fail,
fail and identify the effects of failure.
failure
FMEA is used when




Designing products or processes, to identify and avoid
failure-prone designs.
Investigating why existing systems have failed and to
identify possible causes.
causes
Investigating possible solutions, to help select one with an
acceptable risk.
Planning actions, in order to identify risks in the plan and
hence identify countermeasures.
far@ucalgary.ca
51
FMEA: Usage

Industries that frequently use FMEA:






Consumer products: toys/home appliances/home
electronics
Automotive industry
Aero/space: Boeing, NASA
Defence industry: DoD
Process industries, e.g., oil and gas, chemical
plants  industrial control systems
Software industry?  not quite popular yet
far@ucalgary.ca
52
FMEA: When?

FMEA cannot be done until design has
proceeded to the point that system elements
have been selected at the level the analysis is
to explore  in software system after
software architecture is finalized (requirement
volatility kills FMEA!)

FMEA can bbe ddone anytime
ti in
i the
th system
t
lifetime, from initial design onward
far@ucalgary.ca
53
FMEA: Process /1

Define the system to be analyzed, and obtain necessary
drawings charts
drawings,
charts, descriptions
descriptions, diagrams
diagrams, component lists
lists.


Break the system down into convenient and logical elements.



Know exactly what you’re analyzing; is it an area, activity,
equipment? – all of it, or part of it? What targets are to be considered?
What mission phases are included?
System breakdown can be either functional (according to what the
System elements “do”), or geographic/architectural (i.e., according to
where the system elements “are”), or both (i.e., functional within the
geographic, or vice versa).
Establish a coding system to identify system elements.
Analyze (FMEA) the elements.
far@ucalgary.ca
54
FMEA: Process /2
FMEA Outputs
 – Spreadsheet documenting each





failure mode
possible causes
possible
poss
b e co
consequences
seque ces
possible remedies
– Usually computer records kept
far@ucalgary.ca
55
FMEA Questions



Identification: How ( i.e., in what ways) can
this element fail (failure modes)?
Ramification: What will happen
pp to the
system and its environment if this element
does fail in each of the ways available to it
(failure effects)?
Prevention: What needs to be done to prevent
or mitigate the problem?
far@ucalgary.ca
56
FMEA: How to /1
Interface matrix
Components
Reflects how
components of the
system are
connected &
function together
Components

far@ucalgary.ca
57
FMEA: How to /2
far@ucalgary.ca
58
FMEA Template

Sample FMEA template
N
o
.
Part Name
Part No.
Function
1
Position
Controller
Receive
a
demand
position
Failure
Mode
Mechanisms
& Causes of
Failure
Effect(s)
of Failure
•Loose
cable
connection
•Incorrect
demand
signal
•Wear and
tear
•Operator
error
•Motor fails
to move
•Position
controller
breakdown
in a long-run
Current
Control
P.R.A.
P
S
D
R
2
4
4
4
1
3
8
48
Recommended
Corrective
Action
•Replace faulty
wire
•QC checked
•Intensive
training for
operators.
P = Probabilities (chance) of occurrences
S = Seriousness of failure
D = Likelihood that the defect will reach the customer
R = Risk priority measure (P x S x D)
1 = very low or none
2 = low or minor
3 = moderate or significant
4 = high
5 = very high or catastrophic
far@ucalgary.ca
59
Action
Taken
FMEA: Benefits



Discover potential single-point failures.
Assesses risk (FMECA) for potential, single-element
failures for each identified target, within each
mission phase.
phase
Knowing these things helps to:





Optimize
O
ti i reliability,
li bilit hence
h
mission
i i accomplishment.
li h
t
Guide design evaluation and improvement.
Guide design of system to “fail
fail safe
safe” or crash softly
softly.
Guide design of system to operate satisfactorily using
equipment of “low” reliability.
Guide component/manufacturer selection.
far@ucalgary.ca
60
FMEA: Limitations


FMEA will find and summarize system vulnerability
i terms
in
t
off SPFs,
SPF andd it requires
i lots
l t off time,
ti
money,
and effort to do so.
What if several SPFs are identified?



Developing SPF paranoia  leading to total system
redundification!
But SPFs are abundant and we encounter them daily,
yyet continue to function.
Most system nastiness failures come from complex
threats, not from SPFs  don’t ignore SPFs, just
keep them in perspective.
far@ucalgary.ca
61
FMEA: Summary


Method: from known or predicted failure modes of
components,
t determine
d t
i possible
ibl effects
ff t on system
t
Good for hazard identification early in development,
by considering possible failures of system functions:





loss of function (omission failure)
function performed incorrectly
function performed when not required
(commission failure)
No good for concurrent multiple failures
No ggood when requirements
q
change
g
far@ucalgary.ca
62
Fault Tree Analysis (FTA)



Fault tree analysis is a graphical representation of the major
faults or critical failures associated with a pproduct, the causes
for the faults, and potential countermeasures. FTA helps
identify areas of concern for new system design or for
improvement of existing systems. It also helps identify
corrective actions to correct or mitigate problems.
problems
FTA can also be defined as a graphic “model” of the
pathways within a system that can lead to a foreseeable,
undesirable event
event. The pathways interconnect contributory
events and conditions, using standard logic symbols.
Fault tree analysis is useful both in designing new systems,
products or services or in dealing with identified problems in
existing products/services. As part of process improvement, it
can be used to help identify root causes of trouble and to
g remedies and countermeasures.
design
far@ucalgary.ca
63
Steps in FTA

FMEA (Failure Modes and Effects Analysis) determines the failure
modes that are likely to cause failure events. Then it determines what
single
i l or multiple
l i l point
i failures
f il
could
ld produce
d
those
h
top level
l l
events. FMEA asks the question, “What can go wrong?” even if the
product meets specification. In order to perform FTA one first needs
to perform FMEA.
far@ucalgary.ca
65
FTA: Example 1
far@ucalgary.ca
66
FTA: Example 2
Tank overflow
AND
Inlet open
Outlet
closed
Inlet
Valve B
OR
Inlet
valve failed
X
Controller
Outlet
Valve A
Controller
failed
Y
far@ucalgary.ca
Wrong control
t inlet
to
i l t valve
l
OR
AND
Sensor
X
fails
Sensor
Y
fails
67
FTA: Benefits & Limitations


Benefits: The primary advantages of fault tree
analyses
l
are the
th meaningful
i f l data
d t they
th produce
d
which
hi h
allow evaluation and improvement of the overall
reliability of the system.
system It also evaluates the
effectiveness of and need for redundancy.
Limitation: The undesired event evaluated must be
foreseen and all significant contributors to the failure
must be anticipated. This effort may be very time
consuming and expensive. Finally, the overall
success of the process depends on the skill of the
analyst
l t involved.
i
l d
far@ucalgary.ca
68
FTA: Summary

Method: trace faults stepwise back through system design to
possible causes







a tree with a top event at the root
logic gates at branches, linking each event with its “immediate” causes
initiating faults at leaves (eventually)
Good for tracing system hazards to component failures, and
allocating safety requirements
Good for systems with multiple failures
Good for checking completeness of safety requirements
Can be difficult, time-consuming, hard to maintain
far@ucalgary.ca
69
far@ucalgary.ca
70
Download