Uploaded by Dr Adinath Damale

Redudacies

advertisement
Safety and Reliability
ISSN: 0961-7353 (Print) 2469-4126 (Online) Journal homepage: http://www.tandfonline.com/loi/tsar20
Introduction to Redundancy
David M. Clarke & Ian Hollister
To cite this article: David M. Clarke & Ian Hollister (2010) Introduction to Redundancy, Safety and
Reliability, 30:4, 4-15, DOI: 10.1080/09617353.2010.11690919
To link to this article: http://dx.doi.org/10.1080/09617353.2010.11690919
Published online: 11 Mar 2016.
Submit your article to this journal
View related articles
Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=tsar20
Download by: [Australian Catholic University]
Date: 31 August 2017, At: 04:17
4
Introduction to Redundancy
Downloaded by [Australian Catholic University] at 04:17 31 August 2017
David M. Clarke and lan Hollister
Many issues arise from the implementation of redundancy in different ways and
in different types of systems. The objectives of this introduction are to:
e
e
e
e
e
Outline the basic concepts of redundancy;
Provide some background on the special topics of redundancy in software
and human systems;
Describe the approach to redundancy in the design and safety case
process;
Indicate the costs and limitations of redundancy;
Put redundancy in the context of other approaches to design for reliability.
The foregoing will set the scene for the remaining articles in this issue of the
journal.
BASIC CONCEPTS
Definition of Redundancy
Redundancy is used in system design with the purpose of increasing reliability.
It is found in many engineering systems such as bridges, distribution systems,
process plants and communication systems.
Redundancy may be defined as the provision of more than one means of
performing a required function. In a redundant system, more than the minimum
number of equipment items necessary to perform the function is available.
An illustration of the use of redundancy is provided by Figures 1 and 2
(adapted from Green and Bourne, 1972). Assuming general compatibility
5
between the load and structure, the non-redundant structure in Figure 1 will fail
if a member fails while the redundant structure in Figure 2 will not.
Full and Partial Redundancy
Downloaded by [Australian Catholic University] at 04:17 31 August 2017
The functional design can be duplicated in whole or in part, corresponding to
full and partial redundancy respectively.
Implied Redundancy, Component Redundancy and System Redundancy
The use of a factor of safety in design entails, in effect, some level of
redundancy. For example, using twice the number of wires in a cable than is
required to support a load results in an implied redundancy. Several
components connected in parallel can provide an explicit component
redundancy. Similarly,
LOAD
!
Figure 1. Non-redundant structure- will not remain rigid if any member fails.
LOAD
!
Figure 2. Redundant structure - will remain rigid if any member fails.
6
system redundancy results where more than one system can perform the same
function.
Downloaded by [Australian Catholic University] at 04:17 31 August 2017
Active and Standby Redundancy
Redundancy can be classified as active or standby. In active redundancy, all
items are operating but failure of one or more items (the number depending on
the specifics of the system) does not result in system failure. In standby
redundancy, the redundant component must be switched in to take the place of
the main item when it fails.
Main unit
Standby
unit
-
Figure 3. Standby redundancy.
Dependent Failures
Improvements in reliability from redundancy are limited by dependent failures.
For example, features of design and environment can lead to common cause
failure of redundant items. Similarly, two or more human operators concerned
with the same function can fail through dependence effects.
Diversity
Redundant items may be identical or diverse. Diversity is the provision of
dissimilar means of achieving the same function. For example, two identical
pumps could provide a redundant system while a motor-driven pump and a
turbine-driven pump could provide a diverse system.
Diversity provides one way of reducing the chances of failure of a redundant
system through dependent failures. Other ways include physical separation by
distance or barriers. Removing all dependencies in redundant or diverse
systems can be very difficult.
7
Redundancy Structures
Downloaded by [Australian Catholic University] at 04:17 31 August 2017
A useful high-level description of the implementation of redundancy and
diversity is provided by the notion of redundancy structures (Brand and
Matthews, 1993):
e
e
e
e
Single channel (no redundancy);
Redundant system;
Diverse system;
Two diverse systems.
The reliability of such arrangements depends on specifics of the system.
However, when moving down the list of redundancy structures above, the
general trend is one of increasing reliability.
Reliability and Safety
Reliability and safety overlap but they are not identical. Redundancy is aimed at
increasing reliability. While improving reliability often improves safety, neither
redundancy nor design for reliability in general can be considered a complete
solution for system safety.
REDUNDANCY IN SOFTWARE SYSTEMS
Redundancy and diversity can be used in software to provide a level of fault
tolerance (Pham, 2006). Different means of achieving the same function are
provided by diverse software elements generated using different designs,
programming teams, programming languages, etc. Two ways of implementing
diversity are recovery blocks and N-version programming (NVP).
In the recovery block scheme, the output of a primary module is tested for
acceptability. If the output is not acceptable, then the state of the system
is restored to that existing before the fault occurred (backward error recovery).
A second, alternative module then executes the function. If the output is not
acceptable, then a third module comes into play, and so on. If all modules
are exhausted and no acceptable outputs are produced, then the system
fails.
In NVP, two or more functionally equivalent programmes are developed
independently to fulfil the same specification. The N programme versions are
executed in parallel and results obtained by voting on the outputs from the
different programmes. The NVP scheme differs from the recovery block
8
approach in that the former executes versions simultaneously while the latter
executes alternate programmes in sequence.
Downloaded by [Australian Catholic University] at 04:17 31 August 2017
Hybrid schemes that combine the recovery block and NVP schemes can be
used.
Diversity can also be applied to other phases of the software lifecycle,
including tests (IAEA, 2000).
REDUNDANCY IN HUMAN SYSTEMS
Redundancy in human systems can be used to increase human reliability
associated with an operator function by providing an opportunity for one
individual to detect the error of another, ultimately leading to correction of the
error. Examples of redundancy in human systems occur with respect to the
following:
e
e
•
e
The First Officer and Captain on the flight-deck of a commercial airliner;
The bridge team and pilot on a ship;
A supervisor and operators in the control room of a civil nuclear power plant
or chemical plant;
The shift technical advisor and normal operating team in the control room of
a civil nuclear power plant.
A variety of factors influence the likelihood that human redundancy will result in
the recovery of an error (Clarke, 2005). Examples of factors o( particular
importance are:
e
e
e
e
e
Presence of alerting factors that highlight the occurrence of an error;
Whether checks are active or passive (active checks generally being more
reliable);
Checker's knowledge of the operator's competence (which can create an
expectation in the checker that performance is satisfactory, which tends to
undermine the effectiveness of a check);
Excessive professional courtesy between individuals (which can inhibit an
individual from highlighting the error of another);
Level of cognitive diversity (see below).
The effectiveness of human redundancy can be supported, and hence human
reliability enhanced, by use of appropriate forms of human redundancy. For
9
Downloaded by [Australian Catholic University] at 04:17 31 August 2017
example, in some situations, a single individual providing redundancy with
respect to an operator function may be sufficient. In contrast, where a very
high level of reliability is required, an operating team supported by a supervisor
and an independent observer, with procedural requirements for checks at
critical stages of operations, may be appropriate.
For a given form of human redundancy, overall reliability can be enhanced
through control of influencing factors. Examples of design measures that can
be taken are:
e
e
e
e
Design of equipment such that unwanted system states resulting from
operator errors are clearly visible to a supervisor;
Use of checks that are active rather than passive;
Development of a culture in which it is acceptable for individuals to
challenge the actions of others;
Use of cognitive diversity.
Cognitive diversity, by reducing the chances of failures of common origin, can
be used to increase the effectiveness of human redundancy. It is characterized
as the availability of different cognitive behaviours to fulfill a required function,
where differences originate either within operators or within their environment
(Westerman et al. 1995, 1997). Use can be made of the potential for diversity
in the areas of information, equipment, intrinsic human capabilities, skill and
organization (Hollywell 1996). For example, where an operator achieves a given
function using one source of information, a checker could use a different
source.
DESIGN AND SAFETY CASE METHODOLOGY
Implementation of Redundancy
In principle, the addition of redundant components increases the reliability of
systems as multiple failures are required to induce system failure. The HSE
guidance document, T/AST/036 (HSE, 2009), provides a useful discussion of
the implementation of redundancy and diversity in safety-critical systems for
mechanical plant. Whilst targeted at the nuclear industry, the guidance provided
is equally applicable to other industries.
One of the threats that can undermine the potential increase in reliability from
use of redundancy is dependent failure. Dependent failures are those that
10
cause simultaneous failure of multiple components. The susceptibility of
redundant systems to dependant failures can be reduced by the incorporation
of diverse provisions. Diversity in the design of complex plant can be achieved
in a variety of ways:
Downloaded by [Australian Catholic University] at 04:17 31 August 2017
•
e
e
•
System functional diversity: the provision of the same function at system
level by differing methods, e.g. incorporation of an air cooling system as a
back-up to a duty water cooling system;
Equipment functional diversity: the provision of the same function at
equipment level by differing methods, e.g. pumps driven by electrical motors
and others in the same system driven by hydraulic pressure;
Equipment supplier diversity: nominally similar equipment procured from
different suppliers;
Support diversity: the reduction in potential for dependant failures caused
by the support systems, e.g. mandating that redundant components are
maintained by different staff to reduce the possibility of common
maintenance errors.
Certain dependent failures in redundant systems may be reduced by
segregation, where components are separated by distance or physical barriers
such as fire barriers.
Redundancy must be considered for all activities and stages associated with a
system's lifecycle. As an illustration, the planning and execution of
Examination, Maintenance, Inspection and Testing (EMIT) activities should
account for redundancies in systems. Examples of appropriate practices,
subject to practicality and costs in specific instances, include:
•
e
Staggering of tests to minimize the chances of redundant trains failing at
the same time;
Ensuring that failures in redundant trains are identified despite the potential
for systems incorporating redundancy to pass a simple functional test while
equipment in that system is in a failed state.
In terms of incorporation of redundancy in safety cases, the deterministic and
probabilistic elements can be considered separately.
Deterministic Safety Case
A key element of the deterministic case is the identification and justification of
the duty safety systems and preventative, protective and mitigating safety
11
Downloaded by [Australian Catholic University] at 04:17 31 August 2017
measures that maintain the safety functions of the plant. A key principle of the
deterministic case is that the safety systems and measures are as
independent of each other as possible to minimise the potential for dependent
failures.
Thus, a safety case should identify the key safety systems and measures and
the fault sequences in which they are claimed. A fault schedule will often be
used to record such information. A systematic assessment of the degree of
independence between the safety systems and measures should be
undertaken. This is best structured around the key engineering safety
principles relating to equipment failures, notably:
e
e
e
e
e
Redundancy;
Diversity;
Segregation;
Single failure criterion;
Dependent failures.
Probabilistic Safety Case
The key objective of a probabilistic case is to determine a numerical value for
the frequency of accidents of a defined nature. This assessment is commonly
undertaken using fault trees and/or event trees. Both techniques involve the
multiplication of the probability of events. It is valid to multiply such
probabilities only if the events are independent. Therefore any probabilistic
assessment must allow for dependency between events. This can be via
explicit modelling of events such as loss of common supporting systems or via
an estimative technique that provides a numerical estimate of the proportion of
failures due to dependent failures.
Common cause failures in redundant systems can be assessed using one of
several techniques (Smith, 2005), including:
e
e
e
e
The
The
The
The
beta model;
partial beta model;
beta plus model (Smith, 2005; 2008);
multiple Greek letter model.
12
COSTS AND LIMITATIONS OF REDUNDANCY
Downloaded by [Australian Catholic University] at 04:17 31 August 2017
Costs
In hardware systems, redundancy can give rise to a variety of costs, including
the costs of redundant or diverse components, additional power equipment,
spares and maintenance. For example, use of diverse pumps will increase the
spares holdings and tooling required. There can also be penalties from
increased size and weight.
Similarly, in software and human systems, extra resources are required to
develop redundant and diverse system elements. For example, multi-version
programming gives rise to costs of producing the extra version(s).
The costs of redundancy must be balanced against the benefits to reliability.
Also to be considered in the overall trade-off is the question of the cost and
effectiveness of redundancy compared to the cost and effectiveness of
reducing failures by other means.
Accidents
Limitations of redundancy are indicated by many accidents in which the failure
of redundant systems played a part. In 1989, United Flight 232 crashed when
engine failure generated fragments that resulted in failure of independent
hydraulic systems crucial for normal flight control. Prior to the accident, the
likelihood of the redundant system failing had been considered extremely
remote. A contribution to the Clapham Junction railway accident was a failure
on the part of an electrician to carry out a wiring task satisfactorily. A failure of
human redundancy occurred when a supervisor did not detect the error made
by the electrician.
Dependent Failures
Dependent failure represents an important limitation on redundant systems. To
illustrate:
e
e
A redundant hardware system may fail as a result of design error affecting
multiple components, as a consequence of failure of a common power line,
because of a fire or explosion, or because of projectiles (as in the loss of
hydraulic systems on United Flight 232);
There is evidence that different teams that generate different versions in
multi-version programming do not create software that fails independently.
For example, different programmers tend to make mistakes in the same,
13
Downloaded by [Australian Catholic University] at 04:17 31 August 2017
e
more difficult parts of a problem. One experiment (Brilliant et al, 1990)
found that versions using different algorithms and containing different
programming errors still failed on the same inputs.
A supervisor and operator may both misread an indication (due to a
shortcoming in a display to which both individuals refer) or misinterpret
a piece of information (perhaps due to a weakness in training undertaken
by both individuals), giving rise to a dependent failure in system
operation.
Effects on Other Failure Modes
When redundancy is implemented with respect to a given failure mode, it may
have adverse effects on other failure modes. For example, adding a valve in
series with a pressure control valve will improve reliability with respect to fail
open leading to overpressure, but reduce reliability with respect to fail shut
leading to loss of supply.
Complexity
A criticism that has been made of the use of redundancy in design is that its
use can increase the complexity of the system. The concern is that increased
complexity may result in failures, defeating the purpose of redundancy.
False Sense of Security
Another concern is that, when the limitations of redundancy are not recognized,
its use can lead to a false sense of security. Different groups of people, such
as managers and operators, may perceive a system to be more reliable or safe
than it actually is.
The Redundancy Debate
There has been significant debate in recent times about the value of
redundancy as a technique for achieving reliability and safety. That debate has
focussed on limitations of redundancy described above.
One specific criticism has been that redundancy 'backfires' through common
cause failures. In this view, common cause failures are seen as undermining
redundancy, which is therefore regarded as an ineffective system design
technique. Hoepfer et al (2009) present an analysis suggesting that, while
caution about the benefits of redundancy is warranted, unqualified criticisms of
redundancy are inappropriate. Rather, the value of redundancy should be
viewed as contingent on the prevalence of common cause failures, the level of
redundancy considered, and cost-benefit. Where redundant systems are used,
14
Downloaded by [Australian Catholic University] at 04:17 31 August 2017
a key part of the design effort is the identification and minimization of common
cause failures.
OTHER DESIGN APPROACHES
Given the limitations of redundancy, the technique must be seen, not as a total
solution to design for reliability, but rather as one approach amongst several
others. Other techniques include:
e
e
e
e
e
e
e
e
e
Over-design;
De-rating;
Timed replacements;
Simplification;
Maintainability techniques;
Design review;
Corrective action;
Testing and debugging of software;
Human error prevention approaches.
The task of the designer is to select the most appropriate solution in given
circumstances.
CONCLUSIONS
Concepts of redundancy applicable to mechanical, software and human
systems have been described. The treatment of redundancy in design and
safety case methodology has been outlined. Redundancy can be an effective
way of enhancing reliability but a variety of costs and constraints must be
recognized and taken into account when considering its use as a design
technique.
ACKNOWLEDGEMENTS
The authors would like to acknowledge helpful comments from David J. Smith
on an earlier draft of this introduction.
15
REFERENCES
Downloaded by [Australian Catholic University] at 04:17 31 August 2017
References Cited in the Text
Brand, V.P. and Matthews, R.H. (1993) Dependent failures- when it all goes wrong at once, Nuclear
Energy, val. 32, no. 3.
Brilliant, S., Knight, J.C. and Leveson, N. (1990) Analysis of faults in anN-version software
experiment, IEEE Transactions on Software Engineering, SE-16(2), February.
Clarke, D.M. (2005) Human redundancy in complex, hazardous systems: a theoretical framework,
Safety Science, val. 43, pp. 655-677.
Green, A.E. and Bourne, A.J. (1972) Reliability Technology, London: Wiley-lnterscience.
HSE (2009) Diversity, redundancy, segregation and layout of mechanical plant, T/AST/036- Issue
02, Health and Safety Executive.
Hoepfer, V.M., Saleh, J.H. and Marais, K.B. (2009) On the value of redundancy subject to commoncause failures: Toward the resolution of an on-going debate, Reliability Engineering and System
Safety, val. 94, pp. 1904-1916.
Hollywell, P.D. (1996) Incorporating human dependent failures in risk assessments to improve
estimates of "actual risk, Safety Science, val. 22, no.s 1-3, 177-194.
IAEA (2000) Software for Computer Based Systems Important to Safety in Nuclear Power Plants (IAEA
Safety Standards Series), Vienna: International Atomic Energy Agency.
Pham, H. (2006) System Software Reliability, London: Springer-Verlag.
Smith, D.J. (2005) Reliability, Maintainability and Risk, 8th edition, Oxford: Butterworth-Heinemann.
Smith, D.J. (2008) Developments in the common cause failure quantification, Safety and Reliability,
val. 28, no. 1, pp. 32-36.
Westerman, S.J., Shryane, N.M., Crawshaw, C.M., Hockey, G.R.J. and Wyatt-Millington, C.W. (1995)
Cognitive diversity: a structured approach to trapping human error. In: Safecomp '95,
Proceedings of the 14th International Conference on Computer Safety, Reliability and Security,
Springer, London, pp. 142-155.
Westerman, S.J., Shryane, N.M., Crawshaw, C.M. and Hockey, G.R.J. (1997) Engineering cognitive
diversity. In: Redmill, F. and Anderson, T. (Eds.) Safer Systems: Proceedings of the Fifth SafetyCritical Systems Symposium, Springer, London, pp. 111-120.
USEFUL GENERAL REFERENCES
Elsayed, E. (1996) Reliability Engineering, Reading, Massachusetts: Addison Wesley Longman.
Kumamoto, H. and Henley, E.J. (1996) Probabilistic Risk Assessment and Management for Engineers
and Scientists, Second Edition, Piscataway, NJ: IEEE Press.
Download