Safety and Reliability ISSN: 0961-7353 (Print) 2469-4126 (Online) Journal homepage: http://www.tandfonline.com/loi/tsar20 Introduction to Redundancy David M. Clarke & Ian Hollister To cite this article: David M. Clarke & Ian Hollister (2010) Introduction to Redundancy, Safety and Reliability, 30:4, 4-15, DOI: 10.1080/09617353.2010.11690919 To link to this article: http://dx.doi.org/10.1080/09617353.2010.11690919 Published online: 11 Mar 2016. Submit your article to this journal View related articles Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=tsar20 Download by: [Australian Catholic University] Date: 31 August 2017, At: 04:17 4 Introduction to Redundancy Downloaded by [Australian Catholic University] at 04:17 31 August 2017 David M. Clarke and lan Hollister Many issues arise from the implementation of redundancy in different ways and in different types of systems. The objectives of this introduction are to: e e e e e Outline the basic concepts of redundancy; Provide some background on the special topics of redundancy in software and human systems; Describe the approach to redundancy in the design and safety case process; Indicate the costs and limitations of redundancy; Put redundancy in the context of other approaches to design for reliability. The foregoing will set the scene for the remaining articles in this issue of the journal. BASIC CONCEPTS Definition of Redundancy Redundancy is used in system design with the purpose of increasing reliability. It is found in many engineering systems such as bridges, distribution systems, process plants and communication systems. Redundancy may be defined as the provision of more than one means of performing a required function. In a redundant system, more than the minimum number of equipment items necessary to perform the function is available. An illustration of the use of redundancy is provided by Figures 1 and 2 (adapted from Green and Bourne, 1972). Assuming general compatibility 5 between the load and structure, the non-redundant structure in Figure 1 will fail if a member fails while the redundant structure in Figure 2 will not. Full and Partial Redundancy Downloaded by [Australian Catholic University] at 04:17 31 August 2017 The functional design can be duplicated in whole or in part, corresponding to full and partial redundancy respectively. Implied Redundancy, Component Redundancy and System Redundancy The use of a factor of safety in design entails, in effect, some level of redundancy. For example, using twice the number of wires in a cable than is required to support a load results in an implied redundancy. Several components connected in parallel can provide an explicit component redundancy. Similarly, LOAD ! Figure 1. Non-redundant structure- will not remain rigid if any member fails. LOAD ! Figure 2. Redundant structure - will remain rigid if any member fails. 6 system redundancy results where more than one system can perform the same function. Downloaded by [Australian Catholic University] at 04:17 31 August 2017 Active and Standby Redundancy Redundancy can be classified as active or standby. In active redundancy, all items are operating but failure of one or more items (the number depending on the specifics of the system) does not result in system failure. In standby redundancy, the redundant component must be switched in to take the place of the main item when it fails. Main unit Standby unit - Figure 3. Standby redundancy. Dependent Failures Improvements in reliability from redundancy are limited by dependent failures. For example, features of design and environment can lead to common cause failure of redundant items. Similarly, two or more human operators concerned with the same function can fail through dependence effects. Diversity Redundant items may be identical or diverse. Diversity is the provision of dissimilar means of achieving the same function. For example, two identical pumps could provide a redundant system while a motor-driven pump and a turbine-driven pump could provide a diverse system. Diversity provides one way of reducing the chances of failure of a redundant system through dependent failures. Other ways include physical separation by distance or barriers. Removing all dependencies in redundant or diverse systems can be very difficult. 7 Redundancy Structures Downloaded by [Australian Catholic University] at 04:17 31 August 2017 A useful high-level description of the implementation of redundancy and diversity is provided by the notion of redundancy structures (Brand and Matthews, 1993): e e e e Single channel (no redundancy); Redundant system; Diverse system; Two diverse systems. The reliability of such arrangements depends on specifics of the system. However, when moving down the list of redundancy structures above, the general trend is one of increasing reliability. Reliability and Safety Reliability and safety overlap but they are not identical. Redundancy is aimed at increasing reliability. While improving reliability often improves safety, neither redundancy nor design for reliability in general can be considered a complete solution for system safety. REDUNDANCY IN SOFTWARE SYSTEMS Redundancy and diversity can be used in software to provide a level of fault tolerance (Pham, 2006). Different means of achieving the same function are provided by diverse software elements generated using different designs, programming teams, programming languages, etc. Two ways of implementing diversity are recovery blocks and N-version programming (NVP). In the recovery block scheme, the output of a primary module is tested for acceptability. If the output is not acceptable, then the state of the system is restored to that existing before the fault occurred (backward error recovery). A second, alternative module then executes the function. If the output is not acceptable, then a third module comes into play, and so on. If all modules are exhausted and no acceptable outputs are produced, then the system fails. In NVP, two or more functionally equivalent programmes are developed independently to fulfil the same specification. The N programme versions are executed in parallel and results obtained by voting on the outputs from the different programmes. The NVP scheme differs from the recovery block 8 approach in that the former executes versions simultaneously while the latter executes alternate programmes in sequence. Downloaded by [Australian Catholic University] at 04:17 31 August 2017 Hybrid schemes that combine the recovery block and NVP schemes can be used. Diversity can also be applied to other phases of the software lifecycle, including tests (IAEA, 2000). REDUNDANCY IN HUMAN SYSTEMS Redundancy in human systems can be used to increase human reliability associated with an operator function by providing an opportunity for one individual to detect the error of another, ultimately leading to correction of the error. Examples of redundancy in human systems occur with respect to the following: e e • e The First Officer and Captain on the flight-deck of a commercial airliner; The bridge team and pilot on a ship; A supervisor and operators in the control room of a civil nuclear power plant or chemical plant; The shift technical advisor and normal operating team in the control room of a civil nuclear power plant. A variety of factors influence the likelihood that human redundancy will result in the recovery of an error (Clarke, 2005). Examples of factors o( particular importance are: e e e e e Presence of alerting factors that highlight the occurrence of an error; Whether checks are active or passive (active checks generally being more reliable); Checker's knowledge of the operator's competence (which can create an expectation in the checker that performance is satisfactory, which tends to undermine the effectiveness of a check); Excessive professional courtesy between individuals (which can inhibit an individual from highlighting the error of another); Level of cognitive diversity (see below). The effectiveness of human redundancy can be supported, and hence human reliability enhanced, by use of appropriate forms of human redundancy. For 9 Downloaded by [Australian Catholic University] at 04:17 31 August 2017 example, in some situations, a single individual providing redundancy with respect to an operator function may be sufficient. In contrast, where a very high level of reliability is required, an operating team supported by a supervisor and an independent observer, with procedural requirements for checks at critical stages of operations, may be appropriate. For a given form of human redundancy, overall reliability can be enhanced through control of influencing factors. Examples of design measures that can be taken are: e e e e Design of equipment such that unwanted system states resulting from operator errors are clearly visible to a supervisor; Use of checks that are active rather than passive; Development of a culture in which it is acceptable for individuals to challenge the actions of others; Use of cognitive diversity. Cognitive diversity, by reducing the chances of failures of common origin, can be used to increase the effectiveness of human redundancy. It is characterized as the availability of different cognitive behaviours to fulfill a required function, where differences originate either within operators or within their environment (Westerman et al. 1995, 1997). Use can be made of the potential for diversity in the areas of information, equipment, intrinsic human capabilities, skill and organization (Hollywell 1996). For example, where an operator achieves a given function using one source of information, a checker could use a different source. DESIGN AND SAFETY CASE METHODOLOGY Implementation of Redundancy In principle, the addition of redundant components increases the reliability of systems as multiple failures are required to induce system failure. The HSE guidance document, T/AST/036 (HSE, 2009), provides a useful discussion of the implementation of redundancy and diversity in safety-critical systems for mechanical plant. Whilst targeted at the nuclear industry, the guidance provided is equally applicable to other industries. One of the threats that can undermine the potential increase in reliability from use of redundancy is dependent failure. Dependent failures are those that 10 cause simultaneous failure of multiple components. The susceptibility of redundant systems to dependant failures can be reduced by the incorporation of diverse provisions. Diversity in the design of complex plant can be achieved in a variety of ways: Downloaded by [Australian Catholic University] at 04:17 31 August 2017 • e e • System functional diversity: the provision of the same function at system level by differing methods, e.g. incorporation of an air cooling system as a back-up to a duty water cooling system; Equipment functional diversity: the provision of the same function at equipment level by differing methods, e.g. pumps driven by electrical motors and others in the same system driven by hydraulic pressure; Equipment supplier diversity: nominally similar equipment procured from different suppliers; Support diversity: the reduction in potential for dependant failures caused by the support systems, e.g. mandating that redundant components are maintained by different staff to reduce the possibility of common maintenance errors. Certain dependent failures in redundant systems may be reduced by segregation, where components are separated by distance or physical barriers such as fire barriers. Redundancy must be considered for all activities and stages associated with a system's lifecycle. As an illustration, the planning and execution of Examination, Maintenance, Inspection and Testing (EMIT) activities should account for redundancies in systems. Examples of appropriate practices, subject to practicality and costs in specific instances, include: • e Staggering of tests to minimize the chances of redundant trains failing at the same time; Ensuring that failures in redundant trains are identified despite the potential for systems incorporating redundancy to pass a simple functional test while equipment in that system is in a failed state. In terms of incorporation of redundancy in safety cases, the deterministic and probabilistic elements can be considered separately. Deterministic Safety Case A key element of the deterministic case is the identification and justification of the duty safety systems and preventative, protective and mitigating safety 11 Downloaded by [Australian Catholic University] at 04:17 31 August 2017 measures that maintain the safety functions of the plant. A key principle of the deterministic case is that the safety systems and measures are as independent of each other as possible to minimise the potential for dependent failures. Thus, a safety case should identify the key safety systems and measures and the fault sequences in which they are claimed. A fault schedule will often be used to record such information. A systematic assessment of the degree of independence between the safety systems and measures should be undertaken. This is best structured around the key engineering safety principles relating to equipment failures, notably: e e e e e Redundancy; Diversity; Segregation; Single failure criterion; Dependent failures. Probabilistic Safety Case The key objective of a probabilistic case is to determine a numerical value for the frequency of accidents of a defined nature. This assessment is commonly undertaken using fault trees and/or event trees. Both techniques involve the multiplication of the probability of events. It is valid to multiply such probabilities only if the events are independent. Therefore any probabilistic assessment must allow for dependency between events. This can be via explicit modelling of events such as loss of common supporting systems or via an estimative technique that provides a numerical estimate of the proportion of failures due to dependent failures. Common cause failures in redundant systems can be assessed using one of several techniques (Smith, 2005), including: e e e e The The The The beta model; partial beta model; beta plus model (Smith, 2005; 2008); multiple Greek letter model. 12 COSTS AND LIMITATIONS OF REDUNDANCY Downloaded by [Australian Catholic University] at 04:17 31 August 2017 Costs In hardware systems, redundancy can give rise to a variety of costs, including the costs of redundant or diverse components, additional power equipment, spares and maintenance. For example, use of diverse pumps will increase the spares holdings and tooling required. There can also be penalties from increased size and weight. Similarly, in software and human systems, extra resources are required to develop redundant and diverse system elements. For example, multi-version programming gives rise to costs of producing the extra version(s). The costs of redundancy must be balanced against the benefits to reliability. Also to be considered in the overall trade-off is the question of the cost and effectiveness of redundancy compared to the cost and effectiveness of reducing failures by other means. Accidents Limitations of redundancy are indicated by many accidents in which the failure of redundant systems played a part. In 1989, United Flight 232 crashed when engine failure generated fragments that resulted in failure of independent hydraulic systems crucial for normal flight control. Prior to the accident, the likelihood of the redundant system failing had been considered extremely remote. A contribution to the Clapham Junction railway accident was a failure on the part of an electrician to carry out a wiring task satisfactorily. A failure of human redundancy occurred when a supervisor did not detect the error made by the electrician. Dependent Failures Dependent failure represents an important limitation on redundant systems. To illustrate: e e A redundant hardware system may fail as a result of design error affecting multiple components, as a consequence of failure of a common power line, because of a fire or explosion, or because of projectiles (as in the loss of hydraulic systems on United Flight 232); There is evidence that different teams that generate different versions in multi-version programming do not create software that fails independently. For example, different programmers tend to make mistakes in the same, 13 Downloaded by [Australian Catholic University] at 04:17 31 August 2017 e more difficult parts of a problem. One experiment (Brilliant et al, 1990) found that versions using different algorithms and containing different programming errors still failed on the same inputs. A supervisor and operator may both misread an indication (due to a shortcoming in a display to which both individuals refer) or misinterpret a piece of information (perhaps due to a weakness in training undertaken by both individuals), giving rise to a dependent failure in system operation. Effects on Other Failure Modes When redundancy is implemented with respect to a given failure mode, it may have adverse effects on other failure modes. For example, adding a valve in series with a pressure control valve will improve reliability with respect to fail open leading to overpressure, but reduce reliability with respect to fail shut leading to loss of supply. Complexity A criticism that has been made of the use of redundancy in design is that its use can increase the complexity of the system. The concern is that increased complexity may result in failures, defeating the purpose of redundancy. False Sense of Security Another concern is that, when the limitations of redundancy are not recognized, its use can lead to a false sense of security. Different groups of people, such as managers and operators, may perceive a system to be more reliable or safe than it actually is. The Redundancy Debate There has been significant debate in recent times about the value of redundancy as a technique for achieving reliability and safety. That debate has focussed on limitations of redundancy described above. One specific criticism has been that redundancy 'backfires' through common cause failures. In this view, common cause failures are seen as undermining redundancy, which is therefore regarded as an ineffective system design technique. Hoepfer et al (2009) present an analysis suggesting that, while caution about the benefits of redundancy is warranted, unqualified criticisms of redundancy are inappropriate. Rather, the value of redundancy should be viewed as contingent on the prevalence of common cause failures, the level of redundancy considered, and cost-benefit. Where redundant systems are used, 14 Downloaded by [Australian Catholic University] at 04:17 31 August 2017 a key part of the design effort is the identification and minimization of common cause failures. OTHER DESIGN APPROACHES Given the limitations of redundancy, the technique must be seen, not as a total solution to design for reliability, but rather as one approach amongst several others. Other techniques include: e e e e e e e e e Over-design; De-rating; Timed replacements; Simplification; Maintainability techniques; Design review; Corrective action; Testing and debugging of software; Human error prevention approaches. The task of the designer is to select the most appropriate solution in given circumstances. CONCLUSIONS Concepts of redundancy applicable to mechanical, software and human systems have been described. The treatment of redundancy in design and safety case methodology has been outlined. Redundancy can be an effective way of enhancing reliability but a variety of costs and constraints must be recognized and taken into account when considering its use as a design technique. ACKNOWLEDGEMENTS The authors would like to acknowledge helpful comments from David J. Smith on an earlier draft of this introduction. 15 REFERENCES Downloaded by [Australian Catholic University] at 04:17 31 August 2017 References Cited in the Text Brand, V.P. and Matthews, R.H. (1993) Dependent failures- when it all goes wrong at once, Nuclear Energy, val. 32, no. 3. Brilliant, S., Knight, J.C. and Leveson, N. (1990) Analysis of faults in anN-version software experiment, IEEE Transactions on Software Engineering, SE-16(2), February. Clarke, D.M. (2005) Human redundancy in complex, hazardous systems: a theoretical framework, Safety Science, val. 43, pp. 655-677. Green, A.E. and Bourne, A.J. (1972) Reliability Technology, London: Wiley-lnterscience. HSE (2009) Diversity, redundancy, segregation and layout of mechanical plant, T/AST/036- Issue 02, Health and Safety Executive. Hoepfer, V.M., Saleh, J.H. and Marais, K.B. (2009) On the value of redundancy subject to commoncause failures: Toward the resolution of an on-going debate, Reliability Engineering and System Safety, val. 94, pp. 1904-1916. Hollywell, P.D. (1996) Incorporating human dependent failures in risk assessments to improve estimates of "actual risk, Safety Science, val. 22, no.s 1-3, 177-194. IAEA (2000) Software for Computer Based Systems Important to Safety in Nuclear Power Plants (IAEA Safety Standards Series), Vienna: International Atomic Energy Agency. Pham, H. (2006) System Software Reliability, London: Springer-Verlag. Smith, D.J. (2005) Reliability, Maintainability and Risk, 8th edition, Oxford: Butterworth-Heinemann. Smith, D.J. (2008) Developments in the common cause failure quantification, Safety and Reliability, val. 28, no. 1, pp. 32-36. Westerman, S.J., Shryane, N.M., Crawshaw, C.M., Hockey, G.R.J. and Wyatt-Millington, C.W. (1995) Cognitive diversity: a structured approach to trapping human error. In: Safecomp '95, Proceedings of the 14th International Conference on Computer Safety, Reliability and Security, Springer, London, pp. 142-155. Westerman, S.J., Shryane, N.M., Crawshaw, C.M. and Hockey, G.R.J. (1997) Engineering cognitive diversity. In: Redmill, F. and Anderson, T. (Eds.) Safer Systems: Proceedings of the Fifth SafetyCritical Systems Symposium, Springer, London, pp. 111-120. USEFUL GENERAL REFERENCES Elsayed, E. (1996) Reliability Engineering, Reading, Massachusetts: Addison Wesley Longman. Kumamoto, H. and Henley, E.J. (1996) Probabilistic Risk Assessment and Management for Engineers and Scientists, Second Edition, Piscataway, NJ: IEEE Press.