CHAPTER 1 Introduction to functional safety Abstract To manage hazards in the process industries, the associated risk of undesired incidents needs to be evaluated and managed. As an illustration, a fictional incident on a badly managed process plant is narrated, along with proceedings at the subsequent board of enquiry (also fictional). Functional safety, which is safety achieved by means of automatic systems, is one approach available to manage risk. Functional safety standards applicable to a range of industry sectors are available, in particular the process sector standard IEC 61511. Key functional safety concepts, in particular the functional safety lifecycle, random and systematic failures, and competency management, are introduced and explained. Keywords: Competency; Functional safety; Functional safety lifecycle; Harm; Hazard; IEC 61511; Intrinsically safer design; Random failure; Risk; Systematic failure. 1.1 What could possibly go wrong? All’s quiet in the control room. A routine Sunday evening, and the information screens glow in bright colours; status information on all the tanks and pumps outside to the operator’s left, and a half-finished Solitaire game to the right. A flow quantifier ticks over silently on number 1 tank, registering an incoming transfer of flammable solvent via pipeline from another site several kilometres away. The operator flips through the file of open maintenance work orders on the desk. That faulty level sensor again; the work order has been open for three months now. Maybe they’ll get round to it eventually. The high level trip bypass warning light has been glowing for so long that nobody even notices any more. Anyway, who cares? We’ve got backup systems, the operator thinks. This place is safe as a rock. An alarm sounds; a discreet baapebaap from a speaker on top of the console. Irritated by the disturbance, the operator stretches out a lazy finger and stabs the well-worn Acknowledge button. Just that low oil pressure alarm on number 4 cooling pump again, I suppose. Checked it out twice before, false alarm every time. Anyway, half the alarms that come up, I don’t even know what they mean. Forget it. Back to the Solitaire game. Functional Safety from Scratch. https://doi.org/10.1016/B978-0-443-15230-6.00013-6 Copyright © 2023 Elsevier Inc. All rights reserved. 1 2 Chapter 1 Solvent flows into number 1 tank, just like it always does on a Sunday night transfer. The level transmitter registered a fault an hour ago, so the operator switched over to a backup transmitter. The level creeps up, 10 cm/min. The actual level’s already too high and the operator should have shut it off by now, but the level is showing only 65% on the screen. The backup transmitter has never been used before and it is miscalibrated, set up for a denser solvent that used to occupy this tank. The operator, satisfied that the transfer still has an hour to go, flips back to the gaudily coloured site overview screen. The level creeps up to the high level alarm sensor. It hasn’t worked for years and nobody can test it, because the wiring is on an inaccessible part of the top of the tank. Up, up goes the level to the high level trip point. This time, the last-chance level sensor works but the trip is bypassed; that same work order the operator just flipped through was closed a month ago but nobody reset the bypass. Solvent hits the tank’s overflow pipe and starts to pour out into the spill containment bund. A flammable vapour detector picks it up and raises an alarm in the control room, but the operator ignores it because of all the false alarms. Every time the wind blows, the vapour alarm rings. They should fix that someday. It’s a warm, still summer’s evening. Not a breath of wind. The solvent, gushing out of the tank into the bund, evaporates to form a relentlessly expanding cloud of vapour, an invisible ball of disaster waiting to strike. Spreading outwards, now 50 m, now a hundred metres from the tank, it silently envelops the site and creeps over the fence to the neighbouring facility. Next door’s nightwatchman is out on patrol. So many rats around here; what if they chew the cables, he wonders. An unexpected chemical smell catches his attention. Glue? Paint? Who could be painting at this time of night? He walks across to a storeroom at the back of the warehouse, facing the tank storage site. The smell is really strong just here. Maybe the rats knocked over some can of chemical waste? He pulls his flashlight from his belt and flips it on. There is a spark . 1.2 Hazard and risk 1.2.1 What is a hazard? The chairperson pulls her desktop microphone closer and flicks the switch. “Good morning, everybody. This is day four of the board of enquiry into the explosion and fire at ABC Solvents on 16th August last year. Today, we have an expert witness from the National Safety Council, Mr Ben Kim. Welcome, Mr Kim.” The witness nods and settles in his seat, looking round the room. “Mr Kim, yesterday, we heard from another witness that the safety features on the tank were hazardous because they were not in proper working order. Can you tell us, in your view, what should have been done to keep them working properly?” Introduction to functional safety 3 “Thank you, Madam Chairperson. May I offer a correction to your question? The term hazard cannot correctly be applied to the safety features themselves. First, we should understand what a hazard means: it is some physical aspect of our equipment which has the potential to cause something we don’t want to happen. In this case, it is best to think of the solvent, not the equipment, as the hazard. To explain what I mean: suppose the tank were filled with water instead of solvent. Could the fire have happened? Of course not, as the hazard arises from the nature of the solvent. “The failure of the safety features is better defined as an initiating event, because it initiates a chain of events made possible by the existence of the hazard. “Actually, the international standard IEC 61511 standarddto which we will refer later in this enquiryddefines a hazard very concisely as a “potential source of harm.” By harm we mean, essentially, any consequence that we don’t want to happen.” Mr Kim pauses for a sip of water. 1.2.2 What is harm? “Mr Kim, thank you for the clarification. Can you give an example of what we mean by harm? Is it the overflow of the tank?” Mr Kim continues, “Harm is the final, undesired outcome at the end of a chain of events. The various types of unwanted event we should consider can be grouped according to what suffers the ill effectsdfor example, people, environment or profits. These are known as risk receptors. If I may, I’d like to show some relevant examples of harm, classified by risk receptor, on the screen.” Mr Kim straightens his tie and continues. “When analysing the harm accruing to risk receptors, an operating company will generally select a small subset of these types of harmdusually not more than 5e6 items are relevant and significant for their specific situation. The selected harm types need to be quantified (for example, in money terms) or classified (in 3e5 severity categories).” (Readers of this book can find more detail in Chapter 5.) 1.2.3 What is risk? “Mr Kim, thank you, I’m much clearer on hazards and harm now. Another term we have heard from previous witnesses is risk. Now, I understand that risk is the ‘combination of the frequency of occurrence of harm and the severity of that harm’ (according to IEC 61511). Can you explain why risk is an important concept for our enquiry?” “With pleasure, Madam Chairman. One of the major advances in safety management in recent decades has been a shift in focus from hazard to risk. The concept of risk says that 4 Chapter 1 Table 1.1: Typical risk receptors and types of harm considered in the process industry. Risk receptor People Types of harm that may be considered in functional safety analysis Injury to personnel Illness of personnel Types of harm not typically considered in functional safety analysis Psychological harm such as stress, low morale Injury to visitors onsite Injury to persons offsite Illness of persons offsite (as a direct result of a specific incident) Surrounding environment (biological effects) Surrounding environment (chemical and physical effects) Harm to significant populations of wildlife, especially in the long term, due to release of substances (e.g. harmful gases, hot water effluent) Short term events with no long term impact, e.g. emergency depressurization venting of hydrocarbons Illness of persons offsite (as an indirect result of release of harmful substances, e.g. contamination of watercourses) Long term impacts that are part of normal operations and addressed in other ways, e.g. CO2 emissions Damage due to release of chemicals (e.g. corrosion or blackening of nearby structures due to acid gases, soot) One-time planned impacts such as plant construction spoiling the view of local residents Physical damage (leading to financial loss, injury or complaints), whether onsite or offsite, e.g. noise, earth tremors from mining or fracking Breach of permit conditions, e.g. excessive flaring Financial Equipment damage (direct and indirect, e.g. due to fire) Costs associated with idle time (e.g. personnel salaries, lease or depreciation of equipment) Loss of production capacity (may be calculated in gross or net terms) Long term loss of business due to inability to supply customers Loss of materials (e.g. destruction of product inventory, damage to catalyst) Generation of additional waste Cost of rework Consequential losses such as demurrage of ships in port waiting for loading or unloading Introduction to functional safety 5 Table 1.1: Typical risk receptors and types of harm considered in the process industry.dcont’d Risk receptor Legal Types of harm that may be considered in functional safety analysis Types of harm not typically considered in functional safety analysis Fines and compensation as a result of an incident Cost of defending legal cases Jailing of senior staff Reputation Adverse publicity Loss of shareholder value Requirement for public notification or evacuations Loss of privilege to operate Withdrawal of operating licenses Loss of public confidence or acceptance Withdrawal of environmental permits we pay more attention to harmful outcomes that are more serious, more likely to occur, or both. This means that we can focus our effortdand expendituredwhere it will give the biggest safety return. However, it also means that we need to identify both the frequency and severity of harm that can arise from an incident.” (Chapters 3 and 5 of this book cover these points in detail.) “That leads us to the question of deciding how safe a facility should be; or, to put it another way, how much risk can be tolerated.” The chairperson reaches for her microphone. “Indeed, Mr Kim, that is one of the points I want to ask. How is the level of risk tolerance generally determined in the process industry?” 1.2.4 What is tolerable risk? Mr Kim nods. “Good question, Madam Chairperson. If risks exist in our facilitydas they surely willdwe must determine whether they are tolerable. At first sight, the idea that any kind of risk can be tolerated is counter-intuitive, and may even seem inhuman if the risk in question could lead to fatalities. However, tolerance of risk is a reasonable and, in fact, entirely necessary part of everyday life. We took a calculated risk, for example, by taking some form of transport to come here today, determining subconsciously that the benefits of getting here outweigh the risk of an accident.” (For a detailed discussion of the sociopolitical aspects of the tolerable risk concept, see Ref. [1], p. 29ff.) 6 Chapter 1 “So, by determining the amount of risk the facility is willing to tolerate, we are able to make reasoned judgments on questions like these: • • • • • Do the benefits of operating the facility outweigh the risks? How well controlled are the risks? Can I justify the safety case of the facility to the government and the general public? Am I using my safety resources optimally? Do I need to add more risk control measures, and if so, how well must they perform? “Deciding on a level of tolerable risk is a critical aspect of risk control strategy. Arguably, this is more a question of politics than of engineering, as it touches on sensitive questions like the relative tolerability of human fatalities and lost profits. Fortunately, it is rarely necessary for an individual organization to go through a traumatic decision-making process about tolerable risk. There is now widespread consensus on tolerable risk levels, enshrined in local best practice and, in some places, mandated by law.” 1.2.5 Risk management through functional safety Mr Kim gathers his thoughts and continues. “An operating company needs to perform analysis to determine the current risk levels in their process, and then compare them with the defined tolerable risk levels. If there is a significant gap between the actual and tolerable risk, this may indicate that the risk is not adequately controlled. “At this point, the operating company should consider a hierarchy of risk management measures.” Mr Kim’s assistant displays a slide on the screen, showing the following series of questions: • • • Have we explored feasible options for reducing the inherent risk, such as substituting less hazardous materials, reducing inventories or improving the segregation of hazards and risk receptors? Are the risks already as low as we can reasonably make them? (This is the ALARP concept, which is covered further in Chapter 3.) Do we need further risk reduction measures? If so, should they be implemented through: • design upgrades, e.g. increase in design pressure; • passive protective systems, e.g. relief valves; • improved operating/maintenance procedures and training; • alarms, with defined response from operational personnel; • mitigation systems to reduce the severity of an incident, e.g. fire protection systems; • active protection systems, which automatically detect a dangerous condition and act to keep the plant in a safe state? Introduction to functional safety 7 Mr Kim explains further. “Madam Chairperson, as you can see, a series of protective measures are available. The last of these, active protection systems, belongs to the realm of functional safety: that is, risk control measures implemented through an active Safety Instrumented System or SIS. That’s what I’m here primarily to discuss with your panel today.” The chairperson writes the words Functional Safety on her jotter and circles them. “Mr Kim, are there any generally accepted standards covering the management of such systems?” “Indeed there are. Sound management of functional safety is the objective of the international standards IEC 61508 and IEC 61511, which, with your permission, I’ll introduce to the panel now.” 1.3 Functional safety standards: IEC 61508 and IEC 61511 1.3.1 Purpose of the standards As Mr Kim explained in the fictional board of enquiry above, functional safety is the task of achieving risk reduction by means of an automatic system, which is designed to respond automatically to prevent an incident or to maintain safe operation. It covers a range of activities: risk analysis, safety system design, construction, commissioning, testing, operation, maintenance, and modification. A management system is put in place to ensure everything is done correctly throughout the project lifecycle. Achieving functional safety is a complex task, requiring cooperation between numerous parties: design and instrument engineers, equipment designers and vendors, safety consultants, software specialists, and operations and maintenance personnel, to name a few. International standards help to clarify expectations between the various parties, and provide a level playing field throughout the industry and across national boundaries. For this reason, the IEC released the first complete edition of IEC 61508, its framework standard on functional safety, in 2000, with a significantly updated second edition issued in 2010 [2]. IEC 61508 covers the entire spectrum of functional safety in general terms, with particular emphasis on the development of hardware and software for functional safety applications. The intent is that specific industry sectors will develop their own flavours of this standard, couched in terms applicable to their sector, and focusing on the most relevant aspects of 8 Chapter 1 Table 1.2: Sector-specific functional safety standards. Industry sector Latest year of issue as of 2021 IEC 61508 General 2010 Hardware and software design, risk analysis IEC 61511 Process 2016 Risk analysis, SIS design, SRS, FSAa IEC 61513 Nuclear power 2011 I&C architecture using hardwired and/or computer-based systems IEC 62061 Machinery 2021 Design, integration and validation of safety-related control systems ISO 26262 Automotive 2018 Development cycle IEC 62279 Rail 2015 Software for railway control and protection ISO 13849 Machinery Part 1: 2015 Part 2: 2012 Design and validation. All safety technologies, not just E/E/PE IEC 62304 Medical devices 2006 þ 2015 amendment Software development EN 50129 Rail 2018 Hardware and software, design and implementation ISO 25119 Machinery for agriculture and forestry 2018e19 þ 2020 amendments Safety lifecycle Standard a Major focus Refer to the abbreviations list at the beginning of this book. functional safety. The resulting sector-specific standards are listed in Table 1.2. Some countries have implemented their own national standards, which are essentially identical to the IEC standards and can be treated as such. An example is ANSI/ISA-61511:2018, the US implementation of IEC 61511:2016. 1.3.2 Scope of IEC 61511 The standard for the process industry sector covers electrical, electronic and programmable electronic (often abbreviated to E/E/PE) safety equipment. Purely mechanical and/or pneumatic systems are, strictly speaking, outside the scope of IEC 61511, but the principles in the standard are often useful in managing such systems. Also out of scope are conventional process control systems (e.g. PCS, DCS, BPCS) unless they Introduction to functional safety 9 are required to play a part in high-integrity risk control measures (which, usually, they should not). In practice, the standard is normally applied to Safety Instrumented Systems (SISs) implemented using: • • a safety-rated PLC; or safety relay logic. IEC 61511 is intended to protect specific risk receptors: only “protection of personnel, protection of the general public or protection of the environment” are explicitly within its scope. However, it can bedand often isdapplied to other risk receptors, as listed in Table 1.1. 1.3.3 Why comply with IEC 61511? One of the most significant features of the IEC series of functional safety standards is that they are mostly performance-based, rather than prescriptive. That means, they expect entities to set their own safety targets, meet those targets, and demonstrate that the targets are metdwithout specifying the way in which this is achieved. Older prescriptive standards laid down rules constraining some quite specific aspects of design such as how many redundant items of hardware were required, irrespective of the actual safety performance achieved thereby. The advantages of the performance-based approach translate directly into benefits for the end user: • • • • Solutions to risk management problems can be tailored to suit specific situations. This often results in better safety performance at less cost. Analytical methods can be selected to provide the optimal balance between analysis costs and design costs (this point is covered further in Chapter 5). Local and best practice can evolve over time, taking advantage of experience gained in real-world applications. Compliance with IEC 61511 is not mandatory under law, but is widely regarded as representing best practice. As such, stakeholders such as end users, insurers and holding companies regard IEC 61511 compliance as evidence of “all reasonable measures” being taken to protect health and safety and avoid losses [3]. 1.4 IEC 61511 key concepts 1.4.1 The functional safety lifecycle Developing and implementing a Safety Instrumented System (SIS) is a stepwise process. First, we must identify the hazards within the scope of the project, and determine the risks they generate. Next, risk reduction measures must be developed, and assessed to ensure they are adequate. If the risk reduction measures require a SIS, we design the SIS and 10 Chapter 1 check the design meets the risk reduction needs. Then the SIS is installed and commissioned. During its operational lifetime, it may need to be reassessed and modified according to changing circumstances. Eventually, parts of the SIS will be decommissioned, and we must make sure this does not compromise the safety performance of the remaining systems. Successful execution of each step requires completion of all previous steps. Thus, the standard requires a plan, detailing the steps required, the actions to be performed in each one, and how the sequence as a whole will be executed and managed. The steps of the lifecycle are known as phases. For convenience, in this book we will sometimes group phases together into 3 periods: the analysis period, the design period and the operational period. Fig. 1.1 shows the periods of the lifecycle, and Fig. 1.2 shows the lifecycle phases typically included in each period. Earlier, we noted that a key aspect of the standard is to demonstrate that safety performance targets are met. To do this, we must measure performance and compare with the goals that were set. If targets are not achieved, we should return to earlier steps and Figure 1.1 Main periods of the functional safety lifecycle. Introduction to functional safety 11 Figure 1.2 Phases included in each main period of the functional safety lifecycle. revise the work that was done. This means that looping back within the sequence of steps is an intrinsic part of managing functional safety. For this reason, the steps are arranged in a functional safety lifecycle. A recommended scheme for a safety lifecycle is set out in IEC 61511 (and a slightly different version in its parent standard, IEC 61508). In keeping with its performance-based philosophy, the standard does not compel us to use its recommended lifecycle; we are free to substitute one of our own, as long as it achieves all 12 Chapter 1 the same objectives. However, in practice, the lifecycle model set out in the standards is almost universally adopted, as it is clear, comprehensive and intuitive. Another important reason for adopting a cyclic, rather than linear, approach to safety design is that operational needs change over time. Processes may need to be altered for a number of reasons, such as: • • • • • • Changing process parameters as operational experience is gained (e.g. optimization of yield or manpower utilization, maintenance problems, avoidance of unnecessary tripping) Obsolescence or deterioration of equipment Adoption of new technology Changing product profile to match customer demands Changes in aspects of plant management, such as equipment utilization or manning Changes in environmental protection requirements Any process change should prompt a return to early phases of the safety lifecycle, so that the impact on the demands and performance of the SIS can be assessed. We’ll come back to this topic in greater detail in Chapter 11. 1.4.2 Intrinsically safer design In a typical functional safety project, the hazard identification and risk analysis phases start with a substantially frozen design already embodied in P&IDs and equipment data sheets. However, this tends to squeeze out the opportunity to apply ‘intrinsically safer’ design principles: the concept that it is generally better to eliminatedor at least reducedhazards in the design, rather than managing the risks generated by those hazards. During early-stage hazard identification studies such as HAZID and HAZOP, the analysis team should be given the chance to question whether hazards could be better managed by design changes rather than relying on layers of protection. Examples of intrinsically safer design principles include: • • • • • Replacing a hazardous material with a less hazardous one Reducing the inventory of hazardous materials Applying less hazardous operating conditions (e.g. lower temperatures and pressures) Increasing the design pressure of piping and equipment, so that upset conditions are less likely to lead to a loss of containment Reducing the opportunity for human errors, e.g. eliminating hose changeovers between items of equipment in a batch process Introduction to functional safety 13 1.4.3 The safety requirements specification (SRS) This crucial component of functional safety is a document (or set of documents) spelling out exactly what the SIS must do. It lists the design intent of the SIS, every detail of its design specification, and a slew of information needed during the operational phase, such as maintenance requirements. The SRS is first draughted when the need for the SIS is identified; this takes place immediately after risk analysis is completed. Then, after the full design details of the SIS have been elaborated, the SRS is updated to contain all the information necessary for complete execution of the lifecycle. The purpose of having a centralised document of this type is to provide a single point of reference for all parties responsible for each phase of the lifecycle. Since the people involved are likely to be spread across many departments and organizations, it is critical to have an unambiguous definition of the SIS’s function and operation. Indeed, some of them will be performing their duties many years after the SIS is commissioned. Another important function of the SRS is to provide a benchmark, against which the SIS itself can be validated, and its performance assessed. This allows reviewers to confirm or revise assumptions made during the safety analysis and SIS design periods. Extensive coverage of the SRS is provided in Chapter 7. 1.4.4 Assuring that functional safety is achieved A key aspect of the standard is that we must demonstrate successful control of risk. There are two main aspects to this: • • minimizing the scope for undetected human error, and ensuring that each phase of the lifecycle has been completed competently. The standard identifies four separate activities for assuring this has been achieved, as outlined briefly in Table 1.3. This is one of the more challenging areas of functional safety, and often causes confusion. Areas of misunderstanding typically include: • • • • • The differences between the various activities What is involved in each activity When they should be performed, and how often Whether they can be delegated or outsourced to consultants Whether the activities need to be undertaken by independent parties We’ll cover these topics in detail in Chapter 10. 14 Chapter 1 Table 1.3: Activities for assuring that functional safety is achieved. Activity Brief description Verification The inputs and outputs required for each lifecycle phase should be defined. Verification involves confirming that the required output has been generated. Validation During the analysis and design periods of the lifecycle, a document known as the safety requirements specification (SRS) is generated. Validation confirms that the commissioned SISdincluding hardware, software and operating and maintenance proceduresdmeet the stipulations of the SRS. Functional Safety Assessment (FSA) FSA is a wide-ranging assessment of how effectively the functional safety lifecycle is followed. It can be executed at up to five stages of the lifecycle, although it is compulsory at only one stagedbetween commissioning and process startup. Audit A review of evidence to demonstrate compliance with site-specific procedures relating to functional safety. 1.4.5 Random and systematic failures The safety lifecycle approach recognises that there are two fundamentally different ways in which the SIS can fail to perform its intended function. These are known as random and systematic failures. Because this concept underpins every aspect of the safety lifecycle, a clear understanding of failure types is crucial. Random failures are hardware failures. Every item of equipment has a finite lifetime, during which some component within the equipment may break due to natural wear-andtear processes caused by fatigue. This is true even if the equipment is installed correctly, operated within specification, and maintained properly. Random failures can never be eliminated entirely, but they can be handled mathematically. Although it is impossible to predict when any individual item of equipment will fail, we can know a great deal about typical failure behaviour, given data from a large enough population of equipment in service. For example, we can determine the item’s useful lifetime, and the probability that it will fail during a given period of time. This information is essential during risk analysis, because it allows us to calculate the extent of risk reduction that a particular design of SIS can be expected to providedhence, whether it is sufficient to meet the tolerable risk target (as we discussed in the Section 3.3). Introduction to functional safety 15 Systematic failures are device failures ultimately caused by human errors. The lifecycle presents numerous possibilities for human errors to occur; a few examples are • • • • • • Incorrect risk analysis (failing to identify hazards, underestimating risks) Administrative errors (working from out-of-date versions of documents, incorrect drafting of documents, miscommunication) Incorrect design of SIS Software bugs Incorrect installation of SIS Failure to maintain equipment, or errors during maintenance (such as failing to remove overrides after completing the maintenance procedure) While some of these are under the direct control of the process plant owner or design and construction contractor, others are not. For example, a safety equipment manufacturer may make a design error, which could lie hidden for many months or years until a particular combination of circumstances brings it to light. When the error is finally revealed, severe consequences could occur without warning; for example, an emergency trip may fail to operate on demand, leading to a fire or explosion. Unlike random failures, systematic failures cannot currently be mathematically modelled. Since it is impossible to test every combination of circumstances and events that could ever arise, we can never know for sure whether errors exist in our SIS, how many, or how serious they are. Statistical treatment is of little value, since error rate data collected in one environment is unlikely to be applicable to another. The only practical way to address systematic failures is to minimise them. The two main ways of doing this are: • • Reduce the number of errors made in the first placedfor example, by ensuring individuals are competent, providing clear requirements and procedures, and reducing the number of opportunities for error (fewer and simpler operations); and Provide opportunities to detect errorsdfor example, by verification and review, and by recording and investigating every unexpected incident involving the SIS. For this reason, IEC 61511 places great emphasis on software development techniques, management procedures, cross-checking of work completed (as discussed in Section 7.3) and competency of individual safety practitioners. Practical ways of addressing systematic failures are listed in Tables 1.4 and 1.5, while Table 1.6 and Fig. 1.3 suggest ways to distinguish between random and systematic failures. 16 Chapter 1 Table 1.4: Practical methods for reducing errors that can cause systematic failures. Type of method Ensure competency Practical steps involved Chapter in this book Define the competency level required for each lifecycle task, including qualifications, experience and knowledge 1, 6 Assign individuals to tasks for which they are competent Encourage individuals to query any information or instructions they do not understand or agree with Information availability Ensure resources are available, e.g. access to up-to-date versions of standards and codes of practice 6 Provide and implement a document control system, to ensure everyone works from the latest version of each document. (This is often part of an ISO 9000 quality management system.) Use the SRS and other key lifecycle documents as the sole means of transferring information between individuals Use adequate labelling (of equipment and wiring) and commenting (of software code) Ensure procedures and manuals are available and fit for purpose: clear, unambiguous, complete, and provided in the local language. Simplification Do not use equipment with more features than actually required 9 Make unneeded features (especially software features) unavailable Use passwords and other means of access control to limit the number of individuals that can change things (such as documents, wiring and software settings) Use restrictive languages for the application program Avoid unnecessary diversity. Use the same brand or type of equipment and software for all similar applications where practical.a Familiarity Avoid unnecessary novelty. Use well-established and familiar equipment, procedures and methods 9 Suitability Use equipment and software only for its intended function. Pay attention to any restrictions listed in the equipment’s Safety Manual. 8 Use SIL-certified equipment and validated tools (software development tools, analytical software, test equipment). Alternatively, use equipment with a good, documented track record of prior use (see Chapter 9 for detailed coverage) a However, this can conflict with avoidance of common cause failures. See Chapter 8 for further discussion. Introduction to functional safety 17 Table 1.5: Practical methods for detecting errors that can cause systematic failures. Type of method Practical steps involved Chapter in this book Follow a properly designated review procedure, especially for software development. Ensure an adequate degree of independence between the executing engineer and the reviewer. Record deviations and errors found, not for disciplinary purposes but to allow an assessment of whether systematic failures are properly under control. 10 Compare the expected and actual performance of the SIS, especially in terms of trip rate (real trips and spurious trips). If the actual trip rate is much higher than expected (based on random failure rate calculations), it indicates the presence of systematic failures in the design and/or implementation of the SIS. 11 Investigation Record and investigate all incidents of unexpected SIS behaviourdespecially unwanted (spurious) trips, diagnostic alarms, test failures, issues found during maintenance, and events when the SIS is found to be in an abnormal state (e.g. unauthorised bypasses, parameters changed). Most of these will indicate the presence of some kind of systematic failure. 11 Maintenance When maintaining the SIS, always inspect and test it before carrying out any maintenance works such as cleaning and repair. Record the ‘as-found’ condition of the SIS, since this more accurately represents the ‘real’ state of the SIS during the majority of its working lifetime. Investigate the root cause of issues such as loose connections, corrosion or other physical damage, unauthorised or unexpected alterations from design (compare back with the SRS), and any other finding that could compromise the functioning of the SIS. 11 Review Table 1.6: Guidelines for classifying failures as random or systematic. Random failures Unconnected to any specific causal event Occurs within the design envelope of the SIS Not attributable to a specific design or operating error Systematic failures May be associated with a design error May be associated with exceeding the design envelope of the SIS Attributable to a specific root cause May be avoidable by a design change May be controlled by improved training and procedures 18 Chapter 1 Figure 1.3 Decision flow diagram: classifying a failure as random or systematic. Introduction to functional safety 19 Figure 1.3 Cont’d. 20 Chapter 1 Why is this type of failure known as systematic? The term arises from the idea that the underlying error will systematically lead to a failure when a given set of conditions arise, step by step, with essentially 100% probability. For example, if there is a division-by-zero error in a line of computer code that runs as part of a housekeeping procedure once a month, the program will crash on a specific date. Unfortunately, the term systematic is prone to confusion, because it can also refer directly to a failure in a system (e.g. management system). The terms deterministic, causative or induced would be preferable. 1.4.6 Competency Competency is a core concept of IEC 61511. It requires us to • • determine the level of competency required to perform each safety lifecycle task, and assign individuals only to tasks for which they meet the competency requirements. The competency requirements should be defined in terms of qualifications, general experience, directly relevant experience, background knowledge (e.g. of functional safety concepts and relevant regulations and codes of practice), and specific knowledge (of the process, equipment and procedures concerned). All this should be documented, to provide an audit trail for verifying systematic failure controls are effective. Fig. 1.4 shows the core aspects of competency required by IEC 61511. The standard requires only that each person is competent for the tasks they are performing. It is not necessary for every engineer in the project to have a full in-depth knowledge of every aspect of the SIS. The standard does not make any specific stipulation about what competence actually means in practice: that is up to each individual organization to decide. One of the reasons for placing so much emphasis on assuring competency is that a great many serious incidents in the past have been traced back to competency failures. One chilling example relates to the collapse of a coal mine spoil heap at Aberfan, Wales, in 1966. According to Trevor Kletz, “responsibility for the siting, management, and inspection of tips was given to mechanical rather than civil engineers. The mechanical engineers were unaware that tips on sloping ground above streams can slide and have often done so.” [4] The official report of the board of inquiry described the Aberfan Disaster as Introduction to functional safety 21 Figure 1.4 IEC 61511 competency requirements. a terrifying tale of bungling ineptitude by many men charged with tasks for which they were totally unfitted, of failure to heed clear warnings, and of total lack of direction from above. Not villains but decent men, led astray by foolishness or by ignorance or by both in combination, are responsible for what happened at Aberfan [4]. The result was 144 fatalities, 116 of whom were children in a nearby junior school. Another reason for requiring evidence of competency is that many lifecycle activities are heavily outsourced. An end user will typically delegate a bundle of safety engineering activities to an EPC (engineering, procurement and construction) contractor. The EPC will, in turn, purchase SIS components from manufacturers, and hire consultants to help with safety analysis and verification activities. In each case, responsibility for ensuring competency is effectively being transferred from one entity to another, further and further from the final end user. Unless there are clear definitions of what constitutes competency or how it is controlled, the end userdwho is ultimately responsible for safetydhas no way of assuring the effectiveness of the safety products and services provided. 22 Chapter 1 Chapter 7 explains how competency management can be achieved in practice. 1.5 The structure of IEC 61511 The IEC 61511 standard itself does not make easy reading, especially if English is not your mother tongue. It is, thankfully, more digestible than its parent standard, IEC 61508, whose readability suffers from having to be comprehensive and cover every kind of industry and situation. It is probably unnecessary for the individual safety engineer to read the standard from cover to cover; however, each user should at least understand the Safety Lifecycle, the documentation and verification requirements, and the aspects of the standard applicable to one’s own responsibilities. It may be helpful, then, for us to take a quick tour of the standard here. First, the standard is in three parts. Part 1 is the core of the standard and addresses the whole lifecycle, explaining the purpose and requirements for each phase. The phases covered in the greatest detail are SIS design and software development. It also contains a brief discussion of management and documentation issues. Importantly, it includes a substantial glossary of abbreviations and definitions. Unlike the similar glossary in Part 4 of IEC 61508, it has the advantage of being arranged, for the most part, in alphabetical order. Part 2 is a series of Annexes. Annex A contains clause-by-clause guidance on many of the clauses in Part 1, although the guidance is of limited practical value for most users. Annex F is a worked example of the entire functional safety lifecycle. The remaining annexes cover special topics, mainly around application program development. Part 3 focuses mainly on risk analysis methods. It can be treated as a textbook of background knowledge required for the risk analysis period of the safety lifecycle. For a first-time reader, it would be most helpful to focus on Part 1, clauses 1 to 7 and 19; the clauses and annexes of Parts 1 and 2 most relevant to your own role; and, if you are involved in risk analysis, the relevant clauses and annexes of Part 3. 1.6 The origins of IEC 61511 One important characteristic of IEC 61511 and its parent standard, IEC 61508, is that it places a strong emphasis on developing reliable software. The need for a focus on software reliability became apparent during the 1980s, as increasingly sophisticated Introduction to functional safety 23 control hardware became available. While it was easy to write elaborate software to provide safety functions, it proved extremely difficult to prove the software was reliable. The difficulty lay in two separate aspects: getting the specification right, and writing applications that met the specification under all possible conditions. At the same time as these software difficulties were becoming obvious, hardware was rapidly advancing in complexity, to such an extent that it became impossible to demonstrate hardware integrity by testing alone. Without being able to demonstrate safety in both hardware and software of instrumented safety systems, end users could not have confidence that major hazards were adequately controlled. This problem was further compounded by the ever-growing trend towards automated plants managed remotely by a small number of operators in a control room. Since the operational staff were increasingly dependent on self-contained trip systems to manage major upset conditions, the importance of confirming the dependability of those systems was clear. The response of the International Electrotechnical Commission (IEC), an independent body based in Geneva with member committees representing the interests of 89 countries plus 84 affiliate members, was to set up separate groups to study the issue for hardware and software. The aim was that each group would develop a standard to assist developers and end users in claiming safety capability in their respective applications. The studies were merged in the early 1990s, giving birth eventually to an umbrella standard IEC 61508 that covered both hardware and software integrity in detail. The merging of the two aspects of functional safety in a single standard was a recognition that many of the issues are the same: overall safety management, competency, the lifecycle approach, and configuration management are just a few of the aspects pertinent to both. The major differences between hardware and software lie in the methods used to achieve and demonstrate integrity; this is reflected in the two separate parts (Part 2 and Part 3, respectively) that IEC 61508 dedicates to them. IEC 61511 was then developed as a specialization of IEC 61508 for the process industry, as we described earlier in the Section 1.3.1. 24 Chapter 1 Exercises 1. Consider the fictional incident in the Section 1.1. Select two of the equipment failures described. Are they likely to be random or systematic failures? 2. Traditionally, a cable trailed across the floor in an office environment has been regarded as a “hazard.” How does this fit with the concepts of “hazard” discussed in this chapter? 3. Look up the definition of one of this chapter’s safety management concepts (such as hazard, risk, harm and risk receptor) in Wikipedia. Given that Wikipedia aims at a broad, non-specialist readership, how does its discussion compare with the one here? What does this say about society’s attitude to safety? 4. Classify the following failures as random or systematic, according to the discussion in the Section 1.4.5. For each failure, describe how it should best be addressed (to minimize the chance of it causing harm). (a) A shutdown valve sticks open when required to close. The valve is suitably designed for its operating environment and process fluid, and is within its usable life. (b) Same case as (a) except the valve missed its last proof test. (c) Same case as (a) except the valve has exceeded its useful life. (d) A pressure transmitter has worked loose on its mountings. As a result, it is vibrating severely. It fails due to a crack in the PCB (printed circuit board). (e) A software bug causes a safety function in a SIS to receive an incorrect “override” signal. As a result, the safety function does not trip when required. (f) A manual “override” key switch is faulty and overrides a safety function incorrectly. As a result, the safety function does not trip when required. (g) A forklift truck strikes an instrument air line. As a result, air supply to a shutdown valve is lost, and the valve closes spuriously. Answers Question 1dAnswer The failure of the primary level measurement in the solvent storage tank, and the false vapour alarms triggered by the wind, could be random failures. All the other failures mentioned are systematic, as they are associated with specific errors in design, operations or maintenance. The high level trip in the solvent tank is not really a failure, as it is not faulty, but bypassed. However, the root cause (poor management of bypasses) could be addressed by the same type of solution as systematic failures, so it could usefully be categorized as a systematic failure. Introduction to functional safety 25 Question 2dAnswer Categorizing a trailing cable as a hazard is a simple concept, but it rather unhelpfully focuses attention on the cable itselfdleading us towards imperfect risk management solutions such as “tape the cable to the floor” or “put the cable under the carpet”. If we trace the causal chain back to the “equipment with potential to cause harm”, it helps us address our attention to the copying machine attached to the cable. Relocating the machine could be a better solution, and might also draw our attention to other related issues such as noise, dust and ozone emanating from the machine. In other words, treating the machine as the hazard may yield a more effective analysis of the risk. (This can also help turn our attention towards intrinsically safer risk management solutions.) Another possible approach is to look for the ‘reservoir of energy’ with the potential to cause harm. The injury resulting from tripping over the cable derives from the potential energy of the person’s body. Again, this helps us change our focus to the real issue: the problem is not the cable, but the person having to cross the cable. Can we find a means to separate people from cables? If so, we have removed one route by which the stored energy can cause harm. Question 4dAnswer Not all practitioners agree on the boundaries between random and systematic failures, so your answers may differ from those suggested here. (a) Random (b) Random. The fault is not related to whether the valve was tested, provided the valve is still within its useful life. (c) Systematic. The valve should not be used beyond its useful life. It is very likely to fail as a result of using it beyond its useful life. (d) As the transmitter is probably not designed for a high vibration environment, this would likely count as a systematic failure. (e) Systematic. All failures arising from software bugs are systematic failures. (f) Random. (g) Systematic. However, this does not count as a dangerous failure, because it causes a spurious trip. So there is little purpose in classifying it as a random or systematic failure. (See Chapter 2 for discussion of the term “dangerous failure.”) References [1] E. Marszal, E. Scharpf, Safety Integrity Level Selection, ISA, Research Triangle Park, 2002. 26 Chapter 1 [2] Anon, An Introduction to Functional Safety and IEC 61508. Application Note AN9025, MTL Instruments Group, 2002. http://www.mtl-inst.com/images/uploads/datasheets/App_Notes/AN9025.pdf (retrieved on 8 January 2022). An admirably clear and readable starting point for readers who are unfamiliar with the area of functional safety. [3] P. Clarke, Setting the Standard, Control Engineering Asia, May 2011, pp. 12e18. Contact the author for a copy via, www.xsericon.world. Focuses on the benefits of compliance for SIS designers and end users. [4] E. Davies, Report of the Tribunal Appointed to Inquire into the Disaster at Aberfan on October 21st 1966, HL 316, HC 553, HMSO, 1967 (retrieved on 6 January 2022), https://www.dmm.org.uk/ukreport/553-04. htm. Further reading The following resources provide wide-ranging coverage of the functional safety lifecycle. [1] Center for Chemical Process Safety (CCPS), Guidelines for Safe and Reliable Instrumented Protective Systems, Wiley, New York, 2011 (Chapter 1 provides an especially lucid introduction to the role of functional safety in risk management). [2] P. Gruhn, S. Lucchini, Safety Instrumented Systems: A Life-Cycle Approach, ISA, Research Triangle Park, 2019 (Detailed and extensive coverage of SIS design and implementation, for process applications. Especially detailed on hardware and software design and SIS validation). [3] K. Kirkcaldy, Exercises in Process Safety, Available from: Amazon, Self-published, Milton Keynes, 2016. [4] K. Kirkcaldy, D. Chauhan, Functional Safety in the Process Industry: A Handbook of Practical Guidance in the Application of IEC 61511 and ANSI/ISA-84, Available from: Amazon, Self-published, Milton Keynes, 2012 (See especially chapters 13e19). [5] SINTEF, Guidelines for the Application of IEC 61508 and IEC 61511 in the Petroleum Activities on the Continental Shelf (Guideline GL 070), Offshore Norge, Stavanger, 2018 (Concise coverage of many aspects of functional safety for the oil & gas industry). [6] D. Smith, K. Simpson, Safety Critical Systems HandbookdA Straightforward Guide to Functional Safety, IEC 61508 (2010 Edition) and Related Standards, Including Process IEC 61511, Machinery IEC 62061 and ISO 13849, third ed., Butterworth-Heinemann, Oxford, 2011 (Chapter 4 on software is particularly useful).