Taking Safe Decisions Worked Example Final drive failures • Decision making made using competent professional judgement. • Circumstances where a risk control may be introduced for business reasons, such as reputation, even though analysis shows that the control is not required to meet legal obligations. Summary: This worked example describes how risk assessment and monitoring can be used together to support the iterative design and implementation of risk reduction improvements, considering both business and safety factors. This includes the replacement of existing controls where they are assessed to be no longer required and the implementation of new risk controls for business reasons. The worked example also illustrates some of the challenges of decision making processes involving a variety of stakeholders. • Large number of parties can make the decision making a lengthy process, therefore this should to be factored into timescales. • Lessons learned when making a change can be transferrable to the wider business. 1. Origin of review Historically, monitoring of the final drives on certain types of rolling stock has indicated a range of issues which can cause their input bearings to heat up rapidly, leading to failure of the input shaft. Key learning points: This worked example illustrates: • Decision making can be an iterative process with risk controls implemented, updated and removed in response to monitoring and other sources of new information or changing circumstances. Factors to consider A Railway Undertaking (RU) had experienced a series of final drive failure incidents, the most notable of which was a final drive failure that led to the detachment of the drive shaft. The RU was concerned as, not only had there been damage to the train and minor injuries to a customer on the station platform caused by ballast thrown up by the drive shaft when it detached from the train, but the failure also presented a derailment risk. Nature of the decision Risk owner Owned by one organisation Shared by many organisations Worst credible case consequences Insignificant Multiple fatalities Operational experience Extensive None Technology Mature Novel Complexity Very simple Highly complex Ability to monitor and act post change Can identify problems and resolve quickly Difficult to monitor and/or intervene More likely to be catergorised as significant Approach for making the decision More senior level decision taking More consultation More extensive and detailed analysis More time to agree and implement the decision Figure 2: Scoping Final drive failures www.rssb.co.uk 1 The independent investigation into this final drive failure focussed on early life issues and led the RU to introduce several new risk controls, including additional monitoring of new drives. A subsequent failure however was found to be caused by a different issue, that of the reliability of the final drive oil pump. Therefore, additional checks were also introduced for the pump operation. A further failure, not related to early life or oil pump malfunction, led the RU to conclude that there were indeed a number of potential failure modes. This resulted in the RU conducting a fleet-wide check to identify the possible options for risk controls and deciding to conduct a risk assessment and review of the risk controls in place to try and prevent repeat of this type of failure. 2. Analysing and selecting options The RU proceeds with the risk assessment and review of the risk controls. Although the decision of which option to implement is conducted within the RU, this is supported by wider industry consultation at a national level, including the Office of Rail Regulations (ORR), the rolling stock operating companies (ROSCOs) and suppliers. There is already considerable operational experience of the technology involved in final drives, but the understanding of their in-service behaviours across the industry is initially found to be weak. This understanding is quickly strengthened from review of the incidents previously noted. The conclusions from these investigations provide a broader knowledge base for the RU and lead to the introduction of further short term risk controls, together with revision of existing controls, over a period of months. Option Costs Benefits 1 Do nothing and continue with mitigations already introduced in response to the failures. Discounted - The mitigations introduced in response to the failures previously noted include a checking regime, oil pump priming check and oil sampling. These are not considered sustainable as they are highly labour intensive and are expected to become less effective as staff become less vigilant with the checks over time. The business case CBA suggests that their gross cost is substantially higher than their safety benefit. 2 Remove mitigations already introduced in response to the failures. Discounted - in light of the failures, removal of risk controls without providing alternatives is not considered to be a viable an option. 3 Fit the modification and continue with mitigations already introduced in response to the failures. Discounted – it is assessed that there is no cost benefit case for this option due to the intensive nature of the mitigations already introduced in response to the failures. 4 Fit the modification and remove all Discounted – the business case cost benefit analysis indicates mitigations already introduced in that there is a case for fitting the modification but this option is response to the failures. not considered to give sufficient assurance around failures due to pump priming. 5 Fit the modification and remove the mitigations already introduced in response to the failures, except oil pump priming and the use of oil sampling. Selected – the business case cost benefit analysis indicates that there is a case for fitting the modification for this option. Oil pump priming and oil sampling are selected as possible mitigation measures to retain as they are less labour intensive than the other measures. The modification does not monitor pump priming therefore this provides additional value, and oil sampling provides a backup for the modification until it is shown to be working as intended. The deciding factor in selecting this option is the potentially severe reputational risk should another failure occur and no further mitigations had been put in place. Table 1:Summary of the business case options 2 www.rssb.co.uk The preferred long term solution is initially to fit alternative drives, but as the RU estimates that this will take 12-18 months, the timescales are considered prohibitive and this option is not pursued further at this point. The fleet-wide check identifies fitting a Temperature Intervention Modification (referred to as ‘the modification’) as a possible method for detecting failures from a range of causes. The modification is not intended to resolve design issues or issues of drive performance but is configured to help identify the symptoms of catastrophic failure of the drive and potential drive shaft detachment (the events leading to the failure of the final drives can occur rapidly). The modification is designed to detect a rise in temperature and alert the driver to bring the train to a stop. If the warning is ignored, the modification is designed to shut the engine down well before the temperature reaches the range for likely catastrophic failure. The RU therefore proceeds to assess the use of this modification as an element of possible options for reducing the risk from catastrophic final drive failure. The possible combinations of risk controls that could be applied as a longer term solution are developed by a team of operational and technical experts into five options to evaluate for the business case (Table 1) used to support the final overall design of the risk controls. These focus on the issues of whether or not to fit the modification and which other risk controls should be retained. Safety considerations are the starting point for assessing the business case options, but commercial and business risks also play a significant role in the decision making. The business case cost-benefit analysis (CBA) initially indicates that there is justification for fitting the modification in either options 4 or option 5, based on the avoidance of costs of future incidents. Also, the checking regime introduced in response to previous failures is very labour intensive which makes it more feasible to justify fitting the modification instead. Given the number of incidents, it is considered feasible that in the event of a future incident there may be a call from shareholders to remove the affected fleet, which accounts for half of the RU’s rolling stock, to allay fears of further failures. The RU determined that the loss of revenue, compensation payable for cancellation of services and the additional replacement bus costs for just one day would pay for the modification, even accounting for the savings on fuel and materials. Also, the repair costs for a train failure with the drive shaft detached were similar to the cost of installing the modification. Therefore, the RU determined that preventing another single incident would justify www.rssb.co.uk funding the modification as well. Overall, potential reputational damage becomes the deciding factor for the RU to fit the modification. Simply, it is concluded that the risks of not fitting the modification and a further incident then occurring, are unacceptable: • Another incident could have the small potential to cause injury or death to customers, which would bring severe reputational damage and diversion of extensive resources to any associated investigation, with the potential also for prosecution. • Speed restrictions could be placed on all affected vehicles following another incident, impacting performance, causing delays and leading to the payment of penalties to the Infrastructure Manager (IM). • There could potentially be pressure from stakeholders requesting that the affected vehicles are withdrawn as they are seen to be unsafe and, in the event of an incident, there could also be pressure from ORR remedy the failure, which may result in the fleet being to withdrawn from service. On the balance of the cost benefit analysis presented in the business case, and consideration of the potential reputational risks, it is decided that the modification should be fitted and option 5 is finally selected. 3. Making a change The selected option is assessed using the company’s internal risk assessment methodology with a combination of quantitative and qualitative risk estimation techniques to identify and classify hazards, evaluate whether the risk is acceptable and identify any further controls that should be applied to reduce the risk. The RU also holds discussions with the relevant ROSCOs and suppliers to identify additional hazards and controls. Once the business case is approved, the modification is implemented fleet-wide. The risk assessment process also identifies an updated final drive overhaul regime as a risk control to address the root cause of the failures, alongside the fitment of the modification, which detects the symptom of the problem. The new specification for the overhaul regime, conducted roughly every 2 years, is subsequently reviewed and strengthened in light of the information from investigation of 3 incidents. A cross-industry stakeholder working group is formed to contribute to these reviews, including the ROSCOs and suppliers to make use of their technical expertise and gain cooperation for making the change at an early stage. Other RUs who are using the same final drive are also included to draw on a wider range of operational experience. Other key players in the industry, such as the IM and ORR, are kept informed of progress but are not an active part of the group. The RU involves participants from various different groups to provide a range of expertise, and therefore factors this into the planned timescales, as coordination between the groups makes the decisionmaking process lengthy. The RU includes time to manage the supply chain as each ROSCO holds different views regarding the long term strategy for specification of the final drive, therefore the short term changes do not fully align with the long term strategies for all the involved parties. Engaging the other RUs, who use the same type of final drive in the process also takes time; as they have not had similar incidents the review is not a high priority and, despite participating in the process they eventually decide not to fit the modification as it is primarily a business decision for the RU, and not a legal requirement. 4. Monitoring safety Subsequent monitoring leads the RU to modify the actions taken on the basis of the risk assessment. The RU removes oil sampling from the mitigation measures after sufficient monitoring shows that the modification is working as intended. Whilst the modification picks up failures rapidly, oil sampling has a five day turnaround which is inconsistent with the speed at which defects can escalate to a catastrophic loss of integrity. The modification picks up a potential incident shortly after implementation, thus justifying its cost. The modification also effectively monitors the performance of the final drives and so is beneficial for monitoring the effectiveness of the new final drive overhaul regime. This indicates an improvement in performance, and system reliability: leading up to the incident involving the detached drive shaft Selection of Risk Acceptance Principle CODES OF PRACTICE SIMILAR REFERENCE SYSTEM EXPLICIT RISK ESTIMATION Application of Codes of Practice Similarity Analysis with Reference System(s) Identification of Scenarios & associated Safety Measures Safety Criteria Qualitative Quantitative Figure 3: Risk acceptance principles selected 4 www.rssb.co.uk there were 7 failures over 10 months; following the changes there are no incidents for over 18 months. Monitoring further identifies occasional operational disruption due to the failsafe nature of the modification, as failures caused by damage to the sensors bring trains to a stop. However, this is judged tolerable by the RU given the safety and reputational benefits of the modification. The assessment of the final drive failures and subsequent monitoring of the mitigation measures post-implementation feeds into the RU’s wider risk assessment programme. The lessons learned from the final drive failures are extended to the management of final drive condition across the RUs entire fleet. www.rssb.co.uk 5