CURRENT RESEARCH ACTIVITY ON THE IMPACT OF NO FAULT FOUND (NFF) ON MAINTENANCE EFFECTIVENESS THROUGH LIFE Wg Cdr CJ Hockley OBE, CEng, MRAeS, RAF(Rtd) EPSRC Centre for Innovative Manufacturing in Through-life Engineering Services, United Kingdom, Cranfield University, Bedford, MK43 0AL Telephone: +44 (0) 1793 785337 E-mail: c.hockley@cranfield.ac.uk Dr Paul Phillips EPSRC Centre for Innovative Manufacturing in Through-life Engineering Services, United Kingdom, Cranfield University, Bedford, MK43 0AL Abstract: Maintenance Effectiveness and Efficiency need to be one hundred percent if service and availability are to be delivered to customer’s expectations. However, the occurrence of faults where the cause cannot be determined, usually described in the UK as No Fault Found (NFF) and in USA as Retest OK (RTOK) can often be a huge and disruptive pressure on the successful delivery of support to the customer. The NFF problem affects many industries and often in different ways, yet good lessons are often not being shared. What is clearly not in doubt is that huge sums of money are still being wasted by not sharing best practice and not understanding the true cost of the problem. This paper will provide a summary of the NFF problem and explain the common causes; it will describe the impact and some of the solutions identified in the project underway at the EPSRC Centre for Innovative Manufacturing in Through-life Engineering Services at Cranfield University to research the problem across different industries. Keywords: Cannot duplicate (CND); Fault-not-found (FNF); Maintenance effectiveness; No-Fault-Found (NFF); Retest OK (RTOK); Through-life engineering services; 1. Establishing the Problem: Initially we should understand what the term NFF, RTOK or Cannot be Duplicated (another common US term) means. Most work in this area has been done in the commercial airline industry and this is where most research has been done that provides evidence of the problem. Definitions of NFF, RTOK and CND differ as to which level and line of servicing or test equipment is involved, so for the purposes of this paper we will start with a simple definition of NFF provided by current experts in the field, Cockram and Huby [1]: A reported fault for which the root cause cannot be found. Whilst this is straightforward it suggests that NFF is a diagnostic failure and without doubt many NFF occurrences will fit this definition. However, there is perhaps more to it than this. For instance we need to include examples for which there never was a root cause in the first place. Such examples might be described as merely an incorrect interpretation of 1 what was reported ie communication problems where the user perhaps didn’t describe or report the fault correctly, leading to the maintenance team chasing a spurious fault and wasting valuable maintenance effort. Another and well-used definition from the commercial airline industry is defined as: Removals of equipment from service for reasons that cannot be verified by the maintenance process (shop or elsewhere) [2] But it also needs to comprehensively cover the whole logistic support chain where maintenance and logistic effort might be expended yet yield no results for any number of reasons. So NFF should also cover any reported fault which has some nugatory maintenance and logistic cost and effect but where no fault is found. To return to the aspect of establishing the root causes, means we should consider several things to achieve diagnostic success, but it needs correct and comprehensive reporting and interpretation of the reported symptoms in the first place. Given that it will depend on many other things such as maintenance and test equipment efficiency, technical and training skill levels and even environment and facilities, it can be seen that the whole problem is hugely complex. Only if all these factors are in our favour will diagnostic success be possible. 2. Background of NFF: The problems and the costs of NFF are difficult to establish. Some organisations, whilst understanding they have a problem, seem unwilling to establish actual costs. The reason for this is possibly because of the belief that solutions are too difficult or elusive, or perhaps too costly to implement. Old data from the commercial air industry is worthy of mentioning; in 1993 British Airways (BA)1analysed their NFF costs and highlighted that the major problem was with their older aircraft which suffered the highest rates - Concorde, Tristar and Boeing 747. The task force that they then established found that it was not quite as bad as they had initially suspected; it was thought to be that 33% of all unscheduled removals were NFF where the same fault reoccurred very soon after. However, for the year of 1992 their data showed that it was only 13.8% of all unscheduled removals that they could categorise as NFF. The cost though was still £17.6M per year. Some unsurprising data also emerged; for instance avionics components made up 80.4% of all NFF which in addition represented 26.6% of all avionics removals. The task force also established that maintenance personnel were removing items for the wrong reason and those in the workshops were not repairing the faults that had been reported but might be repairing something else that they found wrong. There were many reasons identified but the most important were: • Diagnostic training was not good enough. • Pressure on quantity of items processed rather than quality of repair. • Lack of emphasis on the tracking the history of components by serial number. 1 All BA data from a presentation by BA to an ERA Conference (1996). Results and conclusions only retained in a lecture by the author. 2 • Data on repetitive defects needed to be shared between user and repair organisation Solutions were put in place and one of the most important being to track what they called “rogue units” [3] which were identified as ones which seemed to rotate from aircraft to workshop very frequently. Once they had such data they discovered that 360 units could be described as rogue units as they had been involved in 2300 removals at the aircraft in a single year. This also highlighted the need to share data between themselves and their repair contractors. In 1998 they revisited the study to see how things had improved once the new systems had been implemented. There had been an improvement in the mean time between unscheduled removals of 20-30%. They had instituted a robust system to both identify and track the rogue units, having set specific criteria to identify them. These criteria were: the component had to have a history of repeated ‘short service’2 periods, demonstrate repeated, identical system faults, be unable to have the fault detected by standard testing procedures and a replacement of the same type solved the problem. Rogue units were quarantined and then put through a detailed testing regime and if a fault identified, other components that showed the same fault were then tested. Another system introduced at the same time was aimed at lessening the effect on the repair organisation. This was the “Subject to aircraft Check” (STAC) which involved authority being given to licensed engineers to change a component and replace it with a new item. If the aircraft then suffered the same problem on the next flight it could be assumed that the original item had not been the problem and that item could be returned to stock as “STAC serviceable”. BA thought that by adopting this procedure they had saved £17M per year.3 This procedure was quickly adopted by other airlines and has produced similar savings. However, finding published data on costs and cost savings has so far been difficult, yet from current research it is believed that costs in many industries are still a huge problem. In 2007 it was reported in Aviation Week that avionic problems amounted to 75% of all NFF instances in the airline industry [4]. In military aviation, the problem is believed to be even bigger. A commercial aircraft is built to fail-safe design principles so there is an element of redundancy and safety built in which may alleviate some risk and make a NFF instance a little more acceptable. For a military aircraft built to safe-life principles there is not the same level of redundancy if at all, so the risk level is higher and therefore something must be found if at all possible to minimise the risk. Sometimes this may result in more components being changed as a form of saturation maintenance, so generating items in the repair pool that were not originally at fault but with the expectation that the faulty item has been caught. 2 A short service period was defined then as 250 flying hours (approx 25 trips for a 747) however it would be different for each type dependant on the type of service flown. 3 BA Presentation to ERA Conference (1996). Only results and conclusions retained in a lecture by the author. 3 From research carried out in the RAF some years ago4 some interesting causes and conclusions were drawn about causes of NFF in RAF aircraft. Operational pressure to provide availability was a significant reason in some operational theatres. It could vary for the same item from 25% where there was little operational pressure to over 90% where operational commitments were highest. Built-in-Test (BIT) and Built-in-Test Equipment (BITE) seemed to be responsible for a good proportion of NFF arisings as well. Looking at particular aircraft fleets also gave different arising rates as might be expected but there were some more general reasons that could be established. Lack of knowledge and familiarity, usually when equipment was new into service, was an obvious reason but less so, were discrepancies in procedures and even in some cases, incorrect techniques. Poor diagnostic procedures, poor test equipment design and/or tolerances and incorrect software could also be cited as causes. Another main cause was intermittent faults which manifested themselves due to environmental conditions and specific usage and these would invariably be impossible to reproduce and find on the ground. The research that both BA and the RAF had done into the problem showed that NFF costs occur throughout the maintenance and supply chain. They are also generated because poor communication might direct the maintenance staff to the wrong systems and the system then proves serviceable on the ground because of the inability to simulate the conditions and environment when the fault occurred. Similarly items may be removed for the wrong reasons resulting in maintenance work at subsequent levels of the repair organisation who then fail to find anything, so log it as a NFF; this has all caused further costs for transport and investigation further down the supply chain.. 3. Can NFF be Classified? It is clear that a great many NFF occur in avionics, electrical and electro-mechanical, but initial research shows that software is also causing problems. Mechanical systems and structure do not give rise to many NFF and are usually confined to maintenance actions such as Non-Destructive Testing (NDT) inspections. NDT checks range from simple visual inspections to much more technical ones such as eddy-current and ultra-sonic inspections. A NFF in these cases will be caused by lack of skill or training but also perhaps by a badly designed inspection. The issue here is that whilst we usually think of NFF as applied to avionics, electrical and electro-mechanical, it can indeed be much wider and demands solutions. Consequently the earlier simple definition of NFF as a “reported fault for which the root cause cannot be found” suggests it is merely a diagnostic failure and therefore needs some expansion. Relating symptoms to the root cause is clearly important to ensure the right maintenance activities are set in motion, but the causes of NFF and the solutions are not always quite that simple. It is thus vital to understand first the causes and to see if they can be classified into groups that will aid our understanding and lead to the best solutions. In the past it has been accepted that there are three main categories or classes of faults that are prevalent in the study of NFF: 4 The investigation was carried out by the Central Servicing Development Establishment at RAF Swanton Morley under direction of the author but the report is unfortunately no longer available. The results and summary conclusions though have been retained by the author in a lecture prepared at the time. 4 Intermittent – This category is fairly obvious and describes the fault in a component or system that only occurs infrequently. It manifests itself as some loss of function or disruption of a connection and may only occur for a minimal or limited time. The causes can be diverse but are usually caused by vibration or flexing of the mountings or possibly through corrosion. Integration – This category describes components and system that work effectively when tested in isolation but demonstrate a fault when they are integrated into another system. The most obvious example is a bent pin on an electrical connector plug. Integration problems may occur as a result of the way the components or subsystems have been built, or as a result of software compatibility and are often not a fault in the original component or subsystem itself; nevertheless the component or subsystem often takes the blame resulting in wasted maintenance and logistic effort. Testing – The final generally accepted category concerns BIT/BITE and testing in general. Tests may be done as part of the system routine checks within the equipment itself or as part of the diagnosis when a fault is reported. The effectiveness and success of the BIT/BITE then becomes important in that it needs to be able to assist the technician in diagnosing the cause and identifying the necessary maintenance action. In the same way testing in the workshop needs to be able to isolate the cause successfully and the correct repair required. Test effectiveness and efficiency are therefore important and the tests have to be designed correctly in the first place. In addition the alarm rates need to be set at the correct level because if they are too low, faults will not be isolated and if too high, too many faults that are of little consequence or effect perhaps will cause a spurious alarm to be triggered. These three categories or classes, however, appear to be too narrow from the initial research underway at the EPSRC Centre for Innovative Manufacturing in Through-life Engineering Services. For instance there are many causes that do not fall into the above three categories and there are some that might only be relevant to particular industries. For instance we need to consider other causes such as poor design, lack of communication, mis-communication, wrong diagnostic methods and processes being specified, poor training, wrong processes applied and operational pressure. Categorising these is necessary if we are to comprehensively solve this costly problem. All will generate NFF which do not fall into one of the generally accepted three classes. Consequently it may be necessary to generate other specific categories or classes in due course. Perhaps the issue is not can we categorise all NFF into three or more classes, but can we relate causes, solutions and classes as an aid to understanding how to solve or reduce the cost of the problem? 4. Additional causes of NFF: It is necessary therefore to look at all the other causes that can be identified to see how they can be grouped. Initial work shows that there are many additional causes that must be considered if total and effective solutions are to be provided. These additional causes are broadly grouped under organisational and culture, procedures and rules, technical inefficiencies, and finally workforce behaviour. However, within each of these there are many sub-sets and causes and they are briefly mentioned below. 5 Organisational and culture. Some organisations are too big and often therefore too bureaucratic to recognise or admit that they have a problem and even when they do, there is a reluctance to grasp the need for fundamental change. Often in these situations the there is a lack of evidence on the true cost of the problem. Certain levels in the organisation know of the problem but do not have the authority or motivation to help solve the problem. Some organisations have a structure that doesn’t allow good communication between maintenance teams and between the different technical trades. The organisation does not have culture that encourages solutions to the NFF problem. Data and information is often lacking and would require a great deal of commitment, effort and expense to generate. There may even be an unwillingness to confront the problem from below for fear of alerting management to poor practices. Job protection can also be an issue. Time pressures on maintenance operations. When there is great pressure to return the equipment to service quickly, time for diagnosis is critical and might just not be possible. Consequently we see an overwhelming pressure on maintenance staff which then often results in some speculative maintenance as this is the quickest way to return the equipment to service. What happens is that instead of diagnosis to determine which of three components is at fault, all three are changed. This means one out of the three should have the fault found further down the repair chain, but the other two should each generate a NFF occurrence. Whilst changing the three components was the quickest solution, it has now caused expense and effort throughout the support chain. Inadequate training and lack of training equipment. Technician training has to be sufficient to cope with difficult diagnosis tasks and there has to be adequate training equipment and facilities to support such training otherwise faults will not be diagnosed successfully. Maintenance personnel also have to be given time to build up experience, particularly on new equipment, otherwise the wrong processes or tests may be done with little chance of finding the true cause or fault. Communication, information exchange and lack of historical data. Components need to be tracked by serial number so that there is historical evidence to pinpoint rogue units. All those involved also need to share knowledge and experience of NFF, which means designers, producers, maintenance providers and users, as there will be a wealth of information that can help reduce systematic faults across fleets. Sharing information and historical data though is always easier said than done and may require some investment. Procedures and rules. In many organisations such as the military there will be procedures and rules that deal with NFF, or alternatively, either cause some of the NFF incidents, or exacerbate and perpetuate their occurrence. The ability to have the STAC procedure previously described is an example of a procedure that deals with the occurrences for fail-safe designs and has been shown to reduce the subsequent cost of NFF. On the other hand military rules and procedures may be such that it is too difficult for rules to be changed purely for some fleets and not for others, as it would require a whole raft of new rules and procedures with the attendant safety provisions in order to ensure it is only applied to the authorised fleets; in such a large organisation with so many 6 diverse fleets, this has so far proved impossible to adopt in the military. Similarly there has been no real incentive in the past to monitor rogue units because the overhaul and repair costs were covered in-house and thus largely hidden. Indeed more repair work often justified the need to keep the service as an organic capability rather than contract it out to a third party. Times have changed with much of this support activity now being outsourced to industry who has a much greater incentive to drive out waste and increase profits. Other considerations loosely under the heading of procedures and rules are: Supply chain effect. NFF can have a huge effect on the supply chain. When a fault occurs frequently it may generate the replacement of the same item, yet the fault keeps re-occurring usually because the cause is intermittent in a completely different component. The supply chain sees large numbers of this item being used so demands more to keep up with demand; because there are then more items in the supply chain, technicians believe it must be the unreliable candidate to change as the first choice and so it becomes self-perpetuating. This is known as the supply chain effect or phantom supply chain. Discrepancies and errors in test routines. Errors in test routines and procedures may not be obvious but actually will keep causing NFFs until the procedure or routine is corrected. Similarly some tests or procedures may not be as comprehensive as they need to be to diagnose some faults properly so result in a NFF. Inaccurate reporting. Faults must be clearly and unambiguously reported so that maintenance personnel have the best chance of diagnosing the cause. Without a clear understanding the diagnosis may follow a completely spurious path leading to a NFF. Technical inefficiencies. Causes from technical inefficiencies can be anything from the lack of test equipment, poorly designed test equipment or processes, incorrect diagnostic routines for new and novel faults not previously experienced before. BIT/BITE. The pass/fail criteria for BIT/BITE must be correctly set at the right level otherwise false alarms will be generated if set too low, or faults will not be correctly isolated if set too high. Incorrect repairs. Again faults usually caused by intermittency might not be able to be diagnosed but during investigation a fault is found in the system but which actually had nothing to do with the original fault. Nevertheless it is repaired but the original fault still persists and should therefore the original job should be classed as NFF. Operating environment. Without the relevant information on the environment it may be a fruitless search for the reported fault and the result will be a NFF. Furthermore, a representative operating environment, with the correct operating stresses being properly replicated, may be impossible to provide during diagnosis. Poor system design. Some designs will not lend themselves to easy testing and diagnosis and so may make NFFs inevitable. 7 Workforce behaviour. This cause of NFF also impinges on the cultural aspects previously mentioned but not from an organisation’s point of view; rather it is more about the individual’s attitudes and behaviour. For instance maintenance personnel often get into bad habits and not necessarily deliberately. They accept short-cuts or accepted divergence from procedures in what is known in the aviation world as “Norms.” Very often managers have even tacitly approved such divergences. This tacit approval then becomes the norm but can exacerbate the number of NFF instances or hide the true path to successful diagnosis. Communication. At the individual level, maintenance personnel may not appreciate the importance of good communication and the sharing of experience and knowledge. Poor communication can in itself cause NFF by sending the next shift off on a wild goose chase or merely not sharing information about a commonly experienced problem. Resistance to change. Individuals are often reluctant to change what they believe is well tried practice and procedures. Poor Maintenance practice. Technicians may have got into bad habits but no-one recognises them as poor maintenance practice. Simple disregarding of procedures, perhaps due technician over-confidence and arrogance, may cause NFFs. It can be seen from the above that all these issues will produce an increase in cost and certainly a waste of effort and resources. It is also clear that NFF problems will certainly impact badly on maintenance effectiveness and through-life engineering services. 5. Cost of NFF throughout the support chain: It is still difficult however, to assess overall costs for the NFF problem. Companies seem reluctant to declare what the costs are, perhaps fearing the worst without the means to solve the problem. It may be the issues are further down the support chain and possibly therefore not under their control. There are various points at which the costs are going to appear: the user, the next level repair organisation which might be quite close to the user, or the original equipment manufacturer (OEM) or Maintenance Repair Organisation (MRO) both of which are usually also remote from the user. At the user, no fault might be found despite a great deal of diagnostic effort but no component is replaced and no actual solution therefore established. All the tests prove satisfactory so the equipment is declared serviceable but designated NFF. Nevertheless, the fault recurs during the next mission or very soon afterwards, so further diagnosis is now required. Also at the equipment the wrong diagnosis or investigation might get underway for a number of reasons resulting in the wrong solution being applied; work has been done so it is expected the fault has been fixed but it again recurs on the next mission. In this case the first fault is not usually classed NFF but should be! Finally, very often in desperation, a component is speculatively replaced in the hope that it will fix the fault, but at least something has been changed! Unfortunately the removed component is now classified faulty but at the next or subsequent level of repair, it will be classified NFF. 8 At the subsequent level of repair therefore there could be many wasted man-hours spent looking for faults that are not there because the items had been removed ‘on spec’; similarly man-hours may be spent looking for intermittent faults that cannot be replicated because the environmental conditions cannot be applied as part of the bench test. Finally all the same technical inefficiencies previously described will cause unproductive effort. But the effects and cost will not just be felt at the repair sites and there will be costs and effort expended throughout the supply and support chain. More items will need to be provisioned into stock to cope because of the numbers of items undergoing diagnosis that are in fact NFF but which are consuming time and effort and hence cost. The supply chain effect causes more items to be in supply chain and thus additional inventory cost.. Diagnostic Success: Whilst the costs are difficult to establish, there has been better information available in recent years on diagnostic success, particularly with respect to avionics. Based on some research done by Copernicus Technology Ltd.[5], the diagnostic success rate for avionics tends to divide into diagnostic success, diagnostic failure resulting in speculative replacement and diagnostic failure of the functional tests. Diagnostic Failure Speculative Replacement Diagnostic Success Diagnostic Failure Functional Test Only Figure 1: Diagnostic Success Rates for Avionics System Repairs Fig 1 shows that considering all faults, both hard and intermittent faults, that diagnostic success is comparatively low in avionics, perhaps only 40%. Diagnostic Failures account for more than 50% of occurrences. The ‘Functional Test Only’ number is the case where the technician cannot positively identify the fault, but by establishing there isn’t a fault by doing the functional test, the equipment can be declared serviceable again. The other diagnostic failure group covers all the remaining reasons previously identified. 6. Mitigation of NFF: Having tried to establish something about the costs, we must turn attention to mitigation and solutions. There has been a great deal of research over the years but solutions and mitigations are certainly not universal even within some Companies let alone within industries or across different industries. There is consequently much still to do and hence the need for the consolidated 3 year research effort now being undertaken at the newly formed EPSRC Centre for Innovative Manufacturing in Throughlife Engineering Services. Some of this effort is being directed at the design and production stages where there is a need to create more fault-tolerant systems which perhaps incorporate inbuilt redundancy. There is a requirement for some thorough 9 research effort into intermittency, its causes and solutions. Understanding intermittent faults will rely on the ability to describe the various interactions accurately and how mechanical, software and electronic elements all have to interact together. Modelling of intermittent faults will be required but will need to include probabilities of fault detection. A thorough understanding of individual systems will be required in order to provide fault models and models that deal with false BIT alarms and the root causes of BIT deficiency. In some industries and individual companies, adopting better HUMS and prognostics would ensure that important operational parameters were monitored at all times to identify adverse and out of limits variations. These technologies would help to change from a policy of reactive maintenance to a predictive policy which would concentrate on providing vital information on the root causes of faults which is not provided with traditional BIT/BITE. Other technology improvements such as the use of RFID technology can be adopted to track units within the supply chain and to monitor the complete service history of items while they are in the supply chain and is certainly not a new development [6]. Such technology solutions will go some way to mitigating NFF but what is needed is a comprehensive approach dealing with organisational, procedural and behavioural issues as well as all the technical issues. 7. Current Research Effort: Research at the EPSRC Centre for Innovative Manufacturing in Through-life Engineering Services has a three year project which started in November 2011 and which seeks to address the problem of NFF and make some progress with solutions. It has the following ambitious, but we believe, realistic objectives: • • • • • Identify procedural, process and behavioural issues that need to be changed, learning from best practice in each industry. Develop an in-situ health monitoring approach at the board level to detect, characterize and locate NFF intermittent failures and deliver a fault localisation mechanism and demonstrator at the board and sub-system level. Devise strategies, methodologies and system design rules to mitigate the intermittent NFF failure mechanisms and to demonstrate their effectiveness in reducing the likelihood of NFF occurrences. Develop a multi-disciplinary approach at the System level for the effective analysis of the root causes of NFF in order to assist design activity across domains. Develop a handbook and a system design evaluation standard with procedures that will reduce the problem of NFF. Conclusion This initial research in the EPSRC has confirmed that there is still a serious issue with NFF and shows that there is still much to do in some organisations to change the culture and behaviours if we are to avoid the huge costs that occur. One early conclusion is that we should start that culture change by describing the issue as a Fault Not Found (FNF) which has a more positive behavioural sense, rather than NFF which suggests an attitude of resignation that there was probably no fault there anyway. FNF suggests that we must still do something to solve a problem and there is an acceptance that something must be done. Whilst this research project has only just started, it will continue for nearly three 10 more years and if all the objectives are met will certainly achieve a huge step forward with this perennial problem References [1]. Cockram, J., Huby, G. (2009), “No Fault Found (NFF) occurrences and intermittent faults: improving availability of aerospace platforms/systems by refining maintenance practices, systems of work and testing regimes to effectively identify their root causes”, CEAS European Air and Space Conference,26th-29th October, Manchester, UK [2]. ARINC Report 672 (2008), “Guidelines for the reduction of No Fault Found (NFF)”, Avionics Maintenance Conference, Aeronautical Radio Inc. [3] Söderholm, P. (2007),” A system view of the No Fault Found (NFF) phenomenon”, Reliability & System Safety, Vol 92, No 1, pp. 1-14 [4] Burchell, B., (2007), “Untangling No Fault Found”, Aviation Week, 9th Feb [5] Huby, G., Cockram, J. (2010), “The system integrity approach to reducing the cost impact of ‘NO Fault Found’ and intermittent faults”, RAeS Airworthiness and Maintenance Conference, 16th Sept, Cranfield University, UK [6] Narsing, A. (2005), “RFID and supply chain management: An assessment of its economic, technical and productive viability in global operations”, The Journal of Applied Business Research, Vol 21, No 2, pp. 75-80 11