AN INTELLIGENT VALVE FRAMEWORK FOR INTEGRATED SYSTEMS HEALTH MANAGEMENT ON ROCKET ENGINE TEST STANDS by Michael Russell A Thesis Submitted in partial fulfillment of the requirements of the Master of Science Degree of The Graduate School at Rowan University October 2010 Thesis Chair: Shreekanth Mandayam, Ph.D. © 2010 Michael Russell ABSTRACT Michael J. Russell AN INTELLIGENT VALVE FRAMEWORK FOR INTEGRATED SYSTEMS HEALTH MANAGEMENT ON ROCKET ENGINE TEST STANDS 2009/2010 Shreekanth Mandayam, Ph.D. Master of Science in Engineering Intelligent sensors can play a critical role in the monitoring of complex test systems such as those used to inspect rocket engine components. Such sensors have the capability not only to provide raw data, but also indicate the data’s reliability and its effect on system health at various levels in the system hierarchy. A major concern at NASA-Stennis Space Center (SSC) in Mississippi is the failure of critical components in the rocket engine test stand during a test cycle. Test cycles can run for extended periods of time and it is nearly impossible to perform maintenance on mission critical components once testing has commenced. Valves play a critical role in rocket engine test stands, because they are essential for the cryogen transport mechanisms that are vital to test operations. Sensors that are placed on valves monitor the pressure, temperature, flow-rate, valve position and any other features that are required for diagnosing their functionality. Integrated systems health management (ISHM) algorithms have been used to identify and evaluate anomalous operating conditions of systems and sub-systems (e.g. valves and valvecomponents) on complex structures such as rocket test stands. In order for such algorithms to be useful, there is a need to develop realistic models for the most common and problem-prone elements. Furthermore, the user needs to be provided with efficient tools to explore the nature of the anomaly and its possible effects on the element as well as its relationship to overall system state. This thesis presents the development of an intelligent valve framework that is capable of tracking and visualizing events of the large linear actuator valve (LLAV) in order to detect anomalous conditions. Specifically, the research work presented in this thesis describes a diagnostic process that receives and stores incoming sensor data; performs calculation of operating statistics; compares with existing analytical models; and, visualizes faults, failures, and operating conditions in a 3D GUI environment. A suite of diagnostic algorithms have been developed that can detect anomalous behavior in the valve and other system components of the rocket engine test stand. The framework employs a combination of technologies including a DDE data transfer protocol, autoassociative neural networks, empirical and physical models and virtual reality environments. The diagnostic procedure that is developed has the ability to be integrated into existing ISHM systems and reduce information overload in the typically crowded environments of complex system control rooms. The augmentation to ISHM capabilities that is presented in this thesis can provide significant benefits for ground-based spacecraft monitoring and has the potential to be ultimately adapted for providing on-board support for spacecraft. ACKNOWLEDGEMENTS The support of my MS program by the NASA Graduate Student Researchers Program (GSRP) award No. NNX08AV98H in 2008 and 2009 is gratefully acknowledged. The research work presented in this thesis was also supported by NASA Stennis Space Center under Grant/Cooperative Agreement No. NNX08BA19A. I also acknowledge Dr. Shreekanth Mandayam for being a great advisor and providing the funding to get me through my master's program. To Dr. Schmalzel and Dr. Merrill, I thank you for your guidance as part of my thesis committee. To Hak attack, Fillman, Metin, Rane, Elwell and Freddie for helping me pass undergrad and making life bearable during those all-nighters. To Will and Steven for being my best non-nerd friends through college. I would also like to thank my family who have supported me in my academic journey. My mom and dad for always encouraging me to push myself in life and faith. My siblings and sibling-in-laws for always being there for me. My grandparents for supporting me in my internship to NASA which started this research. In Memoriam: Dr. Robert (Bob) Field was one of the many engineers at Stennis Space Center that contributed to the development of improved system models—one of which is a core element in the intelligent valve. A Mechanical Engineer adept at thermal system design and analysis, he brought a depth of experience and insight gained from his many years at Pratt-Whitney designing turbomachinery blades and solving other equally complex problems. At NASA, he applied his deep understanding of thermal systems design and analysis to many facets of test stand design and optimization. In addition to his thermal technical expertise, he was the leader of many a stimulating conversation into iii the finer—and fringier—points of the enterprises of engineering, science, and the unknown. He was always ready to talk to young engineers and students. Bob retired from NASA in October 2009 and passed away in February 2010. In memory of Gladys Russell and William Kolb, the best grandparents, parents and spouses I have ever known. iv TABLE OF CONTENTS Acknowledgements .......................................................................................................... iii List of Figures .................................................................................................................. vii List of Tables ................................................................................................................... xii CHAPTER 1: INTRODUCTION .................................................................................... 1 1.1 APPLICATIONS ............................................................................................................ 3 1.2 MOTIVATION .............................................................................................................. 4 1.3 OBJECTIVES ............................................................................................................... 6 1.4 SCOPE ........................................................................................................................ 7 1.5 ORGANIZATION .......................................................................................................... 7 1.6 EXPECTED CONTRIBUTIONS ....................................................................................... 8 CHAPTER 2: BACKGROUND ...................................................................................... 9 2.1 HEALTH ANALYSIS .................................................................................................... 9 2.2 FRAMEWORK FOR HEALTH ANALYSIS ...................................................................... 10 2.3 DESIGN AND TRADE STUDIES ................................................................................... 11 2.4 FAILURE MODE ANALYSIS ....................................................................................... 13 2.5 CBM TESTING, DATA COLLECTION, AND DATA ANALYSIS ..................................... 20 2.6 ALGORITHM DEVELOPMENT - DIAGNOSTICS ............................................................ 21 2.6.1 Preprocessing and Feature Extraction ............................................................ 23 2.6.2 Techniques for Diagnostics.............................................................................. 24 2.7 ALGORITHM DEVELOPMENT - PROGNOSTICS ........................................................... 32 2.8 RELIABILITY CENTERED MAINTENANCE .................................................................. 40 2.9 SYSTEM IDENTIFICATION TECHNIQUES .................................................................... 41 2.9.1 Autoregressive Models ..................................................................................... 43 2.9.2 Kalman Filters ................................................................................................. 43 CHAPTER 3: APPROACH ........................................................................................... 45 3.1 FAILURE MODES ...................................................................................................... 46 3.2 INTELLIGENT VALVE FRAMEWORK .......................................................................... 48 3.2.1 Data Acquisition .............................................................................................. 49 3.2.2 Preprocessing .................................................................................................. 51 3.2.3 Failure Mode Detection and Diagnosis ........................................................... 52 3.2.4 Valve Operational Statistics ............................................................................ 52 3.2.5 Auto-associative Neural Networks for Sensor Validation ............................... 55 v 3.2.6 Thermal Modeling ............................................................................................ 59 3.2.7 Adaptive Thresholding ..................................................................................... 60 3.3 PROGNOSTIC SURVEY .............................................................................................. 63 3.4 DIAGNOSTIC PROCESS .............................................................................................. 63 CHAPTER 4: RESULTS ............................................................................................... 68 4.1 DIAGNOSTIC VALIDATION DATA ............................................................................. 68 4.1.1 Thermal Model Data ........................................................................................ 68 4.1.2 Sensor Validation Data .................................................................................... 69 4.1.3 Adaptive Threshold Data ................................................................................. 70 4.2 THERMAL MODEL VALIDATION ............................................................................... 71 4.2.1 Thermal Modeling ............................................................................................ 72 4.2.2 Simulation Metrics ........................................................................................... 91 4.3 SENSOR VALIDATION ............................................................................................... 94 4.4 ADAPTIVE THRESHOLD .......................................................................................... 118 4.5 VALVE STATISTICS................................................................................................. 131 4.6 HEALTH VISUALIZATIONS ...................................................................................... 132 4.7 PROGNOSTICS ......................................................................................................... 134 4.8 PROGNOSTICS DATA .............................................................................................. 134 4.8.1 Canonical Data .............................................................................................. 134 4.8.2 LLAV Data ..................................................................................................... 136 4.9 PROGNOSTIC PERFORMANCE .................................................................................. 136 4.10 DIAGNOSTIC PROCESS .......................................................................................... 150 CHAPTER 5: CONCLUSIONS .................................................................................. 154 5.1 SUMMARY OF ACCOMPLISHMENTS......................................................................... 154 5.2 RECOMMENDATIONS FOR FUTURE WORK .............................................................. 157 References .......................................................................................................................159 vi LIST OF FIGURES Figure 1 - Integrated approach for system health analysis............................................................. 11 Figure 2 - The four types of failure mode and effect analysis (FMEA)......................................... 15 Figure 3 - Reliability analysis procedure for bottom-up and top-down FMEA approaches. ......... 17 Figure 4 - System decomposition for CBM testing, data collection, and data analysis. ................ 20 Figure 5 - Diagnostic and Prognostic Flowchart. .......................................................................... 23 Figure 6 - Model-based and Data-driven diagnostic techniques. ................................................... 26 Figure 7 - Approaches for prognosis. ............................................................................................ 35 Figure 8 - The system identification loop. ..................................................................................... 42 Figure 9 - LLAV with regions of interest labeled. ......................................................................... 46 Figure 10 - Prioritization of LLAV failure modes (see Equations 2.1 and 2.2 for y-axis calculation) . .................................................................................................................. 48 Figure 11 - System level flowchart of the Intelligent Valve framework. ...................................... 49 Figure 12 - Health analysis framework for the Intelligent Valve. ................................................. 49 Figure 13 - Valve statistics algorithm. ........................................................................................... 54 Figure 14 - Training method for auto-associative neural networks for sensor validation. ............ 58 Figure 15 - Adaptive threshold algorithm for designing and choosing ARMA models. ............... 61 Figure 16 - Adaptive threshold algorithm simulation on real-time data. ....................................... 62 Figure 17 - Intelligent Valve database schema. ............................................................................. 64 Figure 18 - Software framework for the Intelligent Valve framework. ......................................... 67 Figure 19 - MTTP Trailer used for validating sensor faults. ......................................................... 70 Figure 20 - Simulation data using thermal modeling for base run. ................................................ 73 Figure 21 - Data acquisition setup for thermal modeling fault detection. ..................................... 74 Figure 22 - Simulation data using thermal modeling for faulty connections in Tustin amplifier input................................................................................................................................ 75 Figure 23 - Fault classification using thermal modeling for faulty connections in Tustin amplifier input................................................................................................................................ 75 Figure 24 - Simulation data using thermal modeling for amplifier power downs and Tustin input disconnections. .......................................................................................................... 76 vii Figure 25 - Fault detection using thermal modeling for amplifier power down and Tustin input disconnection. ............................................................................................................ 77 Figure 26 - Simulation data using thermal modeling for faulty input connections in the digitizer. ................................................................................................................................... 78 Figure 27 - Fault detection using thermal modeling for amplifier power down and Tustin input disconnection. ............................................................................................................ 78 Figure 28 - Simulation data using thermal modeling for simulated frost insulation test 1. ........... 79 Figure 29 - Fault detection using thermal modeling for frost insulation test 1. ............................. 80 Figure 30 - Simulation data using thermal modeling for simulated frost insulation test 2. ........... 81 Figure 31 - Fault detection using thermal modeling for frost insulation test 2. ............................. 81 Figure 32 - Data acquisition modified setup for thermal modeling fault detection. ...................... 82 Figure 33 - Simulation data using thermal modeling for temperature junction reference errors. .. 83 Figure 34 - Fault detection using thermal modeling temperature for junction reference errors. ... 83 Figure 35 - Simulation data using thermal modeling for thermocouple and power disconnections. ............................................................................................................................... 84 Figure 36 - Fault detection using thermal modeling for thermocouple and power disconnections. ............................................................................................................................... 85 Figure 37 - Simulation data using thermal modeling for thermocouple disconnections and shorts. ...................................................................................................................................... 86 Figure 38 - Fault detection using thermal modeling for thermocouple disconnections and shorts. ...................................................................................................................................... 86 Figure 39 - Simulation data using thermal modeling for transmitter power failures. .................... 87 Figure 40 - Fault detection using thermal modeling for transmitter power failures. ..................... 88 Figure 41 - Simulation data using thermal modeling for unaccounted thermocouple junctions. .. 89 Figure 42 - Fault detection using thermal modeling for unaccounted thermocouple junctions. .... 89 Figure 43 - Comparison of predicted and actual frost line............................................................. 90 Figure 44 - Example of a hard fault. .............................................................................................. 95 Figure 45 - Example of a soft fault. ............................................................................................... 95 Figure 46 - Example dataset from LLAV and downstream pressure sensor. ................................ 96 Figure 47 - Hard fault detection using AANN. .............................................................................. 97 Figure 48 - Soft fault detection by AANN. .................................................................................... 97 viii Figure 49 - Fault detection of a simulated hard fault in a pressure sensor..................................... 98 Figure 50 - Fault detection of a soft fault in a pressure sensor. ..................................................... 99 Figure 51 - Detection of a simulated disconnect in a pressure transducer. .................................. 100 Figure 52 - Legend for AANN estimations: (a) Top estimation plots and (b) bottom error plots. .................................................................................................................................... 101 Figure 53 - AANN Estimation for PE-1134-GO and PE-1140-GO pressure sensors under normal operating conditions. ............................................................................................. 101 Figure 54 - AANN Estimation for PE-1143-GO and PC1 pressure sensors under normal operating conditions. ....................................................................................................... 102 Figure 55 - AANN Estimation for VPV-1139-FB and VPV-1139-CMD valve sensors under normal operating conditions. ............................................................................................. 102 Figure 56 - AANN Estimation for PE-1134-GO and PE-1140-GO pressure sensors with hard fault in PE-1143. .................................................................................................................. 104 Figure 57 - AANN Estimation for PE-1143-GO and PC1 pressure sensors with hard fault in PE-1143. .......................................................................................................................... 104 Figure 58 - AANN Estimation for VPV-1139-FB and VPV-1139-CMD pressure sensors with hard fault in PE-1143. ............................................................................................. 105 Figure 59 - AANN Estimation for PE-1134-GO and PE-1140-GO pressure sensors with level shift in PE-1143-GO. .................................................................................................. 107 Figure 60 - AANN Estimation for PE-1143-GO and PC1 pressure sensors with level shift in PE-1143-GO. ................................................................................................................... 107 Figure 61 - AANN Estimation for VPV-1139-FB and VPV-1139-CMD valve sensors with level shift in PE-1143-GO. .................................................................................................. 108 Figure 62 - AANN Estimation for PE-1134-GO and PE-1140-GO pressure sensors with noise in PC1. ........................................................................................................................ 110 Figure 63 - AANN Estimation for PE-1143-GO and PC1 pressure sensors with noise in PC1. . 110 Figure 64 - AANN Estimation for VPV-1139-FB and VPV-1139-CMD valve sensors with noise in PC1. ........................................................................................................................ 111 Figure 65 - AANN Estimation for PE-1134-GO and PE-1140-GO pressure sensors with noise in VPV-1139-FB. ............................................................................................................... 112 ix Figure 66 - AANN Estimation for PE-1143-GO and PC1 pressure sensors with noise in VPV-1139-FB. ......................................................................................................................... 113 Figure 67 - AANN Estimation for VPV-1139-FB and VPV-1139-CMD valve sensors with noise in VPV-1139-FB. ....................................................................................................... 113 Figure 68 - AANN Estimation for PE-1134-GO and PE-1140-GO pressure sensors with simultaneous faults in PE-1143-GO and PC1. ....................................................... 115 Figure 69 - AANN Estimation for PE-1143-GO and PC1 pressure sensors with simultaneous faults in PE-1143-GO and PC1. ............................................................................. 115 Figure 70 - AANN Estimation for VPV-1139-FB and VPV-1139-CMD valve sensors with simultaneous faults in PE-1143-GO and PC1. ........................................................ 116 Figure 71 - Set point transitions for adaptive thresholding testing. ............................................. 119 Figure 72 - Set point transition #1 with fault detection while operating in : (a) normal OS and (b) faulty OS. ................................................................................................. 120 Figure 73 - Set point transition #2 with fault detection while operating in: (a) normal OS and (b) faulty OS. ................................................................................................. 121 Figure 74 - Set point transition #3 with fault detection while operating in: (a) normal OS and (b) faulty OS. ................................................................................................. 123 Figure 75 - Set point transition #4 with fault detection while operating in: (a) normal OS and (b) faulty OS. ................................................................................................. 124 Figure 76 - Set point transition #5 with fault detection while operating in: (a) normal OS and (b) faulty OS. ................................................................................................. 126 Figure 77 - Set point transition #6 with fault detection while operating in: (a) normal OS and (b) faulty OS. ................................................................................................. 127 Figure 78 - Average fault values for different parameters of the ARMA model thresholding method over all tests. .............................................................................................. 128 Figure 79 - Training data with final threshold fit......................................................................... 129 Figure 80 - Fault detection of simulated obstruction fault using adaptive thresholding. ............. 130 Figure 81 - Frost line visualization of LLAV. ............................................................................. 133 Figure 82 - Cross sectional and exploded view with flow and position visualizations. .............. 133 Figure 83 - Frost line visualization of LLAV with thermocouple values. ................................... 134 Figure 84 - Linear equation with 0 mean and 1 variance. ............................................................ 135 x Figure 85 - Linear time series with 0 mean and 10 variance ....................................................... 135 Figure 86 - Original model time series. ....................................................................................... 136 Figure 87 - AR prediction of first time signal at 1 prediction step and SNR = 25dB. ................. 137 Figure 88 - AR prediction of first time signal at 5 prediction steps and SNR = 25dB. ............... 137 Figure 89 - AR prediction of first time signal at 5 prediction step and SNR = -5dB................... 138 Figure 90 - AR MSE performance on 0 mean, 1 variance signal. ............................................... 138 Figure 91 - ARMA prediction of first time signal at 1 prediction step and SNR = 25dB. .......... 139 Figure 92 - ARMA prediction of first time signal at 1 prediction step and SNR = -5dB. ........... 139 Figure 93 - ARMA prediction of first time signal at 5 predictions steps and SNR = -5dB. ........ 140 Figure 94 - ARMA MSE performance on 0 mean, 1 variance signal. ......................................... 140 Figure 95 - Kalman filter prediction of first time signal at 1 prediction step and SNR = 25dB. ................................................................................................................................ 141 Figure 96 - Kalman filter prediction of first time signal at 5 prediction steps and SNR = 25dB. ................................................................................................................................ 141 Figure 97 - Kalman filter prediction of first time signal at 5 prediction steps and SNR = -5dB. ................................................................................................................................ 142 Figure 98 - Kalman filter MSE performance on 0 mean, 1 variance signal. ............................... 142 Figure 99 - Original time series model #2. .................................................................................. 143 Figure 100 - AR MSE performance on 0 mean, 10 variance signal. ........................................... 144 Figure 101 - ARMA MSE performance on 0 mean, 10 variance signal. ..................................... 144 Figure 102 - Kalman filter performance on 0 mean, 10 variance signal. ..................................... 145 Figure 103 - ARX prediction of the LLAV data to 30 time steps................................................ 146 Figure 104 - Performance for ARX model based on LLAV data. ............................................... 147 Figure 105 - ARMAX prediction of the LLAV data to 30 time steps. ........................................ 147 Figure 106 - Performance for ARMAX model based on LLAV data.......................................... 148 Figure 107 - Kalman prediction of the LLAV data to 30 time steps. .......................................... 148 Figure 108 - Performance for Kalman filter based on LLAV data. ............................................. 149 Figure 109 - Intelligent Valve statistics tab. ................................................................................ 151 Figure 110 - Intelligent Valve thermocouple tab. ........................................................................ 152 Figure 111 - Intelligent Valve setup tab....................................................................................... 153 xi LIST OF TABLES Table 1 - An example morphological matrix of a redesigned rail bogie. ...................................... 12 Table 2 - Description of the four types of failure mode and effect analysis (FMEA). .................. 15 Table 3 - Possible values of the parameters used in a FMEA. ....................................................... 18 Table 4 - Diagnostic algorithms from the literature. ...................................................................... 27 Table 5 - Prognostic algorithms from the literature. ...................................................................... 35 Table 6 - Failure modes and effects for LLAV. ............................................................................. 47 Table 7 - Thermocouple types and ranges. .................................................................................... 51 Table 8 - Data server class interface. ............................................................................................. 66 Table 9 - Adaptive threshold simulation parameters. .................................................................... 71 Table 10 - Physical parameter obtained from least square optimization curve fit of base run. ................................................................................................................................ 72 Table 11 - Performance metrics for faulty connection in amplifier input. ..................................... 91 Table 12 - Performance metrics for amplifier power down and Tustin input disconnect. ..................................................................................................................................... 91 Table 13 - Performance metrics for input disconnection on the digitizer. ..................................... 91 Table 14 - Performance metrics for frost insulation test 1. ............................................................ 91 Table 15 - Performance metrics for frost insulation test 2. ............................................................ 92 Table 16 - Performance metrics for temperature junction reference error. .................................... 92 Table 17 - Performance metrics for thermocouple and power disconnection................................ 92 Table 18 - Performance metrics for thermocouple disconnections and shorts. ............................. 92 Table 19 - Performance metrics for transmitter power and failure. ............................................... 93 Table 20 - Performance metrics for unaccounted thermocouple junction. .................................... 93 Table 21 - Average performance metrics for all thermocouple fault tests. .................................... 93 Table 22 - Performance metrics for fault detection using AANN under normal operating conditions. .................................................................................................................... 103 Table 23 - Performance metrics for fault detection using AANN with injected hard fault in PE-1143-GO. ........................................................................................................... 105 xii Table 24 - Performance metrics for fault detection using AANN with injected level shift fault in PE-1143-GO. .................................................................................................. 108 Table 25 - Performance metrics for fault detection using AANN with injected noise in PC1. ................................................................................................................................ 111 Table 26 - Performance metrics for fault detection using AANN with noise in VPV-1139-FB. ......................................................................................................................... 114 Table 27 - Performance metrics for fault detection using AANN with simultaneous faults in PE-1143-GO and PC1. ................................................................................................... 116 Table 28 - Operating Statistics for LLAV ................................................................................... 132 xiii GLOSSARY OF TERMS 1. Health Management - A comprehensive system that detects, isolates, and quantifies faults as well as predicts future failures in an engineering system 2. Condition based maintenance - The use of machinery run-time data to determine the machinery condition and hence its current fault/failure condition, which can be used to schedule required repair and maintenance prior to breakdown 3. Prognostics and health management - The prediction of future failure conditions and remaining useful life of a system, subsystem, or component 4. Reliability centered maintenance - The process that is used to determine the most effective approach to maintenance 5. Failure Conditions - States of components and subsystems that are indicative of a fault occurring in the overall system. 6. Dimensionality Reduction - The process of reducing the number of random variables under consideration in order to create a more accurate set of feature vectors. 7. Fuzzy Logic - A form of multi-valued logic derived from fuzzy set theory to deal with reasoning that is approximate rather than accurate 8. Intelligent Component - A component in a system that relays not only raw data, but some sort of analysis on the data, e.g. FFT, DSP, moving average, fault and failure conditions, etc. 9. Artificial Neural Network - A mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. It consists of an interconnected group of artificial neurons and processes information using a connectionist approach to computation. 10. Integrated Systems Health Management - a set of system capabilities that in aggregate perform: determination of condition for each system element, detection of anomalies, diagnosis of causes for anomalies, and prognostics for future anomalies and system behavior xiv 11. Fault Diagnosis - Detecting, isolating, and identifying an impending of incipient failure condition -- the affected component (subsystem, system) is still operational even though at a degraded mode. 12. Fault Diagnosis - Detecting, isolating, and identifying an impending of incipient failure condition -- the affected component (subsystem, system) is still operational even though at a degraded mode. 13. Failure Diagnosis - Detecting, isolating and identifying a component (subsystem, system) that has ceased to operate. 14. Fault Detection - Detection of the occurrence of faults in the functional units of the process, which lead to undesired or intolerable behavior of the whole system. 15. Fault Isolation - Localization (classification) of different faults. 16. Fault Analysis or Identification - Determination of the type, magnitude and cause of the fault 17. Failure modes effects and criticality analysis (FMECA) - A procedure in product development and operations management for analysis of potential failure modes within a system for classification by the severity and likelihood of the failures. xv Chapter 1: INTRODUCTION As system complexity increases, the amount of data required to monitor system failures also increases. Originally, a human operator could view the raw time series data to find sensor faults that could be traced back to a root cause in the system. In modern day systems, however, the increase in sensor data can make it difficult if not impossible for a human operator to find anomalies in these systems in a timely fashion [1]. Therefore, reliability engineers are deploying automated algorithms that detect failure modes in complex, dynamic systems. The goal of these algorithms has been extended from detecting threshold violations in the sensors to identifying and quantifying the degradation of health in a system and even predicting faults before they occur. Numerous techniques have been developed with help from extensive research and funding being put into the field of health analysis. The United States military has taken particular interest in health analysis in order to provide their troops with robust and reliable systems. Military studies found that maintenance protocols were based on a schedule rather than a degradation of performance. The scheduled maintenance leads to many components being replaced before their operational lifetime had ended. Health analysis allows for preventive maintenance to be performed based on the current health 1 state of the component and has considerable cost benefits while also keeping operators of the machines safe. In any system, the proper health analysis technique must be determined and depends on several design parameters such as application, severity, accuracy, historical data, constraints, deadlines, and complexity of the physical dynamics of the system. For instance, in a manufacturing plant certain critical components can cause a shutdown for days. The shutdown can cost considerable delays in the shipping of the manufactured products. Such systems would require a highly accurate algorithm such as a physics model, but these algorithms also take the longest amount of time and cost to develop. In the realm of health analysis, three major technologies have arisen: Condition Based Maintenance, Prognostics and Health Management, Reliability Centered Maintenance and Integrated Systems Health Management. Condition Based Maintenance (CBM) is defined as “the use of machinery run-time data to determine the machinery condition and hence its current fault/failure condition, which can be used to schedule required repair and maintenance prior to breakdown” [2]. Prognostics and Health Management (PHM) refers to the prediction of future failure conditions and remaining useful life of a system, subsystem, or component. Integrated Systems Health Management (ISHM) “describes a set of system capabilities that in aggregate perform: determination of condition for each system element, detection of anomalies, diagnosis of causes for anomalies, and prognostics for future anomalies and system behavior” [3]. Reliability centered maintenance (RCM) "is the process that is used to determine the most effective approach to maintenance" [4]. 2 As systems grow in size and complexity, there will be a need to develop algorithms that have higher accuracy, are more general, and have longer prediction intervals than current system health analysis. 1.1 Applications Diagnostics and prognostics make up the core components of the health analysis framework. These two technologies are not limited to just engineering, but are also used in medical, and business applications. While the goals of the analysis may be different, the techniques used in the diverse fields are often the same. The medical field uses diagnostics to try and determine the health of a patient, and the disease that is affecting the patient. Once a diagnosis has been made, remedial procedures can be created to try and help the patient as much as possible. Three of the methods that are used by medical professionals are exhaustive, algorithmic, and patternrecognition. The exhaustive method uses every possible question and runs all possible tests in order to create the most comprehensive diagnosis possible. The algorithmic method follows steps from a proven strategy to diagnosis the disease based on the symptoms the patient is going through. The final method, pattern-recognition, uses past experience to recognize a pattern of clinical characteristics in order to diagnosis the patient. While the procedures are different for each method, the goal of finding the disease and coming up with a treatment based on the symptoms and test data available is the same [5]. The global economy is constantly in a state of flux, with company’s stocks rising and falling every day. The ability of an investor to predict these changes would result in success for his company. Therefore, algorithms are being designed that attempt to 3 analyze, model, and even forecast the stock market in an attempt to find trends that will tell investors when it is the best time to buy or sell their stocks. Financial forecasting is also used by top management for planning and implementing long-term strategic objectives. The methods used by forecasters usually rely on probabilistic models such as regression and Markov models. The main drawback of most of these models is the difficulty in taking into account all the variables as well as the functions or relationships of those variables that contribute to something as complex as the global market [6]. 1.2 Motivation The National Aeronautics and Space Administration (NASA) was formed in 1958 and has quickly established itself as a worldwide leader for air and space research. The accomplishments of the public space agency resulted in a number of firsts including an interplanetary flyby, pictures from another planet, and manned landings on the moon, the assembly the launch of a space station. Since its inception, NASA has placed an emphasis on the safety of its astronauts during their voyages into space. However, even with safety procedures and equipment, the manned space flight program has suffered several catastrophic loses including the crews of Apollo 1, STS-51-L, and STS-107. Since the two space shuttle disasters, NASA has focused research on the development of an Integrated Systems Health Management (ISHM) platform to ensure the highest level of safety possible for future endeavors in space [7, 8]. NASA defines Integrated System Health Management (ISHM) as “a capability that focuses on determining the condition (health) of every element in a complex System (detect anomalies, diagnose causes, prognosis of future anomalies), and provide data, information, and knowledge (DIaK) - not just data - to control systems for safe and 4 effective operation[9]”. The vision of NASA is to start incorporating ISHM at the beginning of the conceptual design until the end of the manufacturing cycle for future missions. By allowing safety to influence conceptual design, engineers can catch potential failures and anomalies in systems before they are fully designed. By catching these flaws early enough, the best opportunity for costs savings can be exploited at the earliest stages in development. The development and implementation of ISHM in the complex systems designed by NASA can also create additional costs if not applied correctly. Therefore, risk analysis tools are being created that find a balance between cost, performance, safety and reliability throughout the system lifecycle. NASA Stennis Space Center (SSC) in Mississippi is the location of one group researching ISHM technologies. NASA-SSC’s primary responsibility is the testing of all the rocket engines before they are launched from NASA Kennedy Space Center. This includes the Space Shuttle main engine (SSME) and the new J-2X, which are both critical to the success of their respective missions. While the SSME is being phased out, the J2X is a brand new engine in its first stages of testing. The engines require highly combustible fluids such as liquid hydrogen and liquid oxygen to create thrust of up to 294,000 lbs [10]. The engines are bolted down to massive superstructures and are fired for the exact amount of time the engine stays lit during the live launch. To date, no shuttle mission ever has been delayed or aborted due to an engine failure [11]. To continue this perfect track record, an ISHM module is being created for the newly renovated A-Complex test stands. One of the important aspects of the ISHM module is the determination of failures and anomalies in the valves of the test stand. 5 These valves are responsible for maintaining a precise flow of cryogenic fluids needed to fuel a test article. The cost of these test articles is extremely high and even a small discrepancy in the flow rate of cryogenic fluids can cause catastrophic events. The cost to run a test program can be in the millions of dollars and extended delays can cause the cancellation of an entire program. The restrictive constraints placed upon the operation of the test stands requires the test engineers to continually monitor the valve operations and, at the first sign of degradation, repairs must be made quickly and efficiently. Currently human engineers perform the analysis on the valve data, but the implementation of an intelligent framework with algorithms to monitor the health of the valves in the system could help give additional insight to the engineers at NASA-SSC. The valuable statistical, diagnostic, and prognostic information introduced with such a framework could generate advisories that, when combined with the domain expert’s opinions, would produce the greatest accuracy in maintenance decisions. 1.3 Objectives The objectives of this thesis are 1. To design a framework for the detection of faults and failure modes in the large linear actuated valve that are used on the rocket engine test stands at NASA-SSC. 2. To develop a diagnostic process that – a. Receives and stores incoming sensor data; b. Performs calculation of operating statistics; c. Compares with existing analytical models; and, d. Visualizes faults, failures, and operating conditions in a 3D GUI environment. 6 3. To develop a suite of diagnostic algorithms that can detect anomalous behavior in the valve and other system components of the rocket engine test stand. 4. To expand the capability of the diagnostic algorithm to perform prognosis in specific context. 1.4 Scope The survey of current diagnostic and prognostic techniques focused on how to apply these algorithms to specific algorithms and is presented in the background section of this thesis. The steps of a health analysis framework are also presented in the background section. The development of these algorithms is defined in the approach section, with specific applications to NASA-SSC's E-complex test stand. Particularly, the valve in question is the Large Linear actuator valve, which is a critical component to the test stands at NASSA-SSC. The algorithms are tested on both actual data from the rocket engine test stands, as well as simulated data from forward analytic models. 1.5 Organization This thesis is organized as follows. Chapter 1 provides introductory information on NASA’s history and the motivation of the agency to develop an Integrated System Health Management framework for its rocket engine test stands. Possible applications, objectives, and expected contributions are also discussed. Chapter 2 provides a thorough background on the development process of health analysis algorithms and frameworks. An overview of the framework is given and then proceeding sections provide detailed information for each step. Chapter 3 outlines the approach taken to develop diagnostic and prognostic algorithms for the detection of anomalies in sensor data from the valves at 7 NASA-SSC. Chapter 4 is an account of the results of creating the functional database and intelligent valve framework, following the premises outlined in Chapter 3. Chapter 5 is a summary of accomplishments presented in this thesis, as well as future research recommendations. 1.6 Expected Contributions This thesis will provide a detailed summary of existing methods for health analysis with applications to the ISHM components used at NASA-SSC. It will also provide a literature review of existing diagnostic and prognostic algorithms. A functional database that utilizes neural networks will be integrated into the existing ISHM framework. This database will detect alarms in the subsystems and components of the test stand. These alarms can then be used for root cause analysis to pinpoint faults and failures in the complex test stands. This thesis will also show the approach and results of an intelligent valve framework. There will be two modes that the framework will be used in: health analysis algorithms and a diagnostic process. The diagnostic process, which is run in real-time during tests, will be responsible for the capturing of operating statistics, thermal model diagnostics, and a 3D model of the valves. The health analysis algorithms, which will be run after a test series has completed, is responsible for the development and validation of advanced diagnostic and prognostic algorithms for the determination of the remaining useful life for the valves. These algorithms will eventually convey advisory information to NASA engineers for maintenance options in the valves at the E-Complex test stand. 8 Chapter 2: BACKGROUND The following section contains a summary of previous work performed in the area of fault diagnosis, fault detection, and prognostics. A detailed method of the entire health analysis framework will be given. Finally, a discussion of various system identification techniques will be presented. 2.1 Health Analysis As engineering systems have become more complex, the cost to maintain these systems has also increased dramatically. Therefore, research in the area of system health analysis has emerged over recent years to help alleviate the cost of these expensive machines. The research has been split into two major areas: Condition-based maintenance (CBM) and prognostics and health management (PHM) [2]. CBM focuses on the detection of faults in a system and then labeling a specific component that caused the fault. This methodology replaces traditional scheduled maintenance which commonly resulted in working parts bring replaced before their useful life had expired. PHM algorithms attempt to determine the remaining useful life of a system after a fault has occurred. Knowing the remaining life of a system can minimize the downtime risk for critical systems in manufacturing plants [12]. As systems become more advanced, physical modeling has become too expensive to develop in a timely fashion 9 and can become too specific to be useful in health management. Therefore, systems are broken into smaller subsystems that can be modeled more easily. Ideally, these subsystems are able to be modeled by first order physics equations. If this degree of complexity is not sufficient, system identification techniques are used to model a system based on historical data. These techniques and their application in system health management will be explored in the following sections of this thesis. 2.2 Framework for Health Analysis Modeling the entire health of a system can be very complex and is impractical in most cases. Therefore, the approach of health analysis is broken up into different sections that include systems, subsystems and components into a pipeline that streamlines the entire process. While the input and output formats of the sections are defined, they are each treated as a black box where only pertinent information is passed on to the next level of the analysis. The entire pipeline can be seen in Figure 1 [2] with a description of each section following. 10 Figure 1 - Integrated approach for system health analysis. 2.3 Design and Trade Studies The first step of the health analysis process is to examine the system from a top level, and determine the best approach for each failure mode identified. In 2002, a formal methodology was accepted by U.S. Department of Defense called integrated product and process design (IPPD) [3]. IPPD defines the following tasks: Define the problem Establish value Generate feasible alternatives Establish alternatives Recommend a decision 11 The IPPD framework is applied to the system during the design phase. Its main purpose is to provide guidance to the engineers designing the system. The IPPD uses a morphological matrix that lists the functions of the system and proposes alternative methods of accomplishing those functions. An example of a morphological matrix of the redesign of a rail bogie can be seen in Table 1 [13]. Table 1 - An example morphological matrix of a redesigned rail bogie. Function To connect the wheel-set and the carriage To allow the primary suspensions simultaneously working To reduce oscillations between the bogie and the carriage frame Actuator Solutions Carriage spring Bogie Bogie with single-stage suspension Coaxial helical springs + shock absorber Helical springs working in parallel + shock absorber Helica springs working in parallel + shock absorber Helical springs working in parallel with shock absorber Coaxial Pressure spring + rubber small block When allowing CBM/PHM to contribute to the design at an early stage, more reliable systems can be built based on past experience of what failures occur most often in what equipment. While the morphological table presents the best technology to perform each function in a system, it is not always feasible in a budget to build a system with the most state of the art components. Therefore, the morphological table must be presented side-by-side with a quantitative analysis of the benefits of each component. Decision analysis is a field that is well studied and provides several techniques to quantify the options available to the design engineers. A mathematical model has been developed for the selection of the best alternative attributes based on incomplete preference information to asses attribute weights [14]. There are various methods of 12 multiple attribute analysis model (MADM) which are ideal for quantifying the attributes in the morphological matrix [15]. To completely satisfy the tasks of the design and trade studies phase, all design aspects of a system must be chosen from the techniques described above. Final design choices should be made only after expert opinions have been solicited or simulation studies are performed [2]. All design alternatives should be accompanied by some technique of numerical rankings in order to best select the attributes which solve the functions required by the system as well as stay within the budget constraints placed on the system. After these choices have been made, the output of the design and trade study section is a design of a system and subsystems which accomplish the task with the greatest reliability possible. The next stage then analyzes these designs from a health standpoint in order to understand the failure modes of the system. 2.4 Failure Mode Analysis Understanding not only what component fails in a system, but why it fails is critical to any health analysis platform. To perform complete health analysis, these failures must be classified by their criticality in the system. The field of study has become known as failure modes and effects analysis (FMEA), and many methods have been presented in the literature. NASA Ames Research Center developed a failure mode mechanism through clustering analysis. The analysis includes a statistical clustering procedure to retrieve information on the set of predominant failures that a function experiences [16]. The Society of Automative Engineers (SAE) has also developed a FMEA procedure specifically for the automotive industry. They split their approach and have separate procedures for the design phase, as well as the manufacturing and assembly 13 phase. It contains recommendations for appropriate terms, requirements, ranking charts, and worksheets. The SAE standard is not as general as the other mentioned methodologies, which makes it only usable for the automotive industry. Therefore, the work in the remainder of this thesis will focus on general standards that can be applied to any health analysis framework [17]. The United States military developed a procedure for performing FMEA in Military Procedure MIL-P-1629. The evaluation criteria of this standard determined the effect of system and equipment failures. The criteria was extended to the Mil-Std-1629A in order to add criticality analysis to the failure modes. NASA formally developed and applied the 1629A method in the 1960's to improve the reliability of its space program. The 1629A standard has become the most widely accepted method used through the military and commercial industry [18]. Even though 1629A is considered a standard, in many applications it is applied more as a template that is altered and updated to meet the needs of the project. For example, similar to the SAE standards, the design process is separated into multiple phases such as System FMEA (SFMEA), Design FMEA (DFMEA), Process FMEA (PFMEA), System FMEA (SFMEA). A diagram each is seen in Figure 2 with a description following in Table 2 [19]. 14 Figure 2 - The four types of failure mode and effect analysis (FMEA). Table 2 - Description of the four types of failure mode and effect analysis (FMEA). Type Focus System Minimize failure effects on the system Design Minimize failure effects on the design Process Minimize process failures on the total process (system) Service Minimize service failures on the total organization Objectives and Goals Maximize system quality, reliability, cost, and maintainability Maximize design quality, reliability, cost, and maintainability Maximize the total process (system) quality, reliability, cost, maintainability, and productivity Maximize the customer satisfaction through quality, reliability, and service To create a complete and thorough standard process that can be used in a wide variety of applications, certain terminology has been defined in the Mil-Std-1629A 15 document in order to simplify the communication channels between design and FMEA team. The overall objective of the FMEA process is to discover all of the ways a process or product can fail. Failures occur not only because of design or manufacturing flaws, but also by misuse of the product by the operator. That is why it is essential to investigate all four types of FMEA; which leads to a study that follows a product from concept and design, to the manufacturing and distribution. While these evaluations are not guaranteed to be comprehensive, any customer complaints are able to be addressed due to the understanding of the system based on the failure modes and effects that have been analyzed [20]. A FMEA is a straightforward process that allows for a system to be broken down into easily analyzed parts where failure modes are identifiable. A formal definition given by NASA Lewis Research Center distinguishes three specific components as the objective of a FMEA [21]: 1. Analyze and discover all potential failures modes of a system 2. Effects these failures have on the system 3. How to correct and/or mitigate the failures or effects on the system The effects of these failure modes can be more difficult to determine from a system level. A design FMEA can be conducted by a bottom-up approach, where the lowest level component is analyzed, or a top-down approach where an upper level failure is chosen, then the lower level effects are analyzed. Figure 3 shows these two approach of failure analysis [21]. 16 Figure 3 - Reliability analysis procedure for bottom-up and top-down FMEA approaches. Once the FMEA approach has been selected, failure modes are classified based on a set of parameters including: severity, frequency of occurrence, and testability. It is common for these criterion to be classified based on fuzzy values rather than numerical values as seen in Table 3. The fuzzification of the values allows for the FMEA to be performed on systems without large amounts of quantitative data of the faults of a system. The study also identifies the symptoms that the system exhibits while under the fault condition, as well as recommendations of the observers that can monitor and track the fault as it occurs [2]. The selection of observers to identify a fault may not always be a physical sensor, but rather the features that can be extracted from data in order to build a diagnostic algorithm. To identify these key components, domain experts must contribute to the FMEA study, particularly those experts who have experience with the exact components being used to perform a specific function in the system. After the parameters are given values, the priority of each is listed in a scale or table in order to 17 identify the keys components of a system where health diagnostic algorithms should be developed. Once these algorithms are developed, the system is reevaluated by the same parameters, but with improved testability and occurrence scores for those failure modes which have been addressed [20]. Table 3 - Possible values of the parameters used in a FMEA. Parameter Severity Frequency of occurrence Testability Possible Values Catastrophic Critical Marginal Minor Likely Probable Occasional Unlikely Comments based on domain expert's knowledge Two downsides arise from the use of fuzzy logic in the FMEA process. The first is that there is no quantitative priority number that can be deduced from the fuzzy values. A very straight forward solution, and the most commonly applied, is to defuzzify the values into a scalar range from 1-10 for each of the parameters. The resulting values are then multiplied together to form a priority number, commonly known as the risk priority number (RPN) [22]. In the same manner as before, after a diagnostic model of failure mode with the highest RPN is designed and verified, the RPN is readjusted based on the newly evaluated occurrence and testability parameters. Eqs. 2.1 and 2.2 shows the formula for both the RPN and the readjusted RPN [23]. 𝑅𝑃𝑁 = 𝑂𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒 ∗ 𝑆𝑒𝑣𝑒𝑟𝑖𝑡𝑦 ∗ 𝑇𝑒𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 % 𝑅𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑖𝑛 𝑅𝑃𝑁 = 𝑅𝑃𝑁𝑖𝑛𝑖𝑡𝑖𝑎𝑙 − 𝑅𝑃𝑁𝑟𝑒𝑑𝑢𝑐𝑒𝑑 𝑅𝑃𝑁𝑖𝑛𝑖𝑡𝑖𝑎𝑙 (2.1) (2.2) Equation 1- (2.1) Risk priority number and (2.2) readjusted risk priority number formulas. The other downside of the fuzzy FMEA approach is the lack of a biasing tool for the parameters. For example, if a component has a high likely of occurrence, but low 18 severity and testability, and another component has high severity, but low occurrence and testability; their priority may fall at exactly the same location in the FMEA. While in some applications this priority scoring may be desired, some failure modes must be identified based strictly on their severity to the system. Therefore, a criticality analysis is added to the analysis which weights the parameters based on the applications goals and design team concerns. The addition of the criticality parameter results in a ranking system based on the severity classification of a failure mode, as well as the probability of occurrence based on historical data. If there is no historical data, then a qualitative approach must again be used, but the more desired approach is again to use the quantitative number scaling used above. Based on the failure modes and effects criticality analysis (FMECA) standard being used for the system, the scaling will changed based on specifications put forth by the designer. Failure mode and effects criticality analysis is a very important, but often overlooked section of the health analysis framework. It may be partially due to the amount of time, effort, and research that must be put into the collection and analysis of data. Collaboration in a FMECA system is essential for it to be performed correctly, particularly when there are varying systems that require advice from different domain experts. Also, FMECAs should be done iteratively through the life a component to guarantee that all failure modes are identifiable and recommended actions can be performed in the result of a real failure when the product is being used by the customer [2]. 19 2.5 CBM Testing, Data Collection, and Data Analysis After the potential failure modes of a system have been identified, the next step in the health analysis framework is the design of the required instrumentation and dataacquisition system in order to gather baseline data under real operations. One system level approach to perform the design task is to decompose a system into six distinct parts. This hierarchy, developed by Pennsylvania State University's Applied Research Laboratory, allows for data acquisition to be performed on the lowest level before the system is even constructed. The hierarchy is comprised of areas of focus that can be examined by multiple level of engineers and scientists [24]: Figure 4 - System decomposition for CBM testing, data collection, and data analysis. By dividing a system into these 6 specific levels, the amount of health analysis algorithms is broadened to support many different fields of engineering. For example, by analyzing the material of a subsystem, non-destructive evaluation can be used in order to determine degradation whereas at a system level it would be more difficult to see the applicability of such techniques. Also studies have been performed on how materials degrade under hostile conditions [25, 26]. These previous studies can be applied to 20 different CBM applications in order to minimize the amount of redundant research being performed. Another method, developed by at the University of South Carolina (USC), relies on historical data and a relational database to tag key anomalies while maintenance is being performed. The historical data is retrieved from Maintenance Management Systems (MMS), which are more traditional maintenance records that holds information on faults and the repair actions performed. These systems are used by companies and manufacturers in order to optimize and control the maintenance of its facilities [27]. With an abundance of fault and failure data in a MMS, a link can be built between itself and a Health and Usage Monitoring System (HUMS) in order to monitor vehicle component parameters. The system developed by USC attempts to create this data link by extracting metadata from the MMS textual descriptions and combining it with the statistical analysis performed by the HUMS. The integrated service benefits greatly from large amounts of both qualitative and quantitative historical data; however, without a common data format for both the MMS and HUMS, USC's MMS and HUMS link is very application specific and difficult to apply to existing structures [25]. Example implementations of the USC's method can be found in [28]. 2.6 Algorithm Development - Diagnostics Once faults have been seeded, proper sensor instrumentation selected and data obtained, algorithms must be developed in order to detect failure modes as early as possible. Diagnosis is a subject studied not only for machine systems, but other disciplines such as medicine, sciences, business, and finance [29-31]. While the application and objective of each is different, the methodology of detecting anomalous conditions using appropriate 21 sensor data is the same. Due to the vast number of applications for diagnosis, this research area has been well studied during recent years. The two areas of focus on fault diagnostics are [2]: Fault Diagnosis: Detecting, isolating, and identifying an impending of incipient failure condition -- the affected component (subsystem, system) is still operational even though at a degraded mode. Failure Diagnosis: Detecting, isolating and identifying a component (subsystem, system) that has ceased to operate. The overall concept consists of three essential tasks [29]: Fault Detection: detection of the occurrence of faults in the functional units of the process, which lead to undesired or intolerable behavior of the whole system Fault Isolation: localization (classification) of different faults Fault Analysis or Identification: determination of the type, magnitude and cause of the fault Applications require different approaches depending on the nature of the faults and the system. Also, a strong influence in the method chosen is based on the amount of historical data available. If large amounts of fault data has been collected, automatic clustering algorithms can be utilized with fuzzy logic or neural networks in order to detect known faults in a system or component [30]. Conversely, if little fault data is available, model based approaches must be used in an attempt to create an accurate physical representation of the system. Figure 5 shows the diagnostic and prognostic framework [2]. The following sections will present an in-depth, though not exhaustive, 22 summary of various diagnostic methods which have been applied in various health frameworks. Figure 5 - Diagnostic and Prognostic Flowchart. 2.6.1 Preprocessing and Feature Extraction Diagnostic algorithms can be considered a subset of pattern recognition and machine learning due to the objective of classifying the current state of a machine based on the incoming sensor data. As such, the raw data provided will not always yield the greatest classification percentage; instead data must be analyzed in different forms and combinations in order to extract useful information for a given fault. Inconsistencies in data, such as process and measurement noise, must also be considered during processing in order to provide accurate and reliable results. When considering these anomalies, careful attention must be given to ensure a proper balance between signal integrity and information loss. Preprocessing techniques normally have tradeoffs and based on the application the engineer must be able to distinguish the degree of noise reduction that must occurs. In some instances a technique as simple as a low-pass filter will be sufficient, but in other instances more advanced techniques such as Kalman filters, wavelets, and artificial neural networks must be applied to the signal [31]. 23 A classic problem with pattern recognition is a lack of information found in raw data. Therefore, pertinent information from the sensor data must be found using feature extraction. Many times, a very difficult problem can be reduced down to a few variables by dimensionality reduction. These reductions in the size of the feature vector allows for redundant data to be ignored while focusing the diagnostic algorithms on the information that is relevant to the problem being solved. Several feature extraction techniques will be discussed in the upcoming sections, but only as they pertain to the specific applications of interest in this research. References [2, 32, 33] provide additional insight regarding feature extraction. 2.6.2 Techniques for Diagnostics Fault diagnostics requires a careful choice of implementation based on the objectives of the project as well as the data provided to the CBM designer. These implementation choices have been the subject of numerous investigations in recent decades. Reference [2] lays out several major objectives that a CBM must envelop: Ensure enhanced maintainability and safety while reducing operation and support cost Be designed as an open systems architecture Closely control PHM weight Meet reliability, availability, maintainability, and durability (RAM-D) requirements Meet monitoring, structural, cost, scalability, power, compatibility, and environmental requirements 24 The technologies utilized to accomplish these objectives are split into two major areas: model-based and data-driven. Model-based approaches involve the development of an accurate physical model of the system under evaluation. Incoming sensor data is then monitored and compared against the model in order to find residuals. The major benefit of the model-based approach is its ability to detect unanticipated faults. For mission-critical systems, the ability to detect such faults is an invaluable resource. Conversely, the major drawback of model-based approaches is the complexity of modern machine systems. If a system's dynamics are too complex, developing a model that is accurate enough to find faults without false positives may prove too difficult or costly to be a viable solution. In the cases where determining a model is improbable due to the complexity of a system, a data-driven approach is an alternate technology that has been proposed that relies on parameter estimation from historical data to create a mathematical model of the system. Machine learning and system identification techniques are very common methods utilized in such circumstances. Machine learning techniques such as artificial neural networks, support vector machines, and fuzzy-logic allow engineers to classify faults based on sensor data without any knowledge of the underlying system. System identification techniques such as regression, black-box and state-space models create a mathematical representation of the system by estimating known physical parameters. It is essential for a design engineer to understand that while the mathematical model can accurately depict the output of a system, it contains no information of the physical dynamics of a system as discussed in the model-based approach. In fact, both machine learning and system identification rely on a large amount of historical data to create a 25 robust algorithm that allows for accurate classification of faults. This requirement is one of the major drawbacks of these technologies because many times a CBM system is designed in parallel with the hardware of a system and little to no operational data exists [34]. Figure 6 shows flow charts for both techniques [2]. Figure 6 - Model-based and Data-driven diagnostic techniques. As discussed in the previous section, there is often a lack of information in raw sensor data. Therefore, a feature vector must be constructed that contains enough information to determine the current operating mode. That vector information in a model-based system will usually be the physical parameter that defines the system’s dynamics. In a data-driven method, statistical regression and system identification will most likely be used. Once the parameters that make up the feature vector have been found, they can be compared with a library of fault vectors to determine the current fault state of the machinery. Many times, a complex problem can be simplified to a few parameters extracted from raw data. Once the fault has been found, advisory generation 26 can be created based on a database of corrective maintenance for each individual fault. In the next section, once a fault has occurred, the remaining useful life (RUL) will attempt to be found based on prognostic algorithms [2, 29]. The following table shows recent research in the literature and describes several diagnostic approaches along with their applications. Table 4 - Diagnostic algorithms from the literature. Authors and Paper Title V.Puig, J. Quevedo, T. Escobet, F. Nejjari, and S. de las Heras, “Passive Robust Fault Detection of Dynamic Processes Using Interval Models,” 2008 [35] H. Bassily, R. Lund, and J. Wagner, “Fault Detection in Multivariate Signals With Applicatios to Gas Turbines” 2009 [36] Area of Research Model-based fault detection based on interval models that generate adaptive thresholds using three schemes (simulation, prediction, and observation) Compares multivariate autocovariance functions of two independently sampled signals in order to create a model-based algorithm to detect faults in a gas turbine Utilizes fuzzy-genetic algorithm to detect different types of actuator failures in a nonlinear F-16 aircraft model Uses an adaptive method to overcome false alarms in slowly degrading manufacturing processes that use Hotelling T2 and squared prediction errors. Details an intelligent adaptive fuzzy system with self-learning functions that monitors electrical equipment C. H. Lo, Eric H. K. Fung, and Y. K. Wong, “Intelligent Automatic Fault Detection for Actuator Failures in Aircraft”, 2009 [37] G. Spitzlsperger, C. Schmidt, G. Ernst, H. Strasser, and M. Speil, “Fault Detection for a Via Etch Process Using Adaptive Multivariate Methods,” 2005 [38] W. R. A. Ibrahim, M. M. Morcos, “An Adaptive Fuzzy Self-Learning Technique for Predication of Abnormal Operation of Electrical Systems” 2006 [39] S. Huang and K. K. Tan, “Fault Detection Uses multiple radial basis functions to estimate and Diagnosis Based on Modeling and both the unknown nonlinear dynamics as well as the fault characteristics of a simulated Estimation Methods, ” 2009 [40] system J. Yun, K. Lee, K. Lee, S. B. Lee, J. Yoo, Proposes a stator-winding turn-fault detection “Detection and Classification of Stator Turn algorithm using sensorless zero-sequence or negative-sequence current Faults and High-Resistance Electrical voltage Connections for Induction Machines”, 2009 measurements. [41] The authors of [35] demonstrated a technique using interval models to detect faults. The paper compares and contrasts several different interval models. In particular, they show the benefits and drawbacks of simulation, observation, and prediction interval 27 models. They applied their fault detection algorithm to the European Research Training Network DAMADICS servo motor. They used several different parameter configurations for each and use an optimization criterion to create an adaptive threshold. When the input signal crosses the threshold, a fault indicator is set to a high state. It is seen in the paper that the simulation method had the greatest accuracy because the model does not depend on current inputs. The adaptive threshold in the prediction and observation follows the input sensor values too closely and either has too many false alarms or too many false negatives. Quantitative and qualitative analyses are given for each method. The autocovariance function for any zero mean stationary d-dimensional signal can be used to determine if two independently sampled signals are statistically identical. Reference [36] presents the theory behind such a claim, and goes on to provide insight on how to use this property for diagnostics. The authors develop a statistical measure to determine signal equality, and then applies the measure to multi-dimensional bivariate white noise and compared with the empirical probabilities of several simulated models. These are compared to ensure the feasibility of the statistical measure. To show applicability to machinery, the method is applied to a gas turbine at Clemson University. Several artificially induced faults are tested including: added synthetic noise, partial blockage, and compressor relief valve failure. The final results were compared to standard dynamic principal component analysis and it was found that the statistical measure was able to detect the faults earlier and with more accuracy. The authors of [37] developed an automatic fault detection system using genetic algorithms and fuzzy logic. The fuzzy-genetic algorithm is proposed to eliminate the 28 need for hardware redundancy in aircrafts and instead suggests that analytic redundancy is sufficient when a robust algorithm is applied to the dynamic behaviors of such a system. The algorithm claims the capability to detect four types of failure mode including no fault, elevator failure, aileron failure, and rudder failure. It detects these failures by first fuzzifying the residuals and then evaluating them by an inference mechanism using if-then rules. In order to optimize the rule table referenced by the inference mechanism, a genetic search algorithm is used. The fuzzy rule table is coded into a chromosome and the fault models are integer numbers. These chromosomes are set to the size of the fuzzy rule table and decoded for the fuzzy evaluation system. The fitness value of each chromosome is compared to the optimal objective function in order to determine the optimal fuzzy rule table for each fault based on the residuals of the sensor data. The algorithm is applied to a simulation study of the faults in a nonlinear F16 aircraft model. The system is compared to a linear classifier and a neural network. The results show that the fuzzy-genetic algorithm performs well on all faults, and is very resistant to measurement noise in the residuals. The proposed algorithm outperforms both the linear classifier and the neural network in all cases. In the semiconductor industry, Hotelling T2 and the squared prediction error are gaining acceptance to monitor data provided by modern process tools. These methods require models based on the covariance matrix of the training data set and problems arise in the slow drift of modern manufacturing processes. Therefore, false alarms are created during the estimation process which effectively negates the benefits of the diagnostic algorithms. To counteract these drawbacks, an adaptive method for multivariate models is developed [38]. The authors take a current adaptive method of centering and scaling 29 and expand it to incorporate domain knowledge as well as remodeling using a moving window approach. In each case the Hotelling T2 chart is tuned based on the current drift of the system. The results were not seen as promising and it was found that engineering knowledge was more important to the update of the individual univariate rather than the automatic updating of the proposed methods. The presence of fuzzy logic in diagnostic environments is gaining more acceptance due to its flexibility and soft classification of faults. One of the drawbacks of fuzzy logic is the required domain knowledge and historical data needed in order to create an accurate and robust fuzzy rule table. Reference [39] attempts to overcome these requirements is by creating a self-learning process that can predict failure modes in a monitored system. The algorithm first determines the number of data points required to find the underlying trend successfully. The next step is to determine how long of a period is required to fully define a trend. Wavelet denoising is then applied to the signal to create a clean signal for the fuzzy logic predictor. Two fuzzy techniques are considered by the authors. The first uses a single fuzzy system to not only learn a trend, but indicate whether or not the trend is part of a trend previously learned by the system. The second technique uses a fuzzy system that learns the specific data trend, and a second general fuzzy system that compares the incoming data to the trend produced by the first fuzzy system. Both techniques perform well on a long and short-term simulation. They were both able to select an adequate number of data points and period for detecting fault trends in simulated data. The author compares the two techniques and describes the applications for each. 30 Artificial neural networks (ANNs) have been accepted as a way to perform function approximation without knowledge of the underlying system. The ability for an ANN to balance its weights using optimization techniques and activation functions makes them an ideal candidate for diagnostic algorithms. The authors of [40] propose an algorithm that uses multiple radial basis function (RBF) neural networks to not only detect faults, but also diagnose them. The first RBF is trained on nominal system data in order to determine the mechanics of the system. If sufficient data is provided and the optimal weight vector found, the RBF then becomes a state observer and residuals can be calculated based on the incoming sensor data. These residuals are compared to an existing threshold, and if found to be outside the bounds of error, then the observer indicates that the system is in a failure mode state. After this step fault detection has been performed and another RBF must be used to perform the fault classification and diagnosis. The second RBF uses online tuning methods to diagnose the current failure mode of the system. The RBF is initialized with its output weights set to zero in order to force the initial state to be a “no failure” case. As the second RBF is trained using the online data, its failure feature is compared with that of well-understood failures to diagnose the system’s fault. The neural network was tested with simulation data of a linear motor in order to prove its feasibility in real work. The author’s validation of their results is considered future work when the algorithm is tested on a real robotic system. One of the leading root causes of failure in an industrial plant is the open- and short-circuit faults in the electrical circuit of the motor and electrical-distribution system. These failures must be continually monitored to guarantee reliability. Reference [41] identifies a monitoring technique to find stator-winding turn faults using sensorless 31 methods. From simple current and voltage measurements, the faults can be detected by identifying modes of zero-sequence voltage and negative-sequence current which can be related to the turn-faults and high-resistance connections. The authors used the dynamic model of the motor, which had been derived in references given by the paper. Experiments were performed on a 4P 380-V 10-hp induction motor in order to demonstrate feasibility of the algorithm. The stator-winding turn faults and high resistance electrical connection faults were able to be detected and diagnosed using the proposed method. The results promise added benefits and flexibility to maintenance schedules in industrial plants. 2.7 Algorithm Development - Prognostics The ability to predict faults and failures in a machine can yield great benefits for both manufacturers and users. Prognostics is the field of study that attempts to find solutions to the very difficult problem of predicting the future states of systems. There are many more challenges in predicting failures then simply identifying the current state of the machinery. In addition, once a failure has been detected or predicted, prognostic algorithms must also find the propagation of the fault through the rest of the system. Similar to diagnostics, the ability to predict the future state of a system is being widely researched in fields other than engineering and health analysis. Financial researchers have long attempted to forecast the stock market and provide investors with inside knowledge into how the market fluctuates [42]. Meteorologists use artificial intelligence along with advanced radars and sensors to predict storm paths and the formation of natural disasters such as tornados and hurricanes [43]. Even with years of research, advances in sensor technologies, and developments in the mathematical models of such 32 systems, prediction still is based on a probability where multiple scenarios must be taken into account to ensure the highest reliability. Prognostics can be broken into three categories: experience-based, evolutionary or trending models, and physical model-based. Experience-based is the most general of the three in that the algorithm will be applicable to almost every machine system. This class of algorithms usually relies on expert analysis of engineers who have worked extensively with the system. With expert domain knowledge, a maintenance schedule can be developed and engineers can make decisions with assistance from statistical measures and probability functions. The evolutionary model requires enough historical data to develop an accurate mathematical representation of the system. When the model is implemented into the health analysis framework, the future values are predicted based on the operating conditions and previous sensor data inputs. From these values, the model can then predict future outputs and when faults will occur. Since these models are based strictly on the input data, synthetic inputs can be built to simulate different operating conditions that the system may encounter during its operational lifetime. The model is built from historical data which makes it difficult for it to predict what will happen during abnormal conditions. Therefore, any simulations run outside normal operating systems should be taken as more of an uncertain advisory then an actual prediction of what will happen. The more historical data that is available to create the mathematical model will normally increase its accuracy and robustness when being deployed. The final method, building physical models, is the most costly yet most accurate approach for prognostics. Physical modeling requires a dynamic model that extracts 33 parameters from the system in order to predict the future state of the system. Once a physical model has been made, different prediction technologies such as autoregressive moving-average techniques or Kalman filters can be applied to predict future states of the system. Physics-based models require the need for knowledge of both past and current conditions in order to create a dynamic model that can be applied at any point during the lifetime of a component. One benefit of the physical models is its ability to predict the remaining useful life (RUL) of a component without any knowledge of faults that have occurred in the system, though such information can increase the overall accuracy of the prognostics system. Physical models require a thorough engineering knowledge of the system to find quantitative measures of material properties and physical parameters which represent the health of a system. These measures are then predicted based on the current operating conditions and are accompanied by a probabilistic model which provides an uncertainty factor [44]. Figure 7 shows the three prognostic algorithms along with their scope of work, cost, and accuracy [2]. 34 Figure 7 - Approaches for prognosis. The following table shows recent research in the literature and describes several diagnostic approaches along with their applications. Table 5 - Prognostic algorithms from the literature. Authors and Paper Title F. Peysson, M. Ouladsine, R. Outbib, J.B. Leger, O. Myx, C. Allemand, “A Generic Prognostic Methodoloy Using Damage Trajectory Models,” 2009 [44] Z. Sun, J. Wang, D. How, G. Jewell, “Analytical Prediction of the Short-Circuit Current in Fault-Tolerant PermanentMagnet Machines,” 2008 [45] Y. Zhang, G. W. Gantt, M. J. Rychlinski, R. M. Edwards, J. J. Correia, C. E. Wolf, “ Connected Vehicle Diagnostics and Prognostics, Concept, and Initial Practice,” 2009 [46] Area of Research Presents a prognostic framework then decomposes a system into three levels: environment, mission, and process. Decision and data fusion between the three levels is used to create predictions. Describes an analytical technique to predict short-circuit current in a fault-tolerant permanent-magnet machine under partial-turn short-circuit fault conditions. Presents a complete end-to-end framework of diagnostics and prognostics of General Motors vehicles. Presents initial results of the implemented framework 35 M. Baybutt, C. Minnella, A. E. Ginart, P. W. Kalgren, M. J. Roemer, “Improving Digital System Diagnostics Through Prognostics and Health Management (PHM) Technology,” 2009 [47] P. Lall, M. N. Islam, M. K. Rahim, J. C. Suhling, “Prognostics and Health Management of Electronic Packaging,” 2006 [48] S. K. Yang, “A Condition-Based FailurePrediction and Processing-Scheme for Preventive Maintenance,” 2003 [49] A. H. Al-Badi, S. M. Ghania, E. F. ELSaadany, “Prediction of Metallic Conductor Voltage Owing to Electromagnetic Coupling Using Neuro Fuzzy Modeling,” 2009 [50] Integrates prognostics and diagnostics from engineering disciplines to provide minimally invasive onboard monitoring of digital systems. Investigates methods to determine material state in complex systems and subsystems to determine RUL. Specifically, electronic packaging is targeted as a candidate for such methods. Uses an application-specific integrated circuit (ASIC) to perform preventive maintenance using Petri nets and Kalman filter prediction. The application of the ASIC is a thermal plant. Presents a Fuzzy algorithm that can predict the level of a metallic conductor voltage. Provides simulation results and validation for three scenarios. The authors of [44] present an overview of the prognostic approaches described in the previous section. They utilize these different technologies in the design of a prognostics system for a ship. They extend the technologies by applying not only operating conditions and sensor readings, but also the environmental conditions under which the system is placed during its lifetime. It creates a formal method for modeling a complex system based on the mission (operating condition), environment, and process. The process is decomposed into the resources, where a resource is piece of equipment, or a set of equipment. The mission is defined as the use of the system during a time period. It analyzes the start and end dates of the mission as well as the set of places where that task is performed. The environmental is the area where the system operates. It can be characterized by a set of environmental variable that include air temperature, air humidity, and wind force. The environment variables are then fuzzified which allows for a definition for the impact an environment has on the system. A rule base can then be defined in order to perform fault diagnosis and prognosis. 36 The fusion of all three elements, mission, environment, and process, provides a damage trajectory that predicts the degradation of resources, subsystems, and overall system. A simulation was created where a ship was traveling on a tour of Africa. Different missions and processes were created to test the degradation of a ship during the travel. Initial results showed that the framework was in fact a feasible method for the predictions of degradation of a complex system. Fault-tolerant permanent-magnet machines are showing promise in aerospace and automotive sectors. Fault models have been developed for such machines, but in order to predict failures only lengthy processes have been developed thus far. Reference [45] presents an analytical approach that quantifies various parameters of the machines. These parameters are then used to identify worst-case short-circuit scenarios in the design state and formulate remedial actions. The derivation of the short-circuit current is provided as well as a validation by finite element analysis. Experimental validation was also conducted by seeding various failure modes into the machines and seeing if analytic model correctly identified the short-circuit current. The results showed promise for the short-circuit current to be a viable method of feature-extraction for fault detection and prediction. In particular, a Kalman filter was recommended to extract the fundamental components of the feature and predict future faults. The authors of [46] present a methodology for diagnostics and prognostics for vehicles, specifically those manufactured by General Motors (GM). Three key challenges are faced by vehicle manufacturers: unexpected new faults, infrequent and intermittent faults, and prediction of system RUL. Many vehicle manufacturers develop maintenance schedules for consumers, but many times parts are replaced before their 37 operational life is actually completed. To compensate for scheduled maintenance, a concept called Connected Vehicle Diagnostics and Prognostics (CVDP) has been developed where fault data is stored in onboard electronics and downloaded by the manufacturing during its maintenance services. The fault data is then analyzed to determine root causes for the intermittent faults of the vehicles. If-then rules are applied to the data of a battery management system in order to detect any failure modes. The conditions of the rules are currently specified by domain experts, but future work will allow for adaptive thresholds to be computer when sufficient data is acquired. A weighting of the parameters that caused the failure is then computed based on the number of if-then rules violated. These weights allow engineers to determine the root cause of intermittent failures which would previously not have been detected. Preventive maintenance can then be performed when the parameters in other cars of the same model are seen to be degrading. The system has been deployed in a GM manufacturing plant. Digital systems are now present in everyday life for most consumers. Since manufacturing techniques are not fault-proof, many times systems fail before their lifetime. In mission critical situations, especially in military or manufacturing sectors, these faults can produce catastrophic events. Therefore, the authors of [47] present a technique for the detection and prediction of faults in digital electronic systems. The focus of the paper is on the degradation of MOSFET devices and four particular failure modes: thermal cycling, hot carrier effects, time-dependent dielectric breakdown (TDDB), and electromigration. The system used to test the PHM methods is a MPC7447 and faults are seeded to accelerate the degradation of the processor. Aggregate power of the processor is tracked as the main feature of degradation in the processor. Multiple 38 histograms are calculated over time and compared to analyze the feature and find different failure modes for the processor. Based on the statistical feature vector, a percentage of life consumed is calculated based on the amount of time the processor is operating at a specific temperature. From this percentage, RUL is calculated from a life consumption model and fault-to-failure progression data. The authors of [48] present a novel method of prognosis based on the damage caused by prior stress histories of electronic packaging. The paper states that the U.S. Air Force throws away 1000 components to remove a single unknown one that is predicted to be in a failed state based on a theoretical model. If analysis of the post stress conditions of such components could be performed, the cost impact of prognostic methodologies could be immense as wasted life is recovered without increasing risk. Components were tested as simulated thermal cycles were applied. From this data, a mathematical relationship was developed between phase growth and time to failure. Correlations were found between the rate of change of the phase growth parameter and existing macro indicators of damage. It is shown that RUL can be found based on phase growth rate and interfacial shear stress of the chip. State estimation is becoming a leading technology for the prognosis of complex machine systems. Kalman filters have become a particularly appealing solution as it contains an error parameter which provides a confidence interval of the prediction through time. Reference [49] incorporates such methods with Petri nets to find and predict failures in a thermal plant. The Petri nets are a graphical representation of relationships between conditions and events and allows for the root causes of failures to be found and preventive maintenance to be performed only on those components which 39 are failing. Kalman filters are then applied to the current state of the system in order to predict the following state. N-Step state predictions can be performed as well, but the confidence of each step decreases as the error in the covariance matrix of the filter increases. These methods were implemented on an application specific integrated circuit and used in a thermal power plant to validate the framework. Initial results of the proposed scheme were seen to be very promising. The authors of [50] discuss the ability for interference of circuit conductors to be transferred from one to the other without any physically connected components. A fuzzy model was conceived as a method to predict the interference caused by overhead transmission lines. The feature vector was calculated using linear correlation analysis, nonparametric correlation analysis, and partial correlation analysis. If-then rules were applied using training data obtained during the project. Fault current, soil resistivity, separation distance, and mitigation systems were the fuzzified four inputs and total pipeline maximum voltage was the defuzzified output. The member functions used the fuzzy model are found in the paper and the effect of interference based on nearby metallic structures was analyzed. Excellent agreement between test and validation data was obtained for three different scenarios. 2.8 Reliability Centered Maintenance RCM is defined as an analytical process used to determine appropriate failure management strategies to ensure safe and cost-effective operations of a physical asset in a specific operating environment. It relies heavily on prior knowledge of the system and subsystems under evaluation. It was developed after it was found that most systems were being replaced before their active useful life. 40 It compares the requirements of the component from a user perspective and the design reliability of the component. When employed, it is used in conjunction with the FMECA as references to the CBM and PHM portions of the health analysis framework to guarantee that the following seven questions are answered during a failure [2, 51]: 1. What is the item supposed to do and its associated performance standards? 2. In what ways can it fail to provide the required functions? 3. What are the events that cause each failure? 4. What happens when each failure occurs? 5. In what way does each failure matter? 6. What systematic task can be performed proactively to prevent, or to diminish to a satisfactory degree, the consequences of the failure? 7. What must be done if a suitable preventive task cannot be found? 2.9 System Identification Techniques System identification is the method in which mathematical models of dynamical systems are built based on observed data of the system. These methods can save the cost and time of having an engineer develop physical models of a system. The methods usually require a large, well notated database of historical system data in order to build a robust mathematical model. There are three entities involved in creating these mathematical models [34]: A data set A set of candidate models A rule by which candidate models can be assessed 41 Figure 8 shows the general system identification loop. Figure 8 - The system identification loop. 42 2.9.1 Autoregressive Models The most basic of system identification techniques is a linear difference equation between the input and output of a system. While there are continuous time models in system identification, discrete time models are used most often in practice. These difference equations are known as autoregressive models and is notated by: 𝑦(𝑡) + 𝑎1 𝑦(𝑡 − 1) + … + 𝑎𝑛 𝑦(𝑡 − 𝑛) = 𝑏1 𝑢(𝑡 − 1) + … + 𝑏𝑚 𝑢(𝑡 − 𝑚) (2.3) This notation may be altered to solve for the next output value given the previous observations: 𝑦(𝑡) = −𝑎1 𝑦(𝑡 − 1) − ⋯ − 𝑎𝑛 𝑦(𝑡 − 𝑛) + 𝑏1 𝑢(𝑡 − 1) + ⋯ + 𝑏𝑚 𝑢(𝑡 − 𝑚) (2.4) To account for measurement process noise, a zero-mean white noise distribution can be estimated using another coefficient, which estimates error based on a moving average: 𝑦(𝑡) = −𝑎1 𝑦(𝑡 − 1) − ⋯ − 𝑎𝑛 𝑦(𝑡 − 𝑛) + 𝑏1 𝑢(𝑡 − 1) + ⋯ + 𝑏𝑚 𝑢(𝑡 − 𝑚) + 𝑒(𝑡) (2.5) + 𝑐1 𝑒(𝑡 − 1) + ⋯ + 𝑐𝑛 𝑐 𝑒(𝑡 − 𝑛𝑐 ) To correctly model a system mathematically, the coefficients of 2.3 must be calculated. There are various methods that can calculate the coefficients based on recorded inputs and outputs over a time interval. Two of the most popular are the Levinson-Durbin recursive algorithm and least squares method [34]. 2.9.2 Kalman Filters State-space models are developed to form a relationship between the input, noise, and output signals using an auxiliary state vector. mechanisms of the system. These models incorporate physical One type of state-space model is the Kalman filter, which was developed in the 1960s [34]. The discrete Kalman filter is defined in two steps, a time update and a measurement update. The time update equations are defined by: 43 𝑥𝑘+1 = 𝐴𝑘 𝑥𝑘 + 𝐵𝑘 𝑢𝑘 + 𝐺𝑘 𝑤𝑘 (2.6) 𝑃𝑘− = 𝐴𝑘 𝑃𝑘−1 𝐴𝑇 + 𝑄 (2.7) Where 𝐴𝑘 and 𝐵𝑘 are vectors of parameters that correspond to unknown values of physical coefficients, material constants, etc, 𝐺𝑘 is vector of parameters describing the process noise in the system, 𝑥𝑘+1 is the prediction of the state vector time, 𝑥(𝑡) is the internal state vector, 𝑤𝑘 is the process noise of the system, 𝑢(𝑡) is the control input to the system, 𝑃𝑘 is the a posteri estimate error covariance and 𝑄 is the process noise covariance. The measurement equations are defined by: 𝐾𝑘 = 𝑃𝑘− 𝐻 𝑇 (𝐻𝑃𝑘 𝐻𝑇 + 𝑅)−1 (2.8) 𝑥𝑘 = 𝑥̂𝑘− + 𝐾𝑘 (𝑧𝑘 − 𝐻𝑥̂𝑘− ) (2.9) 𝑃𝑘 = (𝐼 − 𝐾𝑘 𝐻)𝑃𝑘− (2.10) The first step in the measurement update is to compute the Kalman gain, 𝐾𝑘 . Then the process or sensor is actually measured and placed into 𝑧𝑘 . This is used to generate a posteriori state estimate by incorporating the new measurement data. The final step is to obtain an a posteriori error covariance estimate as in Eq. 2.10. The goal of the Kalman filter is to minimize the posterior covariance error. The equations are recursive which make them appealing for practical applications. In the field of prognosis, the Kalman filter can perform multiple time updates without a measurement update to predict health variables in the future [2]. 44 Chapter 3: APPROACH This thesis attempts to build a framework for an intelligent valve module for ISHM. This framework is based on the health analysis framework discussed in the previous section. This section will focus on the specific approach taken to fulfill the objectives of each segment in the framework. Most of the work presented is for the general support of valves in a mission critical situation, but some is specific applications to the NASA-SSC test stand environment. The particular valve that will be analyzed is the large linear actuator valve (LLAV) which is responsible for the distribution of cryogenic fluids to the test stand and test articles. Figure 9 shows the regions of interest of the LLAV. 45 Figure 9 - LLAV with regions of interest labeled. 3.1 Failure Modes Valves are a critical component for the day to day operations at NASA-SSC. The valves must be precisely machined to meet the strict specifications set forth by the test stand operators. These specifications raise the price of the valve, which can be tens of thousands of dollars. Though manufacturing of the valves is meticulous, physical degradation still occurs because of the strenuous environment where the valves operate. In particular, the LLAV must transport cyrogenic and noncyrogenic fluids in high pressures to test articles on the test stands at NASA-SSC. Therefore, a FMECA must be performed in order to classify and rank the important failure modes for the LLAV. The analysis was performed in the early stages of the project in order to guarantee that the algorithms developed could detect the failure modes of the valves. Since the valves have already been developed and the sensors have been chosen, the goal of this FMECA will 46 to identify the critical faults and attempt to find solutions with the current capabilities at NASA-SSC. The LLAV FMECA was performed in conjunction with Scott Jensen, a NASASSC test operations engineer and domain expert in the valves on the test stands. Scott was able to provide valuable insight into the valve’s operational characteristics of the valves in the E-Complex test stand. These characteristics include the role the LLAVs fulfill, descriptions of the different components in a LLAV, the signs of degradation in the LLAV, and the common failure modes that have been identified by the NASA-SSC test operations engineers. The information was compiled and risk priority numbers were calculated to prioritize the failure modes identified during the study. The algorithms and framework was then able to be designed around the specific task of collecting data that could identify and eventually predict these failure modes. Table 6 and Figure 10 shows the results of the FMECA: Table 6 - Failure modes and effects for LLAV. Function Controller for cryogenic fluid tank Failure Mode Seat Wear cause leaking fluid Monitor the feedback of the valve and downstream pressure Packing at the top of the valve prevents leaks and allows for balanced pressure Faulty pressure sensor falsely indicate valve failure When frozen, the packing can crack and break apart, degrading the performance of the valve Actuator must transition from If the valve does not open or fully open to fully closed in a close at consistent timings, consistent amount of time. valve maintenance must be performed The controller of the valve If the PID controller is sends a valve to full close. unstable or telling the valve to get to a value it cannot reach, the actuator may “bounce” on 47 Effects Fluid can enter system during a test causing catastrophic failure Incorrect valve maintenance may be performed Valve may not function properly or be able to maintain needed pressure for test Emergency shutdown procedures may not be performed properly. Seat wear (described above) can occur more quickly resulting in delays and increased maintenance costs. the seat causing degradation in the soft metal. The valve feedback must Excessive “deadtimes” create If the mixture is not precise is respond to the control signal in poor timing in test operations certain test articles, undesired an appropriate time for and can cause pressure or flow results can occur. effective test operations. mixture errors. 2200 Seat Wear 2000 1800 Frost Point Risk Priority Number 1600 1400 1200 Sensor Failures 1000 800 600 400 Extended Deadtimes 200 Transition Times Seat bouncing 0 0 1 2 3 4 5 6 Criticality 7 8 9 10 11 Figure 10 - Prioritization of LLAV failure modes (see Equations 2.1 and 2.2 for y-axis calculation) . 3.2 Intelligent Valve Framework Once the failure modes were found, the framework could be constructed based on the requirements set forth by NASA. Figure 11 shows the system level flow chart of the framework. Figure 12 shows the detailed health analysis framework for the intelligent valve. 48 NASA Data Acquistion System (DDE Server) DDE Client Measurement Data WonderWare .NET Plugin G2 Diagnostic Environment Health Data Virtual Reality Environment Health Data Figure 11 - System level flowchart of the Intelligent Valve framework. Figure 12 - Health analysis framework for the Intelligent Valve. 3.2.1 Data Acquisition The E-Complex operations center utilizes both the User Datagram Protocol (UDP) and Dynamic Data Exchange (DDE) protocol to transmit data between the networked computers in their test stands. Under the advice of NASA test operators, it 49 was decided that the best method of acquiring data into the plug-in was via the DDE pipeline. The selection of DDE over UDP provided several benefits, as well as certain drawbacks that must be accounted for in the development process. Some of the benefits were: The data could be acquired by simple strings rather than parsing the UDP packet’s binary file. The data would already be formatted into engineering units based on the calibration sheets used in the UDP format. WonderWare and Labview, both used for test operations, have built-in support for Network DDE (NDDE). Since the developers will not be at Stennis for the tests, application setup is easier with DDE because of the prior knowledge the test engineers possess. The framework can request just the specific data it requires for its algorithms reducing its network footprint. The drawbacks that must be overcome are: The maximum DDE transfer rates are much lower than UDP. The DDE data packet does not include an accurate time stamp for data annotation. While WonderWare still includes DDE with its applications, the developers of the protocol, Microsoft, have not updated or supported it for over a decade. The most crucial drawback in the selection of the DDE protocol is the absent time stamp in the data packet. Fortunately, NASA has a link to the Inter-range instrumentation group (IRIG) system which provides highly accurate timestamps to the networked computers in the test stands. In the software, this IRIG timer is used to 50 timestamp all the data as soon as it is acquired from the system. While there are still some delays from the data acquisition software, this presents an accuracy that is usable to compare data for algorithm development purposes. If the framework ever became a “mission-critical” component the accuracy issue would have to be addressed more strictly. 3.2.2 Preprocessing To validate incoming data, threshold checks are performed at the acquisition of each data point. The thermocouple data are subjected to the following test: 𝑇𝑚𝑖𝑛 ≤ 𝑇 ≤ 𝑇𝑚𝑎𝑥 where Tmin and Tmax are the minimum and maximum temperatures of the thermocouple type. The following table gives the type and temperature range for some commonly used thermocouples: Table 7 - Thermocouple types and ranges. Thermocouple Type J K E T Minimum Temperature (oC) 0 -200 -200 -250 Maximum Temperature (oC) 750 1250 900 350 The valves are also subjected to the threshold test: −5 ≤ 𝑉 ≤ 100 where V is the feedback or control signal of the valve. While it does not seem intuitive that the valve state can be below zero, the operators at NASA use this method to guarantee that a tight seal is being created between the actuator and the soft metal at the bottom of the valve. 51 The data acquisition systems used in the E-Complex test stand perform preprocessing techniques themselves in an attempt to deliver noiseless signals to the test stand computers. Therefore, there is no need for advanced preprocessing techniques in the intelligent valve module. Moreover, this allows the module to classify any noise detected in the signal as an anomaly instead of process and measurement noise. 3.2.3 Failure Mode Detection and Diagnosis The failure modes were investigated based on the FMECA with priority given to those with a high RPN. 3.2.4 Valve Operational Statistics Seat wear is one of the most severe and costly failure modes that can occur in the LLAV. Not only is it expensive to replace the valve seat and insert, but it also forces excessive delays in projects. It is very difficult to obtain a direct quantitative measurement of the seal without the use of additional sensors. There are studies into detecting seat wear and recession, but all use external instrumentation such as x-ray machines that are not available for this research. Even though a direct measurement is not possible, combining the valve's operational statistics and test operator's expert knowledge can provide information and advisories for maintenance teams. After consulting with the test operations team at NASA-SSC, seven statistics were selected for observation. They are as follows: Transitions - The amount of times the valve has traveled from a completely open to a completely closed with non-cryogenic fluid flow. Cryogenic Transitions - The amount of times the valve has traveled from a completely open to a completely closed with cryogenic fluid flow. 52 Distance Traveled - The linear distance the valve has traveled in inches. Last transition time - The time it took for a valve to go from completely open to completely closed. Average transition time - The average of the last ten transitions from completely open to completely closed. Direction changes - The amount of times the valve has changed motion from either opening to closing or closing to opening. Number of closings - The amount of times the valve has come to a completely closed state. These statistics can be used to measure how the valve is performing under certain operating conditions. To detect the events, an algorithm, seen in Figure 13, was developed based on the changing state of the valve and definitions presented previously. 53 Figure 13 - Valve statistics algorithm. In the specific application of detecting seat recession, the statistics of relevance are transitions, cryogenic transitions, and number of closings. When under cryogenic conditions the metal packing hardens, reducing the amount of degradation on the seat. Conversely, non-cryogenic closings create a deeper impact and reduces the operational life of the seat. As stated in Table 6, seat bouncing can also adversely affect the seat if not detected. The number of closings can be observed between tests and compared to the amount of closings the controller relayed to the valve during the test. If there is a large 54 disparity between the two, it is an indication of bouncing either due to a valve fault or controller instability. In either case, seat wear can be accelerated when there is a constant changing of force on the seat. 3.2.5 Auto-associative Neural Networks for Sensor Validation In order to provide a test article with the correct mixture of propellants, pressure and fluid flow must be kept at very specific rates. This requires accurate sensors that can relay the current readings back to test operations. The readings during failure modes of the sensors can be unpredictable and can cause misclassified faults in a valve. For example, if a downstream pressure sensor has a near zero reading after a valve is opened, it can appear as though the valve did not open properly. When this happens, weeks or months of unnecessary valve repairs may be performed instead of the day it takes to replace a sensor. There are two main approaches to this type of fault, physical and analytic redundancy. Physical redundancy requires the use of multiple, similar sensors in the same spatial location. Many times three sensors will be used and majority-rules weighting system is used to determine the actual reading. Analytic redundancy exploits functional relationships between components in the systems. The functions are normally isolated into closely related subsystems to reduce their complexity. While physical redundancy is a more robust solution than analytic redundancy, it is not feasible in all situations. At the E-Complex test stands there is a limited amount of sensors that can be attached to the data acquisition system for any given test. Also, running additional connections through the complex test stand is very costly and safety protocols apply stringent rules to where and how wires can be run. 55 Analytic redundancy can be applied to a system using either a complex model comprised of physical properties and equations or a mathematical model that approximates the functional relationship based on previous data. The physical model results in a very detailed understanding of the system, and is applicable only for the current system setup. Artificial neural networks (ANN) have been used extensively in function approximation and pattern recognition. Specifically, auto-associative neural networks (AANNs) have been used in sensor validation because of their ability to perform nonlinear principal component analysis which allows for the extraction of key features in a high dimensional, nonlinear dataset [52, 53]. Reference [53] presents a training method for sensor validation and AANN where two training runs are performed. The first training run presents accurate training data to both the input and output in order to learn the functional relationships between the two. The second training run presents faulty data to the input, but accurate data to the output. This method allows the AANN to become "insensitive" to faulty data and extract only the proper features from the dataset. Figure 14 shows the two training methods. Linear principal component analysis (PCA) can be beneficial in reducing high dimensional datasets into their principal components. To accomplish this task, eigenvalues of the covariance matrix are used to maximize the variance of the dataset in a lower dimension, i.e., 𝑌𝑃 = 𝑇 (3.3) where 𝑌 is the sample set, 𝑇 is the transformed data, and 𝑃 is the eigenvectors of the covariance matrix. Nonlinear PCA extends the capabilities of linear PCA by using nonlinear functions instead of eigenvectors. In some cases, this can increase the variance 56 of the selected dataset and result in less information loss than linear PCA during the dimensionality reduction. The following equations describe nonlinear PCA: 𝑇𝑖 = 𝐺𝑖 (𝑌) (3.4) where 𝑇𝑖 is the transformed data and 𝐺𝑖 (𝑌) is a vector nonlinear functions. In order to restore data in nonlinear PCA, another nonlinear function is needed: 𝑌𝑗′ = 𝐻𝑗 (𝑇) (3.5) where 𝑌𝑗 is the restored data and 𝐻𝑗 (𝑇) is a vector nonlinear function. A difficulty in nonlinear PCA is the determination of the nonlinear functions 𝐺 and 𝐻. However, it has been shown in previous work that functions of the following form are capable of fitting any nonlinear function to arbitrary precision: 𝑁2 𝑁1 (3.6) 𝑣𝑘 = ∑ 𝑤𝑗𝑘 𝜎 (∑ 𝑤𝑖𝑗 𝑢𝑖 + 𝜃𝑗 ) 𝑗=1 𝑖=1 Where 𝑣 is the desired nonlinear function, 𝑤 are weights of the sigmoid function, and 𝜎(𝑥) is a function that approaches 1 as 𝑥 approaches ∞ and 0 as 𝑥 approaches −∞. A sigmoid satisfies this criterion: 𝜎(𝑥) = 1 1 + 𝑒 −𝑥 (3.7) Sigmoids are typically transfer functions seen in artificial neural networks. In order to perform the dimensionality reduction, a bottleneck layer is used in the hidden layer nodes of a multilayer perceptron. This allows for the common backpropagation training technique to be used for sensor validation in the autoassociative neural network. 57 Figure 14 - Training method for auto-associative neural networks for sensor validation. Another benefit of AANNs is their ability to predict the values of faulty sensors in the output. If utilized in a mission critical situation, this can provide the information needed to continue a test even when a fault is detected. Also, this data can provide other fault diagnosis algorithms with accurate data that can narrow down the exact cause of faults in a system. 58 3.2.6 Thermal Modeling While cryogenic fluid can cause less wear on the seal at the bottom of the valve, there is another packing at the top that allows the valve to offset pressures in order to operate properly. If this packing freezes there is the potential that it will crack and cause pressure equalization problems with the valve. This cracking in the packing is one of the reasons that the steam of the valve is so long. Since the machining of the valves is so precise, added inches in the stem can increase costs by tens of thousands of dollars. NASA-SSC performed a series of tests under simulated conditions in the summer of August 2006 in an attempt to establish a formula for valve frost points. From the tests, they discovered that complex thermodynamic equations were unnecessary to estimate the frost line, but instead a simple fin model gave accuracy up to 95%, which is sufficient for this application. The equation estimates the base temperature of the body by tracking the amount of time cryogenic fluid has been flowing through an open valve. This value can be projected up the valve based on a thermal fin equation provided by NASA-SSC engineers [54]. 𝑇𝑡𝑐 = (𝑇𝑎𝑚𝑏 − 𝑇𝑓𝑙𝑢𝑖𝑑 ) ∗ 𝑒 − 𝑡𝑜𝑝𝑒𝑛 𝑚 + 𝑇𝑓𝑙𝑢𝑖𝑑 (3.1) where 𝑇𝑎𝑚𝑏 is the ambient temperature, 𝑇𝑓𝑙𝑢𝑖𝑑 is the boiling temperature of the flowing cryogen, 𝑡𝑜𝑝𝑒𝑛 is the amount of time the valve has been open, 𝑚 is the amount of time it takes for the valve to reach its steady state, and 𝑇𝑡𝑐 is the estimated base temperature of the body. 𝑇𝑒𝑠𝑡 = 𝐶𝑜𝑠ℎ(𝑚𝑡 ∗ (𝐿𝑣𝑎𝑙𝑣𝑒 − 𝐿 𝑇𝐶 )) ∗ (𝑇𝑡𝑐 − 𝑇𝑎𝑚𝑏 ) + 𝑇𝑎𝑚𝑏 𝐶𝑜𝑠ℎ(𝑚𝑡 ∗ 𝐿𝑣𝑎𝑙𝑣𝑒 ) (3.2) where 𝐿𝑣𝑎𝑙𝑣𝑒 is the length of the stem of the valve, 𝐿𝑇𝐶 is the distance of the thermocouple from the base, 𝑚𝑡 is a material constant found experimentally for the 59 valve, and 𝑇𝑒𝑠𝑡 is the estimated temperature of the thermocouple located at 𝐿𝑇𝐶 . This formula can be manipulated in order to solve for the frost line of the valve by setting 𝑇𝑒𝑠𝑡 to 32oF and solving for 𝐿𝑇𝐶 , i.e., 𝐿𝑇𝐶 = − 32 − 𝑇 𝑐𝑜𝑠ℎ−1 (𝑇 − 𝑇𝑎𝑚𝑏 ∗ cosh(𝑚𝑡 ∗ 𝐿𝑣 )) 𝑡𝑐 𝑎𝑚𝑏 (3.3) 𝑚𝑡 This thermal model will be utilized in order to continually monitor the frost line of the valve both during tests and when the test stand is idle. The monitoring of the frost line provides two key benefits to NASA test operations. The first benefit is the ability to monitor how many times and for how long the seal at the top of the valve has been exposed to freezing temperatures. Knowledge of this statistic can assist the operator to diagnosis any anomalies or faults found in the valve data. The second benefit is the ability to monitor frost lines for future valve production. If a study can present conclusive evidence that the valves being used in the test stand are much longer than needed, tens of thousands of dollars can be saved when the existing valves needed to be replaced. 3.2.7 Adaptive Thresholding When preparing for a test, control algorithms are set to autonomously operate the valves. The timings are very specific and the valve's behavior must remain consistent in order to guarantee proper test firings. There are various faults that can prevent the valve from operating correctly, but one of the most important details is how the valve reacts to the control input, independent of the operating conditions. Therefore, simulations of the valve's output based on the input can be run to estimate valve stroke timings and behavior. The model used is a bank of autoregressive moving average (ARMA) filters 60 with an optimization constraint to specify an adaptive threshold. This was first proposed in [35]. Figure 15 shows the algorithm for the design and choice of ARMA models for the adaptive thresholding with a description following. Figure 15 - Adaptive threshold algorithm for designing and choosing ARMA models. The adaptive threshold is chosen based on two optimization functions: 𝑦̂(𝑘) = min (𝐺 (𝑞, 𝜃)𝑢(𝑘)) 𝜃𝜖[𝜃, 𝜃] 𝑢 (3.4) 𝑦̂(𝑘) = max (𝐺 (𝑞, 𝜃)𝑢(𝑘)) 𝜃𝜖[𝜃, 𝜃] 𝑢 (3.5) where 𝑦̂(𝑘) and 𝑦̂(𝑘) are the minimum and maximum value of the simulated ARMA models, respectively, 𝐺𝑢 (𝑞, 𝜃) is the transfer function of the ARMA model with coefficients θ and order 𝑞, and 𝑢(𝑘) is the control signal input at time 𝑘. 𝐹𝑖𝑡 = 100 ∗ (1 − 𝑛𝑜𝑟𝑚(𝑦ℎ − 𝑦) 𝑛𝑜𝑟𝑚(𝑦 − 𝑚𝑒𝑎𝑛(𝑦)) ) (3.6) where 𝐹𝑖𝑡 is the percentage of the output variation that is explained by the model, 𝑦ℎ is the estimated output, and 𝑦 is the measured output [17]. 61 The historic data is assumed to be all nominal data in order to design a set of models that can represent the entire set. During the training process, a fit equation is calculated in order to guarantee that the models are not too accurate and not too lax. Therefore, a threshold is set that the fit equation should be above 70%. This threshold was found experimentally and may need to be refined based on the application. Once the models have been selected, they are run through testing data that is from a similar dataset. If any faults are found in this dataset, it can be concluded that there are not enough models to completely describe the data properly. Models are continually created changing the amount of coefficients in order to create a complete representation of the dataset. Once a sufficient amount of models have been created, the control algorithm can be run through the simulation and compared with the actual feedback from the valve during the test. The adaptive threshold can mark faults during the test which can alert test operations to anomalous behavior. The simulation of the control algorithm and the feedback can be seen in Figure 16. Figure 16 - Adaptive threshold algorithm simulation on real-time data. 62 3.3 Prognostic Survey The ultimate goal of the intelligent valve framework is the ability to determine the remaining useful life of the LLAV. At this time, however, the prognostics portion of the framework is outside the scope of this research. Therefore, several prognostic techniques will be investigated based on simple linear predictors as well as a state-space model. The linear predictors will consist of the autoregressive and autoregressive moving-average filter. The state-space model implemented will be the Kalman filter. These techniques will be used in conjunction with a neural network to determine their feasibility for future development in the Intelligent Valve Framework. 3.4 Diagnostic Process Creating a software framework that can be expanded in the future requires careful planning and structuring. Therefore, object oriented programming (OOP) techniques were utilized to construct a backend acquisition and configuration protocol. A MS-SQL database schema was design to store configuration information throughout tests in order to create a persistent environment. The schema for the MS-SQL database can be seen in Figure 17. 63 Figure 17 - Intelligent Valve database schema. This database was designed in such a way that it meets the requirements of third normal form (3NF). The normalization of databases enforces guidelines that efficiently organizes data into a database. The database defines the necessary attributes required to access data from the DDE servers at NASA-SSC. Valves contain several sensor measurements that must be monitored for the diagnostic process to work correctly. These values are stored in the ValveDetails table where the DDE tags and servers can be specified as well as the length of the valve for the thermal models described previously. In order to store the operating history of a valve, the ValveStatistics table contains a 64 column for all of the statistics described earlier. Each valve can contain several thermocouples that are attached to its stem in order to validate the thermal model. The Thermocouples table holds a foreign key to the ValveDetails table to correlate thermocouples with their valves. This table also holds the current position of the thermocouple on the stem of the valve and the high and low thresholds used to set the flagged state of the thermocouples. The FluidDetails table holds the information used in the thermal model of several fluids and their boiling point. The final table, DDE, contains the connection strings for the DDE servers in order to access all the sensor data. Each thermocouple, valve feedback, and valve control is required to have a foreign key to one of the DDE servers. The software framework was written in C# in order to simplify the development process. Also, since the more computationally intensive algorithms are performed offline, the speed benefits of C++ would have been minimal for this application. A class structure was defined that allows several user controls to share the same data in an efficient manner. Class interfaces and structures have been defined to allow extensibility to the Intelligent Valve framework. The first interface defines how a sensor receives values from the data servers. It includes a single function with parameters for the name of the item and the value captured by the data client. The reasoning for including the name of the value is certain sensors, such as a valve, must keep track of multiple values like its control and process variable. Currently, only a thermocouple and valve class have been developed that implement this interface. The purpose of this interface, however, is too 65 allow other sensors, such as pressure and strain, to be included in a single collection in the intelligent valve data handler. The next interface defines the functionality a data server is required to have to be included in the IV framework. The interface defines a number of function templates that allow the data handler to either sample incoming data at a set rate, or subscribe to data. Since most data servers require drastically different implementations, this framework allows for the seamless integration of various data servers to be handled in a way that is transparent to the IV data handler. The functions for the data server interface are listed in Table 8. Table 8 - Data server class interface. Method Name RequestDelegate Parameters None StartRequest String ItemName StopRequest String ItemName PerformRequests Double elapsedTime StartAdvise String ItemName StopAdvise String ItemName Disconnect None Stop None Resume None 66 Description Allows the data handler to subscribed to any new data that is sampled. Commands the data server to begin sampling the item when commanded by the data handler. Commands the data server to stop sampling the item. Command the data server to sample all request data. The elapsed time parameter is tracked by the IV data handler and represents the amount of time since the last time the server has been sampled. Commands the data server to begin sampling the item whenever a new value is available. Commands the data server to stop sampling the item whenever a new value is available. Disconnect the client from the server and stop all sampling and advise loops. Stop all sampling and advise loops, but do not disconnect from the server. Resume all sampling and advise loops. All data passed around the Intelligent Valve framework is a simple structure that has three fields: String Item, String Value, and String TimeStamp. Each client of the value is responsible for transforming the data into their own desired format. A static type for the value parameter increases the predictability of the values the client will receive and therefore reduces the amount of type and error checking needed to be performed by future developers. The data handler encapsulates the entire backend of the Intelligent Valve framework. All controls in the framework receive a reference to this data handler and can subscribe to updates of the different sensor temperatures as well as the update timer when the data servers are commanded to sample. The data handler is also responsible for logging the sensor data into a MS-SQL database for offline diagnostic tools. Figure 18 shows the entire class structure of the project. Figure 18 - Software framework for the Intelligent Valve framework. 67 Chapter 4: RESULTS Stennis Space Center test operators oversee the testing and validation of rockets for both NASA and private companies. While few accidents have occurred at Stennis, it is still important for test engineers to have a better understanding of the behavior of the valves on the test stands. To further their comprehension, the diagnostic algorithms mentioned above have been tested and validated against canonical data and simulated and injected faults in test stand data. 4.1 Diagnostic Validation Data Several datasets were used to validate the diagnostic algorithm and process discussed in the previous section. The following sections will outline the procedures in which this data was collected and how faults were injected into the data. 4.1.1 Thermal Model Data In order to verify the thermocouple models, a test apparatus was constructed by the test operations group. The setup was simple, but provided the ability to capture isolated anomalies to see how the thermocouple reacts under different operating conditions. The test was completed with the following protocol: 1. A simulated valve was programmed into the WonderWare simulation environment. 68 2. When the simulated valve opened, liquid nitrogen (LN) was poured into the box containing the valve. 3. During the next several hours, the liquid nitrogen was kept at a constant level in order to simulate the passing of fluid through an open valve. 4. The temperature and frost line was monitored after the body reached a steady state temperature of -322oF (boiling point of LN). 5. There was a thermocouple at the base of the valve and a thermocouple about 20 inches up the stem of the valve, both were monitored and stored in a data file. During the test protocol, anomalies would be inserted periodically in order to simulate and capture failure modes commonly seen at the test stands. Some anomalies include the disconnecting of the top thermocouple, decrease in power supply voltage and current, connection of resistor potentiometer to amplified input and output, thermocouple debonding, and the effect of ice insulation. As stated previously, the thermocouples used to measure the frost line calculations have an error rate based on their type and measurement range. This measurement error, as well as the error associated with the thermal model, provides a threshold value that helps guarantee accurate data from the instrumentation. In order to more accurately determine the experimentally calculated values, 𝑚𝑡 , an optimization algorithm was utilized based on a curve fitting method and least squares constraints. 4.1.2 Sensor Validation Data In March 2006, NASA initiated the Methane Thruster Testbed Project (MTTP) as a platform for the research of plume diagnostics and ISHM. Historical data from live tests 69 was used to train and test the AANN for sensor validation. Hard and soft faults were artificially injected into the test runs and simple thresholding was used to determine when faults had occurred. These artificial faults were characterized during the thermal model tests in order to create realistic faults in the data. The MTTP trailer can be seen in Figure 19. Figure 19 - MTTP Trailer used for validating sensor faults. 4.1.3 Adaptive Threshold Data In order to validate the adaptive threshold model, extensive failure data would be needed that tracks a valve from nominal conditions to abnormal and eventually complete failure. This data is difficult to acquire since valves normally are not left until they fail. Therefore, a simulated control system was needed that provided a method to show degradation in a valve’s response based on adaptable parameters. A common transfer function used to simulate valves is seen in with a description of each parameter to follow. 70 𝑉𝑝𝑟𝑜𝑐𝑒𝑠𝑠 = 𝑔 ∗ 𝑒 −𝑇𝑠 ∗𝑠 𝑠 2 + 2 ∗ 𝜁 ∗ 𝑇𝑤 ∗ 𝑠 + 𝑇𝑤2 (3.2) where 𝑔 is the gain, 𝑇𝑠 is the unit delay, 𝑇𝑤 is the natural frequency, 𝜁 is the damping ratio, and 𝑉𝑝𝑟𝑜𝑐𝑒𝑠𝑠 is the output of the transfer function modeling a valve's response to a PID controller. As the parameters are changed, the valve’s feedback should change accordingly, and as the valve’s performance degrades, the algorithm’s adaptive threshold detects these changes and labels faults in the system. In order to model NASA-SSC as closely as possible, the control system uses a PID controller simulated in MATLAB’s Simulink. Parameters for the PID were selected by common values used during live test firings at NASA-SSC. The proportional constant was set at 1 and the integral component set to .1. The parameters were modified based on the following intervals: Table 9 - Adaptive threshold simulation parameters. Parameter Nominal Low Abnormal High Abnormal Gain . 98 ≤ 𝑔 ≤ 1.01 . 8 ≤ 𝑔 < .98 1.01 < 𝑔 ≤ 1.2 Natural Frequency . 9 ≤ 𝑇𝑤 ≤ 1.1 . 8 ≤ 𝑇𝑤 < .9 1.1 < 𝑇𝑤 ≤ 1.2 Damping Ratio . 9 ≤ 𝜁 ≤ 1.1 . 8 ≤ 𝜁 < .9 1.1 < 𝜁 ≤ 1.2 Delay 2 ≤ 𝑇𝑠 ≤ 3 0 ≤ 𝑇𝑠 < 2 𝑁/𝐴 4.2 Thermal Model Validation In order to validate the thermal model, experiments were performed with 10 faults injected in a thermocouple which was bonded three inches up the stem of a fifteen inch valve. The thermocouple data was compared to the thermal model and a simple threshold of 22oF was used to determine when a fault had occurred. This threshold was 71 derived from the 95% accuracy of the thermal model. The overall range of temperatures is from -322oF to 80oF or approximately 400oF and 5% of that is 22oF. 4.2.1 Thermal Modeling The first test performed at NASA-SSC was a base run to identify the valve’s physical parameters. The least squares optimization curve fitting method described in the approach section was used to determine the parameters for the remaining tests. Table 10 shows the values that were found based on the optimization algorithm and shows the simulation results using the parameters. Table 10 - Physical parameter obtained from least square optimization curve fit of base run. Mt – Chill Down 659.80 Mt – Warm Up 4672 mt – Chill Down .36 72 mt – Warm up .32 Base Run 50 0 Degress (F) -50 -100 Top Thermocouple Top Simulation Bottom Thermocouple Bottom Simulation -150 -200 -250 -300 -350 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (s) 4.5 4 x 10 Figure 20 - Simulation data using thermal modeling for base run. Once the physical parameters were determined, they could be used to validate the thermal model’s ability to detect anomalies by injecting faults during similar test runs. Disconnections were made at various locations during the test runs to see how the system responded. Figure 21 shows the test setup and will the numbered locations will be referenced within parenthesis, i.e. (13), throughout the following results. The bottom simulation throughout the tests provides inaccurate results because of the dependency on the ambient temperature. These tests were run for several hours and sometimes over night with only a single ambient temperature being recorded. Therefore, the measured bottom thermocouple is used for the top simulation except in the presence of a fault, then the simulation was used. 73 Figure 21 - Data acquisition setup for thermal modeling fault detection. The first fault simulates a faulty connection before the amplifier (13) and after the patch panel (12). The faulty connection was simulated by connecting a potentiometer to the referenced locations and increasing it quickly at 8230 and 8990 seconds. The fault detection was able to detect both faults accurately using the thermal modeling in the top thermocouple. However, there are some false positives reported in the chill down phase of the test. While no fault was documented, abnormal behavior can be seen in the top thermocouple as it rises slightly as the temperature reaches its minimum. In determining the performance metrics, only documented faults were considered to be true positives even if the measurements show unexpected results. 74 Faulty Connection in Amplifier Input 50 0 Degress (F) -50 -100 -150 Top Thermocouple Top Simulation Bottom Thermocouple Bottom Simulation -200 -250 -300 -350 0 0.5 1 1.5 2 2.5 3 3.5 Time (s) 4 4 x 10 Figure 22 – Simulation data using thermal modeling for faulty connections in Tustin amplifier input. Fault Classification Top Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0.5 1 1.5 2 2.5 Time (s) 3 3.5 4 4.5 x 10 Bottom Thermocouple Fault Detection Fault Classification 5 4 1 0.8 0.6 0.4 0.2 0 0.5 1 1.5 2 2.5 Time (s) 3 3.5 4 4.5 5 4 x 10 Figure 23 - Fault classification using thermal modeling for faulty connections in Tustin amplifier input. 75 The next fault simulates a faulty amplifier (6, 13) as well as disconnects in the Tustin patch panel (7, 14). The power downs of the amplifier were performed at 5563 and 5910 seconds with 6 input disconnections occurring at 7381, 7457, 7592, 7641, 9336, 9363 seconds. Again, simple thresholding combined with the thermal equations was able to detect all faults accurately in both the top and bottom thermocouple. This test revealed no false positives in the top thermocouple, which is the desired metric for these tests. Amplifier Power Down and Tustin Input Disconnect 200 100 Degress (F) 0 -100 -200 Top Thermocouple Top Simulation Bottom Thermocouple Bottom Simulation -300 -400 -500 0.5 1 1.5 Time (s) 2 2.5 4 x 10 Figure 24 - Simulation data using thermal modeling for amplifier power downs and Tustin input disconnections. 76 Fault Classification Top Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0.5 1 1.5 Time (s) 2 2.5 4 x 10 Fault Classification Bottom Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0.5 1 1.5 Time (s) 2 2.5 4 x 10 Figure 25 - Fault detection using thermal modeling for amplifier power down and Tustin input disconnection. In order to simulate a fault in the digitizer input, a potentiometer was connected in between (13) and (14). Instead of a hard fault, the resistance was slowly increased at 6693 seconds to simulate a drifting connection. A hard fault was injected at 7750 seconds by quickly increasing the resistance. While both faults were detected, several false negatives were reported because the simulation was predicting values lower than the measured value. Therefore, since the fault was slowly injected, there was a delay before it reached the threshold values indicating a fault. 77 Faulty Input Connection in Digitizer 100 50 0 Degress (F) -50 -100 -150 Top Thermocouple Top Simulation Bottom Thermocouple Bottom Simulation -200 -250 -300 0 0.5 1 1.5 2 2.5 Time (s) 4 x 10 Figure 26 - Simulation data using thermal modeling for faulty input connections in the digitizer. Fault Classification Top Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 Time (s) 2 2.5 4 x 10 Fault Classification Bottom Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 Time (s) 2 2.5 4 x 10 Figure 27 - Fault detection using thermal modeling for amplifier power down and Tustin input disconnection. 78 During a humid day, the moisture in the air can change into ice as it comes into contact with the surface of the valve. When surrounding a thermocouple, the frost may act as an insulator and cause incorrect readings. In order to simulate this, water was applied to the valve stem as the test was occurring. The water then froze when that part of the valve reached freezing point. In this test, there were no identifiable effects from the frost insulation. However, the top simulation estimates a steeper drop in temperature during chill down, which is recorded as a false positive. If the frost insulation occurred on the bonnet of the valve Frost Insulation #1 50 0 -50 -100 Top Thermocouple Top Simulation Bottom Thermocouple Bottom Simulation -150 -200 -250 -300 0.5 1 1.5 2 Time (s) 2.5 3 4 x 10 Figure 28 - Simulation data using thermal modeling for simulated frost insulation test 1. 79 Fault Classification Top Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 Time (s) 2 2.5 3 4 x 10 Fault Classification Bottom Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0.5 1 1.5 Time (s) 2 2.5 3 4 x 10 Figure 29 - Fault detection using thermal modeling for frost insulation test 1. In the next test, frost insulation was again added, but this time the thermocouple was not in direct contact with the valve. This induced fault checks how frost can affect a loose thermocouple. Based on the top thermocouple's data in Figure 30, it can be seen that the top thermocouple lowered in temperature, but was well above the actual temperature of the valve based on the top simulation data. The simulation threshold method was again able to detect this fault with 100% accuracy. 80 Frost Insulation Test #2 50 0 -100 Top Thermocouple Top Simulation Bottom Thermocouple Bottom Simulation -150 -200 -250 -300 -350 0 1 2 3 4 5 6 7 8 4 Time (s) x 10 Figure 30 - Simulation data using thermal modeling for simulated frost insulation test 2. Fault Classification Top Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 Time (s) 5 6 7 8 4 x 10 Bottom Thermocouple Fault Detection Fault Classification Degress (F) -50 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 Time (s) 5 6 7 Figure 31 - Fault detection using thermal modeling for frost insulation test 2. 81 8 4 x 10 Figure 32 - Data acquisition modified setup for thermal modeling fault detection. A junction reference error can cause misread thermocouple readings. In this particular test, the junction was placed into ice water to simulate a reference error. During the beginning of the test the top thermocouple does not reach the expected temperature, but the more noticeable fault occurs when the junction was lifted out of the water around 11832 seconds. A sharp decrease in the temperature resulted from this induced fault. The fault detection algorithm was able to detect both faults with reasonable accuracy. 82 Temperature Junction Reference Error 50 0 Degress (F) -50 -100 -150 Top Thermocouple Top Simulation Bottom Thermocouple Bottom Simulation -200 -250 -300 -350 0 0.5 1 1.5 2 2.5 3 3.5 4 4 Time (s) x 10 Figure 33 - Simulation data using thermal modeling for temperature junction reference errors. Fault Classification Top Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 Time (s) 3 3.5 4 4 x 10 Fault Classification Bottom Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 Time (s) 2.5 3 3.5 4 4 x 10 Figure 34 - Fault detection using thermal modeling temperature for junction reference errors. 83 The next test simulated a series of disconnects and shorts in both the top and bottom thermocouple. During warm up, the top thermocouple was repeated connected and disconnected to simulate a connection that was just starting to become faulty. The faults were able to be detected at a high precision, but several false positives and false negatives were found during the repetitive disconnect due to the voltage not having enough time to reach its minimum value. Thermocouple and Power Disconnection 100 0 Degress (F) -100 -200 Top Thermocouple Top Simulation Bottom Thermocouple Bottom Simulation -300 -400 -500 0 0.5 1 1.5 Time (s) 2 2.5 3 4 x 10 Figure 35 - Simulation data using thermal modeling for thermocouple and power disconnections. 84 Fault Classification Top Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0.5 1 1.5 Time (s) 2 2.5 4 Bottom Thermocouple Fault Detection Fault Classification 3 x 10 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 Time (s) 2 2.5 3 4 x 10 Figure 36 - Fault detection using thermal modeling for thermocouple and power disconnections. This test was again simply a disconnect and shortage of the thermocouple, however, the ambient temperature was recorded during the test which provided a more accurate simulation model. Similar symptoms were seen as previous tests where a level shift to the channel's minimum value was the result of a disconnect and a level shift to the channel's highest value was seen for a short. Even with the ambient temperature, however, several false positives can be seen during cool down. This again is probably due to the body freezing much faster than expected due to the testing procedures. 85 Thermocouple Disconnection and Short 100 0 -100 Degress (F) -200 -300 -400 Top Thermocouple Top Simulation Bottom Thermocouple Bottom Simulation -500 -600 -700 -800 -900 0 0.5 1 1.5 2 4 Time (s) x 10 Figure 37 - Simulation data using thermal modeling for thermocouple disconnections and shorts. Fault Classification Top Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.2 Time (s) 1.4 1.6 1.8 2 4 x 10 Bottom Thermocouple Fault Detection Fault Classification 2.2 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.2 Time (s) 1.4 1.6 1.8 2 2.2 4 x 10 Figure 38 - Fault detection using thermal modeling for thermocouple disconnections and shorts. 86 This test demonstrated a drifting fault that was simulated by decreasing the voltage on the transmitters power supply over a two minute span. Since the fault's effect was slower and our threshold value is so high, there was a number of false negatives reported. This same test was performed several times over the course of an hour with similar results. Near the end of the test the power supply for both transmitters was dropped which cause a fault in both the bottom and top thermocouple. The fault detection in the lower thermocouple allowed for the top thermocouple to retain a value closer to the actual temperature of the valve which resulted in proper fault detection of the transmitter's low power output. Transmitter Power Failures 0 Degress (F) -100 -200 Top Thermocouple Top Simulation Bottom Thermocouple Bottom Simulation -300 -400 -500 0 0.5 1 1.5 Time (s) 2 2.5 3 3.5 4 x 10 Figure 39 - Simulation data using thermal modeling for transmitter power failures. 87 Fault Classification Top Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0.5 1 1.5 2 2.5 3 Time (s) 4 x 10 Bottom Thermocouple Fault Detection Fault Classification 3.5 1 0.8 0.6 0.4 0.2 0 0.5 1 1.5 2 Time (s) 2.5 3 3.5 4 x 10 Figure 40 - Fault detection using thermal modeling for transmitter power failures. A new setup was used for this test, Figure 32 , where a thermocouple junction was added (18,19) and placed in an ice bath. At warm up, it was removed from the ice bath and a heat gun was blown on it. When the junction reference was in the ice bath, the thermocouple's temperature was higher than expected, and when the heat gun caused the data to be lower than expected. Both these induced faults were detected accurately. 88 Unaccounted Thermocouple Junction 50 0 Degress (F) -50 -100 -150 Top Thermocouple Top Simulation Bottom Thermocouple Bottom Simulation -200 -250 -300 -350 0 0.5 1 1.5 2 2.5 3 3.5 4 4 Time (s) x 10 Figure 41 - Simulation data using thermal modeling for unaccounted thermocouple junctions. Fault Classification Top Thermocouple Fault Detection 1 0.8 0.6 0.4 0.2 0 0.5 1 1.5 2 Time (s) 2.5 3 3.5 4 x 10 Bottom Thermocouple Fault Detection Fault Classification 4 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 Time (s) 2.5 3 3.5 4 4 x 10 Figure 42 - Fault detection using thermal modeling for unaccounted thermocouple junctions. 89 Figure 43 shows a comparison of the thermal model's prediction of a frost line against an actual thermocouple that was bonded to the stem of the valve three inches from the body. Frost line model comparison with actual thermocouple data 13 Frost Line (inches) 9 X: 359.8 Y: 31.97 7 60 50 40 30 5 X: 417.2 Y: 3.006 20 3 10 0 Temperature (F) Thermocouple reading at 3 inches Actual time of frost point at 3 inches Predicted time of frost point at 3 inches Predicted frost line 11 0 -10 -3 -20 350 400 450 500 550 600 650 Elapsed Time (s) Figure 43 - Comparison of predicted and actual frost line. It can be seen that the difference between the predicted and actual frost line at three inches is approximately only one minute. Since the heat dissipation of the valves is exponential, the steady state time of larger valves can be upwards of twenty-two hours. Therefore, a minute is well within the accepted error for this application. These results further validate the study performed in [14], but expands the work to detect faults in thermocouples. The model being incorporated in the intelligent valve framework will allow for the continuous monitoring of the frost line in the LLAV. If it can be found that 90 the frost line of the valve never reaches the packing at the top of the valve, the stem length can be reduced saving thousands of dollars in the manufacturing of the valve. 4.2.2 Simulation Metrics Table 11 - Performance metrics for faulty connection in amplifier input. Positive Negative Positive 5529 1411 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Negative 1217 255460 79.67% 99.53% 81.96% 99.45% 89.89% Table 12 - Performance metrics for amplifier power down and Tustin input disconnect. Positive Negative Positive 854 7 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Negative 30 246106 99.19% 99.99% 96.61% 100% 98.27% Table 13 - Performance metrics for input disconnection on the digitizer. Positive Negative Positive 4487 190 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Negative 2048 240361 95.94% 99.16% 68.66% 99.92% 81.14% Table 14 - Performance metrics for frost insulation test 1. Positive Negative Positive 2592 7619 91 Negative 0 204984 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure 25.38% 100% 100% 96.42% 100% Table 15 - Performance metrics for frost insulation test 2. Positive Negative Positive 34702 8800 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Negative 0 371790 79.77% 100% 100% 97.69% 100% Table 16 - Performance metrics for temperature junction reference error. Positive Negative Positive 155980 770 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Negative 4778 51895 99.51% 91.57% 97.03% 98.54% 94.22% Table 17 - Performance metrics for thermocouple and power disconnection. Positive Negative Positive 1120 44 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Negative 9 244668 96.22% 100% 99.20% 99.98% 99.60% Table 18 - Performance metrics for thermocouple disconnections and shorts. Positive Negative Positive 841 22 Sensitivity Negative 3880 173357 97.45% 92 Specificity Positive Predictive Value Negative Predictive Value F-Measure 97.81% 17.81% 99.99% 30.14% Table 19 - Performance metrics for transmitter power and failure. Positive Negative Positive 115910 38566 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Negative 10 92715 75.03% 99.99% 99.99% 70.62% 99.99% Table 20 - Performance metrics for unaccounted thermocouple junction. Positive Negative Positive 92865 55685 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Negative 0 58750 62.51% 100% 100% 51.34% 100% Table 21 – Average performance metrics for all thermocouple fault tests. Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure 98.80% 81.07% 86.13% 91.39% 89.32% The metrics validate the feasibility for the use of thermal models for calculation of frost line and thermocouple sensor validation. With a greater population size and more controlled test configuration, the results can be validated further, but the initial test size shows promising results for the use of this algorithm in the Intelligent Valve framework. The faults that caused drastic changes in temperature during disconnects and 93 shorts in the thermocouple were always detected within a measurement sample of the induced fault occurring. Other faults, such as a slowly degrading transmitter power supply, caused a slow discrepancy in the thermocouple’s measured data and its simulation temperature. This type of fault was able to be detected but it took several minutes into the fault for the measured value to cross the threshold value. The computational efficiency of this approach is also very appealing for use in a mission critical situation where processing power is limited and must be reserved for operational algorithms. Therefore, the calculation of the two thermal equations can be performed on a sample-by-sample basis on numerous thermocouples during live test fires giving real-time results. 4.3 Sensor Validation NASA-SSC provided valve data with a downstream pressure sensor for validation of the diagnostic algorithms. This data provided canonical datasets for the development of the AANN sensor validation. Five total datasets were provided with three sets used for training and two for testing. All the data provided was nominal, so artificial soft and hard faults were injected into the data. A hard fault is defined as a level shift in the data where the measurement values drastically change to a certain value and remains at that value for an extended period of time. This is typical behavior of a sensor that is completely disconnected. A soft fault is defined when the value of the sensor deviates from the physical value slowly. This is characteristic of a sensor that slowly begins to degrade in performance from either a slow bonding disconnect or insulation disconnect. A hard and soft fault can be seen in Figure 44 and Figure 45, respectively. 94 Figure 44 - Example of a hard fault. Figure 45 - Example of a soft fault. An example dataset of the valve can be seen in Figure 46. This dataset has a very simple correlation between the pressure sensor and the valve’s position. It is nearly a step function between the valve and pressure reading. This testing, while simple, will 95 provide validation for the AANN method, which will be expanded to a more complex system later. As previously mentioned, hard and soft faults were artificially injected and an AANN was trained based on the method described in the background section. Figure 47 and Figure 48 show the fault conditions and the AANN output. Figure 46 - Example dataset from LLAV and downstream pressure sensor. 96 Figure 47 - Hard fault detection using AANN. Figure 48 - Soft fault detection by AANN. 97 In order to further validate the AANN algorithm, the MTTP data discussed above was also used to create a more extensive subsystem that could be tested. Again, artificial faults were injected into different sensors at different times during the test, but were characteristic of actual faults found during the thermal modeling tests. Figure 49 shows a hard fault in a pressure sensor, Figure 50 demonstrates the AANN's ability to track a soft fault in a separate pressure sensor, and Figure 51 shows the robustness of the AANN in the case of large disturbances which is a known symptom of a faulty connection. Simulated data for sensor validation Pressure (PSIG) 400 300 Measured Data Simulated Fault Data Estimated AANN Data 200 100 0 2.75 2.8 2.85 2.9 Elapsed Time (s) 2.95 5 x 10 Fault region detection Fault Detected 1 0.8 0.6 0.4 0.2 0 2.75 2.8 2.85 2.9 Elapsed Time (s) Figure 49 - Fault detection of a simulated hard fault in a pressure sensor. 98 2.95 5 x 10 Measured Data Simulated Fault Data Estimated AANN Data Simulated data for sensor validation Pressure (PSIG) 400 300 200 100 0 2.75 2.8 2.85 2.9 Elapsed Time (s) 2.95 5 x 10 Fault region detection Fault Detected 1 0.8 0.6 0.4 0.2 0 2.75 2.8 2.85 2.9 Elapsed Time (s) Figure 50 - Fault detection of a soft fault in a pressure sensor. 99 2.95 5 x 10 Simulated data for sensor validation Pressure (PSIG) 400 350 Measured Data Simulated Fault Data Estimated AANN Data 300 250 200 150 100 2.82 2.84 2.86 2.88 2.9 2.92 2.94 Elapsed Time (s) 2.96 5 x 10 Fault region detection Fault Detected 1 0.8 0.6 0.4 0.2 0 2.82 2.84 2.86 2.88 2.9 2.92 2.94 Elapsed Time (s) 2.96 5 x 10 Figure 51 - Detection of a simulated disconnect in a pressure transducer. The AANN was able to detect the faults in the pressure sensor as well as predict the values of the pressure sensor to a reasonable degree. In the hard and soft fault, Figure 49 and Figure 50, no false positives or negatives were detected by the AANN. In the simulated disconnect, the fault data occasionally approached the AANN's value causing false negatives to be detected. Depending on the application, this may be remedied by setting an alarm only when a predefined number of fault classifications occurs, and conversely disable the alarm when a defined number of positive classifications occurs. To verify this algorithm further, the GOX subsystem of the MTTP was also tested. Similar artificial faults were injected into the test data as well as multiple sensor 100 faults at concurrent times. The same metrics that were used in the thermocouple algorithm were also calculated for the sensor validation with the addition of mean squared error. Mean squared error was not used in the thermocouples due to the lack of a "true" signal being present. The first test did not contain any faults to ensure that the AANN had correctly learned the correlations in the system. (a) (b) Figure 52 - Legend for AANN estimations: (a) Top estimation plots and (b) bottom error plots. AANN estimation for PE-1140-GO Pressure (PSIG) Pressure (PSIG) AANN estimation for PE-1134-GO 600 400 200 0 0 1 2 Elapsed Time (ms) 200 150 100 50 0 0 4 100 50 0 4 x 10 Error signal and threshold for PE-1140-GO Pressure (PSIG) Pressure (PSIG) x 10 Error signal and threshold for PE-1134-GO 1 2 Elapsed Time (ms) 30 20 10 0 0 1 2 Elapsed Time (ms) 0 4 x 10 1 2 Elapsed Time (ms) 4 x 10 Figure 53 - AANN Estimation for PE-1134-GO and PE-1140-GO pressure sensors under normal operating conditions. 101 AANN estimation for PC1 Pressure (PSIG) Pressure (PSIG) AANN estimation for PE-1143-GO 600 400 200 0 0 1 2 Elapsed Time (ms) 100 50 0 0 4 100 50 0 4 x 10 Error signal and threshold for PC1 Pressure (PSIG) Pressure (PSIG) x 10 Error signal and threshold for PE-1143-GO 1 2 Elapsed Time (ms) 20 15 10 5 0 0 1 2 Elapsed Time (ms) 0 4 x 10 1 2 Elapsed Time (ms) 4 x 10 Figure 54 - AANN Estimation for PE-1143-GO and PC1 pressure sensors under normal operating conditions. AANN estimation for VPV-1139-CMD Percent Open (%) Percent Open (%) AANN estimation for VPV-1139-FB 40 20 0 0 1 2 3 Elapsed Time (ms) 40 20 0 0 4 10 5 0 0 1 2 3 Elapsed Time (ms) 4 x 10 Error signal and threshold for VPV-1139-CMD Percent Open (%) Percent Open (%) x 10 Error signal and threshold for VPV-1139-FB 1 2 3 Elapsed Time (ms) 10 5 0 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 55 - AANN Estimation for VPV-1139-FB and VPV-1139-CMD valve sensors under normal operating conditions. 102 Table 22 – Performance metrics for fault detection using AANN under normal operating conditions. Positive Negative Positive 0 0 Negative 0 55685 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Average MSE 100% NaN NaN 100% NaN 14.56 Sensor MSE PE-1134-GO PE-1140-GO PE-1143-GO PC1 VPV-1139-FB VPV-1139-CMD 48.3133 0.8716 37.4732 0.1882 0.5223 0.0176 As can be seen in Figure 53-Figure 55, the AANN was able to find the correct correlations based on the training data, then estimate the test data set while operating under normal conditions. The VPV-1139-FB channel had significant noise is all of the datasets which seemed to be caused by either a bad power supply or bad connection. In order to create relatively nominal data, a moving average window was applied to the training dataset as well as all the test datasets. In the next test, a hard fault was injected into the PE-1134-GO pressure sensor during the startup phase. The hard fault was a level shift to zero for the first 200 samples in the sequence. Figure 56 - Figure 58 show the results of the six monitored sensors and Table 23 shows the respective performance metrics. 103 AANN estimation for PE-1140-GO 600 400 200 0 200 Pressure (PSIG) Pressure (PSIG) AANN estimation for PE-1134-GO 0 150 100 50 0 1 2 3 Elapsed Time (ms) 0 4 100 50 0 4 x 10 Error signal and threshold for PE-1140-GO Pressure (PSIG) Pressure (PSIG) x 10 Error signal and threshold for PE-1134-GO 1 2 3 Elapsed Time (ms) 30 20 10 0 0 1 2 3 Elapsed Time (ms) 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 56 - AANN Estimation for PE-1134-GO and PE-1140-GO pressure sensors with hard fault in PE-1143. AANN estimation for PC1 Pressure (PSIG) Pressure (PSIG) AANN estimation for PE-1143-GO 600 400 200 0 0 1 2 3 Elapsed Time (ms) 100 50 0 0 4 400 300 200 100 0 4 x 10 Error signal and threshold for PC1 Pressure (PSIG) Pressure (PSIG) x 10 Error signal and threshold for PE-1143-GO 1 2 3 Elapsed Time (ms) 20 15 10 5 0 0 1 2 3 Elapsed Time (ms) 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 57 - AANN Estimation for PE-1143-GO and PC1 pressure sensors with hard fault in PE-1143. 104 AANN estimation for VPV-1139-CMD Percent Open (%) Percent Open (%) AANN estimation for VPV-1139-FB 40 20 0 0 1 2 3 Elapsed Time (ms) 40 20 0 0 4 10 5 0 0 1 2 3 Elapsed Time (ms) 4 x 10 Error signal and threshold for VPV-1139-CMD Percent Open (%) Percent Open (%) x 10 Error signal and threshold for VPV-1139-FB 1 2 3 Elapsed Time (ms) 10 5 0 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 58 - AANN Estimation for VPV-1139-FB and VPV-1139-CMD pressure sensors with hard fault in PE1143. Table 23 – Performance metrics for fault detection using AANN with injected hard fault in PE-1143-GO. Positive Negative Positive 201 0 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Average MSE Negative 0 55685 100% 100% 100% 100% 100% 18.1059 Sensor MSE PE-1134-GO PE-1140-GO PE-1143-GO PC1 VPV-1139-FB VPV-1139-CMD 58.7237 3.8857 44.225 1.0074 0.7697 0.0240 105 This test shows the robustness of the AANN with a hard fault in a pressure sensor. The AANN was able to detect all of the faults in the pressure sensor as well as maintain proper values for the rest of the sensors. Since the training data contains windows of zeroed out sensors, it makes sense that this test would perform well. The MSE was slightly higher in certain sensors, especially in the faulty sensor PE-1134-GO. However, the values produced by the AANN were close enough to be used in lieu of the faulty data, which is the goal of this algorithm. It was seen in the thermocouple tests that a shorted sensor connection can result in a level shift to the maximum value of the sensor. The next test simulates a similar short in the PE-1143-GO sensor. The value was held for the entirety of the test to make sure that the AANN could detect the fault through all transitions and not just the initial state. Figure 59-Figure 61 show the result of the test and Table 24 shows the respective performance metrics. 106 AANN estimation for PE-1140-GO Pressure (PSIG) Pressure (PSIG) AANN estimation for PE-1134-GO 600 400 200 0 0 1 2 3 Elapsed Time (ms) 200 150 100 50 0 0 4 x 10 Error signal and threshold for PE-1134-GO 1 2 3 Elapsed Time (ms) 4 x 10 Error signal and threshold for PE-1140-GO Pressure (PSIG) Pressure (PSIG) 60 100 50 0 40 20 0 0 1 2 3 Elapsed Time (ms) 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 59 - AANN Estimation for PE-1134-GO and PE-1140-GO pressure sensors with level shift in PE-1143GO. AANN estimation for PC1 Pressure (PSIG) Pressure (PSIG) AANN estimation for PE-1143-GO 600 400 200 0 0 1 2 3 Elapsed Time (ms) 100 50 0 0 4 x 10 Error signal and threshold for PE-1143-GO 1 2 3 Elapsed Time (ms) 4 x 10 Error signal and threshold for PC1 Pressure (PSIG) Pressure (PSIG) 400 300 200 100 0 20 15 10 5 0 0 1 2 3 Elapsed Time (ms) 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 60 - AANN Estimation for PE-1143-GO and PC1 pressure sensors with level shift in PE-1143-GO. 107 AANN estimation for VPV-1139-CMD Percent Open (%) Percent Open (%) AANN estimation for VPV-1139-FB 40 20 0 0 1 2 3 Elapsed Time (ms) 40 20 0 0 4 10 5 0 0 1 2 3 Elapsed Time (ms) 4 x 10 Error signal and threshold for VPV-1139-CMD Percent Open (%) Percent Open (%) x 10 Error signal and threshold for VPV-1139-FB 1 2 3 Elapsed Time (ms) 10 5 0 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 61 - AANN Estimation for VPV-1139-FB and VPV-1139-CMD valve sensors with level shift in PE-1143GO. Table 24 – Performance metrics for fault detection using AANN with injected level shift fault in PE-1143-GO. Positive Negative Positive 975 0 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Average MSE Negative 28 4847 100% 99.42% 97.20% 100% 98.30% 415.51 Sensor MSE PE-1134-GO PE-1140-GO PE-1143-GO PC1 VPV-1139-FB VPV-1139-CMD 1701.9 9.30 779.1 0.9 1.80 0.137 108 The shortage simulations produced similar fault classification results as the hard fault, but the prediction error of the sensor data was much higher. This is to be expected as a number of runs in the training dataset contained similar values of PE-1143-GO as the fault was reporting. Therefore, the correlations found by the neural network's bottleneck layer would have been caught between two different states of the training data. Even though the prediction accuracy decreased, the fault detection would still be sufficient to pass on the data to a fault diagnosis algorithm which could identify the faulty pressure sensor. There are times on the test stand due to weather and wind conditions that a sensor's insulation can become loose causing a faulty connection in the pressure sensor. This disconnect can cause considerable noise in the channel's measurements. These measurements can be particularly difficult to detect at an early stage because only small variations in the measurement data can be seen. The first test seen in Figure 62 - Figure 64 is a simulation of a more drastic disconnect where the values of the PC1 pressure sensor have a more severe discrepancy from the actual measured value. 109 AANN estimation for PE-1140-GO Pressure (PSIG) Pressure (PSIG) AANN estimation for PE-1134-GO 600 400 200 0 0 200 150 100 50 0 1 2 3 Elapsed Time (ms) 0 4 100 50 0 4 x 10 Error signal and threshold for PE-1140-GO Pressure (PSIG) Pressure (PSIG) x 10 Error signal and threshold for PE-1134-GO 1 2 3 Elapsed Time (ms) 30 20 10 0 0 1 2 3 Elapsed Time (ms) 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 62 - AANN Estimation for PE-1134-GO and PE-1140-GO pressure sensors with noise in PC1. AANN estimation for PC1 Pressure (PSIG) Pressure (PSIG) AANN estimation for PE-1143-GO 600 400 200 0 0 1 2 3 Elapsed Time (ms) 100 50 0 0 4 100 50 0 4 x 10 Error signal and threshold for PC1 Pressure (PSIG) Pressure (PSIG) x 10 Error signal and threshold for PE-1143-GO 1 2 3 Elapsed Time (ms) 60 40 20 0 0 1 2 3 Elapsed Time (ms) 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 63 - AANN Estimation for PE-1143-GO and PC1 pressure sensors with noise in PC1. 110 AANN estimation for VPV-1139-CMD Percent Open (%) Percent Open (%) AANN estimation for VPV-1139-FB 40 20 0 0 1 2 3 Elapsed Time (ms) 40 20 0 0 4 10 5 0 0 1 2 3 Elapsed Time (ms) 4 x 10 Error signal and threshold for VPV-1139-CMD Percent Open (%) Percent Open (%) x 10 Error signal and threshold for VPV-1139-FB 1 2 3 Elapsed Time (ms) 10 5 0 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 64 - AANN Estimation for VPV-1139-FB and VPV-1139-CMD valve sensors with noise in PC1. Table 25 – Performance metrics for fault detection using AANN with injected noise in PC1. Positive Negative Positive 260 41 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Average MSE Negative 9 5540 99.64% 86.37% 96.65% 99.26% 98.22% 89.43 Sensor MSE PE-1134-GO PE-1140-GO PE-1143-GO PC1 VPV-1139-FB VPV-1139-CMD 52.02 4.40 58.11 0.87 1.62 0.085 111 This test again verified the AANN's robustness even in the presence of noise. The training method using random biases in the second training set optimized the weights to enhance its understanding of the complex system. The last test performed is the only "real world" fault that was available in the data. As stated previously, the VPV-1139-FB sensor had noise is its channel during every test run. The preprocessing of the data used a moving average to create normal operating data that was sufficient for training the AANN. This test uses the original dataset to validate the AANN's performance with actual fault data. AANN estimation for PE-1140-GO Pressure (PSIG) Pressure (PSIG) AANN estimation for PE-1134-GO 600 400 200 0 0 1 2 3 Elapsed Time (ms) 200 150 100 50 0 0 4 100 50 0 4 x 10 Error signal and threshold for PE-1140-GO Pressure (PSIG) Pressure (PSIG) x 10 Error signal and threshold for PE-1134-GO 1 2 3 Elapsed Time (ms) 60 40 20 0 0 1 2 3 Elapsed Time (ms) 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 65 - AANN Estimation for PE-1134-GO and PE-1140-GO pressure sensors with noise in VPV-1139-FB. 112 AANN estimation for PC1 Pressure (PSIG) Pressure (PSIG) AANN estimation for PE-1143-GO 600 400 200 0 0 100 50 0 1 2 3 Elapsed Time (ms) 0 4 100 50 0 4 x 10 Error signal and threshold for PC1 Pressure (PSIG) Pressure (PSIG) x 10 Error signal and threshold for PE-1143-GO 1 2 3 Elapsed Time (ms) 20 15 10 5 0 0 1 2 3 Elapsed Time (ms) 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 66 - AANN Estimation for PE-1143-GO and PC1 pressure sensors with noise in VPV-1139-FB. AANN estimation for VPV-1139-CMD 60 Percent Open (%) Percent Open (%) AANN estimation for VPV-1139-FB 40 20 0 0 1 2 3 Elapsed Time (ms) 40 20 0 0 4 30 20 10 0 0 1 2 3 Elapsed Time (ms) 4 x 10 Error signal and threshold for VPV-1139-CMD Percent Open (%) Percent Open (%) x 10 Error signal and threshold for VPV-1139-FB 1 2 3 Elapsed Time (ms) 10 5 0 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 67 - AANN Estimation for VPV-1139-FB and VPV-1139-CMD valve sensors with noise in VPV-1139-FB. 113 Table 26 – Performance metrics for fault detection using AANN with noise in VPV-1139-FB. Positive Negative Positive 129 846 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Average MSE Negative 3 4872 99.93% 13.23% 97.72% 85.20% 98.82% 23.22 Sensor MSE PE-1134-GO PE-1140-GO PE-1143-GO PC1 VPV-1139-FB VPV-1139-CMD 78.77 11.24 45.17 0.98 3.18 0.03 While the AANN was still able to hold a low MSE in this case, the lack of a fault region detection algorithm produces a very low specificity rating. This resulted in a low specificity rating which could result in an undetected fault in the sensor. Detection of spikes in data makes it difficult to determine and diagnose the source of a fault, and therefore the sensitivity is usually defined based on the application. Fault region detection algorithms can use fault windows with a majority rule decision to determine the overall health of a sensor over a period of time to try and assist the fault diagnosis algorithm. The last test determines whether the AANN can detect simultaneous faults in sensors. A disconnect was injected into PE-1143-GO and a short was injected into PC1 for the entirety of the test. 114 AANN estimation for PE-1140-GO Pressure (PSIG) Pressure (PSIG) AANN estimation for PE-1134-GO 600 400 200 0 0 1 2 3 Elapsed Time (ms) 200 150 100 50 0 0 4 x 10 Error signal and threshold for PE-1134-GO 1 2 3 Elapsed Time (ms) 4 x 10 Error signal and threshold for PE-1140-GO Pressure (PSIG) Pressure (PSIG) 200 300 200 100 0 0 1 2 3 Elapsed Time (ms) 150 100 50 0 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 68 - AANN Estimation for PE-1134-GO and PE-1140-GO pressure sensors with simultaneous faults in PE-1143-GO and PC1. AANN estimation for PC1 Pressure (PSIG) Pressure (PSIG) AANN estimation for PE-1143-GO 600 400 200 0 0 1 2 3 Elapsed Time (ms) 100 50 0 0 4 300 200 100 0 4 x 10 Error signal and threshold for PC1 Pressure (PSIG) Pressure (PSIG) x 10 Error signal and threshold for PE-1143-GO 1 2 3 Elapsed Time (ms) 20 15 10 5 0 0 1 2 3 Elapsed Time (ms) 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 69 - AANN Estimation for PE-1143-GO and PC1 pressure sensors with simultaneous faults in PE-1143GO and PC1. 115 AANN estimation for VPV-1139-CMD 60 Percent Open (%) Percent Open (%) AANN estimation for VPV-1139-FB 40 20 0 0 1 2 3 Elapsed Time (ms) 40 20 0 0 4 40 20 0 0 1 2 3 Elapsed Time (ms) 4 x 10 Error signal and threshold for VPV-1139-CMD Percent Open (%) Percent Open (%) x 10 Error signal and threshold for VPV-1139-FB 1 2 3 Elapsed Time (ms) 40 30 20 10 0 0 4 x 10 1 2 3 Elapsed Time (ms) 4 x 10 Figure 70 - AANN Estimation for VPV-1139-FB and VPV-1139-CMD valve sensors with simultaneous faults in PE-1143-GO and PC1. Table 27 – Performance metrics for fault detection using AANN with simultaneous faults in PE-1143-GO and PC1. Positive Negative Positive 386 1564 Sensitivity Specificity Positive Predictive Value Negative Predictive Value F-Measure Average MSE Negative 1530 2370 60.76% 19.80% 20.15% 60.24% 30.26% 14561 Sensor MSE PE-1134-GO PE-1140-GO PE-1143-GO PC1 VPV-1139-FB VPV-1139-CMD 20464 10226 56154 361 50 110 116 This verifies that the AANN is only useful in the presence of a single sensor fault. The data from the output of the AANN would not be useful for the fault diagnosis algorithm as there are too many misclassifications of faults in sensors that were still operating properly. A possible solution is to replace the motor valves with a feedback sensor to get a more accurate understanding of the operating state of the system. The binary input from these valves were tested in the AANN to see if they could improve performance, but showed no real benefit in similar tests. Therefore, to reduce computational complexity and training time, they were removed from the tests. Overall the use of a bottlenecked neural network with mapping and demapping layers proved to be an effective tool for the detection of single sensor faults in a complex system. The AANN was able to find correlations in the data without any knowledge of the physical dynamics of the MTTP dataset. This provides a generic algorithm that will work for most complex systems if enough training data is provided. The estimated data provided by the AANN can also assist in the decision made by NASA test operators to continue an expensive rocket engine test in the event that a fault is detected. There are also several drawbacks to this method. First, large amounts of data must exist that encompasses all of the system's operating conditions in order to train the AANN properly. If insufficient data is provided, an online training algorithm may have to be implemented to continually update the weights of the neural network. Also, multiple sensor faults were not able to be detected with the current amount of data that was provided to the system. Lastly, certain sensors may have no correlations to each other, for example valve position and strain. Therefore, while pressure and valve sensors work in this context, detection of a fault in a strain sensor would not and may even throw 117 off the detection of faults in the other sensors. It can be concluded then, that determination of the AANN's sensor selection and training data must be performed carefully in order to guarantee the successful detection of faults in the sensors that it is monitoring. 4.4 Adaptive Threshold To validate the adaptive threshold method, six different set point transitions were used to determine the robustness of the algorithm with different transition ranges and speeds. These transitions can be seen in Figure 71: Setpoint transition 1 Setpoint transition 2 100 Open Percentage (%) Open Percentage (%) 100 80 60 40 20 0 80 60 40 20 0 0 1000 2000 Samples 3000 4000 0 Setpoint transition 4 2000 Samples 3000 4000 Setpoint transition 5 100 Open Percentage (%) 100 Open Percentage (%) 1000 80 60 40 20 0 80 60 40 20 0 0 1000 2000 Samples 3000 4000 0 118 1000 2000 Samples 3000 4000 Setpoint transition 6 Setpoint transition 7 100 Open Percentage (%) Open Percentage (%) 100 80 60 40 20 0 80 60 40 20 0 0 1000 2000 3000 Samples 4000 0 1000 2000 3000 Samples 4000 Figure 71 - Set point transitions for adaptive thresholding testing. The parameters of the transfer function were modified as mentioned above in order to validate the adaptive thresholding algorithm. Several example plots of the algorithm working are presented below, with a description of the results following. 119 Percent Open (%) G: 0.98 Tw : 0.98 : 0.98 Ts : 2 100 80 Measured Values Upper Threshold Lower Threshold 60 Fault Detection: 1500 2000 2500 3000 3500 Sample Fault identification: 0 faults 4000 1 0 -1 1500 2000 2500 3000 Sample 3500 4000 (a) 100 Fault Detection: Percent Open (%) G: 0.9 Tw : 0.85 : 0.85 Ts : 4 Measured Values Upper Threshold Lower Threshold 80 60 1500 2000 2500 3000 3500 Sample Fault identification: 276 faults 4000 4500 1 0 -1 1500 2000 2500 3000 Sample 3500 4000 (b) Figure 72 - Set point transition #1 with fault detection while operating in : (a) normal OS and (b) faulty OS. In the first set point transition, the effects of a lower natural frequency and damping ratio can be seen. The valve reacts normally as it ramps up to the setpoint, but has difficulty reaching its steady state value. The algorithm was able to detect this using 120 the adaptive threshold until it reached a steady state point that was reasonably close to the set point. G: 0.98 Tw : 0.98 : 0.98 Ts : 2 Percent Open (%) 100 50 Measured Values Upper Threshold Lower Threshold 0 -50 0 500 1000 0 500 1000 1500 2000 2500 3000 Sample Fault identification: 0 faults 3500 4000 3500 4000 Fault Detection: 1 0 -1 1500 2000 2500 Sample 3000 (a) 100 Fault Detection: Percent Open (%) G: 0.9 Tw : 0.85 : 0.85 Ts : 4 50 Measured Values Upper Threshold Lower Threshold 0 -50 0 500 1000 1500 2000 2500 3000 Sample Fault identification: 254 faults 0 500 1000 1500 3500 4000 3500 4000 1 0 -1 2000 2500 Sample 3000 (b) Figure 73 - Set point transition #2 with fault detection while operating in: (a) normal OS and (b) faulty OS. 121 The second set point transition in Figure 73 again shows the effects of a degrading valve that cannot reach its set point quickly enough and when it does it overshoots the value. This test shows a lower amount of faults, but the faults are localized around the transitional points of the test. This information could be vital to a test engineer by providing knowledge of not only how, but also what points in the test the valve is failing. Percent Open (%) G: 0.98 Tw : 0.98 : 1.08 Ts : 2 60 Measured Values Upper Threshold Lower Threshold 40 20 0 -20 1500 1600 1700 1600 1700 1800 1900 2000 2100 Sample Fault identification: 0 faults 2200 2300 2200 2300 Fault Detection: 1 0.5 0 -0.5 -1 1500 1800 1900 2000 Sample (a) 122 2100 Percent Open (%) G: 0.9 Tw : 0.9 : 0.95 Ts : 5 60 Measured Values Upper Threshold Lower Threshold 40 20 0 1500 1600 1700 1600 1700 1800 1900 2000 2100 Sample Fault identification: 776 faults 2200 2300 Fault Detection: 1 0.5 0 -0.5 -1 1500 1800 1900 2000 Sample 2100 2200 2300 (b) Figure 74 - Set point transition #3 with fault detection while operating in: (a) normal OS and (b) faulty OS. Although all valves have an input delay between the time it receives a signal and it actually moves, this delay can increase as the health of a valve decreases and cause undesirable behavior. In Figure 74 (a), it can be seen that the valve and threshold react in the same reasonable time frame, however in Figure 74 (b), the valve reacts more slowly and causes a fault to be detected in the valve. 123 Percent Open (%) G: 0.98 Tw : 0.98 : 1.03 Ts : 3 Measured Values Upper Threshold Lower Threshold 100 50 0 100 200 300 100 200 300 400 500 600 700 Sample Fault identification: 1267 faults 800 900 Fault Detection: 1 0.5 0 -0.5 -1 400 500 600 Sample 700 800 900 (a) Percent Open (%) G: 0.9 Tw : 1.15 : 0.85 Ts : 5 Measured Values Upper Threshold Lower Threshold 100 50 0 Fault Detection: 100 200 300 400 500 600 700 Sample Fault identification: 2864 faults 800 900 0.5 0 -0.5 -1 100 200 300 400 500 600 Sample 700 800 900 (b) Figure 75 - Set point transition #4 with fault detection while operating in: (a) normal OS and (b) faulty OS. 124 Figure 75 shows how the algorithm would need to be tuned based a fit parameter in order to get perfect accuracy. This set point time series was very fast with little steady state time between the transitional periods. This is not common operating procedures in the test stands at NASA-SSC, but is still useful to see how a valve will operate during an emergency shutdown. Also, as the gain parameter lowers to .9, the valve is unable to reach fully open. This can be caused by excessive wear or transition friction or a power failure in the control systems. G: 0.98 Tw : 0.98 : 1.08 Ts : 2 Percent Open (%) 150 100 50 0 -50 Fault Detection: Measured Values Upper Threshold Lower Threshold 0 500 1000 0 500 1000 1500 2000 2500 3000 Sample Fault identification: 76 faults 3500 4000 3500 4000 1 0.5 0 -0.5 -1 1500 2000 Sample (a) 125 2500 3000 G: 1.1 Tw : 0.9 : 0.85 Ts : 4 Percent Open (%) 150 100 50 0 -50 Fault Detection: Measured Values Upper Threshold Lower Threshold 0 500 1000 1500 2000 2500 3000 Sample Fault identification: 754 faults 0 500 1000 1500 3500 4000 3500 4000 1 0.5 0 -0.5 -1 2000 2500 Sample 3000 (b) Figure 76 - Set point transition #5 with fault detection while operating in: (a) normal OS and (b) faulty OS. In Figure 76, an issue with the algorithm can be seen as initialization effects can cause false positives in the valve. However, this would not normally be a problem as the algorithm would be continually running, but if the framework were to be turn on in the middle of a test, this could cause some false positives in the valve's health analysis. Also, in Figure 76 (b), the effects of a large gain parameter, coupled with a low damping ratio can be seen as large oscillations occur at the top of the transitional period. These effects are continually seen as the valve is suddenly closed and not given time to reach its steady state value. This effect continues much longer than in the previous tests due to the continually changing control variable. 126 G: 0.98 Tw : 0.98 : 0.98 Ts : 2 Percent Open (%) 100 50 -50 Fault Detection: Measured Values Upper Threshold Lower Threshold 0 0 500 1000 0 500 1000 1500 2000 2500 3000 Sample Fault identification: 0 faults 3500 4000 3500 4000 1 0.5 0 -0.5 -1 1500 2000 Sample 2500 3000 (a) G: 0.9 Tw : 0.9 : 0.9 Ts : 4 Percent Open (%) 100 50 Measured Values Upper Threshold Lower Threshold 0 -50 0 500 1000 0 500 1000 1500 2000 2500 3000 Sample Fault identification: 0 faults 3500 4000 3500 4000 Fault Detection: 1 0.5 0 -0.5 -1 1500 2000 Sample 2500 3000 Figure 77 - Set point transition #6 with fault detection while operating in: (a) normal OS and (b) faulty OS. (b) 127 Figure 77 shows a type of transition where the valve' degradation can be seen due to the long and steady rise of the valve's control variable. Because most of the valve's problems are exposed during fast transition times, the algorithm is unable to detect the difference between a normally operating valve and a faulty operating valve. While the algorithm is not detecting the failing health of the valve, it is also not providing false positives where a valve would be fixed unnecessarily. Natural Frequency Avg. Number of Faults 200 150 100 0.9 1 1.1 1.2 Parameter value Damping Coefficient 1.3 200 Avg. Number of Faults Avg. Number of Faults Avg. Number of Faults Gain 250 180 160 140 120 0.8 1 1.2 Parameter value 1.4 600 400 200 0 0.8 1 1.2 Parameter value Input Delay 1.4 3 4 Parameter value 5 250 200 150 100 2 Figure 78 – Average fault values for different parameters of the ARMA model thresholding method over all tests. It can be seen in Figure 78 that as the parameters of the transfer function are modified, the average number of faults increases as the valve's physical parameters degrade. While there are faults in the nominal region, there is an increasing trend in all the variables as the boundaries are exceeded. 128 To further validate the fault detection algorithm in the scope of the Intelligent Valve framework, data was taken from the E-Complex test stand's simulation lab using PLCs from NASA-SSC. A similar transfer function was used to model the valve's response, however, the PID controller was controlled by a Allen-Bradley PLC which is used in the E-complex test stand. By reducing the gain parameter, a simulated obstruction or power failure can be injected into the feedback signal of the valve. The same control signal was used for the test and training data to show how the valve changes based on its parameters. Figure 79 shows the adaptive threshold on the training data and Figure 80 shows the results of the simulated obstruction fault. Adaptive Threshold Training Data 120 Upper Values Lower Values Actual Values 100 Percentage Closed (%) 80 60 40 20 0 -20 0 50 100 Time (s) Figure 79 - Training data with final threshold fit. 129 150 Valve Feedback with Simulated Obstruction Upper Threshold Lower Treshold Measured Valve Feedback Percentage Closed (%) 100 80 60 40 20 0 0 50 100 150 Elapsed Time (s) Lower Faults 1 1 0.8 0.8 Fault Detected Fault Detected Upper Faults 0.6 0.4 0.2 0.6 0.4 0.2 0 -0.2 0 0 50 100 Elapsed Time (s) 150 -0.2 0 50 100 Elapsed Time (s) 150 Figure 80 - Fault detection of simulated obstruction fault using adaptive thresholding. The adaptive threshold was able to detect when the valve was unable to match the control signal. Since all of the faults were detected by the lower threshold, the diagnosis can be narrowed to such things as obstruction or power faults. If both lower and higher faults were found, then the data would need to be analyzed further by domain experts to determine the correct maintenance for the system. Also, there are several false positives detected between 𝑡 = 111𝑠 and 𝑡 = 115, however, the faults only exist for one time step, which can be accounted for by the error in the ARMA models. The adaptive threshold algorithm has been validated using a forward analytical model to detect degradation in the LLAV as well as actual data from hardware and simulations from NASA-SSC's E-complex test stand. Since no fault data from the LLAV was available, the parameters of the transfer function were used to model how the valve 130 would react to its given input. The algorithm provides a fit parameter which can be used to develop a range of values that are considered nominal to the test engineer, but maintain the quality of performance required in such a critical environment. The algorithm showed that it could detect faults amongst various transitions that would be commonly seen in NASA-SSC test stands. There is a large difference in the number of faults detected between the nominal parameters and fault parameters. If the faults detected by the algorithm are trended between tests, the trend lines will show when a valve begins to become faulty. A drawback of this method is that it is data-driven and, therefore, requires previous data from the valve to develop the ARMA models required to create the adaptive threshold. The advantage of this method is that the data required to calculate the coefficients is all normal functioning data rather than faulty data. Another drawback is the lack of an optimization parameter for the fit equation. If one could be developed, the amount of ARMA models needed could be optimized to reduce computational load of the algorithm. 4.5 Valve Statistics The valve operating statistics are used to advise in the fault diagnosis after a failure mode has been detected by the previous algorithms. The operating statistics have been captured for several test runs of two LLAVs from historical data in Table 28. These statistics are presented to the operators in order to investigate negative trends in the system's behavior and assists in determining more accurate maintenance decisions with the understanding of the valve's operating history. 131 Table 28 - Operating Statistics for LLAV Name Transitions Cryogenic Transitions Distance Traveled Transition Time Average Transition Time Direction Changes Number Closings 10A23 10A24 25 17 33 35 15 14 14.5 13.2 13.78 15 13 20 12 25 4.6 Health Visualizations The data is visualized using a 3D model to show the different operating conditions of the LLAV. Utilizing drafted design documents, each of the valve components were modeled in Autodesk 3D Studio Max. The valves were designed and animated to allow for an exploded view or a cross sectional view during operation. When operating, the visualization would display the direction of flow with a series of arrows. Green indicated that the valve was open, while red indicated that it was closed. The frost point was visualized through the use of a shader program that would display the frost height through an icy bitmap texture, which would slowly replace the normal metal appearance as the frost continued to migrate up the valve. Each of these visualizations can be seen in Figure 81, Figure 82, and Figure 83. 132 of Figure 81 - Frost line visualization of LLAV. Figure 82 - Cross sectional and exploded view with flow and position visualizations. 133 Figure 83 - Frost line visualization of LLAV with thermocouple values. 4.7 Prognostics While no prognostics were performed to determine the remaining useful life of the LLAV, several techniques were investigated for future consideration of this task. 4.8 Prognostics Data In order to test the feasibility of these techniques use in prognostics, the following data was used to determine their performance under different environments. 4.8.1 Canonical Data To perform simple validation of the AR, ARMA and Kalman filter, canonical time series data. A linear equation using mean and variance was used to produce multiple test series. Additive white gaussian noise was then added in order to see how well it could perform under harsh environmental conditions. An example of this time series can be seen in Figure 84 and Figure 85. 134 Time series with 0 mean and 1 variance 120 100 Amplitude 80 60 40 20 0 0 10 20 30 40 50 Time (s) 60 70 80 90 100 90 100 Figure 84 - Linear equation with 0 mean and 1 variance. Time series with 0 mean and 10 variance 120 100 Amplitude 80 60 40 20 0 -20 0 10 20 30 40 50 Time (s) 60 70 80 Figure 85 - Linear time series with 0 mean and 10 variance 135 4.8.2 LLAV Data To test the prognostic methods under an actual test, the data from the LLAV was used. This presented a simple approach with an input and output that could determine how well the techniques could predict into the future of a time series. This data was presented earlier in the sensor validation section and Figure 46. 4.9 Prognostic Performance The first time series was based on a 0 mean and 1 variance time signal. The models were tested with the prediction steps ranging from 1 to 25 steps and the SNR of the AWGN from -5 to 25dB. The MSE was measured and plotted to gauge performance. The results can be seen in the following figures: Time series with 0 mean and 1 variance 120 100 Amplitude 80 60 40 20 0 0 10 20 30 40 50 Time (s) 60 70 Figure 86 - Original model time series. 136 80 90 100 AR prediction at 1 prediction step and SNR = 25dB 120 100 Amplitude 80 60 40 AR Prediction Actual Signal 20 0 0 20 40 60 80 100 Time (s) Figure 87 - AR prediction of first time signal at 1 prediction step and SNR = 25dB. AR prediction at 5 prediction steps and SNR = 25dB 120 100 Amplitude 80 60 40 AR Prediction Actual Signal 20 0 0 20 40 60 80 100 Time (s) Figure 88 - AR prediction of first time signal at 5 prediction steps and SNR = 25dB. 137 AR prediction at 5 prediction steps and SNR = -5dB 120 100 Amplitude 80 60 40 AR Prediction Actual Signal 20 0 0 20 40 60 80 100 Time (s) Figure 89 - AR prediction of first time signal at 5 prediction step and SNR = -5dB. Prediction performance of AR model for = 0 and = 1 signal 3500 3000 2500 MSE 2000 1500 1000 500 0 15 10 5 0 25 20 15 10 5 0 SNR (dB) Prediction Steps Figure 90 - AR MSE performance on 0 mean, 1 variance signal. 138 -5 ARMA prediction at 1 prediction steps and SNR = 25dB 120 100 Amplitude 80 60 40 ARMA Prediction Actual Signal 20 0 0 20 40 60 80 100 Time (s) Figure 91 - ARMA prediction of first time signal at 1 prediction step and SNR = 25dB. ARMA prediction at 1 prediction steps and SNR = -5dB 120 ARMA Prediction Actual Signal 100 Amplitude 80 60 40 20 0 -20 0 20 40 60 80 100 Time (s) Figure 92 - ARMA prediction of first time signal at 1 prediction step and SNR = -5dB. 139 ARMA prediction at 5 prediction steps and SNR = -5dB 120 ARMA Prediction Actual Signal 100 Amplitude 80 60 40 20 0 0 20 40 60 80 100 Time (s) Figure 93 - ARMA prediction of first time signal at 5 predictions steps and SNR = -5dB. Prediction performance of ARMA model for = 0 and = 1 signal 1500 MSE 1000 500 0 15 10 5 10 0 30 Prediction Steps 0 20 SNR (dB) Figure 94 - ARMA MSE performance on 0 mean, 1 variance signal. 140 -10 Kalman prediction at 1 prediction steps and SNR = 25dB 100 90 80 Amplitude 70 60 50 40 30 Kalman Prediction Actual Signal 20 10 0 0 20 40 60 80 100 Time (s) Figure 95 - Kalman filter prediction of first time signal at 1 prediction step and SNR = 25dB. Kalman prediction at 5 prediction steps and SNR = 25dB 120 100 Amplitude 80 60 40 Kalman Prediction Actual Signal 20 0 0 20 40 60 80 100 Time (s) Figure 96 -Kalman filter prediction of first time signal at 5 prediction steps and SNR = 25dB. 141 Kalman prediction at 5 prediction steps and SNR = -5dB 120 100 Amplitude 80 60 40 20 Kalman Prediction Actual Signal 0 -20 0 20 40 60 80 100 Time (s) Figure 97 - Kalman filter prediction of first time signal at 5 prediction steps and SNR = -5dB. Prediction performance of Kalman filter model for = 0 and = 1 signal 2000 MSE 1500 1000 500 0 15 10 5 0 Prediction Steps 25 20 15 10 5 0 SNR (dB) Figure 98 – Kalman filter MSE performance on 0 mean, 1 variance signal. 142 -5 As can be seen in the figures, as the prediction steps increase, the accuracy of the model decreases. This is true in both models, which is to be expected as they both use the same general approach to predicting future time series values. However, under significant noise, the ARMA model performances much better. This is due to the additional coefficients that calculate a moving average of the white noise. The signal was changed by increasing the variance to 10 with the following results: Time series with 0 mean and 10 variance 120 100 Amplitude 80 60 40 20 0 -20 0 10 20 30 40 50 Time (s) 60 70 Figure 99 - Original time series model #2. 143 80 90 100 Prediction performance of AR model for = 0 and = 10 signal 9000 8000 7000 MSE 6000 5000 4000 3000 2000 15 10 5 0 25 15 20 -5 0 5 10 SNR (dB) Prediction Steps Figure 100 – AR MSE performance on 0 mean, 10 variance signal. Prediction performance of ARMA model for = 0 and = 10 signal 6000 5500 MSE 5000 4500 4000 3500 3000 15 10 5 Prediction Steps 0 25 20 15 10 5 0 SNR (dB) Figure 101 - ARMA MSE performance on 0 mean, 10 variance signal. 144 -5 Prediction performance of Kalman filter for = 0 and = 10 signal 2000 MSE 1500 1000 500 0 15 10 5 0 25 15 20 Prediction Steps 10 5 0 -5 SNR (dB) Figure 102 - Kalman filter performance on 0 mean, 10 variance signal. The results in this test are similar to that of the previous time step with some changes in the performance of the ARMA model. The two linear regression models had an error base much higher than the first test due to the high amount of variance in the signal, however the Kalman filter stayed relatively constant through both tests. The ARMA model performs the worst, which is due to its cancellation of noise through the extra coefficient. The ARMA is actually smoothing the estimation too much because the high variance was treated as noise. The Kalman filter performed more consistently in this test than the previous one, and was still the best of the three. The real benefit of prognostics can be seen when a process or output variable of a system can be predicted based on the measureable input variables of the system. One instance of this is a valve's control variable to predict its output variable. If the process 145 variable can be measured several time steps into the future, a disaster can be averted by performing health analysis algorithms on the future state of the valve. In order to perform this type of calculation, the techniques mentioned above must be extended to account for an input variable. In the AR and ARMA models, an external input is added with another vector of coefficient that must be calculated. The Kalman filter adds another state vector to the time update equation to account for this type of prediction. These three methods were applied to LLAV data provided by NASA-SSC. Similarly to previous tests, AWGN was added to the signal to test the robustness of the techniques in harsh conditions. The results can be seen in Figure 103. 30 step prediction of process variable using ARX model 120 Previous Process Predicted Process Actual Process Control Input Percentage Open(%) 100 80 60 40 20 0 -20 85 90 95 100 105 110 Time (s) 115 120 125 Figure 103 - ARX prediction of the LLAV data to 30 time steps. 146 130 Prediction performance of ARX Model for LLAV signal 8000 RMSE 6000 4000 2000 0 30 20 10 0 Prediction Steps 10 0 -20 -10 -40 -30 -50 SNR Figure 104 - Performance for ARX model based on LLAV data. 30 step prediction of process variable using ARMAX model 120 Previous Process Predicted Process Actual Process Control Input 100 Percentage Open(%) 80 60 40 20 0 -20 85 90 95 100 105 110 Time (s) 115 120 125 Figure 105 - ARMAX prediction of the LLAV data to 30 time steps. 147 130 Prediction performance of ARMAX Model for LLAV signal 8000 RMSE 6000 4000 2000 0 30 20 10 Prediction Steps 0 10 -10 0 -20 -30 -40 -50 SNR Figure 106 - Performance for ARMAX model based on LLAV data. 30 step prediction of process variable using ARX model 120 Previous Process Predicted Process Actual Process Control Input Percentage Open(%) 100 80 60 40 20 0 -20 85 90 95 100 105 110 Time (s) 115 120 125 Figure 107 – Kalman prediction of the LLAV data to 30 time steps. 148 130 Prediction performance of ARMAX Model for LLAV signal 4 x 10 3 RMSE 2 1 0 30 20 -60 -40 10 -20 0 Prediction Steps 0 20 SNR Figure 108 - Performance for Kalman filter based on LLAV data. All three algorithms were able to predict the output of the process variable reasonably well, even out to 30 time steps. The results are similar to the canonical results where the MSE of the techniques directly proportional to the SNR and prediction steps used to simulate the signal. This is due to the presence of an input control variable that allows the predictors to gain better context of how the valve will respond in future states. The Kalman filter was the least consistent as the process and measurement noise are both modeled by constant vectors with previous knowledge of the noise covariance. This prognostic process, used in conjunction with the adaptive threshold method developed above, could provide valuable seconds to the test engineers at NASA-SSC to make determinations of test operations in the E-complex test stand. The ARX model is the simplest of all the techniques and performs well under systems with relatively low noise. It's simplicity makes it the lowest in both computational and memory costs and can save valuable resources on mission critical 149 devices if large amount of valves are being monitored. The ARMAX model provides a way to estimate the measurement and process noise through an additional coefficient that calculates the error of the system as a moving average white noise. The ARMAX and ARX model coefficients are both data driven in that they require historical data to calculate their coefficients. In the test performed in this research, the determination of the coefficients was done quickly and with low amounts of training data and the ARX and ARMAX models were still able to perform well in the prognosis tests. A significant drawback to these models is their inability to incorporate physical mechanisms into their equations. They are both mathematical models with no relation to the realworld. The Kalman filter, as well as other state-space models, provide the ability for real world processes to be described by an internal state vector. This state vector is continually updated throughout the prognosis process to minimize the state error covariance through measurement and time update equations. The drawbacks to the Kalman filter is that the parameters must be tuned which requires knowledge of the physics of the system. Also, initial values are needed at the start of the algorithm to ensure optimal results. The Kalman filter is the most computationally intensive, but least memory intensive as it only relies on the current sensor data point which can be discarded after the measurement updates have been performed. 4.10 Diagnostic Process In order to make the framework practical for use by NASA-SSC engineers, the health data must be displayed efficiently in the control computers. The software used by the control engineers, WonderWare InTouch, allows for developers to expand the functionality of their software through the use of Microsoft’s ActiveX modules and .NET 150 controls. The control computers each have four monitors that give the control engineers vast screen real-estate to monitor the test stands during test article firings. Through the software framework described in the approach section, a process was designed and implemented with the design constraints in mind that provides the data necessary to perform and visualize the health data of the LLAV. The .NET module accomplished the tasks mentioned above by creating a tabbed control that provides test operations with information required to make intelligent maintenance decisions. A tabbed control was selected because it lowers the footprint the module will have on control computer screens, while still allowing extensibility in the future. The first tab contained the historical context of the valve by displaying crucial operating statistics which are continuously monitored by the module. These values are stored in a MS-SQLCE database in order to create a persistent record of the events of the valve. The statistics tab can be seen in Figure 109. Figure 109 - Intelligent Valve statistics tab. The second tab, Figure 110, demonstrates the ability to track the frost line of the valve. The method used to track the frost line of a valve will be discussed in the thermal modeling portion of the Prototype Diagnostics section of this report. This tab allows a 151 test operator to quickly see all the thermocouples that are attached to a valve, as well as their current health status. The flagged attribute of the thermocouple is determined by either a percentage or absolute threshold designated by the test operator in the setup tab. It also shows the current position, control, and open time of the valve. Each valve can be selected from a drop down menu to see the current status of the valve. A 2D view is provided so when the user clicks on a thermocouple in the list view, the position is shown by a red box. This gives context to the position of the thermocouple in relation to the total length of the valve. Figure 110 - Intelligent Valve thermocouple tab. The final functional tab allows the test operator to add, modify, reset, and delete valves. It also provides functionality for adding, modifying, and deleting thermocouples 152 from the valve, and finally the ability to add and delete DDE data servers. This feature enables test operators to change between setups, while still keeping persistent tracking of the valve statistics. Also, the test operator can specify a data folder where the raw measurement data is stored in another MS-SQLCE database. Future health analysis algorithms can be developed, tested and validated on this data. Figure 111 shows the setup tab. Figure 111 - Intelligent Valve setup tab. 153 Chapter 5: CONCLUSIONS ISHM capabilities can provide significant benefits for ground-based spacecraft monitoring and control and ultimately can be adapted to provide on-board support for spacecraft. Progressive development and demonstration of key ISHM architectural elements requires that key propulsion components be adequately modeled and supported with high-performance anomaly detection algorithms. It is also important that the integration of the model within an ISHM framework be supported with useful user interfaces that maximize the selectivity and utility of the ISHM output in order to obtain the intended benefits. 5.1 Summary of Accomplishments The objectives of this thesis are revisited below, and the solutions proposed to address each of the problems indentified in this research work are summarized. 1. To design a framework for the detection of faults and failure modes in the large linear actuated valve that are used on the rocket engine test stands at NASA-SSC. An Intelligent Valve framework was designed using domain expert knowledge to identify the key faults and failure modes in the LLAV. A FMECA was performed, as seen in Section 3.1, to focus efforts on the most critical problems with the valves. Once 154 this knowledge had been acquired, a diagnostic process and algorithms could be developed to detect these faults and failure modes. 2. To develop a diagnostic process that – a. Receives and stores incoming sensor data; b. Performs calculation of operating statistics; c. Compares with existing analytical models; and, d. Visualizes faults, failures, and operating conditions in a 3D GUI environment. The diagnostic process was developed with an interface that could be easily expanded in the future. The DDE protocol and a SQL database (Section 3.4) was used to receive and store incoming sensor data in an efficient manner that could be easily annotated by the diagnostic algorithms. In order to give maintenance personnel historic context of the valve's operation, an algorithm was developed (Section 3.2.4) to capture key operating statistics used throughout the valve's lifespan. A thermal analytic model (Section 3.2.6) was developed by NASA engineers and implemented into the Intelligent Valve framework. A 3D environment was developed using advanced visualization techniques to show faults, failures, and operating statistics in a 3D environment, which can be seen in Section 4.6. 3. To develop a suite of diagnostic algorithms that can detect anomalous behavior in the valve and other system components of the rocket engine test stand. A suite of diagnostic algorithms was developed that detects various anomalous behaviors in the LLAV and other system components. The first is a sensor validation algorithm using Auto-associative neural networks (Section 4.3), an adaptive thresholding method to detect degradation in valve parameters (Section 4.4), and a thermocouple fault 155 detection using the thermal analytical model developed by NASA engineers (Section 4.2.1). These fault detection algorithms, coupled with the contextual information from the operating statistics, can help advise maintenance personnel in their decisions to repair the valves. 4. To expand the capability of the diagnostic algorithm to perform prognosis in specific context. The diagnostic algorithms have been expanded with prediction in specific context. Particularly, AR, ARMA and Kalman filters were used to gauge the ability to predict the process variable of a valve. These values can be used by the adaptive thresholding method to determine faults in a valve seconds before they occur. If accurate enough, these seconds could be the difference between an emergency shutdown, and a catastrophe. In this thesis, we have shown that a judicious combination of technologies, namely, the DDE data transfer protocol, auto-associative neural networks, empirical and physical models and virtual reality environments can be used to develop a diagnostic procedure for assessing the integrity of rocket engine test stand components. We have specifically focused on valves, because they are critical to the cryogen transport mechanisms that are vital to test operations. This project is in the area of an identified core competency at John C. Stennis Space Center; specifically in the technology focus area of ISHM user interfaces. The project addressed the development of an effective interface between the ISHM and its users in order to reduce information overload in the typically crowded environments of complex system control rooms. We have designed, developed and validated a user-interface that presents information related to the system 156 health and supports the user’s navigation through diagnostic scenarios with the ability to extract and visualize the required system details. 5.2 Recommendations for Future Work The state of the ISHM functional art is hampered by a number of factors; a major constraint is the unavailability of intelligent process models that can provide the reasoned determination of element condition based on the available data sources that feed the ISHM architecture. One of the significant challenges is to develop realistic models for the most common and problem-prone elements. Surprisingly, there are major gaps in our understanding of how even fundamental elements (such as valves in a rocket engine test stand) degrade and—more importantly—how to determine the remaining operational life available from a valve or any other similar component. And, if an anomaly is detected, what are the best means for providing a user with efficient tools to explore the nature of the anomaly and its possible effects on the element as well as its relationship to overall system state. This thesis has addressed a part of the problem, by providing a framework for diagnosing the integrity of a specific test-stand component – the large linear actuator valve. The next steps in expanding this research work will involve the design, development and validation of prognosis algorithms that can predict potential anomalies in a reasonable time frame before they actually occur. This recognizes the fact that in a test-stand environment, by the time a fault is diagnosed, it is usually too late to remedy the problem. The subsequent addition of a prognosis module to the intelligent valve model will provide test operations personnel to initiate “what if?” queries and enhance the ability to perform a comprehensive risk analysis of every test procedure. The 157 combination of the analysis and prognosis algorithms can be used to arrive at a model that can predict the remaining useful life of a test-stand component such a valve – making such predictions provides a significant capability enhancement to ISHM platforms. The research work presented in this thesis expands upon prior ISHM framework that utilizes smart sensors by developing diagnostic tools that can track changing health conditions in dynamic systems. This work has the potential to advance sensor data fusion and integration to the degree required to achieve the benefits that are necessary to support next-generation space exploration missions. 158 References [1] J. Schmalzel, F. Figueroa, J. Morris, R. Polikar, and S. Mandayam, "An architecture for intelligent systems based on smart sensors," IEEE Transactions on Instrumentation and Measurement, vol. 54, no. 4, pp. 1612-1616, August 2005. [2] G. Vachtsevanos, F. L. Lewis, M. Roemer, A. Hess, and B. Wu, Intelligent Fault Diagnosis and Prognosis for Engineering Systems, 1st ed. Hoboken, United States of America: John Wiley & Sons, Inc., 2006. [3] D. Schrage, D. DeLaurentis, and K. Taggart, "FCS Study: IPPD Concept Development Process for Future Combat Systems," Georgia Institute of Technology, Atlanta, Georgia, AIAA MDO Specialists Meeting September 2002. [4] NASA, "NASA Reliability Centered Maintenance (Rcm) Guide for Facilities and Collateral Equipment," NASA, Maintenance Guide 2008. [5] M. B. Mengel, W. L. Holleman, and S. A. Fields, Eds., Fundamentals of clinical practice, 2nd ed. New York, United States of America: Kluwer Academic/Plenum Publishers, 2002. [6] J. K. Shim and J. G. Siegel, Handbook of financial analysis, forecasting and modeling, 2nd ed. Chicago, United States of America: CCH Incorporated, 2004. 159 [7] NASA History Division. (2010, January) NASA History. [Online]. http://history.nasa.gov/ [8] NASA Ames Research Center. (2005, March) NASA - Design Principles for Robust ISHM. [Online]. http://www.nasa.gov/centers/ames/research/technologyonepagers/design_principles.html [9] F. Figueroa, R. Holland, J. Schmalzel, and D. Duncavage, "Integrated System Health Management (ISHM): Systematic Capability," IEEE Sensors Application Symposium, Houston, 2006, pp. 202-206. [10] Pratt and Whitney Rocketdyne. (2010, January) J-2X. [Online]. http://www.pw.utc.com/Products/Pratt+&+Whitney+Rocketdyne/J-2X [11] NASA. (2010, January) Propoulsion Testing at NASA's John C. Stennis Space Center. [Online]. http://www.nasa.gov/centers/stennis/pdf/372105main_FS-2008-1000071-SSC.pdf [12] M. Currie, "Where did all the People Go? The New Case for Condition Monitoring," Chicago, 2006. [13] M. Fargnoli, E. Rovida, and R. Troisi, "An example of a morphological matrix can be seen ," The 4th International Conference on Axiomatic Design, Florence, 2006. [14] Z. Fan and J. Ma, "An Approach to Multiple Attribute Decision Making Based on Incomplete Information on Alternatives," Thirty-second Annual Hawaii International Conference on System Sciences-Volume 6, vol. 6, Maui, 1999, p. 6041. 160 [15] T. Marchant et al., Evaluation and Decision Models - A Critical Perspective (International Series in Operations Research and Management Science Volume 32). Norwell, United States of America: Kluwer Academic Publishers, 2000. [16] S. G. Arunajadai, Scott J. Uder, Robert B. Stone, and Irem Y. Tumer, "Failure Mode Identification Through Clustering Analysis," Quality and Reliability Engineering International, vol. 20, no. 5, pp. 511-526, April 2004. [17] Society of Automotive Engineers, "Potential Failure Mode and Effects Analysis in Design (Design FMEA), Potential Failure Mode and Effects Analysis in Manufacturing and Assembly Processes (Process FMEA)," Automotive Quality And Process Improvement Committee, Standard SAE J1739, 2009. [18] FMEA-FMECA.com. (2009, August) FMEA / FMECA Information. [Online]. www.fmea-fmeca.com [19] D. H. Stamatis, Failure mode and effect analysis: FMEA from theory to execution, 2nd ed., Pual O'Mara, Ed. Milwaukee, United States of America: William A. Tony, 2003. [20] R. E. McDermott, J. Raymond Mikulak, and Michael R. Beauregard, The Basic of FMEA, 2nd ed. New York, United States of America: Productivity Press, 2008. [21] NASA Lewis Research Center, "Tools of Reliability Analysis: Introduction and FMEAs," Cleveland, Presentation 2009. [22] P. D. T. O'Connor, Practical Reliability Engineering, 4th ed. Hoboken, United States of America: John Wiley & Sons Inc., 2002. 161 [23] C. Bunis [et al.], Design for Reliability, 1st ed., Dana Crowe and Alec Feinberg, Eds. Lowell, United States of America: CRC, 2001. [24] E. Crow, K. Reichard, J. Banks, and L. Weiss. (2005, February) Penn State Applied Research Laboratory. [Online]. http://csrp.psu.edu/files/ishm2005/ishm_reichard.pdf [25] A. Bayoumi et al. (2008, February) Condition-Based Maintenance at University of South Carolina. [Online]. http://cbm.me.sc.edu/pubs/AHS1.pdf;http://cbm.me.sc.edu/pubs/AHS3.pdf [26] A. Bandes, "What You Need to Know About Ultrasound CBM," Pumps & Systems, pp. 60-61, December 2006. [27] T. Wireman, Computerized Maintenance Management Systems, 2nd ed. New York, United States of America: Industrial Press, 1994. [28] University of South Carolina. (2009, February) College of Engineering and Computing Condition-Based Maintenance. [Online]. http://cbm.me.sc.edu/pubs.html [29] S. X. Ding, Model-based fault diagnosis techniques: design schemes, algorithms, and tools. Berlin, Germany: Sprinter-Verlag, 2008. [30] H. Park, W. Pedrycz, and S. Oh, "Granular Neural Networks and Their Development Through Context-Based Clustering and Adjustable Dimensionality of Receptive Fields," IEEE TRANSACTIONS ON NEURAL NETWORKS, vol. 20, no. 10, pp. 1604-1616, October 2009. [31] G. M. Davis, Ed., Noise Reduction in Speech Applications. Boca Raton, United States of America: CRC, 2002. 162 [32] E. Micheli-Tzanakou, Ed., Supervised and Unsupervised Pattern Recognition: Feature Extraction and Computational Intelligence. Boca Raton, United States of America: CRC Press LLC, 2000. [33] I. G. et al., Eds., Feature Exraction: Foundations and Applications. Berlin, Germany: Springer-Verlag, 2006. [34] L. Ljung, System Identification: Theory for the User, 2nd ed. Upper Saddle River, United States of America: Prentice Hall PTR, 2007. [35] V. Puig, J. Quevedo, T. Escobet, F. Nejjari, and S. de las Heras, "Passive Robust Fault Detection of Dynamic Processes Using Interval Models," IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, vol. 16, no. 5, pp. 1083-1089, September 2008. [36] H. Bassily, R. Lund, and W. John, "Fault Detection in Multivariate Signals With Applications to Gas Turbines," IEEE TRANSACTIONS ON SIGNAL PROCESSING, vol. 57, no. 3, pp. 835-842, March 2009. [37] C. H. Lo, Eric H. K. Fung, and Y. K. Wong, "Intelligent Automatic Fault Detection for Actuator Failures in Aircraft," IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, vol. 5, no. 1, pp. 50-55, February 2009. [38] G. Spitzlsperger, C. Schmidt, G. Ernst, H. Strasser, and M. Speil, "Fault Detection for a Via Etch Process Using Adaptive Multivariate Methods," IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, vol. 18, no. 4, pp. 528-533, November 2005. 163 [39] W. R. A. Ibrahim and M. M. Morcos, "An Adaptive Fuzzy Self-Learning Technique for Predication of Abnormal Operation of Electrical Systems," IEEE Transactions on Power Delivery, vol. 21, no. 4, pp. 1770-1777, October 2006. [40] S. Huang and K. K. Tan, "Fault Detection and Diagnosis Based on Modeling and Estimation Methosd," IEEE Transactions on Neural Networks, vol. 20, no. 5, pp. 872-881, May 2009. [41] J. Yun, K. Lee, K. Lee, S. B. Lee, and J. Yoo, "Detection and Classification of Stator Turn Faults and High-Resistance Electrical Connections for Induction Machines," IEEE Transactions on Industry Applications, vol. 45, no. 2, pp. 666-674, March/April 2009. [42] Financial Forecast Center, LLC. (2009, November) Financial Forecast Center Home Page. [Online]. http://www.forecasts.org/ [43] A. Rodgers and A. Streluk, Forecasting the Weather, 2nd ed. Chicago, United States of America: Reed Elsevier Inc., 2007. [44] F. P. et al., "A Generic Prognostic Methodology Using Damage Trajectory Models," IEEE Transactions on Reliability, vol. 58, no. 2, pp. 277-285, June 2009. [45] Z. Sun, J. Wang, D. Howe, and G. Jewell, "Analytical Prediction of the Short-Circuit Current in Fault-Tolerant Permanent-Magnet Machines," IEEE Transaction on Industrial Electronics, vol. 55, no. 12, pp. 4210-4217, December 2008. [46] Y. Zhang et al., "Connected Vehicle Diagnostics and Prognostics, Concept, and Initial Practice," IEEE Transactions of Reliability, vol. 58, no. 2, pp. 286-294, June 2009. 164 [47] M. Baybutt, C. Minnella, A. E. Ginart, P. W. Kalgren, and M. J. Roemer, "Improving Digital System Diagnostics Through Prognostic and Health Management (PHM) Technology," IEEE Transactions on Intrumentation and Measurement, vol. 58, no. 2, pp. 255-262, February 2009. [48] P. Lall, M. N. Islam, M. K. Rhim, and J. C. Suhling, "Prognostics and Health Management of Electronic Packaging," IEEE Transactions on Components and Packaging Technologies, vol. 29, no. 3, pp. 666-677, September 2006. [49] S. K. Yang, "A Condition-Based Failure-Prediction and Processing-Scheme for Preventive Maintenance," IEEE Transactions on Reliability, vol. 52, no. 3, pp. 373383, September 2003. [50] A. H. Al-Badi, S. M. Ghania, and E. F. EL-Saadany, "Prediction of Metallic Conductor Voltage Owing to Electromagnetic Coupling Using Neuro Fuzzy Modeling," IEEE Transaction on Power Delivery, vol. 24, no. 1, pp. 319-327, January 2009. [51] Society of Automotive Engineers, "Evaluation Criteria for Reliability-Centered Maintenace (RCM) Processes," Standards Report SAE JA1011, 1998. [52] M. Kramer, "Nonlinear Principal Component Analysis Using Autoassociative Neural Networks," AIChE Journal, vol. 37, no. 2, pp. 233-243, February 1991. [53] L. D. Mattern, C. L. Jaw, T. Guo, R. Graham, and W. McCoy, "Using Neural Networks for Sensor Validation," 34th Joint Propulsion Conference, Cleveland, 1998. 165 [54] J. H. Lienhard IV and J. H. Lienhard V, A Heat Transfer Textbook, 3rd ed. Cambridge, United States of America: Phlogiston Press, 2008. [55] S. J. McPhee and M. Papadakis, Current Medical Diagnosis and Treatment 2009, 48th ed. New York, United States of America: McGraw-Hill Professional, 2009. [56] D. Ruppert, Statistics and finance: an introduction, 1st ed., George Caseila, Stephen Fienberg, and Ingram Olkin, Eds. New York, United States of America: SpringerVerlag, 2004. [57] R. Mimick, M. Thompson, and S. W. William, Business Diagnostics 2005: Evaluate And Grow Your Business. Victoria, Canada: Trafford, 2005. [58] J. Schmalzel and F. Figueroa, "Rocket Testing and Integrated System Health Management," Condition Monitoring and Control for Intelligent Manufacturing, D. T. Pham, Ed. London, England: Springer London, 2006, ch. 15, pp. 373-391. 166