Detection and Diagnosis of Faults and Energy Monitoring

Detection and Diagnosis of Faults and Energy Monitoring of HVAC Systems with Least-Intrusive Power Analysis by Dong Luo M. E., Thermal Engineering, 1991 Tianjin University Submitted to the Department of Architecture in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in Architecture: Building Technology ROT. MASSACHUSETTS INSTITUTE OF TECHNOLOGY at the MAR 15?001 Massachusetts Institute of Technology LIBRARIES February 2001 @ 2001 Massachusetts Institute of Technology. All rights reserved. Signature of Author of Architecture apartmentJanuary 12, 2001 Certified by Leslie K. &orford Aqsociate rofessor of Building Technology Thesis Supervisor Accepted byf '3 'Stanford Anderson Chairman, Departmental Committee on Graduate Students Head, Department of Architecture 2 Thesis Committee: Leon R. Glicksman, Professor of Building Technology Qingyan Chen, Associate Professor of Building Technology 4 Detection and Diagnosis of Faults and Energy Monitoring in HVAC Systems with Least-Intrusive Power Analysis by Dong Luo Submitted to the Department of Architecture on January 12, 2001 in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in Architecture: Building Technology ABSTRACT Faults indicate degradation or sudden failure of equipment in a system. Widely existing in heating, ventilating, and air conditioning (HVAC) systems, faults always lead to inefficient energy consumption, undesirable indoor air conditions, and even damage to the mechanical components. Continuous monitoring of the system and analysis of faults and their major effects are therefore crucial to identifying the faults at the early stage and making decisions for repair. This requires the method of fault detection and diagnosis (FDD) not only to be sensitive and reliable but also to cause minimal interruption to the system's operation at low cost. However, based on additional sensors for the specific information of each component or black-box modeling, current work of fault detection and diagnosis introduces too much interruption to the system's normal operation associated with sensor installation at unacceptable cost or requires a long time of parameter training. To solve these problems, this thesis first defines and makes major innovations to a change detection algorithm, the generalized likelihood ratio (GLR), to extract useful information from the system's total power data. Then in order to improve the quality of detection and simplify the training of the power models, appropriate multi-rate sampling and filtering techniques are designed for the change detector. From the detected variations in the total power, the performance at the system's level is examined and general problems associated with unstable control and on/off cycling can be identified. With the information that are basic to common HVAC systems, power functions are established for the major components, which help to obtain more reliable detection and more accurate estimation of the systems' energy consumption. In addition, a method for the development of expert rules based on semantic analysis is set up for fault diagnosis. Power models at both system and component levels developed in this thesis have been successfully applied to tests in real buildings and provide a systematic way for FDD in HVAC systems at low cost and with minimal interruption to systems' operation. Thesis Supervisor: Leslie K. Norford Title: Associate Professor of Building Technology 6 DEDICATION To my family who are always behind me 8 Acknowledgements I would like to thank Professor Leslie Norford, my thesis advisor, for his guidance and support throughout my Ph. D. study. Also, I want to thank Professor Leon Glicksman and Professor Qingyan Chen, my thesis committee, for their valuable advice on my thesis. I wish to express my special gratitude to my family for their love and support. 10 CONTENTS ABSTRACT 5 DEDICATION 7 ACKNOWLEDGEMENTS 9 CONTENTS 11 1. INTRODUCTION 1.1 Background of this study 1.2 Model-based fault detection and diagnosis in HVAC systems 1.3 Approaches to solve the problems of current FDD models 1.4 Aims of this thesis 13 15 17 18 2. THEORY OF CHANGE DETECTION 2.1 2.2 2.3 2.4 Review of the non-intrusive load monitoring Steady-state change detection Theory of abrupt change detection Summary 21 22 24 32 3. SYSTEM POWER MODELING FOR FAULT DETECTION AND DIAGNOSIS 33 3.1 Introduction of on-line change detection in power data of the HVAC systems 34 equation two-window vs. equation one-window 3.2 Modification of the GLR: 36 3.3 Training of the parameters for the GLR detector 41 algorithm the detection of 3.4 Improvements 45 3.5 The median filter 48 3.6 Change + oscillation detector 49 3.7 Monitoring of the total power data - multi-rate vs. single-rate sampling 61 3.8 Summary of the training guidelines 62 input 3.9 Application of the GLR model in fault detection with system's total power 72 3.10 Results and discussion 4. COMPONENT POWER MODELING FOR FAULT DETECTION AND DIAGNOSIS 75 4.1 Introduction 4.2 Component power modeling by correlation with basic measurements or signals 76 87 4.3 Error analysis - confidence intervals of the estimated models 4.4 Fault detection and diagnosis of HVAC systems with submetered power input 91 4.5 Application of the power models in fault detection with submetered power input 94 107 4.6 Discussion and conclusions 5. MONITORING OF COMPONENTS THROUGH THE TOTAL POWER PROFILE 5.1 Introduction 5.2 Parameter selection for component modeling with system power monitoring 5.3 Analysis of the detected changes for component modeling 5.4 Application of the gray-box model in fault detection and energy estimation 5.5 Discussion and conclusions 109 110 111 119 127 6. DIAGNOSIS OF FAULTS BY CASUAL SEARCH WITH SIGNEDDIRECTED-GRAPH RULES 6.1 6.2 6.3 6.4 6.5 6.6 6.7 Introduction 129 Fault diagnosis by shallow reasoning with system's total power input 132 Deep knowledge expert system 134 Diagnosis based on casual search of fault origin 135 SDG rule development for fault diagnosis with power input of components 143 Rules for detection and diagnosis of typical faults in common air handling units 156 Discussion and conclusions 162 7. SUMMARY 7.1 Review 7.2 Achievements and future work 165 168 REFERENCES 171 NOMENCLATURE 177 APPENDIX DESCRIPTIONS OF THE TEST SYSTEM AND THE FAULTS 179 CHAPTER 1 Introduction 1.1 Background of this study Faults in an HVAC system indicate that some components are not running properly according to the design intent. Faults can be divided into two general groups based on the abruptness of the occurrence: gradual degradation and abrupt failure of components [Annex 25]. A degradation fault emerges after some time of operation and requires calibration or adjustment when the error exceeds a threshold, such as the static pressure sensor offset in the fan-duct loop. A failure means the equipment suddenly stops working and usually needs immediate maintenance or replacement to resume the normal operation of the system, e.g., a stuck damper. In general, an abrupt fault is easier to detect because it results in sudden failure of equipment and obvious changes of the monitored parameters. However, this does not necessarily mean that an abrupt fault is easier to diagnose. Unacceptable deviations in the monitored parameters caused by a degradation fault are determined with reference to thresholds that must be evaluated by cost-benefit analysis. The potential benefits of detecting and correcting a fault include energy savings, desirable indoor air quality, and minimal interruption to the normal equipment operating schedules. By definition, any kind of fault has some undesirable effects on the system, which include inefficient usage of energy, uncomfortable working environment, and even damage of the mechanical components. Excessive energy consumption of components caused by a fault may range from several percent to several times of the design value depending on the severity of the fault [Luo et al. 2001][Kao and Pierce, 1983]. A fault may not directly affect the indoor air quality of the conditioned space since the control system sometimes can compensate for the loss of the conditioning capacity. But this is often achieved at the expense of excessive energy consumption. For example, an offset of the static pressure sensor increases the speed of the supply fan and causes more energy to be wasted with the VAV dampers closing up in order to meet the indoor air temperature setpoint. If a fault can not be compensated by the control system itself, the working environment will become uncomfortable or even unhealthy, such as the insufficient fresh air supply caused by a stuck-closed outdoor air damper. Sometimes, a fault may lead to quick damage of components. A typical case is the unstable control of a component, which may quickly worsen the wearing of the mechanical parts due to the oscillatory control signal. In practice, faults exist widely in operating HVAC systems. The performance of many HVAC systems is affected by various types of faults that may result from poor installation, commissioning, and maintenance. A recent study has indicated that 20-30% of energy savings in commercial buildings can be achieved by re-commissioning of the HVAC systems to rectify faulty operation [Annex 25]. Current building energy management systems are widely used for automation of HVAC systems' operation and for prevention of critical faults. However, manufacturers offer very few tools for detection and diagnosis of defects that cause faulty operation. Therefore, it is very important to find an effective method to minimize the negative effects of faults. Fault detection intends to determine if the observed behavior of the monitored object deviates beyond an acceptable range from the expected performance. Fault diagnosis is to find which of the possible causes is most consistent with the observed faulty conditions. For engineering systems with zero tolerance, as in the control system of a nuclear reactor, a fault is usually detected and immediately removed by hardware redundancy with the faulty part or sensor substituted by another one and an alarm is issued. Hardware redundancy involves the use of multiple components for the same purpose. Faults are determined by majority rules from the voting among the redundant sensors. Although physically more reliable, this technique is expensive, bulky, and limited in ability [Rossi et al. 1996]. The alternative is analytical redundancy that utilizes for FDD the inherent relationships existing between a system's inputs and outputs. These relationships can be described with mathematical models or with collections of rules. For less critical systems like the ordinary HVAC systems, hardware redundancy is rarely used due to the extra cost and the additional space required to accommodate the equipment. Therefore, in order to keep track of the operation of the system, analytical models must be established. Analytical models for fault detection and diagnosis can be implemented in two steps: estimation and classification. Estimators are generally mathematical models based on either physical laws or block-box estimation. They can generate two types of features, residuals (or innovations) and parameters, depending on the monitored output. A residual is the difference of state estimates between the plant and the nominal models. Model parameters, derived from state variables, are compared between estimates of the current and fault models for given inputs. Classifiers can be a system based on either expert knowledge or artificial neural networks. An expert system is a machine that emulates the reasoning process of a human expert in a particular domain [Rossi et al. 1996]. It is comprised of two important parts: a knowledge base and an inference engine. The knowledge base contains expert knowledge about the domain. The inference engine combines data about a particular problem with the expert knowledge to provide a solution. Expert knowledge can be expressed as production rules, or stored as a collection of a prioriand conditional probabilities in statistical pattern recognition classifiers, or set up as semantic networks. Evaluation made in the classifier is often based on the criteria of economy, comfort, safety, and environmental hazard. As shown by the diagram in Fig. 1.1, estimators take the plant inputs as control signals or measurements from sensors, perform quantitative operations, and generate simplified features for classification. Classifiers operate on the residuals or parameters based on appropriate logic with thresholds and rules. Then they will either issue alarms if the deviation is not acceptable and give analysis of some possible causes of the fault or keep a log of the operating status if no abnormal difference is observed. Figure 1.1. Diagramoffault detection and diagnosis with analyticalmodel. Consisting of measurements and/or control signals, the vector X represents the inputs to both the plant and the estimator. Y and Y are vectors of outputs from the plant and predictions by the model separately. 1.2 Model-based fault detection and diagnosis in HVAC systems 1.2.1 Methods of fault detection and diagnosis Three types of models may be used for fault detection and diagnosis: plant, reference, and fault models. A plant model keeps a record of the current operating features of the system, a nominal model provides the expected performance of the system under normal operation for given inputs, and a fault model represents how the system would respond to given inputs if a specific fault occurs. Fault detection can be achieved by comparing the outputs of the plant model against that of the nominal model. Diagnosis can be conducted by selecting the fault model that provides features closest to the current operation. There are two basic types of reference models [Benouarets et al. 1994]: physical and block-box models. Physical models are mainly established from analysis of the system based on first principles, though some empirical relationships may be assumed. Block-box models are empirical models that do not use any prior knowledge of the physical processes. They can only be constructed with training data generated from the monitored system itself or by simulation of the system. Fault detection and diagnosis based on physical laws have been studied and used for simulation since the early 80's and have progressed rapidly during the 90's, especially with the establishment of the International Energy Agency (IEA) in the framework of Annex 25. Pape et al. [1991] developed a methodology for fault detection in HVAC systems based on optimal control using simulation, in which deviations from the optimal performance are detected by comparing the measured system power against the power predicted with the optimal strategy. Haves et al. [1996] described a condition monitoring scheme used to detect the presence of valve leakage and waterside coil fouling within the cooling coil subsystem of an air handling unit based on first-principle models. The major advantage of a physical model is that the prior knowledge and hence the better understanding of the physical process enable the model to be extrapolated to regions of the operating range where no training data are available. Also the parameters in a physical model are usually related to physically meaningful quantities, which not only makes it possible to estimate the values of the parameters from the design information and manufacturers' catalogs but also provides an opportunity for the observer to associate the abnormal values of parameters with the presence of a specific fault. However, physical models often require installation of additional sensors for detailed information about the monitored components in the system, which is not only limited by the extra cost of sensors but also leads to interruption to the system's operation. Moreover, a large set of nonlinear differential and algebraic equations need to be set up to define the system's behaviors because HVAC systems are often comprised of a number of subsystems and each may exhibit time-varying and/or nonlinear characteristics. For example, a detailed description of the dynamics of a typical five-zone commercial HVAC system requires an order of 1,000 differential and algebraic equations [Kelly et al. 1984]. This often makes the solution prohibitively time-consuming and yet often the results are not satisfactory. In addition, the parameters of this dynamical description generally vary with load, weather, and building occupancy, such as the hear transfer coefficient of a cooling coil. These limitations make it almost impossible to use complete physical models for real-time fault detection and diagnosis in practice. Two major methods have been used for block-box models during the past ten years, ARX (autoregressive with exogenous input) or its innovation ARMAX (autoregressive moving average with exogenous input) and ANN (artificial neural network). Lee et al. [1996] examined ARMAX/ARX models for a laboratory VAV unit with both SISO (single-input/single-output) and MISO (multi-input/single-output) structures and model parameters are determined using the Kalman filter recursive identification method. Peitsman and Soethout [1997] established a real-time ARX model for a simulated VAV system. Li et al. [1996] demonstrated the feasibility of using ANNs for FDD of a specific heating, ventilating, and air-conditioning (HVAC) system. Peitsman and Bakker [1996] demonstrated that ANN models fit better than ARX models for nonlinear systems after studying several faults in a laboratory chiller and a simulated VAV system with both MISO ARX models and ANN models. The block-box models have been studied and tested with simulation and laboratory units in many research projects in recent years due to the fact that such methods do not require detailed knowledge of the system and therefore do not require the user to know much about the physical processes. Also, linear forms of parameters can be selected to make the parameter estimation of the processes more computationally efficient and more robust even if the system itself is non-linear. However, with no physical meanings in its parameters, the results of a block-box model can not be reliably extrapolated. This drawback greatly limits the use of a block-box model because faulty conditions are often beyond the range of the normal expectations and therefore make it impossible to train the parameters for the model in absence of the faults in the system. Meanwhile, the unknown property of a block-box model makes it difficult to understand and keep track of the performance of the model itself. For example, improper guess of the initial values of the weight factors and offsets of a neural network often result in an inappropriate learned network [Peitsman et al. 1996]. For such reasons, the block-box models for fault detection and diagnosis in HVAC systems are still largely in the experimental phase. In addition to the problems of the algorithms themselves, most FDD techniques are based on the examination of some controlled parameters against their setpoints. In modern HVAC control systems, a hierarchical structure is always used for the control logic in commercial applications and variables are often controlled in local feedback loops. A typical application is the zone temperature control with a VAV box. Offsets in the controlled variables can usually be compensated or cancelled in the local control loop, leaving to the upper level of the system the task of handling the error at excessive energy cost. If no further supervision of the system's operation is available, then the effects can only be seen in some uncontrolled variables as the ultimate outcome, such as the power input of a system or a component. For example, a static pressure sensor offset can lead to considerably excessive amount of power consumption while temperature of the room air or other controlled points may still be kept within the normal range. A stuck-closed outdoor air damper in an HVAC system not only causes significant waste of energy but also results in unhealthy environment during the transitional period when it is expected to be wide open. But the faulty operation may never be found by the measurements of the indoor air temperature or other controlled parameters. This indicates that traditional FDD techniques based on the controlled variables are often insensitive to the existence of the malfunctions in the system and hence fail in estimating the outcome of the defects for appropriate maintenance or repair. In practice, energy waste due to negligence to faults in a system may become significantly high, especially after a long time period of operation when a degradation fault occurs unavoidably and keeps worsening the energy efficiency of the system without notice. Moreover, in the diagnosis of a fault, the decision for further actions needs to be based on the major evaluation factors at the system level, such as the cost vs. benefit effect, rather than the monitored local variables. The above review reveals four major problems associated with the current methods used for fault detection and diagnosis in HVAC system: 1. Extra costs and interruption to the system's operation caused by installation of additional sensors for detailed information about the monitored components required by the physical models; 2. Computational load imposed by the physical models; 3. Lack of capacity for model manipulation and the uncertainty in output extrapolation of the block-box models; 4. Insensitivity to fault existence that may cause inefficient operation and/or undesirable indoor air quality when using controlled variables as the primary detection indices; 5. Insufficient information for cost-conscious decisions. 1.3 Approaches to solve the problems of current FDD models To solve the above problems, the FDD model should be able to make full use of the available information, issue timely alarms or reports about the system's status even if the design specifications for the controlled variables are met, and produce physically meaningful output and the cost effects of faults for further maintenance if necessary. For real-time FDD applications, the model also needs to be easily incorporated into the present control system, which indicates that the FDD model should rely on as little information as possible in addition to the signals readily available from the building energy management and control system, such as the control signal for the supply fan speed. So far, limited research has been carried out to address such issues. Seem et al. [1999] described a detection algorithm to reduce the data load for online computing. Rossi et al [1997] described a statistical rule-based fault detection and diagnostic method for vapor compression air conditioners, which uses only temperature and humidity measurements. However, such work only improves the data communication problem but does not make fundamental changes in the modeling itself. In order to minimize the intrusion into the system, a more efficient method needs to be found at system level. In this research, on the basis of tests and observations of the relationship between the system performance and the data trend of the electrical power input of buildings, it has been found that as the ultimate reflection of the system's energy cost, power consumption can be used to detect faults regardless of the status of the controlled variables in the system. Also, FDD based on power consumption directly demonstrates the effects of the current operating conditions on the system's cost and hence enables the decision for appropriate maintenance. In addition, measurements of power consumption only involve the electric circuits of the energy system and impose little intrusion into the system's mechanical structure and hence introduce minimal interruption to the system's operation. In this thesis, with power consumption as the major criterion for detection and diagnosis of faults, a gray-box model is developed with statistical estimation of the system's total power as well as with power functions of the major components through fundamental analysis of the system's structure and control logic. In addition, with the power model as the primary fault indicator, an effective method for the casual search of fault origin is established with semantic analysis of the general physical processes of common HVAC systems. With the structured inference logic, the expert rules for the identification of a fault can be easily developed for any fault in a given system. 1.4 Aims of this thesis In this research, a statistical model is first developed using the total electrical power input of a system to identify the changes in the data trend. With the detected changes, abnormal patterns of power consumption of the major components can be identified with reference to the design information. Models at a more detailed level based on submetered measurements of power and the related variables are also developed for detection. Then, the feasibility of modeling a components' power input with the detected changes in the total power series and basic measurements or control signals is studied and an effective method is proposed for component power models by appropriate combination of the change detection and the component modeling techniques. The resulting function can be used to detect faults of the component and, in return, help to improve the resolution of detection at the system level. In addition, the power function provides an efficient tool for accurate energy estimation and appropriate evaluation of the system and the components as well. With the total power logger and other signals inherent to the common control systems, this method yields timely response to abnormal changes in the system with little interruption to the system's operation. The power models developed in this thesis can be applied to real building systems under different load conditions. With the deviations found by the power models, a knowledge-based approach is proposed from semantic analysis of typical faults in common HVAC systems. To find the undesirable operation at an early stage before the deterioration of the control quality, an inference structure with power input as the end node of a digraph is developed. Innovations are introduced for the inference logic to allow progressive alarms with violations. Modifications of the rules can be easily implemented for the application of the inference structure in different systems. This thesis is organized as follows. Chapter 2 introduces the theory of change detection. Chapter 3 develops the algorithm for detection of abnormal power input of HVAC systems. Chapter 4 describes the principle of fitting algorithm and component power modeling with submeters. Chapter 5 develops the gray-box model by applying detection at system level to component power modeling. Chapter 6 explores the digraphbased rule development technique for diagnosis of faults in HVAC systems. Chapter 7 summarizes the whole thesis. 20 CHAPTER 2 Theory of Change Detection This chapter motivates the development of a steady-state change detector for applications in both residential and commercial buildings. Electrical power data trend and algorithms for abrupt change detection in common HVAC systems will be examined. First, the concept of non-intrusive load monitoring (NILM) is introduced and change detection in the power series of HVAC systems under steady state and transient period is discussed. The steady state method is selected not only owing to its smaller computing load but also because its capability of dealing with the noisy power environment. Moreover, the steady state method itself is a complete load monitoring system while the transient state method is not. Second, characteristics of power profiles in both residential and commercial buildings are illustrated and motivations of new algorithms for change detection of power in commercial applications are proposed. Finally, the necessity and feasibility of the new methods for steady-state change detection of power consumption in commercial buildings are discussed. Then the algorithm based on the generalized likelihood ratio (GLR) is selected according to the available information of typical HVAC systems in practice. 2.1 Review of the non-intrusive load monitoring Monitoring changes in power data can determine if the designated equipment has been turned on or off. Also, by analyzing the characteristics of the change, abnormalities or faults may be identified. Conventional load monitoring requires a separate meter for the motor of each component of interest, which makes it very expensive and impossible to be used in practice as a real system is usually driven by more than one motor. A nonintrusive load monitor is designed to monitor an electrical circuit that contains a number of devices which switch on and off independently. By a comprehensive analysis of the current and voltage waveforms of the total load, the NILM estimates the number and nature of the individual loads and other relevant statistics. The key feature of the NILM compared to conventional load monitors is that no access to the individual components is necessary for sensor installation and measurements. For a system with multiple monitored equipment, the use of NILM may significantly reduce the cost and the interruption to the system's operation caused by the installation of separate power meters. There are two general ways to conduct change detection, steady state detection and transient detection. The steady state change detection determines the magnitude of the power change caused by on/off events while transient change detection analyzes the dynamic patterns of startup signals. Transient detection has been primarily developed for discrimination between two changes with equal amplitude that can hardly be discriminated by the steady state method if no additional information about the events is known in advance [Leeb 1992] [Norford and Leeb 1996]. Although this method seems to be able to yield earlier response than the steady state method and can distinguish some nearly overlapping events [Hart 1992] [Norford and Leeb 1996], it suffers several major drawbacks. First, transient signals can not be added, which makes it difficult to combine simultaneous events. Second, the transient method only makes sense in terms of the startup detection and can not provide any useful information about the equipment during operation or at shutdown which also assumes an important role in both fault detection and energy estimation. If two pieces of equipment are turned on within a short time interval, transient response can tell the order of turn-on, but not the order of turn-off. In addition, transient detection can easily be deteriorated by the presence of oscillation or noise in the data series. Furthermore, transient detection only recognizes the known patterns of equipment startups and can not discover unanticipated but potentially interesting events. And last, the transient response usually imposes a heavy burden on the computing facility for complex data processing with a data sampling rate of not lower than 60 Hz to accommodate the fundamental frequency of voltage variation. On the other hand, steady state detection is capable of dealing with almost all of the above problems except that it can not distinguish simultaneous events without other information for reference. However, this seems not to become a critical issue in recognition of the fact that the future trend of the fault detection system is to be incorporated with the building energy management system, from which the on/off signals can readily be obtained by the detector for further discrimination. Based on the above analysis, steady state is determined to be the basis of the detection strategy in this research. The definition of "steady state" should be noted. Steady state conditions should not be interpreted so strictly that they rarely occur in practice. In this thesis, steady state is recognized if there are no significant abrupt changes in the detection windows, but the power data may still vary in the detection windows. This is because in commercial buildings, fluctuations and variations in power data are unavoidable due to the application of variable speed motor drives as well as the existence of noise. Therefore, the power data are not constant even without on/off switches. This indicates that to obtain reliable detection output, a reasonable threshold is needed to identify a significant signal in the "steady state" environment. 2.2 Steady-state change detection 2.2.1 Power monitoring of residential buildings One of the major characteristics of the power data of houses is that appliances are generally driven by constant-speed motors with on/off control or stepwise finite states. The noise effect is often negligible compared to the total power magnitude, as shown in the following figure for a single-family house over a one-hour period collected by Hart[1992]. 43- 6 2- 1 -j 0 0 600 1200 1800 Time (s) 2400 3000 3600 Figure 2.1. Total electricalpower input vs. time of a single-family house. By assuming time-invariant complex power with each appliance, a list of all the appliances in the system can be set up with the nameplates or the specifications from the manufacturers. Change detection becomes edge identification and can be fulfilled by combinatorial optimization based on the following equations and appropriate clustering techniques if necessary [Hart 1992]. n P,(t) a(t)P, i +e(t) = (2.1) i=1 n a(t) = argmin PP (t) - a(t)P,1 (2.2) a S1, a 0, if appliance i is on at time t, if appliance i is off at time t. With the above equations and constraint, if the vectors of the p-phase load Ppi of each component i are known and the measured total power vector Pp(t) is given at each time t, the error e(t) is minimized by searching the n vector a(t). Application of a steady state detector for a small house that contains several constant-power appliances has been demonstrated in previous research [Hart 1992] by detecting the edges in the power series. 2.2.2 Power monitoring in commercial buildings In commercial applications, the characteristics of the electrical power profile are quite different and the usage patterns and types of equipment involved are more likely to generate power quality problems. First, electric surges and spikes caused by the startup of larger motors or the automatic setpoint adjustments of some components make the data environment much more noisy, which can be seen by comparing the power quality of the typical applications in a house in Fig. 2.1 and in a commercial building in Fig. 2.2 [Norford and Leeb, 1996]. Second, variable-speed motor drives leading to gradual power changes are very common in commercial buildings. And finally, even constant-speed motors consume variable power in response to changing load conditions in commercial buildings. It can be seen that with the presence of noise and variations in the power data of commercial buildings as shown in Fig. 2.2, the edge detection algorithm used in the residential case tends to cause unacceptable false alarm rates. Therefore, in order to achieve desirable detection performance, more efficient and robust algorithms based on statistical analysis are necessary. 740 720- 700) 9 680- 660 0 180 360 540 720 900 1080 Time (s) Figure 2.2. Power data with 4 on/off events of a pump in a campus building. 2.3 Theory of abrupt change detection In this section, methods for abrupt change detection are introduced and the GLR algorithm is developed as a major detection tool in this thesis. The materials analyzed in this section follow the development in [Shanmugan and Breiohl 1988] and [Basseville and Nikiforov 1993]. 2.3.1 Introduction of abrupt change detection Abrupt changes mean that changes in properties of the monitored object occur very fast with respect to the sampling period of the measurements. In an industrial process, faults related to such can be divided into two general categories. First is failures or catastrophic events that usually stop the operation and need to be identified immediately. Second is smaller faults, sudden or gradual (incipient), which affect the process without causing it to stop but are of crucial practical interest in preventing the subsequent occurrence of more serious or catastrophic events. Both types of faults can be approached in the abrupt change detection framework. Detection of abrupt changes refers to tools that help to decide whether such a change has occurred in the characteristics of interest. With the increasing complexity of technological processes and availability of sophisticated information processing systems over the last twenty years, applications of abrupt change detection have grown rapidly in many areas, from critical applications like prediction of natural catastrophic events, e.g., earthquake and tsunami, to other major industrial processes, such as quality control, vibration monitoring, and pattern recognition. A common objective of these applications is to detect abrupt changes in some characteristic properties of the monitored object, which usually can be described as the problem of detecting changes in the parameters at unknown time constants of a static or dynamic stochastic system. The major challenge is to identify intrinsic changes that are not necessarily directly observed and that are measured together with other types of perturbations [Basseville and Nikiforov 1993]. The design of a change detection and diagnosis algorithm generally consists of two major tasks: a). Derivation of the sufficient statistics. With known mean values before changes, the ideal residuals generated from the measurements should be close to zero. However, for online detection in a dynamic system, the mean value or the spectral properties of these residuals may change and in such cases, the generation of "residuals" is to derive the sufficient statistics. b). Design of decision rules based on these residuals. This involves designing the convenient decision rule which solves the change detection problem as reflected by the residuals. In this thesis, a major task is to develop a parametric statistical tool for detecting abrupt changes in the properties of discrete time signals from dynamic systems. In the following sections of this chapter, a sequence of independent random variables yi (i =1, 2, ... ) is studied with a probability density po which depends on parameter 0. 0 is equal to 00 before the unknown change time te and becomes 01 after the change. On the basis of the requirements of applications and the corresponding mathematical mechanisms, statistical change detection can be divided into three major classes, on-line detection of a change, off-line hypotheses testing, and off-line estimation of the change time. On-line detection of change The detection is determined by a stopping rule: te = inf{n : gn(yi, y2, .. .,yn) > h} (2.3) which is to search for the minimum of a sample of size n when the value of the statistical function gn exceeds the threshold h. With the trained threshold, neither the mean values before and after the changes nor the time of change is required. The overall criterion is to minimize the time delay of detection for a given mean time between false alarms. Off-line hypotheses testing With a given finite sample yi, y2, ... , yn, hypotheses "without change" and "with change": Ho : for 1 j < n: H1 : there exists an unknown 1 for 1 ji for te 5 j the test is to verify one of two states, the p, (yj I yj-,..., yi) = p"0 (yj I yj-,..., yi) te n such that: te-1: p, (yj I yj-i,..., yi) = p"0 (yj I yj-,..., yi) n: p, (yj I yj-,,... , y1) = po (yj I yj-i,... , y1) (2.4) The criterion for this algorithm is to maximize the probability of H1 when H1 is true and minimize the alarms when Ho is true. The estimation of the change time is not required. Off-line estimation of the change time With the same hypotheses as above, the detection is used to find the time of the change from Ho to H1 , with the assumption that H1 does happen in the finite time period from 1 to N. te = inf{j: p,9 (yj I yj-i,..., y1) = p, (yj I yji,..., y1)} (1 ji N) (2.5) The objective of this detection is to track more accurately the time of a given event H1. It can be seen from the above definitions that both off-line detection methods require a prioriknowledge about the mean values before and after the change. In this thesis, the objective is to develop an online detection method, not only because this can provide real-time monitoring but also due to the fact that the nonintrusive detection is based on the total power data of the HVAC or even the whole building's power system. With variable patterns of multiple on/off switches in response to the load conditions, it is difficult to know in advance the mean values before and after a change. Meanwhile, the unknown time of change adds a third variable for detection. This indicates that statistical estimates for the most likely values of those variables are necessary for practical detection. 2.3.2 Algorithm of on-line change-of-mean detection Algorithms for abrupt change detection, on-line or off-line, are based on an important concept in mathematical statistics, namely, the logarithm of the likelihood ratio, defined by p , (y) S(y)= In 0 1 (2.6) poo (y) With on-line detection, the system is continuously monitored and changes may happen at any time without prior notification. Therefore, in the likelihood tests of change detection with a random independent series, three parameters are involved, the mean before and after the change and the time of change, denoted by 00, 01, and i (or te for a continuous series) respectively. The log-likelihood ratio for a continuous series from y1 to y2 can be described as , SY (0,01)= Y 2 In l (y) dy. Y1 p o (y) For a discrete sample from time j to k of a random sequence k S (o k i pO, (yi ) in p(y1)= (2.7) P(yi) pj For online tests with a complex dynamic system of multiple components, all three parameters can be varying. Therefore, the estimates of them should be determined with the maximum likelihood. This leads to the triple maximization equation of the sufficient statistic gk for a continuous series aS A pX 2a92 -0 1 aS+ aP 1+X a92 =0 A31 as=0P2-0 91I=0 192=0 where (p,and (P2 are applicable constraints for function S. For a discrete sequence, the search becomes sup gk = maxInA = 1max 01 Sk(01) 00 j k sup I jk where ik (2.8) is the estimate of the upper bound of the ratio of the joint frequency function about the post-event mean p, for a given pre-event mean pgo, within the window U, k] and sup represents the supremum, i.e., the least upper bound of Sk over [, k] about the mean g, with reference to the mean po before the change. For off-line detection, the corresponding conditional maximum likelihood estimates of the three values can be given as k-1 N (teOo,01) = arg max sup sup ln HPoo (yi) Hpe1 (yi) 15k:N 00 0i i=1 (2.9) i=k or in its condensed form te=arg max In 1 k N k-1 N lip 0 (yi) Hp, 1 (yi) i=1 (2.10) i=k It should be noted that the mean values 0 and 01 indicate the most possible or most representative values of the data samples before and after a change, not just a simple average value of the samples. This can be demonstrated by the following plot. + 20 - sarnples the most representative the average U) 15 U) 0 50 Time Figure 2.3. Demonstration of the difference between the average and the most representativevalue for the mean used in change detection. The extremely high or low values are usually caused by random noises and are common for a complex system with disturbances. Hence the mean needs to be estimated statistically. For a data series following the Gaussian distribution with (g, (2.11) Ye 2G 2 p 2), the most representative value of a sample is equal to the average, such as the noise in electrical power systems [Shanmugan and Breiohl 1988]. In this thesis, the change-of-mean detection is conducted with the system's total power data. Therefore, the mean equals the average in the detection window. In principle, estimates of mean values and the time of change can only be performed off-line in order to find the maximum likelihood throughout all the data samples in the sub-windows. For on-line detection with a dynamic system, however, these mean values are generally time-dependent and changes may happen at any time. Therefore, some finite length of the detection window that progresses with time should be used to isolate each time period of change, which is the idea of the finite moving average (FMA) algorithm. With the estimated mean before the change assumed as a known value, two possible algorithms can be used for the detection, i.e., for estimation of the time of change and the post-event mean: the weighted cumulative sum (CUSUM) and the generalized likelihood ratio (GLR). The CUSUM method is to weight the likelihood ratio with a weighting function dF(0 1) with respect to all possible values of the parameter 01: An= , * p00(yi1y2,---,Yn) dF(61) (2.12) while the GLR uses the maximum likelihood estimate of 01: supe, p,(y1 , y2 ,-... yn) (2.13) With the weighted CUSUM algorithm, all possible change values need to be known before the detection begins. For a system with known finite states of changes under all normal and faulty conditions, as in the residential applications, this method can be used for the best statistical estimate of the coming event. On the other hand, the GLR algorithm is based on the maximization of the likelihood ratio, i.e., the maximization of the probability ratio defined in Eq. (2.13), of the sample between two values 00 and 01. Without information about the incoming event in advance, both the post-event mean 01 and the time of change are searched in the detection window through the maximization of the probability ratio. In a moving FMA window, the maximum of the ratio is first identified by computing the likelihood with data in each sub-window and the time of change is then determined by a second maximization of the ratio among all the subwindows. The search in a FMA window is implemented with each new data point accepted into the window. In such real systems as commercial HVAC applications, it is not feasible to define all the possible states under faulty conditions, even if all the changes under normal conditions can be well determined. Moreover, training of the weighting function with the CUSUM method is not a trivial issue and may add uncertainty to the estimation. In this research, it has been found that with reasonable innovations of the algorithm and appropriate processing of the sampled data, the GLR algorithm can be used to achieve reliable detection without knowledge of the potential events. The improved GLR detector not only makes it possible to detect changes in the HVAC power data but also requires less parameters to be trained compared with the CUSUM method. Therefore, in this thesis, the GLR algorithm becomes the principal method for on-line change detection in complex systems. 2.3.3 The GLR algorithm In the GLR algorithm, the mean value of the sequence before a change is assumed known, which is from either previous hypothesis testing or estimation with the past data. For commercial HVAC systems, power consumption is always varying with gradually changing load conditions, which indicates the mean of the power data from the last hypothesis may be very different from the current value. Therefore, the mean value before the change must be estimated based on its probability density function, 90 =arg sup p0 0 (y) (2.14) For a series with normal distribution (g, <Y), the most representative value Oo is the average of this series o, no = which can be proved by solving a 00 n j= yi (2.15) =0. It should be noted that in the presence of large spikes in a window with a finite length, this definition of the mean may lead to significant deviations from the real representative value of the sample, as shown in Fig. 2.3. Therefore, power data need to be processed before used in the GLR computation, which will be discussed in Chapter 3. With this estimate as the known mean before a change, for a sequence of random variables (yO) with a probability density function of pe(y), there are two independent unknowns left, the change time and the post-event mean. In the GLR method, these two variables are determined with the maximum likelihood estimation, i.e., double maximization of the likelihood ratio. For a continuous series, the mean after the change 01 and the change time te can be searched with appropriate boundary conditions through aS at ( , ie) =0 (01, (2.16) ,te) For a discrete independent sequence, the double maximization is expressed as InA gk = max lijsk max sup sk ( 1 ). = l jsk o, (2.17) The probability density belongs to the Koopman-Darmois family of probability densities [Lorden 1971]: (2.18) p,(y)=l(y)e qte)mty> - rte> where 1, m, q, and r are finite and measurable functions. Moreover, m and q are monotonic and r is strictly concave upward and infinitely differentiable over an interval of the real line. For a given system, sometimes the minimum magnitude of changes in the system can be estimated from manufacturers' catalogs or product specifications. In change detection, this means the minimum expected change vm in the parameter 0 can be used as a constraint for the double maximization, max gk = 1jsk (2.19) Sk(1I). sup 1i- ol vm Hence the conditional maximum likelihood estimates of the after-change mean and the change time are te (j,01) = arg max 1 js te sup ln i=J o01-00>Vm i:e p0 (yi) (2.20) P0 (yi) For an independent (g, 2) Gaussian sequence, the sufficient statistic function can then be derived as S k) S (;-- g 22)" (i k 91- 91+2 where i0 is the estimate of the pre-event mean from Eq. (2.15). (2.21) To obtain the estimate of let v = i, - k gk = ljmaxk IVI sup Vm>O V (yi -0 to v i=j L V2 2 2' 2 The maximum of g as the function of v can be found by the equation of av -0 with the constraint of vm, IviI=( 1. k-j+l>j yi -po -vm)+ m and then max S= 15j<sk k j .. ~-ipto- G2 2 (2.22) 22 2 If vm=O, meaning that changes of any magnitude are of interest or no information about the minimum expected change is available in advance, then -2 gk 9 1 2-a2 max 1:sj!k k- 1 [k j . +1Iij(i-- y - (2.23) 2.4 Summary In this chapter, the theory of change detection has been thoroughly studied. The steady state method is selected not only owing to its smaller computing load but also because of its capability of dealing with the noisy power environment. In addition, the steady state method itself is a complete load monitoring system. Second, characteristics of power profiles of both residential and commercial systems are discussed and algorithms for change detection of power in commercial applications are analyzed. Finally, the necessity and the feasibility of methods for steady-state change detection of power consumption in commercial buildings is evaluated. Then the algorithm based on the generalized likelihood ratio is selected according to the available information of common HVAC systems in practice. CHAPTER 3 System power modeling for fault detection and diagnosis This chapter presents a detailed study of the GLR detection of changes in the total power data of commercial HVAC systems. First, several fundamental issues are put forward regarding the GLR algorithm when applied in practice. In order to reduce false and missed alarms, some innovations are proposed and verified with data from real HVAC systems. Then, to deal with the noise effect in the power data and the various startup characteristics of different equipment in one system, methods including the median filter and the multi-rate sampling technique for preprocessing of the power data are developed. Finally, with the enhanced GLR detector working on properly preprocessed data, case studies for change detection in the total power data of both the building and the HVAC systems are presented and a short discussion is given about the training and application of the change detector. 3.1 Introduction of on-line change detection in power data of the HVAC systems Monitoring and analysis of a system's total power input involve keeping track of the on/off schedules of individual components and the abnormal magnitude of changes and obtaining appropriate estimates of the system's energy consumption. Undesirable operating status may be identified by examination of the trend of the total power data. For example, unstable control of a component in the system can be detected by checking the standard deviation of the sampled power data against a trained threshold. Monitoring of the total power profile can be generally defined as a quality control problem in two major aspects: constant mean with varying standard deviation or constant deviation with varying mean, as illustrated in Fig. 3.1. In quality control, these two types of variation are considered as systematic and random errors respectively. 14 1 E 8 2CL 66 E0 ~"8 2 42 6 424 0 2 4 68 114 Time (a) 118 20 0 2 4 6 8101)214 1618 Time 20 (b) Figure 3.1. Two major types of change detection in quality control. a). constant standard deviation with varying mean; b). constant mean with varying standard deviation. In common HVAC systems, a systematic error means an undesired on/off switch of a component under faulty operating conditions while a random error indicates oscillation in the power data resulted from unstable control in the system. In order to identify such abnormalities under steady state, two thresholds must be established, one for the sufficient statistic to find the changes and the other for the standard deviation of the sampled data. As a dimensionless index statistically derived from a noisy environment, the sufficient statistic must be trained for a given system. To evaluate the noise level, which usually depends on the on-site configuration of both the electrical and mechanical systems, the appropriate training and estimation of the threshold for the standard deviation are essential for oscillation identification. Unlike the event monitoring in product quality control where the mean and the standard deviation can be treated as constants under normal operation, in an operating HVAC system, both the mean and the standard deviation are changing all the time. In order to obtain timely estimates for the mean values of a sequence, the finite moving average (FMA) window should be used for the detection. An FMA window is a data sampling window with a finite length, in which the mean values for each sub-window can be determined progressively with weighting or forgetting factors. For a Gaussian sequence, due to the randomness of the independent data in the window, the weighting factor can be assumed as a constant to eliminate the training of these factors. 3.2 Modification of the GLR: one-window equation vs. two-window equation In the theory of on-line change detection, the search for the maximum sufficient statistic is conducted by successively searching throughout each sub-window j-te, te (j, 0 1) = arg max i j te sup Iln 0:I01-001>vm j Pej y1 ) po (3.1) (yi) In the estimation of the post-event mean 01 in a dynamic system, one problem inherent in this equation is the availability of the pre-event mean 0. In practice, especially when detecting unexpected events, which is a major task of fault detection, the pre-event mean must be estimated from the continuously updated data sample and can not be selected from the expected values or the equipment specifications. For a system with relatively constant or very small variations compared to the magnitude of the sampled data, the post-event mean detected from the last hypothesis may be used as the pre-event mean for the next event, such as the total power data of a small house. However, in some cases, the total power data are always varying with the load as a function of time, especially with the presence of variable speed drives which are widely used in commercial HVAC systems. This indicates the post-event mean from the last hypothesis 0, can be very different from the current pre-event mean 00, which tends to cause unacceptable errors or even false/missed alarms in the detection if 00 is assumed to be equal to 0, . Fig. 3.2 illustrates such effects with two abrupt turn-on events at 27 and 88 seconds separately. The post-event mean value of 21.75 of the previous hypothesis is obtained from the average during time 0-5. If this value instead of the mean value of 31.29 during 21-26 seconds is used as the pre-event mean for the change at time 27 with a post-event mean of 39.12, the resulting error of the detected change is 80%. For the change at 88 seconds, the pre-event mean from the previous hypothesis is very close to the post-event mean of 39.61 during 88-93, which virtually makes it impossible for the detector to see the change at 88 seconds if 39.12 is used as the pre-event mean. 0 10 20 30 40 50 60 70 80 90 100 110 120 Time (s) Figure3.2. Effects of the window selection for the estimation of the pre-event mean. This chart somewhat reveals a potential solution to this problem: to obtain the continuously updated estimate of the pre-event mean 0, the data in the latest section of the series should be used. For more accurate change detection, the pre-event mean should be properly estimated independently, i.e., it can not be obtained with any data in the same window used for the post-event mean. Therefore, to clarify the difference between the two mean values, there must be an additional window before the post-event window to perform the estimation for the pre-event mean. By moving with the post-event window, the pre-event window provides appropriate updated estimates of the base for the next "step" change. Hence the equation for the estimation of the change time and the magnitude is modified as N (j, 01 ) = arg max M+1 j 5N sup NIn 01:01-601>vm i=j Pe (y) (3.2) p 0 (y) 60= arg sup p0 (y) 1 i M where 00 as the estimate of the pre-event mean 00 is obtained as the most likely value based on the probability density p0 (y) in the pre-event window 1-M. With the N-(g, (2) distribution, pol(i N (j, 0 1) = arg max sup M+1:j:N Gid 060e>vMj lin . (3.3) P0 (Yi) Myi i=1 where 0 = p. The length of the two windows M and N-M depends on the on/off characteristics of the monitored equipment in a system and needs to be properly trained in practice. 3.3 Training of the parameters for the GLR detector According to Sections 3.1 and 3.2, four parameters need to be trained before the GLR algorithm can be used for detection: a). the length of the pre-event window; b). the length of the post-event window; c). the threshold for the sufficient statistic; d). the threshold for the standard deviation. The effects of the above parameters on the detection output are illustrated by Fig. 3.4 with the total power data shown in Fig. 3.3(a) [Hill, 1995] for four fans in the HVAC system of a campus building. During a period of 28 hours, there were 10 on/off events of the fans at minutes 230, 455, 475, 490, 520, 545, 670, 1150, 1193, and 1560 respectively, as represented by the binary on/off switches in Fig. 3.3(b). Data were sampled at a rate of 1/60 Hz. 7560450 cL 3015 0 0 200 400 600 1000 800 1200 1400 1600 1800 1200 1400 1600 1800 Time (min) (a) 1 - 0 L 0 200 T 400 600 1000 800 Time (s) (b) Figure 3.3. Power datafor four supply fans in the HVAC system of a campus building . a). power data sampled at 1/60 Hz; b). binary event indicatorof the on/off switches. 3.3.1 The length of the pre-event window The length of the data window has a profound impact on the decision function. The major objective of the pre-event window is to provide a stable datum for the detector to find the coming change. Therefore, there should be enough sampled data points in this window, i.e., this window has to be long enough for a given sampling rate. On the other hand, multiple power changes frequently occur in sequence in HVAC systems and sometimes they are closely separated in time, which requires a short window to avoid averaging out two consecutive events. However, with a single sampling rate, the mean obtained with a short window tends to be affected by noise in the data and the resulting deviation of the estimated mean from the most representative average may lead to false or missed alarms. In addition, for events that occur gradually over a time period, a short window tends to miss the event or produce multiple alarms for the same event. Fig. 3.4(a) shows the short-window effect on the GLR detection output with a pre-event window of 5 data points. Further decrease of the window length yielded more false alarms and even unstable outputs. On the other hand, a long window prevents the detector from finding multiple changes that are close to each other in time. This effect is demonstrated in Fig.3.4(b), where in addition to continuous alarms for each event, the detector with a window of 30 data points produced sufficient statistics above the threshold over the entire time period from 450-550 minutes when four events occurred and one continuous false alarm beginning at 1198. The continuous alarm for a single event is because the data from the coming event can still be "assimilated" or "averaged out" in such a long pre-event window and hence the sufficient statistics stay beyond the threshold until enough data after the change are admitted into the pre-event window, resulting in a smaller probability ratio between the two values. For a similar reason, a long pre-event window leads to continuous alarms for multiple close events. With a window length of 10 in Fig.3.4(c), the duration of continuous alarms was much shorter than with 30 points in the window and the close changes could be distinguished, but three events were missed at 230, 670, and 1560 due to the longer duration of events. Apparently, the proper window length changes with the characteristics of different systems. But for a given system, the window length for detection seems to be consistent under various operating conditions, as demonstrated in the applications of this method to detect HVAC equipment on/off events in a test building [Shaw et al. 2000] [Norford et al. 2000]. Observations indicate that the upper limit for the length of the pre-event window should not be longer than the interval between two major consecutive events. Such a value can sometimes be obtained from the basic design specifications of the system, because in practice, the major components in an HVAC system are usually turned on and off with a designated lower limit of time lapse in order to protect the equipment from deterioration due to frequent switches, such as the chiller control. Moreover, the on/off switches of multiple components are often governed by the load conditions which usually occur over a certain time span, e.g., the sequence control of multiple fans. As a lower limit, the pre-event window should not be shorter than the duration of a noise or a spike. Ideally, it should be longer than the duration of the startup period of each component in the system. However, this condition may not be always met for a real system, since the startup of some VSD equipment can be as long as 15 minutes, which is sometimes longer than the interval between two consecutive switches. In such cases, other approaches are needed, such as the multi-rate sampling technique discussed later. Without violation of the above basic rules, the pre-event window should be kept short, which facilitates one of the major innovations to the GLR algorithm, namely, pre-event window reset, as presented later. 900 --L 750 600 450 - ------------- 300 150 200 0 400 600 800 1000 1200 1400 1600 1800 Time (rnin.) a). Pre-event window length of 5. Five false alarms at 258, 350, 380, 510, and 1241. 900 750 600 450 300 150 0 0 200 400 600 1000 800 Time (min.) 1200 1400 1600 1800 b). Pre-event window length of 30. Continuous alarmsfor each event, mixed alarms for the five close events during 450-550, and continuousfalse alarms beginning at 1198. 900 750 600 450 300 150 0 200 400 600 800 1000 II 1200 1400 1600 1800 Time (min.) c). Pre-event window length of 10. Mixed alarms for the close events eliminated and missed alarms at 230, 670, and 1560. Figure 3.4. Demonstration of the improvement of the detection quality with the proper pre-event window length. All the three tests are based on a constant variance of 0.617 and a minimum expected change of zero. 3.3.2 The length of the post-event window The above basic limits for the pre-event window also apply for the post-event window, i.e., it should not be longer than the interval between two consecutive events, never shorter than a disturbance, and ideally not shorter than the duration of a startup transient process. On the other hand, unlike the pre-event window, which is used to achieve a stable mean as the reference for the coming events, the post-event window is intended to be sensitive to events yet robust to disturbances. Eq. (3.3) shows that a shorter post-event window is more sensitive to changes than a longer one. Moreover, to save the time used for searching the change in the post-event window, the window length should be as short as possible. The appropriate length of the post-event window, was found to be 25-50% of the pre-change average window in order to get a relatively stable yet sensitive average for detection of on-off events. 3.3.3 The threshold for the sufficient statistic The magnitude of an appropriate detection threshold scales with signal noise, the minimum signal change of interest, and the abruptness of potential changes in the system. All the three factors are system-dependent. The minimum signal change may be found from the specifications of the components in the system while the other two depend on the component characteristics and system setup as well. Therefore, the threshold has to be established adaptively during the early test period for a given system. The training process benefits from available reference information (such as the type of the motor drive), which can be obtained from design information of the system, on-site observations, and on/off tests if possible. Tests with different systems also suggested that the threshold can be consistently used for a given system. 3.3.4 The standard deviation of the power data The standard deviation, or the variance, is an important measure of data quality. The standard deviation calculated for an FMA window may change rapidly over time in the power series of HVAC systems, as shown in Fig. 3.5. In the GLR algorithm, since the sufficient statistic is directly affected by the standard deviation, determination of the value of the standard deviation in an FMA window becomes one of the key issues for successful detection. However, the output of detection based on a fixed value of the standard deviation was rarely satisfactory even though such a value might be tuned to reduce false alarms. For example, a constant variance of 0.617 has been used for the detection illustrated by Fig. 3.4. Tests with the tuning of the variance showed that with any constant value, the false/missed alarm rates were not desirable. 50 C ,* 40 u- C =20 * 10- * 0 200 400 600 800 1000 Time (min.) 1200 1400 1600 1800 Figure 3.5. Variations in the noise of the electricalsignalfor the aggregatedfan power measurement. Tests for the training of the GLR detector have shown that even with well-tuned values of the above parameters, false and missed alarms still occurred at unacceptable rates for real applications. In order to achieve desirable detection quality for a practical system, further improvements are necessary in addition to the previous modifications and tuning guidelines. 3.4 Improvements of the detection algorithm In this thesis, three innovations have been made to the form of the original algorithm for improved performance of the GLR detector: window reset, variance update, and non-zero minimum magnitude of change. 3.4.1 Progressive vs. reset windows From Fig. 3.4(c), it can be seen that with an appropriately trained length of the pre-event window, continuous alarms are still issued in the detection output. This is because with the finite length of a window, which is essential to obtain a stable mean value, the sufficient statistic gradually decreases as more data are accepted into the windows. With the window reset technique, once the sufficient statistic exceeds the threshold, the data in the two windows before the found change point are replaced with the data after this point. Upon this replacement, the calculated sufficient statistic immediately drops down below the threshold due to the close post-event values used in both windows. Hence the duration of an alarm for a single event is minimized and masking of subsequent events is eliminated. Also, the problem of a too long or a too short window is alleviated. It should be noted that more data points may be needed to reset the pre-event window. This is realized by using a third window following the detection window with the same length as the pre-event window. The two-window equation is not affected by the window reset because the third window is not involved in the current detection. The window lengths used here are 10 for the pre-event, 5 for the detection, and 10 for reset in sequence as shown in Fig. 3.6. The effect of this innovation is shown in Fig. 3.7. 20 - detection w indow ith 5 data points 20w 16 - pre-event w indow a)_ 12 w~with 10 data points ------ J 0. En reset w indow w ith 10 data points 8 4-* sampled data rnean value 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 Time Figure3.6. FMA windows used with the GLR detection algorithm. 3.4.2 Updated standard deviation The standard deviation can be continuously calculated from the pre-event window, with high and low limits to avoid singular values of the calculated sufficient statistic. This effect is demonstrated by comparing Fig. 3.7(a) with Fig. 3.7(c). In this test, the limits were 10000 and 0.0001. The problem in using this approach is that it is difficult to set proper values for the fixed limits, which consequently produces inaccuracy in calculation of the sufficient statistic. Another method to determine the threshold of the updated standard deviation is to set it as an approximate fraction of the mean of the data in the current FMA window and then train this fraction number. Tests showed that the standard deviation tends to increase with the total power input of an HVAC system as more equipment are put into operation. Data from a training period can be used to determine the ratio of the measured standard deviation and the measured total power. The ratio is calculated as the standard deviation to the mean value in the moving pre-event window during steady state operation and periods with significant dynamics should be excluded. The maximum of the computed values during the training period is set as the ratio to be used in detection because the averaged offset as the standard deviation in the pre-event window caused by random noise following the Gaussian distribution is not expected to deviate significantly from the value under normal conditions. The training of the ratio requires the total power data of the operating system during a typical day. Some basic knowledge of the system's power consumption magnitude is also needed, which is often obtainable from the design data. During subsequent on-line FDD applications, the threshold for the calculated standard deviation is estimated as a product of this ratio and the averaged total power. From tests with different HVAC systems, it has been found that reasonable upper and lower limits for the standard deviation as fractions of the current power data are at the orders of 10% and 1% respectively. This method gives the most reliable estimate of the standard deviation because it eliminates the effect of the extreme values of the standard deviation while incorporating an updated estimate of the standard deviation in the calculation of the sufficient statistic. Tests conducted with real building systems have verified the success of this method. 3.4.3 Nonzero minimum expected change A value of zero for the minimum expected power change makes the GLR equation easy to implement. However, relative to a minimum value assigned on the basis of the knowledge about equipment sizes in a system, the zero-minimum detection usually leads to more false alarms, as shown in Fig. 3.7(b) at time 270 and 1260. This can also be proved from the equation itself as g decreases with the increasing of Vj. In practice, it is often reasonable to find and set a minimum expected change based on the knowledge of the system and its components. With a properly determined minimum expected change Vm, the algorithm will neglect small abrupt disturbances and their accumulation within the window and thus reduce the false alarm rate and yield more reliable detection outputs. This effect is shown with a known minimum change of Vm = 5 kW in Fig. 3.7(c). 400300- 200- 100- 0- . , 200 400 600 800 1000 1200 1400 1600 1800 Time (min.) a). window reset, constant variance, and non-zero minimum expected change. 400- 0> 300- A2 CO 200- C, 5 CO 100 0- 0 200 400 600 800 1000 1200 1400 1600 1800 Time (min.) b). window reset, updated variance,and zero minimum expected change. 400 o> 300 - C 200 - .CD 100 Cl) 0 0 200 400 600 800 1000 1200 1400 1600 1800 Time (min.) c). window reset, updated variance,and non-zero minimum expected change. Figure 3.7. Demonstration of the improvements of the GLR detection method by preevent window reset, variance update, and minimum signal estimation(5 kW). 3.5 The median filter In the search for a change in the detection window, each sub-window j~N (j = 1,..., N) is used for the calculation of the likelihood ratio. If a large spike enters the window, the ratio may exceed the threshold and lead to a false alarm. This effect is demonstrated by Fig. 3.8 and the following equation. When the detector reaches time point 15 in the post-event window which is affected by a big spike, the likelihood ratio will have a dramatic jump and a false alarm will be issued for the spike. 18 16 detection w indow w ith 5 data points pre-event w indow w ith 10 data points D14 12 10 10 C) *.... E8 .................. ---........9 ~ 642 0 2 6 4 14 12 10 Time 8 16 18 20 Figure 3.8. Sampled data with one spike in the post-event window. g = max k1si 5 pe (yi) p (y 15 ) = In ZIn sup 1 e sol1; i =11 POO ( P8 (Y15) i In addition, although the length of the two windows can be selected to alleviate the noise effect by "averaging out" the deviations, the detection quality is still often degraded when the data environment becomes very noisy with large disturbances, as shown in Fig.3.9(a). The dotted line shows the profile of the total power input of a campus building at a sampling rate of 1 Hz with four on/off events of a water pump. With frequent big spikes, the estimated mean in each window may significantly deviate from the real representative value. From the above discussion, it becomes clear that a preprocessor needs to be developed to improve the quality of the sampled data before they are used for the detector. In this thesis, to remove the large disturbances in the power data, a median filter [Karl et al. 1992] is employed to facilitate the change detection. The median of a series with a probability density function p(x) is the value Xmed for which larger and smaller values of x are equally probable: xrmed p(x)dx = 00Xmed f0 p(x)dx =± 2 (3.4) For a discrete sequence, the median of a distribution is estimated from a sample of values x 1, ... , XN by searching the value xi that has equal number of data greater and smaller than it. If N is even, the median becomes the average of the two central values. With sample sorted into an ascending (or a descending) order, the formula for the median of the sample is if Nis odd X N+1, Xmed 12(3.5) Xd X N+ X ), if N is even A median filter is a nonlinear filter based on the above function, which sorts the data in a window and preserves the median as the value at the end point of the window before sliding over one point. Spikes that appear in the window as points of very large or small values compared to the selected median are discarded. Hence a representative value is picked by the filter in the FMA window for the detector. By eliminating large spikes, the median filter significantly improves the robustness of the GLR detection with reference to the detection threshold. It has been found the GLR detector is less sensitive to the value of the threshold and hence easier to be trained as the disturbances in the data environment are greatly reduced by a median filter. The advantage of using a median filter has been verified by tests with one example illustrated below based on the total power data sampled at 1 Hz from a campus building. During a period of 1020 seconds, a pump was turned on and off 8 times by the control system at 124, 183, 443, 558, 737, 797, 939, and 998 seconds. Fig.3.9(a) shows the profile of the total power data with the dashed line for the raw data and the solid line for the filtered data. Without the median filter, 16 events were detected including the 8 switches and 8 false alarms, as shown in Fig.3.9(b). By using the appropriately designed median filter for the detection, false alarms caused by spikes at 29, 102, 119, 207, 355, 675, 812, and 934 seconds were eliminated, as illustrated in Fig.3.9(c). Comparison between the computed power changes in Fig.3.9(b) and (c) also indicates that for the 8 events, the detected changes are more accurate with the startup electrical surges eliminated by the filter. In the training of the GLR, same results can be obtained with the median filter if the value used for the threshold is between 1.0-3.0. On the other hand, no identical outputs were found with different thresholds in the detection based on the raw data. A value of 3.0 was found "optimal" for the threshold that leads to the identification of all the events with the minimum amount of false alarms, as illustrated by Fig.3.9(b). 720 700 680 660 0 90 180 270 360 450 540 630 720 810 900 990 1080 630 720 0 900 90 1C0 630 720 10 900 9 1C30 Time (s) (a) 2010>D 010 _-10 - 90 10 270 360 450 54 -20 -30 Time (s) (b) 30 20103 0 0 -10 90 130 270 360 450 54 -20-30* Time (s) (c) Figure 3.9. Demonstrationof the improved detector performance with a pre-processing medianfilter to reduce unwanted electrical noises. a). 1-Hz totalpower data of a campus building with 4 on/off events of a pump -filtered and unfiltered; b). output of the GLR detector with the raw data; c). output of the GLR detector with the filtered data. It is necessary to select a value for the window length used by a median filter. The window should be at least twice as long as the general duration of an electrical spike, which is generally less than 5 seconds from our observations, and not longer than the interval between two consecutive events or the duration of an on/off state, whichever is shorter. The median filter is more robust than the mean (or average) filter due to the fact that the median filter discards the extreme values while the mean filter calculates the average of all the data in the sample, taking into account and hence susceptible to all the spikes in the window [Rice 1988]. Statistically, the median filter fails as an estimator only if the area in the tails is large, while the mean filter fails if the first moment of the tails of the distribution is large. One practical problem with the median filter is the detection of data oscillation in a window. For common HVAC systems, oscillation in the power input is generally caused by unstable control and needs to be removed once detected. With a median filter, however, the oscillation tends to be neglected because the extreme values are removed by the filter and consequently the fault is hidden from the detector. 3.6 Change + oscillation detector Oscillation of power caused by unstable control in HVAC systems may degrade equipment and in some cases increase energy consumption. One of the major characteristics of oscillation is the deviation of the data from their mean value in an FMA window. Hence, the approach used to identify this fault in this thesis is to compare the standard deviation of the data set against a threshold that is dynamically adjusted as a fraction of the current power data. To avoid false alarms caused by random spikes which might lead to significant errors in calculating the standard deviation in a short window, the detection of oscillation is continuously conducted over the whole pre-event window rather than searching through all the sub-windows. The key points in designing a GLR detector with oscillation detection capability are the proper thresholds for the sufficient statistic and for the standard deviation. For the change detector, a higher threshold HGLR is determined for the sufficient statistic and a lower threshold LSTD for standard deviation to identify changes. For the oscillation detector, a higher threshold HSTD is selected for the standard deviation and a lower threshold LGLR for the sufficient statistic to find the oscillation. Thresholds established in this way eliminate the fuzzy intermediate region where, for example, fluctuations in power may cause false alarms in the change detector, and permit a clearer distinction of step changes and oscillations. The LGLR and the LSTD were set to be 10% of the HGLR and 20% of the HSTD respectively for an example illustrated by Fig. 3.10 for the detection of a turn-on event among power oscillation caused by an unstable controller in the total power data of the HVAC system in a test building. While the supply-fan static pressure controller gain was set to produce power oscillation, a chilled water pump was turned on at about 3604 seconds in the plot. With HGLR= 2500, LGLR= 250, HSTD= 0.02, and LSTD= 0.004, both events were alarmed successfully. 7000 6000 5000 4000 3000 2000 0 900 2700 1800 3600 4500 5400 6300 7200 ~0me (s) (a) 300 250 200 150 100 0 720 1440 2160 2880 3600 4320 5040 5760 6480 7200 Time (s) (b) Figure3.10. Paralleldetection of changes and oscillation. a). 1-Hz total electricalpower data of the HVAC system in a test building, showing oscillation indicative of an unstable supply-duct static-pressure controller as well as a turn-on event at 3604 seconds; b). output for the detection of the unstable controller with a change of 872 watts found at 3604 seconds. 3.7 Monitoring of the total power data - multi-rate vs. single-rate sampling In addition to the above improvements to the GLR algorithm and data filtering, the sampling rate of the data fed to the detector has also been found critical in keeping track of the events in a system. This is not only because of the varying abruptness of the same event exposed to the detector but also due to the distinct characteristics of the data trend unveiled with different sampling rates, such as the data spread about the mean value. In this section, after comparing the detection output with single- and multi-rate sampling, a detector for change and oscillation tracking with multiple sampling intervals is developed and discussed with power data from a test building. 3.7.1 Change detection by multi-rate sampling of the total power data It can be seen from the pervious sections that on/off events occur in a wide range of time intervals in HVAC systems. In order to find as many events as possible, a short sampling period should be used to avoid missing of events due to sampling. This is because a shorter sampling interval, or a faster sampling rate, can help to distinguish closely separated events. However, a fast sampling rate also tends to result in more false alarms because the calculation is more susceptible to high-frequency transients and disturbances. Moreover, a too fast sampling rate may cause more than one alarm for a single event, especially for some relatively slow on/off events, e.g., a supply fan with a variable speed motor drive. On the other hand, a long sampling interval, or, a slow sampling rate, is likely to miss changes with short time duration, such as the startup of a constant-speed water pump, and sometimes mix up closely separated events. Therefore, a reasonable range of sampling intervals should be determined before applying the GLR detector to a data series. i). Order of magnitude analysis of the sampling interval In general, the sampling interval should be shorter than the time lapse between two most closely spaced events. Such a limit can be obtained by examining the total power data at a fast enough sampling rate or by observing the signals from the control system, if possible. For example, in the tests with a real HVAC system, the shortest time period between two consecutive events has been found to be about 1 second by analyzing the building's total power data. Fig. 3.11 shows the whole-building power data during a typical day. The effect of sampling rate on the data pattern fed to the detector is illustrated in Fig. 3.12. All the data were collected with a 24-Hz data logger called remote-1 but plotted here with a sampling interval of 10 seconds due to limitations of the plotting software. 100000 80000- 60000- O 40000 20000 00 10800 21600 32400 43200 54000 64800 75600 86400 Time (s) Figure3.11. Whole-building electricalpower, sampled at 24 Hz, plotted with 0.1 Hz. In the 24 hours of operation, there have been more than 100 on/off switches of equipment in the system. During a selected period of 30 seconds (12300-12330) three components were turned on sequentially. The power data series with sampling intervals of 60 10, 1, and 0.125 seconds are shown sequentially in the following charts. 60000 50000 40000 3000020000 10000 12240 12270 12300 12330 12360 12390 12420 12360 12390 12420 12360 12390 12420 Time (s) (a) 60000 50000S46000 0 C- 30000 20000 10000 12240 12270 12300 12330 ime (s) (b) 60000- 50000400000 3000020000 10000 12240 12270 12300 12330 Time (s) (c) 60000 5000040000- a.0 30000- 2000010000 12240 12270 12300 12330 lime (s) 12360 12390 12420 (d) Figure 3.12. Effect of different data sampling intervals on detectability of rapidly occurring equipment start-up events.(a). 60 seconds; (b). 10 seconds; (c). 1 second; (d). 0.125 seconds. As shown in the above charts, if an interval of 60 seconds is used to sample the data, then these three turn-on events are taken as only one step change. With the sampling interval reduced to 10 seconds, the events are only sampled as one point each, which still can not be used as a solid proof for a real step change. When the sampling interval is reduced by another order of magnitude, i.e., to 1 second, then the first event can be clearly recognized by eye and by the detector as well. Further decrease of the sampling interval makes the second event discernible to eye, but not to the detector, which needs an appropriate length of data window to build up the sufficient statistic. The second event remains ambiguous to the detector until the sampling interval is reduced to 1/8 seconds, i.e., a sampling rate of 8 Hz. However, tests to detect the cycling of a reciprocating chiller in the given system showed that the false alarm rate increases drastically when the sampling interval drops below 1 second. Based on the tests and analysis of the sampling effect on the detection quality, the appropriate lower limit of the sampling interval for the GLR detection has been found to be of the same order of magnitude as the shortest time duration of the on/off events or the shortest time interval between two consecutive events of interest, whichever is shorter. The upper limit should be longer than the lower limit by about 1-2 orders of magnitude. In this case, the lower and upper limits are 1 and 30 second(s) respectively. In addition to a reduced false alarm rate, another significant advantage of using such a range compared to the higher sampling rate of 8 Hz is the more acceptable execution time of the detection. This is more meaningful when the automatic detector is developed with multiple sampling intervals as discussed later. The running time of the detector was reduced from 90 minutes to less than 5 minutes when the lower limit of the sampling intervals increased from 0.125 seconds to 1 second for the test with one day's data, making the detector more desirable for on-line detection. One problem with such a moderate sampling range is that some very closely separated events might be missed or misinterpreted, such as the above event with an interval of less than 2 seconds. But such events seem to happen by coincidence and are rarely seen in common HVAC systems because the switches of major electrically-driven components are typically staggered to alleviate electrical surge. In practice, there is generally an interval of not less than 30 seconds even for some coupled on/off equipment, such as the supply and return fans associated with the same air handler. ii). Detection of on/off changes with single sampling rate Since the data patterns supplied to a detector and consequently the 'visibility' of an event to the detector vary significantly with the data sampling frequency, the sampling interval must be carefully determined based on the above order-of-magnitude analysis to achieve desirable quality of the detection. In this thesis, many tests have been conducted to search for the optimal sampling rates to detect all the on/off events in an HVAC system. In the following analysis, one example of such tests is demonstrated with the power data from a test building over six days. The tests were started with the detection of the on/off cycling of a reciprocating chiller from the building's total power data. Fig. 3.13 illustrates the first seven hours of the total power data on a single day and the GLR detection output for the chiller's on/off switches with different sampling rate. From the original power data shown in Fig. 3.13(a), it is difficult to visually distinguish the chiller cycles from the whole-building power signal. The first chiller cycle, aligned with data from an electricity submeter used to check the GLR results, is readily discerned. Most of the following switches are obscured by noises or other events. - 60000 - - - --- - -- - 50000 40000 30000 0 a- 2000010000- 0 3600 7200 10800 Time (s) (a) 14400 18000 21600 25200 6000 5000 - 4000 - 3000 - 2000 - 1000 0 60 0 120 240 180 300 360 420 18000 2 600 25 18000 21600 252 Time (min.) (b) 8000 6000 4000 2000 0 3600 -2000 -4000 7 00 10 0 1440 -t -6000 -8000 Time (s) (c) 8000 6000 4000 2000 0 -2000 !) 3600 7 100 14400 10800 -4000 -6000 -8000 Time (s) (d) 8000 - 600040002000 - 03600 (0 -2000 7200 1440 10E 18000 2 600 25200 18000 2 600 250 8000 2 600 25 00 -4000 -6000 -8000 Time (s) (e) 8000 6000400020000 a 3 300 -2000 7100 75200 14400 108D0 -4000-6000-8000Time (s) (f) 8000 6000 4000 2000 0 -2000 3600 7200 14400 10810 -4000-6000-8000 Time (s) (g) Figure 3.13. Demonstration of the detection of chiller on-off cycles from the total power data of a test building with different sampling intervals. (a). total power data from remote-i during 0:00--7:00, 99/05/23; (b). chiller power data from a submeter during 0:00--7:00, 99/05/23; (c) - (g). detection output of the chiller's on/off switches with sampling intervals of 1, 2, 5, 10, and 20 seconds respectively. All other sampling intervals between 1 and 60 seconds produced results similar to those shown above. Apparently, with a single sampling interval, the detector is not able to find all the on/off switches correctly as shown in the submetered data. However, it can be seen from the above plots that all the events can be found by matching the on's and off's among outputs with different sampling intervals. In this case, on/off matching among the outputs with the sampling intervals of 1, 2, and 5 seconds identifies all the switches without any false alarms or missing events. Other combinations of sampling intervals may yield the same results, such as the combination of 1, 2, and 10 second(s). Similar detection patterns with different sampling intervals have been found with the remaining five days' data. This indicates that the quality of detection can be significantly improved if the detection can be performed by properly integrating the outputs among different sampling rates. This is because, first, different components in a system have different time constants that result in different duration for the turn-on switches. Second, for given sizes of the pre- and post-event windows, the ideal sampling interval should be longer than any of the on/off transients and shorter than the time interval between two on/off events. Unfortunately, this condition can not always be met in commercial applications where VSD drives and load controls are widely used, which lead to more gradual changes in the system's total power series. As a consequence, an on/off transient process may last longer than the interval between two events. Moreover, as can be seen from Fig. 3.13(a), the quality of the total power data makes it difficult to isolate changes under different conditions with a single sampling rate even for the same equipment. iii). Detection of on/off changes by automatic matching among multiple sampling intervals In this detector, the data series is sampled with discrete integer intervals between the lower and upper limits and then supplied to the GLR detector. The detection output with each interval is sorted by the magnitude of the changes, assigned to the equipment with that specific magnitude, and matched with the outputs with other sampling intervals. Fig. 3.14 displays the output of this automatic detector for the data shown in Fig. 3.11 without false/missed alarms. 6000 4000 2000 0 () 0 60 0 1 24 300 60 4; -2000 -4000 -6000 Time (min.) Figure3.14. GLR detection outputfor the chiller on/off switches with the building's total power data by automatic matching among different sampling intervals: 1, 2, 3, 4, 5, 10, 20, and 30 seconds. Additional data filtering criteria, if available, may be applied for specific equipment to reduce the false alarm rate, especially with the presence of other components with close magnitude of change. For the chiller in the test building, for example, these criteria include minimum off-time between cycles, which is typically set within the chiller controls to prevent unnecessary equipment cycling, and minimum expected on-time. Such limits have been incorporated in the detector designed for the given system. Therefore, data sampling with multiple intervals not only sharpens the abruptness of changes, which makes the event clearer for the GLR detector to "see", but also helps to distinguish abrupt on/off cycles among a gradually changing process which for example, can be a slow startup of a VSD fan. With only one sampling rate, such 'enclosed' changes can hardly be found without false alarms. As shown in Fig. 3.15, fan turn-off events as recorded by the NILM logger vary in their apparent abruptness, as influenced by changes in other loads. All the four events were successfully detected with multi-rate sampling but some were always missed with any single-rate detection. 7000 Turnoff of supply 6000 - fan A and B 5000 4000 a> 0 (L 3000200010000 0 3600 7200 10800 14400 18000 21600 25200 28800 21600 25200 28800 Time (s) (a) 7000 6000 Turnoff of supply fan A and B 5000 4000- hb 46 a) 0 3000 2000 1000 0 3600 7200 10800 14400 18000 Time (s) (b) Figure 3.15. Demonstration of the various abruptness of the turn-off transient of the same fans in two different days. Multi-rate sampling produces at least one data stream with the 'visible' or 'abrupt' event for the GLR detector despite of the noise variations in the data environment, which could not be achieved with any single rate. (a). total power data of the HVAC system 20:07 99/05/11-4:07 99/05/12, showing the turnoff of the fans among multiple events; (b). total power of the HVAC system 20:07 99/05/24-4:07 99/05/25/, demonstrating the impaired and obscured abruptness of the turnoff events in the total power data at certain sampling rates. 3.7.2 Variance tracking by multi-rate sampling of the total power In addition to the on/off changes, data spread about the mean value of a power series is another important factor for monitoring system performance. An effective parameter to describe this factor for a system's operating status is the standard deviation or the variance of the sampled data set. Continuous high values of the standard deviation usually indicate serious problems of a system's operation. For example, unstable control in an HVAC system often leads to oscillation in the power data, which can be detected by examining the standard deviation in the FMA window. However, despite the fact that such an effect can usually be seen by a detector with some data sampling rates, it may become ambiguous or even totally invisible to detection based on sampling intervals that are too short compared with the oscillation period or close to the integer times of the period. Therefore, if a wide range of sampling intervals can be used for the detection, the event monitor will be able to keep a more thorough track of the data trend and achieve a better understanding of the ongoing processes. For fault detection, this means abnormal operation occurring at some specific range of frequencies can be reliably identified by detection based on multi-rate sampling, but might be very difficult to be found by single-rate detection if the employed sampling interval falls within certain ranges. One example of such effects is illustrated by Fig. 3.16 as the output of oscillation detection during 24 hours in a test building. Oscillation in the data series as shown in Fig. 3.16(a) can be detected by checking the calculated standard deviation against a threshold of the data in the pre-event window sampled at an interval of five seconds but not visible at all to detection with the one-second sampling interval, which was found too short relative to the oscillation period. Established as a fixed percentage of the current mean in the continuously updated samples, the threshold of the standard deviation varies with the moving window, as shown by the solid lines in Fig. 3.16(b) and (c). It can be seen that this approach is able to mitigate the problem of picking a sampling rate appropriate for detection of oscillations at an unknown frequency. As a result, data tracking based on multi-rate sampling can be easily implemented and yields more reliable output than that with single sampling rate. 13000 11000 9000 7000 5000 3000 7200 0 14400 21600 28800 36000 43200 50400 57600 64800 72000 79200 86400 Time (S) (a) 250 200 150 100 50 14400 28800 57600 43200 72000 86400 72000 86400 Time (s) (b) 400300 calculated s.t.d. -threshold - At a ~ .. .. .. .. .. .. ...... .. .... ......... .. ............ 200100- 0 14400 28800 43200 57600 Time (s) (c) Figure3.16. Effects of sampling rate on the detection of oscillation in a power series. (a). twenty-four-hour total power data of a test building's HVAC system, exclusive of the chiller; (b), (c). detection output based on sampling intervals of 1 second and 5 seconds respectively. 3.8 Summary of the training guidelines Guidelines for training the parameters involved in the GLR detection used in the previous sections are summarized as follows. 1). Record electrical power for the circuits monitored by the data logger for one day under typical operating conditions. The sampling rate should be between 1 and 10 Hz for common HVAC systems; 2). Locate the events from the abrupt changes in the total power data and estimate the fastest and the slowest events; 3). Determine the sampling rates for detection. The base sampling rate will be used as the fastest sampling rate if multi-rate sampling is employed. Therefore, power data sampled at this rate should be used to discern each event of interest by eye. The lower limit of the sampling rate should be able to pick up the longest duration of events of interest. Then other sampling rates can be decided between these two limits with 2-5 rates selected within each order of magnitude; 4). Determine the window lengths. The length of the detection window should contain at least two data points. It should not be longer than the interval between two consecutive events and never shorter than a disturbance. The length of the pre-event window is 2-4 times longer than the detection window. A third window with the same length as the preevent window is used for data reset; 5). Calculate and estimate the maximum of the standard deviation of the power data as a fraction of the current total power data, f. Note that periods with abrupt changes should be avoided in the calculation off. The lower limit of the standard deviation should be set one magnitude lower than the estimated value f. 6). Estimate the threshold for the detection statistic. Without knowledge about the minimum expected change Vm in the total power data, a reasonable base value for the threshold is 1/f 2. If Vm is known in advance, then the training of the threshold can be started with (s/Vm) 2, where s is the average standard deviation of the samples from the training data. The threshold for the GLR can be trained by adjusting this value until all events of interest can be seen by the detector, with a minimum number of false alarms. The lower limit of the GLR can be started with 1/(10*f )2 or s/(10*Vm) 2. Note that the starting values for the training of the parameters are the estimated thresholds, not the exact value. Moreover, exact values are not expected for the above parameters due to the statistical properties of the detection method. Slightly different combinations of these parameters may yield equally acceptable detection output. The detector developed in this research has been successfully applied to two test sites. With the above basic rules and guidelines, the training process for the parameters becomes much easier. Because the rules for the window lengths are for common HVAC systems, similar window lengths can be used in different applications with minor adjustments and the major training task is usually reduced to determination of the thresholds for the sufficient statistic and the standard deviation. 3.9 Application of the GLR model in fault detection with system's total power input Faulty operation in common HVAC systems often leads to excessive energy input of some components and hence abnormal total power supply to the systems. Therefore, in addition to detection of the occurrence of on/off switches as in the previous sections, faults in an HVAC system can also be identified by monitoring the magnitude of changes in the total power data or time duration of an event with appropriate techniques. Without submetered power data and the related reference parameters, those abnormal changes can usually be recognized during the on/off switches of the components. For example, an offset in the signal of the static pressure sensor leads to higher power input to the supply fan, which can be found from the magnitude of the total power change when the fan is turned off, but can hardly be seen if the fan keeps running. In principle, faults that cause abnormal magnitude change or cycling frequency can be detected with the change detection algorithm presented in the previous sections. However, depending on the relative magnitude of the power change to the monitored total power and the noise level, detection of significant changes caused by components' on/off switches must be implemented at different system levels. It is difficult to derive the analytical form of the threshold to define the appropriate system level for the monitoring of a specific component. From the tests conducted with some typical HVAC systems, the detectable magnitude should not be less than 5% of the total power, which is generally close to the observed magnitude of the standard deviation relative to the total power data under normal operation. For example, in the tests with the HVAC system used in the previous section, the reciprocal chiller generally switches between 5 and 10 kW, which is about 5-10% of the building's total power input, while the turnoff power of the supply fan and other equipment is usually around 0.5 kW. Therefore, faults related to chiller operation can be found from the building's total power data while those related to the supply fan can be better detected from the total power data of a subsystem, e.g., the HVAC system exclusive of the chiller. This is feasible in buildings where fans and pumps are served by a motor control center in HVAC applications while chillers are handled separately. The effect of the relative magnitude of a change on the detection accuracy is studied in Chapter 5. Another typical undesirable operating status of an HVAC system is the unstable control of the system, which may not only cause extra energy cost but also harm the related equipment directly. Since oscillation in power consumption is typical of an unstable control system, the instabilities can be identified by checking the variance or the standard deviation of the sampled sequence as discussed before. The detector developed with the enhanced GLR algorithm and data processing techniques has been tested with faults that cause the above effects in the total power. In this section, some of the tests for several typical degrading and abrupt HVAC faults related to pressure sensors, dampers, valves, etc., are presented and discussed for the application of this detector. Data were collected from a typical HVAC system in a test building as shown in Fig. 3.17 [Norford et al, 2000]. Further details about the system and fault implementation are given in the appendix. by AHU-A by AHU-1 by AHU-B - electric power transducer CHWP - chilled water pump - pressure transducer * - flow meter t - General Area System point - temperature probe $ - Point listed in Table 3.6 for AHU-A and B , - Point listed in Table 3.4 for cooling plant DMACC Connection To remaining FCUs (b) EA OA-TEMP OA-HUMD EA-DMPR RA-TEMP RA-FLOW RA Fan RF Watts OA 11 ER -. R A R A-H UMD OAD-TEMP OA-FLOW M UL RC-DMPR HTG-DAT mom OA-DMPR MA-TEMP HTG-EWT HTG-MWT T7 SA-TEMP SA-FLOW .IF CLG-DAT ' HTG-LWT SA Fan SF Watts ' IIL: SA SA-HUMD CLG-EWT CLG-MWT CLG-LWT Figure 3.17. Schemes of: a). the test building; b). the chilled waterflow circuit; and c). the airhandling units. Tests were conducted in a one-story building that combines laboratory-testing capability with real building characteristics and is capable of simultaneously testing two full-scale commercial building systems side-by-side with identical thermal loads. The building is equipped with three variable-air-volume air-handling units: AHU-A, AHU-B, and AHU-1. AHU-A and B are identical, while AHU-1 is similar but larger to accommodate higher thermal loads. Two fan coil units are also installed in this building but were not used in the tests for this research. The major components of the AHUs are the recirculated air, exhaust air, and outdoor air dampers; cooling and heating coils with control valves; and the supply and return fans; ducts to transfer the air to and from the conditioned spaces. Air from the AHUs is supplied to VAV box units, each having electric or hydronic reheat. All of the six fans and 10 pumps with capacities ranging from 0.5 to 5.0 kW are served by a motor control center. The pumps in AHU-A and B are equipped with constant speed motor drives while the other pumps and all the fans have variable speed motor drives. The total power input of the motor control center is recorded by a logger named remote-2 and each component is monitored with a power submeter for the tests. The cooling load is handled by a two-stage air-cooled reciprocating chiller with a nominal input of 10 kW. The power input of the chiller used for this test is supplied through the building's general electricity distribution circuit, which is monitored by another power logger called remote-1. For the GLR detection, power data from remote-1 were used for analysis of faults associated with the chiller and power data from remote-2 were used for detection related to the fans and the pumps. In addition to the power meters, the AHUs are well instrumented with sensors for all the controlled variables in common HVAC systems, which include the basic measurements used for the detection method developed in this research. Table 3.]. Power capacity and motor types offans and pumps served by the motor control center. Equipment Supply fan 1 Return fan 1 Supply fan A Return fan A Supply fan B Return fan B Hot water pump 1 Hot water pump A Hot water pump B Hot water pump LA Hot water pump LB Chilled water pump 1 Chilled water pump A Chilled water pump B Chilled water pump LC Chilled water pump CH Total Power capacity (kW) 5.0 2.0 5.0 2.0 5.0 2.0 0.75 0.4 0.4 0.4 1.0 1.0 0.4 0.4 1.0 1.0 Motor drive speed Variable Variable Variable Variable Variable Variable Variable Constant Constant Variable Variable Variable Constant Constant Variable Variable 32.25 3.9.1 Model calibration and parameter identification As summarized in Section 3.8, parameters for event monitoring with power consumption can be obtained without any special control operation or arrangement of the HVAC system. With the data from normal operation, the related thresholds are established for future detection. To find the power threshold for the turnoff of the supply fan by the detector, data at the turnoff time must be sampled. Ten minutes are needed each day, five before and five after the turnoff, and three to five days' of data during this time slot are used to eliminate the effect of data noise in the magnitude. Thus for the turnoff of the supply fan, 30-50 minutes are required in 3-5 days of normal operation or with faults that do not significantly affect the fan power during the late evening when the fan is turned off. For detection involving the chiller power, the power cycling during the low load condition must be used to estimate the normal intervals between two consecutive turn-on events of the chiller. Cooling load generally reaches its minimum during the early morning hours and the chiller cycling will be maintained at a minimum constant frequency under normal operation. In this test, six hours of the building's total power data from midnight to 6 am are required to determine the cycling period. For detection that requires the full operation of the economizer, i.e., 100% open outdoor air damper, when the outdoor air temperature slightly higher than the supply air temperature during the operating time, the weather condition is not easy to find. Although only 5-6 hours of such a time period are needed, it is not typically seen during summer and is more likely to be found under the spring or the fall load conditions. For the estimation of the noise in the power data, a period of 24 hours that covers the operating conditions of a typical day has been found sufficient to calibrate the threshold for the calculated standard deviation. Table 3.2 summarizes the trained values for the detection with the test system. Table 3.2. Trainedparametersfor the detection and diagnosis in the test system. Parameters Values Sampling intervals (seconds) Fastest events 1 Slowest events 600 Base sampling interval 1 2, 3, 4, 5, 10, 20, 30 Other sampling intervals Window lengths (data points) Pre-window 6 Detection window 4 Standard deviation as the fraction of the power data (W) 0.1 Upper limit 0.02 Lower limit Threshold for the sufficient statistic Upper limit 20 5 Lower limit 35 Chiller cycling interval (minutes) 0.2 Normalized outdoor air temperature For the non-intrusive detection based on the centralized power data, besides the two meters for the power input of the building and the motor control center, the only sensor needed is the outdoor air temperature, which is generally available with the building energy management system. Three setpoints are used for the temperatures of the supply air, the room air, and the balance point of the outdoor air for the economizer control. 3.9.2 Detection of typical faults with the total power data a). Detection of faulty operation by change magnitude at turnoff In general, a change in the total power data at turn-off is generally more abrupt and hence easier to detect than during turn-on, especially for equipment with startup protection or variable load control. Therefore, detection of faults related to magnitude change in total power is conducted at turn-off. Any faults that cause abnormal power consumption can be detected by this method if the related equipment is turned off under similar load conditions. In the HVAC applications, equipment are turned off or run under the economizer operation mode at a fixed time when the building is not occupied during late evening and early morning, which means a similar low load condition at turnoff for each day. If the detected change in the total power data caused by turnoff of the equipment at the given time is significantly higher than a trained threshold for this component, a false alarm will be issued for it. For example, a VSD fan tends to consume more energy when the control loop is not working properly. Fig. 3.18(a) shows the total power profile of the motor control center under normal and faulty operating conditions throughout 24 hours (8:00pm - 8:00pm) of 3 different days, with the supply fan turned off at 10:00pm on each day. Fig. 3.18(b) compares the detected magnitude of the changes under normal and faulty conditions with some additional normal days' of data. All the points under 600 watts are the detected turnoff power changes under normal conditions during 16 days. The two days with pressure sensor offset clearly show significantly larger values (> 600 W). Therefore, by setting a threshold of 600 W for the change magnitude, abnormal power consumption can be recognized by the detector. Other faults with similar effects can also be found in the same way, such as an unexpectedly restricted air flow path that also leads to higher fan power consumption and hence a larger magnitude of change in the system's total power when the fan is turned off. Fig.3.18(b) presents such an example of a stuck-closed recirculation air damper. 15000 1250099/05/22--normal 10000- 99/05/15--sensor offset 99/02/28--stuck RA damper r- 7500 5000- 2500 0:00 3:00 6:00 9:00 12:00 Time (h) (a) 15:00 18:00 21:00 0:00 1400 A 1200 1000 99/05/10 99/02/28 99/05/15 Pressure sensor offset Stuck closed recirc damper Pressure sensor offset 800 A CD o 600 400AA 200AAAA 0 700 750 800 850 900 950 1000 1050 1100 Flow rate (CFM (b) Figure 3.18. Detection of the faults of pressure sensor offset and stuck-closed damper from the magnitude of changes in the HVA C system's total power input at the turnoff of a supply fan. a). 24-hour total power data of three typical days of an air conditioning system including 10 fans and 6 pumps; b). detection output of the total power change caused by the turnoff of the fan under normal andfaulty operation conditions. While faults can be detected when abnormally large magnitude is found at turnoff, extremely small or no change found at the designated time for turnoff also indicate potential faulty operation. If turnoff of a component is not found at the given time, then a fault has likely happened in the system. For example, a fan with a slipping belt may lead to an extremely small magnitude of total power change at turnoff. b). Detection of faulty operation from on/off cycling Sometimes, a fault does not cause any abnormality in the power magnitude, but, rather, in the on/off cycling frequency of the equipment, especially when the motor runs at a constant speed. This method is very effective because in HVAC systems, especially for commercial applications, there are generally certain restrictions on the on/off cycling of some equipment to protect them from deterioration by frequent switches. For such components, some faults can be accordingly analyzed by examining the on/off intervals based on the detected events from the power input with reference to the current load condition or the design specifications. For example, chiller operation is generally controlled based on the entering and leaving water temperatures, which are mostly affected by the cooling load introduced through the cooling coil valve. If the load condition stays around some stable state, e.g., during the late night or early morning in an office building, the chiller cycling period should also remain constant under normal operation. When the cooling coil valve is leaking, the load of the chiller tends to increase due to the higher flow rate of water to be processed, which tends to cause shorter off-on intervals for a chiller with stepwise control. Therefore, the leaky valve might be identified if the chiller on/off switches can be correctly detected during periods under appropriate load conditions. It should be noted that with the chiller power input, the leaky fault can be most reliably identified when the cooling coil valve is closed by the control signal. The following charts show the output obtained by the detector from the total power data of the test building. In this system, the cooling coil valve is closed and the chiller only cycled on and off regularly by the design intent during the early morning hours. Under normal conditions, the on-off duration and the off-on interval were 4-5 minutes and 38-39 minutes respectively, i.e., about 10 full on/off cycles in total between 0:00-7:00 am when the valve was closed. Therefore, the threshold of a full cycle that consists of the on-off duration and the off-on interval has been calculated to be 42 minutes/cycle, which was then used as the minimum time period for a full on/off cycle. If the valve is leaky, the average period of a full on/off cycle tends to exceed this threshold when the valve is expected to be closed and an alarm will be issued for the excess by the automatic detector. During the FDD tests in spring, the leaky valve fault has been found on three days with the average full cycling intervals of 33.2, 30, and 27.3 minutes respectively. Fig. 3.19 demonstrates the detection output of one day with the leaky valve. Other faults that cause the abnormal cycling of the chiller can also be detected similarly. However, the conditions for the detection might be different depending on the characteristics of the fault and the control system setup. For example, the fault of a leaky recirculation air damper can only be seen from the chiller cycles when the system is in the economizer mode, i.e., when the outside temperature falls within a certain beneficial range for energy conservation. In this test building, it has been found that the abnormal cycles could be found only when the outside temperature was slightly above the point where the supply air temperature controller began to open the chilled-water valve in order to maintain the supply-air temperature setpoint. Limiting the examination of the cycling frequency to this narrow region reduces the risk of false alarms due to normal changes in cycling rate at higher outside temperatures when the cooling load increases. The rule used in the tests is: IF (0 < Toa - Ts pt. sa T spt.ra - T < ht.oa AND chiller cycling frequency > threshold) THEN alarm spt.sa Where ht.oa is the threshold for the dimensionless outdoor air temperature calculated as the ratio of the difference between the outdoor air temperature and the supply air temperature set point to that between the supply and the return air temperature set points. Although it is desirable to keep the temperature threshold as small as possible, to minimize the false alarm rate, some variations in the outdoor air temperature should be considered by the detection rule. In this case, ht.oa < 20% was found acceptable. 20000 16000 12000 8000 4000 0 0 3600 7200 10800 14400 18000 21600 25200 Time (s) 6000 5000 4000 3000 , - 2000 1000 0 0 -- --- --- -L -.- -1 60 120 180 240 1-1----- -1300 360 42 0 310 360 4, !0 Time (min.) (b) 6000 4000 2000 0 ()60 1 0 180 240 -2000 -II I -4000 -6000 Time (in.) (c) Figure 3.19. Detection of the fault of a leaky cooling coil valve from the cycling frequency of the chiller with the building's total power data under low load condition during the early morning hours (0:00-7:OO, May 22,1999). (a). 7-hour total power data of the system with a leaky cooling coil valve; (b). chiller cycles from the submetered chillerpower; (c). chiller cycles as the detected power changes. c). Detection of faulty operation from the standard deviation As discussed in Section 3.7.2, unstable control of equipment usually leads to oscillations in the system's total power consumption, which can be detected by analyzing such characteristics as the standard deviation of the sampled power sequence. What needs to be noted here is the rules for the alarm of power oscillation. The rule used to detect an abrupt change is: IF ( sufficient statistic > upper limit of the GLR AND the standard deviation in the detection window < lower limit for the standard deviation ) THEN alarm for abnormal change and check the magnitude; The rule for detecting power oscillation is: IF ( sufficient statistic < lower limit of the GLR AND the standard deviation in the detection window > upper limit for the standard deviation ) THEN alarm for oscillation. The reason to combine the sufficient statistic and the standard deviation is that whenever an abrupt change occurs, the data variance in the detection window increases significantly and a short alarm for oscillation might be issued. To eliminate such false alarms and also in consideration of the data noise, a certain ambiguous region should be discarded in search of the real events. The upper and lower limits depend on the equipment characteristics of the HVAC system and can be trained with test data. For a given system, these parameters remain constant. 3.10 Results and discussion This chapter demonstrates that low-cost information about individual electrical components in a building can be obtained via careful analysis of power measurements at central locations within the building, notably the electrical service entrance and motorcontrol centers that supply power to the HVAC components. A reliable automatic detector suitable for use in noisy and complex electrical environment has been developed, involving establishment of guidelines for tuning the detector, innovations of the detection algorithm, and preprocessing of data for the detector including design of a filter and data sampler of multiple rates, all contributing to the significantly improved performance of the detector (enhanced sensitivity to signals of interest and rejection of electrical noise). Test results show that with the centralized power data, the detector is able to conduct reliable detection for on-off events of interest and keep track of the data trend in the power data series from real buildings with low costs and little intrusion into the system. The performance of the detector has been proved by the detection of several typical faults related to abnormalities in the magnitude of on/off changes, frequency of equipment cycling, or trend of data in real HVAC systems. The multi-rate sampling technique enables the detector to find changes that may be invisible to detection with single sampling rate. However, on/off events that cross each other's time bounds can hardly be detected without false reports of magnitude change. To distinguish such changes, additional information like the control signals from the building energy management system are necessary to avoid false alarms. Dynamic analysis of the transient process is also a possible solution. Moreover, although faults usually cause abnormal effects in power consumption and can be detected and initially diagnosed with the algorithm developed in this chapter, further identification of the faults is necessary to find the real cause of the fault. This indicates more specific information are necessary to describe the components in the system, which leads to the study of detection and diagnosis at deeper levels in the system as studied in Chapter 4 and Chapter 6. 74 CHAPTER 4 Component power modeling for fault detection and diagnosis This chapter describes the other steady-state method for fault detection and diagnosis based on the modeling of submetered power input of equipment in HVAC systems. First, typical faults in HVAC systems and their effects in power consumption with reference to other parameters are studied. Then correlation between submetered power data and basic control signals or measurements available in common HVAC systems is analyzed based on the least-square fitting algorithm with singular value decomposition. Finally, models are developed for major HVAC components in real buildings and tested with typical faults. 4.1 Introduction Faults in HVAC systems lead to abnormal power consumption at both the system and the component levels. For detection at the system level, the most valuable data are the change magnitude in the total power data when a piece of equipment is turned on or off and the standard deviation of the sampled data in the FMA window as demonstrated in Chapter 3. Although it can be used to find the faulty operating status of a system, the detector based on the system's total power input is usually not able to issue reliable alarms until a fault becomes serious enough to be recognizable in the noisy power environment. In practice, this leads to unacceptable system malfunctions and energy waste before the fault can be identified. In order to maintain desirable indoor air quality and avoid excessive energy consumption, faults should be eliminated at an early stage. This means detection and correction of faults should be implemented not only at the system level during the on/off switches of a device, but also at the component level when the equipment is running, which requires closer monitoring of the component. For example, the gradual offset of a static pressure sensor for the fan speed control may result in considerable energy waste before it becomes discernible in the total power data. In order to keep track of such variations in power consumption at an early stage, the load needs to be monitored at the component level with reference to appropriate control signals or measurements. In a common air handling process, the pressure-independent VAV system maintains a constant static pressure at a certain location in the supply air duct by control of the fan speed while the heating/cooling coil adjusts the supply air temperature according to the load conditions via control of the valve and the chiller. Abnormal operation of any of the above devices tends to cause undesirable air conditions or excessive energy consumption or both. Faults in an air handling unit (AHU) may occur on either the air or the water side. In the air loop, faults are usually caused by some type of blockage or leakage that changes the resistance or the setpoint for the air flow and hence the driving power drawn from the fan, e.g., a stuck damper in the air duct. This indicates that with a given air flow rate, the fan power under faulty conditions will differ from that in normal operation. Such faults can therefore be found and analyzed by monitoring the variations in the fan power with reference to the air flow rate. Previous research has shown that power consumption of a fan with variable load adjustment can be described as a polynomial function of the air flow rate or the fan speed with constant static pressure control [Englander and Norford, 1990a]. Similarly, in the water loop, such blockage and leakage can be identified with the pump and/or the chiller power. It should be noted that some degradation must be accepted for most equipment in common HVAC systems after certain time period of operation. In order to avoid false alarms caused by random disturbances and allow a reasonable level of defects, it is appropriate to introduce an offset or a confidence interval for the fitted function. With a given control signal or measurement, if the sampled power data exceeds certain threshold or falls out of the confidence interval of a power function, a fault is identified and then diagnosed with expert rules. The objective of this chapter is to establish an appropriate algorithm in the matrix form and use it to develop the power functions of the major equipment in common HVAC systems and then establish the means for detection and diagnosis of typical faults in VAV AHUs by applying proper confidence intervals to the functions. 4.2 Component power modeling by correlation with basic measurements or signals The power model of a component depends on the type of the related motor drive. For a constant-speed motor drive with finite state control, the submetered power can be established as a stepwise function in accordance with the range of the reference data. For equipment with continuously adjusted control signals or load, such as a fan with a variable speed motor or a variable inlet vane, the power consumption can be properly represented by a polynomial function of the reference parameter. 4.2.1 Correlation by linear least-square with singular value decomposition Since the measurement errors or the noise in the power data follow the normal distribution, the maximum likelihood estimation of the coefficients in the power function can be carried out by the chi-square minimization [Draper and Smith, 1981]. Hence the linear least square fitting is to find the coefficients aj (j=1, 2, ... , M) of function M y(x) =I ajf(x) (4.1) j=1 from a set of measured data pairs (xi, yi) (i= 1, 2, ... , n) by minimizing the X2 function X2= -I i=1i X2 ny; IM a fj(Xi) - 2 (4.2) where fi(x), ... , fM(x) are arbitrary functions of x, called basis functions. Note the "linear" fitting here means y(x) is linearly dependent on the parameters aj, while the functions fj(x) can be linear or nonlinear. The minimum of the X 2 can be found from its partial derivatives with respect to the M parameters aj, ax2 Da = 0 (j = 1, ... , M ), which result in the following equation set in the matrix form, (XT .X). fI(x 1) - f 2 (xI) -1M fI(x 2 ) f 2(x 2) _(T2 U2 -- X T -y f M(Xi) fM(X 2 (4.3) yI ai a2 ) Y2 (72 where X= a, fi(xn) f2(x.) fn(xn) Yn) Let A=X TT*Xand=XT-y , Eq.(4.3) can be rewritten as A. = b (4.4) and then the normal equations in Eq.(4.3) can be solved for the aj's by multiplying both sides with A-, = A-' -b (4.5) However, the solution of Eq.(4.5) is rather susceptible to roundoff errors due to the extremely large value of the condition number, which is the ratio between the largest and the smallest elements of a diagonal matrix. To avoid poor approximation caused by an ill-conditioned matrix, a method called singular value decomposition (SVD) is used to decompose the matrix A [Golub and Van Loan, 1989]. The SVD algorithm is based on the following theorem of linear algebra: Any M x N matrix A whose number of rows M is greater than or equal to its number of columns N can be expressed as the product of an M x N column-orthogonal matrix U, an N x N diagonal matrix W with positive or zero elements, and the transpose of an N x N orthogonal matrix V . A =I_ -W -V (4.6) Hence the inverse of A is A-1 = V - [diag(1/w )] UT (4.7) In this clearly decomposed structure, the only source of undesirable results is the 1/wj, when wj = 0 or -* 0 from roundoff. Such a singular matrix can be solved for a 'healthy' set of aj by simply replacing the l/wj by zero if wj = 0 or -> 0, which actually removes one linear combination of the parameters that causes the decrement of the rank of A. Note that in the above discussion, there is no restriction for the type of the function fj (j=1, 2, ... , M) and the method can also be used for fitting with multiple independent variables, i.e., multi-dimensional correlation. The coefficients of the original function can be rewritten as a1 a2 =V -[diag(1/wj)]- UT.- b (4.8) aM , Analytically, the dependent variable can be estimated for any new data as .[fi(x), f 2(x), ... , fM X)] (4.9) NMJ In practice, however, this fitted model is always susceptible to some level of uncertainties due to the unavoidable errors caused by the measurements and the model fit itself. To achieve reliable fault detection, a range based on statistical confidence must be defined to accommodate the error effects. 4.2.2 Selection of parameter for correlation Power consumption of a component in an HVAC system may be affected by more than one factor and can be modeled as different functions through the above equations with reference to various indices. However, the resulting functions may vary greatly in terms of accuracy and applicability and hence differ significantly in their efficacy of distinguishing between faulty and normal operations when used in fault detection. For fault detection, to achieve reliable outputs, a model should be able to present a consistent and sensitive response in power to the reference parameter. Therefore, selection of the reference parameter is crucial for the power model. The major criteria for selecting such a parameter can be ascribed as its ability to predict the equipment's power input, the accessibility of the measurement, and the detectability of multiple faults with the model. The ability of prediction indicates minimum uncertainties in the correlation between the dependent and the independent variables, which requires the power consumption to be closely and ideally, exclusively, related to the parameter. For example, although fan power changes with outdoor air temperature, it is hard to formulate a useful relationship between them for fault detection because the outdoor air temperature can not directly affect and exclusively determine the fan power, which increases the uncertainties in the model due to the 'transmission error' and leads to a 'loose' correlation or even no applicable correlation at all. Fig. 4.1 illustrates the difference in the association between the power input of a supply fan and two parameters, the outdoor air temperature and the supply air flow rate. Compared to the plot of supply fan power vs. air flow rate under constant static pressure control which shows a clear polynomial trend, the data clusters in the supply fan power vs. outdoor air temperature present a much wider spread of power data under the same outdoor air temperature, which is caused by the varying air flow rate with the combined effects of the free-cooling switch, the delay in thermal response, and the control deadband. Therefore, for fault detection with the fan power input, the supply air flow rate is a better reference parameter than the outdoor air temperature and is usually used as a major index in fan power prediction. Accessibility is also an important consideration in practice, which actually determines the feasibility of the detection method. For example, the damper position determines the change of the resistance in the air loop and hence affects the fan power. However, damper position is not easy to measure during operation. In a typical HVAC system, usually more than one damper is used to control the air distribution, which further complicates the measurement and the correlation. The detectability of multiple faults with the model requires that the fitted function with an appropriate confidence interval should be able to detect as many different faults as possible. This is aimed to save the cost and minimize the intrusion into the system due to the measurements of multiple parameters as well as alleviate the complexity in the correlation itself. For example, even if the fan power vs. damper position function can be used for identification of inappropriate damper position, it fails in recognizing other significant faults related to the fan power, e.g., the static pressure sensor offset. Therefore, the independent variable selected for fitting should be not only a parameter of easy access with least intrusion into the system and minimum interruption to its operation, but also a sensitive and comprehensive index for variations of power consumption related to different causes. 2000 1600 . .*:::.'. - 1200 -800 400 51 52 53 56 Outdoor air temperature (0F) (a). - 2000 55 54 -: . - .... 57 51 * . . * 1600 1200 800 400 1000 1200 1400 1600 1800 2000 2200 2400 2600 Supply air flow rate (CFM) (b) Figure 4.1. Illustration of the difference in the ability to predict the power input of a supplyfan between two different reference parameters: (a). outdoor air temperature and (b). supply airflow rate. Data were sampled at an interval of ] minute from a test building during 7:30-19:00 when the fan was running. 4.2.3 Power functions of major HVAC components Based on thermal and fluid laws, power consumption of fans, pumps, and chillers of HVAC systems can usually be expressed as functions of control signals or measurements, either as a step function when stepwise control is used for the equipment or as a polynomial function when such continuous adjustments as P1 control are involved. By fitting the power data with appropriate parameters, power models can be established for each major equipment, which can then be used for energy estimation and fault detection with certain confidence intervals. 4.2.3.1 Equipment with continuous capacity control The power model of such a component can be expressed by a bi-quadratic function of related reference parameters [Braun et al, 1987], (4.10) P = ao + a, x + a2 x 2+ a, y + a4 y 2 + a, x y where x and y are the measured reference parameters and ao, ... , a5 are the coefficients to be fitted. The reference parameters are selected based on the criteria mentioned in Section 4.2.2. For example, in the power function of a VSD fan, x and y may represent air volume flow rate and the pressure difference across the fan respectively. With the submetered power data Pi (i= 1, 2, , n) and the measured parameters xi and yi, P1 1, x 1, x 12 P2 _ P _ 1, x 2, x 22 pn_ 1, xn, xn 2 ao- y1, y12 + _a 2 x1 y1 x 2 y2 a02 1, y 2, y22 + ay a3 1, yn5 y Y2 _a 4 xn - (4.11) yn where aoi + a0 2 = aO. The X2 merit function is then formed as Sn Pi i=1 i (aA xi; + aj+2 yi-1) - xi yi 12 (4.12) The fitting procedure of such a function is similar for all the power-consuming equipment. At the component level, fan power is of key importance in detecting many faults in air handling units. In this thesis, the fan power function has been thoroughly studied with data from real buildings. The power functions of the other equipment and the corresponding confidence intervals for detection can be similarly established. Previous studies have shown that fan power correlates well with some parameters. For example, Lorenzetti and Norford [1992] have shown that the hourly average power consumption of a VSD fan can be expressed as the function of the outdoor air dry bulb temperature. Although such a correlation works well when the internal thermal load is insignificant or remains constant all the time compared to the external load, it can not be used to predict the fan power if the conditioned space is subject to considerable and varying thermal disturbances other than the outdoor air temperature. In air conditioning systems, all the thermal disturbances, from both inside and outside, are finally accommodated by the amount and the discharge temperature of the supply air. Therefore, air flow rate becomes the most comprehensive index to represent fan power consumption when the discharge air temperature is fixed, which is typical of control in HVAC systems. From the characteristics of a fan, fan power can be determined by two of the three variables: total pressure, air flow rate, and fan speed, as shown by the curves in Fig. 4.2 for a fan typically used in HVAC systems [ASHRAE Handbook, 1996]. 600 total pressure: 500 IL 400 -4.0 400 80 shaft pow er 300- 30- 60 2.0 40 ' (1) CU) 200 effciency 0. A) 100 1.0 0 0 2 4 6 8 10 -------- 0.0 12 20 0 Volume flow rate (rri/s) Figure 4.2. Characteristiccurves of a backward-tipfan. For a fan with a variable speed motor or a constant speed motor under inlet vane control, the power input can be determined by the supply air flow rate, the total pressure gain, and the total efficiency as a product of four efficiencies related to fan, fan-motor coupling, motor, and the motor drive: Pi = ( total pressure gain * air flow rate ) / (total efficiency) The total pressure gain is the rise of the total pressure across the supply fan and is equal to the total pressure loss in the air loop. The total pressure loss in the air flow path consists of two parts: first, the pressure drop to overcome the resistance due to friction and abrupt section changes at duct fittings and air processing components before the static pressure sensor; and second, a constant pressure drop from the static pressure setpoint to that in the occupied space through the terminal box. The static pressure setpoint is intended to accommodate the air flow requirement to maintain the room condition. In principle, the static pressure setpoint can be adjusted to save the energy consumed by the low open level of the VAV dampers. As a result, the total pressure becomes a variable which must be taken into account in the prediction of fan power input. In many systems, the static pressure setpoint is usually a fixed value for a given system and hence the term for the pressure difference in the power function can be dropped, (4.13) P = ao + ai x + a2 x 2 For n pairs of measured data (xi, P1), the power function and the error function can be written as P = (PI, P2, ...,P) n[p with the (4.14) = X.d xi 1- Z 3 ,aj- (4.15) j j X2 which can be solved Section 4.2.1. T singular value decomposition method introduced in a). Fan power as a function of air flow rate Based on the above fitting algorithm, the correlation between the fan power and the air flow rate has been coded in the FORTRAN language and applied for the HVAC systems in several buildings. Fig. 4.3(a) illustrates the fitting results with data selected from the HVAC system of a test building. The close association between the fan power and the air flow rate is demonstrated in Fig.4.3(b), which shows the profiles of the fitted and the measured power with the corresponding air flow rate sampled at an interval of 1 minute during one normal day's operation. 4000 - data selected for fitting fitted curve 3000 y = 3.652e-4 x 2 - 2.462e-2 X - 2.627e-5 FF= 0.9887 200 . C U- 1000 - 0 0 500 1000 2000 1500 Air f low rate (CFM (a) 2500 3000 3500 2800 2800 2400 2400 2000 2000 1600 1600 M 1200 1200 0 ca. U- c 800 -800 400 7:30 9:00 10:30 12:00 13:30 15:00 16:30 18:00 400 19:30 Time(h) (b) Figure 4.3. Correlationbetween fan power and airflow rate with data sampled at an interval of 1 minute from a test building. (a). fitted curve of fan power vs. airflow rate and the datapoints selected forfitting; (b). airflow rate, measuredfan power, and fitted fan power vs. time of a summer day, demonstrating the association between fan power and airflow rate. b). Fan power as a function of the motor speed control signal As discussed before, for a fan with a variable speed motor, the power input can also be modeled as a function of the motor speed when constant static pressure control is used. Figure 4.4 demonstrates the correlation between the fan power and the motor speed control signal based on the same day's data as used in Fig. 4.3. While air flow rate needs to be measured in most HVAC systems, fan speed is usually readily available as the fan speed control signal. Therefore, cost due to the need of submeters for measurements associated with the power vs. speed detection can be reduced compared with the power vs. flow rate method. 4000 * - data selected for fitting fitted curve 3000 y = 0.812x 2 - 51.852x +989.75 FF =.9802 2000 1000 0 80 60 40 20 0 1C Fan speed (%) (a) 85 2800 - fan pow er fitted - fan speed 2400 -80 2000 -75 / n .M .2 1600 70 0 0 1200 65 ( 800 60 c 400 47:30 9:00 10:30 12:00 13:30 15:00 16:30 18:00 55 19:30 Time (h) (b) Figure4.4. Correlationbetween fan power andfan motor speed based on the same day's data used in Fig. 4.3. (a). fitted curve of fan power vs. motor speed control signal and the data points selected for fitting; (b). fan speed, measuredfan power, and fitted fan power vs. time, demonstrating the association between the fan power and the fan motor speed. 4.2.3.2 Equipment with constant power input Such equipment are often driven by constant speed motors to maintain a fixed load or flow. For example, as shown in Fig. 4.5(a), power consumption of a constant speed chilled-water pump with constant flow rate loop control is always around a fixed value, i.e., the design capacity of the pump, regardless of the position of the valve stem. Therefore, the power function can be established as a constant with an offset due to the random errors in the measured data. However, it should be noted that constant motor speed does not necessarily mean constant power consumption. As mentioned before, power consumption of a fan or a pump can be determined by any two of the three parameters: pressure rise, motor speed, and flow rate of the medium. This indicates that the power function of a constant-speed fan or pump should be determined with reference to the loop structure and the hence the flow rate delivered by the equipment. Fig. 4.5 demonstrates the completely different relationships between the pump power and the control signal for the valve position under different loop setups of two air handling units in a test building, one for AHU-A and the other for AHU-1. 0 20 40 60 80 100 0 20 60 40 100 (b) (a) secondary loop pump secondary loop pump primary loop pump 80 Valve position () Valve position (%) primary loop pump (d) Figure 4.5. Dependence of the power function of equipment with constant-speed motor drives on the loop setup. Data were collected from two similar air handling units with different water loop structures in a test building. Figures(a) and (c) show the profiles of pump power vs. valve position with the water control loops in (b) for AHU-A and (d) for AHU-1 respectively. In the AHU-A tests, a three-way valve was used to regulate the water flow rate through the cooling coil and maintain the constant total pressure drop across the loop and hence the constant total water flow rate through the secondary loop pump, which resulted in constant pump power consumption regardless of the control valve position. In the AHU-1 tests, the water flow rate through the secondary loop pump was adjusted by a two-way valve that was directly connected to the pump and hence the water flow rate, the pressure drop across the valve, and the pump power varied with the valve position. Therefore, the power functions must be established accordingly. 4.3 Error analysis - confidence intervals of the estimated models Since errors in the sampled data and uncertainties of the correlation are always unavoidable in practice, confidence analysis must be conducted for the reliability of a model. A confidence interval is a region with certain confidence level which represents the probability for the true values of a parameter to fall within this region about the measured (of fitted) value. A confidence level represents a certain percentage of the total probability distribution of the errors and is usually designated by the user. An analytical form of the confidence interval for a model can be derived if the random errors in the measurements follow some known statistical distribution [Rice, 1988]. Since noise in electrical power generally follows a normal distribution [Shanmugan and Breiohl, 1988], the power models for fault detection can be defined by the power functions with statistical offset ranges to accommodate the errors caused by random disturbances. For equipment with variable power consumption, the upper/lower limits of the intervals vary with the load conditions while for equipment with constant power consumption, the upper/limits are also constants. 4.3.1 Confidence intervals for equipment with variable power consumption Deviations in such models are caused by the errors of measurement for the fitting and the uncertainties of the correlation. For the fitting itself, a confidence interval for the M fitted coefficients is a region of the M-dimensional space that contains a percentage of 2 the total probability distribution represented by AX 2 , the deviation of X from its minimum at d(O) corresponding to the data set for fitting. This involves the examination of the M-dimensional space with reference to certain confidence level. Increasing the confidence level will enlarge the AX2 and hence the confidence interval. In this research, power consumption as a function of some measurements is intended to be used as the sole criterion for the least-intrusive fault detection. This means power input is used as the index for the confidence limit to check the perturbation of power yo for any given measurement xO, e.g., supply fan power for air flow rate. Therefore, for the dependent variable y, the confidence interval for prediction must contain the uncertainties due to the model fit and the measurement error as well [Little, 1991] [Little and Norford, 1993]. For a given measurement x, the dependent variable y and its estimate fitted function can be expressed as M y = Eaj f(x)+F , j=1 - N(O,2) y by the M E(y) = a f (x) j=1 M =Ejfj(x) j=1 Or in the matrix form for multiple measurements (xi, yi) (i=1, 2, ... , n) used for the fitting, where Yi fi(x 1 ) f2 (x1) ... f M (x1) Y2 fi(x 2 ) f 2 (x2 ) ... fM(x y=- 2 'ai a2 ) a= yn fi(xn) f 2 (xn) fn(xn) ... For a large data sample of size n, the covariance of the correlation is Cov = G2 A) (AT 1 With the fitted function as obtained in Section 4.2.1, the distribution of the estimated variable y0 for a new observation x0 can be derive as, Yo ~ N(E(yo), (2 f0 T (AT A)-1 fo) (4.16) where f0 is the function vector for x0 , f0 =(fl(xo),f (xo),---fM(Xo)) 2 T . In addition, the distribution of the square sum Q of the residuals of the sample can be represented by the x 2 function with n-M degrees of freedom, S i (2(2i=1 [(yi -E jfj(x))]2 j=1a(n ~2 (4.17) M and the standard deviation s of the sampled data for fitting is s = Q /(n - M) Since Q and y Oare independent (9'0 -E(y 0 )) /(a fT of each other, then T A)- f)) (A YO Q/(&2(n -M)) s fO 0 ) -, f _5 ~E(y t(n - M) (4.18) ) which as shown follows the t-distribution with (n-M) degrees of freedom governed by the density function FpN+I) p(t,N) = N7g where F(a) = J x-I e- 2 (1+_-) N F(N) 2 N+1 2 ,-oo<t<+oo (4.19) dx. 0 Therefore, the confidence interval for the expectation of yo with the double-sided t-distribution at the confidence level of 1-a is S i t , (n - M).s. Z ( Y -1 o) (4.20) 12 which may also be expressed in terms of the probability for the expectation to fall within the given confidence interval, s Since yo, yO, and <t Io -E(yo)I p, (n -M){=1-a (4.21) 2 fo_( A) fo Q are all independent of each other and E(yo -9' 0 )=E(yo)-E(yO)=0 D(yo -yo)=D(yo)+D(y 0 )=a2(1+f0 T (A T A- 1' distribution related to the predicted value yo can be derived as Yo-yo S1+fOT (AT N(0,1) (4.22) ) f0 With the estimated standard deviation s, the t-distribution of the prediction can be obtained as s (At Yo-yo S 1+ fOT (AT A-' (4.23) (n - M) f, Therefore, the confidence interval for the predicted yo with the double-sided t-distribution at the confidence level of 1-a is y^i t (n - M).s. 1+ fo T (ATA i -' fo (4.24) which can also be written as the probability for a given value yo to fall in the confidence interval, < t a(n -M) Yo -YOI p{ ~s s fT 1 14T(A TT A) 1_1 0 =l-a (4.25) 1 2o It should be noted that the confidence interval is affected not only by the confidence level (1-a), the freedom of the t-distribution (n-M), the sample standard deviation s, and the covariance of the sample s2 (AT A-' , but also by the vector function fo, which indicates that the predicted range of the power consumption varies with the independent variables. Eq. (4.24) shows that the confidence interval increases with the confidence level, which means for higher confidence about the fitting, the data should be expected within a wider range about the predicted values. The interval also increases with the estimated error s in the sample for fitting. If the selected data scatter loosely around the fitted average, i.e., a large standard deviation of the sample, then for a given confidence level (1-a), the confidence interval must be wider to accommodate the error in the fitted data themselves plus the deviation of AX2 from the minimized X2 with the new measurement. In addition, since the value of the t-function decreases with the freedom (n-M), the predicted range becomes narrower with more data pairs involved in the fitting of a polynomial function. For the least squares correlation, a small interval with a high confidence level usually indicates a good fit of the real process. For fault detection, such a correlation provides sensitive yet reliable identification of abnormalities. 4.3.2 Confidence intervals for equipment with constant power consumption If power consumption of the equipment is a constant, the confidence interval for fault detection can be determined as a constant offset from the design value based on the probability function. At the given confidence level of 1-a, the upper/lower limits can be expressed as y = YO ±T -Za,(4.26) 2 where yo is the design value of the power input for the equipment and Z is the normal distribution function. The standard deviation a can be approximated with the sample standard deviation s when the sample is large enough (more than 25-30) [Rice 1988], which can easily be met with the training data usually collected during one normal day's operation. 4.4 Fault detection and diagnosis of HVAC systems with submetered power input In addition to the on/off switch detection and data trend monitoring of the total power data of a system as introduced in the previous chapters, modeling of the power consumption of electrically-driven equipment has been proved to be another efficient method in this study for fault detection and diagnosis in HVAC systems. With the submetered power data of specific equipment, fault detection and diagnosis can be conducted at the component level and hence the FDD output becomes more accurate than that based on the total power data as presented in Chapter 3. In addition, unlike the definition in the abrupt change detection given in Chapter 2, steady state for the submetered power modeling means the process throughout the operation of the equipment except the short on/off transient periods. This indicates that the detection and diagnosis can be implemented when the equipment is in operation and are not limited to the short time period when the equipment is turned on or off. Therefore, detection and diagnosis with submetered power data makes it possible to find a fault at the early stage and hence prevent further energy loss or more serious outcome before it can be seen from the total power data by the change detector. Moreover, monitoring based on submetered data enables the detection of faults in equipment that runs around the clock, e.g., the exhaust fan used in a space with toxic emissions in it. In addition to detection of a fault, diagnosis of the fault origin can be achieved with higher resolution by the submetered power models than that based on the system's total power data. This is because as the function of certain reference parameters, the submetered model may present different patterns of abnormalities for various causes of the observed abnormality, which is the basis of fault diagnosis. With the submetered measurements of power and the related parameters, detection and diagnosis of faults in HVAC systems can be divided into two general categories according to the types of the power functions as discussed in Sec.4.2, i.e., power models for equipment with variable and constant power inputs. 4.4.1 FDD for equipment with continuously varying power input From the previous sections, power consumption of such components can be modeled as polynomial functions of basic measurements or control signals. However, as shown in Fig. 4.1(b), even under normal operation, the measured power data usually scatter around the predicted values within a certain range. This indicates that for reliable detection output, some offset range must be appropriately defined to reduce the rate of false alarm caused by deviation from the predicted value due to random errors. Such a range can be established by the confidence interval described in Sec. 4.2. The confidence interval takes into account the error of the model fit itself as well as that of the sampled data for fitting. The FDD models with confidence intervals corresponding to the power functions plotted in Fig. 4.3(a) and Fig. 4.4(a) are illustrated by Fig. 4.6(a) and (b) respectively. Both are based on an estimated standard deviation of 5% of the average power value and a confidence level of 90%, which is commonly used in fault detection in non-critical applications [Huber, 1981]. 4000 3000 2000 1000 0 500 1000 2000 1500 3000 2500 3500 Air flow rate (CFM) (a) 4000 3000 2000 1000 0 30 40 50 80 60 70 Fan speed control signal (%) 90 100 110 (b) Figure4.6. Illustration of the power models of a VSD fan with a 90%-confidence interval forfault detection and diagnosis of the air handling unit in a test building. (a). the model offan power vs. airflow rate; (b). the model offan power vs. motor speed control signal. In fault detection, if the submetered power data exceed the confidence interval at a given flow rate, then there is a probability of 1-a that the operation is under a faulty condition. In this thesis, faults related to the power input of a VSD fan in a test building have been studied based on the above algorithms and several typical cases are discussed as follows. 4.4.2 FDD for equipment with constant power consumption Abnormal behaviors in such equipment can usually be seen from the magnitude of the power input and the FDD can be implemented by checking the measured power value by the submeter against a threshold, which should be based on an offset from the design value. The design value is generally the design power input of the equipment and the offset can be obtained as a confidence interval introduced in Sec. 4.3.2. For example, based on a confidence level of 90%, the upper and lower limits were found to be 423 and 387 watts respectively for the power input of a chilled water pump with a design value of 405 watts in a test building, as shown in Fig. 4.7. 440 -. ~ - -.-.-. 420 400 - __________________ 380 0 360- 3401 . - 320 - power data lower limit upper limit 300 0 20 60 40 Valve position (%) 80 100 Figure 4.7. Illustration of the power thresholdsfor a chilled water pump with constant power consumption at different valve open levels for fault detection and diagnosis of the HVAC system in a test building. 4.4.3 FDD from the on/off cycling of equipment with constant power consumption Sometimes, abnormalities associated with equipment consuming constant electrical power can be found from its on/off cycling frequency. First, this is because to prevent energy inefficiency and degradation of some equipment due to frequent on/off switches, limits are usually set up by design in the forms of maximum number of daily cycles and/or minimum intervals between on and off. In HVAC systems, such a control strategy is often seen in chiller operation. Second, equipment should cycle on and off in accordance with the load conditions. If the device is turned on more frequently under low load than under significantly higher load in normal operation, then the system is not acting properly. For example, a chiller should normally run less frequently during the early morning hours than in late afternoon. With the submetered power data, such on/off cycles are easy to be counted as shown in Fig.4.8. 6000 5000 4000 3000 2000 1000 0 -I - -- - 120 180 240 Time (min.) 300 360 420 Figure 4.8. Constant on/off power cycling period of a reciprocal chiller under low load conditions during the early morning hours (0:00-7:40)for fault detection and diagnosis of the HVAC system in a test building. 4.4.4 Oscillation detection by the standard deviation of the submetered power data In principle, detection of oscillation from the submetered power data are similar to that with the total power data as introduced in Chapter 3 and hence the same method can be used here. In addition, since the equipment is directly and solely monitored by the submeter, detection of oscillation is easier to implement due to the reduced noise and diagnosis of the fault becomes more straightforward. Fig. 4.9 demonstrates the effects of oscillation in the submetered power series. 2000 2000 1500 1500 1000 1000 500 500 0 0 0 180 360 540 720 900 Time (min.) (a) 1580 1260 1440 0 180 360 540 720 900 1080 1260 1440 Time (min.) (b) Figure 4.9. Submetered power data of a supply fan during two different days, demonstrating the significant increase of the data spread caused by power oscillation. Plots (a) and (b) are the 24-hourfan power with and without oscillation respectively. 4.5 Application of the power models in fault detection with submetered power input The power models for fault detection and diagnosis described in the previous sections have been applied to HVAC systems in real buildings. The performance of the application is discussed in this section based on the detection and diagnosis output of several typical faults that have been introduced into the air handling units of the same system as the one used in Section 3.6. 4.5.1 Model calibration and parameter identification To set up the power functions for equipment with varying power input, such as a supply fan driven by a variable speed motor, the data should be ideally sampled over a wide range of load, from the minimum to the maximum possible load conditions. But this may require a fairly long time to obtain such a range of data and is not necessary in practice, because the model is obtained based on physical laws and hence can be reasonably extrapolated to higher or lower load conditions. This may result in small errors for some regions due to lack of test data but will not have significant effects on the detection quality. Tests with several air handling units showed that without the fan power data under low air flow rates (around 500~1000 CFM) during the calibration period, the later collected power data in this area were slightly higher than the fitted curve, but still within the confidence interval and much closer to the fitted curve than to the upper limit of the interval. Moreover, during the normal tests, the fan power input for the thermal load varying with the outside weather changed over a desirable range for the fitting. In the test building, the time required for data collection for the fan power function by submeter was less than the 10 hours that cover the work day. This time length also suffices the need for the estimation of the fan power noise. In the test building, the parameters of the power function were obtained by least squares fitting introduced in Section 4.2. To reduce the offset caused by random fluctuations, the data for the fitting were selected when the static pressure signal deviated from the setpoint by less than 5% of the setpoint. For each supply fan, about 30 data pairs representing different load conditions were collected 60 minutes after the startup and 30 minutes before the shutdown in a day. For equipment with constant power consumption, the models can be obtained more easily. For example, the power consumption of the chilled water pump shown in Fig. 4.5(c) is close to a constant when the cooling capacity of the coil is adjusted by a bypass valve and hence the water flow rate through the pump is unchanged. With such an arrangement, the power function is simplified as a threshold for the detection, which can be set up as the average of the measured values. To minimize the rate of false alarm, an offset value based on a given probability should be established to allow random deviations. The random errors are independent of the load condition and should be within an expected range with a given confidence level for the normal distribution of the electrical power noise. In the test building, a 90%-confidence interval of ± 18 W was set up about the design power input of 405 W for a chilled water pump as shown in Fig. 4.7. For the evaluation of the on/off cycling frequency, training of the limits may require several days of observation under low load conditions, normally during off-work time periods. In the test system, the threshold for the chiller's on/off cycling interval was found during early morning hours when the cooling coil valve was closed under normal operation. For the estimation of the power noise, thresholds for the standard deviation and time duration can be set up with the 10-hour data sampled at an interval of 1 minute. One important index to evaluate the operating status is the outside air temperature. For example, in detection of the fault of leaky recirculation damper based on the condition introduced in Chapter 3, the outdoor air temperature needs to be recorded and the thresholds for the dimensionless outdoor air temperature and the chiller cycling interval should be determined when the outside air damper is 100% open in the economizer mode. It should be noted that all the data for the detection with submeters can be obtained simultaneously because no change in the system needs to be made for model training. 4.5.2 Detection and diagnosis of typical HVAC faults with submetered power models Faults in HVAC systems tend to cause abnormal power input to at least one electrically-driven component. By comparing the measured power data against the values obtained from the power function and the associated confidence intervals, the abnormal equipment operation can be found and analyzed accordingly. a). Detection and diagnosis of faulty operation by the power models of equipment with variable power consumption According to the load condition, power consumption of equipment is adjusted by using a motor with VSD control or by changing the flow rate of the medium delivered with a CSD motor. In modem HVAC systems, VSD motors are becoming more widely used due to the low operating cost, especially in commercial buildings [Englander and Norford, 1990]. In the test building introduced in Chapter 3, all the supply and return fans are driven by VSD motors. As discussed in Section 4.2.3, power consumption of a VSD fan is determined by two of the three factors: total pressure loss, air flow rate, and motor speed. While the air flow rate depends on the thermal load condition, the total pressure gain across the fan is composed of two parts, the loss due to the resistance and the static pressure setpoint which is the index for fan speed control. Hence faults that cause deviations in either part will lead to abnormal fan power input and can be recognized from the power models described in Section 4.4. Fig. 4.10 compares the detection output of normal operation against two typical faults. One is a leak in the static pressure sensor's pneumatic line which results in an offset in the total pressure and the other is a stuck-closed recirculation air damper which increases the resistance in the air loop. It can be seen that the power data in both faulty operations lie well above the upper limit of the confidence interval. In the case of the pressure sensor offset, the static pressure signal fed back to the fan controller by the leaky sensor was always lower than the actual value. To maintain the fixed static pressure setpoint, the supply fan had to work harder and consumed more energy, which led to the higher power input at given air flow rates as shown in Fig. 4.10(b). A pressure sensor offset can be negative (reported pressure lower than the actual value) or positive (reported pressure higher than the actual value) and therefore leads to increased or decreased fan power input. For a pneumatic pressure sensor as used in the test building, an offset is often caused by a leak in the transmission line, which produces an increase in fan power. Under normal operation in the given air handling unit, the recirculation air damper was supposed to be 70% open in accordance with the 30% open outdoor air damper for the minimum fresh air supply. When the recirculation air damper was stuck closed, only a small amount of the return air was drawn through the leaks in the recirculation air damper, while the outdoor air damper was still fixed at the 30% open position by the control system based on the outdoor air temperature. As a result, the total flow rate of the supply air was greatly reduced and the supply fan power increased significantly to overcome the extra resistance introduced by the closed damper, as shown in Fig. 4.10(c). These two faults can be best distinguished from each other by checking the air flow rate after the working hours (between 17:01-22:00 in this case). Since the building was not occupied during this time period, regardless of the outdoor air temperature, 100% return air was recirculated in the building through the fully open recirculation air damper in order to save energy for cooling and the outdoor air damper was shut off. If the recirculation air damper was closed by fault, the air flow rate, which then could only go through the dampers by leakage, would be extremely low (less than 500 CFM in this case) and could never happen under normal condition. The rule for detection of the stuck-closed recirculation air damper can be expressed as: IF ( power > higher limit of the confidence interval for the fitted power value AND flow rate < threshold after working hours with fan on) THEN alarm. The threshold for the airflow rate after working hours can be observed from one day's normal operation. 4000 3000 0 higher imit of the confidence interval 2000 CL C a. 1000 00 500 1000 1500 2000 Air fllow rate (CFM) 2500 3000 3500 2500 3000 3500 2500 3000 3500 (a) 4000 -fitted 3000 3 0 C. C - - power data curve lower limit of the confidence interval higher limit of the confidence interval 2000 1000 0 0 500 1000 1500 2000 Air flow rate (CFM) (b) 4000 3000 - power data fitted curve -,a- lower limit of the confidence interval higher limit of the confidence interval ai) 3 0 C. 2000 C 1000 0 0 500 1000 1500 2000 Air f low rate (CFM) (c) Figure 4.10. Detection output for operation under normal and faulty conditions by the power consumption of a supply fan as the function of airflow rate in an air handling unit. (a). under normal operation; (b). with an offset in the static pressure sensor; (c). with a stuck-closed recirculationair damper. In addition to the fan power vs. airflow rate correlation for fault detection in the air flow path and the static pressure control, in this thesis, fan power has also been proved useful in finding abnormal status of the fan itself with reference to the motor speed. As shown in Fig. 4.6(b), fan power can be modeled as a polynomial function of the motor speed with a confidence interval to allow random disturbances. Power data that fall beyond the interval will be regarded as a fault. For example, as a typical degradation problem in HVAC applications, the fault of a slipping fan belt can be seen clearly at high loads through the FDD model. The initial approach combines detection and diagnosis in a single step by searching for abnormally low fan power at a very high motor speed control signal. This is demonstrated by the data cluster in Fig. 4.11, with the most significant lower power consumption occurring when the fan was running at full speed. The dependence of the detectability of such a fault on fan speed has been further proved by detection conducted in other tests with the same air handling unit. In the tests, the slipping fan belt was introduced by adjusting the tension of the fan belt to reduce the maximum fan speed by 15% at 100% control signal for the first stage and 20% for the second stage. With more than 15% of reduced tension, the fan had to work at high speed to meet the load requirement during the summer time. In practice, however, as a degradation fault, the belt slippage tends to occur more gradually day by day. Therefore, with the power vs. speed function, the detection of this fault can be conducted at any speed and is not be limited to the full speed. The rule is: IF (power < lower limit of the confidence interval for the fitted power value AND time duration > time threshold) THEN alarm. The minimum time period for determination of the fault should be longer than one sampling interval. With a sampling interval of 1 minute by a submeter in the tests, the threshold for time duration of this fault should not be shorter than 2 minutes. This is intended to eliminate false alarms which may occur when one or two points accidentally drop to this level without a slipping fan belt. In the test building, 3 minutes have been proved appropriate for reliable FDD output. 5000 4000 3000 2000 1000- 0 -- 30 40 50 60 70 80 Fan speed control signal (%) 90 100 110 90 100 110 (a) 5000. -fitted 4000 - - power data curve lower linit of the confidence interval 3000 2000 1000 030 40 50 60 70 80 Fan speed control signal (%) Figure 4.11. Utilization of the power vs. motor speed model of a VSD fan with a 90%confidence interval for fault detection and diagnosis of the air handling unit in a test building, showing the reduction in the fan power input due to a slipping fan belt. (a) under normal operation;(b). with a slipping belt. As discussed in Section 4.2.3.2, power consumption of a component driven by a CSD motor varies with flow rate that is adjusted by a flow control device. In the test building, a pump was used with a loop setup as shown in Fig. 4.5(d) in an air handling unit, in which the water flow rate delivered by the pump changes with the valve position. The pump power can then be correlated as a polynomial function of the valve control signal, as demonstrated by Fig. 4.12 with the 90%-confidence interval for fault detection and diagnosis. 100 1000 900 - ' '' '^'" low erlimrit --- upperlimit 800700600500400 0 20 40 60 80 100 Valve position (%) Figure 4.12. Power consumption of a chilled-water pump with a CSD motor as a polynomial function of the valve control signal, showing that the pump power input varies with the waterflow rate even the motor speed remains constant. b). Detection and diagnosis of faulty operation by the magnitude of power input of equipment With the submetered data, the magnitude of power consumption can be used to detect faults in equipment with constant power input. For such components, abnormal operation is recognized with appropriate thresholds as discussed in Sec. 4.4.2. In the test building, for example, the decreased capacity fault of a cooling coil due to reduction of the water flow rate leads to reduced power consumption of the pump. For a pump driven by a CSD motor, such a fault can be easily detected by the model illustrated by Fig. 4.7. Fig. 4.13 shows the detection output with one day's data from the test building. It can be seen from the figure that the deviation of the measured pump power from the designated range increases with the open level of the valve which is in response to increased requirement for water flow rate due to the higher thermal load. Therefore, the rule for detecting and diagnosing this fault from the submetered pump power is: IF ((valve open level > valve threshold) and ( Ipump power-design value I> threshold of power offset)) THEN alarm. For practical application, the design value for the power input of the equipment can be established as a measured value, or, more accurately, as the average of several measured values. The threshold for the pump power offset is determined by Eq.(4.26). The valve threshold is introduced in this rule because it is difficult to find the fault when the valve is not wide open which increases the resistance and hence the power input. Electrical spikes have also been found when the valve is nearly closed. It has also been observed that to detect the capacity fault, the valve should be at least 30% open. The valve threshold may also be trained if data for training are available. For the test building, the design capacity, the valve threshold, and the upper and lower limits for the pump power were 405 watts, 40%, and 423 and 387 watts respectively. 101 440 420 400 (D E a- 380 360 *. 340 . 320 - - purrp power lower limit upper limit 3001 0 20 60 40 80 100 Valve position (%) Figure 4.13. Detection of the coil-capacityfault from the reduction in the power input of a chilled-waterpump driven by a CSD motor. The fault became detectable when the valve was wide open. In principle, such a fault would also result in lower chiller power input. In practice, however, it is difficult to find this fault with the chiller power input due to the fact that the effect of reduced flow is visible under high load conditions with the pump power when the chiller cycling presents a more irregular and hence more complex pattern for recognition. Sometimes, faults in equipment with a VSD motor may also be detected with a constant power threshold. Such detection is generally based on understanding of the equipment from the design information and is especially useful when the submetered power data are the only available reference for fault detection. For example, under a given load condition, the fault of a slipping fan belt leads to significantly lower power input of the fan compared to that in normal operation. Such a fault can be easily recognized from the submetered power data with a conservative constant threshold obtained from one typical day's operation. In the test building, a value of 1 kW was used as the lower limit for the power input of a supply fan with a design capacity of 5 kW. During the peak load time period between 2-4pm in summer, the power input of the fan should never be lower than 1 kW unless a fault like a slipping fan belt occurs, as shown in Fig. 4.14. The relatively flat and low data trend during the faulty time was due to the continuous 100% fan speed caused by the slipping belt when the fan strove to meet the load requirement, which could be met at lower fan speeds during normal operation. 102 2000 - normal operation slipping fan belt threshold 1500 0 1000 500 360 380 400 420 440 460 480 Time (min.) Figure 4.14. Power consumption of a supply fan during 2:00-4:00pm of two days under normal condition and with a slipping fan belt respectively, showing the detectability of faults with a constantpower thresholdfor equipment with a VSD motor drive. c). Detection and diagnosis of faulty operation from on/off cycling of equipment with constant power consumption As discussed in Chapter 3, faulty operation of some components can be detected by analyzing the on/off cycles of the related equipment. Such detection is generally applicable under stable low load conditions, not only because the on/off interval becomes longer and more visible to the detector but also due to the fact that such load conditions provide a reliable basis for the detection that requires periods with a constant on/off cycling frequency of the equipment. In the test building, for example, cycling period of a reciprocating chiller during the early morning hours when the cooling coil valve was closed has been analyzed to detect faults related to the thermal load of the chiller, e.g., the leaky cooling coil valve and the leaky recirculation air damper. Based on the same criteria used in Section 3.9.2(b), the identification of such a fault becomes easier by counting the on/off cycles with the submeter than by detection of changes in the total power data, as shown in Fig. 4.15. The rule for detection of this fault is: IF (averaged chiller off-on interval < threshold for chiller off-on interval AND valve position = 0 ) THEN alarm. The threshold for chiller off-on interval for this rule can be observed from the 7hour normal operation of the chiller in the early morning. In the test building, this value was found to be 35 minutes. 103 6000 5000 4000 3000 ---, ---------- - - ~r-,-- - - - 2000 1000 0 0 60 120 180 240 300 360 420 Time (mn.) (a) 600 500 400 300 200 100 0 240 300 360 420 Time (mn.) (b) Figure4.15. Submetered power data of the chiller under low load condition during 0:007:00 am, showing a decrease in the off-on interval of the chiller at the presence of a leaky cooling coil valve. (a). under normal operation: chiller turned on at an interval of 38 minutes; (b). with leaky cooling coil valve: chiller turned on at an interval of 30 minutes. Another fault related to chiller cycling is the leaky recirculation air damper. As explained before, abnormal chiller cycling can be detected under low cooling loads. Similar to that of the leaky cooling coil valve, the deviation in chiller cycling due to the leaky recirculation air damper is most significant if it is closed by the control command when the outdoor air temperature falls between the supply and the return air temperature (or the economizer) setpoints. Fig. 4.16 shows the common control sequence of dampers and valves for HVAC systems with one day's data from the test building as an example. 104 Tspt.sa Tspt.ia Outdoor air temperature (a) 120 RA. damper 100 O.A. temperature S.A. setpint - -Economizer setpoint - 80 60 40 20 0 420 540 780 660 900 1020 1140 Time (rnin.) (b) Figure 4.16. Control of dampers and valves based on outdoor air temperature and the related setpoints. (a). typical sequencing for air handling; (b). dependence of the recirculationair damper position on outdoor air temperature in the test building during the working hours of a day. The setpoints of supply air temperature and the economizer are 55 0F and 65"F respectively. Furthermore, it has been found that the effect in chiller cycles is visible to the detector only when the outside temperature is slightly above the point where the supplyair temperature controller begins to open the chilled-water valve in order to maintain supply-air temperature at its setpoint. By limiting the examination of the cycling frequency to this narrow region as defined in Section 3.9.2(b), the risk of false alarms is reduced due to normal changes in cycling rate at higher outdoor air temperatures, when the cooling load increases. 105 With the same rule defined in Sec.3.9.2(b), the fault of the leaky recirculation air damper has been detected, as shown in Fig. 4.17. 6000 5000 4000 3000 2000 1000 0 420 480 -_l 540 600 660 720 780 840 900 960 U U 11 i, il 960 1020 1080 1140 Time (min.) (a) 6000 5000 4000 3000 2000 1000 0 !~-"L.t 420 ' 480 - -".JL.J 'L.JL 540 600 "J'L ' .JUULi 1- 1 660 720 I 11 l 11 | .U'1 l 780 840 900 il l Lll[l i 11 1020 1080 I 1140 Time (mn.) (b) Figure 4.17. Submetered power data of the chiller under low load conditions when the outdoor air temperature lies within the defined region, showing the increase in on/off cycles of the chiller at the presence of a leaky recirculationdamper. (a). under normal operation: chiller turned on at intervals between 30-31 minutes; (b). with a leaky recirculationairdamper: chiller turned on at intervals between 14-16 minutes. Note that the leaky valve fault was detected only when the valve control signal is zero, so the two faults that make use of the chiller cycling can be separated. d). Detection and diagnosis of unstable control from the submetered power data As an abrupt fault, unstable control leaves a clear oscillatory power signature, which can be detected with submetered power data by quantifying the magnitude of the oscillations. To distinguish sustained power oscillations from the impact of start-up or shut-down events on the standard deviation of signals, as computed over a sliding window, the GLR output should be simultaneously monitored to avoid mixed alarms. If 106 the sufficient statistic is below its threshold and the power oscillations are high, an alarm for unstable control will be issued. The fault detection rule with the submetered power is similar to that with the total power data presented in Chapter 3. IF ( standard deviation > power * fraction threshold AND sufficient statistic < lower limit of the GLR threshold ) THEN alarm The two parameters for this rule, the fraction threshold for oscillation and the lower limit of the GLR threshold, can be trained from one day's data under normal operation as discussed in Chapter 3. For the fraction power threshold, deviation of power value due to noise should be within a reasonable region. In the test building, the deviation should never be greater than 10% under the fault-free condition. Figure 4.18 shows an example of the increased standard due to unstable control for the power data presented in Fig. 4.9(a). 400 300 200 100 0 180 360 540 720 900 1080 1260 1440 Time (min.) Figure 4.18. Detection output of power oscillation caused by unstable control of a supply fan in an air handling unit of the test building during one day's operation as shown in Fig. 4.9(a). 4.6 Discussion and conclusions This chapter demonstrated that by examining the submetered power data of the major components in an HVAC system, faults can be detected and analyzed at an early stage. Such faults may not be seen by detection of changes from the total power of the building, especially for degradation faults like the slight offset in the static pressure sensor. It always becomes a protracted and difficult task to define an effective set of parameters and formulate a corresponding set of rules to cover a comprehensive range of fault conditions in a whole air conditioning system if other parameters are used as the indices for fault detection. With the submetered power data and a few basic reference parameters, faults in the major components of an HVAC system can be not only detected but also diagnosed as shown in this chapter. Also, the development of power models based on submeters does not require any special setup to test and train the parameters. This avoids the interruption to the system's 107 daily operation and hence makes it feasible to implement the detection in an existing system. Furthermore, all the submetered data related to different equipment can be obtained simultaneously and the small amount of data needed for the models simplifies the model setup process. As shown in this chapter, for a typical HVAC system, a total of 24 hours of data obtained at a sampling rate of one minute will suffice the training of the models. The method of correlating the submetered power with appropriate parameters can be used for any system. It should be noted that the form of the models may vary depending on the type of the related motor drive. For example, chiller power can be used to check the on/off frequency if the chiller is driven by a CSD motor or can be modeled as a polynomial function of the appropriate temperatures with confidence intervals when a VSD motor is involved. In this chapter, the power models based on submetered measurements have been developed for equipment driven by CSD or VSD motors with different system configurations. The model parameters are identified from the normal operation data with statistical analysis and appropriate fitting algorithms. The FDD methods employed in this chapter have been verified with data from three air handling units in a test building. 108 CHAPTER 5 Monitoring of components through the total power profile 5.1 Introduction For fault detection and diagnosis, it is always desirable to obtain accurate and inexpensive information about the plant with least interruption to the system's operation. In principle, electrical power consumption as a comprehensive index can be used for detection of almost all the faults in an HVAC system due to the fact that faults tend to cause abnormal magnitude or trend of power consumption. In practice, however, the detection output and costs vary significantly with measurements at different levels in the monitored system. In the previous chapters, detection and diagnosis of several faults with electrical power data have been demonstrated at two levels in a test building, i.e., the system and the component levels. Although the tests were successful for the selected typical faults, it may still be difficult to implement the two methods in a real system with various available information for detection and diagnosis of faults. As shown in Chapter 3, abnormal operation can be identified by analyzing the total power data obtained with two meters, one for the whole building system as used in the detection of chiller cycling and the other for the motor control center as used in the detection of the remaining equipment of the HVAC system. Based on the examination of the on/off switches or the trend in the total power data, this detection method is efficient under some special load conditions or operating patterns. For example, the excessive power consumption of a supply fan could be seen by checking the detected change in the total power against a trained constant threshold upon turnoff each day at 10pm when the fan was only used to maintain appropriate air circulation at a minimum flow rate. In practice, however, turnoff of equipment may occur at different time points under various load conditions, such as a system controlled by a presence sensor, which makes it impossible to detect the abnormal operation with a fixed threshold established under some specific conditions. This indicates that a practical detector must be robust to any possible changes, i.e., it should be able to find the events of interest in spite of the variations in the current conditions or parameters. The robustness of an FDD scheme is the degree to which its performance is unaffected by conditions in the operating system which turn out to be different from what they were assumed to be in the design of the FDD scheme. For the change detection in the total power series developed in Chapter 3, a major problem of robustness is to find events under uncertain load conditions. In contrast to the change detection with the least amount of measurements discussed in Chapter 3, the FDD strategy developed in Chapter 4 uses a submeter for each major electrically-driven equipment in the HVAC system, 17 in total for the three air handling units and the chiller in the test building. Although the detector with these submeters has produced desirable results in the tests, it should be noted that such devices are not usually available in building systems and thus the detection may not be feasible in consideration of cost vs. benefit in common applications. Therefore, it is important for the detector to be applicable in current building systems under different load conditions at 109 acceptable costs. In order to minimize or eliminate the use of submeters, power thresholds need to be established for all the common shutdown conditions with a minimum amount of sensors and centralized meters. In this chapter, a detector is developed in the form of a power function by correlation, as discussed in Chapter 4, based on the measured reference parameter (like the air flow rate) and the corresponding power values detected from the total power data at the appropriate system level instead of the submetered power data. The power values of a component can be obtained by detection of the total power changes in two different ways depending on the load conditions when the equipment is turned off. One way is to detect and collect the changes only at turnoffs if the shutdown conditions vary over a wide range during different days and the other way is to detect the changes introduced by several times of manual shutdown during a typical operating day. Although the difference in the methods of data acquisition has no effect on the algorithms of detection and correlation, modeling based on manual shutdown should be conducted with appropriate guidelines in the development of a reliable detector. In the following sections, the method based on manual shutdown is presented with tests in an HVAC system. First, the detectability and the appropriate accuracy of changes in the total power data are discussed. Then, as an example, the power model of a supply fan is set up with the detected power changes in the 24-hour power data and the measured air flow rate from a test building. With the fitted function, the status of a component in the system can be monitored by comparing the detected changes in the total power at shutdown against the thresholds given by the power function with the measured value of the reference parameter. In addition, energy consumption of the equipment can be estimated with reasonable accuracy. Demonstrated with a supply fan as an example, such a detector can be established with this method and used for any equipment with a variable speed motor drive in HVAC systems. 5.2 Parameter selection for component modeling with system power monitoring Component modeling is aimed to establish a power function that can provide reference power data of the equipment under various operating conditions. In principle, a model based on the physical relationships between a component's power input and the reference parameters can always be extrapolated. In practice, however, in order to minimize the uncertainty in fitting due to random errors in the sampled data, it is always desirable to collect the power input of the equipment over a wide load range as the total power changes at turnoff under different conditions. Also, the sampled data should be appropriately distributed within the load range to obtain a more reliable correlation. Therefore, some representative load conditions need to be determined and the corresponding power input of the equipment should be properly detected as changes in the total power series. In general, the trend of a component's power input is affected by more than one factor. To facilitate the modeling process and reduce the related cost, the factor selected as the reference parameter is desired to be the most dominant or ultimate among the others for the equipment's load variations. This is because only such a parameter can be 110 used as the load indicator of the equipment to provide a guideline for the time schedule of the manual shutdown. In addition, the indicator must be easy to be obtained or available in the given system for the purpose of the least-intrusive detection. In general, a load indicator for a component can be determined based on the knowledge of the system and the component as well. For example, outdoor air temperature can usually be used as an indicator of the load variations of air conditioning systems, especially when the internal load of the system remains relatively stable throughout a day. Fig.5.1 shows the 12-hour profiles of the outdoor air temperature and the electrical power input of a supply fan of an air handling unit in a test building with a stable internal load during a typical summer day's operation. 90 4000 -85 3000 D 2000- 0 (L 80 - 75 c 70 0 E 100065 0 9:00:00 12:00:00 15:00:00 18:00:00 60 21:00:00 Time (h) Figure 5.1. Profiles of the outdoor air temperature and the power input of a supply fan in a test building. Note that the load indicator is to provide the trend of the load profile and not necessarily for monitoring or calculation of the load. This is because the actual thermal load profile may not conform to that of the load indicator in time due to the delay caused by the thermal capacity of the building. For example, the daily peak load of an air conditioning system generally occurs at a time different than the highest outdoor air temperature. For a system with outdoor air temperature as the major variant, data detected at a certain interval, e.g., 1 hour, can be used to fit the power function. 5.3 Analysis of the detected changes for component modeling Another important issue in data acquisition for a model is the detection of changes in the total power series. In this FDD strategy, power functions are based on the detected changes, which may deviate significantly from the true values due to the dynamics in the operating system. In Chapter 3, it has been shown that some faults can be directly detected from the total power data of a system as abnormal magnitude of changes at specific time points or under certain load conditions. However, for the modeling of a component's power 111 consumption with the system's total power input, changes need to be detected under different load conditions. In addition, to obtain a reliable function for fault detection, the magnitude of the detected changes used for correlation must be computed with reasonable accuracy. Therefore, it is essential to analyze the detectability of changes and the accuracy of the computed values before the model can be established. Although it is difficult to define a common standard for the detectability of the changes and the acceptable error range due to the diverse characteristics of the power data in different systems, some guidelines for implementation of the detection and analysis of the output have been established in this research to improve the accuracy of the models for the monitored components. The most important factors obtainable on-line without additional measurements that affect the detection are the sampling interval of power data, the relative magnitude of the change, and the noise level in the integrated power series. As a time series of power consumption of an HVAC system, the data pattern in the detection window changes with different sampling intervals, as shown in Chapter 3, and hence an event tends to appear with varying visibility and magnitude to the detector. In general, the accuracy of the detected power changes is affected by the magnitude of the change relative to the current total power value. The visibility of the change increases with its magnitude, which indicates that data should be more carefully selected when the monitored component is under low load conditions and the relative magnitude of the change becomes smaller in the total power data. Observations have also verified that the detection quality can be improved when the data remain constant and the event is close to a step change. This means data obtained during transient periods or in a very noisy environment should be avoided in fitting the model. Therefore, change detection should be implemented not only under representative load conditions but also in a relatively noise- and dynamics-free data environment. 5.3.1 Effects of the sampling interval It has been observed that the visibility and the accuracy of the detected changes are affected by the sampling intervals of the power data. Since the duration of events varies among equipment at different time points, e.g., during on/off switches, detection with multiple sampling intervals usually yields more desirable results for a system with more than one electrically-driven component. Upon the identification of an event, the magnitude of a change is calculated as the difference between the averages in the detection and the pre-event windows. Errors always exist due to deviations from the real values caused by random noise which is inherent to electrical power systems. With a single sampling interval, the error is determined by the sampled data points and changes with the FMA windows. The error value also differs among different sampling intervals depending on the relationship between the sampling interval and the period of the random noise. For detection with a 112 single interval, the error caused by the random disturbance can not be balanced or reduced. However, it can often be significantly decreased by averaging out the opposite spikes in the detected changes among different sampling rates owing to the normal distribution of noise in the electrical power data. As an example of the tests conducted in this research, Fig. 5.2 and Table 5.1 compare the detection output for changes of different magnitudes with single- and multirate sampling of power data from the motor control center of a test building against that measured by submeters. During this test period, there were 17 on/off switching events of two hot-water pumps, two supply fans, and two return fans among a total of six fans and 10 pumps. The sampling interval used in the single-rate detection is 10 seconds while nine intervals, including 1, 2, 3, 4, 5, 10, 20, 30, and 60 seconds, have been employed in the multi-rate case. These sampling intervals, selected after sufficient tests with the intervals from 0.125 through 1200 seconds based on power data from different systems and under various conditions, have been found as the most representative values for HVAC power systems. The two limits, 0.125 and 1200 seconds, are observed as the lower and upper limit for the duration of the on/off events in HVAC systems. a). With single sampling interval Of the 17 events, 15 were found with different levels of error and two were missed. From Table 5.1, it can be seen that the quality of the detection depends on the magnitude of the change relative to the total power as well as the current data trend. For example, at 20:34 with the turnoff of pump B, when the power change of -244 W was less than 5% of the total power, the error was about 150%. At 21:55 with the turnoff of supply fan A, when the power change of 1400 W was about 25% of the total power, the error was less than 0.3%. When the change magnitude was very small, the detector was not able to find the event at all, such as the event at 21:56 when return fan A was turned off with a power change of -82 W. In addition, the shorter the time interval between two changes, the more difficult to find the changes, especially the small ones. This is because as a steady state detector, the GLR needs some time for the effect of the former event to die out in order to find the subsequent change. This is demonstrated by the output at 20:21 for the turnoff of pump A. With a magnitude of 275 W, the event was still missed due to the gradually changing property of the data caused by the previous change and the noise, as seen from Fig. 5.2(a). b). With multiple sampling intervals With a single sampling interval, significant discrepancies of power changes have been observed between the submetered data and that detected from a centralized data logger by the detector. Sometimes, on/off events are undetectable with a single sampling rate. By definition, the GLR algorithm is intended to locate changes only and not to 113 quantify the magnitude of changes with sufficient precision to detect potentially faulty conditions. However, with further techniques like the multiple sampling rates, the detector itself which is based on the innovated GLR algorithm and the data processing techniques is significantly improved in terms of the detection quality, not only in the detectability of the events but also in the resolution of the output. This has been testified by running the detector with multiple sampling intervals through the same data set. As shown in Fig.5.2 and Table 5.1, with the multiple sampling intervals, 16 out of the 17 ON/OFF events were identified, one of which was undetectable with a single sampling rate that was implemented before. Moreover, the errors of nearly 70 percent of the events were significantly smaller with the multi-rate sampling than that with the single-rate sampling, while only one event at 23:40 showed considerably larger errors with multirate samples, as listed in Table 5.1. Under some circumstances, the error by the multi-rate detection is larger than that by single-rate detection. This is because to obtain the maximum amount of alarms for the events with each sampling rate, the threshold for the detection must be lowered to the minimum value among all the sampling rates. With the lowered threshold, the alarm time points are changed for those with higher thresholds. As a result, the data patterns for the reset of the detection window is changed, which in turn may cause some variations in the computed magnitude. But the occurrence of this problem is rare and such errors can usually be averaged out among different sampling intervals in the multi-rate detection. 114 8000 700060005000- a. - -- 4000 3000 2000 19:44 22:44 21:44 20:44 0:44 23:44 Time (h) (a) 4000 A . 0 5 19 44 -2000 - 0- -4000- A U 2000 ~ A A A 4 * 21:440 I A 22:44 - - - 23:44 - - - - 0:4 mdetected -6000 -8000 . I A measured _8000- r~ Time (h) (b) 1000 500 0144 19 ~j -500*-U- - E A20:44 A - - A 21:44 - 1 - - 22:44 - - 23:44 - - - 0:4 0- -1000- -1500 a detected A measured -2000 Time (h) (c) Figure 5.2. Detection of changes under different conditions with single and multiple sampling intervals in comparison with the values measured by submeters. (a). Five-hour power data sampled at 1Hz from the motor control center of a test building; (b). Comparison of the submetered power with the detected changes in the centralizedpower data with a single sampling interval of 10 seconds - 15 alarmsfor 17 events, 2 missed; (c). Comparison of the submetered power with the detected changes in the centralizedpower data with multiple sampling intervals of 1, 2, 3, 4, 5, 10, 20, 30, and 60 seconds -16 alarmsfor 17 events, 1 missed. 115 Table 5.1. On/off power change detection for a motor control center serving fans and pumps in a test building, as comparedwith submetereddata. Detector Time Equipment (W) Total power (W) Relative power (W) Submeter Single interval Error Power (W) (%) 20:10 Loop-B hot water pump--OFF -405 5970 6.8 -502 24.0 20:19 Loop-B hot water pump--ON 264 5325 5.0 249 5.7 20:21 Loop-A hot water pump--OFF -275 5181 5.3 Missed 20:25 Loop-A hot water pump--ON 252 5489 4.6 313 20:34 Loop-B hot water pump--OFF -244 5125 4.8 20:44 Loop-B hot water pump--ON 232 5519 21:02 Loop-B hot water pump--OFF -309 Multiple intervals Error Power (W) (%) -401 1.1 284 7.6 -279 1.5 24.2 227 10.1 -98 59.8 -476 95.1 4.2 497 114.2 230 0.7 5289 5.8 -346 12.0 -282 8.9 21:14 Loop-B hot water pump--ON 267 5105 5.2 171 36.0 252 5.6 21:29 Loop-B hot water pump--OFF -202 5755 3.5 -11 94.6 -172 15.1 21:47 Loop-B hot water pump--ON 237 5371 4.4 289 21.9 257 8.5 21:55 AHU-A supply fan--OFF -1400 4224 33.1 -1396 0.3 -1411 0.8 21:56 AHU-A return fan--OFF -82 4142 2.0 Missed 22:00 AHU-B supply fan--OFF -430 3727 11.5 -274 22:06 Loop-B hot water pump--OFF -300 3338 9.0 22:26 Loop-A hot water pump--OFF -264 2734 AHU-B3 return fan--OFF ___ Missed 36.3 -485 12.8 -242 19.3 -243 18.9 9.7 -283 7.2 -244 7.4 373 9.1 220 35.6 494 67.5 451 52.7 23:40 Loop-A hot water pump--ON 342 3092 11.1 00:33 Loop-B hot water pump--ON 295 3789 7.8 ___ As noted above, in addition to the data sampling technique, characteristics of the power profile play an important role in the detection of changes and the selection of values for the modeling, which can be seen from the above plots and the table as well. In this research, two major factors have been found useful to outline the effects of the current data trend on the detection quality, the relative magnitude and the noise level. 5.3.2 Effects of the relative magnitude In principle, the GLR algorithm can be used to detect changes of any magnitude. However, tests have shown that in order to minimize the occurrence of false alarms due to random noise in the data, the magnitude of the minimum detectable event should not be less than the current standard deviation of the data window. This minimum value helps to determine the smallest component that can be modeled from the system's total power as well as the appropriate time for the detection of changes to be used for the fitting, as discussed later in this chapter. It has been found that the detectability of an event changes with its magnitude relative to the current integrated power value of the system and the detection becomes more accurate when its magnitude accounts for a larger portion in the total power consumption of a given system. As the ratio of the magnitude of a change to the current 116 total power of the monitored system, the relative magnitude is used as an indicator because the visibility of a change to the detector diminishes with the increase of the power value and the consequent accumulation of noise with more components put into operation in the system. Although no quantitative and concrete correlation has been observed between the detection error and the relative magnitude of a change, a general descriptive trend has been found as the envelope of the detection error under different relative magnitude. This has been verified through tests with different systems and can also be seen from the above example. No consistent function can be obtained between the detection error and the relative magnitude for all the data points in Table 5.1, as shown by the scattered points in Fig. 5.3. However, a trend line can be formed, which is composed of the maximum error found at different relative magnitude in the detection, as demonstrated by the dashed line in Fig. 5.3. The randomly distributed points show that the errors vary significantly even for the same relative magnitude, but the dashed line clearly justifies that the maximum error of the monitored event decreases with the increase of the relative magnitude. 100 8060 .0 0 40 20 0 *** 0 20 60 40 80 100 Relative magnitude (%) Figure 5.3. Errors between the detected changes from the total power data and the submetered values vs. the relative magnitude of the submetered values. The relationship between the error and the relative magnitude verified that in the modeling of a component, equipment with low power consumption under common operating conditions should be avoided, because a reliable power model of the component can not be established based on the detected changes from the total power if the component's power input accounts for a very small portion in the monitored system, e.g., less than 5% in this example. However, with only the relative magnitude, it is still difficult to choose the appropriate data points without measurements by submeters due to lack of reference and the uncertainties in the detection caused by the irregular fluctuations in the power series. In order to choose reliable detection outputs, another index, namely, the standard deviation has been found useful in tracking the noise level of the current environment. 117 5.3.3 Data screening by the standard deviation in the FMA window In a real power system, the values of random noise can hardly be defined by a function or formula. However, the characteristics of the noise at a given time can usually be reflected by some indices, such as the approximate period and the peak value for the transient description of a disturbance and the standard deviation for the average noise level. Generally, the error of a detected change increases with the noise level. In this research, it has been observed that the accuracy of the detected changes is related to the noise level represented by the standard deviation in the detection window. Moreover, it has been found that the relative error and consequently the relative standard deviation as the ratio between the standard deviation to the current total power data in the detection window are more appropriate to obtain a reasonably balanced power model. This is first because the noise level, or the standard deviation, usually increases with the amount of equipment in operation and hence the total power value. Second, the power model is desired to be of evenly distributed accuracy under various load conditions. Test results showed that large relative errors were always accompanied by high levels of noise as the major source of errors in change detection. It should be noted, however, it may not be true vice versa due to the random data distribution in the window, i.e., noises at high level do not necessarily result in large errors. Fig. 5.4 shows the distribution of the relative error as the relative standard deviation changes for the 17 events shown in Fig. 5.2. The number beside each point is the value of the relative standard deviation at the time of the event. Although no concrete correlation can be achieved for the randomly scattered points, it can be clearly seen that big errors always occur with large relative standard deviations, like the points at 66.4, 77.2, and 90.1. On the other hand, large relative standard deviations do not necessarily lead to big errors, such as the points at 76.6 and 79.6. Similar patterns for the distribution of relative error vs. relative standard deviation have been found with all other tests conducted in this research. In spite of the lack of an explicit formula in the relationship between the two variables, such a trend provides a sufficient, but not essential, condition for the selection of data to be used for correlation. Useful data can be chosen from the detected changes by discarding the ones with relatively large values of relative standard deviation in the current detection window. Although this may sacrifice some data points with big values of relative standard deviation but small relative errors, such as the points at 76.6 and 79.6 in Fig. 5.4, this method helps to secure a more accurate model and hence more reliable outputs of future fault detection. 118 I V * 90.1 80 60- 6_ 0 Uf 66.4 4020 77.2 49.7 15.3 +18.3 0 13.3 20 47.3 32.3 **0 5.2 0* 30.4 *79.6 36.3 7~ *t 60 40 80 100 Relative standard deviation *1000 Figure 5.4. Distributionof the relative error with the relative standard deviation in the detection window. Moreover, such a screening method can be realized without the reference of submetered data. This is because the values of the relative standard deviation are directly calculated with each incoming point in the FMA window by the detector and then compared among all the events involved. 5.4 Application of the gray-box model in fault detection and energy estimation With the above analysis of the factors that can be used to improve the quality of detection, models based on the system's total power can be developed for the electricallydriven equipment in HVAC systems. Assume the GLR detector has been trained with the method developed in Chapter 3, the component modeling procedure can be summarized as follows. 1). Obtain information about a component of interest, including power capacity, type of motor drive, and general operating schedules from manufacturer's catalogs and/or the system's design specifications ; 2). Study the feasibility of the modeling of the equipment's power input with the system's total power based on the component's power capacity relative to the system's total value. Although there is no fixed criterion, the relative magnitude should not be too small, e.g., 1/1000 is too small to be seen by the detector. Tests showed that this value should not be less than 5%, especially in the presence of VSD motors in the system. Also, refer to the standard deviation obtained in the training of the GLR detector in Chapter 3 for the decision. If the power capacity of the equipment is less than the standard deviation of the total power data when this equipment is in operation, then its power input may not be modeled with desirable accuracy; 3). Determine the parameter to be correlated with the power consumption of the equipment, as discussed in Chapter 4; 119 4). Select the parameter that can be used as the indicator of the major variations in the load of the system. In common HVAC systems, outdoor air temperature or enthalpy can be used if the heat gain/loss from outside is the major variable part of the total load. If the internal load dominates, then use the schedule of the source as a reference, e.g., number of occupants; 5). Observe the total power data of the monitored system from one normal day's operation and find the time for the on/off events to be detected based on the variations of the load condition; 6). Determine the sampling rates of power data as analyzed in Chapter 3. In addition, check the availability of the control parameters of the equipment, such as the PID coefficients to help to determine the time constant of the process; 7). Apply the detector to the power data and record the values of the reference parameter to be correlated at the on/off switches of the equipment; 8). Choose the power data for correlation based on the calculated values of the relative magnitude of the changes and the relative standard deviation at the event time as discussed in Section 5.3; 9). Set up the model by correlating the power data with the reference data and form the allowable offset range from the fitted function based on a confidence level; 10). Test the model with additional data collected under normal operation and make modifications if needed. In deciding the time points for the on/off events in Step 5, some preliminary information or observations might be helpful to avoid switches of the equipment that are likely to cause large errors in the detection. Analysis in Section 5.3.3 has shown that in order to obtain more accurate changes, periods with more dynamics marked by large relative standard deviations such as the transient processes should be avoided. Although the occurrence of large noises and transient processes is rarely known in advance, detection with significant effects of such disturbances can be skipped based on the knowledge of the system's normal control sequence or by looking at the power profile during a typical day's operation. The representative load conditions are best selected in a day with a wide span of the parameter that causes the major variations in the load. For example, if outside air temperature is used as an indicator, the turnoff events may be executed every hour from the early morning into the late afternoon during the operation time of a day. Fig. 5.5 illustrates the modeling process for the power consumption of a supply fan with a variable speed drive and a design capacity of 5 kW in a test building. The monitored system is composed of six fans and 10 pumps electrically wired together at a motor control center as shown in the appendix. It is known from the design information 120 of this system that the major factor that causes the variations in the building's cooling load is the outdoor air temperature. So the cooling load trend can be approximately predicted by keeping track of the outdoor air temperature, which changes with time during a day as shown in Fig. 5.1. Table 5.2 listed the necessary preliminary information for the detection. From the twenty-four-hour data of a typical summer day, it can be seen that during the transient processes between 7:00 - 8:00 when the system was started and 12:00 - 13:00 when the system was turned down by the control system during lunch break, the power data showed significantly more dynamics than the rest of the day. This indicates the manual turnoff should not be implemented during these time periods. For the remaining time, manual turnoff of the fan was simulated with the power data by subtracting the submetered power data at an interval of 1 hour from 8:00 to 20:00 excluding 12:00 and 13:00, as shown by Fig.5.5(a) and (b). The 1-Hz data were then fed to the detector for changes at these time points. The standard deviation was also computed simultaneously in the moving window. The detected changes in comparison with the submetered values and the relative error distribution with the relative standard deviation are shown in Fig.5.5(c) and (d) respectively. Data points at 14:00, 15:00, and 16:00 corresponding to the three largest relative standard deviations 70.5, 100.3, and 74.6 were further removed for the modeling, though it can be seen from Fig. 5.5(c) that the error at 14:00 is not the largest among the other nine points. The nine remaining data pairs were finally used for the polynomial power model. From the comparison between the model based on detection with the total power and that from the submetered data as shown in Fig. 5.5(e) and Table 5.3, it can be seen that the difference between the two models decreased with an increase in the air flow rate and hence the fan power input, which further verified that the detection accuracy increased with the relative magnitude of the monitored equipment in its host system. The model for detection based on a 90% confidence has been successfully implemented with power data from the test system. One example is shown in Fig. 5.5(f) for the detection of a 3-stage offset of a static pressure sensor. Table 5.2. List of the information used in the power change detection of a supply fan. Comments Descriptions Items Power capacity 5.0 kW Minimum power input 0.5 kW Type of Fan motor drive Minimum speed control signal Operating schedule Power percentage of system VSD 20% 7:00 - 22:00 15.5% Correlation parameter Air flow rate Confidence level of model 90% Event time Every hour Total capacity 32.25 kW Transient period Range of standard deviation 12:00 -14:00 0.0 - 105.0 W 1 - 30 seconds Sampling intervals Under the zero-load condition Detectable magnitude erature Load indicator System When the fan is in operation 121 The fan speed control signal can also be used for FDD and energy estimation. 8:00 - 20:00, excluding 12 and 1p.m. Lunch break -System turndown Higher values occurred at events Multi-rate detection 3000 12000 Total pow er Fan power - - 10500 2500 9000- 2000 7500 - 1500 6000- 1000 4500 500 0 3000 0:00 2:00 4:00 6:00 8:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 0:00 Time (h) 12000 - 1050090007500600045003000 - 10:00 8:00 6:0 0 12:00 14:00 16:00 18:00 20:00 22:00 16:00 18:00 20:00 22:00 Time (h) (b) 30002500- +Detected mSubmetered: 2000 1500 sS 1000 500 0+4 6:00 8:00 10:00 12:00 14:00 Time (h) (c) 122 25.0 . 100.3 20.0 15.0 - 74.6 10.0 - 23.9 - 10.5 5.0 £ 0.0 0.0 2.9 31.3. 40. 28.5. 20.0 70.5 , - 6,7 80.0 40.0 60.0 Relative standard deviation *10000 100.0 120.0 (d) 2500 - 2000 # pow er data selected for fitting itted curve Fan model from submetered power 1500 1000 500 0 500 1000 1510 2500 2000 3000 Flow rate (CFM 3000 2500 2000 1500 1000 500 0 0 500 1000 1500 2000 2500 3000 Row rate (CFM) (f) Figure 5.5. Component power modeling via detection of power changes at the system level-electrical power input vs. airflow rate of a supply fan in a test building. (a). Twenty-four-hour total power of the motor control center and the submetered power of a supply fan; (b). Total power of the motor control center with the manual shutdown at an interval of one hourfrom 8:00 to 20:00 except 12:00 and 13:00 when the system was turned down for lunch break; (c). Power changes measuredfrom the submeter and that detectedfrom the totalpower at the shutdown of a supply fan; (d). Errordistributionwith the relative standarddeviation; (e). Comparisonbetween the two models based on the detected changesfrom the total power data and the submeteredfan power data respectively; (f). Detection outputfor the static pressure sensor offset. 123 Table 5.3. Comparison between the fan power as detected changes in the total power and submetered measurements at given airflow rates. Air flow rate (CFM) 914.1 965.3 1361.6 1999.1 2332.5 2391 2490.4 2636.7 2820.9 Detected power change (W) 387.2 418.8 698.5 1280.0 1648.7 1718.0 1838.8 2023.9 2269.0 Submetered power data (W) 317.5 347.2 620.2 1218.8 1610.1 1684.3 1814.2 2014.0 2280.4 Error (%) 22.0 20.6 12.6 5.0 2.4 2.0 1.4 0.5 0.5 Tests with other faults related to the fan power have also been conducted with this method. Despite the slight deviation of the model based on detection from that by submetered measurements, all the implemented faults that have been found by the submetered model were also identified by this model. With the decreased accuracy when the relative magnitude of a detected change is small, the model may become less sensitive to a fault under low-load conditions. In practice, a fault under low load of a component can be found only when the fault becomes more serious even with the submeter-based model due to the relatively wide range of the confidence interval compared with the power value itself. In addition, although VSD motors are used to reduce energy waste at off-peak loads, equipment driven by such motors are not designed to work within extremely low-load ranges. This is because accurate and steady control is difficult to maintain under such conditions and hence a minimum load is always required to start the motor [Chen and Demster 1995]. Moreover, the confidence intervals are established to prevent false alarms caused by potential random deviations and data points are usually supposed to spread closely around the fitted curve rather than near the upper or the lower limit. If the sampled power data line up much closer to the upper or lower limits than to the fitted line, it should be treated as a fault even when the sampled points still stay within the interval. In addition to fault detection, the model based on detection of changes by manual shutdown can also be used to estimate energy consumption of the related equipment and of the monitored system if a model can be established for each component. For the 24hour operation of the day used in the above example, as listed in Table 5.4, the energy consumption of the fan was 19.5 kWh by the submeter and 18.8 kWh from the model with the measured air flow rate. The error was about 3.6% despite an electrical surge around 17:30 as seen in Fig. 5.6, which has rarely occurred in common operation. The errors for other normal days' operation were found less than 3.5%. 124 Table 5.4. Energy estimation of afan based on submeter and detectorfor a normal day. Fan power based on submeter Item 3000* - 2500- 18.8 3.6 19.5 0.0 Energy (kWh) Error (%) Fan power based on detector fan power based on submeter fan pow er as a function of air flow rate 20001500 0 (- 1000- 500- 0 0:00 2:00 , 4:00 6:00 8:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 0:00 Time (h) Figure 5.6. Twenty-four-hour supply fan power profiles during a normal working day by measurements and simulation of power as a function of airflow rate based on detection of changes in the total power data. In practice, control signals that can be obtained from the building energy management system may also be used to establish power functions for the related equipment by following the guidelines described above. For example, with the same procedure as shown in Section 5.3, power input of a fan can also be modeled as a function of the motor speed control signal. However, for equipment driven by a VSD motor, special care must be taken in data acquisition for the model fitting when the equipment is running under low load conditions, i.e., when the control signal is at a low value. Although the power data should be ideally obtained under a wide range of the control signal to produce an accurate model, detection of the related changes in the total power series may not be applicable for the model. This is because with a small value of the control signal, the relative magnitude of the equipment's power consumption may also become too small to be computed with acceptable accuracy and hence the quality of the model is affected. Therefore, the power model needs to be extrapolated for detection and energy estimation under low load conditions. When the equipment is not in operation, the controlled variable and the power input should be zero. In practice, however, a minimum load is always required by the design of a VSD motor, i.e., a nonzero lower limit of the motor speed control signal is needed. This means a zero power input corresponds to a non-zero control signal. In the test building, the minimum fan speed control signal was 20% as listed in Table 5.2, which indicates zero power input is related to 20% fan speed control signal though the actual fan speed is also zero. Therefore, the zero-load data pair (0.2, 0.0) must be used in the correlation as a constraint for the function for low-load energy calculation. Otherwise, the power input will be interpreted as zero under zero speed control signal by extrapolation, which may lead to 125 considerable errors in energy estimation, as listed in Table 5.4 for the same data set used in the above example illustrated in Fig.5.5. Comparison of the fan power consumption between the submetered value and the power model with appropriate low-load constraint is illustrated by Fig. 5.7. Table 5.5. Estimation of energy consumption of a fan in one day by the power vs. speed control signal model with different constraintsunder the zero-load condition. Power functions based on change detection 20%-speed Zero-speed No constraint on the Submeter Item zero-load condition zero-power zero-power Energy (kWh) 19.5 14.5 14.6 19.0 Error (%) 0.0 25.6 25.1 2.7 3500 3000- - fan power base fan power as a 25002000150010005000 0:00 2:00 4:00 6:00 8:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 0:00 Time (h) Figure 5.7. Twenty-four-hour supply fan power profiles during a normal working day by measurements and simulation of power as a function offan speed control signal based on detection of changes in the total power data. For a system without VSD motors, the model can be simplified by clustering the power magnitude of different equipment for residential buildings [Hart 1992]. However, for commercial buildings, statistical detection of changes is still necessary due to the presence of higher noise levels, especially when a change is of small relative magnitude. Also, the GLR detector may be simplified with single sampling rate due to the shorter and more consistent on/off duration of equipment with CSD drives. Fig.5.8 demonstrates a typical power profile of such a system in one hour and the GLR output for the 57 events that occurred during this time period. 126 20000 18000 16000 140000 (- 1200010000 8000 6000 0 360 720 1080 1440 1800 2160 2520 2880 3240 3600 2160 2520 2880 3240 3600 Time (s) (a) 12000 9000 6000 3000 D 0- -6000 -9000 1 -12000 0 360 720 1080 1440 1800 Time (s) (b) Figure5.8. GLR detection of changesfor component modeling from the total power data of the building energy system without VSD motors. (a). 1-Hz power data from a restaurant;(b). detection outputfor the 57 on/off switches in the system without false or missed alarms. 5.5 Discussion and conclusions This chapter demonstrates the evaluation and modeling of power consumption of the major components in HVAC systems based on detection of changes in a system's total power profile. Guidelines and rules are proposed for the development of the graybox model for fault detection and diagnosis as well as energy estimation. With appropriately selected indices including the sampling rate, the relative magnitude of a change, and the standard deviation of the current series, the power function of a component can be established for effective fault detection and energy estimation. As a general rule for equipment to be monitored from the integrated power profile, it has been found that the relative magnitude of the component's power consumption in the system's total power input should not be too small. With the test building, a lower limit of 5% was used for the power model. A power logger at a lower 127 level is needed in order to find the power changes that are accurate enough for feasible evaluation or modeling of a small component, such as a chilled water pump. For example, a logger for all the pumps in a building or in several AHUs to meet the requirement for minimum applicable detected changes. Although such a monitor is not working on the system's total power system and the detection may require more than one logger for the whole system, it may still be worth doing compared to submeters for each component. In addition, two potential problems with the tests need to be addressed when applying the detection method. First, for the turnoff power detection, it is assumed there should be no other events happening at the same time. This might produce false alarms if something else actually occurred. One possible solution is to repeat the manual change detection on another day with a similar load condition and remove the data points that differ significantly from that at the same time on the other day(s). Second, the detection in the test building was conducted with a base sampling interval of one second, while the sampling interval of the reference power data by the submeter was one minute. This might introduce deviations between the submetered data and the detected values due to the ever-changing magnitude of the total power even if no other events occurred during the one-minute period. Such a potential difference can be reduced by using a submeter with a sampling interval that is comparable to the base sampling interval of the detector. It should be noted that although feasible power models of the major components can be obtained from the total power data at the system level with the method developed in this chapter, submeters need to be used for fault detection and monitoring when the equipment must run around the clock, such as in a hospital where the supply fans should never be turned off. Under such conditions, fault detection and diagnosis should be based on submeters if power consumption is used as the major index for monitoring. 128 CHAPTER 6 Diagnosis of faults by casual search with signed-directed-graph rules Diagnosis of a fault is intended to identify the fault origin at the lowest possible level in a system based on the exploration of the maximum amount of available information about the system and its components. In Chapter 3 and 4, simple rules have been demonstrated for alarm of abnormal power input due to some typical faults. However, by only pointing out the equipment with abnormal power consumption instead of the origin of a fault, those rules are actually applicable in fault detection and are not sufficient to produce full diagnosis. This chapter applies a top-down knowledge-based scheme for fault diagnosis through analysis of power consumption. Based on a thorough study of current methods of fault diagnosis, two diagnostic approaches are proposed with power data at the system and the component levels. With only the total power input, the knowledge about the system and the components is very limited and hence a shallow diagnostic reasoning technique is introduced to signal the possible directions or devices. When more detailed information about the system and its components are available to the detector, a deep diagnostic reasoning technique called a casual search scheme is developed to trace down the most likely branches to identify the real cause of the detected abnormalities. Formation of the knowledge base and diagnostic rules is described and diagnostic schemes are implemented and verified for each technique with data from real buildings. 6.1 Introduction While fault detection involves the search of undesirable operation status in a system or a component, fault diagnosis is aimed to trace the cause behind the abnormal behaviors. In general, automatic identification and correction of a fault can be fulfilled in two different ways, hardware redundancy and software redundancy [Rossi et al, 1996]. As discussed in Chapter 1, diagnosis of faults in HVAC systems is usually based on software redundancy, which indicates the "duplication", either qualitative or quantitative, of the outputs of a plant with given inputs and compare the results against measurements. If the difference exceeds a threshold which accommodates the presence of noise and disturbance, then the monitor issues alarms for possible cause(s) of the faults for further actions if needed. Fault diagnosis systems based on software redundancy are normally designed using artificial intelligence techniques and emulate human performance in the cause analysis [Patton et al., 1989]. Current fault diagnosis schemes fall into two general categories, knowledge-based and artificial neural network (ANN) approaches. Based on an understanding of the physical relationships within the monitored system, the knowledge-based method compares the selected measurements against some criteria or thresholds obtained from physical models or basic principles and traces the most likely cause(s) of the deviation by checking through an established set of logic. Initiated as an if-then rule checking technique, the knowledge-based approach has been extensively studied and applied in different areas including the HVAC industry. Rossi and Braun [1996] presented a method 129 for automated detection and diagnosis of faults in vapor compression air conditioners by using statistical properties of the residuals for current and normal operation and comparing the directional change of each residual with a generic set of rules unique to each fault. Stylianou [1996] and Stylianou and Nikanpour [1997] demonstrated a methodology that uses expert knowledge to diagnose the selected faults when a problem is found with the "health" of a reciprocating chiller from thermodynamic models and pattern recognition. Breuker and Braun [1998] conducted a detailed evaluation of the performance of a statistical rule-based fault detection and diagnostic technique with tests in a simple rooftop air conditioner over a range of conditions and fault levels. Fault diagnosis based on rules or knowledge has been studied by other researchers in different ways [Karki and Karjalainen, 1999] [Ngo and Dexter, 1999] [Peitsman and Soethout, 1997]. House et al. [1999] conducted a preliminary study of several typical classification techniques from the two basic categories for diagnosis of faults in data generated by a variable air volume air handling unit simulation model and found that the rule-based method yielded the most reliable result in the test outputs. However, most traditional knowledge-based methods rely on detailed information about the system including system setup and physical and thermal parameters of the components, which greatly limits the compatibility or portability of the diagnostic tool. In addition, diagnosis is primarily based on selected physical parameters or a combination of them from specific components, but measurements of such values for the knowledge base are sometimes difficult to fulfill or usually not available with the current control system. The ANN technique is basically aimed to enable a program to learn, reason and make judgements based on the training data set without detailed knowledge of the system's physical background. In the past five years, some research have been conducted with data from simulation or small test units. Lee et al. [1997] described the application of an ANN using the backpropagation algorithm to the problem of fault diagnosis in an air handling unit. Peitsman et al. [1996] studied the application of black-box models for fault detection and diagnosis in HVAC systems by comparing a multiple-input/singleoutput (MISO) ARX model and an ANN model. Li et al. [1997] presented an ANN prototype for fault detection and diagnosis in complex heating systems. Although these efforts showed some positive prospects of the ANN technique for fault diagnosis in HVAC systems, such black-box models are still mainly in the research stage due to the uncertainties in the technique itself for reliable models that can be extended to practical applications with little knowledge about the physical processes. In addition, requirements by an ANN model for training data representative of a system's behaviors under both normal and faulty conditions make it difficult to be used in practice. As discussed above, a practical tool for fault diagnosis should be adaptable to different systems, i.e., the criteria for diagnosis must be consistent and common among various types of systems. It should also be based on appropriate basic principles for potential extension and reasonable amount of measurements or control signals that are current available or easy to be obtained in common HVAC systems. Based on thorough research of the previous work, this thesis aims to develop a practical system-independent method for fault diagnosis in HVAC systems from analysis 130 of appropriately selected parameters. In this research, a knowledge-based methodology is developed with power consumption as the primary index for evaluation of detected deviations. Based on the generic facts of HVAC systems, an expert system structure is designed for the inference of fault source. Fig. 6.1 illustrates the structure of a common expert system [Patton et al, 1989]. inputs Figure 6.1. Architecture of a common expert system. As shown Fig. 6.1, an expert system consists of: (a). User (b). Human interface - window for the explained outputs; (c). Inference engine - automatic inferring program based on information from the knowledge base; (d). Knowledge base - facts from the data base and models and rules for the problem; (e). Data base - input data; (b). Knowledge acquisition - models and rules provided by the expert or from storage; (g). Workspace - memory for the storage of the problem. The key elements in an expert system are the knowledge base which supplies the facts and the related models and rules for analysis and the inference engine which provides the algorithm for diagnosis. By directly presenting to the diagnostic tool the energy effects of faults, which is a major concern of the HVAC industry, FDD based on power input helps the building 131 energy management system to make proper decisions for necessary maintenance not only promptly but also efficiently. In addition, since electric power is the major source of current air conditioning systems, this approach can be applied to HVAC systems with different configurations. Moreover, as a function of reference parameters that can be obtained from the control system or by basic measurements, power consumption is modeled with physical insights into the process and hence the uncertainty due to lack of understanding of the plant is minimized. Depending on the available power measurements, two reasoning techniques are established for fault diagnosis, i.e., the shallow reasoning approach and the deep reasoning method [Patton et al., 1989]. The shallow reasoning approach is used when only the system's total power is available while the deep reasoning approach can be applied if power functions can be obtained with basic measurements and control signals. 6.2 Fault diagnosis by shallow reasoning with system's total power input The shallow reasoning technique was first initiated as the diagnostic expert system in the medical domain for inference of the causes of observed evidence. Without information about the internal physical descriptions of the system, direct relationships are assumed between the observed symptoms and system malfunctions based on a fault dictionary or fault tree. The fault dictionary is prepared from the basic knowledge about the system, which can be obtained from the past record of the system. The diagnosis usually gives a list of possible defect components instead of the ultimate cause of the fault. The major characteristic of this approach is that it can be implemented with a minimal amount of information about the monitored system. In this research, the shallow diagnostic reasoning method is applied when little information is available except the total power input of the given system. Detection in such situations indicates the identification of changes or variations in the total power series while the diagnosis involves recognition of the electrically-driven equipment in the system with abnormal power consumption at on/off switches, unexpected operating schedules, and unstable control. Without measurements of any components in the system, the knowledge base mainly consists of data and facts obtained from the information in the manufacturers' catalogs, design specifications of equipment, and operating schedules from the building management system. For example, when a component is scheduled to start at the design time point, the FDD program that runs continuously will detect if there has been a change in the total power of the monitored system around the given time and diagnose the change by checking the magnitude against the design value. If no change is found or the detected magnitude change does not match the design value, then an alarm is issued to the operator and recorded in the workspace. Fig. 6.2 shows the profile of the total power input of the motor control center in a test building. The FDD output is given as follows for the startup of a supply fan around 7 o'clock in the morning. 132 9000 8000 - ( 7000 - Supply fan started at 7:00 6000 - 0 a_ 5000 4000 3000- 0:00 2:00 4:00 6:00 8:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 0:00 time (h) Figure 6.2. Power profile of the HVAC system in a test building during a typical day. Detection output: Time of event: 7:20 Magnitude: 1106W Knowledge base: Time of event: 7:00 - 7:30 Magnitude: 500 - 3000W Duration: 10 - 30 minutes Event descriptions: Supply fan startup Diagnosis output: Time of event: 7:20 Magnitude: 0 (0 means magnitude matches specifications in the knowledge base. +1 means excessive value in the positive direction and -1 in the negative direction.) Event descriptions: supply fan startup In this case, data in the knowledge base were obtained from the design information of the HVAC system (the 'Time of event' and 'Event descriptions'), specifications for the fan from the manufacturer's catalog (power capacity and minimum power input as the 'Magnitude'), and observations of sampled power data ('Duration' which may also be available in the catalog). Based on the analysis of the information available for common HVAC systems, the shallow reasoning approach is able to provide a set of overall guidelines for the maintenance of the essential functions of a system. In addition, with no measurements of components and hence no intrusion into the system, this approach is virtually an interruption-free and cost-effective tool for general monitoring of the system's operation. However, in the shallow reasoning approach, both the fault dictionary and the fault tree methods are based on look-up tables with all potential faults preprocessed. In practice, such tables result in a large number of entries and hence diagnosis of large 133 systems becomes difficult and time-consuming. If the table is not complete or fails to provide proper evaluations due to lack of insights into the system, the diagnostic output may be meaningless or erroneous. Moreover, with no data from the related components in a system, it is impossible to identify the underlying cause of an observed fault. Therefore, diagnosis with total power input based on shallow reasoning is applicable to systems consisting of components with distinct power capacity and operating schedules, such as the air conditioning system of a restaurant. In order to locate the specific source of a fault in a complex system, the inference scheme must be based on a more thorough understanding of the system and its components as well. 6.3 Deep knowledge expert system Deep reasoning is based on analytical models of a system and its components, which are also called deep models. A deep model of a plant can derive its behaviors of interest with a given set of parameters and predict the effects of the variations in these parameters. By simulating the underlying principles of a process, the deep knowledge expert system can be used to estimate the effects of a potential fault without preprocessing the assumed fault scenario. Diagnosis with deep modeling can be divided into three basic methods according to their principles and functions: casual search, constraint suspension, and governing equations. The causal search technique is based on the tracing of observed deviations or malfunctions to their origins [Shiozaki et al, 1985]. Potential paths of fault propagation are described by signed directed graphs (SDG or digraphs) consisting of nodes that represent state variables, alarm conditions, or fault sources and branches that display influence between nodes. A complete digraph is able to demonstrate the intensity of the effect, the time delays, and the probabilities of fault propagation along the branches. The constraint suspension technique is mainly aimed to reason from misbehaviors to structural defects based on inspection of the constraints on detailed input/output ports in a hierarchical structure [Davis and Shrobe, 1983] [Davis, 1984]. By examining the values at the 1/0 terminals of a component against constraints (or rules) that define the behaviors of the component, such as the inputs and the output of an adder in a digital circuit, this method is able to deal with complex systems with multiple layers when measurements of the input and output of each component can be obtained. The governing equations method works on a complete set of associated quantitative constraint equations [Kramer, 1986], e.g., mass balance across a unit in a flow path. First developed for fault identification in the domain of chemical engineering, this technique can be applied to processes where the governing equation for each component and the associations between equations for different equipment are available. 134 For non-critical applications as in an HVAC system, cost vs. benefit requires that feasible diagnosis should be more based on logic reference than specific descriptions of equipment in the system. Moreover, it is usually difficult to obtain the I/O measurements or the appropriate governing equation for each component in operation. On the other hand, based on a qualitative diagnostic digraph, the casual search method can be implemented for diagnosis with much less information. Nodes representing key equipment are described by physical models that can not only demonstrate the severity of the effect of a fault, such as the power function of a fan, but also allow for time delay in the transportation of media or variations of parameters which is common in the thermal processes of HVAC systems. While branches connect the related nodes in a process and give the actual active directions of fault propagation. Moreover, in this research, the dominant criterion for FDD is power consumption, which is the ultimate reflection of faults in energy cost of the system. Hence fault diagnosis becomes a backward tracing process of the origin, which is exactly the major function of the casual search technique. In principle, the major limitation of this method is that diagnosis can be conducted only if each variable undergoes at maximum one transition at a time between qualitative states, i.e., either increase or decrease from the normal value, during fault propagation. Although this might become a problem in some engineering processes, e.g., the concentration of certain compounds in a chemical reaction process, this condition can usually be met in a common HVAC system, especially when power consumption is the output of interest. Therefore, the casual search method is more efficient and feasible than the other two approaches for FDD applications in HVAC systems. In this thesis, a rule-based SDG technique [Kramer and Palowitch, 1988] is developed for fault diagnosis in common HVAC systems. The inference logic is properly designed in consideration of the characteristics of HVAC control systems. By converting the digraph into a concise set of rules with quantitative descriptions derived from physical models, the reliability of diagnosis is increased and the threshold sensitivity is reduced. The performance of inference logic is verified with tests in an HVAC system. 6.4 Diagnosis based on casual search of fault origin The casual search method consists of two major steps, the construction of a digraph and the conversion of the SDG into applicable rules to be implemented in an automatic FDD program. This section introduces the procedure for deriving expert rules for fault diagnosis based on semantic analysis by following the development by Tazfestaz [Patton et al, 1989]. 135 6.4.1 Construction of the digraph A digraph is the graphical representation of the fault simulation process. In the casual search method, it is always assumed that a fault is generated from a single node, i.e., the root node, which is the source of all the consequent abnormal symptoms. Hence the digraph usually starts with a single root node, from which branches representing potential influences on other components lead the pathways of fault propagation to the next nodes that describe the responses of these components. The directed branches are extended following the system's functional structure until the ultimate node that represents the model of the monitored equipment is reached. Status of the root node is classified with "+1 (too high) for deviations in the positive direction, "-" (too low) in the negative direction, and "0" (normal) for no deviation or deviations within tolerance. The plus sign "+" marked with a branch indicates with the increase of the value in the current node, the value in the next node increases or remains unchanged, while the minus sign "-" means the next node changes in the negative (decreasing) direction or remains unchanged. For example, "a positive change originated in Component A leads to the increase in Component B and consequently a decrease in Component C" can be simply represented by a digraph as shown in Fig. 6.3. (+1) Figure 6.3. Basic elements in the signed directed graph. Note that time delay might occur in the responses of components along a pathway. Hence the status of the next node may or may not be immediately affected by the current node. As in the above example, if a time delay occurs between nodes B and C, then the current status of C might be -1 or 0, though an increased B tends to result in a decrease in C. In addition, measurements are assumed fast enough that a variable representing a physical parameter at one node and its measured value can be lumped as a single node, which makes it convenient to represent the offset or deviation caused by a fault, as illustrated in Fig. 6.4 where the subscript "s" represents the measured value by sensor and "e" indicates the deviation between the measured and the design values. Fig. 6.4 shows a further simplification of the digraph for diagnosis. Nodes representing unmeasured elements that are not of interest in the simulation of the physical process can be removed from the digraph. 136 (b) Figure 6.4. Node merge and elimination in an SDG. (a). originaldigraph; (b). simplified digraph after lumping of variables A and D with their measurements As and D, respectively and removal of unmeasured nodes B and C. Although some controlled variables can be kept within the desired range even at the presence of a fault, as shown in Fig.6.5(b) for the physical process in Fig.6.5(a) [Patton et al, 1989], where the disturbance seems to have been "eliminated" by the feedback loop, extra efforts are always made to compensate for it. For example, a fouled cooling coil fault must be compensated by higher chilled water flow rate or lower chilled water temperature, both causing excessive consumption of pump or chiller energy. This indicates that the feedback control may not be able to completely compensate for the fault in the whole digraph, especially for an uncontrolled variable or for a controlled variable when the offset is large enough to override the feedback effect. In the construction of a digraph, for the continuation of fault propagation along the directed pathways, feedback branches between two adjacent nodes should be removed from the simulation tree of the digraph for the convenience of inference, as shown in Fig.6.5(c). (+ + E (a) A (+1 (+1 (b) + + B C + D + E (c) Figure 6.5. Control loop in digraph presentation. (a). the original digraph with a feedback loop; (b). loop with disturbance cancellation-disturbancestopped; (c). loop with control saturation-disturbancetransported. 137 Fig. 6.6 demonstrates a typical process for the construction of a digraph for diagnosis with a negative deviation in the root node A. The digraph for the original physical process is shown in Fig. 6.6(a) [Patton et al, 1989]. After lumping of nodes in Fig.6.6(b), elimination of the unmeasured variables in Fig.6.6(c), and removal of the feedback branches in Fig.6.6(d), a complete set of all the possible disturbance pathways is presented as interpretations I-1 through 111-2. Note that because diagnosis is finally conducted with the number of violations, there should be no repeated node in one interpretation. Since a fault may not affect all the measurements within the sampling interval in practice due to the time delay in the responses of components, i.e., the effect may not propagate throughout the whole pathway, it is usually necessary to derive a list of possible interpretations instead of a smaller set of the most complete pathways like AC-D-F-G. In the thermal process of an HVAC system, for example, time delay must be considered to obtain an interpretation with "simultaneous" violations in different components and the pathway may stop at some point depending on the relationship between the delay and the sampling interval involved. This characteristic of the SDG technique also improves the flexibility in diagnosis when the number of accessible components changes within a system or among systems with similar structures. It should be noted that although the SDG structure for future diagnosis is based on obtainable measurements or control signals in the system, the variable represented by the root node in an SDG is usually not accessible, which is exactly the task of a fault diagnosis program. In a system with interactive components, observed abnormal behavior of a component may originate from more than one part of the system. Therefore, more violations of constraints generally result in more reliable or more accurate identification of the fault source, which means longer pathways or more nodes in an SDG. With a given number of accessible components, the diagnosis resolution can be improved when the obtainable measurement is closer to the fault source. For example, in Fig.6.6, interpretation 1-3 (A-Ce-De) yields a more reliable result than the interpretation A-De-Ee. This is because the abnormal behavior of an accessible component is more likely to be shared by other fault origins when it is separated from the source by more components along the flow path, i.e., farther down a branch from the root node. This also indicates that an interpretation should not be simplified by removing intermediate nodes when measurements for these components are available. For example, A-Ce-De in the SDG may be interpreted as A-Ce and A-Ce-De but not A-De. In summary, a typical process to establish a complete digraph consists of the following procedures: (a). Determine the root node from analysis of the system; (b). Construct the complete digraph with all potential pathways from the root node as well as the nodes for all affected components in the propagation of the deviation; (c). Lump the nodes and their measurements if appropriate; (d). Remove the non-measured nodes except the root node from the digraph; (e). Remove the feedback branches from the digraph. 138 (-1) (-1) (-1) A + (b) + - + Ge Ce + Fe * De (C) (d) (-) I-1: (-1) I-2: (-1) 1-3: (-1) III-1: (-) Ce A Fe 1-4: (-1) (-1) 111-2: Ge Ce A Fe II-2: Figure6.6. Demonstrationof the SDG inference procedure. (a)-(d): simplificationsof the SDG model; I-1-111-2: interpretationsof potentialpathwaysfor the digraph. 139 Based on the knowledge about the structure and the basic control logic of a system, the SDG technique provides an efficient way to describe the semantic network of fault simulation. However, for automatic execution of diagnosis, it must be converted to certain forms that can be recognized by a computer program. In the casual search technique, one flexible and computationally efficient way for the conversion is the rulebased SDG method, in which all potential pathways for fault propagation are converted to a set of rules [Kramer and Palowitch, 1988]. 6.4.2 Rule development from digraph Potential deviations along the pathways in a digraph can be represented by a set of signed Boolean series by assigning +1 or -1 to deviations (sensor-setpoint) in the positive or negative direction and 0 to the neutral or normal state. For a branch X - - Y in a simulation tree, the relationship is expressed as: if X then Y can be +1 or 0 (due to unspecified time delay for the propagation from component X to Y), if X = 0, then Y = 0, and if X = -1, then Y = -1 or Y = 0. The relationship with X - --- Y can be derived similarly. Table 6.1 shows the Boolean representation of the interpretations in Fig. 6.6 (the subscript e were removed for convenience). = +1, Table 6.1. Boolean node patternsfor the digraph in Figure 6.6. Interpretation group I Interpretation group II Interpretation group III C D F G C D F G C D F G -1 -1 -1 -1 0 0 -1 -1 -1 0 -1 -1 -1 -1 -1 0 0 0 -1 0 -1 0 -1 0 -1 -1 0 0 -1 0 0 0 It can be seen that the mode sets can be directly used for diagnosis by comparing with the sampled data from the monitored system. Also, by checking through the signed Boolean series, duplication of rules can be avoided. However, for a large system with multiple faults, such a table may create an extremely large set of codes and hence become very awkward to use in the development of an FDD scheme. Therefore, the format of the presentation of the deduction logic must be simplified for practical use. In an inference system developed in a computer program, this problem can be solved by utilizing a form of logical predicates, designated as p and n here, in the standard "IF ... THEN ... " rules to form a module or subroutine as shown below. (p X Y) <=> (X = Y ) or ( IX I> IY I) (n X Y) <=> (X = -Y) or ( IX I> IY I) 140 Thus Interpretation group I becomes: A -+- C=>(pAC), C -+- D=>(pCD), D -- F=>(pDF), F -- G=>(pFG). From the above definition and rule table, it can be deducted that the p and n pairs in a branch should be connected with the AND logic sequentially. For branches starting from a common node such as in Interpretation 111-2 in Fig. 6.6, the AND logic should also be used. Hence, the rules in Table 6.1 can be represented as follows. Interpretation group I: Interpretation group II: Interpretation group III: (p A C) AND (p C D) AND (p D F) AND (p F G). (p A F) AND (p F G). (p A C) AND (p A F) AND (p F G). Note that the unmeasured component A can not be used in the conditions for inference. But it is easy to prove that ( X = -1 ) AND ( p X Y ) is equivalent to Y # +1. Therefore, in the form of the standard IF...THEN... rule, Interpretation group I: IF ((C # +1 ) AND (p C D) AND (p D F) AND (p F G)) THEN A = -1 is a possible fault source. Interpretation group II: IF (( F#+1 )AND (pFG)) THEN A = -1 is a possible fault source. Interpretation group III: IF((C#+l ) AND (F#+l )AND (pFG)) THEN A = -1 is a possible fault source. For a given digraph, all interpretations should be incorporated into one "IF... THEN..." clause. This not only provides a more concise form for diagnosis, but also helps to prevent redundancy of node appearance in one interpretation. However, special care should be taken in developing interpretations or combining different interpretations when a converging node exists in a digraph. A converging node indicates multiple causes for the deviation of a component. Although two diverging branches are integrated by an AND, two converging pathways should be combined with an OR in order to avoid repeated clauses in the final IF... THEN... rule. For example, at the converging node F in Fig.6.6, an OR needs to be used in combining the branches, i. e., the complete form should be (p A C) AND (p C D ) AND ((p A F) OR (p D F)) AND (p FG ), or, IF ((C # +1 ) AND (p C D ) AND (p F G) AND ((F # +1) OR (p D F) THEN A = -1 is a possible fault source. 141 6.4.3 Direct rule development from a digraph With the logical form described in the previous sections, rules for diagnosis can be obtained in a concise format. However, for applications in a system, rule derivations based on the fault table may become a huge list of possible pathways. The situation can be further complicated when there are multiple possible causes of the deviation in one component, because the converging nodes often lead to a large amount of combinations among different pathways. For example, with the digraph containing two converging nodes as shown in Fig.6.7(a) [Patton et al, 1989], eighteen combinations can be derived by following the procedures presented in Sec. 6.4.1 and 6.4.2. Therefore, a more efficient method needs to be used for the development of rules. In fact, from the above discussion, it can be found that if the network is properly disassembled, the rules can be directly derived from a partially developed digraph. In general, converging pathways represent a choice in the construction of the simulation tree where one of the paths is used to form an interpretation group. By using all combinations of these choices, one can obtain the full set of interpretations. Thus the combined set of interpretations can be represented by making explicit the choices, instead of enumerating each interpretation. With the two converging nodes D and E in the digraph in Fig. 6.7(a), four interpretation groups can be obtained as shown in Fig.6.7(b)-(e). The combined set of interpretations can be represented by the following set of branches: A --- > BANDA AND(B - CANDC -±-> DOR C --- -- > > E D)AND(D -- > F OR E -+-> F), which can be transformed to the following rule: IF (( p AB ) AND AND ((p B D) AND ((p D F) THEN A = +1 is a (p A C) AND (p C E) OR (p C D)) OR (p E F))) possible fault source. It can be seen from the above discussion that the number of the interpretations for further derivation of the rules with the given digraph has been significantly reduced. 142 (a) (b) (c) (d) (e) Figure 6.7. Interpretationdevelopment for a digraph with converging branches. (a). the partially developed digraph;(b)-(e). simplified presentationsof interpretations. 6.5 SDG rule development for fault diagnosis with power input of components As discussed in the previous sections, for diagnosis with the casual search technique, rules can be established based on digraphs for the potential faults. In HVAC systems, the modeling process of diagnosis involves the development of the digraph for a given fault and the expert rule set with reference to the system configuration and the control sequence. The construction of a digraph usually starts with a root node that represents the original faulty component in an HVAC system. In principle, there can be many defect components as an HVAC system generally consists of a considerable amount of mechanical equipment and parts as well as electrical devices. In practice, however, it has been learned that faults in a given type of HVAC systems usually occur in some typical forms [ANNEX 25]. In addition, although a fault may happen at any component level, a diagnosis program should be reasonably expected to reach the level where the faulty part is directly related to a measured or modeled parameter. For example, a slipping fan belt leading to less power consumption (modeled) at high fan speed (measured) can be detected by the fan power function. But the slipping effect itself may be caused by more 143 than one defect, such as a loose screw in the mechanical transmission system, reduction of belt tension after a long time of operation, etc. Fault origins at such a deeper level are generally impossible for the FDD program to see due to the infeasible measurements of the related physical parameters. Therefore, fault origins in HVAC systems in this research are identified as defects or malfunctions that directly lead to excessive variations in the observed parameters. Since power consumption is the major index of this FDD method, the diagnosis function is usually activated by violations of the power criteria. As an ultimate effect of system malfunctions, power consumption is generally the node at the end of a digraph for a given fault. Hence diagnosis based on the SDG rules is essentially an inversely parallel procedure of the modeling process that predicts the response of a plant to a given operating state. In the SDG rules, the status of the end node is usually determined with a component's power model obtained by the method demonstrated in the previous chapters. As discussed in Chapters 3 and 4, abnormal effects in power consumption can be defined in terms of the magnitude of a component's power input or cycling frequency or the standard deviation of the sampled power series. Development of a digraph for a given fault requires finding the possible connections from the fault origin as the root node and the power response as the end node. From the fault origin, all potential branches in the propagation of the variation in the root node as well as the nodes representing all affected components that can be measured or modeled in the pathways should be included in the digraph. Generally, in the rules for diagnosis, nodes with some constraints or thresholds should be first considered to detect the maximum number of violations. For example, as a controlled variable with a setpoint and a tolerance range, the indoor air temperature is a useful index and is usually available for the control system. As discussed before, a major advantage of using power consumption as the criterion for fault detection is that the detector is able to find abnormal operation even if the other physical parameters appear normal with their local feedback controls. Moreover, some parameters used in the power models are measured but not controlled or modeled. Although constant thresholds can be applied to those parameters under some circumstances, they can not be generally defined as abnormal, i.e., with logical value of -1 or +1, when used in the rules and hence may block the diagnosis for the next variable in the p or n pairs. For example, in a VAV system, supply airflow rate is measured for diagnosis with a fan power model but not controlled with a setpoint. Hence it can not be found at the fault state with the logical value of -1 or +1 unless it is lower than the threshold for the minimum air flow rate (at status -1) that may be used to avoid insufficient air supply in case of failure of some key components in a system. Therefore, in this thesis, the logic inference mechanism is developed in a more compatible structure as follows. For a node X with or without feedback control. (pXY)<=> (X*Y>O) (nXY)<=> (X*Y 144 O) Or, in the pathway as, X -+-> X ---- Y: Y = 0 or +1 if X = 0 or +1 or Y =0 or -1 if X = 0 or -1; Y: Y = 0 or -1 if X = 0 or +1 or Y =0 or +1 if X = 0 or -1. This inference logic considers the potential time delay between nodes as well as the local compensations in the rule development. Alarms can thus be issued for different levels of severity of a fault. Abnormality in a non-controlled parameter usually gives a warning if the fault is not critical for the occupants' or the system's health. Urgent alarms are issued for immediate attention if the deviation of a controlled variable X is found beyond its allowed range or the abnormality of a non-controlled parameter is detected that is critical to the system's health. For example, as a controlled variable of HVAC systems, the indoor air temperature should always be maintained within a specified range for the occupants' health or the needs of some production processes. A typical fault in a non-controlled variable that may call for prompt intervention is the unstable control of a component, which not only deteriorates the quality of the control system but may also damage the related mechanical parts. Such a fault can usually be found from oscillation of the data series that are available for analysis by computing the standard deviation of the samples. Fig. 6.8 shows the digraph for the diagnosis of an unstable fan control. (+ 1) Figure 6.8. Digraphfor the diagnosis of a supply fan with unstable control. With an inappropriately tuned gain Ku.sf the variances (or the standard deviations) of the fan speed control signal usr.v and the supply fan power Psf tend to increase. Hence the SDG rule can be written as IF(Psf.v = +1) THEN alarm for oscillation in the power input of the supply fan IF((usf.v # 0) AND (p usEf Psfv) THEN Ku.sf (+1) is a possible fault origin. If only power input is available for diagnosis, the rule can be simplified as IF(Psf.v = +1) THEN alarm for oscillation in the power input of the supply fan. It should be noted that a branch in a complete digraph represents a potential pathway of fault propagation, whereas a branch in the simulation tree to be used for diagnosis is the specific active pathway along which the fault is expected to propagate. With the SDG diagram, this means not all the pathways or the nodes in the digraph can be used at the same time for a given fault. In this thesis, the control of the active pathways in a digraph is realized by first choosing the related operating modes based on the general control logic of an air conditioning system. 145 The operating mode of an HVAC system is usually determined by the outdoor air temperature and the corresponding type of the thermal process involved, which is typically defined as heating, free cooling with outdoor air, mechanical cooling with 100% outside air, or mechanical cooling with minimum outdoor air, named as modes 1, 2, 3, and 4 respectively in this thesis. For example, a leaky recirculation air damper in an HVAC system tends to result in lower heating energy consumption in the heating mode, higher cooling energy input in the modes of free cooling with outdoor air and mechanical cooling with 100% outdoor air, and lower cooling power input in the mode of mechanical cooling with minimum outdoor air. In practice, it is difficult to find the leaky recirculation air damper from the power input of the boiler or the chiller when the damper is wide open or modulating by the control command. Therefore, this fault should be detected and diagnosed with the chiller power input only when the system is operating in mode 3. The effects of this fault in the chiller power input are listed in Table 6.2. Table 6.2. Effects of a leaky RA damper on the chillerpower in different operating modes Mode Outside temperature region RA damper Chiller power (normal) Chiller power (leaky) open P1 P11 =Pi 1 Toa < Tbp 2 Tbp < Toa < Tsa - ATsf modulating P 2 = PI 3 Tsa - ATsf < Toa < Tra closed P3 P 31 >P 3 4 Tra< Toa open P4 P41 =P4 P21 P2 Toa, Tsa, and Tra are temperatures of the outdoor, supply, and return air respectively while Tbp represents the balance point temperature for the switching between the minimum and the modulating amount of outdoor air and ATsf is the approximate temperature rise of the air across the supply fan. If the above temperature measurements are not available, the supply and the return air temperature setpoints Tspt.sa and Tspt.ia can be used as Tsa - ATsf and Tra separately. In mode 3, since the outdoor air temperature is lower than the return air temperature, 100% outside air is used for cooling to save energy and the recirculation air damper is closed by the control system. If the recirculation air damper can not be fully closed due to a mechanical failure, as shown by the digraph in Fig. 6.9, the power input of the chiller Pch tends to increase with the warm return air Qra from the leakage Ad.ra to be processed by the cooling coil. (+ 1) Figure6.9. Digraphfor the diagnosis of a leaky recirculationairdamper. In principle, the leaky recirculation damper can be identified from the chiller power model as long as the system is operating in mode 3. However, in practice, it has been found that the fault can be reliably detected only when the outdoor air temperature 146 Toa is closer to the supply air temperature Tsa than to the return air temperature Tra. This is because the uncertainty in the outdoor air temperature, the increase of the air temperature in the return duct, and the control deadband in the room air temperature should be considered in order to eliminate alarms for deviations due to random disturbance. Such a modification in the temperature threshold for FDD is especially useful when the chiller power model is not sensitive to load changes. For example, with a two-stage reciprocating chiller in the test building, a dimensionless threshold for the outdoor air temperature was found useful for the FDD of this fault, as shown in the following SDG rule used for the diagnosis of this fault. In fact, the extra clause for the outdoor air temperature not only helps to reduce the false alarm rate due to random disturbance but also greatly improves the resolution of the leaky recirculation air damper in spite of its short pathway that may be shared by other potential faults which also lead to excessive chiller power input. IF(Pch = +1) THEN alarm for high power input of the chiller IF((Qra # 0) AND (p Qra Pch) AND (0 < Toa < ht.oa*(Tspt.ia - Tspt.sa)) THEN Ad.ra(+1) is a possible fault origin The chiller power model Pch can be established as a polynomial function if continuous load control is used or a threshold of the on/off cycling frequency when stepwise control is involved. In the test building, a threshold of 35 minutes as the lower limit for the interval between off and on was used as for the reciprocating chiller with a CSD motor. ht.oa is a trained dimensionless threshold used for oa spt.sa . Tspt.ia - Tspt.sa For the test building, a value of 20% was used for ht.oa. In addition to the operating modes, the status of a component may also be determined by the system's operating schedule. For example, a leaky cooling coil valve is expected to be visible only when the system is in mode 1 or 2 according to Table 6.3. However, this does not necessarily mean that the FDD for this defect must wait until the outdoor air temperature switches the system to mode 1 or 2. In some HVAC systems, when the air handling unit is shut down in mode 3 or 4, the cooling coil valve is also closed and the chiller may run at a constant low speed or cycle on and off regularly to prevent heat accumulation in the equipment. If the valve is leaky, with chilled water to be processed by the cooling coil, the chiller will run at a higher speed or cycle more frequently than under normal conditions with a fully-closed valve. In the test building, a chiller with a CSD motor cycled at an interval of over 38 minutes with a fully-closed valve when the AHU was shut down during late night (after 10pm) and early morning (before 7am). With a leaky valve, the interval was found to be shorter than 35 minutes. Fig. 6.10 shows the digraph for this fault followed by the corresponding rule set. 147 (+ 1) Figure6.10. Digraphfor the diagnosisof a leaky cooling coil valve. IF(Pch = + 1) THEN alarm for high power input of the chiller IF((Qcw # -1) AND (p Qcw Pch) AND (Usf = Usfmin)) THEN Av.cc (+1) is a possible fault origin where Av.cc and Qc, represent the leakage and the flow rate of the chilled water. The minimum fan speed control signal usf.min is used as an indicator of the shutdown of the air handling unit. usf.min is typically used for a VSD fan as discussed in Chapter 5. Table 6.3. Effects of a leaky CC valve on the chillerpower in different operating modes. Mode Outside temperature region Cooling coil valve Chiller power (normal) Chiller power (leaky) 1 Toa < Tbp closed Pi P1 >P 2 Tbp< Toa < Tsa - ATsf closed P2= PI P21> P2 3 Tsa - ATsf < Toa < Tra modulating P3 P31 =P 3 4 Tra < Toa wide open P4 P41=P4 From the above analysis, it can be seen that the detectability of a fault changes significantly with the status of the related components that are determined by the operating mode and the control scheme. For example, it is impossible to detect a leaky cooling coil valve in summer by power analysis when the chiller is also turned off. With the extended logic design, nodes representing those equipment that are not involved in the current operation can be skipped by assigning their values to 0 based on switches between different operating modes or the control design so as to facilitate the execution of the FDD program. By allowing the switch of a node's state with a given operating mode or condition, the compatible inference structure greatly improves the flexibility of the program to be used in a system under various conditions as well as in different systems with various information available for diagnosis under similar control logic. Examples in the following discussions are generally based on the fault diagnosis implemented under summer conditions. SDG rules can be developed similarly for other conditions. The resolution of the diagnosis output may also vary with the load conditions during different time periods in the same day. For example, during summer time, a stuckclosed recirculation air damper in a typical HVAC system can be more positively identified when the condition of Qsa=-l (insufficient air supply) is met after the end of a work day when the outside air damper is closed, which helps to distinguish the stuck- 148 closed recirculation air damper from other faults that also result in excessive fan power input. If Qsa=-1 is found throughout a day when the system is in operation, then both the outdoor and recirculation air dampers are diagnosed as stuck-closed. Therefore, for better recognition of a fault, appropriate time constraints corresponding to load conditions should be utilized in the inference rules based on the system's control logic. Fig.6.11 illustrates the SDG rule development for a stuck-closed recirculation air damper. Tia Osa + Pch Ps - Usf (-)Dra Psf.Q Figure 6.11. Developed digraphfor the fault of a stuck-closed recirculationair damper. As the deviations in the physical variables from their setpoints or normal values, the nodes Dra, Qsa, Tia, Pch, Ps, usf, and Psf.Q describe the potential abnormal status of the recirculation air damper position, the flow rate of the supply air, the room air temperature, the chiller power input, the duct static pressure, the fan motor speed, and the supply fan power model. Note that the indoor air temperature Tia and the static pressure P, in the air duct, both in bold letters, are generally controlled variables in air conditioning systems. As a result of the failure in mechanical transmission, a stuck-closed recirculation air damper will lead to reduced supply air flow rate and increased room air temperature. With a lower air flow rate, the cooling load and hence the power consumption of the chiller will decrease. Meanwhile, the extra pressure drop across the recirculation damper in the air loop tends to cause a lower static pressure in the duct, which then calls for a higher speed and hence more power input of the fan. The rule for the diagnosis of this fault can be derived as follows. IF (Psf.Q = +1) THEN alarm for excessive power input of the supply fan with air flow rate IF(Ps = -1) THEN alarm for low static pressure IF(Tia = +1) THEN alarm for high room temperature IF(usf = +1) THEN alarm for high speed of the supply fan IF(Qsa = -1) THEN alarm for low flow rate of the supply air 149 IF(Pch = -1) THEN alarm for reduced power input of the chiller IF(toffwork < t <tahuoff) THEN alarm for stuck-closed recirculation air damper IF((Fsa # +1) AND (n Fsa Tia) AND (p Fsa Pch) AND (Ps # +1) AND (n Ps usf) AND (p usf PsfQ) AND (IPchI + IPsi + ITial + IusfI + IQsaI # 0)) THEN Dra(- 1) is a possible fault origin Where Qsa =- 1 indicates the supply air flow rate is lower than the normal operating range and usf = +1 means 100% fan speed. Two time constraints, toffwork and tahuoff, were used for the end of the normal occupied hours and the time to turn off the air handling units each day. In the test building, to reduce the cooling load caused by the hot outdoor air in summer, the outdoor air damper was closed and the recirculation air damper was set at fully open between toffwork and tahuoff. In general, the fewer the neutral values (O's) in the nodes, the more positive is the diagnosis output. For a less intrusive and hence less confident diagnosis without the information about Tia and Ps, the rule can be simplified as, IF (Psf.Q = +1) THEN alarm for excessive consumption of the supply fan IF (Qsa = -1) THEN alarm for low flow rate of the supply air IF(Pch = -1) THEN alarm for reduced power input of the chiller IF(toffwork < t <tahuoff) THEN alarm for stuck-closed recirculation air damper The components and the structure of a digraph are generally determined by the control logic of the system and the priority of actions. As an uncontrolled variable and a major source to compensate deficiencies in other equipment by the control system, the power input of the supply fan responds quickly to this fault. In addition, power input is the primary index in this approach. Therefore, the rule for the supply fan power should be the first condition to be checked, i.e., this rule is the outmost layer in this diagnosis structure. Rules for other variables, if available, should then be nested under this condition. Note that although the chiller power is also affected, in the test building, the power model for the two-stage reciprocating chiller is not sensitive to load changes during the occupied period of the building and hence not reliable for detection of this fault. Therefore, the rule for the chiller power is assigned with the lowest priority in the rule set. If the chiller is sensitive to load changes, such as a chiller driven by a VSD motor, then the full diagnosis can be nested in the clause for the chiller power. In practice, if a node is not expected to respond to a fault with a recognizable pattern under certain conditions, the node can be assigned with a default value of zero. 150 It can be seen from this example that for a given fault, the nesting structure must be designed carefully based on the control logic as well as the components' characteristics of the given system so that the execution of the outer rules never leads to missing of diagnosis by the inner rules. If a node with lower sensitivity is used in the outer layer, then its status may block the detection with other nodes in the inner layers that are more sensitive to the fault, as shown in the above example. System deterioration and energy waste may become even worse with inappropriate design of rule structure if the other nodes remain neutral with the compensation by extra power consumption, especially with a degradation fault. For example, as shown by Fig. 6.12(a), a pressure sensor offset in summer may lead to an increase of power input of the supply fan and the chiller as well but may not cause further violations of nodes maintained by local feedback control, such as the indoor air temperature. Although the abnormal operation can be easily found by the node for the fan power model, the excessive power input of the chiller may not be found until the fault becomes very serious if the chiller power model can not produce a detectable pattern, as has been observed with the HVAC system in the test building. If the chiller power node is used in the outer layer, then its status will block the detection by the fan power node which can be used to find the fault at an early stage. In order to prevent missing of rule execution in a nest structure that may be caused,by node sensitivity, time delay between nodes, compensation by local feedback loop, and switches of modes, the diagnosis is designed in a nest structure that generally develops in depth with the increasing severity of the fault. Generally, in a given structure, the sensitivity of nodes to the fault decreases while the confidence about the fault origin increases with more violations as the layer grows. Once an alarm is issued from an end node, it will be recorded and at the same time further violations are also checked until a fault origin is identified. In case no possible cause can be matched to all the errors, the deviations will be stored for further analysis by experts. Tia Qsa Tia _ sa Usf Poh 0) Tfb + Pch Ps ust Psf.Q) Psf.Q (b) (a) Figure6.12. Developed digraphsfor two typical faults in common air handling units. (a). static pressure sensor offset; (b). slippingfan belt. 151 Fig. 6.12 shows the digraphs for other two typical faults related to abnormal fan power input: offset in the static pressure sensor and slippage of fan belt. up, xt, and Pfeu define the status of the static pressure sensor output, the fan belt tension, and the fan power vs. motor speed model. Ps is equal to usp if the pressure sensor is in normal operation. Typically caused by a leak in the pneumatic signal tube, a pressure sensor offset requires higher motor speed and hence excessive fan power consumption. When the fault becomes very serious and the VAV control for the air flow is saturated, more air will be processed by the cooling coil and then delivered into the conditioned space, leading to higher chiller power input and lower room air temperature. The rule can be developed as follows. IF (Psf.Q = +1) THEN alarm for excessive power input of the supply fan with air flow rate IF(Tia = -1) THEN alarm for low room air temperature IF(usf = +1) THEN alarm for high speed of the supply fan IEF(Pch = +1) THEN alarm for excessive power input of the chiller IF((Qsa # -1) AND (n Qsa Tia) AND (P Qsa Pch) AND (usp # +1) AND (n usp usf) AND (p usf Ps.Q)) AND (toffwork < t <tahuoff) THEN usp(-1) is a possible fault origin It should be noted that although Qsa increases with fan power, Qsa = +1 can not be used as an independent clause as no upper limit is typically set for the supply air flow rate. With the minimum amount of measurements and control signals, i.e., without Tia and usf, the rule can be simplified as, IF (Psf.Q = +1) THEN alarm for excessive consumption of the supply fan IF(Pch = +1) THEN alarm for excessive power input of the chiller IF (Qsa # -1) THEN alarm for static pressure sensor offset As a degradation fault, a slipping fan belt results in reduced resistance to the shaft and hence lower fan power input at a given motor speed, which can be represented by the fan power vs. motor speed model Psfu. On the other hand, with the decreased transmission efficiency between the motor drive and the fan shaft, the motor needs to run at a higher speed and hence a higher power input to provide the required air flow rate. If the air flow rate can not be maintained, then the chiller power input will decrease with the 152 reduced load and the room air temperature will increase due to insufficient supply air. All these can be summarized in the following rule set. IF (Psf.u = -1) THEN alarm for reduced power input of the supply fan with motor speed IF (Psf.Q = +1) THEN alarm for excessive consumption of the supply fan IF(usf =+1) THEN alarm for high speed of the supply fan IF(Ps = -1) THEN alarm for low static pressure IF(Pch = -1) THEN alarm for reduced power input of the chiller IF(Tia =+ 1) THEN alarm for high room temperature IiF((Qsa # +1) AND (n Qsa Tia) AND (p Qsa Pch) AND (Ps # +1) AND (n Ps usf) AND (p usf PsfQ) THEN t(- 1) is a possible fault origin Since the most direct and significant effect of a slipping fan belt is the reduced resistance and hence reduced effort to run the fan at a given speed, the supply fan power input is more sensitive to the motor speed than to the air flow rate. Therefore, in the inference structure for the slipping fan belt, the node for the fan power vs. motor speed is used in the outmost layer, i.e., the first essential condition to start the diagnosis of this fault. With the minimum amount of measurements or control signals, i.e., without Tia, usf, Ps, and Pch, the rule can be simplified as, IF (Psf.u = -1) THEN alarm for reduced power input of the supply fan with motor speed IF (Psf.Q = +1) THEN alarm for excessive input of the supply fan IF (Qsa # +1) THEN alarm for slipping fan belt From the above analysis of fault origin, it can be seen that the enhanced structured SDG inference technique enables flexible yet reliable control of the output of the detection and diagnosis of defects and malfunctions according to the available information. Alarms are issued once new abnormalities are detected, but identification of the specific cause is delayed until the complete rule set for the fault is met. Such a sequence control has two major advantages: maximum rule violations and minimum false/missed alarms with high-resolution diagnosis. First, by allowing alarms for each node with appropriate constraints in the pathways, faulty operation can be reported at an early stage. When a fault occurs, as an uncontrolled variable, power input of equipment with abnormalities is alarmed by the most sensitive power model as the end node in an SDG earlier than other variables. 153 Although early alarms may not require immediate actions, they inform the operator of the health of the system and provide the flexibility for appropriate actions before the final alarm if necessary. When a controlled variable is found to shift beyond the tolerance from its setpoint, alarms are issued for immediate attentions. Second, with the maximum number of violations, the unique inference rule set with appropriately designed structure for each specific fault helps to ensure reliable diagnosis and enables recognition among faults that share common pathways at the upper levels and separate from each other when the pathways reach the lower levels toward the origin. For example, in the above diagnosis of the stuck-closed recirculation air damper in the test building, power consumption of the supply fan was first found about four times higher than its expected value by the model since 8:26 a.m.. At the same time, the static pressure was around 0.1 in., much lower than its setpoint of 1.2 in.. In addition to a stuckclosed damper, the other two typical faults, static pressure sensor error and fan belt slippage, as shown in Fig. 6.12, may also result in excessive fan power consumption as a function of air flow rate and a low static pressure signal in an HVAC system. It can be seen from the digraphs that all the three faults share some common abnormalities in the fan power vs. airflow rate model, fan speed, and the static pressure in the duct. From 8:41 a.m., the indoor air temperature went over the upper limit of the expected range of 72.5 ± 2.5 "F and stayed around 80 OF, which eliminated the possibility of the pressure sensor fault. In practice, as the setpoints of the static pressure and the indoor air temperature were not met, early actions might be needed to remove the fault. Although these phenomena could be caused by either a stuck-closed damper or a slipping belt, as a typical degradation fault, a slipping fan belt is not expected to lead to a sudden jump in the fan power input with such a large magnitude. Therefore, if the room is occupied and needs quick repair, the status of the dampers should be inspected first. For automatic identification of the fault origin, further violations are also necessary to distinguish between a stuck-closed outdoor air damper and a stuck-closed recirculation air damper. In the test building, the full diagnosis was completed after 5 p.m. when the outdoor air damper was closed by the control system as the building was no longer occupied for the day and the fans were only to circulate the return air in the building until the AHU was shut down at 10 p.m.. With the closing of the outdoor air damper by the control signal, the flow rate of the supply air dropped significantly, which is not likely to happen with a slipping fan belt or a stuck-closed outdoor air damper, indicating the stuck-closed recirculation air damper as the error source. As the leaks in the outdoor air damper became the only openings for the incoming air, the alarm for extremely low air flow rate was activated and the fault origin was reported. In the test building, a value of 500 CFM was used as the lower limit of the supply air flow rate. Fig.6.13 illustrates the process of the SDG rule development for a typical fault in the water loop of an air handling unit, a cooling coil fouled with a thickness of See of sediments on the water side. 154 Figure 6.13. Developed digraphfor the fault of a fouled cooling coil (water-side). Here Tsa, Tia, uv.ce, Q,,, ucp, and Pcp represent deviations in the supply air temperature, room air temperature, stem position of the cooling coil valve, flow rate of the chilled water, motor speed of the chilled water pump, and power consumption of the chilled water pump. The power consumption of the pump is modeled as a function of water flow rate or control signal of the valve. The chiller power model can be defined as a function of such load conditions as outside air temperature or chilled water temperatures when a VSD is used or by checking the on/off cycling frequency if the chiller is under stepwise control. With the increased thermal resistance due to the fouling 8ec, the heat transfer efficiency will be reduced, which then may lead to increase of the supply air temperature and the indoor air temperature Tia if the control is saturated. The cooling coil valve control signal uv.cc and hence the chilled water flow rate Qcw are then increased to compensate for the deviation in Tsa. With the increased cooling load due to more chilled water to be processed, the chiller power Pch may go beyond its normal range under the same load conditions. Meanwhile, the pump speed ucp and hence the pump power consumption Pep may also be higher than normal with the increased flow rate of water to be delivered and the increased flow resistance when the flow path is restricted due to the fouling. The rule for the fault in this configuration is determined as follows. IF (Pep = + 1) THEN alarm for excessive power input of the chilled water pump IF (Pch = + 1) THEN alarm for excessive power input of the chiller IF (Tsa = +1) THEN alarm for high supply air temperature ]IF (Tia = +1) THEN alarm for high room air temperature IF ((Tsa # -1) AND (p Tsa Tia) AND (p Tsa uv.cc) AND (p uv.cc AND (p ucp Pep) AND ((ucp # -1) OR (p Qcw ucp)) AND (ITsal + ITial + Iuvccl + IQswl + lucpl # 0)) THEN See (+1) is a possible fault origin 155 Qcw) AND (p Qcw Pch) 6.6 Rules for detection and diagnosis of typical faults in common air handling units Rule development for fault identification with reference to the operating modes have been studied by other researchers [House et al, 2001]. In this section, rules are developed in a similar manner with power consumption as the major index for diagnosis of faults that are usually found in a typical VAV air handling unit as the one installed in the test building. Potential violations of rules are summarized in a table and the rule set for each fault can be established with the method presented in Section 6.5. 6.6.1 System description The schematic diagram of the system is shown by Fig. A3 in the Appendix with the following basic characteristics: -Single-duct variable air volume system -Temperature and humidity control of indoor air -Chiller with constant speed motor drive* -Hot and chilled water pumps with constant speed motor drive * -Supply fan control: constant static pressure setpoint -Return fan control: flow rate tracking of the supply fan 6.6.2 Expert rules 6.6.2.1 General rules Although the data available for fault detection and diagnosis may vary among different systems, the basic functions of air conditioning processes should be practiced and hence the general rules should be met in common HVAC systems under any normal operating condition. Since fans driven by electrical power are in operation under different modes, the following rules based on the design intent related to the power consumption and controls of fans can be set up for common HVAC systems. Rule Rule Rule Rule Rule Rule Rule A. B. C. D. E. F. G. I Psf - Psf.m I APsf.ci Psf Psfnin For usf.max - usf esf, I Tspt.ia - Tia I et.ia + ATdb usf max - usf esf Psf> Prf Pif Pf.min (uf/us)min ur/usf (ud/us)max Rule A indicates the random deviation of the power input Psf of a VSD fan from its expected (modeled) value Psf.m should be less than a value defined as a confidence interval APsf.ci which is determined based on a given confidence level. Rules B and F require the power drawn by the fans to be larger than their minimum values when the fan is in operation. Rule C means the indoor air temperature Tia should be maintained around its setpoint Tspt.ia within the offset caused by random error etia and the deadband or a tolerance range of ATdb with the fan speed control signal lower than its maximum usf.max. 156 Violation of Rule C means the supply fan is not able to maintain the indoor air temperature and will lead to an alarm. esf and et.ia can be determined by the user based on specifications of the related device. Rule D is a warning for the capacity of the supply fan. Continuous violations of Rule D indicate either the fan is not properly designed or a serious fault occurred in the air loop, which can be a pressure sensor error, an unexpected increase of resistance, or a fault in the mechanical transmission of the fan itself. Control of a return fan is usually coupled with the supply fan with different indices: air flow rate, fan speed, or room static pressure. In any case, power input of a return fan should be lower than that of the related supply fan. Rule G specifies the appropriate range of the control signals for the return-supply fan match. Undesirable working status of the fans can be detected by this rule, such as belt slippage of the supply fan. In HVAC systems, frequent on/off cycling of some equipment may lead to quick deterioration of the equipment and waste of energy, such as a reciprocating chiller. Therefore, the total number of cycles each day is usually limited under a threshold. Rules H and I provide the constraints for this purpose. Rule H. Rule I. ncdch ncdch.max ncdb ncdb.max To prevent oscillating operation of the components in a system that can result in quick damage to the devices, Rule J is introduced by checking the computed standard deviation of the sampled power data against a trained threshold. Note that such a rule can be used for the power data of a component and the system as well. Rule J. SP < sp.max Temperature control is the basic function of a common HVAC system and the related rules can therefore be applied if only the system is in operation. Rules K and L provide constraints for the indoor air temperature and the supply air temperature respectively. Rule K. Rule L. ITia - Tspt.ia I et.ia + ATdb.ia ITsa - Tspt.sa I et.sa + ATdb.sa 6.6.2.2 Rules under different operating modes As discussed before, detectability of a fault is often affected by the current operating mode of the system. Separated by the related thermal processes of the system, the modes provide a clear structure for implementation of the appropriate rules. Mode 1. Heating In this mode, outdoor air temperature is lower than a setpoint Tbp used for the control of the HVAC system plus a sensor error et.oa. Rules 1 and 2 define the current operating mode and a constraint for the basic function of the system in this mode, i.e., the 157 supply air temperature should not be lower than the return air temperature. Violation of Rule 1 can be caused by an error in the outdoor air temperature and indicates a discrepancy between the mode and the expected outdoor air temperature. Rule 1. Rule 2. Toa < Tbp + et.oa Tsa Tra The supply air is heated by hot water in a heating coil before entering the conditioned space. Therefore, the hot water pump and the boiler should be running with appropriate power input while the chilled water pump and the chiller should be shut down. Rules 3 and 4 provide the corresponding constraints for the power input of the related equipment. The minimum power input Php.min and Pb.min can be obtained from the design specifications. Rule 5 is to prevent frequent on/off cycling of the boiler. Control signals for the heating process should keep the heating coil valve open and the cooling coil valve closed, as described by Rules 6 and 7. The minimum values for the time duration of continuous operation of the boiler ontdb.min, the valve position uv.hc.min, and the errors ep.cp, ev.c, and ep.ch can be specified by the user based on the capacity range of the equipment. Violations of these rules indicate inappropriate thermal processes are involved. Rule 3. Php Rule 4. Rule 5. P epcp and Pch ontdb ontdb.min Rule 6. uv.h uv.hc.min Rule 7. uv.cc ev.cc Php.min and Pb Pb.min ep.ch With the low outdoor air temperature, the outdoor air damper should be at the minimum open position to save energy needed for heating. Violation of Rule 8 indicates a too high or a too low fraction of the outdoor air. Rule 8. 1Qoa/Qsa - (Qoa/Qsa)min I e0 Mode 2. Cooling with outdoor air In this mode, outdoor air temperature is higher than the temperature of the balance point but lower than the supply air temperature minus the temperature rise due to the supply fan. Rules 9 and 10 define the current operating mode and a constraint for the basic function of the system in this mode, i.e., the supply air temperature should not be higher than the return air temperature. Rule 9. Rule 10. Tbp + et.oa Tsa Tra Toa Tspt.sa - ATsf + et.oa When the outside air temperature is within the range defined by Rule 9, free cooling can be realized by mixing the cool outdoor air with the return air to achieve the design temperature of the supply air through the coupled modulation of the outdoor air 158 and return air dampers. Hence, the equipment for mechanical heating or cooling at the plant should be shut down as shown by Rules 11 and 12 for the zero power input of the chiller and the boiler and the related pumps as well. Rule 11. Rule 12. Pc, Php ep-cp and Pch ep.ch ep.cp and Pb ep.b Rules 13 and 14 indicate that control signals should keep the related valves closed as no chilled or hot water is needed. Rule 13. Rule 14. uv.he ev.he uv.cc ev.cc Rule 15 indicates the outdoor air supply should meet the minimum requirement by design, though the outdoor air damper is modulating in this operating mode. Rule 15. Qoa/Qsa > (Qoa/Qsa)min - eQ Mode 3. Mechanical cooling with 100% outside air When the outdoor air temperature is between the supply air temperature minus the temperature rise due to the supply fan and the return air temperature, the outdoor air damper is set to 100% open and the recirculation air damper is 100% closed. Rule 16 defines the current operating mode and Rule 17 indicates the supply air temperature should not be higher than the return air temperature. Rule 16. Rule 17. Tspt.sa Tsa ATsf + et.oa Toa Tspt.ia + ATrf + et.oa Tra Since the outdoor air temperature is higher than the supply air temperature, the chiller and the chilled water pump should be in operation to provide mechanical cooling for the supply air while the boiler and the hot water pump should be turned off. Rule 18 provides the power constraints for the cooling equipment. Rule 19 means the power input of the heating components should be zero. Rule 20 gives the minimum continuous operating time of the chiller once it is turned on. Rule 18. Pcp Rule 19. Php ! ep.hp and Pb Rule 20. ontdch Rule 21. For Pp.min and Pch Pch.min ep.b ontdch.min Toa ht.oa*(Tspt.ia-Tspt.sa), offtdch offtdch.min In addition, it has been found that when the outdoor air temperature is within a range that is closer to the lower bound than to the upper bound in this mode, the chiller runs at a near constant off-on interval which is longer than that when the outdoor air temperature is out of this region. Such a range can be defined by the difference in the 159 outdoor air temperature and the supply air temperature setpoint normalized to the difference between the indoor and supply air temperature setpoints as shown in Section 6.5. Rule 21 provides such a constraint for detection of the related faults, such as the leaking recirculation air damper. The control signals should keep the heating coil valve closed and cooling coil valve open, as represented by Rules 22 and 23. Rule 22. Rule 23. uv.he ev.he uv.cc uvc.min With 100% outdoor air, Rule 24 can be used to identify faults in damper positions and the related control signals. Rule 24. I Qoa/Qsa - 1 1 eQ Mode 4. Mechanical cooling The outdoor air temperature is higher than the return air temperature, as shown by Rule 25 and the supply air temperature should be lower than the return air temperature. Rule 25. Toa Tspt.ia + Rule 26. Tsa Tra ATrf + Et Under the mechanical cooling condition, the chiller and the chilled water pump should be turned on, as represented by the minimum power input constraints shown in Rule 27. Boiler and hot water pump should be out of operation with zero power input, as stipulated by Rule 28. Rule 29 sets the constraint for the continuous operating time of the chiller. Rule 27. Rule 28. Rule 29. Pep.min and Pch Pch.min Php i ep.hp and Pb ep.b ontdch ontdch.min Pep Rules 30 and 31 indicate that the heating coil valve should be closed while the cooling coil valve should be open by the control signals. Rule 30. Rule 31. uv.he uv.cc ev.he uvc.min Rule 32 shows that with the high outdoor air temperature, the amount of the outdoor air is limited to the design minimum value to save cooling energy. Rule 32. I Qoa/ Qsa - (Qoa/ Qsa)min I EQ 160 & S e- c Stuck cooling coil valve IIHot SWILeaking ** Unstable controller*** Static pressure sensor offset Slipping supply fan belt Stuck recirculated air damper Stuck outdoor air damper Leaking mixing box damper(s) water circulating pump fault Hot water supply temperature is too low Fouled heating coil Undersized heating coil heating coil valve heating coil valve Chilled water not available Chilled water circulating pump fault Chilled water supply temperature too high Fouled cooling coil H HStuck cooln Leaking cooling coil valve Leaking Outdoor air temperature sensor error Return air temperature sensor error Supply air temperature sensor error H r-e- Undersized cooling coil I H H_ ?-H&I IEFFFFFFFH C_ * When variable load control is used, power consumption of the chiller, the boiler, and the pumps can also be correlated with appropriate variables such as water temperatures, water flow rate, or valve control signals. The resulted functions can be used for detection of related faults, e.g., valve leakage. ** Theoretically, if the OA temperature sensor is not working properly, the working mode of the system could be totally different. Therefore, depending on the severity and direction of the fault and the priority of the controlled variable in the logic design, this fault may violates all the rules or some of them. For example, if the error in mode 1 results in an extremely high value of the OA temperature signal, then the control mode may be in Mode 4 with mechanical cooling. Then all the rules are violated. If the error just gives an even lower OA temperature in mode 1, then only Rules 2, 3, 5, 6, and 8 might be violated. *** An unstable controller tends to result in a more consistently recognizable pattern in power consumption. Other non-controlled variables that are available for analysis may also be used to help to identify the source of the power oscillation. For example, unstable control of a supply fan causes large variance not only in the power input of both the system and the fan but also in the fan speed signal, the supply air flow rate, and the static pressure in the duct. 6.7 Discussion and conclusions In order to find the fault origin with least intrusion into the system, the diagnostic tool should be built with a flexible structure to produce reliable output from the varying amount of available information in different systems. For fault diagnosis with power consumption as an end effect and with limited basic measurements in HVAC systems, the diagnosis needs to be implemented by tracing the observed effects back to the defect components or system malfunctions without monitoring the input and output of each component. In this chapter, different techniques are first studied for fault diagnosis when abnormal behaviors are observed in a system. Two approaches are introduced for diagnosis with different levels of knowledge about the system, the shallow reasoning technique for evidentially oriented systems and the deep inference technique based on a structural and functional model of the problem domain. For a system with little information other than the total power consumption and the design information of the system, the shallow diagnostic method can be used for indication and general analysis of a fault in the system. Although it imposes virtually no intrusion into the system, the shallow reasoning method generally is not able to lead to a fault origin that is specific enough for correction. With basic measurements and control signals of the system, the casual search method for tracing the observed abnormal behavior to its origin is more applicable and reliable than the other approaches for efficient and least-intrusion diagnosis of faults in HVAC systems. The deep reasoning approach is virtually a model that can derive its own 162 behavior for a given set of parameters and signals and predict the effects of changes in them. In this thesis, a reasoning method based on the signed directed graph (SDG) is developed to produce reliable alarms for faults that are typical of the common HVAC systems. Derived from the system's structure and control logic, a digraph is composed of nodes representing the status of variables or control signals based on appropriate models or thresholds and branches for the propagation of the effects of the fault. One major advantage of the SDG inference technique is that it allows multiple pathways in the propagation of the fault, which maximizes the number of violations of a fault in the given system and hence helps to improve the resolution of the diagnosis. In addition, some critical issues regarding the operating mode and time delay in the propagation of a deviation along a pathway can be handled by the inference logic. With the modified concise form for the logic consequence, the rules can be directly derived from a digraph without missing or repeated rule executions. In this study, the inference structure is further improved with more flexible control of the rules by allowing zero violations of controlled variables regardless of the deviations in the consequent variables. This innovation not only prevents the missing of alarms when power consumption is the major criterion, but also greatly improves the flexibility of the design of the rule structure so that the diagnosis to be applied to systems under different conditions or operating modes or with varying amount of available information. Moreover, by including the controlled variables in the inference, urgent alarms will be issued promptly for faults leading to serious deterioration of the system's function that requires immediate attention. Faults that result in degradation of the system's performance can be monitored and alarmed continuously. Although it has been verified to be an effective approach for fault diagnosis in HVAC systems, this SDG rule-based casual search method has its limitations in two aspects, one is that it can not locate the specific origin of a fault if it rarely happens in HVAC system and hence can not be diagnosed with the knowledge base, though all the abnormal behaviors caused by this fault are reported for further analysis by an expert. The other limitation is that fault diagnosis is based on the assumption that at most one fault exists at a time in one loop, i.e., the water or the air loop, of the monitored system. This is because the interactions among multiple faults in the same loop still can not be appropriately distinguished by power consumption with limited measurements and control signals. But it is still possible to distinguish between faults in different loops if the power models are obtainable for each loop. 163 164 CHAPTER 7 Summary 7.1 Review A fault in an HVAC system means an undesired state or position of a component that causes deterioration of the system's performance when the component is in operation. The effects can be categorized as excessive energy consumption, undesirable indoor air quality, and degradation or failure of components. Sometimes, the temperature and humidity of the indoor air can be maintained through the compensation by the local feedback loop for the fault, but it is always at the expense of extra energy consumption. The power input required by the system due to the existence of a fault can be significantly higher than the value under normal operation. If the fault is not properly found and treated in time, the indoor air quality may also become unacceptable with the increasing severity of the fault that saturates the local control, such as a slipping fan belt, or, if the fault does not cause deviations monitored in any feedback loop in the system, such as a broken fan belt. At the same time, components involved in the propagation of the fault may be damaged due to unexpected actions. Therefore, in order to achieve expected air quality with efficient energy use, faults in an HVAC system must be identified and removed appropriately. However, detection and diagnosis of a fault is often a difficult task for the human operator without a proper tool or method if a complex system is to be dealt with, such as a commercial HVAC system consisting of multiple interactive components. It is therefore necessary to develop an effective method that can be used to find abnormal behaviors and then search for the fault origin in the system. Although faults may occur in diverse ways for different applications, it is often desired that the implementation of the detection and diagnosis should cause as little interruption to the system as possible. Also, the FDD program should be based on the analysis of parameters that are common in different HVAC systems. In addition, the decision for removal or repair of a faulty component is usually based on the evaluation of benefit vs. cost, especially in the case of a degradation fault that is unavoidable after some period of operation. Hence an effective FDD method needs to be developed that can be used not only to locate faults in the different systems with least intrusion but also to demonstrate the cost effect in the system due to the fault for the operator or the building energy management system to make appropriate decisions. Faults in an HVAC system usually lead to abnormal effects in its power consumption. Moreover, as a common input of HVAC systems, electrical power data are generally obtainable without significant intrusion into or interruption of the physical processes in the system. Therefore, if the electrical power data can be properly processed or modeled for system and its components, detection and diagnosis of faults may be fulfilled with negligible interruption of operation for different systems. By comparing the power input based on the models against that from measurements, effects of a fault in 165 energy cost can be estimated and the results are then used in decision-making for maintenance. In this thesis, based on a comprehensive study of the previous research, a new methodology for detection and diagnosis of faults in HVAC systems has been developed. With this approach, identification of faults is conducted by monitoring electrical power consumption of the system and its components. Abnormalities in the power input of a system or a component caused by faults are detected as magnitude deviations or power oscillations, which consequently start the search engine for the fault origin. Depending on the availability of the power submeters for electrically-driven equipment, models are developed at two different levels, i.e., the system and the component levels. Without power submeters, detection at the system level is aimed to check the on/off status of a device and the magnitude of its power consumption by detecting the switches and computing the corresponding changes in the total power data, testing on/off cycles of some major equipment against the constraints for their operating schedules by design, and tracking power oscillation in the system. Utilization of submeters enables the modeling of the equipment's power input as a function of basic measurements or control signals, which eliminates the needs for on/off switches to find abnormal magnitude and allows detection whenever the component is in operation. In Chapter 2, a brief review of the theory for change detection is first conducted. Based on the characteristics of the power data of a commercial HVAC system compared to that in residential applications, the GLR (generalized likelihood ratio) method is proposed for detection of changes in the power data. The GLR search for changes enables detection without knowledge of the coming event by double maximization of probabilities of the time and the magnitude of a change. With the normal distribution of the noise in electrical power data, an executable form of the algorithm has been derived for automatic detection. Although the post-event information is not required, the pre-event is essential in the GLR detection. With the common applications of variable speed motor drives as well as the unavoidable presence of noise and disturbances in the electrical power environment of an HVAC system, the total power changes continuously. This implies that the mean value of the total power consumption before any event must be continuously computed on line to form a correct base for the calculation the change of the event. In Chapter 3, the two-window equation has been derived for the computation of the pre-event mean in a moving window that follows the detection window where the thorough search is conducted for the double-maximization of probabilities. Based on the tests with data from several buildings, major innovations have been made for the algorithms to improve the detection quality: the window reset technique eliminates repeated alarms for one event while the updated standard deviation and the nonzero expected minimum help to prevent false or missing of alarms. To minimize the noise effect, a median filter based on an equal-probability search has been designed for the pre-processing of the power data. With spikes and large 166 fluctuations in the power data due to random noise removed by the median filter, the false alarm rate is greatly reduced. In addition to the median filter, the multi-rate sampling technique is applied in the detection to identify changes or variations that are not all visible to a detector with any single sampling interval due to various on/off time duration of different components in the system, transformed data patterns caused by noise, and oscillation around a certain period. Apparently, for the on/off duration of different equipment, such as those driven by VSD or CSD motors, detection with multiple sampling rates is more capable than with single rate. In addition to the detection for different components, the multi-rate GLR detector can recognize events whose abruptness is impaired by the dynamics or noise in the power data, which is very useful in matching the on/off cycles of a component. Moreover, detection of power oscillation due to unstable control that can be found by tracking the standard deviation in the pre-event window is also greatly enhanced due to the fact that oscillation with a specific period can be easily identified with a wide range of sampling intervals, though it may be invisible to detection with certain sampling intervals. The performance of the detector developed with the innovations and equipped with the median filter and the multi-rate sampler have been verified by tests with several typical faults related to abnormalities in the magnitude of on/off changes, the frequency of equipment cycling, or the trend of the power data of real HVAC systems. For practical applications of the detector, Chapter 3 provides the guidelines for the training of the related parameters, including window lengths, thresholds, and sampling rates. The work in Chapter 3 demonstrates that with the improved GLR detector, centralized power data can be explored to reveal the health of the system through the electrical power input of the key components. When the power input of equipment is obtainable by submeters, fault detection and diagnosis can be conducted at a more specific level, i.e., the component level, as presented in Chapter 4. A brief review of the algorithm of least square fitting through singular value decomposition is first conducted. For the modeling of power consumption of a component, Chapter 4 addresses the selection of the appropriate reference parameter as the variable in the power function, which is very important because the reference parameter are expected to result in effective correlation for fault detection while the intrusion caused by the related measurements is minimized. Moreover, models are also analyzed and developed for different system structures and their control logic. With the confidence required for the detection, an interval can be set up for the upper and lower limits of the deviation of the component's power consumption under a given value of the reference parameter. Power models built at the component level therefore not only lead to easier diagnosis with higher resolution, but also make it possible to conduct detection continuously. This is especially useful when the equipment is driven by a VSD motor which is rarely turned off as required by some systems. Also, detection of faults associated with cycling frequency and oscillation with submetered power data becomes more straightforward as shown. However, it should be noted the detectability of a fault sometimes is determined by the equipment's sensitivity to load variations, which will consequently affect the capacity of the power model. For example, a stuck-open outdoor air damper tends to cause an increase in the chiller's power input, especially under the 167 operation mode of mechanical cooling with minimum outdoor air. However, with a twostage reciprocating chiller as used in the test building, it is difficult to find the fault from the chiller power with its stepwise response to load changes. One major challenge for FDD with power consumption is that although the basic measurements involved in the power models are generally available, submetered power data may not be obtainable in common HVAC systems. Therefore, it is necessary to find an appropriate approach to build the models without power submeters. Chapter 5 explores a feasible method to obtain power data by detecting the changes in the total power controlled by manual shutdown of the related equipment. However, due to the presence of noise and other events, the power data should be carefully selected for the fitting to avoid deviation in the model itself. Two major factors have been found in effective screening of the detected values, the relative magnitude and the standard deviation. With these two indices, errors in the detected changes are minimized, which makes the resulted model applicable for detection. Tests with an HVAC system have verified the feasibility of this approach which has produced expected output as obtained from models based on submetered power data. With the deviations found in the power models, the fault origin is yet to be identified. Chapter 6 proposed a knowledge-based approach from analysis of a digraph for the common faults in HVAC systems. This method is capable of dealing with problems typical of HVAC systems, such as time delay. To find the undesirable operation at an early stage before the deterioration of the controlled variables that might be compensated with local feedback even at the presence of a fault, further innovation is proposed for the inference logic that allow alarms progressively with violations. In addition, with the flexible control of the nodes by the innovation, the logic structure can be easily applied to systems with similar structure but different measurements available for diagnosis. Based on the understanding of the structure and the control logic of the system, the potential pathways and affected components are explored in the digraph, which maximizes the violations available for the diagnosis. By using the simplified form of the logic predicates, the rules can be extracted from the digraph efficiently in a concise form. The methodology developed in this thesis has been verified by tests with different building energy systems. It is concluded that the detection and diagnosis of faults in practical HVAC systems can be conducted based on this research with minimized intrusion into the operating system. 7.2 Achievements and future work In this thesis, the following tasks have been successfully fulfilled and verified with tests in HVAC applications. 1). Derived and innovated the algorithm for detection of changes in power input of HVAC systems; 168 2). Designed two parallel detectors with an appropriate filter to pre-process the noisy power data; 3). Developed a multi-scale sampling mechanism for the detector to identify changes or variations in the power data that can not be found with single-rate sampling; 4). Established the complete set of training rules for the window length, thresholds, and selection of sampling intervals for the change detector; 5). Set up submetered power models of components in HVAC systems to be used for continuous monitoring and addressed the key issues in model development; 6). Developed an approach to obtain feasible power models for components without submetered power data; 7). Designed the concise inference logic for rule development based on signed directed digraph for enhanced identification of typical fault origins in HVAC systems. Future improvements on the methods for fault detection and diagnosis in HVAC systems may be achieved in three directions. First, as the models developed in this thesis are primarily for the simulation of the actual operation of the monitored system, a model for the design intent is desirable for identification of faults by a three-way comparison among the design intent, the control signals, and the actual operation. Such a model may be established for a given system based on optimization of energy consumption, indoor air conditions, and other indices of concern. Second, an optimal search mechanism may be developed to distinguish multiple events with overlapped duration. A possible solution to this issue is a modified algorithm for the Markov series with multi-rate sampling. And third, the open-ended inference capacity of the fault diagnosis method should be explored to replace the closed deduction techniques if possible, which indicates that a learning model for a system should be established. 169 170 References Andelman, R., P.S. Crutiss, and J.F. Kreider. 1997. Demonstration knowledge-based tool for diagnosing problems in small commercial HVAC systems. Annex 25. 1996. "Building optimization and fault diagnosis source book." International Energy Agency for Real Time Simulation of HVAC Systems for Building Optimization, Fault Detection and Diagnosis. Asada, H., Federspiel, C. and Liu, S. 1993. "Human centered control." ASME Journal of Dynamic Systems, Measurements, and Control, Vol. 115, No. 2B, pp. 271 - 280. ASHRAE Handbook. 1996. HVAC Systems and Equipment. American Society of Heating, Refrigerating and Air-conditioning Engineers. Atlanta, Ga. Basseville, M. 1983. "Design and comparative study of some sequential jump detection algorithms for digital signals." IEEE Transactions, ASSP, Vol. 31, pp. 521-534. Basseville, M. and I. V. Nikiforov. 1993. Detection of abrupt changes: theory and application. PTR Prentice Hall, Englewood Cliffs, New Jersey. Benouarets, M., A. L. Dexter, R. S. Fargus, P. Haves, T. I. Salsbury, and J. A. Wright. 1994. "Model-based approaches to fault detection and diagnosis in air-conditioning systems." Proceedings of Systems Simulation in Buildings, Liege, Belgium. Brambley, M., R. Pratt, D. P. Chassin, and S. Katipamula. 1998. "Automated diagnostics for outdoor air ventilation and economizers." ASHRAE Journal Vol. 40, No.10. Braun, J. E., J. W. Mitchell, and S. A. Klein. 1987. "Performance and control characteristics of a large cooling system." ASHRAE Transactions, Vol. 93, pp.18301852. Breuker, M. S. and J. E. Braun. 1998. "Evaluating the performance of a fault detection and diagnostic system for vapor compression equipment." International Journal of HVAC&R Research, October, pp. 401-425. Carlson, R. A. 1991. Understanding building automation systems: direct digital control, energy management, life safety, security/access control, lighting, building management programs. R. S. Means Co. Chen, S. Y. S. and S. J. Demster. 1995. Variable air volume systems for environmental quality. McGraw-Hill. Davis, R. 1984. "Diagnostic reasoning based on structure and behavior." Artificial Intelligence, Vol. 24, pp. 347-410. 171 Davis, R. and H. Shrobe. 1983. "Representing structure and behavior of digital hardware." IEEE Computer, October, pp. 75-82. Dodier, R., P. S. Curtis, and J. F. Kreider. 1998. "Small-scale on-line diagnostics for an HVAC system." ASHRAE Transactions, Vol.104, Part lA, pp.530-539. Dodier, R. H. and J. F. Kreider. 1999. "Detecting whole building energy problems." ASHRAE Transactions, Vol.105, Part 1, pp. 579-589. Draper, N. R. and H. Smith. 1981. Applied regression analysis. John Willey & Sons Englander, S. L. and L. K. Norford. 1990. "VAV system simulation, Part 1: Development and experimental validation of a DDC terminal box model." Proceedings of the third international conference on system simulation of buildings, Liege, Belgium. Golub, G. H. and C. F. Van Loan. 1989. Matrix computations. University Press, Baltimore. 2 nd ed. John Hopkins Friedrich, M., and M. T. Messinger. 1995. "Method to assess the gross annual savings potential of energy conservation technologies used in commercial buildings." ASHRAE Transactions, Vol.101, Part 1, pp.444-453. Giarratano, J. C. 1983. "Expert systems: principles and programming." 1998. PWS Pub. Co. Han, C. Y., Y. Xiao, C. J. Ruther. 1999. "Fault detection and diagnosis of HVAC systems." ASHRAE Transactions, Vol.105, Part 1, pp. 568-578. Hart, G. W. 1992. "Nonintrusive Appliance Load Monitoring" 1992. Proceedings of the IEEE. Vol.80. No.12, pp.1870-1891. Haves, P., T. I. Salisbury, and J. A. Wright. 1996. "Condition monitoring in HVAC subsystems using first-principle models." ASHRAE Transactions, Vol. 102, Part 1, pp. 519-539. He, X. and Liu, S. "Modeling of Vapor Compression Cycles for Multivariable Feedback Control of HVAC Systems", in print, ASME Journal of Dynamic Systems, Measurement, and Control. Hill, R. 0. 1995. Applied change of mean detection techniques for HVAC fault detection and diagnosis and power monitoring. Master's thesis, Massachusetts Institute of Technology. House, J. M., W-Y. Lee, D. R. Shin. 1999. "Classification techniques for fault detection and diagnosis of an air-handling unit ." ASHRAE Transactions, Vol. 105, Part 1, pp.1087-1097. 172 House, J. M., H. V. Naezi-Nej ad, and J. M. Whitcomb. 2001. "An expert rule set for fault detection in air-handling units." ASHRAE Transactions, Vol.107, Part 1. Huber, P. J. 1981. Robust statistics. John Wiley & Sons. Isermann, R. 1984. "Process fault detection based on modeling and estimation methods-A survey." Automatica, Vol. 20, No. 4, pp. 3 8 7 -4 0 4 . Kao J. L. and E. T. Pierce. 1983. "Sensor errors - effect on energy consumption." ASHRAE Journal, Vol. 25, No. 12, pp. 4 2 -4 5 . Karki, S. H. 1999. "Performance factors as a basis of practical fault detection and diagnostic methods for air handling units." ASHRAE Transactions, Vol.105, Part 1, pp. 1069-1077. Karl, W. C., S. B. Leeb, L. A. Jones, J. L. Kirtley, Jr., and G. C. Verghese. 1992. "Applications of rank-based filters in power electronics." IEEE Transactions on Power Electronics, Vol. 7, No. 3, pp. 437-444. Katipamula, S., R. G. Pratt, D. P. Chassin, Z. T. Taylor, G. Krishnan, and M. R. Brambley. 1999. "Automated fault detection and diagnostics for outdoor-air ventilation systems and economizers: methodology and results from field testing." ASHRAE Transactions, Vol. 105, pp. 555-567. Kramer, M. A. 1986. "Malfunction diagnosis using quantitative models and non-Boolean reasoning in expert systems." AIChE Journal, Vol. 34, pp. 1383-1393. Kramer, M. A. and B. L. Palowitch Jr. 1988. "A rule-based approach to fault diagnosis using the signed directed graph." AIChE Journal, Vol. 36, pp 1225-1235. Kreider, F., D. Anderson, L. Graves, W. Reinert, J. Dow, and H. Wubbena. 1989. "A quasi-real -time expert system for commercial building HVAC diagnostics." ASHRAE Transactions, Vol. 95, Part 2, pp. 9 5 4 -9 6 0 . Lee W-Y., J. M. House, and C. Park. 1997. "Fault diagnosis of an air handling unit using artificial neural networks." ASHRAE Transactions, Vol. 102, Part 1, pp. 540-549. Lee, W-Y., C. Park, G. Kelly. 1996. "Fault detection in an air-handling unit using residual and recursive parameter identification methods." ASHRAE Transactions, Vol. 102, Part 1, pp. 528-539. Leeb, S. B. 1992. A conjoint pattern recognition approach to non-intrusive load monitoring. Ph.D. thesis, Massachusetts Institute of Technology. 173 Li, X., H. Vaezi-Nejad, and J-C. Visier. 1996. "Development of a fault diagnosis method for heating systems using neural networks." ASHRAE Transactions, Vol.102, Part 1, pp. 607-614 Li, X., J-C. Visier, and H. Vaezi-Nejad. 1997. "A neural network prototype for fault detection and diagnosis of heating systems." ASHRAE Transactions, Vol. 103, Part 1, pp.634-643. Little, R. D. 1991. Electrical power disaggregation in commercial buildings with applications to a non-intrusive load monitor. Master's thesis, Massachusetts Institute of Technology. Lorden, G. 1971. "Procedures for reacting to a change in distribution." Annals Mathematical Statistics, Vol. 42, pp.1897-1908. Lorenzetti, D. and L. K. Norford, 1992. "Measured energy consumption of variable-airvolume fans under inlet vane and variable-speed drive control." ASHRAE Transactions, Vol. 98, Part 2, pp. 371-379. Luo D., L. K. Norford, S. R. Shaw, and S. B. Leeb. 2001. "Monitoring HVAC equipment electrical lads from a centralized location - methods and field test results." To be published in ASHRAE Transactions Vol. 107. McGowan, J. 1992. Networking for building automation and control systems. Englewood Cliffs, NJ Newman, H. M. 1994. Direct digital control of building systems: Theory and practice. John Wiley & Sons. Ngo, D. and A. L. Dexter. 1999. "A robust model-based approach to diagnosing faults in air-handling units." ASHRAE Transactions, Vol.105, Part 1, pp. 1078-1086. Norford, L. K. and S. B. Leeb. 1996. "Nonintrusive electrical load monitoring in commercial buildings based on steady-state and transient load-detection algorithms." Energy and Buildings 24, pp. 5 1-64. Norford, L. K. and R. D. Little. 1993. "Fault detection and load monitoring in ventilation systems". ASHRAE Transactions, Vol. 99, Part 1, pp. 590-602. Norford, L. K., J. A. Wright, R. A. Buswell, and D. Luo. 2000. Demonstration of fault detection and diagnosis methods in a real building. Final Report for ASHRAE Research Project 1020-RP. Pape, F. L. F., J. W. Mitchell, and W. A. Beckman. 1991. "Optimal control and fault detection in heating, ventilating, and air conditioning systems." ASHRAE Transactions Vol.97, Part 1, pp. 729-736. 174 Patton, R., P. Frank, and R. Clark. 1989. Fault diagnosis in dynamics systems - Theory and application. Prentice Hall International (UK) Ltd. Peitsman, H. C. and V. E. Bakker. 1996. "Application of black-box models to HVAC 6 2 8 -6 4 0 . systems for fault detection." ASHRAE Transactions, Vol.102, Part 1, pp. Peitsman, H. C. and L. L. Soethout. 1997. "ARX models and real-time model-based diagnosis." ASHRAE Transactions, Vol.103, Part 1, pp. 657-67 1. Rice, J. A. 1988. Mathematical statistics and data analysis. Wadsworth & Brooks/Cold Advanced Books & Software, Pacific Grove, California. Rossi, T. M. and J. E. Braun. 1996. "A statistical, rule-based fault detection and diagnostic method for vapor compression air conditioners." Real Time Simulation of HVAC Systems for Building Optimization., Fault Detection and Diagnosis. Technical Papers of Annex 25, pp. 7 2 9 -7 5 4 . Rossi, T. and J. E. Braun. 1997. "A statistical, rule-based fault detction and diagnostic 19 3 7 method for vapor compression air conditioners." HVAC&R Research 3(1), pp. - . Rossi, T. M., J. E. Braun, and W. Ray. 1996. "Fault detection and diagnosis methods." Real Time Simulation of HVAC Systems for Building Optimization, Fault Detection and 3 Diagnosis. Technical Papers of Annex 25, pp. 12 3 - 13 . Seem, J. E., J. M. House, and R. H. Monroe. 1999. "On-line monitoring and fault detection." ASHRAE Journal, July 1999, Vol.41, pp. 21-26. Shanmugan, K. S. and A. M. Breipohl. 1988. Random signals: detection, estimation and data analysis. John Wiley & Sons. Shiozaki, J., H. Matsuyama, E. O'Shima, and M. Iri. 1985. "An improved algorithm for diagnosis of system failures in the chemical process." Computers and Chemical Engineering, Vol. 9, pp. 285-293. Stylianou, M. 1997. "Application of classification functions to chiller fault detection and diagnosis." ASHRAE Transactions, Vol. 103, Part 1, pp. 645-656. Visier, J. C., H. V., Nejad, and P. Corrales. 1999. "A Fault Detection Tool For School Buildings." ASHRAE Transactions, Vol. 105, pp. 543-554. Yoshida, H., Iwami, T., and H. Yuzawa. 1996. "Typical faults of air conditioning systems and fault detection by ARX model and extended Kalman filter." ASHRAE Transactions, Vol.102, Part 1, pp. 557-564. 175 176 Nomenclature s S coefficient optimality region for maximization optimality region for minimization damper position noise or error cumulative distribution function generalized likelihood ratio threshold hypothesis for an event the greatest lower bound on a function over (a subset of) its domain number of samples or logic predicate finite number of samples number of on/off cycles each day off-time duration on-time duration probability density or logic predicate power flow rate standard deviation of a sampled data set log-likelihood ratio t time u y 6 A A p 0 control signal data sample thickness of fouling width of leakage likelihood ratio average value of variables with Gaussian distribution mean value standard deviation a argmax argmin D e F g h H Inf. n N ncd offtd ontd p P Q Superscripts A estimated behavior weighted behavior vector Subscripts 0 1 b cc ci before the change after the change boiler cooling coil confidence interval 177 ch cp d db e hc hp i,j,k,n ia m max mn oa p Q ra rf sa sf sp spt t v 0 chiller chilled water pump damper deadband event of change heating coil hot water pump sample numbers indoor air model maximum minimum outdoor air power or phase of power flow rate return air return fan supply air supply fan static pressure setpoint temperature valve or variance average value of variables with Gaussian distribution mean value matrix 178 Appendix Descriptions of the test system and the faults Based on the final report for the ASHRAE Research Project 1020-RP [Norford et al, 2000], this appendix provides the detailed information about the building, HVAC system and control sequences, and the methods of fault implementation for the demonstration of the FDD method presented in this thesis. Sensors and trained parameters required by this method in the tests are also listed. Fig. Al-A3 are repeated from Section 3.9 for reading convenience. Building Fault testing was conducted in a unique building that combines laboratory-testing capability with real building characteristics and is capable of simultaneously testing two full-scale commercial building systems side-by-side with identical thermal loads. The building is equipped with three variable-air-volume air-handling units (AHUs), namely, AHU-A, AHU-B, and AHU-1. AHU-A and AHU-B are identical, each serving four test rooms (Figure Al). The building has a true north-south solar alignment so that the pairs of test rooms have identical exposures to the external thermal loads. The test rooms are unoccupied but are equipped with two-stage electric-baseboard heaters to simulate thermal loads and with two-stage room lighting, both scheduled to simulate various usage patterns. The test rooms are also equipped with fan coil units, although these were not used in this research. AHU-1 serves the general areas of the facility including offices, reception space, a classroom, a computer center, a display room, service spaces, and a the media center. A second classroom was added to the east side of the building during the later stages of this project. Because AHU-1 serves the occupied part of the building, it will be subject to variable occupant, lighting, and external and internal loads. The test rooms, heating and cooling loops, and AHUs are well instrumented with near-research-grade sensors. Notably, instrumentation included Watt transducers for all components of interest. The A and B test rooms are individually controlled by a single commercial energy-management and control system (EMCS) and the general areas are controlled by a second EMCS. The building has a structural steel frame with internally insulated, pre-cast concrete panels, a flat roof, slab-on-grade flooring, and a floor area of 7,900 m2 (9,272ft2), including the new classroom. The east, south, and west test rooms each have 6.3 m2 (74 ft2) windows with double-layer, clear glass. HVAC system The heating plant consists of a gas-fired boiler, circulation pumps and the necessary control valves. Heating operation of the HVAC systems was not required as part of the tests conducted in this research, other than for the preheating of the outside air during winter operation to simulate higher outside temperatures and force the HVAC systems into economizer mode. The cooling plant (Figure A2), consists of a nominal 10 kW, two-stage reciprocating, air-cooled chiller, a 149 kWh thermal energy storage (TES) unit that was isolated from the cooling system for this research, chilled water supplied by 179 a central facility, pumps, and necessary valves and piping to circulate chilled water through the HVAC coniponents. The major components of the AHUs are the recirculated air, exhaust air, and outdoor air dampers; cooling and heating coils with control valves; and the supply and return fans (Figure A3). Ducts to transfer the air to and from the conditioned spaces. Both the supply and return fans are controlled with variable frequency drives. An additional heating coil was installed, for this research, on AHU-A and B, between the outside air inlet (OA) and the flow and the temperature sensor. This coil was employed to preheat the outside air so as to force the control system into free cooling mode. AHU-A and B are identical, while AHU-1 is similar but larger to accommodate higher thermal loads. Air from the AHUs is supplied to VAV box units, each having electric or hydronic reheat. AHU-A Classroom Display R oo m AHU-B Mechanical AHU-1 room I / F IIr> Int. Int. E A B Served by AHU-A l01Served by AHU-1 Wee s -4- Z-center M st/ FigureAI. Plan of the test building. 180 Served evdbby AHU-B H- - electric power transducer CHWP - chilled water pump - pressure transducer * - Point listed in Table 3.4 for cooling plant - flow meter ? - General Area System point - temperature probe % - Point listed in Table 3.6 for AHU-A and B DMACC Connection To remaining FCUs FigureA2. Chilled-waterflow circuit in the test building. 181 EA ODA-TEMP ODA-HUMD EA-DMPR RA-TEMP RA-FLOW RA Fan RF Watts RA-HUMD OA-TEMP OA-FLOW SA Fan SF Watts DA-TEMP SA OA-DMPR M= - MA-TEMP HTG-EWT HTG-LWT CLG-DAT SA-HUMD CLG-EWT CLG-LWT FigureA3. Air-handling unit in the test building. Control sequences --Hot and chilled waterpump control logic The constant-speed heated- and chilled-water pumps (HWP-A, HWP-B, CHWPA, and CHWP-B) are turned on and off based on the position of the heating and cooling coil control valves. If the control valve is above 15% open, the pump is turned on and stays on until the control valve is below 5% open. For a 15% open valve, the HTGFLOW is about 1.5 gpm, and the CHW-FLOW is about 2.5 gpm. The speeds of HWP-LA, HWP-LB, and CHWP-LC are modulated to maintain the pressure in the pipe at the pressure set point for heated water loops A and B and chilled water loop C. The pressure set points are generally set at 30 psi for loops A, B, and C. --Supply and returnfan control sequence The speed of the supply fan is modulated to maintain duct static pressure at the static pressure set point, which is generally at 1.2 in. W.G. The return fan for the tests was controlled by air flow rate matching with the supply fan, i.e., the return air flow rate is maintained at a percentage of the supply air flow rate. For example, if the supply air flow rate is 3600 cfm, and the return air flow rate is set to maintain 80 percent of the supply air flow rate, then the speed of the return fan would be controlled to maintain a return air flow rate of (0.80*3600 =) 2880 cfm. 182 Table A]. Test system input accuracy. Accuracy Input 0.0244% Voltage and current 0.07 0 F 100 0 RTD (DIN platinum) 0.07 0 F 1000 E RTD (JCI nickel) 0.10 F 1000 E RTD (DIN platinum) --Supply air temperature control sequence The control sequence used to maintain supply air temperature is divided into four control regions, namely, mechanical cooling, mechanical and 'free' cooling, 'free' cooling, and mechanical heating, as shown in Fig. A4. Each region depends on whether or not the OA-TEMP is greater or less than a reference temperature known as the economizer set point temperature (ECONSPT, typically around 55 - 65F), and whether DA-TEMP is above or below the supply air temperature setpoint (SUPSPT). The control sequence is in the mechanical cooling mode when OA-TEMP is greater than ECONSPT and the system calls for cooling (DA-TEMP is greater than the SUPSPT plus a deadband). During the mechanical cooling mode, the OA-DMPR is held in the minimum position and the CLG-VLV is modulated to maintain DA-TEMP at SUPSPT. When the OA-TEMP is less than ECONSPT, the supply air temperature control sequence is in the mechanical and 'free' cooling mode. The OA-DMPR is fully opened and CLG-VLV is modulated to maintain DA-TEMP. 100 oa-min Economizer set point Outdoor air temperature FigureA4. Supply air temperaturecontrol sequence. 183 As the OA-TEMP drops, the need for mechanical cooling is eliminated (CLGVLV modulates to fully closed) thereby switching the control sequence to the 'free' cooling mode. In this mode, DA-TEMP is maintained by modulating OA-DMPR. If the OA-TEMP continues to decrease, a point is reached where the OA-DMPR is in the minimum position and mechanical heating is required causing the control sequence to switch to the mechanical heating mode. During the mechanical heating mode, the HTG-VLV is modulated to maintain DA-TEMP. --VA V control sequences The RM-TEMP is maintained above RMHTGSP and below RMCLGSP. This is accomplished using either a VAV unit and/or a FCU. If a VAV unit is used, there are two methods of reheat available, electric and hydronic. The control sequence used for each method are described below. The first control sequence describes the control logic when a VAV unit is used to maintain RM-TEMP. If RM-TEMP is above RMCLGSP, the VAV-DMPR is opened to bring in more supply air and cool the Test Room. The air flow rate entering the Test Room may be varied from a minimum value (OCC-MIN) (determined either for indoor air quality or equipment limitations) and the maximum (OCC-MAX) rated flow rate for the VAV unit. The air flow rate is determined from the output of a control algorithm (normally a proportional-integral-derivative (PID) loop). This control sequence assumes that the DA-TEMP is lower than RM-TEMP. If RM-TEMP drops below RMHTGSP, and the hydronic reheat coil is selected, the VAV-HVLV is modulated to keep RMTEMP near RMHTGSP. The VAV-HVLV can vary anywhere from fully closed to fully open. The position of the heating valve is determined from the output of a control algorithm. If an electric reheat coil is selected instead of a hydronic reheat coil, the first stage of electric reheat is turned on when the output of the control algorithm is greater than 10%. If the temperature continues to drop after the first stage of electric reheat is turned on, and the output of the control algorithm becomes greater than 20%, the second stage of electric reheat is turned on. Again, if the temperature continues to drop after the first and second stages of electric reheat are on, and the output of the control algorithm becomes greater than 30%, the final stage of electric reheat is turned on. The stages of electric reheat are turned off in the same manner as they are turned on with a 5% load differential, so the third stage of electric reheat is turned off when the control algorithm output drops below 25%, the second stage is turned off when the control algorithm output drops below 15%, and the final stage is turned off when the control algorithm output drops below 5%. Faults Requirements for this project stipulated a minimum of six faults to be investigated, with at least two degradation faults and at least one fault in each of the three AHU subsystems (air-mixing, filter-coil, and fan). Table A2 shows seven faults consistent with these requirements and their method of implementation for AHU-A and 184 AHU-B. Table A3 indicates that each fault was implemented in at least two of the three test periods, held during summer, late winter, and spring seasons. Fault magnitudes were established during an initial period when the FDD method, methods for introducing faults and the HVAC systems were all commissioned. Magnitudes were such that it was anticipated that the FDD method would be able to detect the middle level of degradation faults, with the lowest level more of a challenge. Fault magnitudes were reproducible across test periods. While this would not likely be the case in a typical building, it provided a firmer basis for evaluating the FDD methods, in what is among the first extensive field tests of such methods. HVAC system commissioning consisted primarily of sensor calibration and establishing standard system operating configurations, important in a flexible research facility where systems were often changed to meet the needs of a number of research programs. This proved to be a major task for the test-building staff, encompassing fan control algorithms, isolation of the thermal-storage tank (which provided a thermal capacitance that interfered with analysis of chiller cycling periods), and operating schedules for HVAC equipment and false loads in test rooms. Table A2. Method of imp ementation offaults. Fault Implementation Type Air Mixing SectionI Stuck-closed recirculation damper Leaking recirculation damper Degradation Application of a control voltage from an independent source to maintain the damper in the closed position Removal of the recirculation-damper seals, with one seal removed for the first fault stage, two for the second, and all seals for the third stage. Degradation Degradation Manual opening of a coil bypass valve. Manual throttling of the cooling-coil balancing valve, to 70%, 42%, and 27% of the maximum coil flow of 1.7 1/s Abrupt Filter-Coil Section Leaking cooling coil valve Reduced coil capacity (water-side) (27.5 gpm) for the three fault stages Fan Drifting pressure sensor Degradation Unstable supply fan controller Slipping supply-fan belt Abrupt Introduction of a controlled leak in the pneumatic signal tube from the supply-duct static-pressure sensor to the transducer, to a maximum reduction of 225 Pa (0.9 in. H Degradation 2 0) Introduction of alternative gains for the PID controller that adjusts fan speed to regulate static pressure Adjust fan-belt tension to reduce maximum fan speed by 15% at 100% control signal for the first stage and 20% for the second stage. The third stage has an extremely loose belt with variable fan speed. 185 Table A3. Faults introducedduring the blind-test periods. Fault Air Mixing Section Summer Winter x x Stuck-closed recirculation damper Leaking recirculation damper Filter-Coil Section Leaking cooling coil valve Reduced coil capacity (water-side) Fan Drifting pressure sensor Unstable supply fan controller x x x x x x x x x x x x x x Slipping supply-fan belt Spring Table A4. A listing offaults that can be associatedwith a given electrical-powersignature. Type of electrical-power analysis Polynomial correlation of supply-fan power with supply airflow Polynomial correlation of supply-fan power with supply-fan speed control signal Possible faults causing a deviation between predicted and measured electrical power Change in airflow resistance, possibly due to stuck air-handler dampers or air-side fouling of heating or cooling coils Static-pressure sensor error (affects portion of fan power due to static pressure) Flow sensor error Power transducer error Change in fan efficiency, caused by change in blade type or pitch, or use of VFD in lieu of inlet vanes Change in motor efficiency Slipping fan belt Disconnected control loop (fan speed differs from control signal) Power transducer error Change in fan efficiency Change in motor efficiency Polynomial correlation of chilled-water pump power with cooling-coil control valve position control signal Change in water flow resistance, possibly due to constricted cooling-coil tubes or piping Disconnected control loop Power transducer error Change in pump efficiency Change in motor efficiency Detection of change in cycling frequency for twostage reciprocating chiller Detection of power oscillations Leaky cooling-coil valve Leaky recirculation damper Unstable local-loop controller 186 Sensors Sensors required by the FDD method based on power analysis are listed in Table A5. Table A5. Sensors used for FDD with power analysis. Required Sensors Type of Electrical-Power Analysis Supply-fan electrical power Polynomial correlation of supply-fan power with supply airflow Supply airflow Supply-duct static pressure (training Polynomial correlation of supply-fan power with supply-fan speed control signal PolynmialChilled-water-pump Polynomial correlation of chilled-water pump power with cooling-coil control valve position control signal Detection of change in cycling frequency for twostage reciprocating chiller only) yes yes Supply-fan electrical power Supply-fan speed control signal electrical power power Cooling-coil valve position control signal Chiller electrical power Outdoor temperature no yes (if VFD) Cooling-coil valve position yes control signal Detection of power oscillations Typically installed? no Yes (if volume-tracking control for return fan) Fan and pump electrical power 187 no yes no yes no Table A6. Required values in the applicationof the FDDmethod in the tests. Description of threshold Value Fan-power correlations with airflow and speed-control signal Maximum deviation of static pressure from set point for training data Confidence level to establish boundary between normal and faulty data Airflow boundary to distinguish stuck-closed recirculation damper from staticpressure offset/drift Fan power at 100% speed, below which a slipping-fan-belt fault was considered subject to a minimum time duration* Time duration for low fan-power at 100% speed, above which a slipping- 3 one-minute power samples fan-belt fault was flagged 3_one-miutepowesample 25 Pa (0.1 in. H20) 90% 500 cfm 1 kW Pump-power correlation with cooling-coil valve position-control signal Valve-position control signal above which pump-power data were analyzed for a cooling-coil capacity fault** Measured normal-operation power level of the secondary chilled-water pump Minimum decrease of pump power below normal-operation value, in excess of which a coil-capacity fault was flagged Confidence level to establish boundary between normal and faulty data Chiller-cycling analysis Power level above which the chiller is considered to be operating in the low-power 40% 400 W 10w 90% 4 kW 4k 3 minutes iue stage Cycling interval when the cooling-coil valve control signal is at 0%, below which a leaky-valve fault is flagged35 Normalized outdoor-air temperature, below which chiller cycling is 0.2 analyzed to detect a leaky recirculation damper*** Power-oscillation analysis Size of sliding window for averaging one-minute power data from submeters Standard deviation of power signal above which a fault is flagged, as a percentage of average power 5 samples 15% II * Fan-power analysis at 100% speed was used in AHU-A and B to detect the slipping fan belt. For AHU-1 this approach was replaced by the more rigorous and sensitive polynomial correlation of fan power with speed control signal. ** Pump power analysis relative to a measured and near-constant normal-operation value was used in AHU-A and B to detect the coil-capacity fault. *** The normalized outdoor air temperature is the difference between the outdoor-airtemperature and the supply air temperature setpoints, normalized by the difference between the supply and room air temperature set points. 188

Detection and Diagnosis of Faults and Energy Monitoring

Related documents

Products

Support

Detection and Diagnosis of Faults and Energy Monitoring

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib