Detection and Diagnosis of Faults and Energy Monitoring

Detection and Diagnosis of Faults and Energy Monitoring
of HVAC Systems with Least-Intrusive Power Analysis
by
Dong Luo
M. E., Thermal Engineering, 1991
Tianjin University
Submitted to the Department of Architecture
in partial fulfillment of the requirements for the Degree of
Doctor of Philosophy in
Architecture: Building Technology
ROT.
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
at the
MAR 15?001
Massachusetts Institute of Technology
LIBRARIES
February 2001
@ 2001 Massachusetts Institute of Technology. All rights reserved.
Signature of Author
of Architecture
apartmentJanuary
12, 2001
Certified by
Leslie K. &orford
Aqsociate rofessor of Building Technology
Thesis Supervisor
Accepted byf
'3
'Stanford
Anderson
Chairman, Departmental Committee on Graduate Students
Head, Department of Architecture
2
Thesis Committee:
Leon R. Glicksman, Professor of Building Technology
Qingyan Chen, Associate Professor of Building Technology
4
Detection and Diagnosis of Faults and Energy Monitoring
in HVAC Systems with Least-Intrusive Power Analysis
by
Dong Luo
Submitted to the Department of Architecture
on January 12, 2001 in partial fulfillment of the
requirements for the Degree of Doctor of Philosophy in
Architecture: Building Technology
ABSTRACT
Faults indicate degradation or sudden failure of equipment in a system. Widely
existing in heating, ventilating, and air conditioning (HVAC) systems, faults always lead
to inefficient energy consumption, undesirable indoor air conditions, and even damage to
the mechanical components. Continuous monitoring of the system and analysis of faults
and their major effects are therefore crucial to identifying the faults at the early stage and
making decisions for repair. This requires the method of fault detection and diagnosis
(FDD) not only to be sensitive and reliable but also to cause minimal interruption to the
system's operation at low cost. However, based on additional sensors for the specific
information of each component or black-box modeling, current work of fault detection
and diagnosis introduces too much interruption to the system's normal operation
associated with sensor installation at unacceptable cost or requires a long time of
parameter training.
To solve these problems, this thesis first defines and makes major innovations to a
change detection algorithm, the generalized likelihood ratio (GLR), to extract useful
information from the system's total power data. Then in order to improve the quality of
detection and simplify the training of the power models, appropriate multi-rate sampling
and filtering techniques are designed for the change detector. From the detected
variations in the total power, the performance at the system's level is examined and
general problems associated with unstable control and on/off cycling can be identified.
With the information that are basic to common HVAC systems, power functions are
established for the major components, which help to obtain more reliable detection and
more accurate estimation of the systems' energy consumption. In addition, a method for
the development of expert rules based on semantic analysis is set up for fault diagnosis.
Power models at both system and component levels developed in this thesis have
been successfully applied to tests in real buildings and provide a systematic way for FDD
in HVAC systems at low cost and with minimal interruption to systems' operation.
Thesis Supervisor: Leslie K. Norford
Title: Associate Professor of Building Technology
6
DEDICATION
To my family
who are always behind me
8
Acknowledgements
I would like to thank Professor Leslie Norford, my thesis advisor, for his guidance
and support throughout my Ph. D. study. Also, I want to thank Professor Leon Glicksman
and Professor Qingyan Chen, my thesis committee, for their valuable advice on my
thesis.
I wish to express my special gratitude to my family for their love and support.
10
CONTENTS
ABSTRACT
5
DEDICATION
7
ACKNOWLEDGEMENTS
9
CONTENTS
11
1. INTRODUCTION
1.1 Background of this study
1.2 Model-based fault detection and diagnosis in HVAC systems
1.3 Approaches to solve the problems of current FDD models
1.4 Aims of this thesis
13
15
17
18
2. THEORY OF CHANGE DETECTION
2.1
2.2
2.3
2.4
Review of the non-intrusive load monitoring
Steady-state change detection
Theory of abrupt change detection
Summary
21
22
24
32
3. SYSTEM POWER MODELING FOR FAULT DETECTION AND
DIAGNOSIS
33
3.1 Introduction of on-line change detection in power data of the HVAC systems
34
equation
two-window
vs.
equation
one-window
3.2 Modification of the GLR:
36
3.3 Training of the parameters for the GLR detector
41
algorithm
the
detection
of
3.4 Improvements
45
3.5 The median filter
48
3.6 Change + oscillation detector
49
3.7 Monitoring of the total power data - multi-rate vs. single-rate sampling
61
3.8 Summary of the training guidelines
62
input
3.9 Application of the GLR model in fault detection with system's total power
72
3.10 Results and discussion
4. COMPONENT POWER MODELING FOR FAULT DETECTION
AND DIAGNOSIS
75
4.1 Introduction
4.2 Component power modeling by correlation with basic measurements or signals 76
87
4.3 Error analysis - confidence intervals of the estimated models
4.4 Fault detection and diagnosis of HVAC systems with submetered power input 91
4.5 Application of the power models in fault detection with submetered power input 94
107
4.6 Discussion and conclusions
5. MONITORING OF COMPONENTS THROUGH THE TOTAL
POWER PROFILE
5.1 Introduction
5.2 Parameter selection for component modeling with system power monitoring
5.3 Analysis of the detected changes for component modeling
5.4 Application of the gray-box model in fault detection and energy estimation
5.5 Discussion and conclusions
109
110
111
119
127
6. DIAGNOSIS OF FAULTS BY CASUAL SEARCH WITH SIGNEDDIRECTED-GRAPH RULES
6.1
6.2
6.3
6.4
6.5
6.6
6.7
Introduction
129
Fault diagnosis by shallow reasoning with system's total power input
132
Deep knowledge expert system
134
Diagnosis based on casual search of fault origin
135
SDG rule development for fault diagnosis with power input of components
143
Rules for detection and diagnosis of typical faults in common air handling units 156
Discussion and conclusions
162
7. SUMMARY
7.1 Review
7.2 Achievements and future work
165
168
REFERENCES
171
NOMENCLATURE
177
APPENDIX DESCRIPTIONS OF THE TEST SYSTEM AND THE
FAULTS
179
CHAPTER 1
Introduction
1.1 Background of this study
Faults in an HVAC system indicate that some components are not running
properly according to the design intent. Faults can be divided into two general groups
based on the abruptness of the occurrence: gradual degradation and abrupt failure of
components [Annex 25]. A degradation fault emerges after some time of operation and
requires calibration or adjustment when the error exceeds a threshold, such as the static
pressure sensor offset in the fan-duct loop. A failure means the equipment suddenly stops
working and usually needs immediate maintenance or replacement to resume the normal
operation of the system, e.g., a stuck damper. In general, an abrupt fault is easier to detect
because it results in sudden failure of equipment and obvious changes of the monitored
parameters. However, this does not necessarily mean that an abrupt fault is easier to
diagnose. Unacceptable deviations in the monitored parameters caused by a degradation
fault are determined with reference to thresholds that must be evaluated by cost-benefit
analysis. The potential benefits of detecting and correcting a fault include energy savings,
desirable indoor air quality, and minimal interruption to the normal equipment operating
schedules.
By definition, any kind of fault has some undesirable effects on the system, which
include inefficient usage of energy, uncomfortable working environment, and even
damage of the mechanical components. Excessive energy consumption of components
caused by a fault may range from several percent to several times of the design value
depending on the severity of the fault [Luo et al. 2001][Kao and Pierce, 1983]. A fault
may not directly affect the indoor air quality of the conditioned space since the control
system sometimes can compensate for the loss of the conditioning capacity. But this is
often achieved at the expense of excessive energy consumption. For example, an offset of
the static pressure sensor increases the speed of the supply fan and causes more energy to
be wasted with the VAV dampers closing up in order to meet the indoor air temperature
setpoint. If a fault can not be compensated by the control system itself, the working
environment will become uncomfortable or even unhealthy, such as the insufficient fresh
air supply caused by a stuck-closed outdoor air damper. Sometimes, a fault may lead to
quick damage of components. A typical case is the unstable control of a component,
which may quickly worsen the wearing of the mechanical parts due to the oscillatory
control signal.
In practice, faults exist widely in operating HVAC systems. The performance of
many HVAC systems is affected by various types of faults that may result from poor
installation, commissioning, and maintenance. A recent study has indicated that 20-30%
of energy savings in commercial buildings can be achieved by re-commissioning of the
HVAC systems to rectify faulty operation [Annex 25]. Current building energy
management systems are widely used for automation of HVAC systems' operation and
for prevention of critical faults. However, manufacturers offer very few tools for
detection and diagnosis of defects that cause faulty operation. Therefore, it is very
important to find an effective method to minimize the negative effects of faults.
Fault detection intends to determine if the observed behavior of the monitored
object deviates beyond an acceptable range from the expected performance. Fault
diagnosis is to find which of the possible causes is most consistent with the observed
faulty conditions.
For engineering systems with zero tolerance, as in the control system of a nuclear
reactor, a fault is usually detected and immediately removed by hardware redundancy
with the faulty part or sensor substituted by another one and an alarm is issued. Hardware
redundancy involves the use of multiple components for the same purpose. Faults are
determined by majority rules from the voting among the redundant sensors. Although
physically more reliable, this technique is expensive, bulky, and limited in ability [Rossi
et al. 1996].
The alternative is analytical redundancy that utilizes for FDD the inherent
relationships existing between a system's inputs and outputs. These relationships can be
described with mathematical models or with collections of rules. For less critical systems
like the ordinary HVAC systems, hardware redundancy is rarely used due to the extra
cost and the additional space required to accommodate the equipment. Therefore, in order
to keep track of the operation of the system, analytical models must be established.
Analytical models for fault detection and diagnosis can be implemented in two
steps: estimation and classification. Estimators are generally mathematical models based
on either physical laws or block-box estimation. They can generate two types of features,
residuals (or innovations) and parameters, depending on the monitored output. A residual
is the difference of state estimates between the plant and the nominal models. Model
parameters, derived from state variables, are compared between estimates of the current
and fault models for given inputs. Classifiers can be a system based on either expert
knowledge or artificial neural networks. An expert system is a machine that emulates the
reasoning process of a human expert in a particular domain [Rossi et al. 1996]. It is
comprised of two important parts: a knowledge base and an inference engine. The
knowledge base contains expert knowledge about the domain. The inference engine
combines data about a particular problem with the expert knowledge to provide a
solution. Expert knowledge can be expressed as production rules, or stored as a collection
of a prioriand conditional probabilities in statistical pattern recognition classifiers, or set
up as semantic networks. Evaluation made in the classifier is often based on the criteria
of economy, comfort, safety, and environmental hazard.
As shown by the diagram in Fig. 1.1, estimators take the plant inputs as control
signals or measurements from sensors, perform quantitative operations, and generate
simplified features for classification. Classifiers operate on the residuals or parameters
based on appropriate logic with thresholds and rules. Then they will either issue alarms if
the deviation is not acceptable and give analysis of some possible causes of the fault or
keep a log of the operating status if no abnormal difference is observed.
Figure 1.1. Diagramoffault detection and diagnosis with analyticalmodel.
Consisting of measurements and/or control signals, the vector X represents the
inputs to both the plant and the estimator. Y and Y are vectors of outputs from the plant
and predictions by the model separately.
1.2 Model-based fault detection and diagnosis in HVAC systems
1.2.1 Methods of fault detection and diagnosis
Three types of models may be used for fault detection and diagnosis: plant,
reference, and fault models. A plant model keeps a record of the current operating
features of the system, a nominal model provides the expected performance of the system
under normal operation for given inputs, and a fault model represents how the system
would respond to given inputs if a specific fault occurs. Fault detection can be achieved
by comparing the outputs of the plant model against that of the nominal model. Diagnosis
can be conducted by selecting the fault model that provides features closest to the current
operation.
There are two basic types of reference models [Benouarets et al. 1994]: physical
and block-box models. Physical models are mainly established from analysis of the
system based on first principles, though some empirical relationships may be assumed.
Block-box models are empirical models that do not use any prior knowledge of the
physical processes. They can only be constructed with training data generated from the
monitored system itself or by simulation of the system.
Fault detection and diagnosis based on physical laws have been studied and used
for simulation since the early 80's and have progressed rapidly during the 90's, especially
with the establishment of the International Energy Agency (IEA) in the framework of
Annex 25. Pape et al. [1991] developed a methodology for fault detection in HVAC
systems based on optimal control using simulation, in which deviations from the optimal
performance are detected by comparing the measured system power against the power
predicted with the optimal strategy. Haves et al. [1996] described a condition monitoring
scheme used to detect the presence of valve leakage and waterside coil fouling within the
cooling coil subsystem of an air handling unit based on first-principle models.
The major advantage of a physical model is that the prior knowledge and hence
the better understanding of the physical process enable the model to be extrapolated to
regions of the operating range where no training data are available. Also the parameters
in a physical model are usually related to physically meaningful quantities, which not
only makes it possible to estimate the values of the parameters from the design
information and manufacturers' catalogs but also provides an opportunity for the
observer to associate the abnormal values of parameters with the presence of a specific
fault. However, physical models often require installation of additional sensors for
detailed information about the monitored components in the system, which is not only
limited by the extra cost of sensors but also leads to interruption to the system's
operation. Moreover, a large set of nonlinear differential and algebraic equations need to
be set up to define the system's behaviors because HVAC systems are often comprised of
a number of subsystems and each may exhibit time-varying and/or nonlinear
characteristics. For example, a detailed description of the dynamics of a typical five-zone
commercial HVAC system requires an order of 1,000 differential and algebraic equations
[Kelly et al. 1984]. This often makes the solution prohibitively time-consuming and yet
often the results are not satisfactory. In addition, the parameters of this dynamical
description generally vary with load, weather, and building occupancy, such as the hear
transfer coefficient of a cooling coil. These limitations make it almost impossible to use
complete physical models for real-time fault detection and diagnosis in practice.
Two major methods have been used for block-box models during the past ten
years, ARX (autoregressive with exogenous input) or its innovation ARMAX
(autoregressive moving average with exogenous input) and ANN (artificial neural
network). Lee et al. [1996] examined ARMAX/ARX models for a laboratory VAV unit
with both SISO (single-input/single-output) and MISO (multi-input/single-output)
structures and model parameters are determined using the Kalman filter recursive
identification method. Peitsman and Soethout [1997] established a real-time ARX model
for a simulated VAV system. Li et al. [1996] demonstrated the feasibility of using ANNs
for FDD of a specific heating, ventilating, and air-conditioning (HVAC) system.
Peitsman and Bakker [1996] demonstrated that ANN models fit better than ARX models
for nonlinear systems after studying several faults in a laboratory chiller and a simulated
VAV system with both MISO ARX models and ANN models.
The block-box models have been studied and tested with simulation and
laboratory units in many research projects in recent years due to the fact that such
methods do not require detailed knowledge of the system and therefore do not require the
user to know much about the physical processes. Also, linear forms of parameters can be
selected to make the parameter estimation of the processes more computationally
efficient and more robust even if the system itself is non-linear. However, with no
physical meanings in its parameters, the results of a block-box model can not be reliably
extrapolated. This drawback greatly limits the use of a block-box model because faulty
conditions are often beyond the range of the normal expectations and therefore make it
impossible to train the parameters for the model in absence of the faults in the system.
Meanwhile, the unknown property of a block-box model makes it difficult to understand
and keep track of the performance of the model itself. For example, improper guess of the
initial values of the weight factors and offsets of a neural network often result in an
inappropriate learned network [Peitsman et al. 1996]. For such reasons, the block-box
models for fault detection and diagnosis in HVAC systems are still largely in the
experimental phase.
In addition to the problems of the algorithms themselves, most FDD techniques
are based on the examination of some controlled parameters against their setpoints. In
modern HVAC control systems, a hierarchical structure is always used for the control
logic in commercial applications and variables are often controlled in local feedback
loops. A typical application is the zone temperature control with a VAV box. Offsets in
the controlled variables can usually be compensated or cancelled in the local control loop,
leaving to the upper level of the system the task of handling the error at excessive energy
cost. If no further supervision of the system's operation is available, then the effects can
only be seen in some uncontrolled variables as the ultimate outcome, such as the power
input of a system or a component. For example, a static pressure sensor offset can lead to
considerably excessive amount of power consumption while temperature of the room air
or other controlled points may still be kept within the normal range. A stuck-closed
outdoor air damper in an HVAC system not only causes significant waste of energy but
also results in unhealthy environment during the transitional period when it is expected to
be wide open. But the faulty operation may never be found by the measurements of the
indoor air temperature or other controlled parameters. This indicates that traditional FDD
techniques based on the controlled variables are often insensitive to the existence of the
malfunctions in the system and hence fail in estimating the outcome of the defects for
appropriate maintenance or repair. In practice, energy waste due to negligence to faults in
a system may become significantly high, especially after a long time period of operation
when a degradation fault occurs unavoidably and keeps worsening the energy efficiency
of the system without notice. Moreover, in the diagnosis of a fault, the decision for
further actions needs to be based on the major evaluation factors at the system level, such
as the cost vs. benefit effect, rather than the monitored local variables.
The above review reveals four major problems associated with the current
methods used for fault detection and diagnosis in HVAC system:
1. Extra costs and interruption to the system's operation caused by installation of
additional sensors for detailed information about the monitored components required
by the physical models;
2. Computational load imposed by the physical models;
3. Lack of capacity for model manipulation and the uncertainty in output extrapolation
of the block-box models;
4. Insensitivity to fault existence that may cause inefficient operation and/or undesirable
indoor air quality when using controlled variables as the primary detection indices;
5. Insufficient information for cost-conscious decisions.
1.3 Approaches to solve the problems of current FDD models
To solve the above problems, the FDD model should be able to make full use of
the available information, issue timely alarms or reports about the system's status even if
the design specifications for the controlled variables are met, and produce physically
meaningful output and the cost effects of faults for further maintenance if necessary. For
real-time FDD applications, the model also needs to be easily incorporated into the
present control system, which indicates that the FDD model should rely on as little
information as possible in addition to the signals readily available from the building
energy management and control system, such as the control signal for the supply fan
speed.
So far, limited research has been carried out to address such issues. Seem et al.
[1999] described a detection algorithm to reduce the data load for online computing.
Rossi et al [1997] described a statistical rule-based fault detection and diagnostic method
for vapor compression air conditioners, which uses only temperature and humidity
measurements. However, such work only improves the data communication problem but
does not make fundamental changes in the modeling itself. In order to minimize the
intrusion into the system, a more efficient method needs to be found at system level.
In this research, on the basis of tests and observations of the relationship between
the system performance and the data trend of the electrical power input of buildings, it
has been found that as the ultimate reflection of the system's energy cost, power
consumption can be used to detect faults regardless of the status of the controlled
variables in the system. Also, FDD based on power consumption directly demonstrates
the effects of the current operating conditions on the system's cost and hence enables the
decision for appropriate maintenance. In addition, measurements of power consumption
only involve the electric circuits of the energy system and impose little intrusion into the
system's mechanical structure and hence introduce minimal interruption to the system's
operation.
In this thesis, with power consumption as the major criterion for detection and
diagnosis of faults, a gray-box model is developed with statistical estimation of the
system's total power as well as with power functions of the major components through
fundamental analysis of the system's structure and control logic. In addition, with the
power model as the primary fault indicator, an effective method for the casual search of
fault origin is established with semantic analysis of the general physical processes of
common HVAC systems. With the structured inference logic, the expert rules for the
identification of a fault can be easily developed for any fault in a given system.
1.4 Aims of this thesis
In this research, a statistical model is first developed using the total electrical
power input of a system to identify the changes in the data trend. With the detected
changes, abnormal patterns of power consumption of the major components can be
identified with reference to the design information. Models at a more detailed level based
on submetered measurements of power and the related variables are also developed for
detection. Then, the feasibility of modeling a components' power input with the detected
changes in the total power series and basic measurements or control signals is studied and
an effective method is proposed for component power models by appropriate
combination of the change detection and the component modeling techniques. The
resulting function can be used to detect faults of the component and, in return, help to
improve the resolution of detection at the system level. In addition, the power function
provides an efficient tool for accurate energy estimation and appropriate evaluation of the
system and the components as well. With the total power logger and other signals
inherent to the common control systems, this method yields timely response to abnormal
changes in the system with little interruption to the system's operation. The power
models developed in this thesis can be applied to real building systems under different
load conditions.
With the deviations found by the power models, a knowledge-based approach is
proposed from semantic analysis of typical faults in common HVAC systems. To find the
undesirable operation at an early stage before the deterioration of the control quality, an
inference structure with power input as the end node of a digraph is developed.
Innovations are introduced for the inference logic to allow progressive alarms with
violations. Modifications of the rules can be easily implemented for the application of the
inference structure in different systems.
This thesis is organized as follows. Chapter 2 introduces the theory of change
detection. Chapter 3 develops the algorithm for detection of abnormal power input of
HVAC systems. Chapter 4 describes the principle of fitting algorithm and component
power modeling with submeters. Chapter 5 develops the gray-box model by applying
detection at system level to component power modeling. Chapter 6 explores the digraphbased rule development technique for diagnosis of faults in HVAC systems. Chapter 7
summarizes the whole thesis.
20
CHAPTER 2
Theory of Change Detection
This chapter motivates the development of a steady-state change detector for
applications in both residential and commercial buildings. Electrical power data trend and
algorithms for abrupt change detection in common HVAC systems will be examined.
First, the concept of non-intrusive load monitoring (NILM) is introduced and change
detection in the power series of HVAC systems under steady state and transient period is
discussed. The steady state method is selected not only owing to its smaller computing
load but also because its capability of dealing with the noisy power environment.
Moreover, the steady state method itself is a complete load monitoring system while the
transient state method is not. Second, characteristics of power profiles in both residential
and commercial buildings are illustrated and motivations of new algorithms for change
detection of power in commercial applications are proposed. Finally, the necessity and
feasibility of the new methods for steady-state change detection of power consumption in
commercial buildings are discussed. Then the algorithm based on the generalized
likelihood ratio (GLR) is selected according to the available information of typical
HVAC systems in practice.
2.1 Review of the non-intrusive load monitoring
Monitoring changes in power data can determine if the designated equipment has
been turned on or off. Also, by analyzing the characteristics of the change, abnormalities
or faults may be identified. Conventional load monitoring requires a separate meter for
the motor of each component of interest, which makes it very expensive and impossible
to be used in practice as a real system is usually driven by more than one motor. A nonintrusive load monitor is designed to monitor an electrical circuit that contains a number
of devices which switch on and off independently. By a comprehensive analysis of the
current and voltage waveforms of the total load, the NILM estimates the number and
nature of the individual loads and other relevant statistics. The key feature of the NILM
compared to conventional load monitors is that no access to the individual components is
necessary for sensor installation and measurements. For a system with multiple
monitored equipment, the use of NILM may significantly reduce the cost and the
interruption to the system's operation caused by the installation of separate power meters.
There are two general ways to conduct change detection, steady state detection
and transient detection. The steady state change detection determines the magnitude of
the power change caused by on/off events while transient change detection analyzes the
dynamic patterns of startup signals.
Transient detection has been primarily developed for discrimination between two
changes with equal amplitude that can hardly be discriminated by the steady state method
if no additional information about the events is known in advance [Leeb 1992] [Norford
and Leeb 1996]. Although this method seems to be able to yield earlier response than the
steady state method and can distinguish some nearly overlapping events [Hart 1992]
[Norford and Leeb 1996], it suffers several major drawbacks. First, transient signals can
not be added, which makes it difficult to combine simultaneous events. Second, the
transient method only makes sense in terms of the startup detection and can not provide
any useful information about the equipment during operation or at shutdown which also
assumes an important role in both fault detection and energy estimation. If two pieces of
equipment are turned on within a short time interval, transient response can tell the order
of turn-on, but not the order of turn-off.
In addition, transient detection can easily be deteriorated by the presence of
oscillation or noise in the data series. Furthermore, transient detection only recognizes the
known patterns of equipment startups and can not discover unanticipated but potentially
interesting events. And last, the transient response usually imposes a heavy burden on the
computing facility for complex data processing with a data sampling rate of not lower
than 60 Hz to accommodate the fundamental frequency of voltage variation.
On the other hand, steady state detection is capable of dealing with almost all of
the above problems except that it can not distinguish simultaneous events without other
information for reference. However, this seems not to become a critical issue in
recognition of the fact that the future trend of the fault detection system is to be
incorporated with the building energy management system, from which the on/off signals
can readily be obtained by the detector for further discrimination.
Based on the above analysis, steady state is determined to be the basis of the
detection strategy in this research.
The definition of "steady state" should be noted. Steady state conditions should
not be interpreted so strictly that they rarely occur in practice. In this thesis, steady state
is recognized if there are no significant abrupt changes in the detection windows, but the
power data may still vary in the detection windows. This is because in commercial
buildings, fluctuations and variations in power data are unavoidable due to the
application of variable speed motor drives as well as the existence of noise. Therefore,
the power data are not constant even without on/off switches. This indicates that to obtain
reliable detection output, a reasonable threshold is needed to identify a significant signal
in the "steady state" environment.
2.2 Steady-state change detection
2.2.1 Power monitoring of residential buildings
One of the major characteristics of the power data of houses is that appliances are
generally driven by constant-speed motors with on/off control or stepwise finite states.
The noise effect is often negligible compared to the total power magnitude, as shown in
the following figure for a single-family house over a one-hour period collected by
Hart[1992].
43-
6 2-
1 -j
0
0
600
1200
1800
Time (s)
2400
3000
3600
Figure 2.1. Total electricalpower input vs. time of a single-family house.
By assuming time-invariant complex power with each appliance, a list of all the
appliances in the system can be set up with the nameplates or the specifications from the
manufacturers. Change detection becomes edge identification and can be fulfilled by
combinatorial optimization based on the following equations and appropriate clustering
techniques if necessary [Hart 1992].
n
P,(t)
a(t)P, i +e(t)
=
(2.1)
i=1
n
a(t) = argmin PP (t)
-
a(t)P,1
(2.2)
a
S1,
a 0,
if appliance i is on at time t,
if appliance i is off at time t.
With the above equations and constraint, if the vectors of the p-phase load Ppi of
each component i are known and the measured total power vector Pp(t) is given at each
time t, the error e(t) is minimized by searching the n vector a(t).
Application of a steady state detector for a small house that contains several
constant-power appliances has been demonstrated in previous research [Hart 1992] by
detecting the edges in the power series.
2.2.2 Power monitoring in commercial buildings
In commercial applications, the characteristics of the electrical power profile are
quite different and the usage patterns and types of equipment involved are more likely to
generate power quality problems. First, electric surges and spikes caused by the startup of
larger motors or the automatic setpoint adjustments of some components make the data
environment much more noisy, which can be seen by comparing the power quality of the
typical applications in a house in Fig. 2.1 and in a commercial building in Fig. 2.2
[Norford and Leeb, 1996]. Second, variable-speed motor drives leading to gradual power
changes are very common in commercial buildings. And finally, even constant-speed
motors consume variable power in response to changing load conditions in commercial
buildings.
It can be seen that with the presence of noise and variations in the power data of
commercial buildings as shown in Fig. 2.2, the edge detection algorithm used in the
residential case tends to cause unacceptable false alarm rates. Therefore, in order to
achieve desirable detection performance, more efficient and robust algorithms based on
statistical analysis are necessary.
740
720-
700)
9
680-
660
0
180
360
540
720
900
1080
Time (s)
Figure 2.2. Power data with 4 on/off events of a pump in a campus building.
2.3 Theory of abrupt change detection
In this section, methods for abrupt change detection are introduced and the GLR
algorithm is developed as a major detection tool in this thesis. The materials analyzed in
this section follow the development in [Shanmugan and Breiohl 1988] and [Basseville
and Nikiforov 1993].
2.3.1 Introduction of abrupt change detection
Abrupt changes mean that changes in properties of the monitored object occur
very fast with respect to the sampling period of the measurements. In an industrial
process, faults related to such can be divided into two general categories. First is failures
or catastrophic events that usually stop the operation and need to be identified
immediately. Second is smaller faults, sudden or gradual (incipient), which affect the
process without causing it to stop but are of crucial practical interest in preventing the
subsequent occurrence of more serious or catastrophic events. Both types of faults can be
approached in the abrupt change detection framework.
Detection of abrupt changes refers to tools that help to decide whether such a
change has occurred in the characteristics of interest. With the increasing complexity of
technological processes and availability of sophisticated information processing systems
over the last twenty years, applications of abrupt change detection have grown rapidly in
many areas, from critical applications like prediction of natural catastrophic events, e.g.,
earthquake and tsunami, to other major industrial processes, such as quality control,
vibration monitoring, and pattern recognition. A common objective of these applications
is to detect abrupt changes in some characteristic properties of the monitored object,
which usually can be described as the problem of detecting changes in the parameters at
unknown time constants of a static or dynamic stochastic system. The major challenge is
to identify intrinsic changes that are not necessarily directly observed and that are
measured together with other types of perturbations [Basseville and Nikiforov 1993].
The design of a change detection and diagnosis algorithm generally consists of
two major tasks:
a). Derivation of the sufficient statistics. With known mean values before changes, the
ideal residuals generated from the measurements should be close to zero. However, for
online detection in a dynamic system, the mean value or the spectral properties of these
residuals may change and in such cases, the generation of "residuals" is to derive the
sufficient statistics.
b). Design of decision rules based on these residuals. This involves designing the
convenient decision rule which solves the change detection problem as reflected by the
residuals.
In this thesis, a major task is to develop a parametric statistical tool for detecting
abrupt changes in the properties of discrete time signals from dynamic systems.
In the following sections of this chapter, a sequence of independent random
variables yi (i =1, 2, ... ) is studied with a probability density po which depends on
parameter 0. 0 is equal to 00 before the unknown change time te and becomes 01 after the
change.
On the basis of the requirements of applications and the corresponding
mathematical mechanisms, statistical change detection can be divided into three major
classes, on-line detection of a change, off-line hypotheses testing, and off-line estimation
of the change time.
On-line detection of change
The detection is determined by a stopping rule:
te = inf{n : gn(yi, y2,
..
.,yn) > h}
(2.3)
which is to search for the minimum of a sample of size n when the value of the statistical
function gn exceeds the threshold h. With the trained threshold, neither the mean values
before and after the changes nor the time of change is required. The overall criterion is to
minimize the time delay of detection for a given mean time between false alarms.
Off-line hypotheses testing
With a given finite sample yi, y2,
... , yn,
hypotheses "without change" and "with change":
Ho :
for 1 j < n:
H1 :
there exists an unknown 1
for 1
ji
for te 5 j
the test is to verify one of two states, the
p, (yj I yj-,..., yi) = p"0 (yj I yj-,..., yi)
te n such that:
te-1:
p, (yj I yj-i,..., yi) = p"0 (yj I yj-,..., yi)
n:
p, (yj I yj-,,... , y1) = po (yj I yj-i,... , y1)
(2.4)
The criterion for this algorithm is to maximize the probability of H1 when H1 is
true and minimize the alarms when Ho is true. The estimation of the change time is not
required.
Off-line estimation of the change time
With the same hypotheses as above, the detection is used to find the time of the
change from Ho to H1 , with the assumption that H1 does happen in the finite time period
from 1 to N.
te = inf{j: p,9 (yj I yj-i,..., y1) = p, (yj I yji,..., y1)}
(1
ji
N)
(2.5)
The objective of this detection is to track more accurately the time of a given
event H1.
It can be seen from the above definitions that both off-line detection methods
require a prioriknowledge about the mean values before and after the change.
In this thesis, the objective is to develop an online detection method, not only
because this can provide real-time monitoring but also due to the fact that the nonintrusive detection is based on the total power data of the HVAC or even the whole
building's power system. With variable patterns of multiple on/off switches in response
to the load conditions, it is difficult to know in advance the mean values before and after
a change. Meanwhile, the unknown time of change adds a third variable for detection.
This indicates that statistical estimates for the most likely values of those variables are
necessary for practical detection.
2.3.2 Algorithm of on-line change-of-mean detection
Algorithms for abrupt change detection, on-line or off-line, are based on an
important concept in mathematical statistics, namely, the logarithm of the likelihood
ratio, defined by
p , (y)
S(y)= In 0 1
(2.6)
poo (y)
With on-line detection, the system is continuously monitored and changes may
happen at any time without prior notification. Therefore, in the likelihood tests of change
detection with a random independent series, three parameters are involved, the mean
before and after the change and the time of change, denoted by 00, 01, and i (or te for a
continuous series) respectively. The log-likelihood ratio for a continuous series from y1 to
y2 can be described as ,
SY (0,01)= Y 2 In l (y) dy.
Y1 p o (y)
For a discrete sample from time j to k of a random sequence
k
S
(o
k
i
pO,
(yi )
in p(y1)=
(2.7)
P(yi)
pj
For online tests with a complex dynamic system of multiple components, all three
parameters can be varying. Therefore, the estimates of them should be determined with
the maximum likelihood. This leads to the triple maximization equation of the sufficient
statistic gk for a continuous series
aS A
pX
2a92 -0
1
aS+ aP 1+X a92 =0
A31
as=0P2-0
91I=0
192=0
where (p,and (P2 are applicable constraints for function S.
For a discrete sequence, the search becomes
sup
gk = maxInA
= 1max
01 Sk(01)
00
j k sup
I jk
where
ik
(2.8)
is the estimate of the upper bound of the ratio of the joint frequency function
about the post-event mean p, for a given pre-event mean pgo, within the window U, k]
and sup represents the supremum, i.e., the least upper bound of Sk over [, k] about the
mean g, with reference to the mean po before the change.
For off-line detection, the corresponding conditional maximum likelihood
estimates of the three values can be given as
k-1
N
(teOo,01) = arg max sup sup ln HPoo (yi) Hpe1 (yi)
15k:N
00
0i
i=1
(2.9)
i=k
or in its condensed form
te=arg max In
1 k N
k-1
N
lip 0 (yi) Hp, 1 (yi)
i=1
(2.10)
i=k
It should be noted that the mean values 0 and 01 indicate the most possible or
most representative values of the data samples before and after a change, not just a simple
average value of the samples. This can be demonstrated by the following plot.
+
20
-
sarnples
the most representative
the average
U)
15 U)
0
50
Time
Figure 2.3. Demonstration of the difference between the average and the most
representativevalue for the mean used in change detection.
The extremely high or low values are usually caused by random noises and are
common for a complex system with disturbances. Hence the mean needs to be estimated
statistically.
For a data series following the Gaussian distribution with (g,
(2.11)
Ye
2G 2
p
2),
the most representative value of a sample is equal to the average, such as the noise in
electrical power systems [Shanmugan and Breiohl 1988].
In this thesis, the change-of-mean detection is conducted with the system's total
power data. Therefore, the mean equals the average in the detection window.
In principle, estimates of mean values and the time of change can only be
performed off-line in order to find the maximum likelihood throughout all the data
samples in the sub-windows. For on-line detection with a dynamic system, however,
these mean values are generally time-dependent and changes may happen at any time.
Therefore, some finite length of the detection window that progresses with time should be
used to isolate each time period of change, which is the idea of the finite moving average
(FMA) algorithm. With the estimated mean before the change assumed as a known value,
two possible algorithms can be used for the detection, i.e., for estimation of the time of
change and the post-event mean: the weighted cumulative sum (CUSUM) and the
generalized likelihood ratio (GLR). The CUSUM method is to weight the likelihood ratio
with a weighting function dF(0 1) with respect to all possible values of the parameter 01:
An=
,
* p00(yi1y2,---,Yn)
dF(61)
(2.12)
while the GLR uses the maximum likelihood estimate of 01:
supe, p,(y1 , y2 ,-... yn)
(2.13)
With the weighted CUSUM algorithm, all possible change values need to be
known before the detection begins. For a system with known finite states of changes
under all normal and faulty conditions, as in the residential applications, this method can
be used for the best statistical estimate of the coming event. On the other hand, the GLR
algorithm is based on the maximization of the likelihood ratio, i.e., the maximization of
the probability ratio defined in Eq. (2.13), of the sample between two values 00 and 01.
Without information about the incoming event in advance, both the post-event mean 01
and the time of change are searched in the detection window through the maximization of
the probability ratio. In a moving FMA window, the maximum of the ratio is first
identified by computing the likelihood with data in each sub-window and the time of
change is then determined by a second maximization of the ratio among all the subwindows. The search in a FMA window is implemented with each new data point
accepted into the window.
In such real systems as commercial HVAC applications, it is not feasible to define
all the possible states under faulty conditions, even if all the changes under normal
conditions can be well determined. Moreover, training of the weighting function with the
CUSUM method is not a trivial issue and may add uncertainty to the estimation. In this
research, it has been found that with reasonable innovations of the algorithm and
appropriate processing of the sampled data, the GLR algorithm can be used to achieve
reliable detection without knowledge of the potential events. The improved GLR detector
not only makes it possible to detect changes in the HVAC power data but also requires
less parameters to be trained compared with the CUSUM method. Therefore, in this
thesis, the GLR algorithm becomes the principal method for on-line change detection in
complex systems.
2.3.3 The GLR algorithm
In the GLR algorithm, the mean value of the sequence before a change is assumed
known, which is from either previous hypothesis testing or estimation with the past data.
For commercial HVAC systems, power consumption is always varying with gradually
changing load conditions, which indicates the mean of the power data from the last
hypothesis may be very different from the current value. Therefore, the mean value
before the change must be estimated based on its probability density function,
90 =arg sup p0 0 (y)
(2.14)
For a series with normal distribution (g, <Y), the most representative value Oo is
the average of this series o,
no =
which can be proved by solving
a 00
n j=
yi
(2.15)
=0.
It should be noted that in the presence of large spikes in a window with a finite
length, this definition of the mean may lead to significant deviations from the real
representative value of the sample, as shown in Fig. 2.3. Therefore, power data need to be
processed before used in the GLR computation, which will be discussed in Chapter 3.
With this estimate as the known mean before a change, for a sequence of random
variables (yO) with a probability density function of pe(y), there are two independent
unknowns left, the change time and the post-event mean. In the GLR method, these two
variables are determined with the maximum likelihood estimation, i.e., double
maximization of the likelihood ratio. For a continuous series, the mean after the change
01 and the change time te can be searched with appropriate boundary conditions through
aS
at
( , ie)
=0
(01,
(2.16)
,te)
For a discrete independent sequence, the double maximization is expressed as
InA
gk = max
lijsk
max sup sk ( 1 ).
=
l
jsk
o,
(2.17)
The probability density belongs to the Koopman-Darmois family of probability
densities [Lorden 1971]:
(2.18)
p,(y)=l(y)e qte)mty> - rte>
where 1, m, q, and r are finite and measurable functions. Moreover, m and q are
monotonic and r is strictly concave upward and infinitely differentiable over an interval
of the real line.
For a given system, sometimes the minimum magnitude of changes in the system
can be estimated from manufacturers' catalogs or product specifications. In change
detection, this means the minimum expected change vm in the parameter 0 can be used as
a constraint for the double maximization,
max
gk = 1jsk
(2.19)
Sk(1I).
sup
1i- ol vm
Hence the conditional maximum likelihood estimates of the after-change mean
and the change time are
te
(j,01) = arg max
1
js te
sup
ln
i=J
o01-00>Vm
i:e
p0 (yi)
(2.20)
P0 (yi)
For an independent (g, 2) Gaussian sequence, the sufficient statistic function can
then be derived as
S
k)
S
(;--
g 22)" (i
k 91-
91+2
where i0 is the estimate of the pre-event mean from Eq. (2.15).
(2.21)
To obtain the estimate of
let v =
i,
-
k
gk = ljmaxk IVI sup
Vm>O
V (yi -0 to
v
i=j L
V2
2
2' 2
The maximum of g as the function of v can be found by the equation of
av
-0
with
the constraint of vm,
IviI=(
1.
k-j+l>j
yi -po
-vm)+
m
and then
max
S=
15j<sk
k j
..
~-ipto-
G2
2
(2.22)
22 2
If vm=O, meaning that changes of any magnitude are of interest or no information
about the minimum expected change is available in advance, then
-2
gk 9
1
2-a2 max
1:sj!k k-
1
[k
j . +1Iij(i--
y
-
(2.23)
2.4 Summary
In this chapter, the theory of change detection has been thoroughly studied. The
steady state method is selected not only owing to its smaller computing load but also
because of its capability of dealing with the noisy power environment. In addition, the
steady state method itself is a complete load monitoring system. Second, characteristics
of power profiles of both residential and commercial systems are discussed and
algorithms for change detection of power in commercial applications are analyzed.
Finally, the necessity and the feasibility of methods for steady-state change detection of
power consumption in commercial buildings is evaluated. Then the algorithm based on
the generalized likelihood ratio is selected according to the available information of
common HVAC systems in practice.
CHAPTER 3
System power modeling for fault detection and diagnosis
This chapter presents a detailed study of the GLR detection of changes in the total
power data of commercial HVAC systems. First, several fundamental issues are put
forward regarding the GLR algorithm when applied in practice. In order to reduce false
and missed alarms, some innovations are proposed and verified with data from real
HVAC systems. Then, to deal with the noise effect in the power data and the various
startup characteristics of different equipment in one system, methods including the
median filter and the multi-rate sampling technique for preprocessing of the power data
are developed. Finally, with the enhanced GLR detector working on properly
preprocessed data, case studies for change detection in the total power data of both the
building and the HVAC systems are presented and a short discussion is given about the
training and application of the change detector.
3.1 Introduction of on-line change detection in power data of the HVAC systems
Monitoring and analysis of a system's total power input involve keeping track of
the on/off schedules of individual components and the abnormal magnitude of changes
and obtaining appropriate estimates of the system's energy consumption. Undesirable
operating status may be identified by examination of the trend of the total power data. For
example, unstable control of a component in the system can be detected by checking the
standard deviation of the sampled power data against a trained threshold.
Monitoring of the total power profile can be generally defined as a quality control
problem in two major aspects: constant mean with varying standard deviation or constant
deviation with varying mean, as illustrated in Fig. 3.1. In quality control, these two types
of variation are considered as systematic and random errors respectively.
14
1
E 8
2CL
66
E0
~"8
2
42
6
424
0
2
4
68
114
Time
(a)
118
20
0
2
4
6
8101)214 1618
Time
20
(b)
Figure 3.1. Two major types of change detection in quality control. a). constant
standard deviation with varying mean; b). constant mean with varying standard
deviation.
In common HVAC systems, a systematic error means an undesired on/off switch
of a component under faulty operating conditions while a random error indicates
oscillation in the power data resulted from unstable control in the system. In order to
identify such abnormalities under steady state, two thresholds must be established, one
for the sufficient statistic to find the changes and the other for the standard deviation of
the sampled data. As a dimensionless index statistically derived from a noisy
environment, the sufficient statistic must be trained for a given system. To evaluate the
noise level, which usually depends on the on-site configuration of both the electrical and
mechanical systems, the appropriate training and estimation of the threshold for the
standard deviation are essential for oscillation identification.
Unlike the event monitoring in product quality control where the mean and the
standard deviation can be treated as constants under normal operation, in an operating
HVAC system, both the mean and the standard deviation are changing all the time. In
order to obtain timely estimates for the mean values of a sequence, the finite moving
average (FMA) window should be used for the detection. An FMA window is a data
sampling window with a finite length, in which the mean values for each sub-window can
be determined progressively with weighting or forgetting factors. For a Gaussian
sequence, due to the randomness of the independent data in the window, the weighting
factor can be assumed as a constant to eliminate the training of these factors.
3.2 Modification of the GLR: one-window equation vs. two-window equation
In the theory of on-line change detection, the search for the maximum sufficient
statistic is conducted by successively searching throughout each sub-window j-te,
te
(j, 0 1) = arg max
i
j
te
sup
Iln
0:I01-001>vm
j
Pej y1 )
po
(3.1)
(yi)
In the estimation of the post-event mean 01 in a dynamic system, one problem
inherent in this equation is the availability of the pre-event mean 0. In practice,
especially when detecting unexpected events, which is a major task of fault detection, the
pre-event mean must be estimated from the continuously updated data sample and can
not be selected from the expected values or the equipment specifications. For a system
with relatively constant or very small variations compared to the magnitude of the
sampled data, the post-event mean detected from the last hypothesis may be used as the
pre-event mean for the next event, such as the total power data of a small house.
However, in some cases, the total power data are always varying with the load as a
function of time, especially with the presence of variable speed drives which are widely
used in commercial HVAC systems. This indicates the post-event mean from the last
hypothesis 0, can be very different from the current pre-event mean 00, which tends to
cause unacceptable errors or even false/missed alarms in the detection if 00 is assumed to
be equal to 0, . Fig. 3.2 illustrates such effects with two abrupt turn-on events at 27 and
88 seconds separately. The post-event mean value of 21.75 of the previous hypothesis is
obtained from the average during time 0-5. If this value instead of the mean value of
31.29 during 21-26 seconds is used as the pre-event mean for the change at time 27 with
a post-event mean of 39.12, the resulting error of the detected change is 80%. For the
change at 88 seconds, the pre-event mean from the previous hypothesis is very close to
the post-event mean of 39.61 during 88-93, which virtually makes it impossible for the
detector to see the change at 88 seconds if 39.12 is used as the pre-event mean.
0
10
20
30
40
50
60
70
80
90
100
110
120
Time (s)
Figure3.2. Effects of the window selection for the estimation of the pre-event mean.
This chart somewhat reveals a potential solution to this problem: to obtain the
continuously updated estimate of the pre-event mean 0, the data in the latest section of
the series should be used. For more accurate change detection, the pre-event mean should
be properly estimated independently, i.e., it can not be obtained with any data in the same
window used for the post-event mean. Therefore, to clarify the difference between the
two mean values, there must be an additional window before the post-event window to
perform the estimation for the pre-event mean. By moving with the post-event window,
the pre-event window provides appropriate updated estimates of the base for the next
"step" change. Hence the equation for the estimation of the change time and the
magnitude is modified as
N
(j, 01 ) = arg
max
M+1
j 5N
sup
NIn
01:01-601>vm
i=j
Pe
(y)
(3.2)
p 0 (y)
60= arg sup p0 (y)
1
i
M
where 00 as the estimate of the pre-event mean 00 is obtained as the most likely value
based on the probability density p0 (y) in the pre-event window 1-M.
With the N-(g,
(2)
distribution,
pol(i
N
(j, 0 1) = arg
max
sup
M+1:j:N Gid 060e>vMj
lin
.
(3.3)
P0 (Yi)
Myi
i=1
where 0 = p.
The length of the two windows M and N-M depends on the on/off characteristics
of the monitored equipment in a system and needs to be properly trained in practice.
3.3 Training of the parameters for the GLR detector
According to Sections 3.1 and 3.2, four parameters need to be trained before the
GLR algorithm can be used for detection:
a). the length of the pre-event window;
b). the length of the post-event window;
c). the threshold for the sufficient statistic;
d). the threshold for the standard deviation.
The effects of the above parameters on the detection output are illustrated by
Fig. 3.4 with the total power data shown in Fig. 3.3(a) [Hill, 1995] for four fans in the
HVAC system of a campus building. During a period of 28 hours, there were 10 on/off
events of the fans at minutes 230, 455, 475, 490, 520, 545, 670, 1150, 1193, and 1560
respectively, as represented by the binary on/off switches in Fig. 3.3(b). Data were
sampled at a rate of 1/60 Hz.
7560450
cL 3015
0
0
200
400
600
1000
800
1200
1400
1600
1800
1200
1400
1600
1800
Time (min)
(a)
1
-
0
L
0
200
T
400
600
1000
800
Time (s)
(b)
Figure 3.3. Power datafor four supply fans in the HVAC system of a campus building .
a). power data sampled at 1/60 Hz; b). binary event indicatorof the on/off switches.
3.3.1 The length of the pre-event window
The length of the data window has a profound impact on the decision function.
The major objective of the pre-event window is to provide a stable datum for the detector
to find the coming change. Therefore, there should be enough sampled data points in this
window, i.e., this window has to be long enough for a given sampling rate. On the other
hand, multiple power changes frequently occur in sequence in HVAC systems and
sometimes they are closely separated in time, which requires a short window to avoid
averaging out two consecutive events. However, with a single sampling rate, the mean
obtained with a short window tends to be affected by noise in the data and the resulting
deviation of the estimated mean from the most representative average may lead to false or
missed alarms. In addition, for events that occur gradually over a time period, a short
window tends to miss the event or produce multiple alarms for the same event. Fig. 3.4(a)
shows the short-window effect on the GLR detection output with a pre-event window of
5 data points. Further decrease of the window length yielded more false alarms and even
unstable outputs.
On the other hand, a long window prevents the detector from finding multiple
changes that are close to each other in time. This effect is demonstrated in Fig.3.4(b),
where in addition to continuous alarms for each event, the detector with a window of 30
data points produced sufficient statistics above the threshold over the entire time period
from 450-550 minutes when four events occurred and one continuous false alarm
beginning at 1198. The continuous alarm for a single event is because the data from the
coming event can still be "assimilated" or "averaged out" in such a long pre-event
window and hence the sufficient statistics stay beyond the threshold until enough data
after the change are admitted into the pre-event window, resulting in a smaller probability
ratio between the two values. For a similar reason, a long pre-event window leads to
continuous alarms for multiple close events.
With a window length of 10 in Fig.3.4(c), the duration of continuous alarms was
much shorter than with 30 points in the window and the close changes could be
distinguished, but three events were missed at 230, 670, and 1560 due to the longer
duration of events.
Apparently, the proper window length changes with the characteristics of
different systems. But for a given system, the window length for detection seems to be
consistent under various operating conditions, as demonstrated in the applications of this
method to detect HVAC equipment on/off events in a test building [Shaw et al. 2000]
[Norford et al. 2000].
Observations indicate that the upper limit for the length of the pre-event window
should not be longer than the interval between two major consecutive events. Such a
value can sometimes be obtained from the basic design specifications of the system,
because in practice, the major components in an HVAC system are usually turned on and
off with a designated lower limit of time lapse in order to protect the equipment from
deterioration due to frequent switches, such as the chiller control. Moreover, the on/off
switches of multiple components are often governed by the load conditions which usually
occur over a certain time span, e.g., the sequence control of multiple fans.
As a lower limit, the pre-event window should not be shorter than the duration of
a noise or a spike. Ideally, it should be longer than the duration of the startup period of
each component in the system. However, this condition may not be always met for a real
system, since the startup of some VSD equipment can be as long as 15 minutes, which is
sometimes longer than the interval between two consecutive switches. In such cases,
other approaches are needed, such as the multi-rate sampling technique discussed later.
Without violation of the above basic rules, the pre-event window should be kept short,
which facilitates one of the major innovations to the GLR algorithm, namely, pre-event
window reset, as presented later.
900
--L
750
600
450
-
-------------
300
150
200
0
400
600
800
1000
1200
1400
1600
1800
Time (rnin.)
a). Pre-event window length of 5. Five false alarms at 258, 350, 380, 510, and 1241.
900
750
600
450
300
150
0
0
200
400
600
1000
800
Time (min.)
1200
1400
1600
1800
b). Pre-event window length of 30. Continuous alarmsfor each event, mixed alarms for
the five close events during 450-550, and continuousfalse alarms beginning at 1198.
900
750
600
450
300
150
0
200
400
600
800
1000
II
1200
1400
1600
1800
Time (min.)
c). Pre-event window length of 10. Mixed alarms for the close events eliminated and
missed alarms at 230, 670, and 1560.
Figure 3.4. Demonstration of the improvement of the detection quality with the proper
pre-event window length. All the three tests are based on a constant variance of 0.617
and a minimum expected change of zero.
3.3.2 The length of the post-event window
The above basic limits for the pre-event window also apply for the post-event
window, i.e., it should not be longer than the interval between two consecutive events,
never shorter than a disturbance, and ideally not shorter than the duration of a startup
transient process. On the other hand, unlike the pre-event window, which is used to
achieve a stable mean as the reference for the coming events, the post-event window is
intended to be sensitive to events yet robust to disturbances. Eq. (3.3) shows that a shorter
post-event window is more sensitive to changes than a longer one. Moreover, to save the
time used for searching the change in the post-event window, the window length should
be as short as possible. The appropriate length of the post-event window, was found to be
25-50% of the pre-change average window in order to get a relatively stable yet sensitive
average for detection of on-off events.
3.3.3 The threshold for the sufficient statistic
The magnitude of an appropriate detection threshold scales with signal noise, the
minimum signal change of interest, and the abruptness of potential changes in the system.
All the three factors are system-dependent. The minimum signal change may be found
from the specifications of the components in the system while the other two depend on
the component characteristics and system setup as well. Therefore, the threshold has to be
established adaptively during the early test period for a given system. The training
process benefits from available reference information (such as the type of the motor
drive), which can be obtained from design information of the system, on-site
observations, and on/off tests if possible.
Tests with different systems also suggested that the threshold can be consistently
used for a given system.
3.3.4 The standard deviation of the power data
The standard deviation, or the variance, is an important measure of data quality.
The standard deviation calculated for an FMA window may change rapidly over time in
the power series of HVAC systems, as shown in Fig. 3.5. In the GLR algorithm, since
the sufficient statistic is directly affected by the standard deviation, determination of the
value of the standard deviation in an FMA window becomes one of the key issues for
successful detection. However, the output of detection based on a fixed value of the
standard deviation was rarely satisfactory even though such a value might be tuned to
reduce false alarms. For example, a constant variance of 0.617 has been used for the
detection illustrated by Fig. 3.4. Tests with the tuning of the variance showed that with
any constant value, the false/missed alarm rates were not desirable.
50
C
,*
40
u-
C
=20
*
10-
*
0
200
400
600
800
1000
Time (min.)
1200
1400
1600
1800
Figure 3.5. Variations in the noise of the electricalsignalfor the aggregatedfan power
measurement.
Tests for the training of the GLR detector have shown that even with well-tuned
values of the above parameters, false and missed alarms still occurred at unacceptable
rates for real applications. In order to achieve desirable detection quality for a practical
system, further improvements are necessary in addition to the previous modifications
and tuning guidelines.
3.4 Improvements of the detection algorithm
In this thesis, three innovations have been made to the form of the original
algorithm for improved performance of the GLR detector: window reset, variance update,
and non-zero minimum magnitude of change.
3.4.1 Progressive vs. reset windows
From Fig. 3.4(c), it can be seen that with an appropriately trained length of the
pre-event window, continuous alarms are still issued in the detection output. This is
because with the finite length of a window, which is essential to obtain a stable mean
value, the sufficient statistic gradually decreases as more data are accepted into the
windows. With the window reset technique, once the sufficient statistic exceeds the
threshold, the data in the two windows before the found change point are replaced with
the data after this point. Upon this replacement, the calculated sufficient statistic
immediately drops down below the threshold due to the close post-event values used in
both windows. Hence the duration of an alarm for a single event is minimized and
masking of subsequent events is eliminated. Also, the problem of a too long or a too short
window is alleviated. It should be noted that more data points may be needed to reset the
pre-event window. This is realized by using a third window following the detection
window with the same length as the pre-event window. The two-window equation is not
affected by the window reset because the third window is not involved in the current
detection. The window lengths used here are 10 for the pre-event, 5 for the detection, and
10 for reset in sequence as shown in Fig. 3.6. The effect of this innovation is shown in
Fig. 3.7.
20 -
detection w indow
ith 5 data points
20w
16 -
pre-event w indow
a)_ 12
w~with
10 data points
------
J
0.
En
reset w indow
w ith 10 data points
8
4-*
sampled data
rnean value
0
0
2
4
6
8
10
12
14
16
18
20
22
24
26
Time
Figure3.6. FMA windows used with the GLR detection algorithm.
3.4.2 Updated standard deviation
The standard deviation can be continuously calculated from the pre-event
window, with high and low limits to avoid singular values of the calculated sufficient
statistic. This effect is demonstrated by comparing Fig. 3.7(a) with Fig. 3.7(c). In this
test, the limits were 10000 and 0.0001. The problem in using this approach is that it is
difficult to set proper values for the fixed limits, which consequently produces inaccuracy
in calculation of the sufficient statistic.
Another method to determine the threshold of the updated standard deviation is to
set it as an approximate fraction of the mean of the data in the current FMA window and
then train this fraction number. Tests showed that the standard deviation tends to increase
with the total power input of an HVAC system as more equipment are put into operation.
Data from a training period can be used to determine the ratio of the measured standard
deviation and the measured total power. The ratio is calculated as the standard deviation
to the mean value in the moving pre-event window during steady state operation and
periods with significant dynamics should be excluded. The maximum of the computed
values during the training period is set as the ratio to be used in detection because the
averaged offset as the standard deviation in the pre-event window caused by random
noise following the Gaussian distribution is not expected to deviate significantly from the
value under normal conditions. The training of the ratio requires the total power data of
the operating system during a typical day. Some basic knowledge of the system's power
consumption magnitude is also needed, which is often obtainable from the design data.
During subsequent on-line FDD applications, the threshold for the calculated standard
deviation is estimated as a product of this ratio and the averaged total power. From tests
with different HVAC systems, it has been found that reasonable upper and lower limits
for the standard deviation as fractions of the current power data are at the orders of 10%
and 1% respectively. This method gives the most reliable estimate of the standard
deviation because it eliminates the effect of the extreme values of the standard deviation
while incorporating an updated estimate of the standard deviation in the calculation of the
sufficient statistic. Tests conducted with real building systems have verified the success
of this method.
3.4.3 Nonzero minimum expected change
A value of zero for the minimum expected power change makes the GLR
equation easy to implement. However, relative to a minimum value assigned on the basis
of the knowledge about equipment sizes in a system, the zero-minimum detection usually
leads to more false alarms, as shown in Fig. 3.7(b) at time 270 and 1260. This can also be
proved from the equation itself as g decreases with the increasing of Vj. In practice, it
is often reasonable to find and set a minimum expected change based on the knowledge
of the system and its components. With a properly determined minimum expected change
Vm, the algorithm will neglect small abrupt disturbances and their accumulation within
the window and thus reduce the false alarm rate and yield more reliable detection outputs.
This effect is shown with a known minimum change of Vm = 5 kW in Fig. 3.7(c).
400300-
200-
100-
0-
.
,
200
400
600
800
1000
1200
1400
1600
1800
Time (min.)
a). window reset, constant variance, and non-zero minimum expected change.
400-
0>
300-
A2
CO
200-
C,
5
CO
100
0-
0
200
400
600
800
1000
1200
1400
1600
1800
Time (min.)
b). window reset, updated variance,and zero minimum expected change.
400
o>
300 -
C
200 -
.CD
100
Cl)
0
0
200
400
600
800
1000
1200
1400
1600
1800
Time (min.)
c). window reset, updated variance,and non-zero minimum expected change.
Figure 3.7. Demonstration of the improvements of the GLR detection method by preevent window reset, variance update, and minimum signal estimation(5 kW).
3.5 The median filter
In the search for a change in the detection window, each sub-window j~N
(j = 1,..., N) is used for the calculation of the likelihood ratio. If a large spike enters the
window, the ratio may exceed the threshold and lead to a false alarm. This effect is
demonstrated by Fig. 3.8 and the following equation. When the detector reaches time
point 15 in the post-event window which is affected by a big spike, the likelihood ratio
will have a dramatic jump and a false alarm will be issued for the spike.
18
16
detection w indow
w ith 5 data points
pre-event w indow
w ith 10 data points
D14
12 10
10
C)
*....
E8
..................
---........9
~
642
0
2
6
4
14
12
10
Time
8
16
18
20
Figure 3.8. Sampled data with one spike in the post-event window.
g = max
k1si
5
pe (yi)
p (y 15 )
= In
ZIn
sup
1 e sol1; i
=11
POO (
P8 (Y15)
i
In addition, although the length of the two windows can be selected to alleviate
the noise effect by "averaging out" the deviations, the detection quality is still often
degraded when the data environment becomes very noisy with large disturbances, as
shown in Fig.3.9(a). The dotted line shows the profile of the total power input of a
campus building at a sampling rate of 1 Hz with four on/off events of a water pump. With
frequent big spikes, the estimated mean in each window may significantly deviate from
the real representative value.
From the above discussion, it becomes clear that a preprocessor needs to be
developed to improve the quality of the sampled data before they are used for the
detector. In this thesis, to remove the large disturbances in the power data, a median filter
[Karl et al. 1992] is employed to facilitate the change detection. The median of a series
with a probability density function p(x) is the value Xmed for which larger and smaller
values of x are equally probable:
xrmed
p(x)dx =
00Xmed
f0 p(x)dx
=±
2
(3.4)
For a discrete sequence, the median of a distribution is estimated from a sample of
values x 1, ... , XN by searching the value xi that has equal number of data greater and
smaller than it. If N is even, the median becomes the average of the two central values.
With sample sorted into an ascending (or a descending) order, the formula for the median
of the sample is
if Nis odd
X N+1,
Xmed
12(3.5)
Xd X N+
X
),
if N is even
A median filter is a nonlinear filter based on the above function, which sorts the
data in a window and preserves the median as the value at the end point of the window
before sliding over one point. Spikes that appear in the window as points of very large or
small values compared to the selected median are discarded. Hence a representative value
is picked by the filter in the FMA window for the detector.
By eliminating large spikes, the median filter significantly improves the
robustness of the GLR detection with reference to the detection threshold. It has been
found the GLR detector is less sensitive to the value of the threshold and hence easier to
be trained as the disturbances in the data environment are greatly reduced by a median
filter.
The advantage of using a median filter has been verified by tests with one
example illustrated below based on the total power data sampled at 1 Hz from a campus
building. During a period of 1020 seconds, a pump was turned on and off 8 times by the
control system at 124, 183, 443, 558, 737, 797, 939, and 998 seconds. Fig.3.9(a) shows
the profile of the total power data with the dashed line for the raw data and the solid line
for the filtered data. Without the median filter, 16 events were detected including the 8
switches and 8 false alarms, as shown in Fig.3.9(b). By using the appropriately designed
median filter for the detection, false alarms caused by spikes at 29, 102, 119, 207, 355,
675, 812, and 934 seconds were eliminated, as illustrated in Fig.3.9(c). Comparison
between the computed power changes in Fig.3.9(b) and (c) also indicates that for the 8
events, the detected changes are more accurate with the startup electrical surges
eliminated by the filter. In the training of the GLR, same results can be obtained with the
median filter if the value used for the threshold is between 1.0-3.0. On the other hand, no
identical outputs were found with different thresholds in the detection based on the raw
data. A value of 3.0 was found "optimal" for the threshold that leads to the identification
of all the events with the minimum amount of false alarms, as illustrated by Fig.3.9(b).
720
700
680
660
0
90
180
270
360
450
540
630
720
810
900
990
1080
630
720
0
900
90
1C0
630
720
10
900
9
1C30
Time (s)
(a)
2010>D 010 _-10 -
90
10
270
360
450
54
-20
-30
Time (s)
(b)
30
20103 0 0
-10
90
130
270
360
450
54
-20-30*
Time (s)
(c)
Figure 3.9. Demonstrationof the improved detector performance with a pre-processing
medianfilter to reduce unwanted electrical noises. a). 1-Hz totalpower data of a campus
building with 4 on/off events of a pump -filtered and unfiltered; b). output of the GLR
detector with the raw data; c). output of the GLR detector with the filtered data.
It is necessary to select a value for the window length used by a median filter. The
window should be at least twice as long as the general duration of an electrical spike,
which is generally less than 5 seconds from our observations, and not longer than the
interval between two consecutive events or the duration of an on/off state, whichever is
shorter.
The median filter is more robust than the mean (or average) filter due to the fact
that the median filter discards the extreme values while the mean filter calculates the
average of all the data in the sample, taking into account and hence susceptible to all the
spikes in the window [Rice 1988]. Statistically, the median filter fails as an estimator
only if the area in the tails is large, while the mean filter fails if the first moment of the
tails of the distribution is large.
One practical problem with the median filter is the detection of data oscillation in
a window. For common HVAC systems, oscillation in the power input is generally
caused by unstable control and needs to be removed once detected. With a median filter,
however, the oscillation tends to be neglected because the extreme values are removed by
the filter and consequently the fault is hidden from the detector.
3.6 Change + oscillation detector
Oscillation of power caused by unstable control in HVAC systems may degrade
equipment and in some cases increase energy consumption. One of the major
characteristics of oscillation is the deviation of the data from their mean value in an FMA
window. Hence, the approach used to identify this fault in this thesis is to compare the
standard deviation of the data set against a threshold that is dynamically adjusted as a
fraction of the current power data. To avoid false alarms caused by random spikes which
might lead to significant errors in calculating the standard deviation in a short window,
the detection of oscillation is continuously conducted over the whole pre-event window
rather than searching through all the sub-windows.
The key points in designing a GLR detector with oscillation detection capability
are the proper thresholds for the sufficient statistic and for the standard deviation. For the
change detector, a higher threshold HGLR is determined for the sufficient statistic and a
lower threshold LSTD for standard deviation to identify changes. For the oscillation
detector, a higher threshold HSTD is selected for the standard deviation and a lower
threshold LGLR for the sufficient statistic to find the oscillation. Thresholds established
in this way eliminate the fuzzy intermediate region where, for example, fluctuations in
power may cause false alarms in the change detector, and permit a clearer distinction of
step changes and oscillations. The LGLR and the LSTD were set to be 10% of the HGLR
and 20% of the HSTD respectively for an example illustrated by Fig. 3.10 for the
detection of a turn-on event among power oscillation caused by an unstable controller in
the total power data of the HVAC system in a test building. While the supply-fan static
pressure controller gain was set to produce power oscillation, a chilled water pump was
turned on at about 3604 seconds in the plot. With HGLR= 2500, LGLR= 250, HSTD=
0.02, and LSTD= 0.004, both events were alarmed successfully.
7000
6000
5000
4000
3000
2000
0
900
2700
1800
3600
4500
5400
6300
7200
~0me (s)
(a)
300
250
200
150
100
0
720
1440
2160
2880
3600
4320
5040
5760
6480
7200
Time (s)
(b)
Figure3.10. Paralleldetection of changes and oscillation. a). 1-Hz total electricalpower
data of the HVAC system in a test building, showing oscillation indicative of an unstable
supply-duct static-pressure controller as well as a turn-on event at 3604 seconds; b).
output for the detection of the unstable controller with a change of 872 watts found at
3604 seconds.
3.7 Monitoring of the total power data - multi-rate vs. single-rate sampling
In addition to the above improvements to the GLR algorithm and data filtering,
the sampling rate of the data fed to the detector has also been found critical in keeping
track of the events in a system. This is not only because of the varying abruptness of the
same event exposed to the detector but also due to the distinct characteristics of the data
trend unveiled with different sampling rates, such as the data spread about the mean
value. In this section, after comparing the detection output with single- and multi-rate
sampling, a detector for change and oscillation tracking with multiple sampling intervals
is developed and discussed with power data from a test building.
3.7.1 Change detection by multi-rate sampling of the total power data
It can be seen from the pervious sections that on/off events occur in a wide range
of time intervals in HVAC systems. In order to find as many events as possible, a short
sampling period should be used to avoid missing of events due to sampling. This is
because a shorter sampling interval, or a faster sampling rate, can help to distinguish
closely separated events. However, a fast sampling rate also tends to result in more false
alarms because the calculation is more susceptible to high-frequency transients and
disturbances. Moreover, a too fast sampling rate may cause more than one alarm for a
single event, especially for some relatively slow on/off events, e.g., a supply fan with a
variable speed motor drive. On the other hand, a long sampling interval, or, a slow
sampling rate, is likely to miss changes with short time duration, such as the startup of a
constant-speed water pump, and sometimes mix up closely separated events. Therefore, a
reasonable range of sampling intervals should be determined before applying the GLR
detector to a data series.
i). Order of magnitude analysis of the sampling interval
In general, the sampling interval should be shorter than the time lapse between
two most closely spaced events. Such a limit can be obtained by examining the total
power data at a fast enough sampling rate or by observing the signals from the control
system, if possible. For example, in the tests with a real HVAC system, the shortest time
period between two consecutive events has been found to be about 1 second by analyzing
the building's total power data. Fig. 3.11 shows the whole-building power data during a
typical day. The effect of sampling rate on the data pattern fed to the detector is
illustrated in Fig. 3.12. All the data were collected with a 24-Hz data logger called
remote-1 but plotted here with a sampling interval of 10 seconds due to limitations of the
plotting software.
100000
80000-
60000-
O
40000
20000
00
10800
21600
32400
43200
54000
64800
75600
86400
Time (s)
Figure3.11. Whole-building electricalpower, sampled at 24 Hz, plotted with 0.1 Hz.
In the 24 hours of operation, there have been more than 100 on/off switches of
equipment in the system. During a selected period of 30 seconds (12300-12330) three
components were turned on sequentially. The power data series with sampling intervals
of 60 10, 1, and 0.125 seconds are shown sequentially in the following charts.
60000 50000
40000 3000020000
10000
12240
12270
12300
12330
12360
12390
12420
12360
12390
12420
12360
12390
12420
Time (s)
(a)
60000
50000S46000
0
C-
30000
20000
10000
12240
12270
12300
12330
ime (s)
(b)
60000-
50000400000
3000020000
10000
12240
12270
12300
12330
Time (s)
(c)
60000
5000040000-
a.0
30000-
2000010000
12240
12270
12300
12330
lime (s)
12360
12390
12420
(d)
Figure 3.12. Effect of different data sampling intervals on detectability of rapidly
occurring equipment start-up events.(a). 60 seconds; (b). 10 seconds; (c). 1 second; (d).
0.125 seconds.
As shown in the above charts, if an interval of 60 seconds is used to sample the
data, then these three turn-on events are taken as only one step change. With the sampling
interval reduced to 10 seconds, the events are only sampled as one point each, which still
can not be used as a solid proof for a real step change. When the sampling interval is
reduced by another order of magnitude, i.e., to 1 second, then the first event can be
clearly recognized by eye and by the detector as well. Further decrease of the sampling
interval makes the second event discernible to eye, but not to the detector, which needs
an appropriate length of data window to build up the sufficient statistic. The second event
remains ambiguous to the detector until the sampling interval is reduced to 1/8 seconds,
i.e., a sampling rate of 8 Hz. However, tests to detect the cycling of a reciprocating chiller
in the given system showed that the false alarm rate increases drastically when the
sampling interval drops below 1 second. Based on the tests and analysis of the sampling
effect on the detection quality, the appropriate lower limit of the sampling interval for the
GLR detection has been found to be of the same order of magnitude as the shortest time
duration of the on/off events or the shortest time interval between two consecutive events
of interest, whichever is shorter. The upper limit should be longer than the lower limit by
about 1-2 orders of magnitude. In this case, the lower and upper limits are 1 and 30
second(s) respectively. In addition to a reduced false alarm rate, another significant
advantage of using such a range compared to the higher sampling rate of 8 Hz is the more
acceptable execution time of the detection. This is more meaningful when the automatic
detector is developed with multiple sampling intervals as discussed later. The running
time of the detector was reduced from 90 minutes to less than 5 minutes when the lower
limit of the sampling intervals increased from 0.125 seconds to 1 second for the test with
one day's data, making the detector more desirable for on-line detection.
One problem with such a moderate sampling range is that some very closely
separated events might be missed or misinterpreted, such as the above event with an
interval of less than 2 seconds. But such events seem to happen by coincidence and are
rarely seen in common HVAC systems because the switches of major electrically-driven
components are typically staggered to alleviate electrical surge. In practice, there is
generally an interval of not less than 30 seconds even for some coupled on/off equipment,
such as the supply and return fans associated with the same air handler.
ii). Detection of on/off changes with single sampling rate
Since the data patterns supplied to a detector and consequently the 'visibility' of
an event to the detector vary significantly with the data sampling frequency, the sampling
interval must be carefully determined based on the above order-of-magnitude analysis to
achieve desirable quality of the detection. In this thesis, many tests have been conducted
to search for the optimal sampling rates to detect all the on/off events in an HVAC
system. In the following analysis, one example of such tests is demonstrated with the
power data from a test building over six days. The tests were started with the detection of
the on/off cycling of a reciprocating chiller from the building's total power data. Fig. 3.13
illustrates the first seven hours of the total power data on a single day and the GLR
detection output for the chiller's on/off switches with different sampling rate. From the
original power data shown in Fig. 3.13(a), it is difficult to visually distinguish the chiller
cycles from the whole-building power signal. The first chiller cycle, aligned with data
from an electricity submeter used to check the GLR results, is readily discerned. Most of
the following switches are obscured by noises or other events.
-
60000 - - - ---
- --
-
50000
40000
30000
0
a-
2000010000-
0
3600
7200
10800
Time (s)
(a)
14400
18000
21600
25200
6000
5000
-
4000
-
3000
-
2000
-
1000 0
60
0
120
240
180
300
360
420
18000
2 600
25
18000
21600
252
Time (min.)
(b)
8000
6000
4000
2000
0
3600
-2000
-4000
7 00
10
0
1440
-t
-6000
-8000
Time (s)
(c)
8000
6000
4000
2000
0
-2000
!)
3600
7 100
14400
10800
-4000
-6000
-8000
Time (s)
(d)
8000 -
600040002000 -
03600
(0 -2000
7200
1440
10E
18000
2 600
25200
18000
2 600
250
8000
2 600
25 00
-4000
-6000
-8000
Time (s)
(e)
8000 6000400020000
a
3 300
-2000
7100
75200
14400
108D0
-4000-6000-8000Time (s)
(f)
8000
6000
4000
2000
0
-2000
3600
7200
14400
10810
-4000-6000-8000
Time (s)
(g)
Figure 3.13. Demonstration of the detection of chiller on-off cycles from the total power
data of a test building with different sampling intervals. (a). total power data from
remote-i during 0:00--7:00, 99/05/23; (b). chiller power data from a submeter during
0:00--7:00, 99/05/23; (c) - (g). detection output of the chiller's on/off switches with
sampling intervals of 1, 2, 5, 10, and 20 seconds respectively.
All other sampling intervals between 1 and 60 seconds produced results similar to
those shown above. Apparently, with a single sampling interval, the detector is not able
to find all the on/off switches correctly as shown in the submetered data. However, it can
be seen from the above plots that all the events can be found by matching the on's and
off's among outputs with different sampling intervals. In this case, on/off matching
among the outputs with the sampling intervals of 1, 2, and 5 seconds identifies all the
switches without any false alarms or missing events. Other combinations of sampling
intervals may yield the same results, such as the combination of 1, 2, and 10 second(s).
Similar detection patterns with different sampling intervals have been found with the
remaining five days' data. This indicates that the quality of detection can be significantly
improved if the detection can be performed by properly integrating the outputs among
different sampling rates.
This is because, first, different components in a system have different time
constants that result in different duration for the turn-on switches. Second, for given sizes
of the pre- and post-event windows, the ideal sampling interval should be longer than any
of the on/off transients and shorter than the time interval between two on/off events.
Unfortunately, this condition can not always be met in commercial applications where
VSD drives and load controls are widely used, which lead to more gradual changes in the
system's total power series. As a consequence, an on/off transient process may last longer
than the interval between two events. Moreover, as can be seen from Fig. 3.13(a), the
quality of the total power data makes it difficult to isolate changes under different
conditions with a single sampling rate even for the same equipment.
iii). Detection of on/off changes by automatic matching among multiple sampling
intervals
In this detector, the data series is sampled with discrete integer intervals between
the lower and upper limits and then supplied to the GLR detector. The detection output
with each interval is sorted by the magnitude of the changes, assigned to the equipment
with that specific magnitude, and matched with the outputs with other sampling intervals.
Fig. 3.14 displays the output of this automatic detector for the data shown in Fig. 3.11
without false/missed alarms.
6000
4000
2000
0
()
0
60
0
1
24
300
60
4;
-2000
-4000
-6000
Time (min.)
Figure3.14. GLR detection outputfor the chiller on/off switches with the building's total
power data by automatic matching among different sampling intervals: 1, 2, 3, 4, 5, 10,
20, and 30 seconds.
Additional data filtering criteria, if available, may be applied for specific
equipment to reduce the false alarm rate, especially with the presence of other
components with close magnitude of change. For the chiller in the test building, for
example, these criteria include minimum off-time between cycles, which is typically set
within the chiller controls to prevent unnecessary equipment cycling, and minimum
expected on-time. Such limits have been incorporated in the detector designed for the
given system.
Therefore, data sampling with multiple intervals not only sharpens the abruptness
of changes, which makes the event clearer for the GLR detector to "see", but also helps to
distinguish abrupt on/off cycles among a gradually changing process which for example,
can be a slow startup of a VSD fan. With only one sampling rate, such 'enclosed'
changes can hardly be found without false alarms. As shown in Fig. 3.15, fan turn-off
events as recorded by the NILM logger vary in their apparent abruptness, as influenced
by changes in other loads. All the four events were successfully detected with multi-rate
sampling but some were always missed with any single-rate detection.
7000
Turnoff of supply
6000 -
fan A and B
5000 4000
a>
0
(L
3000200010000
0
3600
7200
10800
14400
18000
21600
25200
28800
21600
25200
28800
Time (s)
(a)
7000
6000
Turnoff of supply
fan A and B
5000
4000-
hb 46
a)
0
3000
2000
1000
0
3600
7200
10800
14400
18000
Time (s)
(b)
Figure 3.15. Demonstration of the various abruptness of the turn-off transient of the
same fans in two different days. Multi-rate sampling produces at least one data stream
with the 'visible' or 'abrupt' event for the GLR detector despite of the noise variations in
the data environment, which could not be achieved with any single rate. (a). total power
data of the HVAC system 20:07 99/05/11-4:07 99/05/12, showing the turnoff of the fans
among multiple events; (b). total power of the HVAC system 20:07 99/05/24-4:07
99/05/25/, demonstrating the impaired and obscured abruptness of the turnoff events in
the total power data at certain sampling rates.
3.7.2 Variance tracking by multi-rate sampling of the total power
In addition to the on/off changes, data spread about the mean value of a power
series is another important factor for monitoring system performance. An effective
parameter to describe this factor for a system's operating status is the standard deviation
or the variance of the sampled data set. Continuous high values of the standard deviation
usually indicate serious problems of a system's operation. For example, unstable control
in an HVAC system often leads to oscillation in the power data, which can be detected by
examining the standard deviation in the FMA window. However, despite the fact that
such an effect can usually be seen by a detector with some data sampling rates, it may
become ambiguous or even totally invisible to detection based on sampling intervals that
are too short compared with the oscillation period or close to the integer times of the
period. Therefore, if a wide range of sampling intervals can be used for the detection, the
event monitor will be able to keep a more thorough track of the data trend and achieve a
better understanding of the ongoing processes.
For fault detection, this means abnormal operation occurring at some specific
range of frequencies can be reliably identified by detection based on multi-rate sampling,
but might be very difficult to be found by single-rate detection if the employed sampling
interval falls within certain ranges. One example of such effects is illustrated by Fig. 3.16
as the output of oscillation detection during 24 hours in a test building. Oscillation in the
data series as shown in Fig. 3.16(a) can be detected by checking the calculated standard
deviation against a threshold of the data in the pre-event window sampled at an interval
of five seconds but not visible at all to detection with the one-second sampling interval,
which was found too short relative to the oscillation period. Established as a fixed
percentage of the current mean in the continuously updated samples, the threshold of the
standard deviation varies with the moving window, as shown by the solid lines in Fig.
3.16(b) and (c).
It can be seen that this approach is able to mitigate the problem of picking a
sampling rate appropriate for detection of oscillations at an unknown frequency. As a
result, data tracking based on multi-rate sampling can be easily implemented and yields
more reliable output than that with single sampling rate.
13000
11000
9000
7000
5000
3000
7200
0
14400
21600
28800
36000
43200
50400
57600
64800
72000
79200
86400
Time (S)
(a)
250
200
150
100
50
14400
28800
57600
43200
72000
86400
72000
86400
Time (s)
(b)
400300
calculated s.t.d.
-threshold
-
At a
~
.. .. .. .. ..
..
......
..
....
......... .. ............
200100-
0
14400
28800
43200
57600
Time (s)
(c)
Figure3.16. Effects of sampling rate on the detection of oscillation in a power series.
(a). twenty-four-hour total power data of a test building's HVAC system, exclusive of the
chiller; (b), (c). detection output based on sampling intervals of 1 second and 5 seconds
respectively.
3.8 Summary of the training guidelines
Guidelines for training the parameters involved in the GLR detection used in the
previous sections are summarized as follows.
1). Record electrical power for the circuits monitored by the data logger for one day
under typical operating conditions. The sampling rate should be between 1 and 10 Hz for
common HVAC systems;
2). Locate the events from the abrupt changes in the total power data and estimate the
fastest and the slowest events;
3). Determine the sampling rates for detection. The base sampling rate will be used as
the fastest sampling rate if multi-rate sampling is employed. Therefore, power data
sampled at this rate should be used to discern each event of interest by eye. The lower
limit of the sampling rate should be able to pick up the longest duration of events of
interest. Then other sampling rates can be decided between these two limits with 2-5
rates selected within each order of magnitude;
4). Determine the window lengths. The length of the detection window should contain at
least two data points. It should not be longer than the interval between two consecutive
events and never shorter than a disturbance. The length of the pre-event window is 2-4
times longer than the detection window. A third window with the same length as the preevent window is used for data reset;
5). Calculate and estimate the maximum of the standard deviation of the power data as a
fraction of the current total power data, f. Note that periods with abrupt changes should
be avoided in the calculation off. The lower limit of the standard deviation should be set
one magnitude lower than the estimated value f.
6). Estimate the threshold for the detection statistic. Without knowledge about the
minimum expected change Vm in the total power data, a reasonable base value for the
threshold is 1/f 2. If Vm is known in advance, then the training of the threshold can be
started with (s/Vm) 2, where s is the average standard deviation of the samples from the
training data. The threshold for the GLR can be trained by adjusting this value until all
events of interest can be seen by the detector, with a minimum number of false alarms.
The lower limit of the GLR can be started with 1/(10*f )2 or s/(10*Vm) 2.
Note that the starting values for the training of the parameters are the estimated
thresholds, not the exact value. Moreover, exact values are not expected for the above
parameters due to the statistical properties of the detection method. Slightly different
combinations of these parameters may yield equally acceptable detection output.
The detector developed in this research has been successfully applied to two test
sites. With the above basic rules and guidelines, the training process for the parameters
becomes much easier. Because the rules for the window lengths are for common HVAC
systems, similar window lengths can be used in different applications with minor
adjustments and the major training task is usually reduced to determination of the
thresholds for the sufficient statistic and the standard deviation.
3.9 Application of the GLR model in fault detection with system's total power input
Faulty operation in common HVAC systems often leads to excessive energy input
of some components and hence abnormal total power supply to the systems. Therefore, in
addition to detection of the occurrence of on/off switches as in the previous sections,
faults in an HVAC system can also be identified by monitoring the magnitude of changes
in the total power data or time duration of an event with appropriate techniques. Without
submetered power data and the related reference parameters, those abnormal changes can
usually be recognized during the on/off switches of the components. For example, an
offset in the signal of the static pressure sensor leads to higher power input to the supply
fan, which can be found from the magnitude of the total power change when the fan is
turned off, but can hardly be seen if the fan keeps running.
In principle, faults that cause abnormal magnitude change or cycling frequency
can be detected with the change detection algorithm presented in the previous sections.
However, depending on the relative magnitude of the power change to the monitored
total power and the noise level, detection of significant changes caused by components'
on/off switches must be implemented at different system levels. It is difficult to derive
the analytical form of the threshold to define the appropriate system level for the
monitoring of a specific component. From the tests conducted with some typical HVAC
systems, the detectable magnitude should not be less than 5% of the total power, which is
generally close to the observed magnitude of the standard deviation relative to the total
power data under normal operation. For example, in the tests with the HVAC system
used in the previous section, the reciprocal chiller generally switches between 5 and 10
kW, which is about 5-10% of the building's total power input, while the turnoff power of
the supply fan and other equipment is usually around 0.5 kW. Therefore, faults related to
chiller operation can be found from the building's total power data while those related to
the supply fan can be better detected from the total power data of a subsystem, e.g., the
HVAC system exclusive of the chiller. This is feasible in buildings where fans and
pumps are served by a motor control center in HVAC applications while chillers are
handled separately. The effect of the relative magnitude of a change on the detection
accuracy is studied in Chapter 5.
Another typical undesirable operating status of an HVAC system is the unstable
control of the system, which may not only cause extra energy cost but also harm the
related equipment directly. Since oscillation in power consumption is typical of an
unstable control system, the instabilities can be identified by checking the variance or the
standard deviation of the sampled sequence as discussed before.
The detector developed with the enhanced GLR algorithm and data processing
techniques has been tested with faults that cause the above effects in the total power. In
this section, some of the tests for several typical degrading and abrupt HVAC faults
related to pressure sensors, dampers, valves, etc., are presented and discussed for the
application of this detector. Data were collected from a typical HVAC system in a test
building as shown in Fig. 3.17 [Norford et al, 2000]. Further details about the system and
fault implementation are given in the appendix.
by AHU-A
by AHU-1
by AHU-B
- electric power transducer
CHWP - chilled water pump
- pressure transducer
*
- flow meter
t - General Area System point
- temperature probe
$ - Point listed in Table 3.6 for AHU-A and B
,
- Point listed in Table 3.4 for cooling plant
DMACC Connection
To remaining
FCUs
(b)
EA
OA-TEMP
OA-HUMD
EA-DMPR
RA-TEMP
RA-FLOW
RA Fan
RF Watts
OA 11
ER
-. R
A
R A-H UMD
OAD-TEMP
OA-FLOW
M
UL
RC-DMPR
HTG-DAT
mom
OA-DMPR
MA-TEMP
HTG-EWT
HTG-MWT
T7
SA-TEMP
SA-FLOW
.IF
CLG-DAT
'
HTG-LWT
SA Fan
SF Watts
'
IIL:
SA
SA-HUMD
CLG-EWT
CLG-MWT
CLG-LWT
Figure 3.17. Schemes of: a). the test building; b). the chilled waterflow circuit; and c).
the airhandling units.
Tests were conducted in a one-story building that combines laboratory-testing
capability with real building characteristics and is capable of simultaneously testing two
full-scale commercial building systems side-by-side with identical thermal loads. The
building is equipped with three variable-air-volume air-handling units: AHU-A, AHU-B,
and AHU-1. AHU-A and B are identical, while AHU-1 is similar but larger to
accommodate higher thermal loads. Two fan coil units are also installed in this building
but were not used in the tests for this research. The major components of the AHUs are
the recirculated air, exhaust air, and outdoor air dampers; cooling and heating coils with
control valves; and the supply and return fans; ducts to transfer the air to and from the
conditioned spaces. Air from the AHUs is supplied to VAV box units, each having
electric or hydronic reheat.
All of the six fans and 10 pumps with capacities ranging from 0.5 to 5.0 kW are
served by a motor control center. The pumps in AHU-A and B are equipped with
constant speed motor drives while the other pumps and all the fans have variable speed
motor drives. The total power input of the motor control center is recorded by a logger
named remote-2 and each component is monitored with a power submeter for the tests.
The cooling load is handled by a two-stage air-cooled reciprocating chiller with a
nominal input of 10 kW. The power input of the chiller used for this test is supplied
through the building's general electricity distribution circuit, which is monitored by
another power logger called remote-1. For the GLR detection, power data from remote-1
were used for analysis of faults associated with the chiller and power data from remote-2
were used for detection related to the fans and the pumps.
In addition to the power meters, the AHUs are well instrumented with sensors for
all the controlled variables in common HVAC systems, which include the basic
measurements used for the detection method developed in this research.
Table 3.]. Power capacity and motor types offans and pumps served by the motor control center.
Equipment
Supply fan 1
Return fan 1
Supply fan A
Return fan A
Supply fan B
Return fan B
Hot water pump 1
Hot water pump A
Hot water pump B
Hot water pump LA
Hot water pump LB
Chilled water pump 1
Chilled water pump A
Chilled water pump B
Chilled water pump LC
Chilled water pump CH
Total
Power capacity (kW)
5.0
2.0
5.0
2.0
5.0
2.0
0.75
0.4
0.4
0.4
1.0
1.0
0.4
0.4
1.0
1.0
Motor drive speed
Variable
Variable
Variable
Variable
Variable
Variable
Variable
Constant
Constant
Variable
Variable
Variable
Constant
Constant
Variable
Variable
32.25
3.9.1 Model calibration and parameter identification
As summarized in Section 3.8, parameters for event monitoring with power
consumption can be obtained without any special control operation or arrangement of the
HVAC system. With the data from normal operation, the related thresholds are
established for future detection.
To find the power threshold for the turnoff of the supply fan by the detector, data
at the turnoff time must be sampled. Ten minutes are needed each day, five before and
five after the turnoff, and three to five days' of data during this time slot are used to
eliminate the effect of data noise in the magnitude. Thus for the turnoff of the supply fan,
30-50 minutes are required in 3-5 days of normal operation or with faults that do not
significantly affect the fan power during the late evening when the fan is turned off.
For detection involving the chiller power, the power cycling during the low load
condition must be used to estimate the normal intervals between two consecutive turn-on
events of the chiller. Cooling load generally reaches its minimum during the early
morning hours and the chiller cycling will be maintained at a minimum constant
frequency under normal operation. In this test, six hours of the building's total power data
from midnight to 6 am are required to determine the cycling period.
For detection that requires the full operation of the economizer, i.e., 100% open
outdoor air damper, when the outdoor air temperature slightly higher than the supply air
temperature during the operating time, the weather condition is not easy to find. Although
only 5-6 hours of such a time period are needed, it is not typically seen during summer
and is more likely to be found under the spring or the fall load conditions.
For the estimation of the noise in the power data, a period of 24 hours that covers
the operating conditions of a typical day has been found sufficient to calibrate the
threshold for the calculated standard deviation. Table 3.2 summarizes the trained values
for the detection with the test system.
Table 3.2. Trainedparametersfor the detection and diagnosis in the test system.
Parameters
Values
Sampling intervals (seconds)
Fastest events
1
Slowest events
600
Base sampling interval
1
2, 3, 4, 5, 10, 20, 30
Other sampling intervals
Window lengths (data points)
Pre-window
6
Detection window
4
Standard deviation as the fraction of the power data (W)
0.1
Upper limit
0.02
Lower limit
Threshold for the sufficient statistic
Upper limit
20
5
Lower limit
35
Chiller cycling interval (minutes)
0.2
Normalized outdoor air temperature
For the non-intrusive detection based on the centralized power data, besides the
two meters for the power input of the building and the motor control center, the only
sensor needed is the outdoor air temperature, which is generally available with the
building energy management system.
Three setpoints are used for the temperatures of the supply air, the room air, and
the balance point of the outdoor air for the economizer control.
3.9.2 Detection of typical faults with the total power data
a). Detection of faulty operation by change magnitude at turnoff
In general, a change in the total power data at turn-off is generally more abrupt
and hence easier to detect than during turn-on, especially for equipment with startup
protection or variable load control. Therefore, detection of faults related to magnitude
change in total power is conducted at turn-off. Any faults that cause abnormal power
consumption can be detected by this method if the related equipment is turned off under
similar load conditions. In the HVAC applications, equipment are turned off or run under
the economizer operation mode at a fixed time when the building is not occupied during
late evening and early morning, which means a similar low load condition at turnoff for
each day. If the detected change in the total power data caused by turnoff of the
equipment at the given time is significantly higher than a trained threshold for this
component, a false alarm will be issued for it. For example, a VSD fan tends to consume
more energy when the control loop is not working properly. Fig. 3.18(a) shows the total
power profile of the motor control center under normal and faulty operating conditions
throughout 24 hours (8:00pm - 8:00pm) of 3 different days, with the supply fan turned
off at 10:00pm on each day. Fig. 3.18(b) compares the detected magnitude of the changes
under normal and faulty conditions with some additional normal days' of data. All the
points under 600 watts are the detected turnoff power changes under normal conditions
during 16 days. The two days with pressure sensor offset clearly show significantly larger
values (> 600 W). Therefore, by setting a threshold of 600 W for the change magnitude,
abnormal power consumption can be recognized by the detector. Other faults with similar
effects can also be found in the same way, such as an unexpectedly restricted air flow
path that also leads to higher fan power consumption and hence a larger magnitude of
change in the system's total power when the fan is turned off. Fig.3.18(b) presents such
an example of a stuck-closed recirculation air damper.
15000
1250099/05/22--normal
10000-
99/05/15--sensor offset
99/02/28--stuck RA damper
r-
7500
5000-
2500
0:00
3:00
6:00
9:00
12:00
Time (h)
(a)
15:00
18:00
21:00
0:00
1400
A
1200
1000
99/05/10
99/02/28
99/05/15
Pressure sensor offset
Stuck closed recirc damper
Pressure sensor offset
800
A
CD
o
600
400AA
200AAAA
0
700
750
800
850
900
950
1000
1050
1100
Flow rate (CFM
(b)
Figure 3.18. Detection of the faults of pressure sensor offset and stuck-closed damper
from the magnitude of changes in the HVA C system's total power input at the turnoff of a
supply fan. a). 24-hour total power data of three typical days of an air conditioning
system including 10 fans and 6 pumps; b). detection output of the total power change
caused by the turnoff of the fan under normal andfaulty operation conditions.
While faults can be detected when abnormally large magnitude is found at
turnoff, extremely small or no change found at the designated time for turnoff also
indicate potential faulty operation. If turnoff of a component is not found at the given
time, then a fault has likely happened in the system. For example, a fan with a slipping
belt may lead to an extremely small magnitude of total power change at turnoff.
b). Detection of faulty operation from on/off cycling
Sometimes, a fault does not cause any abnormality in the power magnitude, but,
rather, in the on/off cycling frequency of the equipment, especially when the motor runs
at a constant speed. This method is very effective because in HVAC systems, especially
for commercial applications, there are generally certain restrictions on the on/off cycling
of some equipment to protect them from deterioration by frequent switches. For such
components, some faults can be accordingly analyzed by examining the on/off intervals
based on the detected events from the power input with reference to the current load
condition or the design specifications.
For example, chiller operation is generally controlled based on the entering and
leaving water temperatures, which are mostly affected by the cooling load introduced
through the cooling coil valve. If the load condition stays around some stable state, e.g.,
during the late night or early morning in an office building, the chiller cycling period
should also remain constant under normal operation. When the cooling coil valve is
leaking, the load of the chiller tends to increase due to the higher flow rate of water to be
processed, which tends to cause shorter off-on intervals for a chiller with stepwise
control. Therefore, the leaky valve might be identified if the chiller on/off switches can
be correctly detected during periods under appropriate load conditions.
It should be noted that with the chiller power input, the leaky fault can be most
reliably identified when the cooling coil valve is closed by the control signal. The
following charts show the output obtained by the detector from the total power data of the
test building. In this system, the cooling coil valve is closed and the chiller only cycled on
and off regularly by the design intent during the early morning hours. Under normal
conditions, the on-off duration and the off-on interval were 4-5 minutes and 38-39
minutes respectively, i.e., about 10 full on/off cycles in total between 0:00-7:00 am when
the valve was closed. Therefore, the threshold of a full cycle that consists of the on-off
duration and the off-on interval has been calculated to be 42 minutes/cycle, which was
then used as the minimum time period for a full on/off cycle. If the valve is leaky, the
average period of a full on/off cycle tends to exceed this threshold when the valve is
expected to be closed and an alarm will be issued for the excess by the automatic
detector. During the FDD tests in spring, the leaky valve fault has been found on three
days with the average full cycling intervals of 33.2, 30, and 27.3 minutes respectively.
Fig. 3.19 demonstrates the detection output of one day with the leaky valve.
Other faults that cause the abnormal cycling of the chiller can also be detected
similarly. However, the conditions for the detection might be different depending on the
characteristics of the fault and the control system setup. For example, the fault of a leaky
recirculation air damper can only be seen from the chiller cycles when the system is in
the economizer mode, i.e., when the outside temperature falls within a certain beneficial
range for energy conservation. In this test building, it has been found that the abnormal
cycles could be found only when the outside temperature was slightly above the point
where the supply air temperature controller began to open the chilled-water valve in order
to maintain the supply-air temperature setpoint. Limiting the examination of the cycling
frequency to this narrow region reduces the risk of false alarms due to normal changes in
cycling rate at higher outside temperatures when the cooling load increases. The rule used
in the tests is:
IF (0 < Toa - Ts pt. sa
T spt.ra - T
< ht.oa AND chiller cycling frequency > threshold) THEN alarm
spt.sa
Where ht.oa is the threshold for the dimensionless outdoor air temperature calculated as
the ratio of the difference between the outdoor air temperature and the supply air
temperature set point to that between the supply and the return air temperature set points.
Although it is desirable to keep the temperature threshold as small as possible, to
minimize the false alarm rate, some variations in the outdoor air temperature should be
considered by the detection rule. In this case, ht.oa < 20% was found acceptable.
20000
16000
12000
8000
4000
0
0
3600
7200
10800
14400
18000
21600
25200
Time (s)
6000
5000
4000
3000
,
-
2000
1000
0
0
-- --- --- -L -.- -1
60
120
180
240
1-1----- -1300
360
42 0
310
360
4, !0
Time (min.)
(b)
6000
4000
2000
0
()60
1 0
180
240
-2000
-II
I
-4000
-6000
Time (in.)
(c)
Figure 3.19. Detection of the fault of a leaky cooling coil valve from the cycling
frequency of the chiller with the building's total power data under low load condition
during the early morning hours (0:00-7:OO, May 22,1999). (a). 7-hour total power data
of the system with a leaky cooling coil valve; (b). chiller cycles from the submetered
chillerpower; (c). chiller cycles as the detected power changes.
c). Detection of faulty operation from the standard deviation
As discussed in Section 3.7.2, unstable control of equipment usually leads to
oscillations in the system's total power consumption, which can be detected by analyzing
such characteristics as the standard deviation of the sampled power sequence.
What needs to be noted here is the rules for the alarm of power oscillation. The
rule used to detect an abrupt change is:
IF ( sufficient statistic > upper limit of the GLR AND the standard deviation in
the detection window < lower limit for the standard deviation ) THEN alarm for
abnormal change and check the magnitude;
The rule for detecting power oscillation is:
IF ( sufficient statistic < lower limit of the GLR AND the standard deviation in
the detection window > upper limit for the standard deviation ) THEN alarm for
oscillation.
The reason to combine the sufficient statistic and the standard deviation is that
whenever an abrupt change occurs, the data variance in the detection window increases
significantly and a short alarm for oscillation might be issued. To eliminate such false
alarms and also in consideration of the data noise, a certain ambiguous region should be
discarded in search of the real events. The upper and lower limits depend on the
equipment characteristics of the HVAC system and can be trained with test data. For a
given system, these parameters remain constant.
3.10 Results and discussion
This chapter demonstrates that low-cost information about individual electrical
components in a building can be obtained via careful analysis of power measurements at
central locations within the building, notably the electrical service entrance and motorcontrol centers that supply power to the HVAC components. A reliable automatic
detector suitable for use in noisy and complex electrical environment has been developed,
involving establishment of guidelines for tuning the detector, innovations of the detection
algorithm, and preprocessing of data for the detector including design of a filter and data
sampler of multiple rates, all contributing to the significantly improved performance of
the detector (enhanced sensitivity to signals of interest and rejection of electrical noise).
Test results show that with the centralized power data, the detector is able to conduct
reliable detection for on-off events of interest and keep track of the data trend in the
power data series from real buildings with low costs and little intrusion into the system.
The performance of the detector has been proved by the detection of several
typical faults related to abnormalities in the magnitude of on/off changes, frequency of
equipment cycling, or trend of data in real HVAC systems.
The multi-rate sampling technique enables the detector to find changes that may
be invisible to detection with single sampling rate. However, on/off events that cross each
other's time bounds can hardly be detected without false reports of magnitude change. To
distinguish such changes, additional information like the control signals from the building
energy management system are necessary to avoid false alarms. Dynamic analysis of the
transient process is also a possible solution. Moreover, although faults usually cause
abnormal effects in power consumption and can be detected and initially diagnosed with
the algorithm developed in this chapter, further identification of the faults is necessary to
find the real cause of the fault. This indicates more specific information are necessary to
describe the components in the system, which leads to the study of detection and
diagnosis at deeper levels in the system as studied in Chapter 4 and Chapter 6.
74
CHAPTER 4
Component power modeling for fault detection and diagnosis
This chapter describes the other steady-state method for fault detection and
diagnosis based on the modeling of submetered power input of equipment in HVAC
systems. First, typical faults in HVAC systems and their effects in power consumption
with reference to other parameters are studied. Then correlation between submetered
power data and basic control signals or measurements available in common HVAC
systems is analyzed based on the least-square fitting algorithm with singular value
decomposition. Finally, models are developed for major HVAC components in real
buildings and tested with typical faults.
4.1 Introduction
Faults in HVAC systems lead to abnormal power consumption at both the system
and the component levels. For detection at the system level, the most valuable data are
the change magnitude in the total power data when a piece of equipment is turned on or
off and the standard deviation of the sampled data in the FMA window as demonstrated
in Chapter 3. Although it can be used to find the faulty operating status of a system, the
detector based on the system's total power input is usually not able to issue reliable
alarms until a fault becomes serious enough to be recognizable in the noisy power
environment. In practice, this leads to unacceptable system malfunctions and energy
waste before the fault can be identified. In order to maintain desirable indoor air quality
and avoid excessive energy consumption, faults should be eliminated at an early stage.
This means detection and correction of faults should be implemented not only at the
system level during the on/off switches of a device, but also at the component level when
the equipment is running, which requires closer monitoring of the component. For
example, the gradual offset of a static pressure sensor for the fan speed control may result
in considerable energy waste before it becomes discernible in the total power data. In
order to keep track of such variations in power consumption at an early stage, the load
needs to be monitored at the component level with reference to appropriate control
signals or measurements.
In a common air handling process, the pressure-independent VAV system
maintains a constant static pressure at a certain location in the supply air duct by control
of the fan speed while the heating/cooling coil adjusts the supply air temperature
according to the load conditions via control of the valve and the chiller. Abnormal
operation of any of the above devices tends to cause undesirable air conditions or
excessive energy consumption or both. Faults in an air handling unit (AHU) may occur
on either the air or the water side. In the air loop, faults are usually caused by some type
of blockage or leakage that changes the resistance or the setpoint for the air flow and
hence the driving power drawn from the fan, e.g., a stuck damper in the air duct. This
indicates that with a given air flow rate, the fan power under faulty conditions will differ
from that in normal operation. Such faults can therefore be found and analyzed by
monitoring the variations in the fan power with reference to the air flow rate. Previous
research has shown that power consumption of a fan with variable load adjustment can be
described as a polynomial function of the air flow rate or the fan speed with constant
static pressure control [Englander and Norford, 1990a]. Similarly, in the water loop, such
blockage and leakage can be identified with the pump and/or the chiller power.
It should be noted that some degradation must be accepted for most equipment in
common HVAC systems after certain time period of operation. In order to avoid false
alarms caused by random disturbances and allow a reasonable level of defects, it is
appropriate to introduce an offset or a confidence interval for the fitted function. With a
given control signal or measurement, if the sampled power data exceeds certain threshold
or falls out of the confidence interval of a power function, a fault is identified and then
diagnosed with expert rules.
The objective of this chapter is to establish an appropriate algorithm in the matrix
form and use it to develop the power functions of the major equipment in common
HVAC systems and then establish the means for detection and diagnosis of typical faults
in VAV AHUs by applying proper confidence intervals to the functions.
4.2 Component power modeling by correlation with basic measurements or signals
The power model of a component depends on the type of the related motor drive.
For a constant-speed motor drive with finite state control, the submetered power can be
established as a stepwise function in accordance with the range of the reference data. For
equipment with continuously adjusted control signals or load, such as a fan with a
variable speed motor or a variable inlet vane, the power consumption can be properly
represented by a polynomial function of the reference parameter.
4.2.1 Correlation by linear least-square with singular value decomposition
Since the measurement errors or the noise in the power data follow the normal
distribution, the maximum likelihood estimation of the coefficients in the power function
can be carried out by the chi-square minimization [Draper and Smith, 1981]. Hence the
linear least square fitting is to find the coefficients aj (j=1, 2, ... , M) of function
M
y(x) =I ajf(x)
(4.1)
j=1
from a set of measured data pairs (xi, yi) (i= 1, 2, ... , n) by minimizing the X2 function
X2= -I
i=1i
X2 ny;
IM
a
fj(Xi) -
2
(4.2)
where fi(x), ... , fM(x) are arbitrary functions of x, called basis functions. Note the
"linear" fitting here means y(x) is linearly dependent on the parameters aj, while the
functions fj(x) can be linear or nonlinear.
The minimum of the X 2 can be found from its partial derivatives with respect to
the M parameters aj,
ax2
Da
= 0 (j = 1, ... , M ), which result in the following equation set
in the matrix form,
(XT .X).
fI(x 1)
-
f 2 (xI)
-1M
fI(x 2 ) f 2(x 2)
_(T2
U2
-- X T -y
f M(Xi)
fM(X
2
(4.3)
yI
ai
a2
)
Y2
(72
where X=
a,
fi(xn)
f2(x.)
fn(xn)
Yn)
Let A=X TT*Xand=XT-y , Eq.(4.3) can be rewritten as
A.
= b
(4.4)
and then the normal equations in Eq.(4.3) can be solved for the aj's by multiplying both
sides with A-,
= A-' -b
(4.5)
However, the solution of Eq.(4.5) is rather susceptible to roundoff errors due to
the extremely large value of the condition number, which is the ratio between the largest
and the smallest elements of a diagonal matrix. To avoid poor approximation caused by
an ill-conditioned matrix, a method called singular value decomposition (SVD) is used to
decompose the matrix A [Golub and Van Loan, 1989]. The SVD algorithm is based on
the following theorem of linear algebra: Any M x N matrix A whose number of rows M
is greater than or equal to its number of columns N can be expressed as the product of an
M x N column-orthogonal matrix U, an N x N diagonal matrix W with positive or zero
elements, and the transpose of an N x N orthogonal matrix V .
A =I_ -W -V
(4.6)
Hence the inverse of A is
A-1 = V - [diag(1/w )] UT
(4.7)
In this clearly decomposed structure, the only source of undesirable results is
the 1/wj, when wj = 0 or -* 0 from roundoff. Such a singular matrix can be solved for a
'healthy' set of aj by simply replacing the l/wj by zero if wj = 0 or -> 0, which actually
removes one linear combination of the parameters that causes the decrement of the rank
of A.
Note that in the above discussion, there is no restriction for the type of the
function fj (j=1, 2, ... , M) and the method can also be used for fitting with multiple
independent variables, i.e., multi-dimensional correlation.
The coefficients of the original function can be rewritten as
a1
a2
=V -[diag(1/wj)]-
UT.-
b
(4.8)
aM ,
Analytically, the dependent variable can be estimated for any new data as
.[fi(x), f 2(x),
...
, fM X)]
(4.9)
NMJ
In practice, however, this fitted model is always susceptible to some level of
uncertainties due to the unavoidable errors caused by the measurements and the model fit
itself. To achieve reliable fault detection, a range based on statistical confidence must be
defined to accommodate the error effects.
4.2.2 Selection of parameter for correlation
Power consumption of a component in an HVAC system may be affected by more
than one factor and can be modeled as different functions through the above equations
with reference to various indices. However, the resulting functions may vary greatly in
terms of accuracy and applicability and hence differ significantly in their efficacy of
distinguishing between faulty and normal operations when used in fault detection.
For fault detection, to achieve reliable outputs, a model should be able to present a
consistent and sensitive response in power to the reference parameter. Therefore,
selection of the reference parameter is crucial for the power model. The major criteria for
selecting such a parameter can be ascribed as its ability to predict the equipment's power
input, the accessibility of the measurement, and the detectability of multiple faults with
the model. The ability of prediction indicates minimum uncertainties in the correlation
between the dependent and the independent variables, which requires the power
consumption to be closely and ideally, exclusively, related to the parameter. For example,
although fan power changes with outdoor air temperature, it is hard to formulate a useful
relationship between them for fault detection because the outdoor air temperature can not
directly affect and exclusively determine the fan power, which increases the uncertainties
in the model due to the 'transmission error' and leads to a 'loose' correlation or even no
applicable correlation at all. Fig. 4.1 illustrates the difference in the association between
the power input of a supply fan and two parameters, the outdoor air temperature and the
supply air flow rate. Compared to the plot of supply fan power vs. air flow rate under
constant static pressure control which shows a clear polynomial trend, the data clusters in
the supply fan power vs. outdoor air temperature present a much wider spread of power
data under the same outdoor air temperature, which is caused by the varying air flow rate
with the combined effects of the free-cooling switch, the delay in thermal response, and
the control deadband. Therefore, for fault detection with the fan power input, the supply
air flow rate is a better reference parameter than the outdoor air temperature and is
usually used as a major index in fan power prediction.
Accessibility is also an important consideration in practice, which actually
determines the feasibility of the detection method. For example, the damper position
determines the change of the resistance in the air loop and hence affects the fan power.
However, damper position is not easy to measure during operation. In a typical HVAC
system, usually more than one damper is used to control the air distribution, which
further complicates the measurement and the correlation.
The detectability of multiple faults with the model requires that the fitted function
with an appropriate confidence interval should be able to detect as many different faults
as possible. This is aimed to save the cost and minimize the intrusion into the system due
to the measurements of multiple parameters as well as alleviate the complexity in the
correlation itself. For example, even if the fan power vs. damper position function can be
used for identification of inappropriate damper position, it fails in recognizing other
significant faults related to the fan power, e.g., the static pressure sensor offset.
Therefore, the independent variable selected for fitting should be not only a
parameter of easy access with least intrusion into the system and minimum interruption to
its operation, but also a sensitive and comprehensive index for variations of power
consumption related to different causes.
2000
1600
.
.*:::.'.
-
1200
-800
400
51
52
53
56
Outdoor air temperature (0F)
(a).
-
2000
55
54
-:
.
-
....
57
51
*
.
.
*
1600
1200
800
400
1000
1200
1400
1600
1800
2000
2200
2400
2600
Supply air flow rate (CFM)
(b)
Figure 4.1. Illustration of the difference in the ability to predict the power input of a
supplyfan between two different reference parameters: (a). outdoor air temperature and
(b). supply airflow rate. Data were sampled at an interval of ] minute from a test
building during 7:30-19:00 when the fan was running.
4.2.3 Power functions of major HVAC components
Based on thermal and fluid laws, power consumption of fans, pumps, and chillers
of HVAC systems can usually be expressed as functions of control signals or
measurements, either as a step function when stepwise control is used for the equipment
or as a polynomial function when such continuous adjustments as P1 control are involved.
By fitting the power data with appropriate parameters, power models can be established
for each major equipment, which can then be used for energy estimation and fault
detection with certain confidence intervals.
4.2.3.1 Equipment with continuous capacity control
The power model of such a component can be expressed by a bi-quadratic
function of related reference parameters [Braun et al, 1987],
(4.10)
P = ao + a, x + a2 x 2+ a, y + a4 y 2 + a, x y
where x and y are the measured reference parameters and ao, ... , a5 are the coefficients to
be fitted. The reference parameters are selected based on the criteria mentioned in Section
4.2.2. For example, in the power function of a VSD fan, x and y may represent air
volume flow rate and the pressure difference across the fan respectively.
With the submetered power data Pi (i= 1, 2,
, n) and the measured parameters xi
and yi,
P1
1, x 1,
x 12
P2
_
P _
1, x 2,
x 22
pn_
1,
xn,
xn 2
ao-
y1, y12
+
_a 2
x1 y1
x 2 y2
a02
1, y 2, y22
+ ay
a3
1, yn5 y
Y2
_a 4
xn
-
(4.11)
yn
where aoi + a0 2 = aO.
The X2 merit function is then formed as
Sn Pi i=1
i
(aA xi;
+ aj+2 yi-1) - xi yi
12
(4.12)
The fitting procedure of such a function is similar for all the power-consuming
equipment. At the component level, fan power is of key importance in detecting many
faults in air handling units. In this thesis, the fan power function has been thoroughly
studied with data from real buildings. The power functions of the other equipment and
the corresponding confidence intervals for detection can be similarly established.
Previous studies have shown that fan power correlates well with some parameters.
For example, Lorenzetti and Norford [1992] have shown that the hourly average power
consumption of a VSD fan can be expressed as the function of the outdoor air dry bulb
temperature. Although such a correlation works well when the internal thermal load is
insignificant or remains constant all the time compared to the external load, it can not be
used to predict the fan power if the conditioned space is subject to considerable and
varying thermal disturbances other than the outdoor air temperature.
In air conditioning systems, all the thermal disturbances, from both inside and
outside, are finally accommodated by the amount and the discharge temperature of the
supply air. Therefore, air flow rate becomes the most comprehensive index to represent
fan power consumption when the discharge air temperature is fixed, which is typical of
control in HVAC systems.
From the characteristics of a fan, fan power can be determined by two of the three
variables: total pressure, air flow rate, and fan speed, as shown by the curves in Fig. 4.2
for a fan typically used in HVAC systems [ASHRAE Handbook, 1996].
600
total pressure:
500
IL
400 -4.0
400
80
shaft pow er
300-
30-
60
2.0
40
'
(1)
CU)
200
effciency
0.
A)
100
1.0
0
0
2
4
6
8
10
-------- 0.0
12
20
0
Volume flow rate (rri/s)
Figure 4.2. Characteristiccurves of a backward-tipfan.
For a fan with a variable speed motor or a constant speed motor under inlet vane
control, the power input can be determined by the supply air flow rate, the total pressure
gain, and the total efficiency as a product of four efficiencies related to fan, fan-motor
coupling, motor, and the motor drive:
Pi = ( total pressure gain * air flow rate ) / (total efficiency)
The total pressure gain is the rise of the total pressure across the supply fan and is
equal to the total pressure loss in the air loop. The total pressure loss in the air flow path
consists of two parts: first, the pressure drop to overcome the resistance due to friction
and abrupt section changes at duct fittings and air processing components before the
static pressure sensor; and second, a constant pressure drop from the static pressure
setpoint to that in the occupied space through the terminal box. The static pressure
setpoint is intended to accommodate the air flow requirement to maintain the room
condition. In principle, the static pressure setpoint can be adjusted to save the energy
consumed by the low open level of the VAV dampers. As a result, the total pressure
becomes a variable which must be taken into account in the prediction of fan power
input.
In many systems, the static pressure setpoint is usually a fixed value for a given
system and hence the term for the pressure difference in the power function can be
dropped,
(4.13)
P = ao + ai x + a2 x 2
For n pairs of measured data (xi, P1), the power function and the error function can
be written as
P = (PI, P2, ...,P)
n[p
with the
(4.14)
= X.d
xi 1-
Z 3 ,aj-
(4.15)
j
j
X2
which can be solved
Section 4.2.1.
T
singular value decomposition method introduced in
a). Fan power as a function of air flow rate
Based on the above fitting algorithm, the correlation between the fan power and
the air flow rate has been coded in the FORTRAN language and applied for the HVAC
systems in several buildings. Fig. 4.3(a) illustrates the fitting results with data selected
from the HVAC system of a test building. The close association between the fan power
and the air flow rate is demonstrated in Fig.4.3(b), which shows the profiles of the fitted
and the measured power with the corresponding air flow rate sampled at an interval of 1
minute during one normal day's operation.
4000 -
data selected for fitting
fitted curve
3000 y = 3.652e-4 x 2 - 2.462e-2 X - 2.627e-5
FF= 0.9887
200
.
C
U-
1000 -
0
0
500
1000
2000
1500
Air f low rate (CFM
(a)
2500
3000
3500
2800
2800
2400
2400
2000
2000
1600
1600
M 1200
1200
0
ca.
U-
c
800 -800
400
7:30
9:00
10:30
12:00
13:30
15:00
16:30
18:00
400
19:30
Time(h)
(b)
Figure 4.3. Correlationbetween fan power and airflow rate with data sampled at an
interval of 1 minute from a test building. (a). fitted curve of fan power vs. airflow rate
and the datapoints selected forfitting; (b). airflow rate, measuredfan power, and fitted
fan power vs. time of a summer day, demonstrating the association between fan power
and airflow rate.
b). Fan power as a function of the motor speed control signal
As discussed before, for a fan with a variable speed motor, the power input can
also be modeled as a function of the motor speed when constant static pressure control is
used. Figure 4.4 demonstrates the correlation between the fan power and the motor speed
control signal based on the same day's data as used in Fig. 4.3.
While air flow rate needs to be measured in most HVAC systems, fan speed is
usually readily available as the fan speed control signal. Therefore, cost due to the need
of submeters for measurements associated with the power vs. speed detection can be
reduced compared with the power vs. flow rate method.
4000
*
-
data selected for fitting
fitted curve
3000
y = 0.812x 2 - 51.852x +989.75
FF =.9802
2000
1000
0
80
60
40
20
0
1C
Fan speed (%)
(a)
85
2800
-
fan pow er fitted
-
fan speed
2400
-80
2000
-75
/
n
.M
.2
1600
70
0
0
1200
65
(
800
60
c
400 47:30
9:00
10:30
12:00
13:30
15:00
16:30
18:00
55
19:30
Time (h)
(b)
Figure4.4. Correlationbetween fan power andfan motor speed based on the same day's
data used in Fig. 4.3. (a). fitted curve of fan power vs. motor speed control signal and
the data points selected for fitting; (b). fan speed, measuredfan power, and fitted fan
power vs. time, demonstrating the association between the fan power and the fan motor
speed.
4.2.3.2 Equipment with constant power input
Such equipment are often driven by constant speed motors to maintain a fixed
load or flow. For example, as shown in Fig. 4.5(a), power consumption of a constant
speed chilled-water pump with constant flow rate loop control is always around a fixed
value, i.e., the design capacity of the pump, regardless of the position of the valve stem.
Therefore, the power function can be established as a constant with an offset due to the
random errors in the measured data.
However, it should be noted that constant motor speed does not necessarily mean
constant power consumption. As mentioned before, power consumption of a fan or a
pump can be determined by any two of the three parameters: pressure rise, motor speed,
and flow rate of the medium.
This indicates that the power function of a constant-speed fan or pump should be
determined with reference to the loop structure and the hence the flow rate delivered by
the equipment. Fig. 4.5 demonstrates the completely different relationships between the
pump power and the control signal for the valve position under different loop setups of
two air handling units in a test building, one for AHU-A and the other for AHU-1.
0
20
40
60
80
100
0
20
60
40
100
(b)
(a)
secondary
loop pump
secondary
loop pump
primary
loop pump
80
Valve position ()
Valve position (%)
primary
loop pump
(d)
Figure 4.5. Dependence of the power function of equipment with constant-speed motor
drives on the loop setup. Data were collected from two similar air handling units with
different water loop structures in a test building. Figures(a) and (c) show the profiles of
pump power vs. valve position with the water control loops in (b) for AHU-A and (d) for
AHU-1 respectively.
In the AHU-A tests, a three-way valve was used to regulate the water flow rate
through the cooling coil and maintain the constant total pressure drop across the loop and
hence the constant total water flow rate through the secondary loop pump, which resulted
in constant pump power consumption regardless of the control valve position. In the
AHU-1 tests, the water flow rate through the secondary loop pump was adjusted by a
two-way valve that was directly connected to the pump and hence the water flow rate, the
pressure drop across the valve, and the pump power varied with the valve position.
Therefore, the power functions must be established accordingly.
4.3 Error analysis - confidence intervals of the estimated models
Since errors in the sampled data and uncertainties of the correlation are always
unavoidable in practice, confidence analysis must be conducted for the reliability of a
model. A confidence interval is a region with certain confidence level which represents
the probability for the true values of a parameter to fall within this region about the
measured (of fitted) value. A confidence level represents a certain percentage of the total
probability distribution of the errors and is usually designated by the user. An analytical
form of the confidence interval for a model can be derived if the random errors in the
measurements follow some known statistical distribution [Rice, 1988]. Since noise in
electrical power generally follows a normal distribution [Shanmugan and Breiohl, 1988],
the power models for fault detection can be defined by the power functions with
statistical offset ranges to accommodate the errors caused by random disturbances. For
equipment with variable power consumption, the upper/lower limits of the intervals vary
with the load conditions while for equipment with constant power consumption, the
upper/limits are also constants.
4.3.1 Confidence intervals for equipment with variable power consumption
Deviations in such models are caused by the errors of measurement for the fitting
and the uncertainties of the correlation. For the fitting itself, a confidence interval for the
M fitted coefficients is a region of the M-dimensional space that contains a percentage of
2
the total probability distribution represented by AX 2 , the deviation of X from its
minimum at d(O) corresponding to the data set for fitting. This involves the examination
of the M-dimensional space with reference to certain confidence level. Increasing the
confidence level will enlarge the AX2 and hence the confidence interval. In this research,
power consumption as a function of some measurements is intended to be used as the
sole criterion for the least-intrusive fault detection. This means power input is used as the
index for the confidence limit to check the perturbation of power yo for any given
measurement xO, e.g., supply fan power for air flow rate. Therefore, for the dependent
variable y, the confidence interval for prediction must contain the uncertainties due to the
model fit and the measurement error as well [Little, 1991] [Little and Norford, 1993].
For a given measurement x, the dependent variable y and its estimate
fitted function can be expressed as
M
y = Eaj f(x)+F ,
j=1
- N(O,2)
y
by the
M
E(y) =
a f (x)
j=1
M
=Ejfj(x)
j=1
Or in the matrix form for multiple measurements (xi, yi) (i=1, 2, ... , n) used for
the fitting,
where
Yi
fi(x 1 ) f2 (x1)
...
f M (x1)
Y2
fi(x 2 ) f 2 (x2 )
...
fM(x
y=-
2
'ai
a2
)
a=
yn
fi(xn)
f 2 (xn)
fn(xn)
...
For a large data sample of size n, the covariance of the correlation is
Cov =
G2
A)
(AT
1
With the fitted function as obtained in Section 4.2.1, the distribution of the
estimated variable y0 for a new observation x0 can be derive as,
Yo
~ N(E(yo),
(2
f0
T
(AT A)-1 fo)
(4.16)
where f0 is the function vector for x0 ,
f0 =(fl(xo),f
(xo),---fM(Xo))
2
T
.
In addition, the distribution of the square sum Q of the residuals of the sample can
be represented by the x 2 function with n-M degrees of freedom,
S
i
(2(2i=1
[(yi
-E
jfj(x))]2
j=1a(n
~2
(4.17)
M
and the standard deviation s of the sampled data for fitting is s =
Q /(n -
M)
Since
Q and y Oare independent
(9'0 -E(y 0 )) /(a
fT
of each other, then
T A)- f))
(A
YO
Q/(&2(n -M))
s
fO
0
)
-, f
_5
~E(y
t(n - M)
(4.18)
)
which as shown follows the t-distribution with (n-M) degrees of freedom governed by the
density function
FpN+I)
p(t,N) =
N7g
where F(a) =
J x-I e-
2
(1+_-)
N
F(N)
2
N+1
2
,-oo<t<+oo
(4.19)
dx.
0
Therefore, the confidence interval for the expectation of yo with the double-sided
t-distribution at the confidence level of 1-a is
S i t , (n - M).s. Z (
Y
-1 o)
(4.20)
12
which may also be expressed in terms of the probability for the expectation to fall within
the given confidence interval,
s
Since yo, yO, and
<t
Io -E(yo)I
p,
(n -M){=1-a
(4.21)
2
fo_(
A) fo
Q are all independent
of each other and
E(yo -9' 0 )=E(yo)-E(yO)=0
D(yo -yo)=D(yo)+D(y
0 )=a2(1+f0
T
(A T A- 1'
distribution related to the predicted value yo can be derived as
Yo-yo
S1+fOT (AT
N(0,1)
(4.22)
) f0
With the estimated standard deviation s, the t-distribution of the prediction can be
obtained as
s
(At
Yo-yo
S
1+ fOT (AT A-'
(4.23)
(n - M)
f,
Therefore, the confidence interval for the predicted yo with the double-sided
t-distribution at the confidence level of 1-a is
y^i t
(n - M).s. 1+ fo T (ATA
i
-' fo
(4.24)
which can also be written as the probability for a given value yo to fall in the confidence
interval,
< t a(n -M)
Yo -YOI
p{
~s
s fT
1
14T(A
TT
A)
1_1 0
=l-a
(4.25)
1 2o
It should be noted that the confidence interval is affected not only by the
confidence level (1-a), the freedom of the t-distribution (n-M), the sample standard
deviation s, and the covariance of the sample s2 (AT A-' , but also by the vector
function fo, which indicates that the predicted range of the power consumption varies
with the independent variables. Eq. (4.24) shows that the confidence interval increases
with the confidence level, which means for higher confidence about the fitting, the data
should be expected within a wider range about the predicted values. The interval also
increases with the estimated error s in the sample for fitting. If the selected data scatter
loosely around the fitted average, i.e., a large standard deviation of the sample, then for a
given confidence level (1-a), the confidence interval must be wider to accommodate the
error in the fitted data themselves plus the deviation of AX2 from the minimized X2 with
the new measurement. In addition, since the value of the t-function decreases with the
freedom (n-M), the predicted range becomes narrower with more data pairs involved in
the fitting of a polynomial function.
For the least squares correlation, a small interval with a high confidence level
usually indicates a good fit of the real process. For fault detection, such a correlation
provides sensitive yet reliable identification of abnormalities.
4.3.2 Confidence intervals for equipment with constant power consumption
If power consumption of the equipment is a constant, the confidence interval for
fault detection can be determined as a constant offset from the design value based on the
probability function. At the given confidence level of 1-a, the upper/lower limits can be
expressed as
y = YO ±T -Za,(4.26)
2
where yo is the design value of the power input for the equipment and Z is the normal
distribution function. The standard deviation a can be approximated with the sample
standard deviation s when the sample is large enough (more than 25-30) [Rice 1988],
which can easily be met with the training data usually collected during one normal day's
operation.
4.4 Fault detection and diagnosis of HVAC systems with submetered power input
In addition to the on/off switch detection and data trend monitoring of the total
power data of a system as introduced in the previous chapters, modeling of the power
consumption of electrically-driven equipment has been proved to be another efficient
method in this study for fault detection and diagnosis in HVAC systems. With the
submetered power data of specific equipment, fault detection and diagnosis can be
conducted at the component level and hence the FDD output becomes more accurate than
that based on the total power data as presented in Chapter 3. In addition, unlike the
definition in the abrupt change detection given in Chapter 2, steady state for the
submetered power modeling means the process throughout the operation of the
equipment except the short on/off transient periods. This indicates that the detection and
diagnosis can be implemented when the equipment is in operation and are not limited to
the short time period when the equipment is turned on or off. Therefore, detection and
diagnosis with submetered power data makes it possible to find a fault at the early stage
and hence prevent further energy loss or more serious outcome before it can be seen from
the total power data by the change detector. Moreover, monitoring based on submetered
data enables the detection of faults in equipment that runs around the clock, e.g., the
exhaust fan used in a space with toxic emissions in it.
In addition to detection of a fault, diagnosis of the fault origin can be achieved
with higher resolution by the submetered power models than that based on the system's
total power data. This is because as the function of certain reference parameters, the
submetered model may present different patterns of abnormalities for various causes of
the observed abnormality, which is the basis of fault diagnosis.
With the submetered measurements of power and the related parameters,
detection and diagnosis of faults in HVAC systems can be divided into two general
categories according to the types of the power functions as discussed in Sec.4.2, i.e.,
power models for equipment with variable and constant power inputs.
4.4.1 FDD for equipment with continuously varying power input
From the previous sections, power consumption of such components can be
modeled as polynomial functions of basic measurements or control signals. However, as
shown in Fig. 4.1(b), even under normal operation, the measured power data usually
scatter around the predicted values within a certain range. This indicates that for reliable
detection output, some offset range must be appropriately defined to reduce the rate of
false alarm caused by deviation from the predicted value due to random errors. Such a
range can be established by the confidence interval described in Sec. 4.2. The confidence
interval takes into account the error of the model fit itself as well as that of the sampled
data for fitting. The FDD models with confidence intervals corresponding to the power
functions plotted in Fig. 4.3(a) and Fig. 4.4(a) are illustrated by Fig. 4.6(a) and (b)
respectively. Both are based on an estimated standard deviation of 5% of the average
power value and a confidence level of 90%, which is commonly used in fault detection in
non-critical applications [Huber, 1981].
4000
3000
2000
1000
0
500
1000
2000
1500
3000
2500
3500
Air flow rate (CFM)
(a)
4000
3000
2000
1000
0
30
40
50
80
60
70
Fan speed control signal (%)
90
100
110
(b)
Figure4.6. Illustration of the power models of a VSD fan with a 90%-confidence interval
forfault detection and diagnosis of the air handling unit in a test building. (a). the model
offan power vs. airflow rate; (b). the model offan power vs. motor speed control signal.
In fault detection, if the submetered power data exceed the confidence interval at
a given flow rate, then there is a probability of 1-a that the operation is under a faulty
condition. In this thesis, faults related to the power input of a VSD fan in a test building
have been studied based on the above algorithms and several typical cases are discussed
as follows.
4.4.2 FDD for equipment with constant power consumption
Abnormal behaviors in such equipment can usually be seen from the magnitude of
the power input and the FDD can be implemented by checking the measured power value
by the submeter against a threshold, which should be based on an offset from the design
value. The design value is generally the design power input of the equipment and the
offset can be obtained as a confidence interval introduced in Sec. 4.3.2. For example,
based on a confidence level of 90%, the upper and lower limits were found to be 423 and
387 watts respectively for the power input of a chilled water pump with a design value of
405 watts in a test building, as shown in Fig. 4.7.
440
-.
~
- -.-.-.
420
400
-
__________________
380
0
360-
3401
.
-
320
-
power data
lower limit
upper limit
300
0
20
60
40
Valve position (%)
80
100
Figure 4.7. Illustration of the power thresholdsfor a chilled water pump with constant
power consumption at different valve open levels for fault detection and diagnosis of the
HVAC system in a test building.
4.4.3 FDD from the on/off cycling of equipment with constant power consumption
Sometimes, abnormalities associated with equipment consuming constant
electrical power can be found from its on/off cycling frequency. First, this is because to
prevent energy inefficiency and degradation of some equipment due to frequent on/off
switches, limits are usually set up by design in the forms of maximum number of daily
cycles and/or minimum intervals between on and off. In HVAC systems, such a control
strategy is often seen in chiller operation. Second, equipment should cycle on and off in
accordance with the load conditions. If the device is turned on more frequently under low
load than under significantly higher load in normal operation, then the system is not
acting properly. For example, a chiller should normally run less frequently during the
early morning hours than in late afternoon. With the submetered power data, such on/off
cycles are easy to be counted as shown in Fig.4.8.
6000
5000
4000
3000
2000
1000
0
-I
- --
-
120
180
240
Time (min.)
300
360
420
Figure 4.8. Constant on/off power cycling period of a reciprocal chiller under low load
conditions during the early morning hours (0:00-7:40)for fault detection and diagnosis
of the HVAC system in a test building.
4.4.4 Oscillation detection by the standard deviation of the submetered power data
In principle, detection of oscillation from the submetered power data are similar
to that with the total power data as introduced in Chapter 3 and hence the same method
can be used here. In addition, since the equipment is directly and solely monitored by the
submeter, detection of oscillation is easier to implement due to the reduced noise and
diagnosis of the fault becomes more straightforward. Fig. 4.9 demonstrates the effects of
oscillation in the submetered power series.
2000
2000
1500
1500
1000
1000
500
500
0
0
0
180
360
540
720 900
Time (min.)
(a)
1580 1260 1440
0
180
360 540 720 900 1080 1260 1440
Time (min.)
(b)
Figure 4.9. Submetered power data of a supply fan during two different days,
demonstrating the significant increase of the data spread caused by power oscillation.
Plots (a) and (b) are the 24-hourfan power with and without oscillation respectively.
4.5 Application of the power models in fault detection with submetered power input
The power models for fault detection and diagnosis described in the previous
sections have been applied to HVAC systems in real buildings. The performance of the
application is discussed in this section based on the detection and diagnosis output of
several typical faults that have been introduced into the air handling units of the same
system as the one used in Section 3.6.
4.5.1 Model calibration and parameter identification
To set up the power functions for equipment with varying power input, such as a
supply fan driven by a variable speed motor, the data should be ideally sampled over a
wide range of load, from the minimum to the maximum possible load conditions. But this
may require a fairly long time to obtain such a range of data and is not necessary in
practice, because the model is obtained based on physical laws and hence can be
reasonably extrapolated to higher or lower load conditions. This may result in small
errors for some regions due to lack of test data but will not have significant effects on the
detection quality. Tests with several air handling units showed that without the fan power
data under low air flow rates (around 500~1000 CFM) during the calibration period, the
later collected power data in this area were slightly higher than the fitted curve, but still
within the confidence interval and much closer to the fitted curve than to the upper limit
of the interval. Moreover, during the normal tests, the fan power input for the thermal
load varying with the outside weather changed over a desirable range for the fitting. In
the test building, the time required for data collection for the fan power function by
submeter was less than the 10 hours that cover the work day. This time length also
suffices the need for the estimation of the fan power noise. In the test building, the
parameters of the power function were obtained by least squares fitting introduced in
Section 4.2. To reduce the offset caused by random fluctuations, the data for the fitting
were selected when the static pressure signal deviated from the setpoint by less than 5%
of the setpoint. For each supply fan, about 30 data pairs representing different load
conditions were collected 60 minutes after the startup and 30 minutes before the
shutdown in a day.
For equipment with constant power consumption, the models can be obtained
more easily. For example, the power consumption of the chilled water pump shown in
Fig. 4.5(c) is close to a constant when the cooling capacity of the coil is adjusted by a
bypass valve and hence the water flow rate through the pump is unchanged. With such an
arrangement, the power function is simplified as a threshold for the detection, which can
be set up as the average of the measured values. To minimize the rate of false alarm, an
offset value based on a given probability should be established to allow random
deviations. The random errors are independent of the load condition and should be within
an expected range with a given confidence level for the normal distribution of the
electrical power noise. In the test building, a 90%-confidence interval of ± 18 W was set
up about the design power input of 405 W for a chilled water pump as shown in Fig. 4.7.
For the evaluation of the on/off cycling frequency, training of the limits may
require several days of observation under low load conditions, normally during off-work
time periods. In the test system, the threshold for the chiller's on/off cycling interval was
found during early morning hours when the cooling coil valve was closed under normal
operation.
For the estimation of the power noise, thresholds for the standard deviation and
time duration can be set up with the 10-hour data sampled at an interval of 1 minute.
One important index to evaluate the operating status is the outside air
temperature. For example, in detection of the fault of leaky recirculation damper based on
the condition introduced in Chapter 3, the outdoor air temperature needs to be recorded
and the thresholds for the dimensionless outdoor air temperature and the chiller cycling
interval should be determined when the outside air damper is 100% open in the
economizer mode.
It should be noted that all the data for the detection with submeters can be
obtained simultaneously because no change in the system needs to be made for model
training.
4.5.2 Detection and diagnosis of typical HVAC faults with submetered power models
Faults in HVAC systems tend to cause abnormal power input to at least one
electrically-driven component. By comparing the measured power data against the values
obtained from the power function and the associated confidence intervals, the abnormal
equipment operation can be found and analyzed accordingly.
a). Detection and diagnosis of faulty operation by the power models of equipment
with variable power consumption
According to the load condition, power consumption of equipment is adjusted by
using a motor with VSD control or by changing the flow rate of the medium delivered
with a CSD motor. In modem HVAC systems, VSD motors are becoming more widely
used due to the low operating cost, especially in commercial buildings [Englander and
Norford, 1990]. In the test building introduced in Chapter 3, all the supply and return fans
are driven by VSD motors. As discussed in Section 4.2.3, power consumption of a VSD
fan is determined by two of the three factors: total pressure loss, air flow rate, and motor
speed. While the air flow rate depends on the thermal load condition, the total pressure
gain across the fan is composed of two parts, the loss due to the resistance and the static
pressure setpoint which is the index for fan speed control. Hence faults that cause
deviations in either part will lead to abnormal fan power input and can be recognized
from the power models described in Section 4.4.
Fig. 4.10 compares the detection output of normal operation against two typical
faults. One is a leak in the static pressure sensor's pneumatic line which results in an
offset in the total pressure and the other is a stuck-closed recirculation air damper which
increases the resistance in the air loop. It can be seen that the power data in both faulty
operations lie well above the upper limit of the confidence interval.
In the case of the pressure sensor offset, the static pressure signal fed back to the
fan controller by the leaky sensor was always lower than the actual value. To maintain
the fixed static pressure setpoint, the supply fan had to work harder and consumed more
energy, which led to the higher power input at given air flow rates as shown in Fig.
4.10(b).
A pressure sensor offset can be negative (reported pressure lower than the actual
value) or positive (reported pressure higher than the actual value) and therefore leads to
increased or decreased fan power input. For a pneumatic pressure sensor as used in the
test building, an offset is often caused by a leak in the transmission line, which produces
an increase in fan power.
Under normal operation in the given air handling unit, the recirculation air
damper was supposed to be 70% open in accordance with the 30% open outdoor air
damper for the minimum fresh air supply. When the recirculation air damper was stuck
closed, only a small amount of the return air was drawn through the leaks in the
recirculation air damper, while the outdoor air damper was still fixed at the 30% open
position by the control system based on the outdoor air temperature. As a result, the total
flow rate of the supply air was greatly reduced and the supply fan power increased
significantly to overcome the extra resistance introduced by the closed damper, as shown
in Fig. 4.10(c).
These two faults can be best distinguished from each other by checking the air
flow rate after the working hours (between 17:01-22:00 in this case). Since the building
was not occupied during this time period, regardless of the outdoor air temperature, 100%
return air was recirculated in the building through the fully open recirculation air damper
in order to save energy for cooling and the outdoor air damper was shut off. If the
recirculation air damper was closed by fault, the air flow rate, which then could only go
through the dampers by leakage, would be extremely low (less than 500 CFM in this
case) and could never happen under normal condition.
The rule for detection of the stuck-closed recirculation air damper can be
expressed as:
IF ( power > higher limit of the confidence interval for the fitted power value
AND flow rate < threshold after working hours with fan on) THEN alarm.
The threshold for the airflow rate after working hours can be observed from one
day's normal operation.
4000
3000
0
higher imit of the confidence interval
2000
CL
C
a.
1000
00
500
1000
1500
2000
Air fllow rate (CFM)
2500
3000
3500
2500
3000
3500
2500
3000
3500
(a)
4000
-fitted
3000
3
0
C.
C
-
-
power data
curve
lower limit of the confidence interval
higher limit of the confidence interval
2000
1000
0
0
500
1000
1500
2000
Air flow rate (CFM)
(b)
4000
3000
-
power data
fitted curve
-,a-
lower limit of the confidence interval
higher limit of the confidence interval
ai)
3
0
C.
2000
C
1000
0
0
500
1000
1500
2000
Air f low rate (CFM)
(c)
Figure 4.10. Detection output for operation under normal and faulty conditions by the
power consumption of a supply fan as the function of airflow rate in an air handling unit.
(a). under normal operation; (b). with an offset in the static pressure sensor; (c). with a
stuck-closed recirculationair damper.
In addition to the fan power vs. airflow rate correlation for fault detection in the
air flow path and the static pressure control, in this thesis, fan power has also been proved
useful in finding abnormal status of the fan itself with reference to the motor speed. As
shown in Fig. 4.6(b), fan power can be modeled as a polynomial function of the motor
speed with a confidence interval to allow random disturbances. Power data that fall
beyond the interval will be regarded as a fault. For example, as a typical degradation
problem in HVAC applications, the fault of a slipping fan belt can be seen clearly at high
loads through the FDD model. The initial approach combines detection and diagnosis in a
single step by searching for abnormally low fan power at a very high motor speed control
signal. This is demonstrated by the data cluster in Fig. 4.11, with the most significant
lower power consumption occurring when the fan was running at full speed. The
dependence of the detectability of such a fault on fan speed has been further proved by
detection conducted in other tests with the same air handling unit.
In the tests, the slipping fan belt was introduced by adjusting the tension of the fan
belt to reduce the maximum fan speed by 15% at 100% control signal for the first stage
and 20% for the second stage. With more than 15% of reduced tension, the fan had to
work at high speed to meet the load requirement during the summer time. In practice,
however, as a degradation fault, the belt slippage tends to occur more gradually day by
day. Therefore, with the power vs. speed function, the detection of this fault can be
conducted at any speed and is not be limited to the full speed. The rule is:
IF (power < lower limit of the confidence interval for the fitted power value AND
time duration > time threshold) THEN alarm.
The minimum time period for determination of the fault should be longer than one
sampling interval. With a sampling interval of 1 minute by a submeter in the tests, the
threshold for time duration of this fault should not be shorter than 2 minutes. This is
intended to eliminate false alarms which may occur when one or two points accidentally
drop to this level without a slipping fan belt. In the test building, 3 minutes have been
proved appropriate for reliable FDD output.
5000
4000
3000
2000
1000-
0
--
30
40
50
60
70
80
Fan speed control signal (%)
90
100
110
90
100
110
(a)
5000.
-fitted
4000 -
-
power data
curve
lower linit of the confidence interval
3000
2000
1000
030
40
50
60
70
80
Fan speed control signal (%)
Figure 4.11. Utilization of the power vs. motor speed model of a VSD fan with a 90%confidence interval for fault detection and diagnosis of the air handling unit in a test
building, showing the reduction in the fan power input due to a slipping fan belt. (a)
under normal operation;(b). with a slipping belt.
As discussed in Section 4.2.3.2, power consumption of a component driven by a
CSD motor varies with flow rate that is adjusted by a flow control device. In the test
building, a pump was used with a loop setup as shown in Fig. 4.5(d) in an air handling
unit, in which the water flow rate delivered by the pump changes with the valve position.
The pump power can then be correlated as a polynomial function of the valve control
signal, as demonstrated by Fig. 4.12 with the 90%-confidence interval for fault detection
and diagnosis.
100
1000
900 -
'
'' '^'"
low erlimrit
---
upperlimit
800700600500400
0
20
40
60
80
100
Valve position (%)
Figure 4.12. Power consumption of a chilled-water pump with a CSD motor as a
polynomial function of the valve control signal, showing that the pump power input
varies with the waterflow rate even the motor speed remains constant.
b). Detection and diagnosis of faulty operation by the magnitude of power input of
equipment
With the submetered data, the magnitude of power consumption can be used to
detect faults in equipment with constant power input. For such components, abnormal
operation is recognized with appropriate thresholds as discussed in Sec. 4.4.2. In the test
building, for example, the decreased capacity fault of a cooling coil due to reduction of
the water flow rate leads to reduced power consumption of the pump. For a pump driven
by a CSD motor, such a fault can be easily detected by the model illustrated by Fig. 4.7.
Fig. 4.13 shows the detection output with one day's data from the test building.
It can be seen from the figure that the deviation of the measured pump power
from the designated range increases with the open level of the valve which is in response
to increased requirement for water flow rate due to the higher thermal load. Therefore,
the rule for detecting and diagnosing this fault from the submetered pump power is:
IF ((valve open level > valve threshold) and ( Ipump power-design value I> threshold of
power offset)) THEN alarm.
For practical application, the design value for the power input of the equipment
can be established as a measured value, or, more accurately, as the average of several
measured values. The threshold for the pump power offset is determined by Eq.(4.26).
The valve threshold is introduced in this rule because it is difficult to find the fault when
the valve is not wide open which increases the resistance and hence the power input.
Electrical spikes have also been found when the valve is nearly closed. It has also been
observed that to detect the capacity fault, the valve should be at least 30% open. The
valve threshold may also be trained if data for training are available. For the test building,
the design capacity, the valve threshold, and the upper and lower limits for the pump
power were 405 watts, 40%, and 423 and 387 watts respectively.
101
440
420
400
(D
E
a-
380
360
*.
340
.
320
-
-
purrp power
lower limit
upper limit
3001
0
20
60
40
80
100
Valve position (%)
Figure 4.13. Detection of the coil-capacityfault from the reduction in the power input
of a chilled-waterpump driven by a CSD motor. The fault became detectable when the
valve was wide open.
In principle, such a fault would also result in lower chiller power input. In
practice, however, it is difficult to find this fault with the chiller power input due to the
fact that the effect of reduced flow is visible under high load conditions with the pump
power when the chiller cycling presents a more irregular and hence more complex pattern
for recognition.
Sometimes, faults in equipment with a VSD motor may also be detected with a
constant power threshold. Such detection is generally based on understanding of the
equipment from the design information and is especially useful when the submetered
power data are the only available reference for fault detection. For example, under a
given load condition, the fault of a slipping fan belt leads to significantly lower power
input of the fan compared to that in normal operation. Such a fault can be easily
recognized from the submetered power data with a conservative constant threshold
obtained from one typical day's operation. In the test building, a value of 1 kW was used
as the lower limit for the power input of a supply fan with a design capacity of 5 kW.
During the peak load time period between 2-4pm in summer, the power input of the fan
should never be lower than 1 kW unless a fault like a slipping fan belt occurs, as
shown in Fig. 4.14. The relatively flat and low data trend during the faulty time was due
to the continuous 100% fan speed caused by the slipping belt when the fan strove to meet
the load requirement, which could be met at lower fan speeds during normal operation.
102
2000
-
normal operation
slipping fan belt
threshold
1500
0
1000
500
360
380
400
420
440
460
480
Time (min.)
Figure 4.14. Power consumption of a supply fan during 2:00-4:00pm of two days under
normal condition and with a slipping fan belt respectively, showing the detectability of
faults with a constantpower thresholdfor equipment with a VSD motor drive.
c). Detection and diagnosis of faulty operation from on/off cycling of equipment with
constant power consumption
As discussed in Chapter 3, faulty operation of some components can be detected
by analyzing the on/off cycles of the related equipment. Such detection is generally
applicable under stable low load conditions, not only because the on/off interval becomes
longer and more visible to the detector but also due to the fact that such load conditions
provide a reliable basis for the detection that requires periods with a constant on/off
cycling frequency of the equipment. In the test building, for example, cycling period of a
reciprocating chiller during the early morning hours when the cooling coil valve was
closed has been analyzed to detect faults related to the thermal load of the chiller, e.g., the
leaky cooling coil valve and the leaky recirculation air damper. Based on the same
criteria used in Section 3.9.2(b), the identification of such a fault becomes easier by
counting the on/off cycles with the submeter than by detection of changes in the total
power data, as shown in Fig. 4.15. The rule for detection of this fault is:
IF (averaged chiller off-on interval < threshold for chiller off-on interval AND
valve position = 0 ) THEN alarm.
The threshold for chiller off-on interval for this rule can be observed from the 7hour normal operation of the chiller in the early morning. In the test building, this value
was found to be 35 minutes.
103
6000
5000
4000
3000
---, ----------
-
-
~r-,--
-
-
-
2000
1000
0
0
60
120
180
240
300
360
420
Time (mn.)
(a)
600
500
400
300
200
100
0
240
300
360
420
Time (mn.)
(b)
Figure4.15. Submetered power data of the chiller under low load condition during 0:007:00 am, showing a decrease in the off-on interval of the chiller at the presence of a leaky
cooling coil valve. (a). under normal operation: chiller turned on at an interval of 38
minutes; (b). with leaky cooling coil valve: chiller turned on at an interval of 30 minutes.
Another fault related to chiller cycling is the leaky recirculation air damper. As
explained before, abnormal chiller cycling can be detected under low cooling loads.
Similar to that of the leaky cooling coil valve, the deviation in chiller cycling due to the
leaky recirculation air damper is most significant if it is closed by the control command
when the outdoor air temperature falls between the supply and the return air temperature
(or the economizer) setpoints. Fig. 4.16 shows the common control sequence of dampers
and valves for HVAC systems with one day's data from the test building as an example.
104
Tspt.sa
Tspt.ia
Outdoor air temperature
(a)
120
RA. damper
100
O.A. temperature
S.A. setpint
- -Economizer setpoint
-
80
60
40
20
0
420
540
780
660
900
1020
1140
Time (rnin.)
(b)
Figure 4.16. Control of dampers and valves based on outdoor air temperature and the
related setpoints. (a). typical sequencing for air handling; (b). dependence of the
recirculationair damper position on outdoor air temperature in the test building during
the working hours of a day. The setpoints of supply air temperature and the economizer
are 55 0F and 65"F respectively.
Furthermore, it has been found that the effect in chiller cycles is visible to the
detector only when the outside temperature is slightly above the point where the supplyair temperature controller begins to open the chilled-water valve in order to maintain
supply-air temperature at its setpoint. By limiting the examination of the cycling
frequency to this narrow region as defined in Section 3.9.2(b), the risk of false alarms is
reduced due to normal changes in cycling rate at higher outdoor air temperatures, when
the cooling load increases.
105
With the same rule defined in Sec.3.9.2(b), the fault of the leaky recirculation air
damper has been detected, as shown in Fig. 4.17.
6000
5000
4000
3000
2000
1000
0
420
480
-_l
540
600
660
720
780
840
900
960
U U
11
i, il
960
1020
1080
1140
Time (min.)
(a)
6000
5000
4000
3000
2000
1000
0
!~-"L.t
420
'
480
-
-".JL.J
'L.JL
540
600
"J'L
'
.JUULi
1- 1
660
720
I 11
l 11
| .U'1
l
780
840
900
il
l
Lll[l i 11
1020
1080
I
1140
Time (mn.)
(b)
Figure 4.17. Submetered power data of the chiller under low load conditions when the
outdoor air temperature lies within the defined region, showing the increase in on/off
cycles of the chiller at the presence of a leaky recirculationdamper. (a). under normal
operation: chiller turned on at intervals between 30-31 minutes; (b). with a leaky
recirculationairdamper: chiller turned on at intervals between 14-16 minutes.
Note that the leaky valve fault was detected only when the valve control signal is
zero, so the two faults that make use of the chiller cycling can be separated.
d). Detection and diagnosis of unstable control from the submetered power data
As an abrupt fault, unstable control leaves a clear oscillatory power signature,
which can be detected with submetered power data by quantifying the magnitude of the
oscillations. To distinguish sustained power oscillations from the impact of start-up or
shut-down events on the standard deviation of signals, as computed over a sliding
window, the GLR output should be simultaneously monitored to avoid mixed alarms. If
106
the sufficient statistic is below its threshold and the power oscillations are high, an alarm
for unstable control will be issued. The fault detection rule with the submetered power is
similar to that with the total power data presented in Chapter 3.
IF ( standard deviation > power * fraction threshold AND sufficient statistic <
lower limit of the GLR threshold ) THEN alarm
The two parameters for this rule, the fraction threshold for oscillation and the
lower limit of the GLR threshold, can be trained from one day's data under normal
operation as discussed in Chapter 3. For the fraction power threshold, deviation of power
value due to noise should be within a reasonable region. In the test building, the deviation
should never be greater than 10% under the fault-free condition.
Figure 4.18 shows an example of the increased standard due to unstable control
for the power data presented in Fig. 4.9(a).
400
300
200
100
0
180
360
540
720
900
1080
1260
1440
Time (min.)
Figure 4.18. Detection output of power oscillation caused by unstable control of a
supply fan in an air handling unit of the test building during one day's operation as
shown in Fig. 4.9(a).
4.6 Discussion and conclusions
This chapter demonstrated that by examining the submetered power data of the
major components in an HVAC system, faults can be detected and analyzed at an early
stage. Such faults may not be seen by detection of changes from the total power of the
building, especially for degradation faults like the slight offset in the static pressure
sensor. It always becomes a protracted and difficult task to define an effective set of
parameters and formulate a corresponding set of rules to cover a comprehensive range of
fault conditions in a whole air conditioning system if other parameters are used as the
indices for fault detection. With the submetered power data and a few basic reference
parameters, faults in the major components of an HVAC system can be not only detected
but also diagnosed as shown in this chapter.
Also, the development of power models based on submeters does not require any
special setup to test and train the parameters. This avoids the interruption to the system's
107
daily operation and hence makes it feasible to implement the detection in an existing
system.
Furthermore, all the submetered data related to different equipment can be
obtained simultaneously and the small amount of data needed for the models simplifies
the model setup process. As shown in this chapter, for a typical HVAC system, a total of
24 hours of data obtained at a sampling rate of one minute will suffice the training of the
models.
The method of correlating the submetered power with appropriate parameters can
be used for any system. It should be noted that the form of the models may vary
depending on the type of the related motor drive. For example, chiller power can be used
to check the on/off frequency if the chiller is driven by a CSD motor or can be modeled
as a polynomial function of the appropriate temperatures with confidence intervals when
a VSD motor is involved.
In this chapter, the power models based on submetered measurements have been
developed for equipment driven by CSD or VSD motors with different system
configurations. The model parameters are identified from the normal operation data with
statistical analysis and appropriate fitting algorithms. The FDD methods employed in this
chapter have been verified with data from three air handling units in a test building.
108
CHAPTER 5
Monitoring of components through the total power profile
5.1 Introduction
For fault detection and diagnosis, it is always desirable to obtain accurate and
inexpensive information about the plant with least interruption to the system's operation.
In principle, electrical power consumption as a comprehensive index can be used for
detection of almost all the faults in an HVAC system due to the fact that faults tend to
cause abnormal magnitude or trend of power consumption. In practice, however, the
detection output and costs vary significantly with measurements at different levels in the
monitored system. In the previous chapters, detection and diagnosis of several faults with
electrical power data have been demonstrated at two levels in a test building, i.e., the
system and the component levels. Although the tests were successful for the selected
typical faults, it may still be difficult to implement the two methods in a real system with
various available information for detection and diagnosis of faults.
As shown in Chapter 3, abnormal operation can be identified by analyzing the
total power data obtained with two meters, one for the whole building system as used in
the detection of chiller cycling and the other for the motor control center as used in the
detection of the remaining equipment of the HVAC system. Based on the examination of
the on/off switches or the trend in the total power data, this detection method is efficient
under some special load conditions or operating patterns. For example, the excessive
power consumption of a supply fan could be seen by checking the detected change in the
total power against a trained constant threshold upon turnoff each day at 10pm when the
fan was only used to maintain appropriate air circulation at a minimum flow rate. In
practice, however, turnoff of equipment may occur at different time points under various
load conditions, such as a system controlled by a presence sensor, which makes it
impossible to detect the abnormal operation with a fixed threshold established under
some specific conditions. This indicates that a practical detector must be robust to any
possible changes, i.e., it should be able to find the events of interest in spite of the
variations in the current conditions or parameters. The robustness of an FDD scheme is
the degree to which its performance is unaffected by conditions in the operating system
which turn out to be different from what they were assumed to be in the design of the
FDD scheme. For the change detection in the total power series developed in Chapter 3, a
major problem of robustness is to find events under uncertain load conditions.
In contrast to the change detection with the least amount of measurements
discussed in Chapter 3, the FDD strategy developed in Chapter 4 uses a submeter for
each major electrically-driven equipment in the HVAC system, 17 in total for the three
air handling units and the chiller in the test building. Although the detector with these
submeters has produced desirable results in the tests, it should be noted that such devices
are not usually available in building systems and thus the detection may not be feasible in
consideration of cost vs. benefit in common applications. Therefore, it is important for
the detector to be applicable in current building systems under different load conditions at
109
acceptable costs. In order to minimize or eliminate the use of submeters, power
thresholds need to be established for all the common shutdown conditions with a
minimum amount of sensors and centralized meters.
In this chapter, a detector is developed in the form of a power function by
correlation, as discussed in Chapter 4, based on the measured reference parameter (like
the air flow rate) and the corresponding power values detected from the total power data
at the appropriate system level instead of the submetered power data. The power values
of a component can be obtained by detection of the total power changes in two different
ways depending on the load conditions when the equipment is turned off. One way is to
detect and collect the changes only at turnoffs if the shutdown conditions vary over a
wide range during different days and the other way is to detect the changes introduced by
several times of manual shutdown during a typical operating day. Although the difference
in the methods of data acquisition has no effect on the algorithms of detection and
correlation, modeling based on manual shutdown should be conducted with appropriate
guidelines in the development of a reliable detector. In the following sections, the method
based on manual shutdown is presented with tests in an HVAC system. First, the
detectability and the appropriate accuracy of changes in the total power data are
discussed. Then, as an example, the power model of a supply fan is set up with the
detected power changes in the 24-hour power data and the measured air flow rate from a
test building. With the fitted function, the status of a component in the system can be
monitored by comparing the detected changes in the total power at shutdown against the
thresholds given by the power function with the measured value of the reference
parameter. In addition, energy consumption of the equipment can be estimated with
reasonable accuracy. Demonstrated with a supply fan as an example, such a detector can
be established with this method and used for any equipment with a variable speed motor
drive in HVAC systems.
5.2 Parameter selection for component modeling with system power monitoring
Component modeling is aimed to establish a power function that can provide
reference power data of the equipment under various operating conditions. In principle, a
model based on the physical relationships between a component's power input and the
reference parameters can always be extrapolated. In practice, however, in order to
minimize the uncertainty in fitting due to random errors in the sampled data, it is always
desirable to collect the power input of the equipment over a wide load range as the total
power changes at turnoff under different conditions. Also, the sampled data should be
appropriately distributed within the load range to obtain a more reliable correlation.
Therefore, some representative load conditions need to be determined and the
corresponding power input of the equipment should be properly detected as changes in
the total power series.
In general, the trend of a component's power input is affected by more than one
factor. To facilitate the modeling process and reduce the related cost, the factor selected
as the reference parameter is desired to be the most dominant or ultimate among the
others for the equipment's load variations. This is because only such a parameter can be
110
used as the load indicator of the equipment to provide a guideline for the time schedule of
the manual shutdown. In addition, the indicator must be easy to be obtained or available
in the given system for the purpose of the least-intrusive detection. In general, a load
indicator for a component can be determined based on the knowledge of the system and
the component as well. For example, outdoor air temperature can usually be used as an
indicator of the load variations of air conditioning systems, especially when the internal
load of the system remains relatively stable throughout a day. Fig.5.1 shows the 12-hour
profiles of the outdoor air temperature and the electrical power input of a supply fan of an
air handling unit in a test building with a stable internal load during a typical summer
day's operation.
90
4000
-85
3000 D
2000-
0
(L
80
-
75
c
70
0
E
100065
0
9:00:00
12:00:00
15:00:00
18:00:00
60
21:00:00
Time (h)
Figure 5.1. Profiles of the outdoor air temperature and the power input of a supply fan in
a test building.
Note that the load indicator is to provide the trend of the load profile and not
necessarily for monitoring or calculation of the load. This is because the actual thermal
load profile may not conform to that of the load indicator in time due to the delay caused
by the thermal capacity of the building. For example, the daily peak load of an air
conditioning system generally occurs at a time different than the highest outdoor air
temperature.
For a system with outdoor air temperature as the major variant, data detected at a
certain interval, e.g., 1 hour, can be used to fit the power function.
5.3 Analysis of the detected changes for component modeling
Another important issue in data acquisition for a model is the detection of changes
in the total power series. In this FDD strategy, power functions are based on the detected
changes, which may deviate significantly from the true values due to the dynamics in the
operating system.
In Chapter 3, it has been shown that some faults can be directly detected from the
total power data of a system as abnormal magnitude of changes at specific time points or
under certain load conditions. However, for the modeling of a component's power
111
consumption with the system's total power input, changes need to be detected under
different load conditions. In addition, to obtain a reliable function for fault detection, the
magnitude of the detected changes used for correlation must be computed with
reasonable accuracy. Therefore, it is essential to analyze the detectability of changes and
the accuracy of the computed values before the model can be established.
Although it is difficult to define a common standard for the detectability of the
changes and the acceptable error range due to the diverse characteristics of the power
data in different systems, some guidelines for implementation of the detection and
analysis of the output have been established in this research to improve the accuracy of
the models for the monitored components. The most important factors obtainable on-line
without additional measurements that affect the detection are the sampling interval of
power data, the relative magnitude of the change, and the noise level in the integrated
power series.
As a time series of power consumption of an HVAC system, the data pattern in
the detection window changes with different sampling intervals, as shown in Chapter 3,
and hence an event tends to appear with varying visibility and magnitude to the detector.
In general, the accuracy of the detected power changes is affected by the
magnitude of the change relative to the current total power value. The visibility of the
change increases with its magnitude, which indicates that data should be more carefully
selected when the monitored component is under low load conditions and the relative
magnitude of the change becomes smaller in the total power data.
Observations have also verified that the detection quality can be improved when
the data remain constant and the event is close to a step change. This means data obtained
during transient periods or in a very noisy environment should be avoided in fitting the
model. Therefore, change detection should be implemented not only under representative
load conditions but also in a relatively noise- and dynamics-free data environment.
5.3.1 Effects of the sampling interval
It has been observed that the visibility and the accuracy of the detected changes
are affected by the sampling intervals of the power data. Since the duration of events
varies among equipment at different time points, e.g., during on/off switches, detection
with multiple sampling intervals usually yields more desirable results for a system with
more than one electrically-driven component.
Upon the identification of an event, the magnitude of a change is calculated as the
difference between the averages in the detection and the pre-event windows. Errors
always exist due to deviations from the real values caused by random noise which is
inherent to electrical power systems. With a single sampling interval, the error is
determined by the sampled data points and changes with the FMA windows. The error
value also differs among different sampling intervals depending on the relationship
between the sampling interval and the period of the random noise. For detection with a
112
single interval, the error caused by the random disturbance can not be balanced or
reduced. However, it can often be significantly decreased by averaging out the opposite
spikes in the detected changes among different sampling rates owing to the normal
distribution of noise in the electrical power data.
As an example of the tests conducted in this research, Fig. 5.2 and Table 5.1
compare the detection output for changes of different magnitudes with single- and multirate sampling of power data from the motor control center of a test building against that
measured by submeters. During this test period, there were 17 on/off switching events of
two hot-water pumps, two supply fans, and two return fans among a total of six fans and
10 pumps.
The sampling interval used in the single-rate detection is 10 seconds while nine
intervals, including 1, 2, 3, 4, 5, 10, 20, 30, and 60 seconds, have been employed in the
multi-rate case. These sampling intervals, selected after sufficient tests with the intervals
from 0.125 through 1200 seconds based on power data from different systems and under
various conditions, have been found as the most representative values for HVAC power
systems. The two limits, 0.125 and 1200 seconds, are observed as the lower and upper
limit for the duration of the on/off events in HVAC systems.
a). With single sampling interval
Of the 17 events, 15 were found with different levels of error and two were
missed. From Table 5.1, it can be seen that the quality of the detection depends on the
magnitude of the change relative to the total power as well as the current data trend. For
example, at 20:34 with the turnoff of pump B, when the power change of -244 W was
less than 5% of the total power, the error was about 150%. At 21:55 with the turnoff of
supply fan A, when the power change of 1400 W was about 25% of the total power, the
error was less than 0.3%. When the change magnitude was very small, the detector was
not able to find the event at all, such as the event at 21:56 when return fan A was turned
off with a power change of -82 W.
In addition, the shorter the time interval between two changes, the more difficult
to find the changes, especially the small ones. This is because as a steady state detector,
the GLR needs some time for the effect of the former event to die out in order to find the
subsequent change. This is demonstrated by the output at 20:21 for the turnoff of
pump A. With a magnitude of 275 W, the event was still missed due to the gradually
changing property of the data caused by the previous change and the noise, as seen from
Fig. 5.2(a).
b). With multiple sampling intervals
With a single sampling interval, significant discrepancies of power changes have
been observed between the submetered data and that detected from a centralized data
logger by the detector. Sometimes, on/off events are undetectable with a single sampling
rate. By definition, the GLR algorithm is intended to locate changes only and not to
113
quantify the magnitude of changes with sufficient precision to detect potentially faulty
conditions. However, with further techniques like the multiple sampling rates, the
detector itself which is based on the innovated GLR algorithm and the data processing
techniques is significantly improved in terms of the detection quality, not only in the
detectability of the events but also in the resolution of the output. This has been testified
by running the detector with multiple sampling intervals through the same data set. As
shown in Fig.5.2 and Table 5.1, with the multiple sampling intervals, 16 out of the 17
ON/OFF events were identified, one of which was undetectable with a single sampling
rate that was implemented before. Moreover, the errors of nearly 70 percent of the events
were significantly smaller with the multi-rate sampling than that with the single-rate
sampling, while only one event at 23:40 showed considerably larger errors with multirate samples, as listed in Table 5.1.
Under some circumstances, the error by the multi-rate detection is larger than that
by single-rate detection. This is because to obtain the maximum amount of alarms for the
events with each sampling rate, the threshold for the detection must be lowered to the
minimum value among all the sampling rates. With the lowered threshold, the alarm time
points are changed for those with higher thresholds. As a result, the data patterns for the
reset of the detection window is changed, which in turn may cause some variations in the
computed magnitude. But the occurrence of this problem is rare and such errors can
usually be averaged out among different sampling intervals in the multi-rate detection.
114
8000
700060005000-
a.
-
--
4000
3000
2000
19:44
22:44
21:44
20:44
0:44
23:44
Time (h)
(a)
4000
A
.
0
5
19 44
-2000 -
0-
-4000-
A
U
2000
~
A
A
A
4
*
21:440 I
A
22:44
-
-
-
23:44
- - - -
0:4
mdetected
-6000
-8000
.
I
A measured
_8000-
r~
Time (h)
(b)
1000
500 0144
19
~j -500*-U-
-
E
A20:44
A
-
-
A
21:44
-
1
-
-
22:44
-
-
23:44
-
-
-
0:4
0- -1000-
-1500
a detected
A measured
-2000
Time (h)
(c)
Figure 5.2. Detection of changes under different conditions with single and multiple
sampling intervals in comparison with the values measured by submeters.
(a). Five-hour power data sampled at 1Hz from the motor control center of a test
building; (b). Comparison of the submetered power with the detected changes in the
centralizedpower data with a single sampling interval of 10 seconds - 15 alarmsfor 17
events, 2 missed; (c). Comparison of the submetered power with the detected changes in
the centralizedpower data with multiple sampling intervals of 1, 2, 3, 4, 5, 10, 20, 30,
and 60 seconds -16 alarmsfor 17 events, 1 missed.
115
Table 5.1. On/off power change detection for a motor control center serving fans and
pumps in a test building, as comparedwith submetereddata.
Detector
Time
Equipment
(W)
Total
power
(W)
Relative
power
(W)
Submeter
Single interval
Error
Power
(W)
(%)
20:10
Loop-B hot water pump--OFF
-405
5970
6.8
-502
24.0
20:19
Loop-B hot water pump--ON
264
5325
5.0
249
5.7
20:21
Loop-A hot water pump--OFF
-275
5181
5.3
Missed
20:25
Loop-A hot water pump--ON
252
5489
4.6
313
20:34
Loop-B hot water pump--OFF
-244
5125
4.8
20:44
Loop-B hot water pump--ON
232
5519
21:02
Loop-B hot water pump--OFF
-309
Multiple intervals
Error
Power
(W)
(%)
-401
1.1
284
7.6
-279
1.5
24.2
227
10.1
-98
59.8
-476
95.1
4.2
497
114.2
230
0.7
5289
5.8
-346
12.0
-282
8.9
21:14
Loop-B hot water pump--ON
267
5105
5.2
171
36.0
252
5.6
21:29
Loop-B hot water pump--OFF
-202
5755
3.5
-11
94.6
-172
15.1
21:47
Loop-B hot water pump--ON
237
5371
4.4
289
21.9
257
8.5
21:55
AHU-A supply fan--OFF
-1400
4224
33.1
-1396
0.3
-1411
0.8
21:56
AHU-A return fan--OFF
-82
4142
2.0
Missed
22:00
AHU-B supply fan--OFF
-430
3727
11.5
-274
22:06
Loop-B hot water pump--OFF
-300
3338
9.0
22:26
Loop-A hot water pump--OFF
-264
2734
AHU-B3 return fan--OFF
___
Missed
36.3
-485
12.8
-242
19.3
-243
18.9
9.7
-283
7.2
-244
7.4
373
9.1
220
35.6
494
67.5
451
52.7
23:40
Loop-A hot water pump--ON
342
3092
11.1
00:33
Loop-B hot water pump--ON
295
3789
7.8
___
As noted above, in addition to the data sampling technique, characteristics of the
power profile play an important role in the detection of changes and the selection of
values for the modeling, which can be seen from the above plots and the table as well. In
this research, two major factors have been found useful to outline the effects of the
current data trend on the detection quality, the relative magnitude and the noise level.
5.3.2 Effects of the relative magnitude
In principle, the GLR algorithm can be used to detect changes of any magnitude.
However, tests have shown that in order to minimize the occurrence of false alarms due
to random noise in the data, the magnitude of the minimum detectable event should not
be less than the current standard deviation of the data window. This minimum value helps
to determine the smallest component that can be modeled from the system's total power
as well as the appropriate time for the detection of changes to be used for the fitting, as
discussed later in this chapter.
It has been found that the detectability of an event changes with its magnitude
relative to the current integrated power value of the system and the detection becomes
more accurate when its magnitude accounts for a larger portion in the total power
consumption of a given system. As the ratio of the magnitude of a change to the current
116
total power of the monitored system, the relative magnitude is used as an indicator
because the visibility of a change to the detector diminishes with the increase of the
power value and the consequent accumulation of noise with more components put into
operation in the system.
Although no quantitative and concrete correlation has been observed between the
detection error and the relative magnitude of a change, a general descriptive trend has
been found as the envelope of the detection error under different relative magnitude. This
has been verified through tests with different systems and can also be seen from the
above example. No consistent function can be obtained between the detection error and
the relative magnitude for all the data points in Table 5.1, as shown by the scattered
points in Fig. 5.3. However, a trend line can be formed, which is composed of the
maximum error found at different relative magnitude in the detection, as demonstrated by
the dashed line in Fig. 5.3. The randomly distributed points show that the errors vary
significantly even for the same relative magnitude, but the dashed line clearly justifies
that the maximum error of the monitored event decreases with the increase of the relative
magnitude.
100
8060 .0
0 40 20 0
***
0
20
60
40
80
100
Relative magnitude (%)
Figure 5.3. Errors between the detected changes from the total power data and the
submetered values vs. the relative magnitude of the submetered values.
The relationship between the error and the relative magnitude verified that in the
modeling of a component, equipment with low power consumption under common
operating conditions should be avoided, because a reliable power model of the
component can not be established based on the detected changes from the total power if
the component's power input accounts for a very small portion in the monitored system,
e.g., less than 5% in this example.
However, with only the relative magnitude, it is still difficult to choose the
appropriate data points without measurements by submeters due to lack of reference and
the uncertainties in the detection caused by the irregular fluctuations in the power series.
In order to choose reliable detection outputs, another index, namely, the standard
deviation has been found useful in tracking the noise level of the current environment.
117
5.3.3 Data screening by the standard deviation in the FMA window
In a real power system, the values of random noise can hardly be defined by a
function or formula. However, the characteristics of the noise at a given time can usually
be reflected by some indices, such as the approximate period and the peak value for the
transient description of a disturbance and the standard deviation for the average noise
level.
Generally, the error of a detected change increases with the noise level. In this
research, it has been observed that the accuracy of the detected changes is related to the
noise level represented by the standard deviation in the detection window.
Moreover, it has been found that the relative error and consequently the relative
standard deviation as the ratio between the standard deviation to the current total power
data in the detection window are more appropriate to obtain a reasonably balanced power
model. This is first because the noise level, or the standard deviation, usually increases
with the amount of equipment in operation and hence the total power value. Second, the
power model is desired to be of evenly distributed accuracy under various load
conditions.
Test results showed that large relative errors were always accompanied by high
levels of noise as the major source of errors in change detection. It should be noted,
however, it may not be true vice versa due to the random data distribution in the window,
i.e., noises at high level do not necessarily result in large errors.
Fig. 5.4 shows the distribution of the relative error as the relative standard
deviation changes for the 17 events shown in Fig. 5.2. The number beside each point is
the value of the relative standard deviation at the time of the event. Although no concrete
correlation can be achieved for the randomly scattered points, it can be clearly seen that
big errors always occur with large relative standard deviations, like the points at 66.4,
77.2, and 90.1. On the other hand, large relative standard deviations do not necessarily
lead to big errors, such as the points at 76.6 and 79.6.
Similar patterns for the distribution of relative error vs. relative standard deviation
have been found with all other tests conducted in this research. In spite of the lack of an
explicit formula in the relationship between the two variables, such a trend provides a
sufficient, but not essential, condition for the selection of data to be used for correlation.
Useful data can be chosen from the detected changes by discarding the ones with
relatively large values of relative standard deviation in the current detection window.
Although this may sacrifice some data points with big values of relative standard
deviation but small relative errors, such as the points at 76.6 and 79.6 in Fig. 5.4, this
method helps to secure a more accurate model and hence more reliable outputs of future
fault detection.
118
I V
*
90.1
80
60-
6_ 0
Uf
66.4
4020
77.2
49.7
15.3
+18.3
0
13.3
20
47.3
32.3
**0
5.2
0*
30.4
*79.6
36.3
7~
*t
60
40
80
100
Relative standard deviation *1000
Figure 5.4. Distributionof the relative error with the relative standard deviation in the
detection window.
Moreover, such a screening method can be realized without the reference of
submetered data. This is because the values of the relative standard deviation are directly
calculated with each incoming point in the FMA window by the detector and then
compared among all the events involved.
5.4 Application of the gray-box model in fault detection and energy estimation
With the above analysis of the factors that can be used to improve the quality of
detection, models based on the system's total power can be developed for the electricallydriven equipment in HVAC systems. Assume the GLR detector has been trained with the
method developed in Chapter 3, the component modeling procedure can be summarized
as follows.
1). Obtain information about a component of interest, including power capacity, type of
motor drive, and general operating schedules from manufacturer's catalogs and/or the
system's design specifications ;
2). Study the feasibility of the modeling of the equipment's power input with the
system's total power based on the component's power capacity relative to the system's
total value. Although there is no fixed criterion, the relative magnitude should not be too
small, e.g., 1/1000 is too small to be seen by the detector. Tests showed that this value
should not be less than 5%, especially in the presence of VSD motors in the system. Also,
refer to the standard deviation obtained in the training of the GLR detector in Chapter 3
for the decision. If the power capacity of the equipment is less than the standard deviation
of the total power data when this equipment is in operation, then its power input may not
be modeled with desirable accuracy;
3). Determine the parameter to be correlated with the power consumption of the
equipment, as discussed in Chapter 4;
119
4). Select the parameter that can be used as the indicator of the major variations in the
load of the system. In common HVAC systems, outdoor air temperature or enthalpy can
be used if the heat gain/loss from outside is the major variable part of the total load. If the
internal load dominates, then use the schedule of the source as a reference, e.g., number
of occupants;
5). Observe the total power data of the monitored system from one normal day's
operation and find the time for the on/off events to be detected based on the variations of
the load condition;
6). Determine the sampling rates of power data as analyzed in Chapter 3. In addition,
check the availability of the control parameters of the equipment, such as the PID
coefficients to help to determine the time constant of the process;
7). Apply the detector to the power data and record the values of the reference parameter
to be correlated at the on/off switches of the equipment;
8). Choose the power data for correlation based on the calculated values of the relative
magnitude of the changes and the relative standard deviation at the event time as
discussed in Section 5.3;
9). Set up the model by correlating the power data with the reference data and form the
allowable offset range from the fitted function based on a confidence level;
10). Test the model with additional data collected under normal operation and make
modifications if needed.
In deciding the time points for the on/off events in Step 5, some preliminary
information or observations might be helpful to avoid switches of the equipment that are
likely to cause large errors in the detection. Analysis in Section 5.3.3 has shown that in
order to obtain more accurate changes, periods with more dynamics marked by large
relative standard deviations such as the transient processes should be avoided. Although
the occurrence of large noises and transient processes is rarely known in advance,
detection with significant effects of such disturbances can be skipped based on the
knowledge of the system's normal control sequence or by looking at the power profile
during a typical day's operation.
The representative load conditions are best selected in a day with a wide span of
the parameter that causes the major variations in the load. For example, if outside air
temperature is used as an indicator, the turnoff events may be executed every hour from
the early morning into the late afternoon during the operation time of a day.
Fig. 5.5 illustrates the modeling process for the power consumption of a supply
fan with a variable speed drive and a design capacity of 5 kW in a test building. The
monitored system is composed of six fans and 10 pumps electrically wired together at a
motor control center as shown in the appendix. It is known from the design information
120
of this system that the major factor that causes the variations in the building's cooling
load is the outdoor air temperature. So the cooling load trend can be approximately
predicted by keeping track of the outdoor air temperature, which changes with time
during a day as shown in Fig. 5.1. Table 5.2 listed the necessary preliminary information
for the detection. From the twenty-four-hour data of a typical summer day, it can be seen
that during the transient processes between 7:00 - 8:00 when the system was started and
12:00 - 13:00 when the system was turned down by the control system during lunch
break, the power data showed significantly more dynamics than the rest of the day. This
indicates the manual turnoff should not be implemented during these time periods. For
the remaining time, manual turnoff of the fan was simulated with the power data by
subtracting the submetered power data at an interval of 1 hour from 8:00 to 20:00
excluding 12:00 and 13:00, as shown by Fig.5.5(a) and (b). The 1-Hz data were then fed
to the detector for changes at these time points. The standard deviation was also
computed simultaneously in the moving window. The detected changes in comparison
with the submetered values and the relative error distribution with the relative standard
deviation are shown in Fig.5.5(c) and (d) respectively. Data points at 14:00, 15:00, and
16:00 corresponding to the three largest relative standard deviations 70.5, 100.3, and 74.6
were further removed for the modeling, though it can be seen from Fig. 5.5(c) that the
error at 14:00 is not the largest among the other nine points. The nine remaining data
pairs were finally used for the polynomial power model. From the comparison between
the model based on detection with the total power and that from the submetered data as
shown in Fig. 5.5(e) and Table 5.3, it can be seen that the difference between the two
models decreased with an increase in the air flow rate and hence the fan power input,
which further verified that the detection accuracy increased with the relative magnitude
of the monitored equipment in its host system. The model for detection based on a 90%
confidence has been successfully implemented with power data from the test system. One
example is shown in Fig. 5.5(f) for the detection of a 3-stage offset of a static pressure
sensor.
Table 5.2. List of the information used in the power change detection of a supply fan.
Comments
Descriptions
Items
Power capacity
5.0 kW
Minimum power input
0.5 kW
Type of
Fan
motor drive
Minimum speed control signal
Operating schedule
Power percentage of system
VSD
20%
7:00 - 22:00
15.5%
Correlation parameter
Air flow rate
Confidence level of model
90%
Event time
Every hour
Total capacity
32.25 kW
Transient period
Range of standard deviation
12:00 -14:00
0.0 - 105.0 W
1 - 30 seconds
Sampling intervals
Under the zero-load condition
Detectable magnitude
erature
Load indicator
System
When the fan is in operation
121
The fan speed control signal can also be
used for FDD and energy estimation.
8:00 - 20:00, excluding 12 and 1p.m.
Lunch break -System turndown
Higher values occurred at events
Multi-rate detection
3000
12000
Total pow er
Fan power
-
-
10500
2500
9000-
2000
7500 -
1500
6000-
1000
4500
500
0
3000
0:00 2:00 4:00 6:00 8:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 0:00
Time (h)
12000
-
1050090007500600045003000
-
10:00
8:00
6:0 0
12:00
14:00
16:00
18:00
20:00
22:00
16:00
18:00
20:00
22:00
Time (h)
(b)
30002500-
+Detected
mSubmetered:
2000
1500
sS
1000
500
0+4
6:00
8:00
10:00
12:00
14:00
Time (h)
(c)
122
25.0
. 100.3
20.0
15.0
- 74.6
10.0
- 23.9
- 10.5
5.0
£
0.0
0.0
2.9
31.3. 40.
28.5.
20.0
70.5
,
- 6,7
80.0
40.0
60.0
Relative standard deviation *10000
100.0
120.0
(d)
2500
-
2000
# pow er data selected for fitting
itted curve
Fan model from submetered power
1500
1000
500
0
500
1000
1510
2500
2000
3000
Flow rate (CFM
3000
2500
2000
1500
1000
500
0
0
500
1000
1500
2000
2500
3000
Row rate (CFM)
(f)
Figure 5.5. Component power modeling via detection of power changes at the system level-electrical power input vs. airflow rate of a supply fan in a test building. (a). Twenty-four-hour
total power of the motor control center and the submetered power of a supply fan; (b). Total
power of the motor control center with the manual shutdown at an interval of one hourfrom 8:00
to 20:00 except 12:00 and 13:00 when the system was turned down for lunch break; (c). Power
changes measuredfrom the submeter and that detectedfrom the totalpower at the shutdown of a
supply fan; (d). Errordistributionwith the relative standarddeviation; (e). Comparisonbetween
the two models based on the detected changesfrom the total power data and the submeteredfan
power data respectively; (f). Detection outputfor the static pressure sensor offset.
123
Table 5.3. Comparison between the fan power as detected changes in the total power and
submetered measurements at given airflow rates.
Air flow rate
(CFM)
914.1
965.3
1361.6
1999.1
2332.5
2391
2490.4
2636.7
2820.9
Detected power change
(W)
387.2
418.8
698.5
1280.0
1648.7
1718.0
1838.8
2023.9
2269.0
Submetered power data
(W)
317.5
347.2
620.2
1218.8
1610.1
1684.3
1814.2
2014.0
2280.4
Error
(%)
22.0
20.6
12.6
5.0
2.4
2.0
1.4
0.5
0.5
Tests with other faults related to the fan power have also been conducted with this
method. Despite the slight deviation of the model based on detection from that by
submetered measurements, all the implemented faults that have been found by the
submetered model were also identified by this model.
With the decreased accuracy when the relative magnitude of a detected change is
small, the model may become less sensitive to a fault under low-load conditions. In
practice, a fault under low load of a component can be found only when the fault
becomes more serious even with the submeter-based model due to the relatively wide
range of the confidence interval compared with the power value itself. In addition,
although VSD motors are used to reduce energy waste at off-peak loads, equipment
driven by such motors are not designed to work within extremely low-load ranges. This is
because accurate and steady control is difficult to maintain under such conditions and
hence a minimum load is always required to start the motor [Chen and Demster 1995].
Moreover, the confidence intervals are established to prevent false alarms caused by
potential random deviations and data points are usually supposed to spread closely
around the fitted curve rather than near the upper or the lower limit. If the sampled power
data line up much closer to the upper or lower limits than to the fitted line, it should be
treated as a fault even when the sampled points still stay within the interval.
In addition to fault detection, the model based on detection of changes by manual
shutdown can also be used to estimate energy consumption of the related equipment and
of the monitored system if a model can be established for each component. For the 24hour operation of the day used in the above example, as listed in Table 5.4, the energy
consumption of the fan was 19.5 kWh by the submeter and 18.8 kWh from the model
with the measured air flow rate. The error was about 3.6% despite an electrical surge
around 17:30 as seen in Fig. 5.6, which has rarely occurred in common operation. The
errors for other normal days' operation were found less than 3.5%.
124
Table 5.4. Energy estimation of afan based on submeter and detectorfor a normal day.
Fan power based on submeter
Item
3000*
-
2500-
18.8
3.6
19.5
0.0
Energy (kWh)
Error (%)
Fan power based on detector
fan power based on submeter
fan pow er as a function of air flow rate
20001500 0
(-
1000-
500-
0
0:00
2:00
,
4:00
6:00
8:00
10:00 12:00 14:00 16:00 18:00 20:00 22:00
0:00
Time (h)
Figure 5.6. Twenty-four-hour supply fan power profiles during a normal working day by
measurements and simulation of power as a function of airflow rate based on detection
of changes in the total power data.
In practice, control signals that can be obtained from the building energy
management system may also be used to establish power functions for the related
equipment by following the guidelines described above. For example, with the same
procedure as shown in Section 5.3, power input of a fan can also be modeled as a
function of the motor speed control signal. However, for equipment driven by a VSD
motor, special care must be taken in data acquisition for the model fitting when the
equipment is running under low load conditions, i.e., when the control signal is at a low
value. Although the power data should be ideally obtained under a wide range of the
control signal to produce an accurate model, detection of the related changes in the total
power series may not be applicable for the model. This is because with a small value of
the control signal, the relative magnitude of the equipment's power consumption may
also become too small to be computed with acceptable accuracy and hence the quality of
the model is affected. Therefore, the power model needs to be extrapolated for detection
and energy estimation under low load conditions. When the equipment is not in
operation, the controlled variable and the power input should be zero. In practice,
however, a minimum load is always required by the design of a VSD motor, i.e., a nonzero lower limit of the motor speed control signal is needed. This means a zero power
input corresponds to a non-zero control signal. In the test building, the minimum fan
speed control signal was 20% as listed in Table 5.2, which indicates zero power input is
related to 20% fan speed control signal though the actual fan speed is also zero.
Therefore, the zero-load data pair (0.2, 0.0) must be used in the correlation as a constraint
for the function for low-load energy calculation. Otherwise, the power input will be
interpreted as zero under zero speed control signal by extrapolation, which may lead to
125
considerable errors in energy estimation, as listed in Table 5.4 for the same data set used
in the above example illustrated in Fig.5.5. Comparison of the fan power consumption
between the submetered value and the power model with appropriate low-load constraint
is illustrated by Fig. 5.7.
Table 5.5. Estimation of energy consumption of a fan in one day by the power vs. speed
control signal model with different constraintsunder the zero-load condition.
Power functions based on change detection
20%-speed
Zero-speed
No constraint on the
Submeter
Item
zero-load condition
zero-power
zero-power
Energy (kWh)
19.5
14.5
14.6
19.0
Error (%)
0.0
25.6
25.1
2.7
3500
3000-
-
fan power base
fan power as a
25002000150010005000
0:00
2:00
4:00
6:00
8:00
10:00 12:00 14:00 16:00 18:00 20:00 22:00 0:00
Time (h)
Figure 5.7. Twenty-four-hour supply fan power profiles during a normal working day by
measurements and simulation of power as a function offan speed control signal based on
detection of changes in the total power data.
For a system without VSD motors, the model can be simplified by clustering the
power magnitude of different equipment for residential buildings [Hart 1992]. However,
for commercial buildings, statistical detection of changes is still necessary due to the
presence of higher noise levels, especially when a change is of small relative magnitude.
Also, the GLR detector may be simplified with single sampling rate due to the shorter
and more consistent on/off duration of equipment with CSD drives. Fig.5.8 demonstrates
a typical power profile of such a system in one hour and the GLR output for the 57 events
that occurred during this time period.
126
20000 18000 16000 140000
(-
1200010000
8000
6000
0
360
720
1080
1440
1800
2160
2520
2880
3240
3600
2160
2520
2880
3240
3600
Time (s)
(a)
12000
9000
6000
3000
D
0-
-6000
-9000 1
-12000
0
360
720
1080
1440
1800
Time (s)
(b)
Figure5.8. GLR detection of changesfor component modeling from the total power data
of the building energy system without VSD motors. (a). 1-Hz power data from a
restaurant;(b). detection outputfor the 57 on/off switches in the system without false or
missed alarms.
5.5 Discussion and conclusions
This chapter demonstrates the evaluation and modeling of power consumption of
the major components in HVAC systems based on detection of changes in a system's
total power profile. Guidelines and rules are proposed for the development of the graybox model for fault detection and diagnosis as well as energy estimation. With
appropriately selected indices including the sampling rate, the relative magnitude of a
change, and the standard deviation of the current series, the power function of a
component can be established for effective fault detection and energy estimation.
As a general rule for equipment to be monitored from the integrated power
profile, it has been found that the relative magnitude of the component's power
consumption in the system's total power input should not be too small. With the test
building, a lower limit of 5% was used for the power model. A power logger at a lower
127
level is needed in order to find the power changes that are accurate enough for feasible
evaluation or modeling of a small component, such as a chilled water pump. For
example, a logger for all the pumps in a building or in several AHUs to meet the
requirement for minimum applicable detected changes. Although such a monitor is not
working on the system's total power system and the detection may require more than one
logger for the whole system, it may still be worth doing compared to submeters for each
component.
In addition, two potential problems with the tests need to be addressed when
applying the detection method. First, for the turnoff power detection, it is assumed there
should be no other events happening at the same time. This might produce false alarms if
something else actually occurred. One possible solution is to repeat the manual change
detection on another day with a similar load condition and remove the data points that
differ significantly from that at the same time on the other day(s). Second, the detection
in the test building was conducted with a base sampling interval of one second, while the
sampling interval of the reference power data by the submeter was one minute. This
might introduce deviations between the submetered data and the detected values due to
the ever-changing magnitude of the total power even if no other events occurred during
the one-minute period. Such a potential difference can be reduced by using a submeter
with a sampling interval that is comparable to the base sampling interval of the detector.
It should be noted that although feasible power models of the major components
can be obtained from the total power data at the system level with the method developed
in this chapter, submeters need to be used for fault detection and monitoring when the
equipment must run around the clock, such as in a hospital where the supply fans should
never be turned off. Under such conditions, fault detection and diagnosis should be based
on submeters if power consumption is used as the major index for monitoring.
128
CHAPTER 6
Diagnosis of faults by casual search with signed-directed-graph rules
Diagnosis of a fault is intended to identify the fault origin at the lowest possible
level in a system based on the exploration of the maximum amount of available
information about the system and its components. In Chapter 3 and 4, simple rules have
been demonstrated for alarm of abnormal power input due to some typical faults.
However, by only pointing out the equipment with abnormal power consumption instead
of the origin of a fault, those rules are actually applicable in fault detection and are not
sufficient to produce full diagnosis. This chapter applies a top-down knowledge-based
scheme for fault diagnosis through analysis of power consumption. Based on a thorough
study of current methods of fault diagnosis, two diagnostic approaches are proposed with
power data at the system and the component levels. With only the total power input, the
knowledge about the system and the components is very limited and hence a shallow
diagnostic reasoning technique is introduced to signal the possible directions or devices.
When more detailed information about the system and its components are available to the
detector, a deep diagnostic reasoning technique called a casual search scheme is
developed to trace down the most likely branches to identify the real cause of the
detected abnormalities. Formation of the knowledge base and diagnostic rules is
described and diagnostic schemes are implemented and verified for each technique with
data from real buildings.
6.1 Introduction
While fault detection involves the search of undesirable operation status in a
system or a component, fault diagnosis is aimed to trace the cause behind the abnormal
behaviors. In general, automatic identification and correction of a fault can be fulfilled in
two different ways, hardware redundancy and software redundancy [Rossi et al, 1996].
As discussed in Chapter 1, diagnosis of faults in HVAC systems is usually based on
software redundancy, which indicates the "duplication", either qualitative or quantitative,
of the outputs of a plant with given inputs and compare the results against measurements.
If the difference exceeds a threshold which accommodates the presence of noise and
disturbance, then the monitor issues alarms for possible cause(s) of the faults for further
actions if needed. Fault diagnosis systems based on software redundancy are normally
designed using artificial intelligence techniques and emulate human performance in the
cause analysis [Patton et al., 1989].
Current fault diagnosis schemes fall into two general categories, knowledge-based
and artificial neural network (ANN) approaches. Based on an understanding of the
physical relationships within the monitored system, the knowledge-based method
compares the selected measurements against some criteria or thresholds obtained from
physical models or basic principles and traces the most likely cause(s) of the deviation by
checking through an established set of logic. Initiated as an if-then rule checking
technique, the knowledge-based approach has been extensively studied and applied in
different areas including the HVAC industry. Rossi and Braun [1996] presented a method
129
for automated detection and diagnosis of faults in vapor compression air conditioners by
using statistical properties of the residuals for current and normal operation and
comparing the directional change of each residual with a generic set of rules unique to
each fault. Stylianou [1996] and Stylianou and Nikanpour [1997] demonstrated a
methodology that uses expert knowledge to diagnose the selected faults when a problem
is found with the "health" of a reciprocating chiller from thermodynamic models and
pattern recognition. Breuker and Braun [1998] conducted a detailed evaluation of the
performance of a statistical rule-based fault detection and diagnostic technique with tests
in a simple rooftop air conditioner over a range of conditions and fault levels. Fault
diagnosis based on rules or knowledge has been studied by other researchers in different
ways [Karki and Karjalainen, 1999] [Ngo and Dexter, 1999] [Peitsman and Soethout,
1997]. House et al. [1999] conducted a preliminary study of several typical classification
techniques from the two basic categories for diagnosis of faults in data generated by a
variable air volume air handling unit simulation model and found that the rule-based
method yielded the most reliable result in the test outputs. However, most traditional
knowledge-based methods rely on detailed information about the system including
system setup and physical and thermal parameters of the components, which greatly
limits the compatibility or portability of the diagnostic tool. In addition, diagnosis is
primarily based on selected physical parameters or a combination of them from specific
components, but measurements of such values for the knowledge base are sometimes
difficult to fulfill or usually not available with the current control system.
The ANN technique is basically aimed to enable a program to learn, reason and
make judgements based on the training data set without detailed knowledge of the
system's physical background. In the past five years, some research have been conducted
with data from simulation or small test units. Lee et al. [1997] described the application
of an ANN using the backpropagation algorithm to the problem of fault diagnosis in an
air handling unit. Peitsman et al. [1996] studied the application of black-box models for
fault detection and diagnosis in HVAC systems by comparing a multiple-input/singleoutput (MISO) ARX model and an ANN model. Li et al. [1997] presented an ANN
prototype for fault detection and diagnosis in complex heating systems. Although these
efforts showed some positive prospects of the ANN technique for fault diagnosis in
HVAC systems, such black-box models are still mainly in the research stage due to the
uncertainties in the technique itself for reliable models that can be extended to practical
applications with little knowledge about the physical processes. In addition, requirements
by an ANN model for training data representative of a system's behaviors under both
normal and faulty conditions make it difficult to be used in practice.
As discussed above, a practical tool for fault diagnosis should be adaptable to
different systems, i.e., the criteria for diagnosis must be consistent and common among
various types of systems. It should also be based on appropriate basic principles for
potential extension and reasonable amount of measurements or control signals that are
current available or easy to be obtained in common HVAC systems.
Based on thorough research of the previous work, this thesis aims to develop a
practical system-independent method for fault diagnosis in HVAC systems from analysis
130
of appropriately selected parameters. In this research, a knowledge-based methodology is
developed with power consumption as the primary index for evaluation of detected
deviations. Based on the generic facts of HVAC systems, an expert system structure is
designed for the inference of fault source. Fig. 6.1 illustrates the structure of a common
expert system [Patton et al, 1989].
inputs
Figure 6.1.
Architecture of a common expert system.
As shown Fig. 6.1, an expert system consists of:
(a). User
(b). Human interface - window for the explained outputs;
(c). Inference engine - automatic inferring program based on information from the
knowledge base;
(d). Knowledge base - facts from the data base and models and rules for the problem;
(e). Data base - input data;
(b). Knowledge acquisition - models and rules provided by the expert or from storage;
(g). Workspace - memory for the storage of the problem.
The key elements in an expert system are the knowledge base which supplies the
facts and the related models and rules for analysis and the inference engine which
provides the algorithm for diagnosis.
By directly presenting to the diagnostic tool the energy effects of faults, which is
a major concern of the HVAC industry, FDD based on power input helps the building
131
energy management system to make proper decisions for necessary maintenance not only
promptly but also efficiently. In addition, since electric power is the major source of
current air conditioning systems, this approach can be applied to HVAC systems with
different configurations. Moreover, as a function of reference parameters that can be
obtained from the control system or by basic measurements, power consumption is
modeled with physical insights into the process and hence the uncertainty due to lack of
understanding of the plant is minimized.
Depending on the available power measurements, two reasoning techniques are
established for fault diagnosis, i.e., the shallow reasoning approach and the deep
reasoning method [Patton et al., 1989]. The shallow reasoning approach is used when
only the system's total power is available while the deep reasoning approach can be
applied if power functions can be obtained with basic measurements and control signals.
6.2 Fault diagnosis by shallow reasoning with system's total power input
The shallow reasoning technique was first initiated as the diagnostic expert
system in the medical domain for inference of the causes of observed evidence. Without
information about the internal physical descriptions of the system, direct relationships
are assumed between the observed symptoms and system malfunctions based on a fault
dictionary or fault tree. The fault dictionary is prepared from the basic knowledge about
the system, which can be obtained from the past record of the system. The diagnosis
usually gives a list of possible defect components instead of the ultimate cause of the
fault. The major characteristic of this approach is that it can be implemented with a
minimal amount of information about the monitored system.
In this research, the shallow diagnostic reasoning method is applied when little
information is available except the total power input of the given system. Detection in
such situations indicates the identification of changes or variations in the total power
series while the diagnosis involves recognition of the electrically-driven equipment in the
system with abnormal power consumption at on/off switches, unexpected operating
schedules, and unstable control. Without measurements of any components in the
system, the knowledge base mainly consists of data and facts obtained from the
information in the manufacturers' catalogs, design specifications of equipment, and
operating schedules from the building management system. For example, when a
component is scheduled to start at the design time point, the FDD program that runs
continuously will detect if there has been a change in the total power of the monitored
system around the given time and diagnose the change by checking the magnitude against
the design value. If no change is found or the detected magnitude change does not match
the design value, then an alarm is issued to the operator and recorded in the workspace.
Fig. 6.2 shows the profile of the total power input of the motor control center in a test
building. The FDD output is given as follows for the startup of a supply fan around 7
o'clock in the morning.
132
9000
8000 -
(
7000 - Supply fan
started at 7:00
6000 -
0
a_
5000
4000
3000-
0:00
2:00
4:00
6:00
8:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 0:00
time (h)
Figure 6.2. Power profile of the HVAC system in a test building during a typical day.
Detection output:
Time of event: 7:20
Magnitude: 1106W
Knowledge base:
Time of event: 7:00 - 7:30
Magnitude: 500 - 3000W
Duration: 10 - 30 minutes
Event descriptions: Supply fan startup
Diagnosis output:
Time of event: 7:20
Magnitude: 0 (0 means magnitude matches specifications in the knowledge base. +1
means excessive value in the positive direction and -1 in the negative
direction.)
Event descriptions: supply fan startup
In this case, data in the knowledge base were obtained from the design
information of the HVAC system (the 'Time of event' and 'Event descriptions'),
specifications for the fan from the manufacturer's catalog (power capacity and minimum
power input as the 'Magnitude'), and observations of sampled power data ('Duration'
which may also be available in the catalog).
Based on the analysis of the information available for common HVAC systems,
the shallow reasoning approach is able to provide a set of overall guidelines for the
maintenance of the essential functions of a system. In addition, with no measurements of
components and hence no intrusion into the system, this approach is virtually an
interruption-free and cost-effective tool for general monitoring of the system's operation.
However, in the shallow reasoning approach, both the fault dictionary and the
fault tree methods are based on look-up tables with all potential faults preprocessed. In
practice, such tables result in a large number of entries and hence diagnosis of large
133
systems becomes difficult and time-consuming. If the table is not complete or fails to
provide proper evaluations due to lack of insights into the system, the diagnostic output
may be meaningless or erroneous. Moreover, with no data from the related components
in a system, it is impossible to identify the underlying cause of an observed fault.
Therefore, diagnosis with total power input based on shallow reasoning is
applicable to systems consisting of components with distinct power capacity and
operating schedules, such as the air conditioning system of a restaurant.
In order to locate the specific source of a fault in a complex system, the inference
scheme must be based on a more thorough understanding of the system and its
components as well.
6.3 Deep knowledge expert system
Deep reasoning is based on analytical models of a system and its components,
which are also called deep models. A deep model of a plant can derive its behaviors of
interest with a given set of parameters and predict the effects of the variations in these
parameters. By simulating the underlying principles of a process, the deep knowledge
expert system can be used to estimate the effects of a potential fault without
preprocessing the assumed fault scenario.
Diagnosis with deep modeling can be divided into three basic methods according
to their principles and functions: casual search, constraint suspension, and governing
equations.
The causal search technique is based on the tracing of observed deviations or
malfunctions to their origins [Shiozaki et al, 1985]. Potential paths of fault propagation
are described by signed directed graphs (SDG or digraphs) consisting of nodes that
represent state variables, alarm conditions, or fault sources and branches that display
influence between nodes. A complete digraph is able to demonstrate the intensity of the
effect, the time delays, and the probabilities of fault propagation along the branches.
The constraint suspension technique is mainly aimed to reason from misbehaviors
to structural defects based on inspection of the constraints on detailed input/output ports
in a hierarchical structure [Davis and Shrobe, 1983] [Davis, 1984]. By examining the
values at the 1/0 terminals of a component against constraints (or rules) that define the
behaviors of the component, such as the inputs and the output of an adder in a digital
circuit, this method is able to deal with complex systems with multiple layers when
measurements of the input and output of each component can be obtained.
The governing equations method works on a complete set of associated
quantitative constraint equations [Kramer, 1986], e.g., mass balance across a unit in a
flow path. First developed for fault identification in the domain of chemical engineering,
this technique can be applied to processes where the governing equation for each
component and the associations between equations for different equipment are available.
134
For non-critical applications as in an HVAC system, cost vs. benefit requires that
feasible diagnosis should be more based on logic reference than specific descriptions of
equipment in the system. Moreover, it is usually difficult to obtain the I/O measurements
or the appropriate governing equation for each component in operation.
On the other hand, based on a qualitative diagnostic digraph, the casual search
method can be implemented for diagnosis with much less information. Nodes
representing key equipment are described by physical models that can not only
demonstrate the severity of the effect of a fault, such as the power function of a fan, but
also allow for time delay in the transportation of media or variations of parameters which
is common in the thermal processes of HVAC systems. While branches connect the
related nodes in a process and give the actual active directions of fault propagation.
Moreover, in this research, the dominant criterion for FDD is power consumption,
which is the ultimate reflection of faults in energy cost of the system. Hence fault
diagnosis becomes a backward tracing process of the origin, which is exactly the major
function of the casual search technique.
In principle, the major limitation of this method is that diagnosis can be conducted
only if each variable undergoes at maximum one transition at a time between qualitative
states, i.e., either increase or decrease from the normal value, during fault propagation.
Although this might become a problem in some engineering processes, e.g., the
concentration of certain compounds in a chemical reaction process, this condition can
usually be met in a common HVAC system, especially when power consumption is the
output of interest.
Therefore, the casual search method is more efficient and feasible than the other
two approaches for FDD applications in HVAC systems.
In this thesis, a rule-based SDG technique [Kramer and Palowitch, 1988] is
developed for fault diagnosis in common HVAC systems. The inference logic is properly
designed in consideration of the characteristics of HVAC control systems. By converting
the digraph into a concise set of rules with quantitative descriptions derived from
physical models, the reliability of diagnosis is increased and the threshold sensitivity is
reduced. The performance of inference logic is verified with tests in an HVAC system.
6.4 Diagnosis based on casual search of fault origin
The casual search method consists of two major steps, the construction of a
digraph and the conversion of the SDG into applicable rules to be implemented in an
automatic FDD program. This section introduces the procedure for deriving expert rules
for fault diagnosis based on semantic analysis by following the development by Tazfestaz
[Patton et al, 1989].
135
6.4.1 Construction of the digraph
A digraph is the graphical representation of the fault simulation process. In the
casual search method, it is always assumed that a fault is generated from a single node,
i.e., the root node, which is the source of all the consequent abnormal symptoms. Hence
the digraph usually starts with a single root node, from which branches representing
potential influences on other components lead the pathways of fault propagation to the
next nodes that describe the responses of these components. The directed branches are
extended following the system's functional structure until the ultimate node that
represents the model of the monitored equipment is reached. Status of the root node is
classified with "+1 (too high) for deviations in the positive direction, "-" (too low) in
the negative direction, and "0" (normal) for no deviation or deviations within tolerance.
The plus sign "+" marked with a branch indicates with the increase of the value in the
current node, the value in the next node increases or remains unchanged, while the
minus sign "-" means the next node changes in the negative (decreasing) direction or
remains unchanged. For example, "a positive change originated in Component A leads to
the increase in Component B and consequently a decrease in Component C" can be
simply represented by a digraph as shown in Fig. 6.3.
(+1)
Figure 6.3. Basic elements in the signed directed graph.
Note that time delay might occur in the responses of components along a
pathway. Hence the status of the next node may or may not be immediately affected by
the current node. As in the above example, if a time delay occurs between nodes B and C,
then the current status of C might be -1 or 0, though an increased B tends to result in a
decrease in C.
In addition, measurements are assumed fast enough that a variable representing a
physical parameter at one node and its measured value can be lumped as a single node,
which makes it convenient to represent the offset or deviation caused by a fault, as
illustrated in Fig. 6.4 where the subscript "s" represents the measured value by sensor and
"e" indicates the deviation between the measured and the design values.
Fig. 6.4 shows a further simplification of the digraph for diagnosis. Nodes
representing unmeasured elements that are not of interest in the simulation of the physical
process can be removed from the digraph.
136
(b)
Figure 6.4. Node merge and elimination in an SDG. (a). originaldigraph; (b). simplified
digraph after lumping of variables A and D with their measurements As and D,
respectively and removal of unmeasured nodes B and C.
Although some controlled variables can be kept within the desired range even at
the presence of a fault, as shown in Fig.6.5(b) for the physical process in Fig.6.5(a)
[Patton et al, 1989], where the disturbance seems to have been "eliminated" by the
feedback loop, extra efforts are always made to compensate for it. For example, a fouled
cooling coil fault must be compensated by higher chilled water flow rate or lower chilled
water temperature, both causing excessive consumption of pump or chiller energy. This
indicates that the feedback control may not be able to completely compensate for the fault
in the whole digraph, especially for an uncontrolled variable or for a controlled variable
when the offset is large enough to override the feedback effect. In the construction of a
digraph, for the continuation of fault propagation along the directed pathways, feedback
branches between two adjacent nodes should be removed from the simulation tree of the
digraph for the convenience of inference, as shown in Fig.6.5(c).
(+
+
E
(a)
A
(+1
(+1
(b)
+
+
B
C
+
D
+
E
(c)
Figure 6.5. Control loop in digraph presentation. (a). the original digraph with a
feedback loop; (b). loop with disturbance cancellation-disturbancestopped; (c). loop
with control saturation-disturbancetransported.
137
Fig. 6.6 demonstrates a typical process for the construction of a digraph for
diagnosis with a negative deviation in the root node A. The digraph for the original
physical process is shown in Fig. 6.6(a) [Patton et al, 1989]. After lumping of nodes in
Fig.6.6(b), elimination of the unmeasured variables in Fig.6.6(c), and removal of the
feedback branches in Fig.6.6(d), a complete set of all the possible disturbance pathways
is presented as interpretations I-1 through 111-2. Note that because diagnosis is finally
conducted with the number of violations, there should be no repeated node in one
interpretation.
Since a fault may not affect all the measurements within the sampling interval in
practice due to the time delay in the responses of components, i.e., the effect may not
propagate throughout the whole pathway, it is usually necessary to derive a list of
possible interpretations instead of a smaller set of the most complete pathways like AC-D-F-G. In the thermal process of an HVAC system, for example, time delay must
be considered to obtain an interpretation with "simultaneous" violations in different
components and the pathway may stop at some point depending on the relationship
between the delay and the sampling interval involved. This characteristic of the SDG
technique also improves the flexibility in diagnosis when the number of accessible
components changes within a system or among systems with similar structures.
It should be noted that although the SDG structure for future diagnosis is based on
obtainable measurements or control signals in the system, the variable represented by the
root node in an SDG is usually not accessible, which is exactly the task of a fault
diagnosis program. In a system with interactive components, observed abnormal behavior
of a component may originate from more than one part of the system. Therefore, more
violations of constraints generally result in more reliable or more accurate identification
of the fault source, which means longer pathways or more nodes in an SDG.
With a given number of accessible components, the diagnosis resolution can be
improved when the obtainable measurement is closer to the fault source. For example, in
Fig.6.6, interpretation 1-3 (A-Ce-De) yields a more reliable result than the
interpretation A-De-Ee. This is because the abnormal behavior of an accessible
component is more likely to be shared by other fault origins when it is separated from the
source by more components along the flow path, i.e., farther down a branch from the root
node. This also indicates that an interpretation should not be simplified by removing
intermediate nodes when measurements for these components are available. For example,
A-Ce-De in the SDG may be interpreted as A-Ce and A-Ce-De but not A-De.
In summary, a typical process to establish a complete digraph consists of the
following procedures:
(a). Determine the root node from analysis of the system;
(b). Construct the complete digraph with all potential pathways from the root node as
well as the nodes for all affected components in the propagation of the deviation;
(c). Lump the nodes and their measurements if appropriate;
(d). Remove the non-measured nodes except the root node from the digraph;
(e). Remove the feedback branches from the digraph.
138
(-1)
(-1)
(-1)
A
+
(b)
+
-
+
Ge
Ce
+
Fe
*
De
(C)
(d)
(-)
I-1:
(-1)
I-2:
(-1)
1-3:
(-1)
III-1:
(-)
Ce
A
Fe
1-4:
(-1)
(-1)
111-2:
Ge
Ce
A
Fe
II-2:
Figure6.6. Demonstrationof the SDG inference procedure. (a)-(d): simplificationsof the
SDG model; I-1-111-2: interpretationsof potentialpathwaysfor the digraph.
139
Based on the knowledge about the structure and the basic control logic of a
system, the SDG technique provides an efficient way to describe the semantic network of
fault simulation. However, for automatic execution of diagnosis, it must be converted to
certain forms that can be recognized by a computer program. In the casual search
technique, one flexible and computationally efficient way for the conversion is the rulebased SDG method, in which all potential pathways for fault propagation are converted to
a set of rules [Kramer and Palowitch, 1988].
6.4.2 Rule development from digraph
Potential deviations along the pathways in a digraph can be represented by a set of
signed Boolean series by assigning +1 or -1 to deviations (sensor-setpoint) in the positive
or negative direction and 0 to the neutral or normal state.
For a branch X - - Y in a simulation tree, the relationship is expressed as: if X
then Y can be +1 or 0 (due to unspecified time delay for the propagation from
component X to Y), if X = 0, then Y = 0, and if X = -1, then Y = -1 or Y = 0. The
relationship with X - --- Y can be derived similarly. Table 6.1 shows the Boolean
representation of the interpretations in Fig. 6.6 (the subscript e were removed for
convenience).
= +1,
Table 6.1. Boolean node patternsfor the digraph in Figure 6.6.
Interpretation group I
Interpretation group II Interpretation group III
C
D
F
G
C
D
F
G
C
D
F
G
-1
-1
-1
-1
0
0
-1
-1
-1
0
-1
-1
-1
-1
-1
0
0
0
-1
0
-1
0
-1
0
-1
-1
0
0
-1
0
0
0
It can be seen that the mode sets can be directly used for diagnosis by comparing
with the sampled data from the monitored system. Also, by checking through the signed
Boolean series, duplication of rules can be avoided.
However, for a large system with multiple faults, such a table may create an
extremely large set of codes and hence become very awkward to use in the development
of an FDD scheme. Therefore, the format of the presentation of the deduction logic must
be simplified for practical use. In an inference system developed in a computer program,
this problem can be solved by utilizing a form of logical predicates, designated as p and n
here, in the standard "IF ... THEN
...
" rules to form a module or subroutine as shown
below.
(p X Y) <=> (X = Y ) or ( IX I> IY I)
(n X Y) <=> (X = -Y) or ( IX I> IY I)
140
Thus Interpretation group I becomes:
A -+- C=>(pAC), C -+- D=>(pCD), D --
F=>(pDF), F --
G=>(pFG).
From the above definition and rule table, it can be deducted that the p and n pairs
in a branch should be connected with the AND logic sequentially. For branches starting
from a common node such as in Interpretation 111-2 in Fig. 6.6, the AND logic should
also be used. Hence, the rules in Table 6.1 can be represented as follows.
Interpretation group I:
Interpretation group II:
Interpretation group III:
(p A C) AND (p C D) AND (p D F) AND (p F G).
(p A F) AND (p F G).
(p A C) AND (p A F) AND (p F G).
Note that the unmeasured component A can not be used in the conditions for
inference. But it is easy to prove that ( X = -1 ) AND ( p X Y ) is equivalent to Y # +1.
Therefore, in the form of the standard IF...THEN... rule,
Interpretation group I:
IF ((C # +1 ) AND (p C D) AND (p D F) AND (p F G))
THEN A = -1 is a possible fault source.
Interpretation group II:
IF (( F#+1 )AND (pFG))
THEN A = -1 is a possible fault source.
Interpretation group III:
IF((C#+l ) AND (F#+l )AND (pFG))
THEN A = -1 is a possible fault source.
For a given digraph, all interpretations should be incorporated into one
"IF... THEN..." clause. This not only provides a more concise form for diagnosis, but
also helps to prevent redundancy of node appearance in one interpretation. However,
special care should be taken in developing interpretations or combining different
interpretations when a converging node exists in a digraph. A converging node indicates
multiple causes for the deviation of a component. Although two diverging branches are
integrated by an AND, two converging pathways should be combined with an OR in
order to avoid repeated clauses in the final IF... THEN... rule. For example, at the
converging node F in Fig.6.6, an OR needs to be used in combining the branches, i. e.,
the complete form should be (p A C) AND (p C D ) AND ((p A F) OR (p D F))
AND (p FG ), or,
IF ((C # +1 ) AND (p C D ) AND (p F G) AND ((F # +1) OR (p D F)
THEN A = -1 is a possible fault source.
141
6.4.3 Direct rule development from a digraph
With the logical form described in the previous sections, rules for diagnosis can
be obtained in a concise format. However, for applications in a system, rule derivations
based on the fault table may become a huge list of possible pathways. The situation can
be further complicated when there are multiple possible causes of the deviation in one
component, because the converging nodes often lead to a large amount of combinations
among different pathways. For example, with the digraph containing two converging
nodes as shown in Fig.6.7(a) [Patton et al, 1989], eighteen combinations can be derived
by following the procedures presented in Sec. 6.4.1 and 6.4.2. Therefore, a more efficient
method needs to be used for the development of rules.
In fact, from the above discussion, it can be found that if the network is properly
disassembled, the rules can be directly derived from a partially developed digraph. In
general, converging pathways represent a choice in the construction of the simulation tree
where one of the paths is used to form an interpretation group. By using all combinations
of these choices, one can obtain the full set of interpretations. Thus the combined set of
interpretations can be represented by making explicit the choices, instead of enumerating
each interpretation.
With the two converging nodes D and E in the digraph in Fig. 6.7(a), four
interpretation groups can be obtained as shown in Fig.6.7(b)-(e). The combined set of
interpretations can be represented by the following set of branches:
A
---
>
BANDA
AND(B
-
CANDC
-±->
DOR C
---
--
>
>
E
D)AND(D
--
>
F OR E
-+->
F),
which can be transformed to the following rule:
IF (( p AB ) AND
AND ((p B D)
AND ((p D F)
THEN A = +1 is a
(p A C) AND (p C E)
OR (p C D))
OR (p E F)))
possible fault source.
It can be seen from the above discussion that the number of the interpretations for
further derivation of the rules with the given digraph has been significantly reduced.
142
(a)
(b)
(c)
(d)
(e)
Figure 6.7. Interpretationdevelopment for a digraph with converging branches. (a). the
partially developed digraph;(b)-(e). simplified presentationsof interpretations.
6.5 SDG rule development for fault diagnosis with power input of components
As discussed in the previous sections, for diagnosis with the casual search
technique, rules can be established based on digraphs for the potential faults. In HVAC
systems, the modeling process of diagnosis involves the development of the digraph for a
given fault and the expert rule set with reference to the system configuration and the
control sequence.
The construction of a digraph usually starts with a root node that represents the
original faulty component in an HVAC system. In principle, there can be many defect
components as an HVAC system generally consists of a considerable amount of
mechanical equipment and parts as well as electrical devices. In practice, however, it has
been learned that faults in a given type of HVAC systems usually occur in some typical
forms [ANNEX 25]. In addition, although a fault may happen at any component level, a
diagnosis program should be reasonably expected to reach the level where the faulty part
is directly related to a measured or modeled parameter. For example, a slipping fan belt
leading to less power consumption (modeled) at high fan speed (measured) can be
detected by the fan power function. But the slipping effect itself may be caused by more
143
than one defect, such as a loose screw in the mechanical transmission system, reduction
of belt tension after a long time of operation, etc. Fault origins at such a deeper level are
generally impossible for the FDD program to see due to the infeasible measurements of
the related physical parameters. Therefore, fault origins in HVAC systems in this
research are identified as defects or malfunctions that directly lead to excessive variations
in the observed parameters.
Since power consumption is the major index of this FDD method, the diagnosis
function is usually activated by violations of the power criteria. As an ultimate effect of
system malfunctions, power consumption is generally the node at the end of a digraph for
a given fault. Hence diagnosis based on the SDG rules is essentially an inversely parallel
procedure of the modeling process that predicts the response of a plant to a given
operating state. In the SDG rules, the status of the end node is usually determined with a
component's power model obtained by the method demonstrated in the previous chapters.
As discussed in Chapters 3 and 4, abnormal effects in power consumption can be defined
in terms of the magnitude of a component's power input or cycling frequency or the
standard deviation of the sampled power series.
Development of a digraph for a given fault requires finding the possible
connections from the fault origin as the root node and the power response as the end
node. From the fault origin, all potential branches in the propagation of the variation in
the root node as well as the nodes representing all affected components that can be
measured or modeled in the pathways should be included in the digraph. Generally, in the
rules for diagnosis, nodes with some constraints or thresholds should be first considered
to detect the maximum number of violations. For example, as a controlled variable with a
setpoint and a tolerance range, the indoor air temperature is a useful index and is usually
available for the control system.
As discussed before, a major advantage of using power consumption as the
criterion for fault detection is that the detector is able to find abnormal operation even if
the other physical parameters appear normal with their local feedback controls.
Moreover, some parameters used in the power models are measured but not controlled or
modeled. Although constant thresholds can be applied to those parameters under some
circumstances, they can not be generally defined as abnormal, i.e., with logical value of
-1 or +1, when used in the rules and hence may block the diagnosis for the next variable
in the p or n pairs. For example, in a VAV system, supply airflow rate is measured for
diagnosis with a fan power model but not controlled with a setpoint. Hence it can not be
found at the fault state with the logical value of -1 or +1 unless it is lower than the
threshold for the minimum air flow rate (at status -1) that may be used to avoid
insufficient air supply in case of failure of some key components in a system. Therefore,
in this thesis, the logic inference mechanism is developed in a more compatible structure
as follows. For a node X with or without feedback control.
(pXY)<=> (X*Y>O)
(nXY)<=> (X*Y
144
O)
Or, in the pathway as,
X -+->
X ----
Y: Y = 0 or +1 if X = 0 or +1 or Y =0 or -1 if X = 0 or -1;
Y: Y = 0 or -1 if X = 0 or +1 or Y =0 or +1 if X = 0 or -1.
This inference logic considers the potential time delay between nodes as well as
the local compensations in the rule development. Alarms can thus be issued for different
levels of severity of a fault. Abnormality in a non-controlled parameter usually gives a
warning if the fault is not critical for the occupants' or the system's health. Urgent alarms
are issued for immediate attention if the deviation of a controlled variable X is found
beyond its allowed range or the abnormality of a non-controlled parameter is detected
that is critical to the system's health. For example, as a controlled variable of HVAC
systems, the indoor air temperature should always be maintained within a specified range
for the occupants' health or the needs of some production processes. A typical fault in a
non-controlled variable that may call for prompt intervention is the unstable control of a
component, which not only deteriorates the quality of the control system but may also
damage the related mechanical parts. Such a fault can usually be found from oscillation
of the data series that are available for analysis by computing the standard deviation of
the samples. Fig. 6.8 shows the digraph for the diagnosis of an unstable fan control.
(+ 1)
Figure 6.8. Digraphfor the diagnosis of a supply fan with unstable control.
With an inappropriately tuned gain Ku.sf the variances (or the standard deviations)
of the fan speed control signal usr.v and the supply fan power Psf tend to increase. Hence
the SDG rule can be written as
IF(Psf.v = +1) THEN alarm for oscillation in the power input of the supply fan
IF((usf.v # 0) AND (p usEf Psfv)
THEN Ku.sf (+1) is a possible fault origin.
If only power input is available for diagnosis, the rule can be simplified as
IF(Psf.v = +1) THEN alarm for oscillation in the power input of the supply fan.
It should be noted that a branch in a complete digraph represents a potential
pathway of fault propagation, whereas a branch in the simulation tree to be used for
diagnosis is the specific active pathway along which the fault is expected to propagate.
With the SDG diagram, this means not all the pathways or the nodes in the digraph can
be used at the same time for a given fault. In this thesis, the control of the active
pathways in a digraph is realized by first choosing the related operating modes based on
the general control logic of an air conditioning system.
145
The operating mode of an HVAC system is usually determined by the outdoor air
temperature and the corresponding type of the thermal process involved, which is
typically defined as heating, free cooling with outdoor air, mechanical cooling with
100% outside air, or mechanical cooling with minimum outdoor air, named as modes 1,
2, 3, and 4 respectively in this thesis. For example, a leaky recirculation air damper in an
HVAC system tends to result in lower heating energy consumption in the heating mode,
higher cooling energy input in the modes of free cooling with outdoor air and mechanical
cooling with 100% outdoor air, and lower cooling power input in the mode of mechanical
cooling with minimum outdoor air. In practice, it is difficult to find the leaky
recirculation air damper from the power input of the boiler or the chiller when the damper
is wide open or modulating by the control command. Therefore, this fault should be
detected and diagnosed with the chiller power input only when the system is operating in
mode 3. The effects of this fault in the chiller power input are listed in Table 6.2.
Table 6.2. Effects of a leaky RA damper on the chillerpower in different operating modes
Mode
Outside temperature region
RA damper
Chiller power (normal)
Chiller power (leaky)
open
P1
P11 =Pi
1
Toa < Tbp
2
Tbp < Toa < Tsa - ATsf
modulating
P 2 = PI
3
Tsa - ATsf < Toa < Tra
closed
P3
P 31 >P 3
4
Tra< Toa
open
P4
P41 =P4
P21
P2
Toa, Tsa, and Tra are temperatures of the outdoor, supply, and return air
respectively while Tbp represents the balance point temperature for the switching between
the minimum and the modulating amount of outdoor air and ATsf is the approximate
temperature rise of the air across the supply fan. If the above temperature measurements
are not available, the supply and the return air temperature setpoints Tspt.sa and Tspt.ia can
be used as Tsa - ATsf and Tra separately.
In mode 3, since the outdoor air temperature is lower than the return air
temperature, 100% outside air is used for cooling to save energy and the recirculation air
damper is closed by the control system. If the recirculation air damper can not be fully
closed due to a mechanical failure, as shown by the digraph in Fig. 6.9, the power input
of the chiller Pch tends to increase with the warm return air Qra from the leakage Ad.ra to be
processed by the cooling coil.
(+ 1)
Figure6.9. Digraphfor the diagnosis of a leaky recirculationairdamper.
In principle, the leaky recirculation damper can be identified from the chiller
power model as long as the system is operating in mode 3. However, in practice, it has
been found that the fault can be reliably detected only when the outdoor air temperature
146
Toa is closer to the supply air temperature Tsa than to the return air temperature Tra. This is
because the uncertainty in the outdoor air temperature, the increase of the air temperature
in the return duct, and the control deadband in the room air temperature should be
considered in order to eliminate alarms for deviations due to random disturbance. Such a
modification in the temperature threshold for FDD is especially useful when the chiller
power model is not sensitive to load changes. For example, with a two-stage
reciprocating chiller in the test building, a dimensionless threshold for the outdoor air
temperature was found useful for the FDD of this fault, as shown in the following SDG
rule used for the diagnosis of this fault. In fact, the extra clause for the outdoor air
temperature not only helps to reduce the false alarm rate due to random disturbance but
also greatly improves the resolution of the leaky recirculation air damper in spite of its
short pathway that may be shared by other potential faults which also lead to excessive
chiller power input.
IF(Pch
= +1) THEN alarm for high power input of the chiller
IF((Qra # 0) AND (p Qra Pch) AND (0 < Toa < ht.oa*(Tspt.ia - Tspt.sa))
THEN
Ad.ra(+1)
is a possible fault origin
The chiller power model Pch can be established as a polynomial function if
continuous load control is used or a threshold of the on/off cycling frequency when
stepwise control is involved. In the test building, a threshold of 35 minutes as the lower
limit for the interval between off and on was used as for the reciprocating chiller with a
CSD motor. ht.oa is a trained dimensionless threshold used for
oa
spt.sa .
Tspt.ia - Tspt.sa
For the test
building, a value of 20% was used for ht.oa.
In addition to the operating modes, the status of a component may also be
determined by the system's operating schedule. For example, a leaky cooling coil valve is
expected to be visible only when the system is in mode 1 or 2 according to Table 6.3.
However, this does not necessarily mean that the FDD for this defect must wait until the
outdoor air temperature switches the system to mode 1 or 2. In some HVAC systems,
when the air handling unit is shut down in mode 3 or 4, the cooling coil valve is also
closed and the chiller may run at a constant low speed or cycle on and off regularly to
prevent heat accumulation in the equipment. If the valve is leaky, with chilled water to be
processed by the cooling coil, the chiller will run at a higher speed or cycle more
frequently than under normal conditions with a fully-closed valve. In the test building, a
chiller with a CSD motor cycled at an interval of over 38 minutes with a fully-closed
valve when the AHU was shut down during late night (after 10pm) and early morning
(before 7am). With a leaky valve, the interval was found to be shorter than 35 minutes.
Fig. 6.10 shows the digraph for this fault followed by the corresponding rule set.
147
(+ 1)
Figure6.10. Digraphfor the diagnosisof a leaky cooling coil valve.
IF(Pch = + 1) THEN alarm for high power input of the chiller
IF((Qcw # -1) AND (p Qcw Pch) AND
(Usf = Usfmin))
THEN Av.cc (+1) is a possible fault origin
where Av.cc and Qc, represent the leakage and the flow rate of the chilled water. The
minimum fan speed control signal usf.min is used as an indicator of the shutdown of the air
handling unit. usf.min is typically used for a VSD fan as discussed in Chapter 5.
Table 6.3. Effects of a leaky CC valve on the chillerpower in different operating modes.
Mode
Outside temperature region
Cooling coil valve
Chiller power (normal)
Chiller power (leaky)
1
Toa < Tbp
closed
Pi
P1 >P
2
Tbp< Toa < Tsa - ATsf
closed
P2= PI
P21> P2
3
Tsa - ATsf < Toa < Tra
modulating
P3
P31 =P 3
4
Tra < Toa
wide open
P4
P41=P4
From the above analysis, it can be seen that the detectability of a fault changes
significantly with the status of the related components that are determined by the
operating mode and the control scheme. For example, it is impossible to detect a leaky
cooling coil valve in summer by power analysis when the chiller is also turned off. With
the extended logic design, nodes representing those equipment that are not involved in
the current operation can be skipped by assigning their values to 0 based on switches
between different operating modes or the control design so as to facilitate the execution
of the FDD program. By allowing the switch of a node's state with a given operating
mode or condition, the compatible inference structure greatly improves the flexibility of
the program to be used in a system under various conditions as well as in different
systems with various information available for diagnosis under similar control logic.
Examples in the following discussions are generally based on the fault diagnosis
implemented under summer conditions. SDG rules can be developed similarly for other
conditions.
The resolution of the diagnosis output may also vary with the load conditions
during different time periods in the same day. For example, during summer time, a stuckclosed recirculation air damper in a typical HVAC system can be more positively
identified when the condition of Qsa=-l (insufficient air supply) is met after the end of a
work day when the outside air damper is closed, which helps to distinguish the stuck-
148
closed recirculation air damper from other faults that also result in excessive fan power
input. If Qsa=-1 is found throughout a day when the system is in operation, then both the
outdoor and recirculation air dampers are diagnosed as stuck-closed. Therefore, for better
recognition of a fault, appropriate time constraints corresponding to load conditions
should be utilized in the inference rules based on the system's control logic.
Fig.6.11 illustrates the SDG rule development for a stuck-closed recirculation air
damper.
Tia
Osa
+
Pch
Ps
-
Usf
(-)Dra
Psf.Q
Figure 6.11. Developed digraphfor the fault of a stuck-closed recirculationair damper.
As the deviations in the physical variables from their setpoints or normal values,
the nodes Dra, Qsa, Tia, Pch, Ps, usf, and Psf.Q describe the potential abnormal status of the
recirculation air damper position, the flow rate of the supply air, the room air
temperature, the chiller power input, the duct static pressure, the fan motor speed, and the
supply fan power model. Note that the indoor air temperature Tia and the static pressure
P, in the air duct, both in bold letters, are generally controlled variables in air
conditioning systems.
As a result of the failure in mechanical transmission, a stuck-closed recirculation
air damper will lead to reduced supply air flow rate and increased room air temperature.
With a lower air flow rate, the cooling load and hence the power consumption of the
chiller will decrease. Meanwhile, the extra pressure drop across the recirculation damper
in the air loop tends to cause a lower static pressure in the duct, which then calls for a
higher speed and hence more power input of the fan. The rule for the diagnosis of this
fault can be derived as follows.
IF (Psf.Q = +1) THEN alarm for excessive power input of the supply fan with air flow rate
IF(Ps = -1) THEN alarm for low static pressure
IF(Tia = +1) THEN alarm for high room temperature
IF(usf = +1) THEN alarm for high speed of the supply fan
IF(Qsa = -1) THEN alarm for low flow rate of the supply air
149
IF(Pch
= -1) THEN alarm for reduced power input of the chiller
IF(toffwork < t <tahuoff) THEN alarm for stuck-closed recirculation air damper
IF((Fsa # +1) AND (n Fsa Tia) AND (p Fsa Pch)
AND (Ps # +1) AND (n Ps usf) AND (p usf PsfQ)
AND (IPchI + IPsi + ITial + IusfI + IQsaI # 0))
THEN Dra(- 1) is a possible fault origin
Where Qsa =- 1 indicates the supply air flow rate is lower than the normal operating range
and usf = +1 means 100% fan speed. Two time constraints, toffwork and tahuoff, were used
for the end of the normal occupied hours and the time to turn off the air handling units
each day. In the test building, to reduce the cooling load caused by the hot outdoor air in
summer, the outdoor air damper was closed and the recirculation air damper was set at
fully open between toffwork and tahuoff.
In general, the fewer the neutral values (O's) in the nodes, the more positive is the
diagnosis output. For a less intrusive and hence less confident diagnosis without the
information about Tia and Ps, the rule can be simplified as,
IF (Psf.Q = +1) THEN alarm for excessive consumption of the supply fan
IF (Qsa = -1) THEN alarm for low flow rate of the supply air
IF(Pch
=
-1) THEN alarm for reduced power input of the chiller
IF(toffwork < t <tahuoff)
THEN alarm for stuck-closed recirculation air damper
The components and the structure of a digraph are generally determined by the
control logic of the system and the priority of actions. As an uncontrolled variable and a
major source to compensate deficiencies in other equipment by the control system, the
power input of the supply fan responds quickly to this fault. In addition, power input is
the primary index in this approach. Therefore, the rule for the supply fan power should be
the first condition to be checked, i.e., this rule is the outmost layer in this diagnosis
structure. Rules for other variables, if available, should then be nested under this
condition. Note that although the chiller power is also affected, in the test building, the
power model for the two-stage reciprocating chiller is not sensitive to load changes
during the occupied period of the building and hence not reliable for detection of this
fault. Therefore, the rule for the chiller power is assigned with the lowest priority in the
rule set. If the chiller is sensitive to load changes, such as a chiller driven by a VSD
motor, then the full diagnosis can be nested in the clause for the chiller power. In
practice, if a node is not expected to respond to a fault with a recognizable pattern under
certain conditions, the node can be assigned with a default value of zero.
150
It can be seen from this example that for a given fault, the nesting structure must
be designed carefully based on the control logic as well as the components'
characteristics of the given system so that the execution of the outer rules never leads to
missing of diagnosis by the inner rules. If a node with lower sensitivity is used in the
outer layer, then its status may block the detection with other nodes in the inner layers
that are more sensitive to the fault, as shown in the above example. System deterioration
and energy waste may become even worse with inappropriate design of rule structure if
the other nodes remain neutral with the compensation by extra power consumption,
especially with a degradation fault. For example, as shown by Fig. 6.12(a), a pressure
sensor offset in summer may lead to an increase of power input of the supply fan and the
chiller as well but may not cause further violations of nodes maintained by local feedback
control, such as the indoor air temperature. Although the abnormal operation can be
easily found by the node for the fan power model, the excessive power input of the chiller
may not be found until the fault becomes very serious if the chiller power model can not
produce a detectable pattern, as has been observed with the HVAC system in the test
building. If the chiller power node is used in the outer layer, then its status will block the
detection by the fan power node which can be used to find the fault at an early stage.
In order to prevent missing of rule execution in a nest structure that may be
caused,by node sensitivity, time delay between nodes, compensation by local feedback
loop, and switches of modes, the diagnosis is designed in a nest structure that generally
develops in depth with the increasing severity of the fault. Generally, in a given structure,
the sensitivity of nodes to the fault decreases while the confidence about the fault origin
increases with more violations as the layer grows. Once an alarm is issued from an end
node, it will be recorded and at the same time further violations are also checked until a
fault origin is identified. In case no possible cause can be matched to all the errors, the
deviations will be stored for further analysis by experts.
Tia
Qsa
Tia
_
sa
Usf
Poh
0)
Tfb
+
Pch
Ps
ust
Psf.Q)
Psf.Q
(b)
(a)
Figure6.12. Developed digraphsfor two typical faults in common air handling units.
(a). static pressure sensor offset; (b). slippingfan belt.
151
Fig. 6.12 shows the digraphs for other two typical faults related to abnormal fan
power input: offset in the static pressure sensor and slippage of fan belt. up, xt, and Pfeu
define the status of the static pressure sensor output, the fan belt tension, and the fan
power vs. motor speed model. Ps is equal to usp if the pressure sensor is in normal
operation.
Typically caused by a leak in the pneumatic signal tube, a pressure sensor offset
requires higher motor speed and hence excessive fan power consumption. When the fault
becomes very serious and the VAV control for the air flow is saturated, more air will be
processed by the cooling coil and then delivered into the conditioned space, leading to
higher chiller power input and lower room air temperature. The rule can be developed as
follows.
IF (Psf.Q = +1) THEN alarm for excessive power input of the supply fan with air flow rate
IF(Tia = -1) THEN alarm for low room air temperature
IF(usf = +1) THEN alarm for high speed of the supply fan
IEF(Pch = +1) THEN alarm for excessive power input of the chiller
IF((Qsa # -1) AND (n Qsa Tia) AND (P Qsa Pch)
AND (usp # +1) AND (n usp usf) AND (p usf Ps.Q))
AND
(toffwork < t <tahuoff)
THEN usp(-1) is a possible fault origin
It should be noted that although Qsa increases with fan power, Qsa = +1 can not be
used as an independent clause as no upper limit is typically set for the supply air flow
rate.
With the minimum amount of measurements and control signals, i.e., without Tia
and usf, the rule can be simplified as,
IF (Psf.Q = +1) THEN alarm for excessive consumption of the supply fan
IF(Pch = +1) THEN alarm for excessive power input of the chiller
IF (Qsa # -1) THEN alarm for static pressure sensor offset
As a degradation fault, a slipping fan belt results in reduced resistance to the shaft
and hence lower fan power input at a given motor speed, which can be represented by the
fan power vs. motor speed model Psfu. On the other hand, with the decreased
transmission efficiency between the motor drive and the fan shaft, the motor needs to run
at a higher speed and hence a higher power input to provide the required air flow rate. If
the air flow rate can not be maintained, then the chiller power input will decrease with the
152
reduced load and the room air temperature will increase due to insufficient supply air. All
these can be summarized in the following rule set.
IF (Psf.u = -1) THEN alarm for reduced power input of the supply fan with motor speed
IF (Psf.Q = +1) THEN alarm for excessive consumption of the supply fan
IF(usf =+1) THEN alarm for high speed of the supply fan
IF(Ps = -1) THEN alarm for low static pressure
IF(Pch = -1) THEN alarm for reduced power input of the chiller
IF(Tia =+ 1) THEN alarm for high room temperature
IiF((Qsa # +1) AND (n Qsa Tia) AND (p Qsa Pch)
AND (Ps # +1) AND (n Ps usf) AND (p usf PsfQ)
THEN t(- 1) is a possible fault origin
Since the most direct and significant effect of a slipping fan belt is the reduced
resistance and hence reduced effort to run the fan at a given speed, the supply fan power
input is more sensitive to the motor speed than to the air flow rate. Therefore, in the
inference structure for the slipping fan belt, the node for the fan power vs. motor speed is
used in the outmost layer, i.e., the first essential condition to start the diagnosis of this
fault.
With the minimum amount of measurements or control signals, i.e., without Tia,
usf, Ps, and Pch, the rule can be simplified as,
IF (Psf.u = -1) THEN alarm for reduced power input of the supply fan with motor speed
IF (Psf.Q = +1) THEN alarm for excessive input of the supply fan
IF (Qsa # +1) THEN alarm for slipping fan belt
From the above analysis of fault origin, it can be seen that the enhanced structured
SDG inference technique enables flexible yet reliable control of the output of the
detection and diagnosis of defects and malfunctions according to the available
information. Alarms are issued once new abnormalities are detected, but identification of
the specific cause is delayed until the complete rule set for the fault is met. Such a
sequence control has two major advantages: maximum rule violations and minimum
false/missed alarms with high-resolution diagnosis.
First, by allowing alarms for each node with appropriate constraints in the
pathways, faulty operation can be reported at an early stage. When a fault occurs, as an
uncontrolled variable, power input of equipment with abnormalities is alarmed by the
most sensitive power model as the end node in an SDG earlier than other variables.
153
Although early alarms may not require immediate actions, they inform the operator of the
health of the system and provide the flexibility for appropriate actions before the final
alarm if necessary. When a controlled variable is found to shift beyond the tolerance from
its setpoint, alarms are issued for immediate attentions.
Second, with the maximum number of violations, the unique inference rule set
with appropriately designed structure for each specific fault helps to ensure reliable
diagnosis and enables recognition among faults that share common pathways at the upper
levels and separate from each other when the pathways reach the lower levels toward the
origin. For example, in the above diagnosis of the stuck-closed recirculation air damper
in the test building, power consumption of the supply fan was first found about four times
higher than its expected value by the model since 8:26 a.m.. At the same time, the static
pressure was around 0.1 in., much lower than its setpoint of 1.2 in.. In addition to a stuckclosed damper, the other two typical faults, static pressure sensor error and fan belt
slippage, as shown in Fig. 6.12, may also result in excessive fan power consumption as a
function of air flow rate and a low static pressure signal in an HVAC system. It can be
seen from the digraphs that all the three faults share some common abnormalities in the
fan power vs. airflow rate model, fan speed, and the static pressure in the duct. From
8:41 a.m., the indoor air temperature went over the upper limit of the expected range of
72.5 ± 2.5 "F and stayed around 80 OF, which eliminated the possibility of the pressure
sensor fault. In practice, as the setpoints of the static pressure and the indoor air
temperature were not met, early actions might be needed to remove the fault. Although
these phenomena could be caused by either a stuck-closed damper or a slipping belt, as a
typical degradation fault, a slipping fan belt is not expected to lead to a sudden jump in
the fan power input with such a large magnitude. Therefore, if the room is occupied and
needs quick repair, the status of the dampers should be inspected first. For automatic
identification of the fault origin, further violations are also necessary to distinguish
between a stuck-closed outdoor air damper and a stuck-closed recirculation air damper.
In the test building, the full diagnosis was completed after 5 p.m. when the outdoor air
damper was closed by the control system as the building was no longer occupied for the
day and the fans were only to circulate the return air in the building until the AHU was
shut down at 10 p.m.. With the closing of the outdoor air damper by the control signal,
the flow rate of the supply air dropped significantly, which is not likely to happen with a
slipping fan belt or a stuck-closed outdoor air damper, indicating the stuck-closed
recirculation air damper as the error source. As the leaks in the outdoor air damper
became the only openings for the incoming air, the alarm for extremely low air flow rate
was activated and the fault origin was reported. In the test building, a value of 500 CFM
was used as the lower limit of the supply air flow rate.
Fig.6.13 illustrates the process of the SDG rule development for a typical fault in
the water loop of an air handling unit, a cooling coil fouled with a thickness of See of
sediments on the water side.
154
Figure 6.13. Developed digraphfor the fault of a fouled cooling coil (water-side).
Here Tsa, Tia, uv.ce, Q,,, ucp, and Pcp represent deviations in the supply air
temperature, room air temperature, stem position of the cooling coil valve, flow rate of
the chilled water, motor speed of the chilled water pump, and power consumption of the
chilled water pump.
The power consumption of the pump is modeled as a function of water flow rate
or control signal of the valve. The chiller power model can be defined as a function of
such load conditions as outside air temperature or chilled water temperatures when a
VSD is used or by checking the on/off cycling frequency if the chiller is under stepwise
control. With the increased thermal resistance due to the fouling 8ec, the heat transfer
efficiency will be reduced, which then may lead to increase of the supply air temperature
and the indoor air temperature Tia if the control is saturated. The cooling coil valve
control signal uv.cc and hence the chilled water flow rate Qcw are then increased to
compensate for the deviation in Tsa. With the increased cooling load due to more chilled
water to be processed, the chiller power Pch may go beyond its normal range under the
same load conditions. Meanwhile, the pump speed ucp and hence the pump power
consumption Pep may also be higher than normal with the increased flow rate of water to
be delivered and the increased flow resistance when the flow path is restricted due to the
fouling. The rule for the fault in this configuration is determined as follows.
IF (Pep = + 1) THEN alarm for excessive power input of the chilled water pump
IF (Pch = + 1) THEN alarm for excessive power input of the chiller
IF (Tsa = +1) THEN alarm for high supply air temperature
]IF (Tia = +1) THEN alarm for high room air temperature
IF ((Tsa # -1) AND (p Tsa Tia) AND (p Tsa uv.cc) AND (p uv.cc
AND (p ucp Pep)
AND ((ucp # -1) OR (p Qcw ucp))
AND (ITsal + ITial + Iuvccl + IQswl + lucpl # 0))
THEN See (+1) is a possible fault origin
155
Qcw)
AND (p
Qcw Pch)
6.6 Rules for detection and diagnosis of typical faults in common air handling units
Rule development for fault identification with reference to the operating modes
have been studied by other researchers [House et al, 2001]. In this section, rules are
developed in a similar manner with power consumption as the major index for diagnosis
of faults that are usually found in a typical VAV air handling unit as the one installed in
the test building. Potential violations of rules are summarized in a table and the rule set
for each fault can be established with the method presented in Section 6.5.
6.6.1 System description
The schematic diagram of the system is shown by Fig. A3 in the Appendix with
the following basic characteristics:
-Single-duct variable air volume system
-Temperature and humidity control of indoor air
-Chiller with constant speed motor drive*
-Hot and chilled water pumps with constant speed motor drive *
-Supply fan control: constant static pressure setpoint
-Return fan control: flow rate tracking of the supply fan
6.6.2 Expert rules
6.6.2.1 General rules
Although the data available for fault detection and diagnosis may vary among
different systems, the basic functions of air conditioning processes should be practiced
and hence the general rules should be met in common HVAC systems under any normal
operating condition. Since fans driven by electrical power are in operation under different
modes, the following rules based on the design intent related to the power consumption
and controls of fans can be set up for common HVAC systems.
Rule
Rule
Rule
Rule
Rule
Rule
Rule
A.
B.
C.
D.
E.
F.
G.
I Psf - Psf.m I APsf.ci
Psf Psfnin
For usf.max - usf esf, I Tspt.ia - Tia I et.ia + ATdb
usf max - usf esf
Psf> Prf
Pif Pf.min
(uf/us)min ur/usf (ud/us)max
Rule A indicates the random deviation of the power input Psf of a VSD fan from
its expected (modeled) value Psf.m should be less than a value defined as a confidence
interval APsf.ci which is determined based on a given confidence level. Rules B and F
require the power drawn by the fans to be larger than their minimum values when the fan
is in operation. Rule C means the indoor air temperature Tia should be maintained around
its setpoint Tspt.ia within the offset caused by random error etia and the deadband or a
tolerance range of ATdb with the fan speed control signal lower than its maximum usf.max.
156
Violation of Rule C means the supply fan is not able to maintain the indoor air
temperature and will lead to an alarm. esf and et.ia can be determined by the user based on
specifications of the related device. Rule D is a warning for the capacity of the supply
fan. Continuous violations of Rule D indicate either the fan is not properly designed or a
serious fault occurred in the air loop, which can be a pressure sensor error, an unexpected
increase of resistance, or a fault in the mechanical transmission of the fan itself. Control
of a return fan is usually coupled with the supply fan with different indices: air flow rate,
fan speed, or room static pressure. In any case, power input of a return fan should be
lower than that of the related supply fan. Rule G specifies the appropriate range of the
control signals for the return-supply fan match. Undesirable working status of the fans
can be detected by this rule, such as belt slippage of the supply fan.
In HVAC systems, frequent on/off cycling of some equipment may lead to quick
deterioration of the equipment and waste of energy, such as a reciprocating chiller.
Therefore, the total number of cycles each day is usually limited under a threshold. Rules
H and I provide the constraints for this purpose.
Rule H.
Rule I.
ncdch ncdch.max
ncdb ncdb.max
To prevent oscillating operation of the components in a system that can result in
quick damage to the devices, Rule J is introduced by checking the computed standard
deviation of the sampled power data against a trained threshold. Note that such a rule can
be used for the power data of a component and the system as well.
Rule J.
SP
< sp.max
Temperature control is the basic function of a common HVAC system and the
related rules can therefore be applied if only the system is in operation. Rules K and L
provide constraints for the indoor air temperature and the supply air temperature
respectively.
Rule K.
Rule L.
ITia - Tspt.ia I et.ia + ATdb.ia
ITsa - Tspt.sa I et.sa + ATdb.sa
6.6.2.2 Rules under different operating modes
As discussed before, detectability of a fault is often affected by the current
operating mode of the system. Separated by the related thermal processes of the system,
the modes provide a clear structure for implementation of the appropriate rules.
Mode 1. Heating
In this mode, outdoor air temperature is lower than a setpoint Tbp used for the
control of the HVAC system plus a sensor error et.oa. Rules 1 and 2 define the current
operating mode and a constraint for the basic function of the system in this mode, i.e., the
157
supply air temperature should not be lower than the return air temperature. Violation of
Rule 1 can be caused by an error in the outdoor air temperature and indicates a
discrepancy between the mode and the expected outdoor air temperature.
Rule 1.
Rule 2.
Toa < Tbp + et.oa
Tsa
Tra
The supply air is heated by hot water in a heating coil before entering the
conditioned space. Therefore, the hot water pump and the boiler should be running with
appropriate power input while the chilled water pump and the chiller should be shut
down. Rules 3 and 4 provide the corresponding constraints for the power input of the
related equipment. The minimum power input Php.min and Pb.min can be obtained from the
design specifications. Rule 5 is to prevent frequent on/off cycling of the boiler. Control
signals for the heating process should keep the heating coil valve open and the cooling
coil valve closed, as described by Rules 6 and 7. The minimum values for the time
duration of continuous operation of the boiler ontdb.min, the valve position uv.hc.min, and the
errors ep.cp, ev.c, and ep.ch can be specified by the user based on the capacity range of the
equipment. Violations of these rules indicate inappropriate thermal processes are
involved.
Rule 3.
Php
Rule 4.
Rule 5.
P
epcp and Pch
ontdb ontdb.min
Rule 6.
uv.h
uv.hc.min
Rule 7.
uv.cc
ev.cc
Php.min
and Pb
Pb.min
ep.ch
With the low outdoor air temperature, the outdoor air damper should be at the
minimum open position to save energy needed for heating. Violation of Rule 8 indicates
a too high or a too low fraction of the outdoor air.
Rule 8.
1Qoa/Qsa - (Qoa/Qsa)min I e0
Mode 2. Cooling with outdoor air
In this mode, outdoor air temperature is higher than the temperature of the balance
point but lower than the supply air temperature minus the temperature rise due to the
supply fan. Rules 9 and 10 define the current operating mode and a constraint for the
basic function of the system in this mode, i.e., the supply air temperature should not be
higher than the return air temperature.
Rule 9.
Rule 10.
Tbp
+ et.oa
Tsa
Tra
Toa
Tspt.sa
-
ATsf + et.oa
When the outside air temperature is within the range defined by Rule 9, free
cooling can be realized by mixing the cool outdoor air with the return air to achieve the
design temperature of the supply air through the coupled modulation of the outdoor air
158
and return air dampers. Hence, the equipment for mechanical heating or cooling at the
plant should be shut down as shown by Rules 11 and 12 for the zero power input of the
chiller and the boiler and the related pumps as well.
Rule 11.
Rule 12.
Pc,
Php
ep-cp and Pch ep.ch
ep.cp and Pb ep.b
Rules 13 and 14 indicate that control signals should keep the related valves closed
as no chilled or hot water is needed.
Rule 13.
Rule 14.
uv.he
ev.he
uv.cc
ev.cc
Rule 15 indicates the outdoor air supply should meet the minimum requirement
by design, though the outdoor air damper is modulating in this operating mode.
Rule 15.
Qoa/Qsa > (Qoa/Qsa)min - eQ
Mode 3. Mechanical cooling with 100% outside air
When the outdoor air temperature is between the supply air temperature minus the
temperature rise due to the supply fan and the return air temperature, the outdoor air
damper is set to 100% open and the recirculation air damper is 100% closed. Rule 16
defines the current operating mode and Rule 17 indicates the supply air temperature
should not be higher than the return air temperature.
Rule 16.
Rule 17.
Tspt.sa Tsa
ATsf + et.oa
Toa
Tspt.ia
+ ATrf + et.oa
Tra
Since the outdoor air temperature is higher than the supply air temperature, the
chiller and the chilled water pump should be in operation to provide mechanical cooling
for the supply air while the boiler and the hot water pump should be turned off. Rule 18
provides the power constraints for the cooling equipment. Rule 19 means the power input
of the heating components should be zero. Rule 20 gives the minimum continuous
operating time of the chiller once it is turned on.
Rule 18.
Pcp
Rule 19.
Php ! ep.hp and Pb
Rule 20.
ontdch
Rule 21.
For
Pp.min and Pch
Pch.min
ep.b
ontdch.min
Toa ht.oa*(Tspt.ia-Tspt.sa),
offtdch
offtdch.min
In addition, it has been found that when the outdoor air temperature is within a
range that is closer to the lower bound than to the upper bound in this mode, the chiller
runs at a near constant off-on interval which is longer than that when the outdoor air
temperature is out of this region. Such a range can be defined by the difference in the
159
outdoor air temperature and the supply air temperature setpoint normalized to the
difference between the indoor and supply air temperature setpoints as shown in Section
6.5. Rule 21 provides such a constraint for detection of the related faults, such as the
leaking recirculation air damper.
The control signals should keep the heating coil valve closed and cooling coil
valve open, as represented by Rules 22 and 23.
Rule 22.
Rule 23.
uv.he
ev.he
uv.cc
uvc.min
With 100% outdoor air, Rule 24 can be used to identify faults in damper positions
and the related control signals.
Rule 24.
I Qoa/Qsa - 1 1 eQ
Mode 4. Mechanical cooling
The outdoor air temperature is higher than the return air temperature, as shown by
Rule 25 and the supply air temperature should be lower than the return air temperature.
Rule 25.
Toa
Tspt.ia +
Rule 26.
Tsa
Tra
ATrf + Et
Under the mechanical cooling condition, the chiller and the chilled water pump
should be turned on, as represented by the minimum power input constraints shown in
Rule 27. Boiler and hot water pump should be out of operation with zero power input, as
stipulated by Rule 28. Rule 29 sets the constraint for the continuous operating time of the
chiller.
Rule 27.
Rule 28.
Rule 29.
Pep.min and Pch Pch.min
Php i ep.hp and Pb ep.b
ontdch ontdch.min
Pep
Rules 30 and 31 indicate that the heating coil valve should be closed while the
cooling coil valve should be open by the control signals.
Rule 30.
Rule 31.
uv.he
uv.cc
ev.he
uvc.min
Rule 32 shows that with the high outdoor air temperature, the amount of the
outdoor air is limited to the design minimum value to save cooling energy.
Rule 32.
I Qoa/ Qsa - (Qoa/ Qsa)min I EQ
160
&
S
e-
c
Stuck cooling coil valve
IIHot
SWILeaking
**
Unstable controller***
Static pressure sensor offset
Slipping supply fan belt
Stuck recirculated air damper
Stuck outdoor air damper
Leaking mixing box damper(s)
water circulating pump fault
Hot water supply temperature is too low
Fouled heating coil
Undersized heating coil
heating coil valve
heating coil valve
Chilled water not available
Chilled water circulating pump fault
Chilled water supply temperature too high
Fouled cooling coil
H HStuck
cooln
Leaking cooling coil valve
Leaking
Outdoor air temperature sensor error
Return air temperature sensor error
Supply air temperature sensor error
H
r-e-
Undersized cooling coil
I
H
H_
?-H&I
IEFFFFFFFH
C_
* When variable load control is used, power consumption of the chiller, the boiler, and
the pumps can also be correlated with appropriate variables such as water temperatures,
water flow rate, or valve control signals. The resulted functions can be used for detection
of related faults, e.g., valve leakage.
** Theoretically, if the OA temperature sensor is not working properly, the working
mode of the system could be totally different. Therefore, depending on the severity and
direction of the fault and the priority of the controlled variable in the logic design, this
fault may violates all the rules or some of them. For example, if the error in mode 1
results in an extremely high value of the OA temperature signal, then the control mode
may be in Mode 4 with mechanical cooling. Then all the rules are violated. If the error
just gives an even lower OA temperature in mode 1, then only Rules 2, 3, 5, 6, and 8
might be violated.
*** An unstable controller tends to result in a more consistently recognizable pattern in
power consumption. Other non-controlled variables that are available for analysis may
also be used to help to identify the source of the power oscillation. For example, unstable
control of a supply fan causes large variance not only in the power input of both the
system and the fan but also in the fan speed signal, the supply air flow rate, and the static
pressure in the duct.
6.7 Discussion and conclusions
In order to find the fault origin with least intrusion into the system, the diagnostic
tool should be built with a flexible structure to produce reliable output from the varying
amount of available information in different systems. For fault diagnosis with power
consumption as an end effect and with limited basic measurements in HVAC systems, the
diagnosis needs to be implemented by tracing the observed effects back to the defect
components or system malfunctions without monitoring the input and output of each
component.
In this chapter, different techniques are first studied for fault diagnosis when
abnormal behaviors are observed in a system. Two approaches are introduced for
diagnosis with different levels of knowledge about the system, the shallow reasoning
technique for evidentially oriented systems and the deep inference technique based on a
structural and functional model of the problem domain. For a system with little
information other than the total power consumption and the design information of the
system, the shallow diagnostic method can be used for indication and general analysis of
a fault in the system. Although it imposes virtually no intrusion into the system, the
shallow reasoning method generally is not able to lead to a fault origin that is specific
enough for correction.
With basic measurements and control signals of the system, the casual search
method for tracing the observed abnormal behavior to its origin is more applicable and
reliable than the other approaches for efficient and least-intrusion diagnosis of faults in
HVAC systems. The deep reasoning approach is virtually a model that can derive its own
162
behavior for a given set of parameters and signals and predict the effects of changes in
them. In this thesis, a reasoning method based on the signed directed graph (SDG) is
developed to produce reliable alarms for faults that are typical of the common HVAC
systems. Derived from the system's structure and control logic, a digraph is composed of
nodes representing the status of variables or control signals based on appropriate models
or thresholds and branches for the propagation of the effects of the fault.
One major advantage of the SDG inference technique is that it allows multiple
pathways in the propagation of the fault, which maximizes the number of violations of a
fault in the given system and hence helps to improve the resolution of the diagnosis. In
addition, some critical issues regarding the operating mode and time delay in the
propagation of a deviation along a pathway can be handled by the inference logic. With
the modified concise form for the logic consequence, the rules can be directly derived
from a digraph without missing or repeated rule executions.
In this study, the inference structure is further improved with more flexible
control of the rules by allowing zero violations of controlled variables regardless of the
deviations in the consequent variables. This innovation not only prevents the missing of
alarms when power consumption is the major criterion, but also greatly improves the
flexibility of the design of the rule structure so that the diagnosis to be applied to systems
under different conditions or operating modes or with varying amount of available
information. Moreover, by including the controlled variables in the inference, urgent
alarms will be issued promptly for faults leading to serious deterioration of the system's
function that requires immediate attention. Faults that result in degradation of the
system's performance can be monitored and alarmed continuously.
Although it has been verified to be an effective approach for fault diagnosis in
HVAC systems, this SDG rule-based casual search method has its limitations in two
aspects, one is that it can not locate the specific origin of a fault if it rarely happens in
HVAC system and hence can not be diagnosed with the knowledge base, though all the
abnormal behaviors caused by this fault are reported for further analysis by an expert.
The other limitation is that fault diagnosis is based on the assumption that at most one
fault exists at a time in one loop, i.e., the water or the air loop, of the monitored system.
This is because the interactions among multiple faults in the same loop still can not be
appropriately distinguished by power consumption with limited measurements and
control signals. But it is still possible to distinguish between faults in different loops if the
power models are obtainable for each loop.
163
164
CHAPTER 7
Summary
7.1 Review
A fault in an HVAC system means an undesired state or position of a component
that causes deterioration of the system's performance when the component is in
operation. The effects can be categorized as excessive energy consumption, undesirable
indoor air quality, and degradation or failure of components. Sometimes, the temperature
and humidity of the indoor air can be maintained through the compensation by the local
feedback loop for the fault, but it is always at the expense of extra energy consumption.
The power input required by the system due to the existence of a fault can be
significantly higher than the value under normal operation. If the fault is not properly
found and treated in time, the indoor air quality may also become unacceptable with the
increasing severity of the fault that saturates the local control, such as a slipping fan belt,
or, if the fault does not cause deviations monitored in any feedback loop in the system,
such as a broken fan belt. At the same time, components involved in the propagation of
the fault may be damaged due to unexpected actions. Therefore, in order to achieve
expected air quality with efficient energy use, faults in an HVAC system must be
identified and removed appropriately.
However, detection and diagnosis of a fault is often a difficult task for the human
operator without a proper tool or method if a complex system is to be dealt with, such as
a commercial HVAC system consisting of multiple interactive components. It is therefore
necessary to develop an effective method that can be used to find abnormal behaviors and
then search for the fault origin in the system. Although faults may occur in diverse ways
for different applications, it is often desired that the implementation of the detection and
diagnosis should cause as little interruption to the system as possible. Also, the FDD
program should be based on the analysis of parameters that are common in different
HVAC systems. In addition, the decision for removal or repair of a faulty component is
usually based on the evaluation of benefit vs. cost, especially in the case of a degradation
fault that is unavoidable after some period of operation. Hence an effective FDD method
needs to be developed that can be used not only to locate faults in the different systems
with least intrusion but also to demonstrate the cost effect in the system due to the fault
for the operator or the building energy management system to make appropriate
decisions.
Faults in an HVAC system usually lead to abnormal effects in its power
consumption. Moreover, as a common input of HVAC systems, electrical power data are
generally obtainable without significant intrusion into or interruption of the physical
processes in the system. Therefore, if the electrical power data can be properly processed
or modeled for system and its components, detection and diagnosis of faults may be
fulfilled with negligible interruption of operation for different systems. By comparing the
power input based on the models against that from measurements, effects of a fault in
165
energy cost can be estimated and the results are then used in decision-making for
maintenance.
In this thesis, based on a comprehensive study of the previous research, a new
methodology for detection and diagnosis of faults in HVAC systems has been developed.
With this approach, identification of faults is conducted by monitoring electrical power
consumption of the system and its components. Abnormalities in the power input of a
system or a component caused by faults are detected as magnitude deviations or power
oscillations, which consequently start the search engine for the fault origin.
Depending on the availability of the power submeters for electrically-driven
equipment, models are developed at two different levels, i.e., the system and the
component levels. Without power submeters, detection at the system level is aimed to
check the on/off status of a device and the magnitude of its power consumption by
detecting the switches and computing the corresponding changes in the total power data,
testing on/off cycles of some major equipment against the constraints for their operating
schedules by design, and tracking power oscillation in the system. Utilization of
submeters enables the modeling of the equipment's power input as a function of basic
measurements or control signals, which eliminates the needs for on/off switches to find
abnormal magnitude and allows detection whenever the component is in operation.
In Chapter 2, a brief review of the theory for change detection is first conducted.
Based on the characteristics of the power data of a commercial HVAC system compared
to that in residential applications, the GLR (generalized likelihood ratio) method is
proposed for detection of changes in the power data. The GLR search for changes enables
detection without knowledge of the coming event by double maximization of
probabilities of the time and the magnitude of a change. With the normal distribution of
the noise in electrical power data, an executable form of the algorithm has been derived
for automatic detection.
Although the post-event information is not required, the pre-event is essential in
the GLR detection. With the common applications of variable speed motor drives as well
as the unavoidable presence of noise and disturbances in the electrical power
environment of an HVAC system, the total power changes continuously. This implies
that the mean value of the total power consumption before any event must be
continuously computed on line to form a correct base for the calculation the change of the
event. In Chapter 3, the two-window equation has been derived for the computation of
the pre-event mean in a moving window that follows the detection window where the
thorough search is conducted for the double-maximization of probabilities. Based on the
tests with data from several buildings, major innovations have been made for the
algorithms to improve the detection quality: the window reset technique eliminates
repeated alarms for one event while the updated standard deviation and the nonzero
expected minimum help to prevent false or missing of alarms.
To minimize the noise effect, a median filter based on an equal-probability search
has been designed for the pre-processing of the power data. With spikes and large
166
fluctuations in the power data due to random noise removed by the median filter, the false
alarm rate is greatly reduced.
In addition to the median filter, the multi-rate sampling technique is applied in the
detection to identify changes or variations that are not all visible to a detector with any
single sampling interval due to various on/off time duration of different components in
the system, transformed data patterns caused by noise, and oscillation around a certain
period. Apparently, for the on/off duration of different equipment, such as those driven
by VSD or CSD motors, detection with multiple sampling rates is more capable than with
single rate. In addition to the detection for different components, the multi-rate GLR
detector can recognize events whose abruptness is impaired by the dynamics or noise in
the power data, which is very useful in matching the on/off cycles of a component.
Moreover, detection of power oscillation due to unstable control that can be found by
tracking the standard deviation in the pre-event window is also greatly enhanced due to
the fact that oscillation with a specific period can be easily identified with a wide range of
sampling intervals, though it may be invisible to detection with certain sampling
intervals. The performance of the detector developed with the innovations and equipped
with the median filter and the multi-rate sampler have been verified by tests with several
typical faults related to abnormalities in the magnitude of on/off changes, the frequency
of equipment cycling, or the trend of the power data of real HVAC systems. For practical
applications of the detector, Chapter 3 provides the guidelines for the training of the
related parameters, including window lengths, thresholds, and sampling rates. The work
in Chapter 3 demonstrates that with the improved GLR detector, centralized power data
can be explored to reveal the health of the system through the electrical power input of
the key components.
When the power input of equipment is obtainable by submeters, fault detection
and diagnosis can be conducted at a more specific level, i.e., the component level, as
presented in Chapter 4. A brief review of the algorithm of least square fitting through
singular value decomposition is first conducted. For the modeling of power consumption
of a component, Chapter 4 addresses the selection of the appropriate reference parameter
as the variable in the power function, which is very important because the reference
parameter are expected to result in effective correlation for fault detection while the
intrusion caused by the related measurements is minimized. Moreover, models are also
analyzed and developed for different system structures and their control logic. With the
confidence required for the detection, an interval can be set up for the upper and lower
limits of the deviation of the component's power consumption under a given value of the
reference parameter. Power models built at the component level therefore not only lead to
easier diagnosis with higher resolution, but also make it possible to conduct detection
continuously. This is especially useful when the equipment is driven by a VSD motor
which is rarely turned off as required by some systems. Also, detection of faults
associated with cycling frequency and oscillation with submetered power data becomes
more straightforward as shown. However, it should be noted the detectability of a fault
sometimes is determined by the equipment's sensitivity to load variations, which will
consequently affect the capacity of the power model. For example, a stuck-open outdoor
air damper tends to cause an increase in the chiller's power input, especially under the
167
operation mode of mechanical cooling with minimum outdoor air. However, with a twostage reciprocating chiller as used in the test building, it is difficult to find the fault from
the chiller power with its stepwise response to load changes.
One major challenge for FDD with power consumption is that although the basic
measurements involved in the power models are generally available, submetered power
data may not be obtainable in common HVAC systems. Therefore, it is necessary to find
an appropriate approach to build the models without power submeters. Chapter 5
explores a feasible method to obtain power data by detecting the changes in the total
power controlled by manual shutdown of the related equipment. However, due to the
presence of noise and other events, the power data should be carefully selected for the
fitting to avoid deviation in the model itself. Two major factors have been found in
effective screening of the detected values, the relative magnitude and the standard
deviation. With these two indices, errors in the detected changes are minimized, which
makes the resulted model applicable for detection. Tests with an HVAC system have
verified the feasibility of this approach which has produced expected output as obtained
from models based on submetered power data.
With the deviations found in the power models, the fault origin is yet to be
identified. Chapter 6 proposed a knowledge-based approach from analysis of a digraph
for the common faults in HVAC systems. This method is capable of dealing with
problems typical of HVAC systems, such as time delay. To find the undesirable operation
at an early stage before the deterioration of the controlled variables that might be
compensated with local feedback even at the presence of a fault, further innovation is
proposed for the inference logic that allow alarms progressively with violations. In
addition, with the flexible control of the nodes by the innovation, the logic structure can
be easily applied to systems with similar structure but different measurements available
for diagnosis. Based on the understanding of the structure and the control logic of the
system, the potential pathways and affected components are explored in the digraph,
which maximizes the violations available for the diagnosis. By using the simplified form
of the logic predicates, the rules can be extracted from the digraph efficiently in a concise
form.
The methodology developed in this thesis has been verified by tests with different
building energy systems. It is concluded that the detection and diagnosis of faults in
practical HVAC systems can be conducted based on this research with minimized
intrusion into the operating system.
7.2 Achievements and future work
In this thesis, the following tasks have been successfully fulfilled and verified
with tests in HVAC applications.
1). Derived and innovated the algorithm for detection of changes in power input of
HVAC systems;
168
2). Designed two parallel detectors with an appropriate filter to pre-process the noisy
power data;
3). Developed a multi-scale sampling mechanism for the detector to identify changes or
variations in the power data that can not be found with single-rate sampling;
4). Established the complete set of training rules for the window length, thresholds, and
selection of sampling intervals for the change detector;
5). Set up submetered power models of components in HVAC systems to be used for
continuous monitoring and addressed the key issues in model development;
6). Developed an approach to obtain feasible power models for components without
submetered power data;
7). Designed the concise inference logic for rule development based on signed directed
digraph for enhanced identification of typical fault origins in HVAC systems.
Future improvements on the methods for fault detection and diagnosis in HVAC
systems may be achieved in three directions. First, as the models developed in this thesis
are primarily for the simulation of the actual operation of the monitored system, a model
for the design intent is desirable for identification of faults by a three-way comparison
among the design intent, the control signals, and the actual operation. Such a model may
be established for a given system based on optimization of energy consumption, indoor
air conditions, and other indices of concern. Second, an optimal search mechanism may
be developed to distinguish multiple events with overlapped duration. A possible
solution to this issue is a modified algorithm for the Markov series with multi-rate
sampling. And third, the open-ended inference capacity of the fault diagnosis method
should be explored to replace the closed deduction techniques if possible, which indicates
that a learning model for a system should be established.
169
170
References
Andelman, R., P.S. Crutiss, and J.F. Kreider. 1997. Demonstration knowledge-based tool
for diagnosing problems in small commercial HVAC systems.
Annex 25. 1996. "Building optimization and fault diagnosis source book." International
Energy Agency for Real Time Simulation of HVAC Systems for Building Optimization,
Fault Detection and Diagnosis.
Asada, H., Federspiel, C. and Liu, S. 1993. "Human centered control." ASME Journal of
Dynamic Systems, Measurements, and Control, Vol. 115, No. 2B, pp. 271 - 280.
ASHRAE Handbook. 1996. HVAC Systems and Equipment. American Society of
Heating, Refrigerating and Air-conditioning Engineers. Atlanta, Ga.
Basseville, M. 1983. "Design and comparative study of some sequential jump detection
algorithms for digital signals." IEEE Transactions, ASSP, Vol. 31, pp. 521-534.
Basseville, M. and I. V. Nikiforov. 1993. Detection of abrupt changes: theory and
application. PTR Prentice Hall, Englewood Cliffs, New Jersey.
Benouarets, M., A. L. Dexter, R. S. Fargus, P. Haves, T. I. Salsbury, and J. A. Wright.
1994. "Model-based approaches to fault detection and diagnosis in air-conditioning
systems." Proceedings of Systems Simulation in Buildings, Liege, Belgium.
Brambley, M., R. Pratt, D. P. Chassin, and S. Katipamula. 1998. "Automated diagnostics
for outdoor air ventilation and economizers." ASHRAE Journal Vol. 40, No.10.
Braun, J. E., J. W. Mitchell, and S. A. Klein. 1987. "Performance and control
characteristics of a large cooling system." ASHRAE Transactions, Vol. 93, pp.18301852.
Breuker, M. S. and J. E. Braun. 1998. "Evaluating the performance of a fault detection
and diagnostic system for vapor compression equipment." International Journal of
HVAC&R Research, October, pp. 401-425.
Carlson, R. A. 1991. Understanding building automation systems: direct digital control,
energy management, life safety, security/access control, lighting, building management
programs. R. S. Means Co.
Chen, S. Y. S. and S. J. Demster. 1995. Variable air volume systems for environmental
quality. McGraw-Hill.
Davis, R. 1984. "Diagnostic reasoning based on structure and behavior." Artificial
Intelligence, Vol. 24, pp. 347-410.
171
Davis, R. and H. Shrobe. 1983. "Representing structure and behavior of digital
hardware." IEEE Computer, October, pp. 75-82.
Dodier, R., P. S. Curtis, and J. F. Kreider. 1998. "Small-scale on-line diagnostics for an
HVAC system." ASHRAE Transactions, Vol.104, Part lA, pp.530-539.
Dodier, R. H. and J. F. Kreider. 1999. "Detecting whole building energy problems."
ASHRAE Transactions, Vol.105, Part 1, pp. 579-589.
Draper, N. R. and H. Smith. 1981. Applied regression analysis. John Willey & Sons
Englander, S. L. and L. K. Norford. 1990. "VAV system simulation, Part 1: Development
and experimental validation of a DDC terminal box model." Proceedings of the third
international conference on system simulation of buildings, Liege, Belgium.
Golub, G. H. and C. F. Van Loan. 1989. Matrix computations.
University Press, Baltimore.
2 nd
ed. John Hopkins
Friedrich, M., and M. T. Messinger. 1995. "Method to assess the gross annual savings
potential of energy conservation technologies used in commercial buildings." ASHRAE
Transactions, Vol.101, Part 1, pp.444-453.
Giarratano, J. C. 1983. "Expert systems: principles and programming." 1998. PWS Pub.
Co.
Han, C. Y., Y. Xiao, C. J. Ruther. 1999. "Fault detection and diagnosis of HVAC
systems." ASHRAE Transactions, Vol.105, Part 1, pp. 568-578.
Hart, G. W. 1992. "Nonintrusive Appliance Load Monitoring" 1992. Proceedings of the
IEEE. Vol.80. No.12, pp.1870-1891.
Haves, P., T. I. Salisbury, and J. A. Wright. 1996. "Condition monitoring in HVAC
subsystems using first-principle models." ASHRAE Transactions, Vol. 102, Part 1,
pp. 519-539.
He, X. and Liu, S. "Modeling of Vapor Compression Cycles for Multivariable Feedback
Control of HVAC Systems", in print, ASME Journal of Dynamic Systems, Measurement,
and Control.
Hill, R. 0. 1995. Applied change of mean detection techniques for HVAC fault
detection and diagnosis and power monitoring. Master's thesis, Massachusetts Institute of
Technology.
House, J. M., W-Y. Lee, D. R. Shin. 1999. "Classification techniques for fault detection
and diagnosis of an air-handling unit ."
ASHRAE Transactions, Vol. 105, Part 1,
pp.1087-1097.
172
House, J. M., H. V. Naezi-Nej ad, and J. M. Whitcomb. 2001. "An expert rule set for fault
detection in air-handling units." ASHRAE Transactions, Vol.107, Part 1.
Huber, P. J. 1981. Robust statistics. John Wiley & Sons.
Isermann, R. 1984. "Process fault detection based on modeling and estimation methods-A
survey." Automatica, Vol. 20, No. 4, pp. 3 8 7 -4 0 4 .
Kao J. L. and E. T. Pierce. 1983. "Sensor errors - effect on energy consumption."
ASHRAE Journal, Vol. 25, No. 12, pp. 4 2 -4 5 .
Karki, S. H. 1999. "Performance factors as a basis of practical fault detection and
diagnostic methods for air handling units." ASHRAE Transactions, Vol.105, Part 1,
pp. 1069-1077.
Karl, W. C., S. B. Leeb, L. A. Jones, J. L. Kirtley, Jr., and G. C. Verghese. 1992.
"Applications of rank-based filters in power electronics." IEEE Transactions on Power
Electronics, Vol. 7, No. 3, pp. 437-444.
Katipamula, S., R. G. Pratt, D. P. Chassin, Z. T. Taylor, G. Krishnan, and M. R.
Brambley. 1999. "Automated fault detection and diagnostics for outdoor-air ventilation
systems and economizers: methodology and results from field testing." ASHRAE
Transactions, Vol. 105, pp. 555-567.
Kramer, M. A. 1986. "Malfunction diagnosis using quantitative models and non-Boolean
reasoning in expert systems." AIChE Journal, Vol. 34, pp. 1383-1393.
Kramer, M. A. and B. L. Palowitch Jr. 1988. "A rule-based approach to fault diagnosis
using the signed directed graph." AIChE Journal, Vol. 36, pp 1225-1235.
Kreider, F., D. Anderson, L. Graves, W. Reinert, J. Dow, and H. Wubbena. 1989. "A
quasi-real -time expert system for commercial building HVAC diagnostics." ASHRAE
Transactions, Vol. 95, Part 2, pp. 9 5 4 -9 6 0 .
Lee W-Y., J. M. House, and C. Park. 1997. "Fault diagnosis of an air handling unit using
artificial neural networks." ASHRAE Transactions, Vol. 102, Part 1, pp. 540-549.
Lee, W-Y., C. Park, G. Kelly. 1996. "Fault detection in an air-handling unit using
residual and recursive parameter identification methods." ASHRAE Transactions,
Vol. 102, Part 1, pp. 528-539.
Leeb, S. B. 1992. A conjoint pattern recognition approach to non-intrusive load
monitoring. Ph.D. thesis, Massachusetts Institute of Technology.
173
Li, X., H. Vaezi-Nejad, and J-C. Visier. 1996. "Development of a fault diagnosis method
for heating systems using neural networks." ASHRAE Transactions, Vol.102, Part 1, pp.
607-614
Li, X., J-C. Visier, and H. Vaezi-Nejad. 1997. "A neural network prototype for fault
detection and diagnosis of heating systems." ASHRAE Transactions, Vol. 103, Part 1,
pp.634-643.
Little, R. D. 1991. Electrical power disaggregation in commercial buildings with
applications to a non-intrusive load monitor. Master's thesis, Massachusetts Institute of
Technology.
Lorden, G. 1971. "Procedures for reacting to a change in distribution." Annals
Mathematical Statistics, Vol. 42, pp.1897-1908.
Lorenzetti, D. and L. K. Norford, 1992. "Measured energy consumption of variable-airvolume fans under inlet vane and variable-speed drive control." ASHRAE Transactions,
Vol. 98, Part 2, pp. 371-379.
Luo D., L. K. Norford, S. R. Shaw, and S. B. Leeb. 2001. "Monitoring HVAC equipment
electrical lads from a centralized location - methods and field test results." To be
published in ASHRAE Transactions Vol. 107.
McGowan, J. 1992. Networking for building automation and control systems. Englewood
Cliffs, NJ
Newman, H. M. 1994. Direct digital control of building systems: Theory and practice.
John Wiley & Sons.
Ngo, D. and A. L. Dexter. 1999. "A robust model-based approach to diagnosing faults in
air-handling units." ASHRAE Transactions, Vol.105, Part 1, pp. 1078-1086.
Norford, L. K. and S. B. Leeb. 1996. "Nonintrusive electrical load monitoring in
commercial buildings based on steady-state and transient load-detection algorithms."
Energy and Buildings 24, pp. 5 1-64.
Norford, L. K. and R. D. Little. 1993. "Fault detection and load monitoring in ventilation
systems". ASHRAE Transactions, Vol. 99, Part 1, pp. 590-602.
Norford, L. K., J. A. Wright, R. A. Buswell, and D. Luo. 2000. Demonstration of fault
detection and diagnosis methods in a real building. Final Report for ASHRAE Research
Project 1020-RP.
Pape, F. L. F., J. W. Mitchell, and W. A. Beckman. 1991. "Optimal control and fault
detection in heating, ventilating, and air conditioning systems." ASHRAE Transactions
Vol.97, Part 1, pp. 729-736.
174
Patton, R., P. Frank, and R. Clark. 1989. Fault diagnosis in dynamics systems - Theory
and application. Prentice Hall International (UK) Ltd.
Peitsman, H. C. and V. E. Bakker. 1996. "Application of black-box models to HVAC
6 2 8 -6 4 0 .
systems for fault detection." ASHRAE Transactions, Vol.102, Part 1, pp.
Peitsman, H. C. and L. L. Soethout. 1997. "ARX models and real-time model-based
diagnosis." ASHRAE Transactions, Vol.103, Part 1, pp. 657-67 1.
Rice, J. A. 1988. Mathematical statistics and data analysis. Wadsworth & Brooks/Cold
Advanced Books & Software, Pacific Grove, California.
Rossi, T. M. and J. E. Braun. 1996. "A statistical, rule-based fault detection and
diagnostic method for vapor compression air conditioners." Real Time Simulation of
HVAC Systems for Building Optimization., Fault Detection and Diagnosis. Technical
Papers of Annex 25, pp. 7 2 9 -7 5 4 .
Rossi, T. and J. E. Braun. 1997. "A statistical, rule-based fault detction and diagnostic
19 3 7
method for vapor compression air conditioners." HVAC&R Research 3(1), pp. - .
Rossi, T. M., J. E. Braun, and W. Ray. 1996. "Fault detection and diagnosis methods."
Real Time Simulation of HVAC Systems for Building Optimization, Fault Detection and
3
Diagnosis. Technical Papers of Annex 25, pp. 12 3 - 13 .
Seem, J. E., J. M. House, and R. H. Monroe. 1999. "On-line monitoring and fault
detection." ASHRAE Journal, July 1999, Vol.41, pp. 21-26.
Shanmugan, K. S. and A. M. Breipohl. 1988. Random signals: detection, estimation and
data analysis. John Wiley & Sons.
Shiozaki, J., H. Matsuyama, E. O'Shima, and M. Iri. 1985. "An improved algorithm for
diagnosis of system failures in the chemical process." Computers and Chemical
Engineering, Vol. 9, pp. 285-293.
Stylianou, M. 1997. "Application of classification functions to chiller fault detection and
diagnosis." ASHRAE Transactions, Vol. 103, Part 1, pp. 645-656.
Visier, J. C., H. V., Nejad, and P. Corrales. 1999. "A Fault Detection Tool For School
Buildings." ASHRAE Transactions, Vol. 105, pp. 543-554.
Yoshida, H., Iwami, T., and H. Yuzawa. 1996. "Typical faults of air conditioning systems
and fault detection by ARX model and extended Kalman filter." ASHRAE Transactions,
Vol.102, Part 1, pp. 557-564.
175
176
Nomenclature
s
S
coefficient
optimality region for maximization
optimality region for minimization
damper position
noise or error
cumulative distribution function
generalized likelihood ratio
threshold
hypothesis for an event
the greatest lower bound on a function over (a subset of) its domain
number of samples or logic predicate
finite number of samples
number of on/off cycles each day
off-time duration
on-time duration
probability density or logic predicate
power
flow rate
standard deviation of a sampled data set
log-likelihood ratio
t
time
u
y
6
A
A
p
0
control signal
data sample
thickness of fouling
width of leakage
likelihood ratio
average value of variables with Gaussian distribution
mean value
standard deviation
a
argmax
argmin
D
e
F
g
h
H
Inf.
n
N
ncd
offtd
ontd
p
P
Q
Superscripts
A
estimated behavior
weighted behavior
vector
Subscripts
0
1
b
cc
ci
before the change
after the change
boiler
cooling coil
confidence interval
177
ch
cp
d
db
e
hc
hp
i,j,k,n
ia
m
max
mn
oa
p
Q
ra
rf
sa
sf
sp
spt
t
v
0
chiller
chilled water pump
damper
deadband
event of change
heating coil
hot water pump
sample numbers
indoor air
model
maximum
minimum
outdoor air
power or phase of power
flow rate
return air
return fan
supply air
supply fan
static pressure
setpoint
temperature
valve or variance
average value of variables with Gaussian distribution
mean value
matrix
178
Appendix
Descriptions of the test system and the faults
Based on the final report for the ASHRAE Research Project 1020-RP [Norford et
al, 2000], this appendix provides the detailed information about the building, HVAC
system and control sequences, and the methods of fault implementation for the
demonstration of the FDD method presented in this thesis. Sensors and trained
parameters required by this method in the tests are also listed. Fig. Al-A3 are repeated
from Section 3.9 for reading convenience.
Building
Fault testing was conducted in a unique building that combines laboratory-testing
capability with real building characteristics and is capable of simultaneously testing two
full-scale commercial building systems side-by-side with identical thermal loads. The
building is equipped with three variable-air-volume air-handling units (AHUs), namely,
AHU-A, AHU-B, and AHU-1. AHU-A and AHU-B are identical, each serving four test
rooms (Figure Al). The building has a true north-south solar alignment so that the pairs
of test rooms have identical exposures to the external thermal loads. The test rooms are
unoccupied but are equipped with two-stage electric-baseboard heaters to simulate
thermal loads and with two-stage room lighting, both scheduled to simulate various usage
patterns. The test rooms are also equipped with fan coil units, although these were not
used in this research. AHU-1 serves the general areas of the facility including offices,
reception space, a classroom, a computer center, a display room, service spaces, and a the
media center. A second classroom was added to the east side of the building during the
later stages of this project. Because AHU-1 serves the occupied part of the building, it
will be subject to variable occupant, lighting, and external and internal loads.
The test rooms, heating and cooling loops, and AHUs are well instrumented with
near-research-grade sensors. Notably, instrumentation included Watt transducers for all
components of interest. The A and B test rooms are individually controlled by a single
commercial energy-management and control system (EMCS) and the general areas are
controlled by a second EMCS. The building has a structural steel frame with internally
insulated, pre-cast concrete panels, a flat roof, slab-on-grade flooring, and a floor area of
7,900 m2 (9,272ft2), including the new classroom. The east, south, and west test rooms
each have 6.3 m2 (74 ft2) windows with double-layer, clear glass.
HVAC system
The heating plant consists of a gas-fired boiler, circulation pumps and the
necessary control valves. Heating operation of the HVAC systems was not required as
part of the tests conducted in this research, other than for the preheating of the outside air
during winter operation to simulate higher outside temperatures and force the HVAC
systems into economizer mode. The cooling plant (Figure A2), consists of a nominal 10
kW, two-stage reciprocating, air-cooled chiller, a 149 kWh thermal energy storage (TES)
unit that was isolated from the cooling system for this research, chilled water supplied by
179
a central facility, pumps, and necessary valves and piping to circulate chilled water
through the HVAC coniponents.
The major components of the AHUs are the recirculated air, exhaust air, and
outdoor air dampers; cooling and heating coils with control valves; and the supply and
return fans (Figure A3). Ducts to transfer the air to and from the conditioned spaces.
Both the supply and return fans are controlled with variable frequency drives. An
additional heating coil was installed, for this research, on AHU-A and B, between the
outside air inlet (OA) and the flow and the temperature sensor. This coil was employed to
preheat the outside air so as to force the control system into free cooling mode. AHU-A
and B are identical, while AHU-1 is similar but larger to accommodate higher thermal
loads. Air from the AHUs is supplied to VAV box units, each having electric or hydronic
reheat.
AHU-A
Classroom
Display
R oo m
AHU-B
Mechanical
AHU-1
room
I
/ F
IIr>
Int. Int.
E
A
B
Served by AHU-A
l01Served by AHU-1
Wee s
-4- Z-center
M
st/
FigureAI. Plan of the test building.
180
Served
evdbby AHU-B
H-
- electric power transducer
CHWP - chilled water pump
- pressure transducer
*
-
Point listed in Table 3.4 for cooling plant
- flow meter
?
-
General Area System point
- temperature probe
%
-
Point listed in Table 3.6 for AHU-A and B
DMACC Connection
To remaining
FCUs
FigureA2. Chilled-waterflow circuit in the test building.
181
EA
ODA-TEMP
ODA-HUMD
EA-DMPR
RA-TEMP
RA-FLOW
RA Fan
RF Watts
RA-HUMD
OA-TEMP
OA-FLOW
SA Fan
SF Watts
DA-TEMP
SA
OA-DMPR
M=
-
MA-TEMP
HTG-EWT
HTG-LWT
CLG-DAT
SA-HUMD
CLG-EWT
CLG-LWT
FigureA3. Air-handling unit in the test building.
Control sequences
--Hot and chilled waterpump control logic
The constant-speed heated- and chilled-water pumps (HWP-A, HWP-B, CHWPA, and CHWP-B) are turned on and off based on the position of the heating and cooling
coil control valves. If the control valve is above 15% open, the pump is turned on and
stays on until the control valve is below 5% open. For a 15% open valve, the HTGFLOW is about 1.5 gpm, and the CHW-FLOW is about 2.5 gpm.
The speeds of HWP-LA, HWP-LB, and CHWP-LC are modulated to maintain the
pressure in the pipe at the pressure set point for heated water loops A and B and chilled
water loop C. The pressure set points are generally set at 30 psi for loops A, B, and C.
--Supply and returnfan control sequence
The speed of the supply fan is modulated to maintain duct static pressure at the
static pressure set point, which is generally at 1.2 in. W.G.
The return fan for the tests was controlled by air flow rate matching with the
supply fan, i.e., the return air flow rate is maintained at a percentage of the supply air
flow rate. For example, if the supply air flow rate is 3600 cfm, and the return air flow
rate is set to maintain 80 percent of the supply air flow rate, then the speed of the return
fan would be controlled to maintain a return air flow rate of (0.80*3600 =) 2880 cfm.
182
Table A]. Test system input accuracy.
Accuracy
Input
0.0244%
Voltage and current
0.07 0 F
100 0 RTD (DIN platinum)
0.07 0 F
1000 E RTD (JCI nickel)
0.10 F
1000 E RTD (DIN platinum)
--Supply air temperature control sequence
The control sequence used to maintain supply air temperature is divided into four
control regions, namely, mechanical cooling, mechanical and 'free' cooling, 'free'
cooling, and mechanical heating, as shown in Fig. A4. Each region depends on whether
or not the OA-TEMP is greater or less than a reference temperature known as the
economizer set point temperature (ECONSPT, typically around 55 - 65F), and whether
DA-TEMP is above or below the supply air temperature setpoint (SUPSPT).
The control sequence is in the mechanical cooling mode when OA-TEMP is
greater than ECONSPT and the system calls for cooling (DA-TEMP is greater than the
SUPSPT plus a deadband). During the mechanical cooling mode, the OA-DMPR is
held in the minimum position and the CLG-VLV is modulated to maintain DA-TEMP at
SUPSPT.
When the OA-TEMP is less than ECONSPT, the supply air temperature control
sequence is in the mechanical and 'free' cooling mode. The OA-DMPR is fully opened
and CLG-VLV is modulated to maintain DA-TEMP.
100
oa-min
Economizer
set point
Outdoor air temperature
FigureA4. Supply air temperaturecontrol sequence.
183
As the OA-TEMP drops, the need for mechanical cooling is eliminated (CLGVLV modulates to fully closed) thereby switching the control sequence to the 'free'
cooling mode. In this mode, DA-TEMP is maintained by modulating OA-DMPR.
If the OA-TEMP continues to decrease, a point is reached where the OA-DMPR
is in the minimum position and mechanical heating is required causing the control
sequence to switch to the mechanical heating mode. During the mechanical heating
mode, the HTG-VLV is modulated to maintain DA-TEMP.
--VA V control sequences
The RM-TEMP is maintained above RMHTGSP and below RMCLGSP. This is
accomplished using either a VAV unit and/or a FCU. If a VAV unit is used, there are
two methods of reheat available, electric and hydronic. The control sequence used for
each method are described below.
The first control sequence describes the control logic when a VAV unit is used to
maintain RM-TEMP. If RM-TEMP is above RMCLGSP, the VAV-DMPR is opened to
bring in more supply air and cool the Test Room. The air flow rate entering the Test
Room may be varied from a minimum value (OCC-MIN) (determined either for indoor
air quality or equipment limitations) and the maximum (OCC-MAX) rated flow rate for
the VAV unit. The air flow rate is determined from the output of a control algorithm
(normally a proportional-integral-derivative (PID) loop). This control sequence assumes
that the DA-TEMP is lower than RM-TEMP. If RM-TEMP drops below RMHTGSP,
and the hydronic reheat coil is selected, the VAV-HVLV is modulated to keep RMTEMP near RMHTGSP. The VAV-HVLV can vary anywhere from fully closed to fully
open. The position of the heating valve is determined from the output of a control
algorithm. If an electric reheat coil is selected instead of a hydronic reheat coil, the first
stage of electric reheat is turned on when the output of the control algorithm is greater
than 10%. If the temperature continues to drop after the first stage of electric reheat is
turned on, and the output of the control algorithm becomes greater than 20%, the second
stage of electric reheat is turned on. Again, if the temperature continues to drop after the
first and second stages of electric reheat are on, and the output of the control algorithm
becomes greater than 30%, the final stage of electric reheat is turned on. The stages of
electric reheat are turned off in the same manner as they are turned on with a 5% load
differential, so the third stage of electric reheat is turned off when the control algorithm
output drops below 25%, the second stage is turned off when the control algorithm output
drops below 15%, and the final stage is turned off when the control algorithm output
drops below 5%.
Faults
Requirements for this project stipulated a minimum of six faults to be
investigated, with at least two degradation faults and at least one fault in each of the three
AHU subsystems (air-mixing, filter-coil, and fan). Table A2 shows seven faults
consistent with these requirements and their method of implementation for AHU-A and
184
AHU-B. Table A3 indicates that each fault was implemented in at least two of the three
test periods, held during summer, late winter, and spring seasons.
Fault magnitudes were established during an initial period when the FDD method,
methods for introducing faults and the HVAC systems were all commissioned.
Magnitudes were such that it was anticipated that the FDD method would be able to
detect the middle level of degradation faults, with the lowest level more of a challenge.
Fault magnitudes were reproducible across test periods. While this would not likely be
the case in a typical building, it provided a firmer basis for evaluating the FDD methods,
in what is among the first extensive field tests of such methods. HVAC system
commissioning consisted primarily of sensor calibration and establishing standard system
operating configurations, important in a flexible research facility where systems were
often changed to meet the needs of a number of research programs. This proved to be a
major task for the test-building staff, encompassing fan control algorithms, isolation of
the thermal-storage tank (which provided a thermal capacitance that interfered with
analysis of chiller cycling periods), and operating schedules for HVAC equipment and
false loads in test rooms.
Table A2. Method of imp ementation offaults.
Fault
Implementation
Type
Air Mixing SectionI
Stuck-closed recirculation
damper
Leaking recirculation
damper
Degradation
Application of a control voltage from an independent
source to maintain the damper in the closed position
Removal of the recirculation-damper seals, with one
seal removed for the first fault stage, two for the second,
and all seals for the third stage.
Degradation
Degradation
Manual opening of a coil bypass valve.
Manual throttling of the cooling-coil balancing valve, to
70%, 42%, and 27% of the maximum coil flow of 1.7 1/s
Abrupt
Filter-Coil Section
Leaking cooling coil valve
Reduced coil capacity
(water-side)
(27.5 gpm) for the three fault stages
Fan
Drifting pressure sensor
Degradation
Unstable supply fan
controller
Slipping supply-fan belt
Abrupt
Introduction of a controlled leak in the pneumatic signal
tube from the supply-duct static-pressure sensor to the
transducer, to a maximum reduction of 225 Pa (0.9 in.
H
Degradation
2
0)
Introduction of alternative gains for the PID controller
that adjusts fan speed to regulate static pressure
Adjust fan-belt tension to reduce maximum fan speed
by 15% at 100% control signal for the first stage and
20% for the second stage. The third stage has an
extremely loose belt with variable fan speed.
185
Table A3. Faults introducedduring the blind-test periods.
Fault
Air Mixing Section
Summer
Winter
x
x
Stuck-closed recirculation damper
Leaking recirculation damper
Filter-Coil Section
Leaking cooling coil valve
Reduced coil capacity (water-side)
Fan
Drifting pressure sensor
Unstable supply fan controller
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Slipping supply-fan belt
Spring
Table A4. A listing offaults that can be associatedwith a given electrical-powersignature.
Type of electrical-power analysis
Polynomial correlation of supply-fan power with
supply airflow
Polynomial correlation of supply-fan power with
supply-fan speed control signal
Possible faults causing a deviation between
predicted and measured electrical power
Change in airflow resistance, possibly due to stuck
air-handler dampers or air-side fouling of heating or
cooling coils
Static-pressure sensor error (affects portion of fan
power due to static pressure)
Flow sensor error
Power transducer error
Change in fan efficiency, caused by change in blade
type or pitch, or use of VFD in lieu of inlet vanes
Change in motor efficiency
Slipping fan belt
Disconnected control loop (fan speed differs from
control signal)
Power transducer error
Change in fan efficiency
Change in motor efficiency
Polynomial correlation of chilled-water pump
power with cooling-coil control valve position
control signal
Change in water flow resistance, possibly due to
constricted cooling-coil tubes or piping
Disconnected control loop
Power transducer error
Change in pump efficiency
Change in motor efficiency
Detection of change in cycling frequency for twostage reciprocating chiller
Detection of power oscillations
Leaky cooling-coil valve
Leaky recirculation damper
Unstable local-loop controller
186
Sensors
Sensors required by the FDD method based on power analysis are listed in Table A5.
Table A5. Sensors used for FDD with power analysis.
Required Sensors
Type of Electrical-Power Analysis
Supply-fan electrical power
Polynomial correlation of supply-fan power with
supply airflow
Supply airflow
Supply-duct static pressure
(training
Polynomial correlation of supply-fan power with
supply-fan speed control signal
PolynmialChilled-water-pump
Polynomial correlation of chilled-water pump
power with cooling-coil control valve position
control signal
Detection of change in cycling frequency for twostage reciprocating chiller
only)
yes
yes
Supply-fan electrical power
Supply-fan speed control signal
electrical
power
power
Cooling-coil valve position
control signal
Chiller electrical power
Outdoor temperature
no
yes (if VFD)
Cooling-coil valve position
yes
control signal
Detection of power oscillations
Typically installed?
no
Yes (if volume-tracking
control for return fan)
Fan and pump electrical power
187
no
yes
no
yes
no
Table A6. Required values in the applicationof the FDDmethod in the tests.
Description of threshold
Value
Fan-power correlations with airflow and speed-control signal
Maximum deviation of static pressure from set point for training data
Confidence level to establish boundary between normal and faulty data
Airflow boundary to distinguish stuck-closed recirculation damper from staticpressure offset/drift
Fan power at 100% speed, below which a slipping-fan-belt fault was
considered subject to a minimum time duration*
Time duration for low fan-power at 100% speed, above which a slipping-
3 one-minute power samples
fan-belt fault was flagged
3_one-miutepowesample
25 Pa (0.1 in. H20)
90%
500 cfm
1 kW
Pump-power correlation with cooling-coil valve position-control signal
Valve-position control signal above which pump-power data were analyzed for a
cooling-coil capacity fault**
Measured normal-operation power level of the secondary chilled-water
pump
Minimum decrease of pump power below normal-operation value, in
excess of which a coil-capacity fault was flagged
Confidence level to establish boundary between normal and faulty data
Chiller-cycling analysis
Power level above which the chiller is considered to be operating in the low-power
40%
400 W
10w
90%
4 kW
4k
3 minutes
iue
stage
Cycling interval when the cooling-coil valve control signal is at 0%, below which
a leaky-valve fault is flagged35
Normalized outdoor-air temperature, below which chiller cycling is
0.2
analyzed to detect a leaky recirculation damper***
Power-oscillation analysis
Size of sliding window for averaging one-minute power data from submeters
Standard deviation of power signal above which a fault is flagged, as a percentage
of average
power
5 samples
15%
II
* Fan-power analysis at 100% speed was used in AHU-A and B to detect the slipping
fan belt. For AHU-1 this approach was replaced by the more rigorous and sensitive
polynomial correlation of fan power with speed control signal.
** Pump power analysis relative to a measured and near-constant normal-operation
value was used in AHU-A and B to detect the coil-capacity fault.
*** The normalized outdoor air temperature is the difference between the outdoor-airtemperature and the supply air temperature setpoints, normalized by the difference
between the supply and room air temperature set points.
188