Feedforward Neural Networks for Process Identification and

advertisement
Feedforward Neural Networks for
Process Identification and Prediction
Ashish Mehta1, Sai Ganesamoorthi2, and Willy Wojsznis1
Emerson Process Management1, Cisco Systems2
ashish.mehta@emersonprocess.com
KEYWORDS
Neural Networks, Soft Sensors, Prediction, Process Identification
ABSTRACT
This paper presents a conceptual background of feed forward Neural Network based intelligent sensors
that are tightly integrated with a scalable process control system. The applied algorithms are based on
the back propagation technique with significant modifications for use in process industries.
Statistical pre-processing techniques remove data unrepresentative of the general region of operation. A
novel two-step process is used to realize a suitable input-output configuration for the data set. The input
is time shifted and cross-correlation values between the output and input at various time delays
computed. A heuristic approach establishes the delay at which the input is maximally correlated to the
output. A combination of delays around the identified delay accommodates process dynamics. This is
done for all inputs expected to affect the process output. Sensitivity analysis determines the relative
importance of the various inputs. Conjugate gradient back propagation training with direction
remembrance and optimal learning rate calculation, and specific checks against overfitting, is used to
realize a robust neural network based identifier. Network model predictions are validated against actual
data.
INTRODUCTION
In the recent past neural networks have been used for process variable predictions as intelligent sensors,
also called soft sensors since they are software versions of physical sensors [1, 2, 3]. The idea is to use
their system identification capability to model the nonlinear behavior of processes based on historical
data. While there is a significant amount of literature on the architecture of neural nets and their
suitability to various kinds of problems, this paper addresses the easy development of sophisticated
neural networks (NN) suitable for prediction of process outputs.
The motivation for using neural networks follows from the fact that they are universal approximators: a
multilayer network with a single hidden layer consisting of an arbitrary number of sigmoidal hidden
layer neurons, can approximate any continuous function defined on a compact set to any desired degree
of accuracy [4, 5]. As an intelligent sensor, a neural net (NN) typically, has one output (the predicted
variable), and any number of process variables as inputs. Figure 1 shows a three layer feed-forward NN
with n inputs, a single layer of hidden neurons, bias nodes at the input and hidden layer, and one output.
The weight wij connects the i th node in the previous layer to the j th node in the next layer.
x1
Input
Laye
r i
Hidde
n
Layer
wij
Sj
Outpu
t
Layer
j hj
x2
x3
y
xn
1
1
Figure 1: Three layer feed-forward Neural Network
Weighted values are summed at the node before being passed through an activation function. For a
sigmoidal hidden neuron, the summed node input S j , and the output h j , are given by:
S
1 e j
S j   wij xi , and h j 
S
1 e j
i
(1)
Typically the output layer has a linear activation function. Training the NN involves presenting known
data sets and minimizing the error function J between the actual (target) T, and predicted Y, output
vectors:
1
2
J p  t p  y p  , and J   J p
(2)
2
p
p denotes one exemplar (input – output set).
The back propagation algorithm [6] minimizes the error function in the direction of negative gradient
(steepest descents) to modify the weights for each pass through the data set (an epoch):
J
wij  wij  wij , where wij  
(3)
wij
 defines the step size in the gradient direction, more popularly known as the learning rate. The network
is trained until an acceptable J value is obtained. In the forward pass the trained NN is used to predict
the output Y from known process inputs X x1 , x2, .., xn  .
IDENTIFIER CONFIGURATION
As a prerequisite to training, one needs to configure the input-output structure of the network by
including the measured input variables that are considered to affect the output. However, in a typical
process this is not trivial: it may not be known exactly which inputs have an influence on the output and
there is a temporal relationship between an input and its effect on the output. This section details
development of the best NN configuration by starting with the inclusion of all inputs that may effect the
output and then identifying for each of the inputs the delay that has maximum influence on the output.
Bad Data marked
for Removal
Figure 2: Data Selection and Preprocessing
Data Selection: As with any empirical method, the accuracy of the model is highly dependent on the
quality of data. Careful preparation of the data set involves inspection (graphical/tabular, etc.), removal
of bad data, definition of the range of data, removal of outliers, handling of noisy data and ensuring that
there is sufficient variation in data over the desired region of operation. Statistical techniques often
prove useful in defining the outlier boundaries. A good rule of thumb for valid data limits is, Mean +/3.5 * Standard deviation, which includes approx. 99.9% of the data in the given region. Figure 2 shows
such a graphical data selection tool.
Correlation Calculation: As a first step in identifying the delays, the cross-correlation between the
output and the time shifted input is calculated for various delays of the input. For an N sample data set,
with input and output vectors X and Y, input vector for delay d is
X d  xd i , i  1..N | xd (i)  x(i  d )
(4)
Then the cross correlation coefficient between the delayed input and output is given by:
Cov( X d , Y )
1
d 
, where Cov( X d , Y )  ( x d (i)   X d )( yi   Y ) ,
 Xd  Y
N i
(5)
where  and  denote the mean and standard deviation respectively.  d is calculated for all possible
delays (based on process response times) of the input. Figure 3A is a plot of the cross-correlation
coefficients for various delays of one process input.
Delay Identification: The cross-correlation value indicates the magnitude (and sign) of the effect of the
input on the output. For example, for a simple first order process, input at delay that equals
approximately (dead time + time constant/2) has most relevance, as the maximum correlation value
occurs at that delay. The second step in delay identification, therefore, is a heuristic based approach that
defines the maximum of the correlation plot. In order to remove noise and spurious maxima, filtering of
the following form is used:
(6)
 d  1  d  2 (  d 1   d 1 ) , where 1   2  1;1   2
To exemplify, in Figure 3A, the input delayed by 24 seconds has most influence on the output. Also,
more weightage is given to the maxima found at smaller delay values, and the use of only one delay
value per input accounts to some extent for the case of correlated inputs. Once the most significant delay
is known, the input data is shifted by that delay to form the data set for training, as in equation 4. While
the delayed input represents the significant process dynamics, use of secondary delays in a time series
Correlation at
different delays
Sensitivity at
delay with max
correlation value
Figure 3: Delay Identification and Sensitivity Analysis for A) Detailed display for one input (left), and
B) Overview for all configured inputs (right)
fashion (for example, X (d  1), X (d ), X (d  1) , etc.), is shown to incorporate the faster responding
higher order dynamics.
Sensitivity Analysis: Sensitivity is defined as the change in dependent variable (output) y for a unit
change in independent variable (input) x, or mathematically,
y
y
y
(7)
Sx 
x
x
In a multivariable system, a higher value of sensitivity indicates that change in that input has higher
influence on the output. The sum total of all sensitivities will equal 1 (one). Information on the
sensitivity of the inputs to the output indicates their relative importance. The sensitivity value at the
delay identified in the previous step can be used to exclude inputs that the output shows little or no
dependence on, in order to prevent training the NN over relatively insignificant data.
One technique is to exclude inputs whose individual sensitivities are small compared to the average
sensitivity. The sensitivity at a different delay value, the dependence on specific inputs, and the effect of
inclusion (and/or exclusion) of inputs from the overall network, can all be explored in a “what-if”
scenario using sensitivity analysis. It must be noted here that the underlying assumption is that the data
set was representative of the region of operation the network is being trained for.
A simplified linear model obtained by Partial Least Squares (PLS) is used for sensitivity computation.
PLS is an established mathematical technique to develop an inferential model using dimensionality
reduction. It is a multiple linear regression algorithm that avoids the ill conditioning problem associated
with inverting a matrix and the poorly conditioned solution that is sensitive to noise.
The model for computing the sensitivities is obtained using the standard PLS algorithm, for which there
are several good references [7, 8]. The input, whose sensitivity is to be calculated is perturbed by unit
amount while all other inputs are kept constant, and the change in output (from the PLS based model)
defines the sensitivity. Graphical and numeric editing capability facilitate the development of the best
NN configuration for the observed data. Of the twenty process inputs in the example of Figure 3B, the
delay identification and sensitivity analysis techniques automatically exclude the four insignificant
inputs shown crossed out.
NEURAL NET TRAINING
The gradient descent based back propagation algorithm, with its conceptual simplicity and numerous
customizable modifications, appears to be the best technique for training. Post sensitivity analysis, the
data set is rearranged according to the selected inputs at their delays, to form the exemplars (one set of
input/output values) of the training set. An exemplar for the pth sample of an n input NN would be of the
form: x1 ( p  d1 ), x2 ( p  d 2 ),.., xn ( p  d n ), y( p), d i denotes the delay for the ith input.
Uniform scaling via normalization equalizes the importance of the inputs, so that the engineering unit
range of inputs does not hamper learning. Since the hidden nodes are not affected (only input/output
node values are normalized), the forward pass can proceed with no need for scaling by simply denormalizing the weights at the end of training. Training an NN involves two aspects: the quality of the
training algorithm itself, and its generalization capability. While the former relates to how well and how
fast the algorithm converges over the given data set (performance), the latter is concerned with how well
the trained NN will perform as a predictor over unseen data (robustness). This section explores these
two issues in some detail.
Training algorithm
As described earlier, training proceeds by changing the weights so as to minimize an error function
along the direction of the gradient. Standard back propagation, however, is notorious for its slow
convergence and the use of a fixed learning rate. Instead, the conjugate gradient method is used.
Conjugate Gradients: In this algorithm, the new direction uses a component of the previous direction,
and so convergence is much faster. The method produces a set of mutually conjugate directions such
that each successive step continually refines the direction towards the minimum. So, minimization along
each direction must, by definition, end in arriving at the global minimum. hi 1 , the new direction vector
from point i+1, is computed by adding the gradient at point i+1, g i 1 , to the previous direction hi scaled
by a constant  i :
hi 1  g i 1  i hi
(8)
The scalar  i is defined by methods such as the Fletcher-Reeves expression [9],
i 
g i1  g i1 
gi  gi
(9)
This direction is used in place of the gradient in equation 3. It is this choice of direction that results in
the rapid convergence of the conjugate gradient algorithm. This is illustrated in Figure 4 where the
algorithm converges within a few epochs whenever a new hidden node is added.
Learning rate: The second weight update term in equation 3,  , determines the distance that is traveled
along the current direction of error minimization. A fixed learning rate if too small, will take a long time
to converge, and if large, will oscillate about the minimum and in some cases overshoot it too. It is
therefore best to calculate the optimal learning rate that finds the minimum error at each epoch. Since a
weight vector, W, containing all the weights is defined, a direction vector, h, for minimization is
obtained, and the learning rate is a single parameter, this is the problem of minimizing a function
f (W  h) of one variable,  , i.e., minimizing along a line. Use is made of the algorithm by Brent [10],
Figure 4: Error vs. Epoch for training the NN with incremental number of hidden nodes. Spikes
represent re-initialization of the network when another hidden node is added
wherein the minimum is bracketed by finding three points such that the function value (error J) at the
middle point is less than its two neighbors.
At each successive step the interval between the points is refined, until the desired accuracy is obtained.
The  value for the minimum J is then used as the learning rate for that epoch. A new learning rate is
evaluated at the next epoch.
Direction Remembrance: The scalar term  (equation 8), optimally defines how much the current
direction remembers the previous one. However, it might be the case that the current direction is not
suitable, or the algorithm may be trapped in a local minima. The training algorithm handles this by
automatically turning the direction remembrance on and off depending on whether the error is
improving or not. Setting   0 results in starting with a brand new direction.
Network initialization: The network is initialized with random small non-zero values for the weights.
Randomness ensures lack of bias when training begins, while small values give larger freedom for the
weights to be modified and avoid saturation. The error value spikes in Figure 4 are a manifestation of
random weight initialization. However, the standard random number generators that come with most
program language libraries have a number of problems: typically the random numbers generated cannot
equal or exceed 32,768 possibilities, the algorithm is periodic, and it has serial correlation [11]. The
generator described in [12] has been applied here. It increases the number of random possibilities to
714,025, and eliminates periodic repetition and serial correlation.
Generalization
Figure 5A plots the error of an NN as a function of the number of epochs, for two different sets of data.
One is the training set and the other is a validation set that is drawn from the same process as the
training set, but not used for training. As expected, the error for the training set decreases monotonically,
approaching an asymptote. However, the error for the validation set starts increasing after a certain
point. The reason for this is that, like all finite data based error minimization techniques, neural nets start
learning features specific to the training data set.
With NN, this is further compounded as a result of their inherent nonlinearity. The goal of training
though, is to learn to predict the output given real process inputs, and not to memorize the training set.
This is known as generalization. This still assumes that the distribution from which data is obtained (the
process) remains similar. The following techniques are implemented to satisfy the generalization
requirements:
Effect of Hidden Nodes
Training Set
Error
Error
Generalization Requirements
Train Error
Test Error
Validation Set
Best Training
Epoch
Number of Epochs
1
2
3
4
Number of Hidden Nodes
5
Best Number
Figure 5: Generalization with A) Cross-validation (left), and B) Incremental Hidden Nodes (right)
Cross-validation: When forming the data set, a certain portion of the data is automatically kept aside for
validation during the training phase. This is called the test set. It is imperative that the distribution of the
training and the test data sets are close to each other and the entire data set, say the mean and standard
deviation are within 5% of each other. At each epoch, while the weights are modified based on the error
on the training set, the test set is used to prevent the network from overfitting the training set. Data may
be distributed between the training and test sets in either a sequential or random fashion. The random
number generator described previously finds use in random distribution. The amount of data split
between the two sets is configurable, a value of around 80% for the training set is generally satisfactory.
Convergence Criteria: At each epoch, the error on the test set (test error) for the new set of weights is
calculated and compared with the best test error. Training convergence criteria are based on test error
reduction to guard against the over training phenomenon. In order to prevent training from stopping at
some random choice of weights for which the test error turns out to be small, the algorithm runs for at
least a fixed number of epochs before establishing any minima. Also, the training error is added to the
test error to define a stringent minimum total error condition. In this way it is made sure that both the
data sets have acceptable errors when the algorithm converges, and the weights at the best epoch are
picked at the end of training, as indicated in Figure 5A.
Incremental hidden nodes: Deciding the optimal number of nodes in the hidden layer remains a nontrivial task. A smaller number does not do a very good job of identification, while a larger number does
not generalize well. This is indicated graphically in Figure 5B, where overfitting occurs at the 5 th hidden
node. In this algorithm, hidden nodes increment from a small starting number to a maximum number
that equals the number of configured inputs. For each hidden node increment, the best weight vector that
corresponds to the minimum error is stored.
If the difference between errors for different number of hidden nodes is within a tolerance level, the NN
with a smaller number of hidden nodes is given preference. The algorithm then automatically picks the
weights for the best combination of train/test EU error obtained. In Figure 4 and 5B, of the multiple
networks trained, the NN with four hidden nodes is automatically chosen. In this manner the
generalization of the network is maintained while exploring for the best possible network configuration.
Regularization: The tradeoff between training and generalization is a manifestation of the criterion of
equation 3 which tries to achieve an accurate I/O mapping as measured by only the training data set.
Consequently, another technique to obtain better generalization is to constrain the NN to find a solution
Figure 6: Validation plots – Actual and Predicted vs Sample (left), and Actual vs Predicted (right)
that satisfies additional criteria. The method used in this algorithm, known as regularization by weight
decay, includes a penalty term on the weights as described by:
~
J  J   ( wij ) 2
(10)
 denotes the penalty and wij indexes all the weight terms in the NN. The additional penalty term
increases the robustness of the NN by encouraging the weights to be small, which in itself is
advantageous for learning. It must be noted that this applies only to the training set, hence, the test error
convergence criteria still remain valid.
VERIFICATION
Once a model is developed, it is validated by applying real input data to it and comparing its output with
the actual process output. The validation/verification data set must be different from the training and test
data sets and should represent the data region the NN is expected to operate on. Figure 6 shows an
example with both visual comparison and statistical error computation of model validation results. For
example, if the root mean square error per scan (in engineering units) is not satisfactory, the model
should not be put online. The XY plot of the Actual vs. Predicted values in Figure 6 also provides
insight into the different modes of operation of the process. At times multiple models for the different
modes will be needed. Hence, even though a test set is used during training verification should always
be carried, provided additional data is made available.
CONCLUSION
An approach for developing neural network based models for process prediction, with special
application to software sensors is presented. A suitable network configuration is obtained by identifying
the temporal relationship between the inputs and the outputs and evaluating sensitivities at the most
significant delays. Sensitivity analysis before model generation enables clearer understanding of process
interactions and allows process experts to do “what-if” analysis. The conjugate gradient algorithm with
several enhancements against over training results in fast and accurate model development. This
technology forms the background for the next generation of easy-to-use, integrated NN implemented in
a scalable process control system [13]. Keeping the algorithmic engine the same, this approach can
easily be extended to implement MIMO process modeling.
REFERENCES
1. Qin, S.J., “Neural Networks for Intelligent Sensors and Control – Practical Issues and some
Solutions”, in Progress in Neural Networks: Neural Networks for Control, 1995, edited by D. Elliott.
2. Ganseamoorthi, S., Colclazier, J., and Tzovla, V., “Automatic Creation of Neural Nets for use in
Process Control Applications”, in Proceedings of ISA Expo/2000, 21-24 August 2000, New Orleans,
LA.
3. Fisher Rosemount Systems. “Installing and Using the Intelligent Sensor Toolkit”. User Manual for
the ISTK on PROVOX Instrumentation, 1997.
4. Hornik, K., Stinchcombe, M., and Halbert, W., “Multilayer Feedforward Networks are Universal
Approximators”, Neural Networks, 2:5, pp 359-66, 1989.
5. Cybenko, G., “Approximation by Superpositions of a Sigmoidal Function”, Math. Control, Signals,
and Systems, vol.2, pp 303-314, 1989.
6. RumelHart, D. and McClelland, J., “Parallel Distributed Processing”, 1986, MIT Press, Cambridge,
MA.
7. Geladi, P. and Kowalski, B., “Partial Least Squares regression: a tutorial”, Analytica Chimica Acta,
185, pp 1-17, 1986.
8. Wold, S., Jellberg, S., Lundsted, T., Sjostrom, M., and Wold, H., “PLS modeling with latent
variables in two or more dimensions”, Proc. Frankfurt PLS Meeting, September, 1987.
9. Polak, E., “Computational Methods in Optimization”, 1971, Academic Press, New York.
10. Brent, R., “Algorithms for Minimization without Derivatives”, 1973, Prentice-Hall, Englewood
Cliffs, NJ.
11. Press, W. H., Flannery, B., Teukolsky, B., and Vetterling, W., “Numerical Recipes in C”, 1988,
Cambridge University Press, New York.
12. Masters, T., “Practical Neural Network Recipes in C++”, 1993, Academic Press, London, UK.
13. DeltaVTM Home Page: http://www.DeltaV.com.
Download