Feedforward Neural Networks for Process Identification and

Feedforward Neural Networks for Process Identification and Prediction Ashish Mehta1, Sai Ganesamoorthi2, and Willy Wojsznis1 Emerson Process Management1, Cisco Systems2 ashish.mehta@emersonprocess.com KEYWORDS Neural Networks, Soft Sensors, Prediction, Process Identification ABSTRACT This paper presents a conceptual background of feed forward Neural Network based intelligent sensors that are tightly integrated with a scalable process control system. The applied algorithms are based on the back propagation technique with significant modifications for use in process industries. Statistical pre-processing techniques remove data unrepresentative of the general region of operation. A novel two-step process is used to realize a suitable input-output configuration for the data set. The input is time shifted and cross-correlation values between the output and input at various time delays computed. A heuristic approach establishes the delay at which the input is maximally correlated to the output. A combination of delays around the identified delay accommodates process dynamics. This is done for all inputs expected to affect the process output. Sensitivity analysis determines the relative importance of the various inputs. Conjugate gradient back propagation training with direction remembrance and optimal learning rate calculation, and specific checks against overfitting, is used to realize a robust neural network based identifier. Network model predictions are validated against actual data. INTRODUCTION In the recent past neural networks have been used for process variable predictions as intelligent sensors, also called soft sensors since they are software versions of physical sensors [1, 2, 3]. The idea is to use their system identification capability to model the nonlinear behavior of processes based on historical data. While there is a significant amount of literature on the architecture of neural nets and their suitability to various kinds of problems, this paper addresses the easy development of sophisticated neural networks (NN) suitable for prediction of process outputs. The motivation for using neural networks follows from the fact that they are universal approximators: a multilayer network with a single hidden layer consisting of an arbitrary number of sigmoidal hidden layer neurons, can approximate any continuous function defined on a compact set to any desired degree of accuracy [4, 5]. As an intelligent sensor, a neural net (NN) typically, has one output (the predicted variable), and any number of process variables as inputs. Figure 1 shows a three layer feed-forward NN with n inputs, a single layer of hidden neurons, bias nodes at the input and hidden layer, and one output. The weight wij connects the i th node in the previous layer to the j th node in the next layer. x1 Input Laye r i Hidde n Layer wij Sj Outpu t Layer j hj x2 x3 y xn 1 1 Figure 1: Three layer feed-forward Neural Network Weighted values are summed at the node before being passed through an activation function. For a sigmoidal hidden neuron, the summed node input S j , and the output h j , are given by: S 1 e j S j   wij xi , and h j  S 1 e j i (1) Typically the output layer has a linear activation function. Training the NN involves presenting known data sets and minimizing the error function J between the actual (target) T, and predicted Y, output vectors: 1 2 J p  t p  y p  , and J   J p (2) 2 p p denotes one exemplar (input – output set). The back propagation algorithm [6] minimizes the error function in the direction of negative gradient (steepest descents) to modify the weights for each pass through the data set (an epoch): J wij  wij  wij , where wij   (3) wij  defines the step size in the gradient direction, more popularly known as the learning rate. The network is trained until an acceptable J value is obtained. In the forward pass the trained NN is used to predict the output Y from known process inputs X x1 , x2, .., xn  . IDENTIFIER CONFIGURATION As a prerequisite to training, one needs to configure the input-output structure of the network by including the measured input variables that are considered to affect the output. However, in a typical process this is not trivial: it may not be known exactly which inputs have an influence on the output and there is a temporal relationship between an input and its effect on the output. This section details development of the best NN configuration by starting with the inclusion of all inputs that may effect the output and then identifying for each of the inputs the delay that has maximum influence on the output. Bad Data marked for Removal Figure 2: Data Selection and Preprocessing Data Selection: As with any empirical method, the accuracy of the model is highly dependent on the quality of data. Careful preparation of the data set involves inspection (graphical/tabular, etc.), removal of bad data, definition of the range of data, removal of outliers, handling of noisy data and ensuring that there is sufficient variation in data over the desired region of operation. Statistical techniques often prove useful in defining the outlier boundaries. A good rule of thumb for valid data limits is, Mean +/3.5 * Standard deviation, which includes approx. 99.9% of the data in the given region. Figure 2 shows such a graphical data selection tool. Correlation Calculation: As a first step in identifying the delays, the cross-correlation between the output and the time shifted input is calculated for various delays of the input. For an N sample data set, with input and output vectors X and Y, input vector for delay d is X d  xd i , i  1..N | xd (i)  x(i  d ) (4) Then the cross correlation coefficient between the delayed input and output is given by: Cov( X d , Y ) 1 d  , where Cov( X d , Y )  ( x d (i)   X d )( yi   Y ) ,  Xd  Y N i (5) where  and  denote the mean and standard deviation respectively.  d is calculated for all possible delays (based on process response times) of the input. Figure 3A is a plot of the cross-correlation coefficients for various delays of one process input. Delay Identification: The cross-correlation value indicates the magnitude (and sign) of the effect of the input on the output. For example, for a simple first order process, input at delay that equals approximately (dead time + time constant/2) has most relevance, as the maximum correlation value occurs at that delay. The second step in delay identification, therefore, is a heuristic based approach that defines the maximum of the correlation plot. In order to remove noise and spurious maxima, filtering of the following form is used: (6)  d  1  d  2 (  d 1   d 1 ) , where 1   2  1;1   2 To exemplify, in Figure 3A, the input delayed by 24 seconds has most influence on the output. Also, more weightage is given to the maxima found at smaller delay values, and the use of only one delay value per input accounts to some extent for the case of correlated inputs. Once the most significant delay is known, the input data is shifted by that delay to form the data set for training, as in equation 4. While the delayed input represents the significant process dynamics, use of secondary delays in a time series Correlation at different delays Sensitivity at delay with max correlation value Figure 3: Delay Identification and Sensitivity Analysis for A) Detailed display for one input (left), and B) Overview for all configured inputs (right) fashion (for example, X (d  1), X (d ), X (d  1) , etc.), is shown to incorporate the faster responding higher order dynamics. Sensitivity Analysis: Sensitivity is defined as the change in dependent variable (output) y for a unit change in independent variable (input) x, or mathematically, y y y (7) Sx  x x In a multivariable system, a higher value of sensitivity indicates that change in that input has higher influence on the output. The sum total of all sensitivities will equal 1 (one). Information on the sensitivity of the inputs to the output indicates their relative importance. The sensitivity value at the delay identified in the previous step can be used to exclude inputs that the output shows little or no dependence on, in order to prevent training the NN over relatively insignificant data. One technique is to exclude inputs whose individual sensitivities are small compared to the average sensitivity. The sensitivity at a different delay value, the dependence on specific inputs, and the effect of inclusion (and/or exclusion) of inputs from the overall network, can all be explored in a “what-if” scenario using sensitivity analysis. It must be noted here that the underlying assumption is that the data set was representative of the region of operation the network is being trained for. A simplified linear model obtained by Partial Least Squares (PLS) is used for sensitivity computation. PLS is an established mathematical technique to develop an inferential model using dimensionality reduction. It is a multiple linear regression algorithm that avoids the ill conditioning problem associated with inverting a matrix and the poorly conditioned solution that is sensitive to noise. The model for computing the sensitivities is obtained using the standard PLS algorithm, for which there are several good references [7, 8]. The input, whose sensitivity is to be calculated is perturbed by unit amount while all other inputs are kept constant, and the change in output (from the PLS based model) defines the sensitivity. Graphical and numeric editing capability facilitate the development of the best NN configuration for the observed data. Of the twenty process inputs in the example of Figure 3B, the delay identification and sensitivity analysis techniques automatically exclude the four insignificant inputs shown crossed out. NEURAL NET TRAINING The gradient descent based back propagation algorithm, with its conceptual simplicity and numerous customizable modifications, appears to be the best technique for training. Post sensitivity analysis, the data set is rearranged according to the selected inputs at their delays, to form the exemplars (one set of input/output values) of the training set. An exemplar for the pth sample of an n input NN would be of the form: x1 ( p  d1 ), x2 ( p  d 2 ),.., xn ( p  d n ), y( p), d i denotes the delay for the ith input. Uniform scaling via normalization equalizes the importance of the inputs, so that the engineering unit range of inputs does not hamper learning. Since the hidden nodes are not affected (only input/output node values are normalized), the forward pass can proceed with no need for scaling by simply denormalizing the weights at the end of training. Training an NN involves two aspects: the quality of the training algorithm itself, and its generalization capability. While the former relates to how well and how fast the algorithm converges over the given data set (performance), the latter is concerned with how well the trained NN will perform as a predictor over unseen data (robustness). This section explores these two issues in some detail. Training algorithm As described earlier, training proceeds by changing the weights so as to minimize an error function along the direction of the gradient. Standard back propagation, however, is notorious for its slow convergence and the use of a fixed learning rate. Instead, the conjugate gradient method is used. Conjugate Gradients: In this algorithm, the new direction uses a component of the previous direction, and so convergence is much faster. The method produces a set of mutually conjugate directions such that each successive step continually refines the direction towards the minimum. So, minimization along each direction must, by definition, end in arriving at the global minimum. hi 1 , the new direction vector from point i+1, is computed by adding the gradient at point i+1, g i 1 , to the previous direction hi scaled by a constant  i : hi 1  g i 1  i hi (8) The scalar  i is defined by methods such as the Fletcher-Reeves expression [9], i  g i1  g i1  gi  gi (9) This direction is used in place of the gradient in equation 3. It is this choice of direction that results in the rapid convergence of the conjugate gradient algorithm. This is illustrated in Figure 4 where the algorithm converges within a few epochs whenever a new hidden node is added. Learning rate: The second weight update term in equation 3,  , determines the distance that is traveled along the current direction of error minimization. A fixed learning rate if too small, will take a long time to converge, and if large, will oscillate about the minimum and in some cases overshoot it too. It is therefore best to calculate the optimal learning rate that finds the minimum error at each epoch. Since a weight vector, W, containing all the weights is defined, a direction vector, h, for minimization is obtained, and the learning rate is a single parameter, this is the problem of minimizing a function f (W  h) of one variable,  , i.e., minimizing along a line. Use is made of the algorithm by Brent [10], Figure 4: Error vs. Epoch for training the NN with incremental number of hidden nodes. Spikes represent re-initialization of the network when another hidden node is added wherein the minimum is bracketed by finding three points such that the function value (error J) at the middle point is less than its two neighbors. At each successive step the interval between the points is refined, until the desired accuracy is obtained. The  value for the minimum J is then used as the learning rate for that epoch. A new learning rate is evaluated at the next epoch. Direction Remembrance: The scalar term  (equation 8), optimally defines how much the current direction remembers the previous one. However, it might be the case that the current direction is not suitable, or the algorithm may be trapped in a local minima. The training algorithm handles this by automatically turning the direction remembrance on and off depending on whether the error is improving or not. Setting   0 results in starting with a brand new direction. Network initialization: The network is initialized with random small non-zero values for the weights. Randomness ensures lack of bias when training begins, while small values give larger freedom for the weights to be modified and avoid saturation. The error value spikes in Figure 4 are a manifestation of random weight initialization. However, the standard random number generators that come with most program language libraries have a number of problems: typically the random numbers generated cannot equal or exceed 32,768 possibilities, the algorithm is periodic, and it has serial correlation [11]. The generator described in [12] has been applied here. It increases the number of random possibilities to 714,025, and eliminates periodic repetition and serial correlation. Generalization Figure 5A plots the error of an NN as a function of the number of epochs, for two different sets of data. One is the training set and the other is a validation set that is drawn from the same process as the training set, but not used for training. As expected, the error for the training set decreases monotonically, approaching an asymptote. However, the error for the validation set starts increasing after a certain point. The reason for this is that, like all finite data based error minimization techniques, neural nets start learning features specific to the training data set. With NN, this is further compounded as a result of their inherent nonlinearity. The goal of training though, is to learn to predict the output given real process inputs, and not to memorize the training set. This is known as generalization. This still assumes that the distribution from which data is obtained (the process) remains similar. The following techniques are implemented to satisfy the generalization requirements: Effect of Hidden Nodes Training Set Error Error Generalization Requirements Train Error Test Error Validation Set Best Training Epoch Number of Epochs 1 2 3 4 Number of Hidden Nodes 5 Best Number Figure 5: Generalization with A) Cross-validation (left), and B) Incremental Hidden Nodes (right) Cross-validation: When forming the data set, a certain portion of the data is automatically kept aside for validation during the training phase. This is called the test set. It is imperative that the distribution of the training and the test data sets are close to each other and the entire data set, say the mean and standard deviation are within 5% of each other. At each epoch, while the weights are modified based on the error on the training set, the test set is used to prevent the network from overfitting the training set. Data may be distributed between the training and test sets in either a sequential or random fashion. The random number generator described previously finds use in random distribution. The amount of data split between the two sets is configurable, a value of around 80% for the training set is generally satisfactory. Convergence Criteria: At each epoch, the error on the test set (test error) for the new set of weights is calculated and compared with the best test error. Training convergence criteria are based on test error reduction to guard against the over training phenomenon. In order to prevent training from stopping at some random choice of weights for which the test error turns out to be small, the algorithm runs for at least a fixed number of epochs before establishing any minima. Also, the training error is added to the test error to define a stringent minimum total error condition. In this way it is made sure that both the data sets have acceptable errors when the algorithm converges, and the weights at the best epoch are picked at the end of training, as indicated in Figure 5A. Incremental hidden nodes: Deciding the optimal number of nodes in the hidden layer remains a nontrivial task. A smaller number does not do a very good job of identification, while a larger number does not generalize well. This is indicated graphically in Figure 5B, where overfitting occurs at the 5 th hidden node. In this algorithm, hidden nodes increment from a small starting number to a maximum number that equals the number of configured inputs. For each hidden node increment, the best weight vector that corresponds to the minimum error is stored. If the difference between errors for different number of hidden nodes is within a tolerance level, the NN with a smaller number of hidden nodes is given preference. The algorithm then automatically picks the weights for the best combination of train/test EU error obtained. In Figure 4 and 5B, of the multiple networks trained, the NN with four hidden nodes is automatically chosen. In this manner the generalization of the network is maintained while exploring for the best possible network configuration. Regularization: The tradeoff between training and generalization is a manifestation of the criterion of equation 3 which tries to achieve an accurate I/O mapping as measured by only the training data set. Consequently, another technique to obtain better generalization is to constrain the NN to find a solution Figure 6: Validation plots – Actual and Predicted vs Sample (left), and Actual vs Predicted (right) that satisfies additional criteria. The method used in this algorithm, known as regularization by weight decay, includes a penalty term on the weights as described by: ~ J  J   ( wij ) 2 (10)  denotes the penalty and wij indexes all the weight terms in the NN. The additional penalty term increases the robustness of the NN by encouraging the weights to be small, which in itself is advantageous for learning. It must be noted that this applies only to the training set, hence, the test error convergence criteria still remain valid. VERIFICATION Once a model is developed, it is validated by applying real input data to it and comparing its output with the actual process output. The validation/verification data set must be different from the training and test data sets and should represent the data region the NN is expected to operate on. Figure 6 shows an example with both visual comparison and statistical error computation of model validation results. For example, if the root mean square error per scan (in engineering units) is not satisfactory, the model should not be put online. The XY plot of the Actual vs. Predicted values in Figure 6 also provides insight into the different modes of operation of the process. At times multiple models for the different modes will be needed. Hence, even though a test set is used during training verification should always be carried, provided additional data is made available. CONCLUSION An approach for developing neural network based models for process prediction, with special application to software sensors is presented. A suitable network configuration is obtained by identifying the temporal relationship between the inputs and the outputs and evaluating sensitivities at the most significant delays. Sensitivity analysis before model generation enables clearer understanding of process interactions and allows process experts to do “what-if” analysis. The conjugate gradient algorithm with several enhancements against over training results in fast and accurate model development. This technology forms the background for the next generation of easy-to-use, integrated NN implemented in a scalable process control system [13]. Keeping the algorithmic engine the same, this approach can easily be extended to implement MIMO process modeling. REFERENCES 1. Qin, S.J., “Neural Networks for Intelligent Sensors and Control – Practical Issues and some Solutions”, in Progress in Neural Networks: Neural Networks for Control, 1995, edited by D. Elliott. 2. Ganseamoorthi, S., Colclazier, J., and Tzovla, V., “Automatic Creation of Neural Nets for use in Process Control Applications”, in Proceedings of ISA Expo/2000, 21-24 August 2000, New Orleans, LA. 3. Fisher Rosemount Systems. “Installing and Using the Intelligent Sensor Toolkit”. User Manual for the ISTK on PROVOX Instrumentation, 1997. 4. Hornik, K., Stinchcombe, M., and Halbert, W., “Multilayer Feedforward Networks are Universal Approximators”, Neural Networks, 2:5, pp 359-66, 1989. 5. Cybenko, G., “Approximation by Superpositions of a Sigmoidal Function”, Math. Control, Signals, and Systems, vol.2, pp 303-314, 1989. 6. RumelHart, D. and McClelland, J., “Parallel Distributed Processing”, 1986, MIT Press, Cambridge, MA. 7. Geladi, P. and Kowalski, B., “Partial Least Squares regression: a tutorial”, Analytica Chimica Acta, 185, pp 1-17, 1986. 8. Wold, S., Jellberg, S., Lundsted, T., Sjostrom, M., and Wold, H., “PLS modeling with latent variables in two or more dimensions”, Proc. Frankfurt PLS Meeting, September, 1987. 9. Polak, E., “Computational Methods in Optimization”, 1971, Academic Press, New York. 10. Brent, R., “Algorithms for Minimization without Derivatives”, 1973, Prentice-Hall, Englewood Cliffs, NJ. 11. Press, W. H., Flannery, B., Teukolsky, B., and Vetterling, W., “Numerical Recipes in C”, 1988, Cambridge University Press, New York. 12. Masters, T., “Practical Neural Network Recipes in C++”, 1993, Academic Press, London, UK. 13. DeltaVTM Home Page: http://www.DeltaV.com.

Feedforward Neural Networks for Process Identification and

Related documents

Products

Support

Feedforward Neural Networks for Process Identification and

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib