Computers and Chemical Engineering 24 (2000) 2755 – 2764 www.elsevier.com/locate/compchemeng A new approach for improved identification of measurement bias Sriram Devanathan a, Derrick K. Rollins b,*, Stephen B. Vardeman c a Department of Chemical Engineering, Iowa State Uni6ersity, Ames, IA 50011, USA Departments of Chemical Engineering and Statistics, Iowa State Uni6ersity, Ames, IA 50011, USA c Departments of Statistics and Industrial and Manufacturing Systems Engineering, Iowa State Uni6ersity, Ames, IA 50011, USA b Received 27 August 1998; received in revised form 28 August 2000; accepted 28 August 2000 Abstract This work presents a technique that can completely and accurately identify measurement bias in cases where it is not possible to use the method of Rollins and Davis (1992, 1993) and where the method of Narasimhan and Mah (1987) fail to perform accurately. This technique makes use of information contained in the relationship between individual measurements and the corresponding nodal imbalance. The performance of this method is demonstrated on a problem from the literature that has been difficult for other methods to handle. In addition, this article discusses how the new technique can be used as a visual monitoring tool for identifying biased measured variables. © 2000 Elsevier Science Ltd. All rights reserved. Keywords: Measurement bias; Imbalance correlation strategy; Unbiased estimation technique 1. Introduction In the chemical industry, measurements collected on process variables are subject to both large random and systematic errors (i.e. measurement biases). Often accurate values of process variables are required for the design of new processes, improvement of existing processes, accurate material accounting, and optimal process control. Hence, it is desirable that measured process variables be close to their true values (i.e. be accurate) and also satisfy the physical constraints that govern the process variables (by the laws of conservation). In general, the mathematical reduction of random variation of measured process variables is broadly classified as ‘filtering’ or ‘smoothing’. When estimates (filtered/smoothed values) are required to satisfy the physical constraints, the task of obtaining such estimates is called data reconciliation (DR). However, in the presence of measurement biases, although the estimates may satisfy the physical constraints, they can still be very inaccurate. Hence, it is important to detect, identify, and eliminate the affect of biases in order to obtain accurate estimates of process variables. * Corresponding author. Tel.: +1-515-2947642; fax: + 1-5152942689. This article considers issues related to the identification of biases in measured process variables. The past four decades have witnessed the introduction of various statistical methods in chemical engineering research (see, for example, Reilly & Carpani, 1963) for the purpose of detecting and identifying biases in measured variables. Mah and Tamhane (1982) introduced the measurement test (MT), which has grown to be perhaps the most widely used statistical test in this context. However, it has been shown that when applied to a process with multiple measurements this test can have a high probability of type I errors (incorrect identification of unbiased variables) and low power (small probability of correct identification of biased variables) (Heenan & Serth, 1986). Narasimhan and Mah (1987) proposed a serial compensation strategy (SCS). In this strategy, one measurement bias is identified at a time, then estimated, and mathematically removed, before attempting to identify another bias. Rollins and Davis (1992) discussed the following undesirable characteristics of SCS, (1) it can have a large probability of making a wrong conclusion for measured variables that are unbiased when at least one variable is biased; and (2) estimates for measurement biases can be inaccurate. Rollins and Davis (1992, 1993) developed a new approach for identifying measurement biases under 0098-1354/00/$ - see front matter © 2000 Elsevier Science Ltd. All rights reserved. PII: S 0 0 9 8 - 1 3 5 4 ( 0 0 ) 0 0 6 2 6 - 8 2756 S. De6anathan et al. / Computers and Chemical Engineering 24 (2000) 2755–2764 steady state or pseudo steady state conditions and linear physical constraints. They called this method the unbiased estimation technique (UBET). Rollins and Davis (1992) presented results from a study of this method for various combinations of two non-zero measurement biases (d). UBET was illustrated on the process represented in Fig. 1, which has seven (7) mass flow variables and four (4) nodes (interconnecting units). As indicated in Rollins and Davis (1992), one of the limitations of UBET is its inability to pinpoint biased process variables for certain combinations of ds. Their work showed that SCS also performs poorly in these cases. In these situations UBET can narrow down the location to three variables, but is unable to make a more specific identification than ‘at least two of the three variables are biased’. Phillips and Harrison (1993) presented a gross error detection and data reconciliation analysis for the context of experimental kinetics. Their work was based upon the modified iterative measurement test (MIMT) of Serth and Heenan (1986). Hence, the problems of high type I error probability and low power associated with the measurement test (as mentioned earlier) exist here too. Tong and Crowe (1996) developed a new strategy for detection of gross errors using principal component analysis. The main focus of their work was the development of a method that remained effective when the assumption of normality was not valid. However, for certain combinations of biases their method does not appear to be capable of leading to complete identification (due to confounding of the effects of the multiple biases). Furthermore, the principal component tests involve intensive computations in calculating eigenvalues and eigenvectors, which could be a drawback for some large processes requiring on-line detection. In an effort to improve identification, this work presents a new strategy that makes use of the relationship between a nodal imbalance (i.e. a mass or energy balance residual) and the measured variables involved in the nodal balance. This technique is computationally simple and straightforward. In addition, this article will show that it can perform well in determining the specific locations of the measurement biases. We are Fig. 1. Recycle process network used in the simulation study taken from Narasimhan and Mah (1987). calling this technique the ‘imbalance correlation strategy (ICS)’. Before presenting ICS, the next section reviews the relevant measurement and process models. This section is followed by a description of how ICS works using a process example. Next, the test statistic is presented. Finally, we discuss the results of a simulation study to evaluate the performance of ICS. 2. Mathematical models This section presents the statistical and physical models for a pseudo steady state process related to the work of Narasimhan and Mah (1987), Rollins and Davis (1992, 1993). The notation of this section will be important to the introduction and understanding of the proposed ICS. First, the statistical model (relating the measured and the true values) can be represented by yij = mi + dij + lij + oij (1) where lij N(0, s 2li), (2) oij N(0, s ) (3) 2 oi and E[yij ]= mi + dij (4) and is subject to Am = g (5) Æm1 Ç Ãm à à 2à with m= à · à ÷à à à Èmp É (6) where yij is the measured value of variable i at the jth time instant; mi is the steady state true value of variable i; dij is the measurement bias of variable i at the jth time instant; lij is the true value of the random process deviation of variable i from mi at the jth time instant; and oij is the random error of variable i at the jth time instant. A is a q× p matrix often called the constraint matrix and in this case (since the constraints are simply total mass balances taken around each node) the number of constraint equations q is equal to the number of nodes. Eq. (5) represents the linear mass and energy conservation constraints and g represents the vector of process leaks. This article assumes that the oij ’s are normally distributed with mean 0 and known variance. Additionally, each variable is assumed to be independent at different values of j (i.e. at different times). Finally, the o’s are assumed to be independent of the l’s. S. De6anathan et al. / Computers and Chemical Engineering 24 (2000) 2755–2764 Rollins and Davis (1992) showed how nonzero elements of g can be identified for steady state conditions. Hence, for simplicity we set g= 0. In vector form, for the jth sampling time, Eq. (1) can be written as yj = m+ dj +lj + oj (7) 2757 (i.e. the inclusion of the l’s). We would like to note that the proposed method is subject to the same conditions and assumptions as Narasimhan and Mah (1987). For example, dij in Eq. (1) is not restricted to positive values. where Æy1j Ç Ã Ã Ãy2j à yj = à · à , à à ÷ à Èypj É Æd1j Ç Ã Ã Ãd2j à dj = à · à , à à à · à Èdpj É Æo1j Ç Ão à à 2j à oj = à · à ÷à à à Èopj É 3. The imbalance correlation strategy (ICS) Æl1j Ç Ã Ã Ãl2j à lj = à · à , à à ÷ à Èlpj É ICS will be illustrated using the seven (7) stream and four (4) nodal steady state process introduced by Narasimhan and Mah (1987). The process is shown in Fig. 1. For the conditions given in the previous section, the true total mass balance around the four nodes is (8) Hence, on the average, yj will deviate from m by dj. The objective of a detection scheme is to determine if any of the elements of dj are non-zero. Similarly, the objective of an identification scheme is to determine which specific elements of dj are non-zero. The steady state Global Test (Reilly & Carpani, 1963) for the conditions represented by Eq. (1) can be used for detection. This test, as well as nodal identification tests, are based on a linear transformation of yj (at any time instant j ) to give the vector of nodal imbalances sj (as in a total mass balance). The transformed measurement model is given by Rollins, Cheng and Devanathan (1996) as sj = Ayj =Am + Adj + Alj +Aoj. (9) Let m1 + m4 + m6 − m2 = 0 (15) m2 − m 3 = 0 (16) m3 − m4 − m5 = 0 (17) m5 − m6 − m7 = 0. (18) The transformation vector at time instant j is specified as ÆsAj Ç Ãs à sj = Ayj = à Bj à à sCj à ÈsDj É (19) where Æ1 Ã0 A= à Ã0 È0 −1 1 0 0 0 −1 1 0 1 0 −1 0 0 0 −1 1 1 0 0 −1 0 Ç 0 à à 0 à −1 É (20) with Alj =tj. (10) Substituting for tj in the expression for sj, with g= 0 in Eq. (5), we have sAj N(d1j + d4j + d6j − d2j, s 2sA), (21) sBj N(d2j − d3j, s 2sB), (22) sj = Adj + tj +Aoj sCj N(d3j − d4j − d5j, s 2sC) (23) (11) and with tj Nq (0, St ) (12) where St, characterizes the variability due to physical process changes. Note that E[sj ] =Adj, j=1, …, n (13) and Var(sj )=St +ASAT (14) where S is the variance – covariance matrix of oj. The model given in this section is basically the same model presented by Narasimhan and Mah (1987) but expanded to clearly show the effect of process variability sDj N(d5j − d6j − d7j, s 2sD). (24) As mentioned before, the key feature of ICS lies in the recognition of a special relationship between a nodal imbalance and the measured variables associated with this node. Table 1 is helpful in demonstrating how ICS works. Rows in the table correspond to the four material balances around the four nodes in the recycle process shown in Fig. 1. Columns correspond to the seven process variables. In this table, the ‘ × ’s’ indicate the associations between streams and nodes. For example, variables 1, 2, 4, and 6 are associated with node A, but variables 3, 5, and 7 are not. Thus, a change in bias S. De6anathan et al. / Computers and Chemical Engineering 24 (2000) 2755–2764 2758 Table 1 Relationship between nodes and stream of recycle processa MBA MBB MBC MBD y1 y2 × × × y3 y4 y5 × × × × y6 y7 × × × × × a ‘MBi ’ means material balance on node i (where i = A, B, C, or D). ‘×’ means that the stream yi is associated with the node in the row. Fig. 2. Elliptical 0.95 confidence region: no biases in any variables. The data tends to have one cluster and the sample correlation coefficient between any sk and any yi will tend to be close to zero. for a variable will also change its associated nodal balance(s) but will not change an un-associated nodal balance(s). Note that, under the assumption of steady state, while mean shifts will change the level of variables (i.e. the y’s) they will not change the level of nodal balances (i.e. the s’s), thus leaving the correlation of the y’s and s’s unaffected. The idea of correlation to detect bias can be illustrated using an example with a single biased variable (i.e. where only one d is non-zero). Suppose, for example, that d2j " 0. Then, E[y2j ]= m2 +d2j (25) and E[yij ]= mi, i" 2, i = 1, …, 7. (26) Furthermore, E[sAj ]= E[y1j +y4j + y6j −y2j ] =m1 + m4 +m6 −m2 +d1j +d4j +d6j −d2j = −d2j (27) since m1 +m4 + m6 − m2 =0 by Eq. (15). Similarly, E[sBj ]= E[y2j − y3j ]= d2j E[sCj ]= E[y3j −y4j − y5j ]= 0, E[sDj ]= E[y5j − y6j −y7j ]= 0. (28) and (29) (30) Comparing Eqs. (25), (27) and (28), we see that when d2j changes from zero (0), y2j, sAj, and sBj will have the same shift in their expected values. In other words, a change in the expected value of a variable will also cause a corresponding positive or negative change in the expected value of all mass balances containing the variable. Thus, when d2j changes from zero (0), the means of y2j, sAj, and sBj will change. Therefore, it appears that an evaluation of changes in the correlation of certain combinations of mass balances and flow rates could be the basis of an effective method to detect and identify biased variables. This work proposes two ways to exploit the effect of measurement bias on the correlation of nodal material balances and mass flow rates for measurement bias identification. The first way is through the use of hypothesis testing for non-zero correlation coefficient. The second way is by visually monitoring plots of nodal balances (i.e. skj ) versus measured flow rate variables (i.e. yij ). A visual change in the correlation would provide diagnostic information on the occurrence of a measurement bias. The second method will be discussed in more details before presenting more details of the first way. It is worth noting that sk and yi (k=A, B, C, or D, and i= 1, …, 7) are bivariate normal random variables. Therefore, (sk, yi )S − 1 sk x 22. yi (31) Given that the upper fifth percentile for the x 22 distribution is 5.99, then p (sk, yi )S − 1 n sk B 5.99 = 0.95. yi (32) Note that the term in the brackets describes a region within a solid ellipse in (sk, yi ) space. Thus, when there are no biases, a scatter plot of skj versus yij should have most observations contained within the ellipse as shown in Fig. 2. With the process in steady state, the plot in Fig. 3 indicates a change in measurement bias of one or more streams connected to node k other than yi. Although this plot shows a separation, the linear association between sk and yi is unaffected. In contrast, under steady state conditions, Fig. 4 indicates that with yi associated with node k, there is a shift in yi due to bias, which causes an increase in the linear association of yi with sk. Yet, if yi changes due to a change in mean (i.e. mi ), the level of sk would not change and the clusters in a plot of sk versus yi would move in the direction of the change along the yi axis but not up or down as we will show in a plot later. Thus, in this case also, the linear association between sk and yi would not be affected. In the next section, we describe a formal statistical test based on the correlation between sk and yi useful S. De6anathan et al. / Computers and Chemical Engineering 24 (2000) 2755–2764 for bias identification. However, one may want to use a nodal strategy for identification first, (such as the one described by Rollins et al., 1996). Then either simultaneously or after exhausting the use of the nodal strategy, one can apply ICS. 4. Tests of hypotheses This section describes the ICS statistical test and shows the development of the null distribution of the test statistic. With N1 representing the sampling times from one period (the reference population) and N2 representing the sampling times from another period Fig. 3. Elliptical 0.95 confidence region: bias in variable other than yi. Bias has occurred in a variable associated with the node k, but the variable is not yi. The two clusters of points represent the data before the bias occurred and after the bias occurred. The sample correlation coefficient between sk and yi will tend to be close to zero. 2759 (the test population), we consider the following null and alternative hypotheses, respectively, H0, dij has averaged d0 in the time space represented by N1 and N2 versus Ha, dij has averaged d0 in the time space represented by N1, but has not averaged d0 in the time space represented by N2. Note that these hypotheses are written in very general terms. That is, they represent any change in dij, which also includes a change from zero to a constant value or non-constant value in the test space. Let r, given by Eq. (33) below, be the correlation coefficient of sk and yi defined under H0: r=corr(sk, yi )= E[(sk − 0)(yi − mi )] [Var(sk )Var(yi )]1/2 (33) where the numerator is the covariance of sk and yi (when the variances and covariances are unknown, samples from the reference population may be used to estimate r. However, under the assumptions of Eq. (1), including S and St, known, with d0 = 0, r can be determined theoretically). Note that Eq. (33) is just simply the definition of the population correlation coefficient (see Devore, 1995) for sk and yi. Also note that if the biases are not zero in the reference population, Eq. (33) has to be adjusted to the appropriate means to include the biases. Let R be the sample correlation coefficient for the pairs (skj, yij ), j= 1, …, n with n= N1 + N2. If the measurement bias in the test space is different than the bias in the reference space, R will tend to be significantly different than r, the true correlation of sk and yi under H0. In order to find an approximate null distribution for R, we will use the ‘Z’ transform which is commonly known in the statistical literature (see Devore, 1995). If (X1, Y1), (X2, Y2), …, (Xn, Yn ) is a random sample of size n, sufficiently large for the central limit theorem to apply (see Devore, 1995), from a bivariate normal distribution under H0, with correlation r, and R is the sample correlation coefficient, let us define Z, z, and n by 1 1+ R Z= loge = arctanh[R] 2 1− R (34) 1 1+ r z= loge = arctanh[r] 2 1− r (35) n 2 = (n− 3) − 1 (36) Then Z is approximately N(z, n ) (see Graybill, 1976). The test statistic 2 Fig. 4. Elliptical 0.95 confidence region: bias in variable yi where yi is a stream that is connected to node k. The two clusters represent data before and after the bias occurs. The absolute value of sample correlation coefficient between sk and yi will not tend to be close to zero. Ts = (Z− z) = (Z− z) n −3 n (37) is approximately N(0, 1) under H0. Rewriting Eq. (37) using Eqs. (34) and (35) gives S. De6anathan et al. / Computers and Chemical Engineering 24 (2000) 2755–2764 2760 Table 2 SCS and UBET results (Table 5) from Rollins and Davis (1992) with di = 7, dj = 4, a= 0.05, and S=Ia i J AVTI (SCS) OPF (SCS) OPF (UBET) 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 4 4 4 5 5 6 2 3 4 5 6 7 3 4 5 6 7 4 5 6 7 5 6 7 6 7 7 0.0138 0.0116 0.0188 0.0827 0.8733 1.0924 1.0908 0.9131 0.0836 0.0204 0.0135 0.9137 0.0816 0.0190 0.0128 1.1041 0.5426 0.0527 1.0217 0.0960 0.9198 0.9862 0.9884 0.9812 0.9226 0.1550 0.0000 0.0000 0.1071 0.9276 0.9799 0.9865 0.1074 0.9294 0.9813 0.9872 0.0000 0.0000 0.9473 0.0249 0.9046 0.1063 0.9588 0.9749 0.9900 0.9881 1, 6, 7** 1, 6, 7** 2, 3, 4** 2, 3, 4** 0.9796 0.9658 0.9830 2, 3, 4** 0.9866 0.9567 0.9679 4, 5, 6** 4, 5, 6** 0.9900 4, 5, 6** 0.9726 1, 6, 7** a 5. Results of simulation studies **, At least two of the three measurements are biased. Ts = (arctanh[R]−arctanh[r]) n − 3. (38) Therefore, the ICS test is, reject H0: dij = d0Öj, if and only if Ts ]za, (for cases where r \ 0) or; reject H0: dij =0Öj, if and only if Ts 5za (for cases where r B 0) where a is the significance level of the test and za is the 100(1− a)th percentile of the standard normal distribution. When nodal strategies fail to completely identify all measurement biases, there is a set of variables declared to be potentially biased. A table like Table 1 can then be used to select (sk, yi ) pairs to be tested using the test given above. For example, suppose that the conclusion after implementing nodal strategies for the process in Fig. 1 is that at least two of the three variables 1, 6, and 7 are biased. Then the pairs to be tested are (sA, y1), (sA, y6), (sD, y6), and (sD, y7). Note ICS is actually a test that detects a change in correlation structure between the reference set (the set which r is based on) and the test set (the set R is determined from). When a change in correlation structure occurs due to a change in bias, this test is designed to detect it. This change in bias could be from no bias to either a negative or positive bias. It could also be from a change in the level of bias in the reference set. In addition, a change in the mean value of yi satisfying the model of Eq. (1) (i.e. steady state) would not change sk. Thus, this change would not change the correlation structure between yi, and sk giving a false identification of change of measurement bias. However, the correlation between yi and time would change making this relationship ineffective at distinguishing changes in means from changes in biases. We illustrate this idea later when we describe ICS visual monitoring. The next section presents the simulation study to evaluate ICS performance. The basic purpose of creating ICS is to obtain accurate identification of biased variables in cases where other GED methods cannot perform well. The specific purpose of our simulation study was to determine ICS performance for the cases presented by Rollins and Davis (1992) that gave their technique and the technique of Narasimhan and Mah (1987) difficulty. These cases are shown in Table 2. Note that it contains all the combinations of two biases for the recycle process in Fig. 1. The problem combinations are identified with ‘**’. The objective of our simulation study was to show that ICS could achieve excellent performance for these problem combinations. Our evaluation consisted of the same conditions and assumptions originally used by Narasimhan and Mah (1987) (i.e. the model is given by Eq. (1) and the values of relevant parameters are given in Table 2). However, although we could have obtained r theoretically in this study, we are going to assume that r (which could be different for each sk, yi pair) is unknown. Hence, this study estimated r from data. Note that, by using the known r, we could run the analysis with the same data sets as Rollins and Davis (1992), Narasimhan and Mah (1987). By estimating r for each case we can evaluate this more difficult situation, which will likely represent the common application. Thus, for each simulated case, we will create a reference data set of size N1 = 10 with d= 0 to estimate r and a test data set of size N2. In this study, N2 was fixed but N1 varied from 5, 10 and 20 to also evaluate the reference sample size on performance. Each simulation consisted of generating data for each of the process variables for a single combination of biases and then using ICS to identify the biased variables. In this manner 10 000 trials of simulated data were run for each result obtained. We used two measures of performance for ICS. The first one, given below, is a measure of the technique’s ability to correctly identify a particular biased variable i and is called the power (denoted by Pi, where i is the variable number): Pi = number of nonzero d%i s correctly identified number of nonerzo d%i s simulated (39) The second performance measure is called the average type I error (AVTI) and indicates the technique’s S. De6anathan et al. / Computers and Chemical Engineering 24 (2000) 2755–2764 tendency toward misidentification of unbiased variables. AVTI is defined as AVTI = number of zerod%i swrongly identified total number of simulations (40) Thus, for the technique to perform well, one would want Pi to be high (near one) and AVTI to be low (near zero). Two levels of a were used in this study, 0.30 and 0.05. Tables 3–8 show results from the simulation study. The first two columns in these tables give the locations of the two biases (i and j ), the third and fourth columns give the corresponding power values (Pi and Pj ), and the Table 3 ICS resultsa Table 6 ICS resultsa i j Pi Pj AVTI 1 1 6 2 2 3 4 4 5 6 7 7 3 4 4 5 6 6 1.00 0.99 1.00 0.99 1.00 1.00 1.00 1.00 1.00 0.99 0.79 0.99 0.71 0.71 0.62 1.00 1.00 0.62 0.0121 0.0094 0.0078 0.0066 0.0071 0.0131 0.0057 0.0057 0.0133 a N1 =5; N2 =10; a =0.30; di =7; dj =4. Table 7 ICS resultsa I j Pi Pj AVTI 1 1 6 2 2 3 4 4 5 6 7 7 3 4 4 5 6 6 1.00 0.93 1.00 0.90 0.99 0.99 1.00 1.00 0.95 0.98 0.44 0.97 0.37 0.37 0.29 0.95 0.96 0.29 0.0016 0.0008 0.0009 0.0012 0.0012 0.0016 0.0005 0.0004 0.0025 a 2761 i j Pi Pj AVTI 1 1 6 2 2 3 4 4 5 6 7 7 3 4 4 5 6 6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.92 1.00 0.86 0.87 0.79 1.00 1.00 0.79 0.0067 0.0067 0.0041 0.0023 0.0029 0.0083 0.0030 0.0036 0.0084 N1 = 5; N2 =10; a =0.05; di = 7; dj = 4. a Table 4 ICS resultsa N1 =10; N2 =10; a =0.30; di =7; dj =4. Table 8 ICS resultsa i j Pi Pj AVTI i j Pi Pj AVTI 1 1 6 2 2 3 4 4 5 6 7 7 3 4 4 5 6 6 1.00 0.99 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 0.67 1.00 0.59 0.60 0.48 1.00 1.00 0.47 0.0006 0.0007 0.0002 0.0004 0.0004 0.0008 0.0003 0.0004 0.0017 1 1 6 2 2 3 4 4 5 6 7 7 3 4 4 5 6 6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 1.00 0.94 0.94 0.88 1.00 1.00 0.88 0.0023 0.0031 0.0018 0.0028 0.0023 0.0056 0.0023 0.0018 0.0045 a N1 = 10; N2 =10; a = 0.05; di = 7; dj = 4. a Table 5 ICS resultsa i j Pi Pj AVTI 1 1 6 2 2 3 4 4 5 6 7 7 3 4 4 5 6 6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.81 1.00 0.75 0.75 0.63 1.00 1.00 0.62 0.0005 0.0002 0.0002 0.0003 0.0001 0.0006 0.0001 0.0002 0.0004 a N1 = 20; N2 =10; a = 0.05; di = 7; dj = 4. N1 =20; N2 =10; a =0.30; di =7; dj =4. fifth column shows AVTI. As mentioned earlier, these combinations are the cases for which Rollins and Davis (1992) could not completely identify the biases. However, as shown, ICS accurately identifies the biases for these cases. (Tables 3–5 show results for a=0.05). For example, for i= 1 and i =6, Table 3 shows that P1 =1.00, P6 = 0.98 and AVTI is 0.0016. Going down the columns in Table 3, for certain combinations the power values are low (e.g. i= 2, j =3, and Pj = 0.37). However, upon increasing N1 (from 5 to 10, and then to 20), Tables 4 and 5 show that the power increases significantly. Additionally, AVTI, which is already low in Table 3, decreases even further in Tables 4 and 5. 2762 S. De6anathan et al. / Computers and Chemical Engineering 24 (2000) 2755–2764 The second set of tables (Tables 6 – 8) are for a= 0.30, as opposed to 0.05 for the previous tables. These tables show the same trends with an increase in N1. Additionally, both power and AVTI (which, however, is still quite low) increase with a, as expected. Notice that Pi and Pj increase significantly going from Table 6 to Table 7 to Table 8 as N1 increases from 5 to 10 to 20. Also note that for N1 =20, the power is very high and the AVTI is low. Finally, as explained earlier, ICS can also be implemented through on-line visual analysis (i.e. visual monitoring). Consider the simple case of d1 =7.0 and dj = 0.0 for j "1 illustrated in Fig. 5. Now, recall that node A has three inlet streams (1, 4, and 6) and one outlet stream (2). Based on the explanations given earlier (comparison of Fig. 2 with Fig. 3), the inference obtained from the plots of sA versus y2, y4, and y6 of Fig. 5 is that bias has occurred in a variable associated with node A other than y2, y4, and y6. The only choice is y1, which is confirmed by the plot of sA versus y1 of Fig. 5, which shows significant change in the correlation. Thus, the occurrence of a measurement bias can be detected by on-line plots such as sA versus yi and by comparing the cluster of sampled observations to the corresponding elliptical confidence region. In Fig. 6 we illustrate the insensitive nature of ICS to be affected by changes in means. In this figure mean shifts have occurred in y1 and y6 (steady state is maintained) but the correlations of sA and the y’s have not changed. Thus, the ICS plots support the correct conclusion of no bias when process variables change due to shifts in means. In contrast, if one tried to base a change in bias on changes in the time series plots of the y’s, a shift in the mean of y could likely lead to a false conclusion of bias as illustrated in Fig. 7. In this figure, the y1 data used in Fig. 5 (d1 goes from 0 to 20 and m1 = 100) and Fig. 6 (d1 = 0 and m1 goes from 100 to 120) are plotted against time. Although y1 is inherently different in these two plots (they have different means and different biases), its plot is the same in Fig. 7. Hence, it is not possible to know accurately whether the change in a process variable is due to a mean shift or a change in bias from its time series plot alone. 6. Concluding remarks In this work, we have presented a new approach, ICS, for the identification of measurement bias in linear Fig. 5. Plots of sA vs. y2, y4, y6, and y1. Bias has occurred in y1. The opened circles are before the bias entered and the solid circles are after the bias entered the process. The separation in sA vs. y1 shows the effect of the bias in shifting the mean of sA and y1 (i.e. a change in their correlation structure). The other plots show only a shift in sA. S. De6anathan et al. / Computers and Chemical Engineering 24 (2000) 2755–2764 2763 Fig. 6. Plots of sA vs. y2, y4, y6, and y1 showing a change in the steady state level due to a change in the means of y1 and y6. The opened circles are at the same conditions as Fig. 5 and the solid circles represent the new steady state from the mean changes. The shifts in the means of y1 and y6 are detectable but there is no shift in sA for any plot. The indication is that no change in bias has occurred in these streams associated with node A because their correlation structures are unchanged. pseudo steady state processes. ICS is easy to implement and is not computationally intensive. For difficult combinations of measurement biases, this approach was shown to be capable of accurate identification where other strategies have failed to perform well. Thus, if a diagnostic analysis cannot reach an accurate or specific conclusion about the location of a biased variable, ICS can be quite useful. In addition, one could also use ICS as a mathematical or visual on-line monitoring method, which could aid in the early detection and identification of biased variables. Visual monitoring would simply consists of plotting the mass or energy balance residuals against the associated process variables and looking for linear trends. If one has a fixed reference set of data of sufficient size, ICS can identify, on-line, bias for any variable that differs from its value in the reference set. However, one could also set up a moving window mathematical online monitoring scheme with N1 equal to the sample size of the reference set and N2 equal to the sample size of the test set as we did in the simulation study. N2 would contain the most current data and N1 would contain data from the past, beyond the N2 data. These two sets do not necessarily have to be close together in time. The amount of data that one would need for high accuracy will depend on their process variability, sampling error, and sampling frequency. These values affect Fig. 7. Plot of y1 vs. time showing a shift in the mean or bias of stream 1. This is the y1 used in Fig. 5 (a shift in bias of y1) and Fig. 6 (a shift in the mean of y1). The opened circles are before the change and the solid circles are after the change. This graph shows the inability of times series plots of process variables to distinguish between shifts in means and shifts in biases and the superiority of plots like those in Figs. 5 and 6. S. De6anathan et al. / Computers and Chemical Engineering 24 (2000) 2755–2764 2764 the window sizes (i.e. N1 and N2). Trade-offs in accuracy may be required to identify frequently changing biases, which require smaller window sizes. However, in the common situation, given advancements in computer and sampling technology, large data sets with small window sizes should be obtainable. In these cases, a moving window strategy should work well for ICS. For off-line analysis, where the window sizes must be fairly large (biases are assumed to occur slowly) the periods do not have to be very close in time and one could be somewhat conservative in their selection. In other situations, engineering judgement and knowledge could be used to select the periods based on historical considerations or some other diagnostic methodology. n2 r s 2i S St tj variance of Z true value of correlation coefficient variance of oij variance–covariance matrix for oj variance–covariance matrix for tj vector representing the effects of process deviation Acknowledgements We would like to acknowledge partial support for this project by the National Science Foundation under grant CTS-9453534, and Meiyu Shen and Molly McNaughton for helping with the final draft. 7. Notation A AVTI I N1 N2 Pi P p q R sAj TS yij Z Za q×p matrix representing process physical constraints average type I error identity matrix sample size taken from the reference population sample size taken from the test population power for variable i probability number of process variables number of process constraint equations random variable representing the sample correlation coefficient mass balance on node A at time instant i test statistic for the ICS hypothesis test measured value of variable i at the jth time instant Fisher’s ‘Z’ transforrn of R the 100(1−a)th percentile of the standard normal distribution Greek letters dij measurement bias of variable i at the jth time instant oij random error of variable i at the jth time instant g vector of process leaks lij true value of process deviation of variable i from mi at the jth time instant mi true value of variable i . References Devore, J. L. (1995). Probability and statistics for engineering and the sciences. Albany, NY: Duxbury Press. Graybill, F. A. (1976). Theory and application of the linear model. Pacific Grove, CA: Wadsworth and Brooks/Cole. Heenan, W. A., & Serth, R. W. (1986). Detecting errors in process data. Chemical Engineering, 99. Mah, R. S. H., & Tamhane, A. C. (1982). Detection of gross errors in process data. American Institute of Chemical Engineering Journal, 28 (5), 828. Narasimhan, S., & Mah, R. S. H. (1987). Generalized likelihood ratios for gross error identification. American Institute of Chemical Engineering Journal, 33 (9), 1514. Phillips, A. G., & Harrison, D. P. (1993). Gross error detection and data reconciliation in experimental kinetics. Industrial Engineering and Chemical Research, 32 (11), 2530. Reilly, P. M., & Carpani, R. E. (1963). Application of statistical theory of adjustments to material balances. Thirteenth Canadian chemical engineering conference. Montreal, Quebec. Rollins, D. K., & Davis, J. F. (1992). Unbiased estimation of gross errors in process measurements. American Institute of Chemical Engineering Journal, 38 (4), 563. Rollins, D. K., & Davis, J. F. (1993). Gross error detection when variance – covariance matrices are unknown. American Institute of Chemical Engineering Journal, 39 (8), 1335. Rollins, D. K., Cheng, Y., & Devanathan, S. (1996). Intelligent selection of hypothesis tests to enhance gross error identification. Comparti6e and Chemical Engineering, 20 (5), 517. Serth, R. W., & Heenan, W. A. (1986). Gross error detection and data reconciliation in steam metering systems. American Institute of Chemical Engineering Journal, 32 (5), 733. Tong, H., & Crowe, C. M. (1996). Detection of gross errors in data reconciliation by principal component analysis. American Institute of Chemical Engineering Journal, 41 (7), 1712.