JOURNAL OF APPLIED MATHEMATICS AND DECISION SCIENCES, 8(1), 15–32 c 2004, Lawrence Erlbaum Associates, Inc. Copyright SUR Models Applied to an Environmental Situation With Missing Data and Censored Values ROSS SPARKS† ross.sparks@csiro.au CSIRO Mathematical and Information Sciences, Locked Bag 17, North Ryde, NSW 1670, Australia Abstract. This paper develops methodology for predicting faecal coliform values in waterways when some of the data are missing, and some of the data are left-censored. Such predictions are important in predicting when bathing is safe in specific areas of Sydney harbor. The approach taken in the paper makes use of spatial information to improve these predictions. Keywords: EM-algorithm, Generalized least squares, Left-censoring, Parameter estimation, Simulation and Spatio-temporal modelling. 1. Introduction The Seemingly Unrelated Regression (SUR) model is well-known in the econometric literature (Zellner [19], Srivastava and Giles [17], Greene [4]), but is less known else where. Its benefits have been explored by several authors (Zellner [20], Kmenta and Gilbert [6], Revankar [13] & [14], Mehta and Swamy [10], Dwivedi and Srivastava [3], Maeshiro [9], Blatten and George [1], Kurata and Kariya [7]). More recently the SUR model is being applied in agricultural economics (O’Donnell, Shumway and Ball [11], Wilde, McNamara, and Ranney [18]), aquaculture economics (SamonteTan and Davis [15]), and forestry for modelling tree growth (Hasenauer, Monserud and Gregoire [5]). Its application in the natural sciences is likely to increase once scientists in these disciplines are exposed to its potential. Liu and Wang [8] demonstrated to natural scientists the superiority of SUR model over the unrelated model that uses ordinary least squares to estimate parameters. This paper illustrates how the SUR model could be used to fit spatiotemporal data in an environmental setting. It starts by introducing the SUR model and presents some of its properties when covariances are known. † Requests for reprints should be sent to Ross Sparks, CSIRO Mathematical and Information Sciences, Locked Bag 17, North Ryde, NSW 1670, Australia. 16 R. SPARKS It then examines how regression parameters will be estimated if covariances are unknown. A simulation study is used to compare SUR model with unrelated regression models. Missing data is fairly common in the applications cited in this paper. Thus the paper indicates how the EM algorithm (Dempster, Laird, and Rubin [2]) can be used to give Maximum Likelihood Estimates of regression and dispersion parameters. A simulated examples illustrate the advantage of using all the data to estimate parameters when some of the responses are missing. The applications considered also have left-censored values. This paper illustrates how the EM algorithm is used to overcome this incomplete data problem. A simulation study demonstrates the methodology. Followed by a detailed consideration of an example of application. 2. Introduction to the SUR Model Assume that you have p response variables each with n observations denoted by vectors y1 , y2 , ···, yp with associated explanatory variables X1 , X2 , ·· ·, Xp , respectively. One way of fitting these models is to treat them as unrelated multiple regression models of the form of yj = X j βj + e j (1) where βj is a vector of unknown regression parameters and ej is a vector of random errors with each element having variance σj2 , for j = 1, 2, · · ·, p. Let X1 0 . . 0 y1 β1 e1 0 X2 . . 0 y2 β2 e2 . . . , y = . , β = . , e = X= . (2) . . . . . . . . 0 0 . . Xp yp βp ep 0 E(ei ej ) = σij I where σij 6= 0, E(ee0 ) = Σ = I ⊗ ∆ , where ∆ = {σij } is the covariance matrix capturing the variances and covariances of the random error terms of (1), then the SUR form of this model is y = Xβ + e (3) When ∆ is known, then SUR formulation of the regression models produces more efficient regression parameter estimates using generalized least squares estimates given by 17 SUR MODELS APPLIED TO AN ENVIRONMENTAL SITUATION 0 0 β̂ = (X Σ−1 X)−1 X Σ−1 y, (4) than the ordinary least squares estimates in the unrelated form of the model given by 0 0 β̂j = (Xj Xj )−1 Xj yj (5) for each j = 1, 2, ..., p. However the efficiency of the estimators in applications is not that simple. 2.1. Efficiency of SUR model for estimating regression coefficients Note that if either σij = 0 for all j 6= i or Xi = Xj for all j 6= i, then the two model formulations produce identical estimates, that is, equation (4) will be identical to (5). The efficiency in the SUR formulation increases, the more the correlations between error vector differ from zero and the closer the explanatory variables for each response are to being uncorrelated (e.g., see Mehta and Swamy [10]). Sparks [16] discussed how to select variables and parameter estimators for the SUR model. The standard errors for the set of the regression parameter estimates in the SUR formulation is 0 −1 given by the diagonal elements of X Σ−1 X while for the unrelated 0 formulation it is the appropriate diagonal elements of (Xj Xj )−1 for j = 1, 2, . . . , p. These can be used to gain an idea of the relative merits of the SUR model formulation for estimating the regression parameters of the model. Generally when σij 6= 0 and the covariances are known, it 0 can be shown that the diagonal elements of (Xj Xj )−1 are larger than the 0 −1 corresponding diagonal elements of X Σ−1 X for each j, and the mean square error of prediction using the generalized least squares estimate is smaller. This is not generally true when the covariances are unknown. The relative benefit of SUR model when covariances are unknown depends on the sample size n (Zellner, [20], Kmenta and Gilbert [6], Revankar [13] & [14], Mehta and Swamy [10], Dwivedi and Srivastava [3], Maeshiro [9]). When fitting regression models with small sample sizes, its unlikely that the seemingly unrelated regression formulation and related generalized least squares estimates are going to add much value. However, the larger the sample size, the more reliable is the estimate of ∆, and hence the more likely an advantage is gained from the seemingly unrelated regression formulation of the model. 18 3. R. SPARKS The Application In this paper, the response variables are the natural logarithm of one plus the faecal coliform measurement taken at various sites in Port Jackson (Sydney). Faecal coliform measurements are a good indicator of the level of bacteria in the waterways, and therefore they flag risks associated with recreational activities, like swimming and boating. The collection of faecal coliform measurements is expensive, however moderate to heavy local rainfall induce many sewerage overflows into the waterways. Therefore an important explanatory variable of faecal coliform levels in the waterways is the aggregation of previous rainfall values. Aggregation was taken over non-overlapping twelve hour periods (running from 6am to 6pm, and 6pm to 6am). Typically measuring rainfall is considerably cheaper than measuring faecal coliforms. Using a modelling approach we can establish predictions of faecal coliform levels from daily rainfall values, and therefore averting the need to measure faecal coliforms on a daily basis. Each sample site in Port Jackson differed in distance from the rainfall gauge site. Therefore, forward selection with a backward look (stepwise regression) was used to select the set of rainfall explanatory variables that were best for predicting each site response. These subset models were included in the seeming unrelated regression formulation. The rainfall data takes away some of the spatial relationship between site response values, however the error term in the models are still going to have some spatial information relating to tidal influences, local rainfall missed by the network of gauges, and flow-down effects in Port Jackson. These spatial relationships determine the value of using the SUR formulation of the model. The faecal coliform data from sample sites have two incomplete data problems: • Missing data: Not every site was sampled on each day the data were recorded. There were days when a subset of the sites was sampled. • Censored data: The measurement methodology could not measure faecal coliform values below 4 CFU (colony forming units). Therefore, in this paper methodology is developed to deal with these issues. SUR MODELS APPLIED TO AN ENVIRONMENTAL SITUATION 3.1. 19 Application of the EM algorithm to deal with missing data and left-censored observations 3.1.1. Dealing with missing data Missing data occurs when no feacal coliform measurement is collected for least for one of the sites on a sample day. The EM-algorithm of Dempster, Rubin and Laird [2] is applied to establish maximum likelihood estimates for the regression coefficients and dispersion parameters of the SUR model. The E-step of this algorithm is defined for the various cases below. Denote yij as the ith day’s logarithm of faecal coliform measurement plus one collected at the jth location in the waterway (xij is defined similarly 0 for rainfall readings). Note that yij ∼ N (xij βj , σj2 ) is assumed. Case 1: Only one site has missing data. Assume that this jth observed response is missing for day i. The expected value of this missing value, given all other measured values is 0 0 0 E(yij |yik for all k 6= j) = xij βj +σ(−j) Σ−1 bij.(−j) (−j) (yi(−j) −Zi(−j) β(−j) ) = µ (6) where 0 • σ(−j) is the j th row of Σ with the jth element removed, • Σ(−j) is the matrix Σ with the jth row and jth column removed, • yi(−j) is the vector of all faecal coliform observations collected on day i with the jth response removed, and • β(−j) is the vector β with the regression coefficients relating to selected explanatory variables of the jth response/site removed, i.e., βj removed. • xij be the ith row of Xj and 0 0 0 xi1 02 . . . 0p 00 x 0 . . . 0 0 p 1 i2 . . . . in which 0j is a vector of zeros of the same Zi = . . . . . . . . 0 0 0 01 02 . . . xip 0 length as vector xij , then Zi(−j) is Zi with the jth row removed. • 0 20 R. SPARKS Case 2: More than one site has missing data. Let C be the set of number sites that have faecal coliform measurements taken on day i. Then the expected value of the missing jth response value, given all observed response values during data collection day i is 0 0 0 E(yij |yik for all k ∈ C) = xij βj + σjC Σ−1 bij.C CC (yiC − ZiC βC ) = µ (7) where • ΣCC is the sub-matrix of Σ that corresponds to all responses in C, • σjC are the covariances between jth response and all responses in C as a row vector, • yiC is the vector of all faecal coliform observations collected on day i with response number in C, • βC is the vector β with the regression coefficients relating to selected explanatory variables of all response/site variables with numbers not in C removed, and • ZiC is a matrix with rows consisting of all rows Zi that correspond to variable numbers in C. 0 0 The expected value of the product of two missing responses on 0 the same day. Denote σjk.C = σjk −σjC Σ−1 CC σkC . If any pair of response variables are missing on day i, then the expected value of their product is E(yij yik |yim for all m ∈ C) = σjk.C + µ bij.C µ bik.C (8) Therefore, when applying the SUR model we need to consider whether to apply the SUR model to: • all sites with only completed data (discarding observations with any site having missing data), or • all the data using the EM algorithm (Dempster et al [2]) to deal with missing data. We found using simulation techniques that the latter approach was significantly better in terms of minimizing the mean square error of prediction. SUR MODELS APPLIED TO AN ENVIRONMENTAL SITUATION 3.1.2. 21 Dealing with left-censored data values. The measurement process of the faecal coliform values can not report values of 4 colony forming units (CFU) or below, e.g., any “true” zero counts for CFU are reported as below 4 CFU. We therefore do not have complete information for all measurements. Also, the variation in measured values such as: • sampling variation; • variation caused by differences in handling (outside the laboratory); and • variation caused by errors within measurement laboratories, can be as much as 40% on average at some sites under certain conditions. Therefore, we can not be certain that a near zero measurement is a true reflection of the faecal coliform level at a site. Four methods of dealing with the censored values and missing values are considered. These are as follows: Method 1: Simple approach to handling censored data. Here we patch censored data with the quantiles of the estimated distribution below the censored value. To achieve this, the distribution is assumed to be normal (log-normal) and the parameters are estimated from the data as outlined below. Let ȳj be the mean of all uncensored y values from the jth site that are not missing, and mj be the number of missing values in T days sampled. The estimated population mean (Persson and Rootzen [12]) is: µ̂j = ȳj − α σ̂RM L (j) (9) and T −mj σ̂y2j = Σi=1 2 ∗ − ȳj )2 /(T − mj ) − (αλ(T −mj )/T − α2 )σ̂RM (yij L (j) (10) where: • ∗ is the ith observed value of the jth response, yij • α= • λa/b is the upper ab th quantile of the standard normal distribution, and 2 √T e−λ(T −mj)/T /(T 2Π − mj ) , 22 R. SPARKS • 2 σ̂RM L (j) = q 1 {λ(T −mj )/T ν̄j + [(λ(T −mj )/T ν̄j )2 + 4τj2 } 2 T −mj in which ν̄j = Σi=1 log(5))2 /(T − mj ). T −mj ∗ (yij − log(5))/(T − mj ) and τj2 = Σi=1 (11) ∗ (yij − For our example we know that faecal coliforms are positively associated with local rainfall. Therefore for the jth site, the censored days are ranked according to total rainfall in the region and they are patched with the respective quantile value of a normal distribution with mean µ̂j and variance σ̂j2 . This is repeated for each response j. These patched censored values are treated as if they were known, and the parameters of the model (excluding missing observation) are estimated for each site individually using (5). Method 2: Simple approach to handling censored data and missing data. We carry out all steps in Method 1. That is, we patch censored data with the quantiles of the estimated distribution below the censored value. Then the patched censored values and ordinary least squares are used to estimate regression parameter for each site, independently. These estimated models are used to estimate missing data for each site, and then the SUR model is used to improve the estimate of the regression model (i.e., using (4)). Method 3: EM algorithm used to handle both censored data and missing data. It seems convenient to use the same methodology for dealing with both sets of incomplete information. Therefore, the EM algorithm is used to “patch” left-censored value as well as the missing data via an expectation step, and then these “patched” values are used to establish maximum likelihood estimates of the parameters. These two steps are applied iteratively until estimates converge. We build in the information recorded for left-censored values by finding the conditional expectation of responses given censored values are less than the low detect values. However, there is information available from neighboring sites and local rainfall records, and this can also help with patching “censored” values. Case 1: All other response values for the ith day are either uncensored or 0 not missing. Put σjj.(−j) = σjj − σ(−j) Σ−1 (−j) σ(−j) , Z(−j) , and let y denote the log faecal coliform measurement plus 1. Censored value yij is replaced by (see Appendix A) E(yij |yik for all k 6= j & yij < log(5)) = µ bij.(−j) −σjj.(−j) φ(z1 )/Φ(z1 ) (12) where: SUR MODELS APPLIED TO AN ENVIRONMENTAL SITUATION √ z1 = [log(5) − µ bij.(−j) ]/ σjj.(−j) 23 (13) Case 2: All other response values for i th day are uncensored but some sites on the day have missing values. Censored value yij is replaced by (see Appendix A) E(yij |yik for all k 6= j & yij < log(5)) = µ bij.C − σjj.C φ(z2 )/Φ(z2 ) (14) where: √ z2 = [log(5) − µ bij.C ]/ σjj.C (15) Case 3: Some other response values for ith day are censored and other sites on the day have missing values (see Appendix B). Let S be all sites with censored values on day i. Censored value yij is replaced by (see Appendix B) E(yij |yij < log(5) & yi` < log(5) for all ` ∈ S) (16) which is solved using simulation methods. Case 4: Some other response values for ith day are censored and sites on the day have observed uncensored values (see Appendix B). Censored value yij is replaced by (see Appendix B) E(yij |yik for all k 6= j, yij < log(5) & yi` < log(5) for all ` ∈ S) (17) which is solved using simulation methods. Method 4: E-Step of the EM algorithm is used to handle both censored data and missing data, but this E-step is simplified to condition on all variables not included in the expectation. It would be convenient if we could ignore the difficult conditioning on other censored values as in Case 3, and condition on them as if they were observed, by using the previous estimates of censored values in iterations as the“observed values”. This method obviously ignores some of the information the censored values provide, but we want to know what is lost. 4. 4.1. Comparison of Methods by Simulation Example 1 The following SUR model was simulated wi1 = exp(2.5 + 2x1j + x2i + e1i + 0.4e2i ) − 1 24 R. SPARKS and wi2 = exp(3 + 2x3i + 2x4i + e3i + 0.4e2i ) − 1 where x1i , x2i , x3i , x4i , e1i , e2i and e3i are generated as independent standard normal random variates. Ten percent of the random variables wi1 and wi2 were hidden at random, i.e., treated as missing. Any of wij values below 4 were censored. The resulting data were fairly typical of some sites in Port Jackson. The number of observations simulated on each occasions is 40. The response variables in the fitting process was taken as yij = loge (wij + 1) for j = 1, 2. Denote β̂p as the estimator gained by using the E-step of EM-algorithm to patch missing and censored values as in equations (5), (6), (7), (10), (11) and (12). β̂np is the estimator gained by first setting censored values of yij to loge (4 + 1) and then the EM algorithm to establish parameter estimates. The simulation was run 100 independent times with each simulation cal0 culating |β̂p − βtrue |−| β̂np − βtrue | where βtrue = 2.5 2 1 3 2 2 . The results are presented in Fig. 1, where the parameter estimates are number 1, 2, 3, 4, 5 & 6 referring to estimates of 2.5, 2, 1, 3, 2, and 2, respectively. It is clear from Fig. 1 that we gained by using the EM algorithm to adjust for missing values because, on average, |β̂p − βtrue |≤| β̂np − βtrue |. Figure 1. Comparison of regression estimates when missing are patched, and censored data are patched and not patched. SUR MODELS APPLIED TO AN ENVIRONMENTAL SITUATION 4.2. 25 Example 2 The same simulation example as Example 1 was used, except this simulation compared the various methods defined earlier for dealing with missing and censored values. Fig. 2 illustrates that there is very little difference between Method 3 and Method 4, and both these methods are superior to Method 1 and Method 2. Method 2 has only a slight advantage over Method 1. Method 4 requires much less computational effort than Method 3, of the order of 1 400 th of the effort, and this is likely to increase as the amount of censoring increases. In terms of balancing effort and accuracy, Method 4 seems to be preferable. Very little is lost by avoiding the complicated evaluation of conditional expectations (16) and (17). Figure 2. Comparing methods of estimating model parameters using a simulation study. 26 5. R. SPARKS Examples of Application Recall the application section earlier in the paper. This outlined the intention of producing efficient predictions of faecal coliform (f c) values at various recreational sites in Port Jackson (Sydney). Three different geographical locations in Port Jackson are used to illustrate the value of these models (see Table 1): Table 1. Sample site information and model fit. Areas of Port Jackson Site locations No. of obs.† values censored values missing variation explained by model Parramatta River Lane Cove River Middle Harbor PJ02, PJ03, PJ04 PJ05, PJ06, PJ07 PJ29, PJ30, PJ31, PJ32 146 20% 4% 72% 150 11% 3% 73% 153 31% 1% 80% Percentage † days f c values were measured Recall the explanatory variables are the lagged 12-hourly rainfall totals. Up to five lags at each of eleven rainfall stations are considered. Using forward selection with a backward look (stepwise regression) we selected the significant rainfall explanatory variables for each site. The selected subset models are then written in SUR model form. The parameters are estimated using the SUR model. The results are recorded in Tables 2, 3, and 4. The following are worth noting in these three regions of Port Jackson. • The variable selection process selects 12-hour rainfall totals lagged 1, 2, 3, 4 and 5 from time f c was measured as significant explanatory variables approximately 9%, 9%, 19%, 23%, and 40% of the time, respectively. This was a surprise because prior to this analysis, experts at Sydney Water suggested that rainfall now would not influence faecal coliform levels 48 hours later. For this reason, we only considered explanatory rainfall variables up to lag 5 (48 to 60 hours later). The modelling process did not seem to adhere to conventional wisdom, and future models should consider lags beyond 60 hours. • Note that the explanatory variables relating to lag 5 values generally corresponded to, on average, higher regression parameter estimates in 27 SUR MODELS APPLIED TO AN ENVIRONMENTAL SITUATION Table 2. Estimated model for sites in the Parramatta River. PJ02 Constant Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site PJ03 Constant Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site PJ04 Constant Rainfall site Rainfall site Rainfall site Rainfall site Method 3 Method 2 Method 1 22 25 37 64 70 lag lag lag lag lag 5 1 2 4 3 3.0132 0.2382 0.1869 0.0703 0.1903 0.6155 2.9653 0.2639 0.1787 0.0719 0.1961 0.6134 2.9331 0.3123 0.1662 0.0723 0.1985 0.6624 25 29 37 37 37 70 70 87 lag lag lag lag lag lag lag lag 1 5 2 3 5 3 5 4 2.2788 0.0664ns 0.1026 0.0484 0.1691 0.1398 0.1466 0.3135 0.0593ns 2.1949 0.0674 0.1054 0.0505 0.1854 0.1395 0.1657 0.3287 0.0616 2.1602 0.0680 0.0952 0.0555 0.2079 0.1337 0.2190 0.3585 0.0608 25 64 70 87 lag lag lag lag 5 4 5 4 1.8708 -0.0629 0.0769 0.1300 0.1769 1.7810 -0.0679 0.0821 0.1345 0.1897 1.7914 -0.1053 0.0860 0.1553 0.2110 the Lane Cove River. Explanatory variables relating to lag 3 or 5 rainfall values generally corresponded to, on average, higher regression parameter estimates in Middle Harbor. Explanatory variables relating to lag 3, 4 or 5 rainfall values generally corresponded to higher regression parameter estimates in Port Jackson. • Variables that are significant when fitting independent multiple regression models can become insignificant when fitting the seeming unrelated regression model (e.g., those parameters with a superscript ns on the left-hand side of the estimate are nonsignificant). • Some estimates change very little when all the information is used (SUR formulation), while others change significantly (e.g., Rainfall site 25 lag 5 of PJ30 changes dramatically). • Estimates of the explanatory variable regression coefficients are generally closer to zero using the SUR model (Method 3), than the respective 28 R. SPARKS coefficients estimated from separate models – Method 1 (in more than 80% of the cases). Table 3. Estimated model for sites in the Lane Cove River. PJ05 Constant Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site PJ06 Constant Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site PJ07 Constant Rainfall site Rainfall site Rainfall site 6. Method 3 Method 2 Method 1 25 29 37 64 70 lag lag lag lag lag 5 1 4 5 5 2.0164 -0.0860 0.1948 0.0973 0.2061 0.2308 1.8858 -0.0805 0.2052 0.1027 0.2102 0.2273 1.8801 -0.0886 0.1967 0.1070 0.2107 0.2618 22 22 25 66 87 87 lag lag lag lag lag lag 2 5 5 1 3 4 2.8925 0.1976 0.4554 -0.1985 0.1792 0.1785 0.0499 2.7821 0.2029 0.4765 -0.2060 0.1882 0.1670 0.0641 2.7773 0.2391 0.4931 -0.2197 0.1629 0.1608 0.0698 17 lag 4 29 lag 3 66 lag 5 3.5448 0.1887 0.3449 0.1167 3.4947 0.1924 0.3479 0.1217 3.4823 0.1956 0.3617 0.1223 Conclusion The SUR model can be usefully applied to many point estimation problems in environmental settings. This model proves useful in spatio-temporal settings where temporal information is built into models via the appropriate explanatory variables and lagged response variables, while the spatial correlations are used via the model errors to improve parameter estimates and predictions. Often these environmental settings have missing data and left censoring observations. This paper provides an approach to fitting SUR models when faced with these difficulties. Several methods of handling these are explored in the paper, and the simple approach of applying the E-step of the EM algorithm by conditioning on all observations (whether observed, censored or missing), and iterating until estimates converge (Method 4) 29 SUR MODELS APPLIED TO AN ENVIRONMENTAL SITUATION Table 4. Estimated model for sites in Middle Harbor. PJ29 Constant Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site PJ30 Constant Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site PJ31 Constant Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site PJ32 Constant Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site Rainfall site Method 3 Method 2 Method 1 25 64 70 70 87 87 lag lag lag lag lag lag 5 5 3 5 3 5 1.9208 -0.0657ns 0.1525 -0.3198 0.2300 0.2857 0.1339 1.8250 -0.588 0.1563 -0.2565 0.2282 0.2882 0.1418 1.8517 -0.0976 0.1750 -0.2852 0.2465 0.2924 0.1428 25 37 70 70 87 lag lag lag lag lag 5 5 4 5 3 1.6241 -0.0181ns 0.0709 0.1626 0.3772 0.6791 1.5312 -0.0246 0.0739 0.1831 0.3849 0.6967 1.5508 -0.0760 0.0923 0.1898 0.4195 0.7063 22 29 37 66 70 70 70 lag lag lag lag lag lag lag 2 4 3 5 3 4 5 1.4817 0.2505 0.0693 0.3255 0.1226 0.3399 0.0835ns 0.3046 1.4038 0.2727 0.0764 0.3259 0.1299 0.4350 0.0769 0.3058 1.4035 0.3084 0.0618 0.3205 0.1312 0.4118 0.0889 0.3025 17 17 17 29 64 70 lag lag lag lag lag lag 2 3 4 5 5 4 1.7820 0.1888 0.6784 0.1532 0.3904 0.1477 0.1518 1.7777 0.1956 0.6714 0.1579 0.3727 0.1470 0.1556 1.7609 0.1967 0.6733 0.1584 0.4346 0.1486 0.1405 is computationally efficient and reasonably accurate. where accuracy is paramount use Method 3. In those few cases References 1. Blattberg, R., and George, E.I. (1991), Shrinkage estimation of price and optional elasticities: Seemingly Unrelated Equations, J. Amer. Statist. Assoc.,86, 304-315. 30 R. SPARKS 2. Dempster, A.P., Laird, N. M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion), J. Roy. Stat. Soc. B 39, 1-38. 3. Dwivedi, T.D. and Srivastava, V.K. (1978). Optimality of least squares in the seemingly unrelated regression equation model. Journal of Econometrics, 7, 391-395. 4. Greene, W. (1993). Econometric Analysis (2nd Ed.) New York: Macmillan. 5. Hasenauer, H., Monserud, R.A. and Gregoire, T.G. (1998). Using simultaneous regression techniques with individual-tree growth models. Forest Science, 44(1), 87-95. 6. Kmenta, J. and Gilbert, R.F. (1968). Small sample properties of alternative estimators of seemingly unrelated regressions. J. Amer. Statist. Assoc., 63, 1180-1200. 7. Kurata, H., and Kariya, T. (1996). Least upper bound for covariance matrix of a generalized least squares estimator in regression with applications to a seemingly unrelated regression model and a heteroscedastic model. The Annals of Statistics, 24, 1547-1559. 8. Liu, J.S. and Wang, S.G. (1999). Two-stage estimate of the parameters in seemingly unrelated regression model. Progress in Natural Science, 9. 9. Maeshiro, A. (1980). New evidence on small sample properties of estimators of SUR models with autocorrelated disturbances. Journal of Econometrics, 12, 177-187. 10. Mehta, J.S. and Swamy, P.A.V.B. (1976). Further evidence on the relative efficiencies of Zellner’s seemingly unrelated regression estimator. J. Amer. Statist. Assoc., 71, 634-639. 11. O’Donnell, C.J., Shumway, C.R. and Ball, V.E. (1999). Input demands and inefficiency in US agriculture. American Journal of Agricultural Economics, 81, 865-880. 12. Persson, T. and Rootzen, H. (1977). Simple and highly efficient estimators for a type I censored normal sample. Biometrika, 64, 123-128. 13. Revankar, N.S. (1974), Some finite sample results in the context of two seemingly unrelated regression equations. J. Amer. Statist. Assoc.,69, 187-190. 14. Revankar, N.S. (1976). Use of restricted residuals in SUR systems: Some finite sample results. J. Amer. Statist. Assoc.,71, 183-192. 15. Samonte-Tan, G.P.B. and Davis, G.C. (1998). Economic analysis of stake and rackhanging methods of farming oysters (Crassostrea iredalei) in the Philippines. Aquaculture, 160,239-249. 16. Sparks, R.S. (1987). Selecting estimators and variables in the seemingly unrelated regression model. Commun. Statist.-Simula., 16, 99-127. 17. Srivastava, V.K. and Giles, D.E.A. (1987). Seemingly Unrelated Regression Equations Model: Estimation and Inference, New York:Marcel Dekker. 18. Wilde, P.E., McNamara, P.E., and Ranney, C.K. (1999). The effect of income and food programs on dietary quality: A seemingly unrelated regression analysis with error components. American Journal of Agricultural Economics, 81(4), 959-971. 19. Zellner, A. (1962). An efficient method of estimating seemingly unrelated regression equations and tests for aggregation bias. J. Amer. Statist. Assoc., 57, 348-368. 20. Zellner, A. (1963). Estimators for seemingly unrelated regression equations: some exact finite sample results. J. Amer. Statist. Assoc., 58, 977-992. Appendix A Let log of the faecal coliform values plus one be denoted by the random vari1 2 √ able y and assume that this is normally distributed. Let φ(z) = e− 2 z / 2π SUR MODELS APPLIED TO AN ENVIRONMENTAL SITUATION and Φ(z) = 31 Rz φ(x)dx, then √ Rc 2 1 E(y|y < c) = −∞ ye− 2σ2 (y−µ) /( 2πσ)dy/Φ( c−µ σ ) R c (y−µ) − 1 (y−µ)2 √ / 2πdy/Φ( c−µ = −∞ σ e 2σ2 σ )+µ √ y=c − 2σ12 (y−µ)2 c−µ = −σe / 2π|y=−∞ /Φ( σ ) + µ (y−µ) = −σφ( σ )/Φ( c−µ σ )+µ −∞ Appendix B Log of faecal coliform values plus one at site j is denoted by the random variable yj . Assume the normal distribution. Let S = (`1 ...`S ) be all response variable numbers corresponding to censored values and C = (C1 , ..., Cp−S−1 ) be all the response variable numbers that correspond to uncensored values, 0 yr = [ yj y`1 . . . y`S ], 0 yC = [ yC1 yC2 . . . yCp−S−1 ], 0 0 0 0 0 0 Σrr ΣrC cov([ yr yC ]) = , E([ yr yC ]) = [ µr µC ] Σrr.C = Σrr − ΣCr ΣCC −1 ΣrC Σ−1 CC ΣCr , µr.C = µr − ΣrC ΣCC (yC − µC ), Φ(1c) = Z c ... −∞ Z c 1 e− 2 (yr −µr ) −∞ 0 Σ−1 rr (yr −µr ) /{(2π) p/2 |Σrr ||1/2 }dyj dyS , where dyS = dy`1 ...dy`S , and Z c Z c 0 −1 1 p/2 Φr.C (1c) = ... e− 2 (yr −µr.C ) Σrr.C (yr −µr.C ) /{(2π) |Σrr.C |1/2 }dyS −∞ −∞ 1. The case when the only data available within a day is censored data. E(yj |yj < c & y` < c for all` ∈ S) 0 −1 Rc Rc R∞ R∞ 1 p/2 ... −∞ −∞ ... −∞ yj e− 2 (y−µ) Σ (y−µ) /{(2π) |Σ|1/2 } −∞ dyj dyS /Φ(1c) 0 −1 Rc Rc 1 p/2 = −∞ ... −∞ yj e− 2 (yr −µr ) Σrr (yr −µr ) /{(2π) |Σrr |1/2 } dyj dyS /Φ(1c). = 2. When both censored and uncensored values are available E(yj |yj < c , y` < c for all ` ∈ S and yC ) = 0 −1 Rc Rc 1 p/2 ... −∞ yj e− 2 (yr −µr.C ) Σrr>c (yr −µr.C ) /{(2π) |Σrr.C |1/2 } −∞ dyj dyS /Φr.C (1c) 32 R. SPARKS The above integrals can be solved by simulation. That is, 1. Simulated the response vector from a multivariate normal random variate with mean vector µ and covariance matrix Σ. Retain all values that are censored. Repeat this many times. Then find all simulations that have yj < c & y` < c for all ` ∈ S, then take the average of all the yj values in this group to give the required values. 2. Simulated yr conditional on yC given they are jointly multivariate normal. Retain only observations with all values in yr censored (less than c). Then calculate the average of these values these retained values to give the expected values of for each variable in yr . These approaches are computationally fairly expensive and the parameter estimation process is slow.