Detecting Pattern in Biological Stressor Response Relationships Using Model Based Cluster Analysis Ilya Lipkovich1, Eric P. Smith2 and Keying Ye2 Abstract Environmental monitoring of aquatic systems is needed to estimate the quality of the systems, to evaluate standards and to study stressor-response relationships. Monitoring programs often focus on the collection of biological, chemical and physical measures of the system. An important concern is the effect of chemical and physical stressors on the biological community. Evaluation of relationships may be difficult as the extent of the relationship is not known. From a management perspective, interest is on what factors affect the biological community and where these factors have an influence. The focus of this paper is on the use of regression based cluster analysis as a tool for finding relationships between a single biological response and a suite of environmental stressors. The approach to cluster analysis uses a penalized regression classification likelihood and Markov Chain Model Composition Monte Carlo. This approach allows for simultaneous development of regression models and clustering of the regression models. The method is applied to the analysis of a data set describing stressors/response relationship in Ohio. Key words: Bayesian methods, cluster analysis, Markov Chain Monte Carlo (MCMC) simulation, regression, water quality Author Footnote: 1 2 Eli Lilly and Company, Lilly Corporate Center, Indianapolis, Indiana 46285 Department of Statistics, Virginia Tech, Blacksburg, VA 24061 1 1. Introduction Recent programs to monitor the aquatic environment focus on the biological, chemical and physical measurement of the environment. Of particular interest is the status of the environment (sometimes referred to as the integrity) at local and regional levels and the factors that determine the status. Both state (for example, Ohio’s Environmental Protection Agency (EPA)) and national agencies (such as USEPA’s EMAP program) are routinely involved in the monitoring of fish and benthic macroinvertebrates along with chemicals in the water, organism habitat and physical characteristics of the landscape surrounding the aquatic system. In the State of Ohio, for example, the state’s EPA monitors on a regular basis over a thousand sites (http://www.epa.state.oh.us/dsw/document_index/305b.html). The data sets that are produced from this sampling program are often quite large, typically containing over 30 fish taxa, 100 benthic macroinvertebrate taxa, habitat variables, water chemistry variables and land use information (http://www.epa.state.oh.us/dsw/documents/exsumm96.pdf). Of particular interest to environmental agencies and managers is biological integrity and what factors are influencing the integrity. Integrity is often measured through a set of biological metrics that are combined into a single summary measure. Examples include the index of Biotic Integrity (IBI) for fish and the Invertebrate Community Index (ICI) for benthic macroinvertebrates. For example, IBI consists of the sum of ten metrics. Metrics include quantities such as the percent of individuals classified as tolerant taxa, the number of species and the percent of individuals with anomalies. Indices may also be used to summarize nonbiological variables, an example being the qualitative habitat evaluation index (QHEI). Relating biological measures to physical, chemical and habitat factors is not an easy problem (see for example, Norton, 1999). Use of methods that relate biology and chemistry over 2 large geographical regions present difficulties in model selection. While for small geographical systems the assumption of a single model may be reasonable, models for large-scale systems may be affected by multiple stressors that operate on different scales. For example, acid mine drainage may affect organisms in a single stream or portion of a stream. Acid rain might affect aquatic communities in a portion of a state while changes in temperature could affect the entire state. It is therefore valuable to identify the dominant patterns in biological stress and to know the extent of stressor-response relationship. In addition, the study design is not based on relationships, so the extent of the relationship is not known and the various influencing factors may vary over different spatial scales. Use of cluster analysis followed by regression is inefficient because cluster analysis seeks to minimize variance within clusters rather than maximize relationships between stressors and responses. We therefore seek clustering methods that finds clusters based on regression relationships. This paper is concerned with the clustering of aquatic sites based on stressor-response relationships using a cluster algorithm based on the Markov Chain Monte Carlo Model Composition (MC3) (Madigan and Raftery, 1994) and Model-based Cluster Analysis (MBCA) (see for example, Banfield and Raftery, 1993; Bensmail et al., 1997; Fraley and Raftery, 1998). MBCA allows a researcher to divide a set of multivariate observations into clusters/classes so as to maximize the underlying likelihood function. Two basic approaches exist to formulate the likelihood function: the classification likelihood method and the finite normal mixture approach. The former approach simply combines the likelihood (typically based on Gaussian distribution) functions from individual clusters to obtain an overall likelihood, given as follows n pc (x | , ) f ( xi | i , i ) , i 1 3 (1) where i indicates the class membership for an observation (that is, i is an index of the class where the ith case belongs). In the mixture approach the likelihood is formed by mixing the observations across K clusters with mixing probabilities k, associated with each cluster (see Banfield and Raftery, 1993; Bensmail et al., 1997; Fraley and Raftery, 1998). In the present approach to cluster analysis, the model space is expanded to include all possible partitions of multivariate observations. Models formed using subsets of variables and observations are evaluated using a penalized classification likelihood. In the cluster analysis, we form the model space and then use a Bayesian Information Criterion (BIC) that allows us to compute an approximate Bayes Factor for any two models. Using the machinery of the Markov Chain Model Composition Monte Carlo (MC3) method, we navigate through the model space searching for data supported partitions of the data set. In particular, for each observation we can provide its (posterior) activation probability in each of the given clusters. Each observation is allocated to a cluster with the highest posterior probability. The MC3 enhances the traditional approach by providing estimates of the uncertainties associated with the class memberships. The paper is organized as follows. In Section 2 we describe our basic algorithm for cluster analysis. A modified version of the algorithm that allows for information about sub-clusters, is applied to a data set describing stressors/response relationships in Ohio river systems in Section 3 and concluding remarks are made in Section 4. 2. Model Based Cluster Analysis via MC3 The implementation we propose defines clusters in terms of the relationship between the set of y’s and the explanatory variables, x (see Wedel and Kamakura, 1999; McLachlan and Peel, 2000). 4 To implement the procedure we use MC3. We define the states of our Markov Chain to be distinct models of n observations into K clusters (the number of clusters is pre-determined). That is, a model (state) is described by a matrix with elements zik, where zik =1 when observation i belongs to a class k, and zik = 0 otherwise. The neighborhood of each state, Z={ zik}, is formed of all models such that any model in the neighborhood of Z can be obtained from Z by moving an observation from a cluster k to a cluster k’, plus the partitioning of Z itself (see Lipkovich, 2002, Section 1.4.2). We make use of the classification likelihood whose basic form is given by expression (1). Note that we can also write (1) via class membership variables, zik. For a normal regression model: K n log{ pc (x | θ, Z)} zik log{ f ( xi | β k , Σ k )} . (2) k 1 i 1 A proposed partitioning Zp is compared against any current partitioning Zc, using the difference of their classification Bayesian Information Criterion (BIC), as will be explained in the next section. We use a stochastic search algorithm based on MC3. First, generate an initial model by randomly allocating each observation to any of the available clusters, with the only restriction that the number of observations in each cluster must be not less than a user prespecified number, say nL. At any stage of the process, a new model is generated by randomly drawing an integer uniformly distributed in the range from 1 to n. This is the index of the observation that has to be moved from its current cluster to a cluster that is randomly selected from the k-1 remaining clusters (another selection will be made when the number of observations in the cluster, nk, has reached its lower limit, nL). The proposed model, p is evaluated (with respect to the comparator model, c) based on the approximation to the Bayes Factor via BIC (see also Raftery, 1995), 5 BFp / c exp(0.5BIC p ) exp(0.5BICc ) exp(0.5{BIC p BICc }) , (3) where the BIC is expressed via the classification likelihood from the expression (1) as follows, K n BIC 2 log{ pc (x | θ, Z)} penalty 2 zik log f ( xi | θˆ k ) penalty , (4) k 1 i 1 where vector comprises cluster-specific means and variance-covariance matrices. A proposed model Z is accepted with probability , where min 1, BF p / c The penalty term that we used K was p k 1 k log( n) , where pk is the number of variables in the model for cluster k. Assuming that clusters are formed with observations, y k Xk βk ε k , where ε k ~ N (0, k I) , and Xk and yk are the subsets of the data that corresponds to a cluster k, we can express (4) as follows (and Raftery, 1995). K BIC {nk log k 1 (y k X k βˆ k )' (y X k βˆ k ) pk log(n)} . nk (5) where the regression coefficients are estimated using OLS procedure. The implementation of the algorithm is rather cumbersome and requires simultaneously maintaining several matrices of sums of squares and cross-products (one for each cluster) and updating them when an observation is moved from one cluster to another (a fast updating scheme was implemented via Sherman-Morrison-Woodbury formulas, see Thisted, 1988). The output of the described procedure contains a set of models that are filtered using Occam’s window, i.e. the models where exp( 0.5{BIC i max( BIC i )}) exceeds a certain lower limit. Each of the models is assigned a posterior probability by renormalizing its respective BIC as follows: L pˆ ( M i | D) exp( 0.5 BIC i ) / exp( 0.5 BIC i ) , i 1 6 where L is the number of models in the output, Mi denotes a particular partitioning. Choosing a single “best clustering” is accomplished via computing the estimates of posterior class membership probabilities for each observation, ik , i=1,..,n; k=1,..K. ˆik pˆ ( M j | D) , (6) k {M j :i M } j where Mk denotes a sub-set of indices {1,..n} associated with observations that fall in cluster k in model M. In words, we just add the estimated posterior probabilities for those models where a given observation is active in the kth cluster. Once the class membership probabilities have been obtained, an observation is allocated to a cluster where it has the highest posterior probability. It is important to understand that allocating an observation to a cluster brings about uncertainty that has to be accounted for. We estimate of the class membership uncertainty as 1 max(ˆik ) . k 1... K It should be noted that using a BIC approximation instead of a full Bayesian model averaging approach may result in inaccurate approximations to the posterior model probabilities and, in particular, it does not account for uncertainty in estimated parameters (regression coefficients), that are fixed at their ML estimates rather than integrated out. We resort to BIC as a computationally simpler and reliable alternative to fully Bayesian approach (see Raftery 1995). 3. Analysis of Ohio data Until recently, most biological data was generally not collected at the same time or location as chemical and habitat data. Dyer et al. (2000) developed a historical data set on streams in Ohio by matching biological data in space and time with chemical and habitat data using a geographical information system rule based approach. When multiple observations of chemical measurements were available, the median value was used. This resulted in a matched spatial set of observations for the state. In this study, we analyze the relationship between the response 7 given by the index of biological integrity (IBI), three chemical stressors, DO (dissolved oxygen), pH, and Zinc and the qualitative habitat evaluation index, QHEI (it is a measure of habitat quality that combines scores for various factors related to stream gradient, in-stream cover score, siltation, etc). Even with the matching, the full data contains a large number of missing values and for the present analysis we selected a representative group of 330 cases with completed records. The data for the analysis were selected from various basins with the intention to cover most of the basins in the state. Our goal is to divide the data into clusters based on the strength of the relationship between the IBI (the response) and the explanatory variables. This is different from the standard cluster analysis in that the latter would classify the sites based on the withincluster distances for the relevant variables, however disregarding the nature of the relationship between the biological data and the environmental variables which itself can be viewed as a determining feature of the region. It can be argued that in forming clusters we have to preserve the local integrity of the data and therefore instead of combining individual sites, we will be concerned with aggregating information on larger units (river basins). Therefore, we consider two levels of hierarchy: basins (larger level) and streams (smaller units) and constrain the algorithm to treat all samples within a basin as an object or all samples within stream as an object. The data selected for analysis contain information on 19 basins with varying number of observations per basin. Figure 1 shows the individual sites from a larger data set (734 records) plotted with their associated physical coordinates (i.e. latitude and longitude). Prior to analysis of the data, the log transformation was applied to DO and Zinc. Biplot displays (Gabriel, 1971; Lipkovich and Smith, 2002) were used to provide an initial evaluation of the sites and variables 8 and indicated relationships between QHEI and IBI although no strong indication of clustering was apparent. 3.1 Analysis with restrictions at stream level Results of clustering with restrictions at the stream level are displayed in Figure 1 and some summary information is given in Table 1. One can see that the first group represents a region with overall better environmental conditions (the mean level of IBI is 41.88 versus 35.08 for the second group), which can be partly explained by a smaller concentrations of Zinc. Also this cluster of sites shows significant negative relationship between Zinc and the effect of interest. QHEI is an important factor for both clusters although it has a more significant effect for cluster 1. Dissolved oxygen is important for cluster 2 while Ph is important for cluster 1. It is interesting to check if the sites from the first group form a compact geographic area. This can be seen from Figure 1 where the grouping is related to the physical coordinates. Most of the sites that form our first class are found in the southern part of the map, and there are several sites in the northern part whose allocation to the first group exhibits high level of uncertainty. To make more direct conclusions about this example, we need more subject specific information about the ecoregions that cover the state and other knowledge about the sites. 3. 2 Analysis with restrictions at basin level To obtain more spatially compact clusters, we imposed constraints at the basin level so that each set of samples within a basin was considered an object. Again we use the regression relationships with the same four independent variables. There were a total of 12 basins (sub-clusters) that were classified into three groups. Figure 2 displays the results of this analysis for a three-cluster solution. 9 Cluster 1 is formed of geographically compact sites in the northwestern part of the ecoregion and corresponds to a single basin. The vertical bars in Figure 2 associated with samples from cluster 3 that are near cluster 1 are class membership uncertainties (multiplied by 105). Cluster 1 represents a basin with regions of poor habitat (QHEI) and stress from metals (Zinc). The second cluster of basins has the highest regression t-statistic on QHEI (and represents the group of basins with the highest absolute level of IBI and QHEI) and represents the region in the southern part of the state. The third cluster is primarily associated with the northwestern part of the state where agriculture dominates. Here, loss of habitat is of importance. 4 Conclusion and Discussion Environmental data sets are often collected over space without emphasis on the underlying model that relates biological response and environmental stress. Finding relationships in the collected data becomes an important problem as it allows for evaluation of the extent and strength of biological stressor-response relationships. The goal of this paper was to show how the MC3 methodology can be applied to the problem of selecting subsets of observations with certain interesting features. This is an interesting example of employing the symmetry between treatment of variables and observations in statistical studies. We developed a general algorithm for MC3 based cluster analysis whose performance was shown with several simulated examples (Lipkovich, 2002) and a real data set. An important modification of the algorithm can accommodate natural restrictions on the model space in that the original observations are nested within larger units and their hierarchical relationships has to be preserved in the solution. This situation is very typical of ecological studies of effect/stressor relationships. Our data analysis allowed us to classify basins of one Ohio ecoregion based on the 10 similarity in the patterns of relationships between biological variables, habitat and chemical stressors, rather than just similarity in the means of multivariate observations. In addition to providing a reliable search procedure (as was shown in limited simulation experiments that will be presented elsewhere), the stochastic search method enhances our analysis by the estimation of model selection uncertainty, which in the context of cluster analysis can be used to construct class memberships. In this application we assumed that the number of clusters, K is known and fixed; we also developed a heuristic procedure for determining the bets K by using a cross-validation type procedure (see Lipkovich, 2002, p.118). According to this procure, K is chosen so as to minimize a cross-validation classification error that is estimated by re-allocating each observation to the closest cluster after re-computing regression without the deleted observation. Acknowledgements This research was funded in part by U.S. EPA-Science To Achieve Results (STAR) Grant #RD83136801-0. Biographical Sketches Ilya Lipkovich is a Research Scientist at Eli Lily & Co in Indianapolis, IN, USA Eric Smith is a Professor in the Statistics Department at Virginia Tech in Blacksburg, VA, USA Keying Ye is an Associate Professor in the Statistics Department at Virginia Tech in Blacksburg, VA, USA. 11 References Banfield, J.D. and Raftery, A.E. (1993). Model based Gaussian and non-Gaussian clustering. Biometrics, 49, 803-821. Bensmail, H., Celeux, G., Raftery, A.E. and Robert, C. (1997). Inference in model-based cluster analysis. Statistics and Computing, 7, 1-10. Dyer, S.D., White-Hull, C., Carr G.C., Smith, E.P. and Wang, X. 2000. Bottom-up and top-down approaches to assess multiple stressors over large geographic areas. Environmental Toxicology and Chemistry, 19(4-2), 1066-1075. Fraley, C., Raftery, A.E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Technical Report No. 329, Department of Statistics, University of Washington Gabriel, K.R. (1971). The biplot-graphic display of matrices with application to principal component analysis. Biometrika, 58, 453-67. Lipkovich, I. (2002). Bayesian Model Averaging and Variable Selection in Multivariate Ecological Models. PhD Dissertation, 2002, Virginia Polytechnic Institute. Lipkovich, I. and Smith, E.,P. (2002). Biplot and SVD macros for EXCEL. Journal of Statistical Software. 7:5 Madigan, D. and Raftery, A.E. (1994). Model selection and accounting for model uncertainty in graphical models using Occam’s window. Journal of the American Statistical Association, 89:428, 1535-1546 McLachlan, G.J. and Peel, D. (2000). Finite Mixture Models. Wiley, New York. Norton, S.B. (1999). Using Biological Monitoring Data to Distinguish Among Types of Stress in Streams of the Eastern Corn Belt Plains Ecoregion. Ph.D. thesis. George Mason University Raftery, A.E. (1995) Bayesian model selection in social research (with discussion). In Marsden, P.V., editor, Sociological Methodology, 111-195. Blackwells Publishers, Cambridge, Mass Thisted, R.A. (1988). Elements of Statistical Computing. Chapman and Hall, London Wedel, M. and Kamakura, W.A. (1999). Market Segmentation, Methodological and Conceptual Foundation. Dordrecht: Kluwer, 2nd edition. 12 Table 1. Group means and t-statistics (in parenthesis) from OLS regressions for twocluster solution. IBI is response variable. T-statistics in the IBI column are for the intercept. Class 1 2 Total Figure 1 Group Means with associated t statistics DO pH Zinc 7.72 (0.18) 8.07 (4.40) 14.61 (-2.03) 7.54 (3.49) 7.87 (2.43) 22.74 (-0.87) 7.61 7.95 19.71 IBI 41.88 (-3.96) 35.08 (-1.81) 37.61 QHEI 66.18 (10.09) 64.37 (6.08) 65.04 Geographical coordinates (degrees of longitude and latitude) of sites with classification results (2 clusters, restrictions at stream level, vertical bars indicate latitude class membership uncertainties) 220 210 Cluster 1 200 Cluster 2 190 longitude 180 90 100 110 13 120 130 Figure 2 Geographical coordinates of the steams with classification results (3 clusters, restrictions are set at basin level, vertical bars indicate class membership latitude uncertainties 220 210 Cluster 1 Cluster 2 200 Cluster 3 190 longitude 180 90 100 110 14 120 130