LipkovichSmithYe

advertisement
Detecting Pattern in Biological Stressor Response
Relationships Using Model Based Cluster Analysis
Ilya Lipkovich1, Eric P. Smith2 and Keying Ye2
Abstract
Environmental monitoring of aquatic systems is needed to estimate the quality of the systems, to
evaluate standards and to study stressor-response relationships. Monitoring programs often focus
on the collection of biological, chemical and physical measures of the system. An important
concern is the effect of chemical and physical stressors on the biological community. Evaluation
of relationships may be difficult as the extent of the relationship is not known. From a
management perspective, interest is on what factors affect the biological community and where
these factors have an influence. The focus of this paper is on the use of regression based cluster
analysis as a tool for finding relationships between a single biological response and a suite of
environmental stressors. The approach to cluster analysis uses a penalized regression
classification likelihood and Markov Chain Model Composition Monte Carlo. This approach
allows for simultaneous development of regression models and clustering of the regression
models. The method is applied to the analysis of a data set describing stressors/response
relationship in Ohio.
Key words: Bayesian methods, cluster analysis, Markov Chain Monte Carlo (MCMC)
simulation, regression, water quality
Author Footnote:
1
2
Eli Lilly and Company, Lilly Corporate Center, Indianapolis, Indiana 46285
Department of Statistics, Virginia Tech, Blacksburg, VA 24061
1
1.
Introduction
Recent programs to monitor the aquatic environment focus on the biological, chemical and
physical measurement of the environment. Of particular interest is the status of the environment
(sometimes referred to as the integrity) at local and regional levels and the factors that determine
the status. Both state (for example, Ohio’s Environmental Protection Agency (EPA)) and
national agencies (such as USEPA’s EMAP program) are routinely involved in the monitoring of
fish and benthic macroinvertebrates along with chemicals in the water, organism habitat and
physical characteristics of the landscape surrounding the aquatic system. In the State of Ohio,
for example, the state’s EPA monitors on a regular basis over a thousand sites
(http://www.epa.state.oh.us/dsw/document_index/305b.html). The data sets that are produced
from this sampling program are often quite large, typically containing over 30 fish taxa, 100
benthic macroinvertebrate taxa, habitat variables, water chemistry variables and land use
information (http://www.epa.state.oh.us/dsw/documents/exsumm96.pdf).
Of particular interest to environmental agencies and managers is biological integrity and
what factors are influencing the integrity. Integrity is often measured through a set of biological
metrics that are combined into a single summary measure. Examples include the index of Biotic
Integrity (IBI) for fish and the Invertebrate Community Index (ICI) for benthic
macroinvertebrates. For example, IBI consists of the sum of ten metrics. Metrics include
quantities such as the percent of individuals classified as tolerant taxa, the number of species and
the percent of individuals with anomalies. Indices may also be used to summarize nonbiological variables, an example being the qualitative habitat evaluation index (QHEI).
Relating biological measures to physical, chemical and habitat factors is not an easy
problem (see for example, Norton, 1999). Use of methods that relate biology and chemistry over
2
large geographical regions present difficulties in model selection. While for small geographical
systems the assumption of a single model may be reasonable, models for large-scale systems
may be affected by multiple stressors that operate on different scales. For example, acid mine
drainage may affect organisms in a single stream or portion of a stream. Acid rain might affect
aquatic communities in a portion of a state while changes in temperature could affect the entire
state. It is therefore valuable to identify the dominant patterns in biological stress and to know
the extent of stressor-response relationship. In addition, the study design is not based on
relationships, so the extent of the relationship is not known and the various influencing factors
may vary over different spatial scales. Use of cluster analysis followed by regression is
inefficient because cluster analysis seeks to minimize variance within clusters rather than
maximize relationships between stressors and responses. We therefore seek clustering methods
that finds clusters based on regression relationships.
This paper is concerned with the clustering of aquatic sites based on stressor-response
relationships using a cluster algorithm based on the Markov Chain Monte Carlo Model
Composition (MC3) (Madigan and Raftery, 1994) and Model-based Cluster Analysis (MBCA)
(see for example, Banfield and Raftery, 1993; Bensmail et al., 1997; Fraley and Raftery, 1998).
MBCA allows a researcher to divide a set of multivariate observations into clusters/classes so as
to maximize the underlying likelihood function. Two basic approaches exist to formulate the
likelihood function: the classification likelihood method and the finite normal mixture approach.
The former approach simply combines the likelihood (typically based on Gaussian distribution)
functions from individual clusters to obtain an overall likelihood, given as follows
n
pc (x |  , )   f ( xi |  i ,  i ) ,
i 1
3
(1)
where i indicates the class membership for an observation (that is, i is an index of the class
where the ith case belongs). In the mixture approach the likelihood is formed by mixing the
observations across K clusters with mixing probabilities k, associated with each cluster (see
Banfield and Raftery, 1993; Bensmail et al., 1997; Fraley and Raftery, 1998).
In the present approach to cluster analysis, the model space is expanded to include all
possible partitions of multivariate observations. Models formed using subsets of variables and
observations are evaluated using a penalized classification likelihood. In the cluster analysis, we
form the model space and then use a Bayesian Information Criterion (BIC) that allows us to
compute an approximate Bayes Factor for any two models. Using the machinery of the Markov
Chain Model Composition Monte Carlo (MC3) method, we navigate through the model space
searching for data supported partitions of the data set. In particular, for each observation we can
provide its (posterior) activation probability in each of the given clusters. Each observation is
allocated to a cluster with the highest posterior probability. The MC3 enhances the traditional
approach by providing estimates of the uncertainties associated with the class memberships. The
paper is organized as follows. In Section 2 we describe our basic algorithm for cluster analysis.
A modified version of the algorithm that allows for information about sub-clusters, is applied to
a data set describing stressors/response relationships in Ohio river systems in Section 3 and
concluding remarks are made in Section 4.
2.
Model Based Cluster Analysis via MC3
The implementation we propose defines clusters in terms of the relationship between the set of
y’s and the explanatory variables, x (see Wedel and Kamakura, 1999; McLachlan and Peel,
2000).
4
To implement the procedure we use MC3. We define the states of our Markov Chain to be
distinct models of n observations into K clusters (the number of clusters is pre-determined). That
is, a model (state) is described by a matrix with elements zik, where zik =1 when observation i
belongs to a class k, and zik = 0 otherwise. The neighborhood of each state, Z={ zik}, is formed
of all models such that any model in the neighborhood of Z can be obtained from Z by moving
an observation from a cluster k to a cluster k’, plus the partitioning of Z itself (see Lipkovich,
2002, Section 1.4.2). We make use of the classification likelihood whose basic form is given by
expression (1). Note that we can also write (1) via class membership variables, zik. For a normal
regression model:
K
n
log{ pc (x | θ, Z)}   zik log{ f ( xi | β k , Σ k )} .
(2)
k 1 i 1
A proposed partitioning Zp is compared against any current partitioning Zc, using the
difference of their classification Bayesian Information Criterion (BIC), as will be explained in
the next section. We use a stochastic search algorithm based on MC3. First, generate an initial
model by randomly allocating each observation to any of the available clusters, with the only
restriction that the number of observations in each cluster must be not less than a user prespecified number, say nL. At any stage of the process, a new model is generated by randomly
drawing an integer uniformly distributed in the range from 1 to n. This is the index of the
observation that has to be moved from its current cluster to a cluster that is randomly selected
from the k-1 remaining clusters (another selection will be made when the number of observations
in the cluster, nk, has reached its lower limit, nL).
The proposed model, p is evaluated (with respect to the comparator model, c) based on
the approximation to the Bayes Factor via BIC (see also Raftery, 1995),
5
BFp / c 
exp(0.5BIC p )
exp(0.5BICc )
 exp(0.5{BIC p  BICc }) ,
(3)
where the BIC is expressed via the classification likelihood from the expression (1) as follows,
K
n
BIC  2 log{ pc (x | θ, Z)}  penalty  2 zik log f ( xi | θˆ k )  penalty ,
(4)
k 1 i 1
where vector  comprises cluster-specific means and variance-covariance matrices. A proposed
model Z is accepted with probability , where   min 1, BF p / c  The penalty term that we used
K
was
p
k 1
k
log( n) , where pk is the number of variables in the model for cluster k. Assuming that
clusters are formed with observations, y k  Xk βk  ε k , where ε k ~ N (0,  k I) , and Xk and yk are
the subsets of the data that corresponds to a cluster k, we can express (4) as follows (and Raftery,
1995).
K
BIC   {nk log
k 1
(y k  X k βˆ k )' (y  X k βˆ k )
 pk log(n)} .
nk
(5)
where the regression coefficients are estimated using OLS procedure.
The implementation of the algorithm is rather cumbersome and requires simultaneously
maintaining several matrices of sums of squares and cross-products (one for each cluster) and
updating them when an observation is moved from one cluster to another (a fast updating scheme
was implemented via Sherman-Morrison-Woodbury formulas, see Thisted, 1988). The output of
the described procedure contains a set of models that are filtered using Occam’s window, i.e. the
models where exp( 0.5{BIC i  max( BIC i )}) exceeds a certain lower limit. Each of the models is
assigned a posterior probability by renormalizing its respective BIC as follows:
L
pˆ ( M i | D)  exp( 0.5 BIC i ) /  exp( 0.5 BIC i ) ,
i 1
6
where L is the number of models in the output, Mi denotes a particular partitioning.
Choosing a single “best clustering” is accomplished via computing the estimates of
posterior class membership probabilities for each observation, ik , i=1,..,n; k=1,..K.
ˆik 

pˆ ( M j | D) ,
(6)
k
{M j :i M
}
j
where  Mk denotes a sub-set of indices {1,..n} associated with observations that fall in cluster k in
model M. In words, we just add the estimated posterior probabilities for those models where a
given observation is active in the kth cluster. Once the class membership probabilities have been
obtained, an observation is allocated to a cluster where it has the highest posterior probability. It
is important to understand that allocating an observation to a cluster brings about uncertainty that
has to be accounted for. We estimate of the class membership uncertainty as 1  max(ˆik ) .
k 1... K
It should be noted that using a BIC approximation instead of a full Bayesian model
averaging approach may result in inaccurate approximations to the posterior model probabilities
and, in particular, it does not account for uncertainty in estimated parameters (regression
coefficients), that are fixed at their ML estimates rather than integrated out. We resort to BIC as
a computationally simpler and reliable alternative to fully Bayesian approach (see Raftery 1995).
3.
Analysis of Ohio data
Until recently, most biological data was generally not collected at the same time or location as
chemical and habitat data. Dyer et al. (2000) developed a historical data set on streams in Ohio
by matching biological data in space and time with chemical and habitat data using a
geographical information system rule based approach. When multiple observations of chemical
measurements were available, the median value was used. This resulted in a matched spatial set
of observations for the state. In this study, we analyze the relationship between the response
7
given by the index of biological integrity (IBI), three chemical stressors, DO (dissolved oxygen),
pH, and Zinc and the qualitative habitat evaluation index, QHEI (it is a measure of habitat
quality that combines scores for various factors related to stream gradient, in-stream cover score,
siltation, etc). Even with the matching, the full data contains a large number of missing values
and for the present analysis we selected a representative group of 330 cases with completed
records. The data for the analysis were selected from various basins with the intention to cover
most of the basins in the state. Our goal is to divide the data into clusters based on the strength of
the relationship between the IBI (the response) and the explanatory variables. This is different
from the standard cluster analysis in that the latter would classify the sites based on the withincluster distances for the relevant variables, however disregarding the nature of the relationship
between the biological data and the environmental variables which itself can be viewed as a
determining feature of the region. It can be argued that in forming clusters we have to preserve
the local integrity of the data and therefore instead of combining individual sites, we will be
concerned with aggregating information on larger units (river basins). Therefore, we consider
two levels of hierarchy: basins (larger level) and streams (smaller units) and constrain the
algorithm to treat all samples within a basin as an object or all samples within stream as an
object.
The data selected for analysis contain information on 19 basins with varying number of
observations per basin. Figure 1 shows the individual sites from a larger data set (734 records)
plotted with their associated physical coordinates (i.e. latitude and longitude). Prior to analysis
of the data, the log transformation was applied to DO and Zinc. Biplot displays (Gabriel, 1971;
Lipkovich and Smith, 2002) were used to provide an initial evaluation of the sites and variables
8
and indicated relationships between QHEI and IBI although no strong indication of clustering
was apparent.
3.1
Analysis with restrictions at stream level
Results of clustering with restrictions at the stream level are displayed in Figure 1 and some
summary information is given in Table 1. One can see that the first group represents a region
with overall better environmental conditions (the mean level of IBI is 41.88 versus 35.08 for the
second group), which can be partly explained by a smaller concentrations of Zinc. Also this
cluster of sites shows significant negative relationship between Zinc and the effect of interest.
QHEI is an important factor for both clusters although it has a more significant effect for cluster
1. Dissolved oxygen is important for cluster 2 while Ph is important for cluster 1. It is
interesting to check if the sites from the first group form a compact geographic area. This can be
seen from Figure 1 where the grouping is related to the physical coordinates. Most of the sites
that form our first class are found in the southern part of the map, and there are several sites in
the northern part whose allocation to the first group exhibits high level of uncertainty. To make
more direct conclusions about this example, we need more subject specific information about the
ecoregions that cover the state and other knowledge about the sites.
3. 2
Analysis with restrictions at basin level
To obtain more spatially compact clusters, we imposed constraints at the basin level so that each
set of samples within a basin was considered an object. Again we use the regression relationships
with the same four independent variables. There were a total of 12 basins (sub-clusters) that were
classified into three groups. Figure 2 displays the results of this analysis for a three-cluster
solution.
9
Cluster 1 is formed of geographically compact sites in the northwestern part of the ecoregion and
corresponds to a single basin. The vertical bars in Figure 2 associated with samples from cluster
3 that are near cluster 1 are class membership uncertainties (multiplied by 105). Cluster 1
represents a basin with regions of poor habitat (QHEI) and stress from metals (Zinc). The second
cluster of basins has the highest regression t-statistic on QHEI (and represents the group of
basins with the highest absolute level of IBI and QHEI) and represents the region in the southern
part of the state. The third cluster is primarily associated with the northwestern part of the state
where agriculture dominates. Here, loss of habitat is of importance.
4
Conclusion and Discussion
Environmental data sets are often collected over space without emphasis on the underlying
model that relates biological response and environmental stress. Finding relationships in the
collected data becomes an important problem as it allows for evaluation of the extent and
strength of biological stressor-response relationships. The goal of this paper was to show how the
MC3 methodology can be applied to the problem of selecting subsets of observations with certain
interesting features. This is an interesting example of employing the symmetry between
treatment of variables and observations in statistical studies.
We developed a general algorithm for MC3 based cluster analysis whose performance
was shown with several simulated examples (Lipkovich, 2002) and a real data set. An important
modification of the algorithm can accommodate natural restrictions on the model space in that
the original observations are nested within larger units and their hierarchical relationships has to
be preserved in the solution. This situation is very typical of ecological studies of effect/stressor
relationships. Our data analysis allowed us to classify basins of one Ohio ecoregion based on the
10
similarity in the patterns of relationships between biological variables, habitat and chemical
stressors, rather than just similarity in the means of multivariate observations.
In addition to providing a reliable search procedure (as was shown in limited simulation
experiments that will be presented elsewhere), the stochastic search method enhances our
analysis by the estimation of model selection uncertainty, which in the context of cluster analysis
can be used to construct class memberships.
In this application we assumed that the number of clusters, K is known and fixed; we also
developed a heuristic procedure for determining the bets K by using a cross-validation type
procedure (see Lipkovich, 2002, p.118). According to this procure, K is chosen so as to minimize
a cross-validation classification error that is estimated by re-allocating each observation to the
closest cluster after re-computing regression without the deleted observation.
Acknowledgements
This research was funded in part by U.S. EPA-Science To Achieve Results (STAR) Grant #RD83136801-0.
Biographical Sketches
Ilya Lipkovich is a Research Scientist at Eli Lily & Co in Indianapolis, IN, USA
Eric Smith is a Professor in the Statistics Department at Virginia Tech in Blacksburg, VA, USA
Keying Ye is an Associate Professor in the Statistics Department at Virginia Tech in Blacksburg,
VA, USA.
11
References
Banfield, J.D. and Raftery, A.E. (1993). Model based Gaussian and non-Gaussian clustering.
Biometrics, 49, 803-821.
Bensmail, H., Celeux, G., Raftery, A.E. and Robert, C. (1997). Inference in model-based cluster
analysis. Statistics and Computing, 7, 1-10.
Dyer, S.D., White-Hull, C., Carr G.C., Smith, E.P. and Wang, X. 2000. Bottom-up and top-down
approaches to assess multiple stressors over large geographic areas. Environmental
Toxicology and Chemistry, 19(4-2), 1066-1075.
Fraley, C., Raftery, A.E. (1998). How many clusters? Which clustering method? Answers via
model-based cluster analysis. Technical Report No. 329, Department of Statistics, University
of Washington
Gabriel, K.R. (1971). The biplot-graphic display of matrices with application to principal
component analysis. Biometrika, 58, 453-67.
Lipkovich, I. (2002). Bayesian Model Averaging and Variable Selection in Multivariate
Ecological Models. PhD Dissertation, 2002, Virginia Polytechnic Institute.
Lipkovich, I. and Smith, E.,P. (2002). Biplot and SVD macros for EXCEL. Journal of Statistical
Software. 7:5
Madigan, D. and Raftery, A.E. (1994). Model selection and accounting for model uncertainty in
graphical models using Occam’s window. Journal of the American Statistical Association,
89:428, 1535-1546
McLachlan, G.J. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
Norton, S.B. (1999). Using Biological Monitoring Data to Distinguish Among Types of Stress in
Streams of the Eastern Corn Belt Plains Ecoregion. Ph.D. thesis. George Mason University
Raftery, A.E. (1995) Bayesian model selection in social research (with discussion). In Marsden,
P.V., editor, Sociological Methodology, 111-195. Blackwells Publishers, Cambridge, Mass
Thisted, R.A. (1988). Elements of Statistical Computing. Chapman and Hall, London
Wedel, M. and Kamakura, W.A. (1999). Market Segmentation, Methodological and Conceptual
Foundation. Dordrecht: Kluwer, 2nd edition.
12
Table 1.
Group means and t-statistics (in parenthesis) from OLS regressions for twocluster solution. IBI is response variable. T-statistics in the IBI column are for the
intercept.
Class
1
2
Total
Figure 1
Group Means with associated t statistics
DO
pH
Zinc
7.72 (0.18)
8.07 (4.40)
14.61 (-2.03)
7.54 (3.49)
7.87 (2.43)
22.74 (-0.87)
7.61
7.95
19.71
IBI
41.88 (-3.96)
35.08 (-1.81)
37.61
QHEI
66.18 (10.09)
64.37 (6.08)
65.04
Geographical coordinates (degrees of longitude and latitude) of sites with
classification results (2 clusters, restrictions at stream level, vertical bars indicate
latitude
class membership uncertainties)
220
210
Cluster 1
200
Cluster 2
190
longitude
180
90
100
110
13
120
130
Figure 2
Geographical coordinates of the steams with classification results (3 clusters,
restrictions are set at basin level, vertical bars indicate class membership
latitude
uncertainties
220
210
Cluster 1
Cluster 2
200
Cluster 3
190
longitude
180
90
100
110
14
120
130
Download