EXPLORING SPATIAL CORRELATION IN RIVERS by Joshua French Introduction A city is required to extends its sewage pipelines farther in its bay to meet EPA requirements. How far should the pipelines be extended? The city doesn’t want to spend any more money than it needs to extend the pipelines. It needs to find a way to make predictions for the waste levels at different sites in the bay. With the passage of the Clean Water Act in the 1970’s, spatial analysis of aquatic data has become even more important. Section 305 b) requires state governments to make, “a description of the water quality of all navigable waters in such State. . .” It is not physically or financially possible to make measurements at all sites. Some sort of spatial interpolation will need to be used. Usually we might try to fit some sort of linear model to the data to make predictions. Usually we assume observations are independent. For spatial data however, we intuitively know that two sampling sites close together will probably be similar. We would expect that two sites in close proximity would be more similar than two sites separated by a great distance. We can use the correlation between sampling sites to make better predictions with our model. The Ohio River The Road Ahead - Methods - Introduction to the Variogram Exploratory Analysis Sample Variogram Modeling the Variogram - Analysis - 3 types of results - Conclusions - Future Work Introduction to the Variogram Spatial data is often viewed as a stochastic process. For each point x, a specific property Z(x) is viewed as a random variable with mean µ, variance σ2, higher-order moments, and a cumulative distribution function. Each individual Z(xi) is assumed to have its own distribution, and the set {Z(x1),Z(x2),…} is a stochastic process. The data values in a given data set are simply a realization of the stochastic process. We want to measure the relationship between different points. Define the covariance for Z(xj) and Z(xk) to be: Cov(Z(xj),Z(xk))=E[{Z(xj)-µ(xj)} {Z(xk)-µ(xk)}] where µ(xj) and µ(xk) is the mean of Z at each respective location. However, we have a problem. We don’t know the means at each point because we only have one realization. To solve this, we must assume sort of stationarity – certain features of the distribution are identical everywhere. We will work with data that satisfies secondorder stationarity. Second-order stationarity means that the mean is the same everywhere: i.e. E[Z(xj)]=µ for all points xj. It also implies that Cov(Z(xj),Z(xk)) becomes a function of the distance xj to xk. Thus, Cov(Z(xj),Z(xk)) = Cov(Z(x),Z(x+h)) = Cov(h) where h measures the distance between two points. We can then derive that Cov(Z(x),Z(x+h)) =E[(Z(x)-µ)(Z(x+h)- µ)] = E[(Z(x)(Z(x+h))-µ2] Sometimes it is clear that our data is not second-order stationary. Georges Matheron solved this problem in 1965 by establishing his “intrinisic hypothesis”. For small distances h, Matheron held that E[Z(x)-Z(x+h)]=0 Looking at the variance of differences, this leads to Var[Z(x)-Z(x+h)] =E[ (Z(x)-Z(x+h))2 ] = 2 γ(h) Intrinsic stationarity is good because analysis may be conducted even if second-order stationarity is violated. Unfortunately, the covariance equation is not defined for intrinsic stationarity. For this reason, we will work with data that is second-order stationarity. If second-order stationarity is violated by the original data, then we will perform additional procedures to work with data that is second-order stationary. Note that second-order stationarity implies intrinsic stationarity, so the variogram equation is still defined. Under second-order stationarity, γ(h)=Cov(0)-Cov(h). γ(h) is known as the semi-variogram. In practice however, it is usually referred to as the variogram. Things to know about variograms: 1. γ(h)= γ(-h). Because it is an even function, usually only positive lag distances are shown. 2. Nugget effect - by definition, γ(0)= 0. In practice however, sample variograms often have a positive value at lag 0. This is called the “nugget effect”. 3. Tend to increase monotonically 4. Sill – the maximum variance of the variogram 5. Range – the lag distance at which the sill is reached The following figure shows these features 1.5 Variogram Example 0.5 Variance 1.0 sill nugget 0.0 range 0 1 2 3 Lag Distance 4 5 Exploratory Analysis Before we model variograms, we should explore the data. We need to make sure that the data analyzed satisfies second-order stationarity We need to check for outliers We need to make sure that the data is not too badly skewed (G1>1) 6 4 2 0 Square Root of Percent Invertivore 8 10 We can look at the river data as a one-dimensional linear system. It is fairly easy to check for stationarity using a scatter plot. 0 200 400 600 RMI 800 1000 If there is an obvious trend in the data, we should remove it and analyze the residuals. If the variance increases or decreases with lag distance, then we should transform the variable to correct this. 0 20 40 60 80 To check for outliers, we may use a typical boxplot. If the data contains outliers, we should do analysis both with and without outliers present. 5 4 3 2 Observed 6 7 8 If G1>1, then we should transform the data to approximate normality if possible. To check approximate normality, the standard qqplot can be used. -2 0 Quantiles of Standard Normal 2 3.3 The Sample Variogram One of the previous definitions of semivariance is: 1 γ(h) E [ ( Z( x ) Z( x h) )2 ]. 2 The logical estimator is: N(h) 1 2 ˆγ(h) [ z(x j ) z(x j h) ] 2N(h) j1 where N(h) is the number of pairs of observations associated with that lag. 60000 40000 20000 Variance 80000 100000 Sample Variogram Example 0 20 40 Lag Distance 60 80 Modeling the Variogram Our goal is to estimate the true variogram of the data. There were four variogram models used to model the sample variogram: the spherical, Gaussian, exponential, and Matern models. 0.6 0.4 0.2 Exponential Spherical Gaussian Matern 0.0 Variance 0.8 1.0 Variogram Models 0 1 2 3 Lag Distance 4 5 6 The algorithm used to fit the spherical model uses least squares. The algorithm used to fit the exponential, Gaussian, and Matern models is maximum likelihood. The spherical model is fit to get an estimate of the sill, nugget, and range. These estimates will be used to fit the other three models. The “best model” will be the model that minimizes the AICC statistic. Analysis The data analyzed is a set of particle size and biological variables for the Ohio River. The data was collected by The Ohio River Valley Sanitation Commission. This is better known as ORSANCO. ORANSCO data collection There were between 190 and 235 unique sampling sites, depending on the variable. Some sites had more than one observation. In these situations, the average value for the site was used for analysis. Ohio River Sampling Sites 39 Cincinnati, OH 38 Louisville, KY 37 Latitude (NAD27) 40 Pittsburgh, PA Cairo, IL -88 -86 -84 Longitude (NAD27) -82 -80 There were two main types of data: particle size data and biological levels. The particle size data measured percent gravel, percent sand, percent fines, percent hardpan, percent boulder, and percent cobble. The biological data measured - Number of individuals at a site - Number of species at a site - Percent tolerant fish - Percent simple lithophilic fish (fish that lay eggs on rocks) - Percent non-native fish - Percent detritivore fish (fish that eat mostly decomposed plants or animals) - Percent invertivore (fish that eat mostly invertebrate animals) - Percent Piscivore (fish that eat mostly other fish) The results of the analysis fell into three main groups: - Sample variogram fit well - Sample variogram did not fit well - Analysis not reasonable Good Results: Number of Individuals at a site Skewness coefficient of data is 8.16. This is much too high. The data is transformed using the natural logarithm New skewness coefficient is reduced to .56. Not perfect, but much less skewed. 7 6 5 4 log(Number of Individuals) 8 Check Normality of log(Num Individuals) -3 -2 -1 0 Quantiles of Standard Normal 1 2 3 7 6 5 4 log(Number of Individuals) 8 Check Second-Order Stationarity of log(Num Individuals) 0 200 400 600 RMI 800 1000 4 5 6 7 8 Check for outliers of log(Num Individuals) There are a number of outliers for the transformed variable We should do analysis with and without the outliers present 0.40 0.35 0.30 Variance 0.45 0.50 0.55 log(Num Individuals) Sample Variogram with outliers 0 50 100 150 Lag Distance (Mi) 200 250 6.0 5.5 5.0 4.5 4.0 log(Num Individuals) w/o outliers 6.5 Check normality of log(Num Individuals) without outliers -3 -2 -1 0 Quantiles of Standard Normal 1 2 3 0.25 0.20 Variance 0.30 log(Num Individuals) Sample Variogram without outliers 0 50 100 150 Lag Distance (Mi) 200 250 We were not able to model the sample variogram perfectly, but we were able to detect some amount of spatial correlation in the data, especially when the outliers were removed. For the transformed variable without outliers, the exponential model estimated the nugget to be .20, the sill to be .2709, and the range to be 37.7 miles. Poor Results: Percent Sand Skewness coefficient only .18, so skewness not a major factor. Check second-order stationarity using scatter plot. 60 40 20 0 Percent Sand 80 100 Check Stationarity of Percent Sand 0 200 400 600 RMI 800 1000 There appears to be a trend in the data. After removing the trend, the data appears to be second-order stationary. The residuals are also approximately normal. 20 0 -20 -40 -60 Percent Sand Residuals 40 Check stationarity of percent sand residuals 0 200 400 600 RMI 800 1000 0 -20 -40 -60 sand$resid 20 40 Check normality of percent sand residuals -3 -2 -1 0 Quantiles of Standard Normal 1 2 3 550 500 450 400 Variance 600 650 700 Sample Variogram of percent sand residuals 0 50 100 150 Lag Distance (Mi) 200 250 The sample variogram does not really increase monotonically with distance. Our variogram models cannot fit this very well. Though we can obtain estimates of the nugget, sill, and range, the estimates cannot be trusted. No results: Percent Hardpan This variable was so badly skewed that analysis was not reasonable. The skewness coefficient is 12.38. This is extremely high. 150 100 50 0 Percent Hard Pan 200 250 QQplot of Percent Hardpan -3 -2 -1 0 Quantiles of Standard Normal 1 2 3 150 100 50 0 Percent Hard Pan 200 250 Scatter plot of Percent Hardpan 0 200 400 600 RMI 800 1000 The data is nearly all zeros! There is also an erroneous data value. A percentage cannot be greater than 100%. Data analysis does not seem reasonable. Our data does not meet the conditions necessary to use the spatial methods discussed. Conclusions Able to fit sample variogram reasonably well – percent gravel, number of individuals, number of species Not able to fit sample variogram well – percent sand, percent detritivore, percent simple lithophilic individuals, percent invertivore No results – remaining variables Summary of Results Response Transformation Percent Gravel Percent Sand Percent Cobble Percent Hardpan Percent Fines Percent Boulder Number of Individuals Natural Log Number of Individuals Natural Log (no outliers) Number of Native Species Percent Tolerant Individuals Percent Lithophilic Individuals Square Root Percent Nonnative Individuals Percent Detritivore Square Root Percent Detritivore Square Root (no outliers) Percent Invertivore Square Root Percent Piscivore Trend Removed Model Nugget Sill Range Exponential 286.09 335.53 72.9 miles 38.1082+.0330x Gaussian 520.88 658.32 71.67 miles Gaussian 0.29 Exponential 0.2 17.7849-.0042x Gaussian 10.1 0.39 44.19 miles 0.27 37.69 miles 11.87 39.93 miles 15.5364-.0023x 0.92 2.76 44.02 miles Exponential 1.09 Exponential 0.94 6.5207-.0039x Exponential 1.4 1.57 1.4 2.97 24.08 miles 19.17 miles 13.43 miles Matern Future Work Data set involving three streams in Norfolk, Virginia. Each stream has 25 observations. Collected by researchers at Old Dominion University. Difficulties to overcome - What is the best way to measure distance between points? - Few observations - Overlapping points after coordinate conversion Problem: What is the best way to measure distance between points? There is some aspect of two-dimensionality to the data, but it is still really a onedimensional problem. Paradise Creek Region of Interest 0.4 0.2 0.0 UTMY 0.6 Paradise Creek Sampling Sites 0.0 0.2 0.4 0.6 0.8 UTMX 1.0 1.2 1.4 Problem: 25 observations per stream is considered the minimum number of points to create a variogram - the sample variogram will be very rough - our variogram model estimates will probably be bad To correct this, we will explore the possibility of combining the data from the three streams Problem: Overlapping points after conversion - Original data in longitude/latitude coordinates - Convert to UTM coordinates so that Euclidian distance makes sense - Converted UTM coordinates often result in overlapping sites (and even fewer unique sampling sites) 36.806 36.804 36.802 36.800 Latitude (NAD27) 36.808 Stream Sampling Sites (Lat/Long) -76.290 -76.288 -76.286 Longitude (NAD27) -76.284 -76.282 4083200 4082800 UTM Y 4083600 Stream Sampling Sites (UTM) 920200 920400 920600 UTM X 920800 921000 36.806 36.804 36.802 36.800 Latitude (NAD27) 36.808 Stream Sampling Sites (Lat/Long) -76.290 -76.288 -76.286 Longitude (NAD27) -76.284 -76.282 4083200 4082800 UTM Y 4083600 Stream Sampling Sites (UTM) 920200 920400 920600 UTM X 920800 921000 Acknowledgments - My committee: Dr. Urquhart, Dr. Wang, and Dr. Theobald - Dr. Davis and Dr. Reich for answering my spatial questions and letting me use their S-Plus spatial library Concluding Thought Before you criticize someone, you should walk a mile in their shoes. That way, when you criticize them, you’re a mile away and you have their shoes. - Jack Handey