EXPLORING SPATIAL CORRELATION IN RIVERS by Joshua French Introduction A city is required to extends its sewage pipelines farther in its bay to meet EPA requirements. How far should the pipelines be extended? The city doesn’t want to spend any more money than it needs to extend the pipelines. It needs to find a way to make predictions for the waste levels at different sites in the bay. Usually we might try to interpolate the data using a linear model. Usually we assume observations are independent. For spatial data however, we intuitively know that response values for points close together should be more similar than points separated by a great distance. We can use the correlation between sampling sites to make better predictions with our model. The Road Ahead - Methods - Introduction to the Variogram Exploratory Analysis Sample Variogram Modeling the Variogram - Analysis - 3 types of results - Conclusions - Future Work Introduction to the Variogram Spatial data is often viewed as a stochastic process. For each point x, a specific property Z(x) is viewed as a random variable with mean µ, variance σ2, higher-order moments, and a cumulative distribution function. Each individual Z(xi) is assumed to have its own distribution, and the set {Z(x1),Z(x2),…} is a stochastic process. The data values in a given data set are simply a realization of the stochastic process. For a spatial process, second-order stationarity is often assumed. Second-order stationarity implies that the mean is the same everywhere: i.e. E[Z(xj)]=µ for all points xj. It also implies that Cov(Z(xj),Z(xk)) becomes a function of the distance xj to xk. Thus, Cov(Z(xj),Z(xk)) = Cov(Z(x),Z(x+h)) = Cov(h) where h measures the distance between two points. Looking at the variance of differences Var[Z(x)-Z(x+h)] =E[ (Z(x)-Z(x+h))2 ] = 2 γ(h) Assuming second-order stationarity, γ(h)=Cov(0)-Cov(h). γ(h) is known as the semi-variogram. The plot of γ(h) on h is known as the variogram. Things to know about variograms: 1. γ(h)= γ(-h). Because it is an even function, usually only positive lag distances are shown. 2. Nugget effect - by definition, γ(0)= 0. In practice however, sample variograms often have a positive value at lag 0. This is called the “nugget effect”. 3. Tend to increase monotonically 4. Sill – the maximum variance of the variogram 5. Range – the lag distance at which the sill is reached. Observations are not correlated past this distance. The following figure shows these features 1.5 Variogram Example 0.5 Variance 1.0 sill nugget 0.0 range 0 1 2 3 Lag Distance 4 5 Exploratory Analysis The data studied is the longitudinal profile of the Ohio River. Instead of worrying about the river network with streams, tributaries, and other factors, we simply look at the Ohio River as a onedimensional object. The Ohio River Longitudinal Profile of the Ohio River Sampling Sites 39 Cincinnati, OH 38 Louisville, KY 37 Latitude (NAD27) 40 Pittsburgh, PA Cairo, IL -88 -86 -84 Longitude (NAD27) -82 -80 Before we model variograms, we should explore the data. We need to make sure that the data analyzed satisfies second-order stationarity If there is an obvious trend in the data, we should remove it and analyze the residuals. If the variance increases or decreases with lag distance, then we should transform the variable to correct this. 6 4 2 0 Square Root of Percent Invertivore 8 10 It is fairly easy to check for stationarity of this data set using a scatter plot. 0 200 400 600 RMI 800 1000 If the data contains outliers, we should do analysis both with and without outliers present. If G1>1, then we should transform the data to approximate normality if possible. 3.3 The Sample Variogram One of the previous definitions of semivariance is: 1 γ(h) E [ ( Z( x ) Z( x h) )2 ]. 2 The logical estimator is: N(h) 1 2 ˆγ(h) [ z(x j ) z(x j h) ] 2N(h) j1 where N(h) is the number of pairs of observations associated with that lag. 60000 40000 20000 Variance 80000 100000 Sample Variogram Example 0 20 40 Lag Distance 60 80 Modeling the Variogram Our goal is to estimate the true variogram of the data. There were four variogram models used to model the sample variogram: the spherical, Gaussian, exponential, and Matern models. 0.6 0.4 0.2 Exponential Spherical Gaussian Matern 0.0 Variance 0.8 1.0 Variogram Models 0 1 2 3 Lag Distance 4 5 6 Analysis The data analyzed is a set of particle size and biological variables for the Ohio River. The data was collected by “The Ohio River Valley Sanitation Commission. This is better known as ORSANCO. There were between 190 and 235 unique sampling sites, depending on the variable. ORSANCO data collection The results of the analysis fell into three main groups: - Able to fit the sample variogram well - Not able to fit the sample variogram well - Analysis not reasonable Good Results: Number of Individuals at a site After correcting for skewness by doing a log transformation, there are a number of outliers. We analyze the data both with and without the outliers. 0.40 0.35 0.30 Variance 0.45 0.50 0.55 log(Num Individuals) Sample Variogram with outliers 0 50 100 150 Lag Distance (Mi) 200 250 0.25 0.20 Variance 0.30 log(Num Individuals) Sample Variogram without outliers 0 50 100 150 Lag Distance (Mi) 200 250 We were not able to model the sample variogram perfectly, but we were able to detect some amount of spatial correlation in the data, especially when the outliers were removed. We are able to obtain reasonable estimates of the nugget, sill, and variance. Poor Results: Percent Sand After doing exploratory spatial analysis and removing a trend, we fit the sample variogram of the percent sand residuals. 550 500 450 400 Variance 600 650 700 Sample Variogram of percent sand residuals 0 50 100 150 Lag Distance (Mi) 200 250 The sample variogram does not really increase monotonically with distance. Our variogram models cannot fit this very well. Though we can obtain estimates of the nugget, sill, and range, the estimates cannot be trusted. No results: Percent Hardpan This variable was so badly skewed that analysis was not reasonable. The skewness coefficient is 12.38. This is extremely high. 150 100 50 0 Percent Hard Pan 200 250 QQplot of Percent Hardpan -3 -2 -1 0 Quantiles of Standard Normal 1 2 3 150 100 50 0 Percent Hard Pan 200 250 Scatter plot of Percent Hardpan 0 200 400 600 RMI 800 1000 The data is nearly all zeros! There is also an erroneous data value. A percentage cannot be greater than 100%. Data analysis does not seem reasonable. Our data does not meet the conditions necessary to use the spatial methods discussed. Conclusions Able to fit sample variogram reasonably well – percent gravel, number of individuals, number of species Not able to fit sample variogram well – percent sand, percent detritivore, percent simple lithophilic individuals, percent invertivore No results – remaining variables Summary of Results Response Transformation Percent Gravel Percent Sand Percent Cobble Percent Hardpan Percent Fines Percent Boulder Number of Individuals Natural Log Number of Individuals Natural Log (no outliers) Number of Native Species Percent Tolerant Individuals Percent Lithophilic Individuals Square Root Percent Nonnative Individuals Percent Detritivore Square Root Percent Detritivore Square Root (no outliers) Percent Invertivore Square Root Percent Piscivore Trend Removed Model Nugget Sill Range Exponential 286.09 335.53 72.9 miles 38.1082+.0330x Gaussian 520.88 658.32 71.67 miles Gaussian 0.29 Exponential 0.2 17.7849-.0042x Gaussian 10.1 0.39 44.19 miles 0.27 37.69 miles 11.87 39.93 miles 15.5364-.0023x 0.92 2.76 44.02 miles Exponential 1.09 Exponential 0.94 6.5207-.0039x Exponential 1.4 1.57 1.4 2.97 24.08 miles 19.17 miles 13.43 miles Matern Future Work Things to consider in future analysis: - The water flows in only one-direction. A point downstream cannot affect a point upstream - Natural features such as tributaries may impact spatial correlation - Manmade features such as dams may impact spatial correlation Concluding Thought Before you criticize someone, you should walk a mile in their shoes. That way, when you criticize them, you’re a mile away and you have their shoes. - Jack Handey