This file was created by scanning the printed publication. Errors identified by the software have been corrected; however, some errors may remain. Filling in Missing F o ~ s t r yData: Exploring Autocomlational Techniques Alissa N. ~ n t l e 'and Peter L. Malshall2 Abstract.-A crucial component of any effective approach to long-term management and protection of valuable forest resources is the ability to represent and model spatial forest ecosystem processes. Geographic information systems (GIs) integrated with spatial forest models provide an appropriate spatial framework for forest ecosystem modeling. In order to be effective, accurate ecological data must be used as input to GIs and spatial forest models. It is often the case that the forest inventory attribute of interest may have incomplete coverage for the area under management. In these cases it is mandatory to have techniques that can predict the missing values while preserving the complex spatial patterns between forest inventory attributes. One of the most prominent spatial patterns found in ecological data is the presence autocorrelation.This paper introduces several autocorrelational tests, describes different methods of representing spatial autocorrelational structures, and discusses various approaches which, either explicitly or implicitly, use this information to predict missing values in ecological data sets. INTRODUCTION In order for spatial forest models to be effective, accurate and complete ecological data must be used. Spatial data quality has received much recent attention in the GIs community (Goodchild and Gopal 1989, Chrisman 1991, Guptill and Morrison 1995). The term completeness has been used in reference to both resolution and missing values. Missing values are those which are expected to appear in the data set but do not. In forest inventory data sets, values may be missing due to collection errors or omissions. In particular, attributes required for a specific forest model may have not been collected, or because of the expense or effort required to collect the data, only a limited set of data may be available. In order to ensure that data sets are fit for forest modeling purposes, techniques that can predict missing values are necessary to complete data sets prior to use in Department of Geography, University of British Columbia. Canada Department of Forest Resources Management. University of British Columbia, Canada existing models or in developing new models. While conventional statistical techniques have been widely used to predict missing values, most of these techniques substitute mean or median values for missing data. These approached do not preserve the complex spatial patterns which may be found among multivariate forest inventory attributes. In addition, most interpolative approaches are based on the assumption of spatially independent data. Simply stated, the problem is how to best predict missing, autocorrelated data. This paper presents an initial exploration into the missing forestry data problem given the autocorrelated nature of most spatial ecological data. Space restrictions preclude a complete discussion of this topic. SPATIAL STRUCTURE While the problem outlined above may seem to be a simple statistical inference problem, it is not. The distribution of natural abiotic, and induced biotic features, are typically neither uniform nor random. Rather, they tend to be either aggregated in patches or form gradients or other types of spatial structures (Legendre 1993). Why is representation of spatial pattern important in forest and ecological models? The spatial or temporal structure or pattern of ecosystems is a crucial element in most ecological theories. Any forest or ecological model based on such theories must take spatial pattern and the nature of spatial ecological data into account. Specific characteristics of spatial data can be identified which cause problems in traditional statistical techniques. There are two main classes of spatial effects: spatial dependence and spatial heterogeneity (Anselin and Getis 1992, Cressie and ver Hoef 1993). Only spatial dependence will be considered in this paper. Spatial Autocornlation What is significant about the patterns found in spatial ecological data? According to Tobler's first law of geography, everything is related to everything else, but near things are more related than distant things. This simply means that the relations within and between variables sampled at spatially proximal locations will be stronger than those sampled at further distances. Ecological data are spatially dependent by nature. Spatial dependence is often referred to as spatial autocorrelation. A major consequence of spatial autocorrelation is that traditional statistical inference techniques are not as efficient as an independent sample of the same size. That is, spatial dependence leads to a loss of information, information which is crucial for predicting missing values. Positive spatial autocorrelation may cause the variance to be underestimated. Negative autocorrelation may inflate estimates of variance. In either case, predictive equations can become biased and the results of prediction unreliable. Clearly, some new methods or enhancements to existing statistical methods of spatial inference are needed to address the case of spatially autocorrelated data (Antle and Klinkenberg 1995). MODELING SPATIAL AUTOCORRELATION As discussed above, many of the basic statistical methods used in ecological studies are impaired by autocorrelated data. Most, if not all, environmental data falls into this category (Legendre and Fortin 1989). Prediction methods which can be used to fill in missing forestry data must account for spatial autocorrelation or be resistant to its effects. Before these techniques are applied, the nature of the spatial patterns in the data should be explored. Various tests for autocorrelation and trends can be used. Once we know that the data is correlated and the nature of the relation (i.e., either positive or negative), techniques which take this pattern into account can be applied. Testing for Autocornlation Since the presence of autocorrelation may distort the results of conventional statistical interpolation techniques, autocorrelation should be tested for prior to selecting any particular model (i.e., during the exploratory phase of data analysis) in order to ensure accurate predictions. There are various methods which test for the presence of spatial autocorrelation in univariate data and for correlation in multivariate data sets. These are well documented and will not be described here. Univariate methods include the joint count test for nominal data, Geary's c or Moran's I test for ordinal and interval data, and spectral analysis (see Haining 1990 for a detailed description). Multivariate methods include the Spearrnan rank correlation coefficient for ordinal data, the Pearson product correlation coefficient for interval data, the Mantel test and correlogram (see Haining 1990 or Goodchild 1986 for details). If large scale spatial dependency exists, it may be removed by regression or model-fitting (eg., trend surface analysis). Traditional statistical techniques can then be used without violating their assumptions. However, caution should be taken in order to avoid removing the determinants of the underlying processes along with the spatial structure. An alternative to removing spatial dependence is to modify statistical approaches to take spatial autocorrelation into account. At the very least, techniques which are resistant to spatial autocorrelational effects should be used. Techniques which utilize autocorrelational information (to enhance prediction accuracy) are preferable. First how do we represent spatial autocorrelation? ' Representing Autocorrelation Autocorrelation can be represented through structure functions, which help quantify spatial dependency and partition it along distance intervals. Geographic coordinates The spatial structure of variables can be expressed as a linear combination of geographic coordinates of sample sites (Legendre 1993). For example, a high order polynomial can be built up using the x and y coordinates of sites (as is done in trend surface analysis). This information can then be incorporated into traditional inference models, such as regression models, using partial regression analysis techniques. The result of this analysis is the separation of the variation of the target variable into four components: the variance resulting from non-spatial environmental factors, spatial structure, the interaction of environmental and spatial factors, and the unexplained variation. Geographic distances Another approach is to use location as one of a set of predictor variables in a statistical model (Legendre 1993). In this case, the spatial structure of the data is represented by a matrix of geographic distances between site locations (also called a proximity matrix). For example, the euclidean distance can be computed for all pairs of sites based on their geographic coordinates. This information is then summarized in a spatial distance matrix. If the remaining environmental variables can be represented in the form of distance matrices, then these distance matrices can be compared using some form of correlation analysis ( e g , the Mantel test, see Legendre and Fortin 1989). Covaliation One of the most common structure functions is the variogram (and related covariance and correlogram). The variogram summarizes the spatial continuity for all possible pairings of data, for all significant lag distances, by modeling the average degree of similarity between values as a function of their separation distance. Variograms can be computed either as averages over all directions or specific to a particular direction. If there is a trend in the data (i.e., local means and local variances change as a function of location within the sampling space), the variogram will include both lag-to-lag variability and the trend variability (Rossi et al. 1992). In the case of stationary data, the variogram can be experimentally estimated from observed values. Various statistical models can then be fitted to the variogram, including exponential, spherical and linear. Prediction Utilizing Autocorrelation Since ecological data are often correlated, techniques which can take advantage of the information contained in these relationships may be able to best predict missing values while retaining the variability and covariation structure of forest attribute data. Regression models Traditional linear regression can be extended to situations where spatial autocorrelation exists by the addition of lagged variables. Griffith (1987) outlines how spatial autocorrelation can be incorporated into traditional regression models. Promising results have been obtained from the introduction of an additional parameter into conventional statistical models that accounts for the latent spatial autocorrelative structure of the data. This technique has also been extended to multivariate models (Haining 1990). This approach is one step toward the integration of spatial dependence (in the form of geographic distance information) into predictive models. As mentioned above, in partial regression analysis, the spatial component can be "partialled out" by regressing the spatial factor variables onto each explanatory variable until only the regression residuals remain (Legendre 1993). The residuals are then used to model the variablets) which have missing values. This approach "explains" much of the variation that is left unaccounted for in simple regression models. Kriging Kriging is a well known geostatistical spatial interpolation technique which explicitly uses autocorrelational information in its predictions (Rossi et al. 1994). Since it uses a local estimator and the autocorrelation structure of the data, kriging is quite unlike traditional methods, such as trend surface analysis, which ignore autocorrelational effects and use all of the data to estimate the unknown point's value. In addition, kriging provides variance information which is crucial in determining the accuracy of predicted values. Kriging can be applied to both nominal and continuous data. In general, kriging predicts missing values at a specific site by taking a weighted linear average of available samples (like regression). It is similar to multiple regression with a few important differences. First, kriging can produce predicted values that are either larger or smaller than any of the sample values. Second, kriging utilizes both distance and geometry among samples, whereas traditional methods use only distance. Third, kriging attempts to minimize the variance of the expected error by inferring the variance from an empirical model of degree of spatial dependence with distance and direction (i.e., the variogram). Most similar neighbor analysis Most similar neighbor (MSN) analysis is a canonically based technique which can be used to fill in interval or continuous data values by implicitly utilizing the spatial structural information of the data. Instead of estimating missing values one- by-one (for a particular site), the technique "chooses" a most similar site from a set of sites to act as a stand-in. The surrogate site is chosen on the basis of a similarity function (derived using canonical correlation analysis). This approach works well when the variable to be predicted and the predictor variables have strong functional relationships to each other (Moeur 1995) (as is the case for spatially autocorrelated data). Because surrogate sites are chosen from among actual sites, impossible predictions for missing values cannot occur. However, this means that the samples sites must adequately span the full range of sites. While Moeur's procedure does not explicitly model spatial correlation, it attempts to maintain the covariance structure of the multivariate response attributes. DISCUSSION None of the methods described here for filling in missing forestry data can be used without careful consideration. They all make assumptions about the nature of data and the phenomena under study. For specific cases, these assumptions may or may not hold. The choice of techniques should consider how the information contained in the observed data can be used, the accuracy of predictions, the amount of computational effort required, and how flexible they are in allowing both theoretical and empirical information to be utilized (Haining 1990). Regression can be extended by incorporating a spatial autocorrelation parameter and by using different forms of the model with spatially lagged variables. A fair amount of trial and error is required to determine the best fitting model for a particular data set. For a particular study, the different regression models should be compared not only in terms of included and excluded variables but also in terms of different forms of the model. Interpretive knowledge can be utilized in the choice of model. Regression assumes normality and stationarity, which may or may not hold. While regression is useful for estimating population means and totals, it does not always preserve the underlying relations among inventory attributes. For example, if one or more variables with missing values is difficult to predict, the correlation structure may be distorted. Regression may produce unreliable results for cases of high unique variances. While traditional methods of interpolation (e.g., inverse distance weighting, local sample means) are less accurate, they are also less computationally intensive than kriging and do not require the subjective estimation of the variogram. Kriging assumes that the variogram is accurate over the entire study area. This is known as the stationarity hypothesis and in cases where it does not hold, the results from kriging will be less accurate. In addition, kriging assumes that the variable being predicted is multivariate normal. Raw data are rarely univariate normal (Rossi et al. 1994). While MSN analysis is a robust approach that implicitly captures spatial structure, it sensitive to sample size in that smaller samples sizes reduce the variance, which in turn affects it's predictive function. Results from MSN analysis are unpredictable if a strong non-linear relation exists among the variables of one or both sets. Canonical analysis (upon which MSN is based) is adequate for the case where the joint distribution of variables is linear and continuous. So, in cases where data are either heterogeneous or exhibit non-linear relations between variables, MSN will fall short. As well, the linear orthogonal nature of solutions in canonical analysis may be unrealistic for ecological data where there is often high unique variances. An issue which is important but has not been addressed in this paper is that of the scale of variation. Spatial variation may be conceptualized as being composed of several (possibly independent) components at several different scales (Haining 1990). In testing for, representing and modeling spatial autocorrelation, it is important to identify and account for the contribution from each of the different scales to the total variation. CONCLUSIONS If we rely on prediction models that assume that biological features are distributed uniformly or randomly in space, then chances of obtaining accurate predictions of missing values are small, since spatial autocorrelation characterizes most ecological data. It is often difficult to determine if assumptions have been broken, making it hard to compensate for such violations or to know how reliable the newly completed data set is. Consequently spatial autocorrelation must be tested for and techniques which either explicitly or implicitly consider spatial autocorrelation must be used when it is present. While the different techniques examined above have various advantages, disadvantages and restrictions, it is impossible to produce a truly meaningful comparison unless each method is applied to a unique set of variables. To date there have been few studies which have formally compared the various spatial inference techniques which can be used to fill in missing data values. Given the increased importance such methods will have in the future, conducting such a comparative study is certainly warranted. REFERENCES Antle, A. N. and Klinkenberg, B. 1995. Statistically-based data generation techniques: an emerging trend. In: Heit, M., Parker, H. D. and Shortreid, A. (eds.) GIs Applications in Natural Resources 2, GIs World, Inc. Anselin, L. & Getis, A. 1992. Spatial statistical analysis and geographic information systems. The Annals of Regional Science, Springer-Verlag, Vol. 26, pp. 19 -33. Chrisman, N.R. 1991. The error component in spatial data. In: MacGuire D.J.,