Document 11863979

advertisement
This file was created by scanning the printed publication.
Errors identified by the software have been corrected;
however, some errors may remain.
Filling in Missing F o ~ s t r yData:
Exploring Autocomlational Techniques
Alissa N. ~ n t l e 'and Peter L. Malshall2
Abstract.-A crucial component of any effective approach to
long-term management and protection of valuable forest resources
is the ability to represent and model spatial forest ecosystem
processes. Geographic information systems (GIs) integrated with
spatial forest models provide an appropriate spatial framework for
forest ecosystem modeling. In order to be effective, accurate
ecological data must be used as input to GIs and spatial forest
models. It is often the case that the forest inventory attribute of
interest may have incomplete coverage for the area under
management. In these cases it is mandatory to have techniques that
can predict the missing values while preserving the complex spatial
patterns between forest inventory attributes. One of the most
prominent spatial patterns found in ecological data is the presence
autocorrelation.This paper introduces several autocorrelational tests,
describes different methods of representing spatial autocorrelational
structures, and discusses various approaches which, either explicitly
or implicitly, use this information to predict missing values in
ecological data sets.
INTRODUCTION
In order for spatial forest models to be effective, accurate and complete
ecological data must be used. Spatial data quality has received much recent
attention in the GIs community (Goodchild and Gopal 1989, Chrisman 1991,
Guptill and Morrison 1995). The term completeness has been used in reference to
both resolution and missing values. Missing values are those which are expected
to appear in the data set but do not. In forest inventory data sets, values may be
missing due to collection errors or omissions. In particular, attributes required for
a specific forest model may have not been collected, or because of the expense or
effort required to collect the data, only a limited set of data may be available. In
order to ensure that data sets are fit for forest modeling purposes, techniques that
can predict missing values are necessary to complete data sets prior to use in
Department of Geography, University of British Columbia. Canada
Department of Forest Resources Management. University of British Columbia, Canada
existing models or in developing new models.
While conventional statistical techniques have been widely used to predict
missing values, most of these techniques substitute mean or median values for
missing data. These approached do not preserve the complex spatial patterns which
may be found among multivariate forest inventory attributes. In addition, most
interpolative approaches are based on the assumption of spatially independent data.
Simply stated, the problem is how to best predict missing, autocorrelated data.
This paper presents an initial exploration into the missing forestry data
problem given the autocorrelated nature of most spatial ecological data. Space
restrictions preclude a complete discussion of this topic.
SPATIAL STRUCTURE
While the problem outlined above may seem to be a simple statistical
inference problem, it is not. The distribution of natural abiotic, and induced biotic
features, are typically neither uniform nor random. Rather, they tend to be either
aggregated in patches or form gradients or other types of spatial structures
(Legendre 1993).
Why is representation of spatial pattern important in forest and ecological
models? The spatial or temporal structure or pattern of ecosystems is a crucial
element in most ecological theories. Any forest or ecological model based on such
theories must take spatial pattern and the nature of spatial ecological data into
account. Specific characteristics of spatial data can be identified which cause
problems in traditional statistical techniques. There are two main classes of spatial
effects: spatial dependence and spatial heterogeneity (Anselin and Getis 1992,
Cressie and ver Hoef 1993). Only spatial dependence will be considered in this
paper.
Spatial Autocornlation
What is significant about the patterns found in spatial ecological data?
According to Tobler's first law of geography, everything is related to everything
else, but near things are more related than distant things. This simply means that
the relations within and between variables sampled at spatially proximal locations
will be stronger than those sampled at further distances. Ecological data are
spatially dependent by nature. Spatial dependence is often referred to as spatial
autocorrelation.
A major consequence of spatial autocorrelation is that traditional statistical
inference techniques are not as efficient as an independent sample of the same
size. That is, spatial dependence leads to a loss of information, information which
is crucial for predicting missing values. Positive spatial autocorrelation may cause
the variance to be underestimated. Negative autocorrelation may inflate estimates
of variance. In either case, predictive equations can become biased and the results
of prediction unreliable.
Clearly, some new methods or enhancements to existing statistical methods
of spatial inference are needed to address the case of spatially autocorrelated data
(Antle and Klinkenberg 1995).
MODELING SPATIAL AUTOCORRELATION
As discussed above, many of the basic statistical methods used in ecological
studies are impaired by autocorrelated data. Most, if not all, environmental data
falls into this category (Legendre and Fortin 1989). Prediction methods which can
be used to fill in missing forestry data must account for spatial autocorrelation or
be resistant to its effects. Before these techniques are applied, the nature of the
spatial patterns in the data should be explored. Various tests for autocorrelation
and trends can be used. Once we know that the data is correlated and the nature
of the relation (i.e., either positive or negative), techniques which take this pattern
into account can be applied.
Testing for Autocornlation
Since the presence of autocorrelation may distort the results of conventional
statistical interpolation techniques, autocorrelation should be tested for prior to
selecting any particular model (i.e., during the exploratory phase of data analysis)
in order to ensure accurate predictions. There are various methods which test for
the presence of spatial autocorrelation in univariate data and for correlation in
multivariate data sets. These are well documented and will not be described here.
Univariate methods include the joint count test for nominal data, Geary's c or
Moran's I test for ordinal and interval data, and spectral analysis (see Haining
1990 for a detailed description). Multivariate methods include the Spearrnan rank
correlation coefficient for ordinal data, the Pearson product correlation coefficient
for interval data, the Mantel test and correlogram (see Haining 1990 or Goodchild
1986 for details).
If large scale spatial dependency exists, it may be removed by regression or
model-fitting (eg., trend surface analysis). Traditional statistical techniques can
then be used without violating their assumptions. However, caution should be
taken in order to avoid removing the determinants of the underlying processes
along with the spatial structure. An alternative to removing spatial dependence is
to modify statistical approaches to take spatial autocorrelation into account. At the
very least, techniques which are resistant to spatial autocorrelational effects should
be used. Techniques which utilize autocorrelational information (to enhance
prediction accuracy) are preferable. First how do we represent spatial
autocorrelation?
'
Representing Autocorrelation
Autocorrelation can be represented through structure functions, which help
quantify spatial dependency and partition it along distance intervals.
Geographic coordinates
The spatial structure of variables can be expressed as a linear combination of
geographic coordinates of sample sites (Legendre 1993). For example, a high order
polynomial can be built up using the x and y coordinates of sites (as is done in
trend surface analysis). This information can then be incorporated into traditional
inference models, such as regression models, using partial regression analysis
techniques. The result of this analysis is the separation of the variation of the
target variable into four components: the variance resulting from non-spatial
environmental factors, spatial structure, the interaction of environmental and spatial
factors, and the unexplained variation.
Geographic distances
Another approach is to use location as one of a set of predictor variables in
a statistical model (Legendre 1993). In this case, the spatial structure of the data
is represented by a matrix of geographic distances between site locations (also
called a proximity matrix). For example, the euclidean distance can be computed
for all pairs of sites based on their geographic coordinates. This information is
then summarized in a spatial distance matrix. If the remaining environmental
variables can be represented in the form of distance matrices, then these distance
matrices can be compared using some form of correlation analysis ( e g , the
Mantel test, see Legendre and Fortin 1989).
Covaliation
One of the most common structure functions is the variogram (and related
covariance and correlogram). The variogram summarizes the spatial continuity for
all possible pairings of data, for all significant lag distances, by modeling the
average degree of similarity between values as a function of their separation
distance. Variograms can be computed either as averages over all directions or
specific to a particular direction. If there is a trend in the data (i.e., local means
and local variances change as a function of location within the sampling space),
the variogram will include both lag-to-lag variability and the trend variability
(Rossi et al. 1992). In the case of stationary data, the variogram can be
experimentally estimated from observed values. Various statistical models can then
be fitted to the variogram, including exponential, spherical and linear.
Prediction Utilizing Autocorrelation
Since ecological data are often correlated, techniques which can take
advantage of the information contained in these relationships may be able to best
predict missing values while retaining the variability and covariation structure of
forest attribute data.
Regression models
Traditional linear regression can be extended to situations where spatial
autocorrelation exists by the addition of lagged variables. Griffith (1987) outlines
how spatial autocorrelation can be incorporated into traditional regression models.
Promising results have been obtained from the introduction of an additional
parameter into conventional statistical models that accounts for the latent spatial
autocorrelative structure of the data. This technique has also been extended to
multivariate models (Haining 1990). This approach is one step toward the
integration of spatial dependence (in the form of geographic distance information)
into predictive models.
As mentioned above, in partial regression analysis, the spatial component can
be "partialled out" by regressing the spatial factor variables onto each explanatory
variable until only the regression residuals remain (Legendre 1993). The residuals
are then used to model the variablets) which have missing values. This approach
"explains" much of the variation that is left unaccounted for in simple regression
models.
Kriging
Kriging is a well known geostatistical spatial interpolation technique which
explicitly uses autocorrelational information in its predictions (Rossi et al. 1994).
Since it uses a local estimator and the autocorrelation structure of the data, kriging
is quite unlike traditional methods, such as trend surface analysis, which ignore
autocorrelational effects and use all of the data to estimate the unknown point's
value. In addition, kriging provides variance information which is crucial in
determining the accuracy of predicted values.
Kriging can be applied to both nominal and continuous data. In general,
kriging predicts missing values at a specific site by taking a weighted linear
average of available samples (like regression). It is similar to multiple regression
with a few important differences. First, kriging can produce predicted values that
are either larger or smaller than any of the sample values. Second, kriging utilizes
both distance and geometry among samples, whereas traditional methods use only
distance. Third, kriging attempts to minimize the variance of the expected error by
inferring the variance from an empirical model of degree of spatial dependence
with distance and direction (i.e., the variogram).
Most similar neighbor analysis
Most similar neighbor (MSN) analysis is a canonically based technique which
can be used to fill in interval or continuous data values by implicitly utilizing the
spatial structural information of the data. Instead of estimating missing values one-
by-one (for a particular site), the technique "chooses" a most similar site from a
set of sites to act as a stand-in. The surrogate site is chosen on the basis of a
similarity function (derived using canonical correlation analysis).
This approach works well when the variable to be predicted and the predictor
variables have strong functional relationships to each other (Moeur 1995) (as is
the case for spatially autocorrelated data). Because surrogate sites are chosen from
among actual sites, impossible predictions for missing values cannot occur.
However, this means that the samples sites must adequately span the full range of
sites. While Moeur's procedure does not explicitly model spatial correlation, it
attempts to maintain the covariance structure of the multivariate response
attributes.
DISCUSSION
None of the methods described here for filling in missing forestry data can
be used without careful consideration. They all make assumptions about the nature
of data and the phenomena under study. For specific cases, these assumptions may
or may not hold. The choice of techniques should consider how the information
contained in the observed data can be used, the accuracy of predictions, the
amount of computational effort required, and how flexible they are in allowing
both theoretical and empirical information to be utilized (Haining 1990).
Regression can be extended by incorporating a spatial autocorrelation
parameter and by using different forms of the model with spatially lagged
variables. A fair amount of trial and error is required to determine the best fitting
model for a particular data set. For a particular study, the different regression
models should be compared not only in terms of included and excluded variables
but also in terms of different forms of the model. Interpretive knowledge can be
utilized in the choice of model.
Regression assumes normality and stationarity, which may or may not hold.
While regression is useful for estimating population means and totals, it does not
always preserve the underlying relations among inventory attributes. For example,
if one or more variables with missing values is difficult to predict, the correlation
structure may be distorted. Regression may produce unreliable results for cases of
high unique variances.
While traditional methods of interpolation (e.g., inverse distance weighting,
local sample means) are less accurate, they are also less computationally intensive
than kriging and do not require the subjective estimation of the variogram. Kriging
assumes that the variogram is accurate over the entire study area. This is known
as the stationarity hypothesis and in cases where it does not hold, the results from
kriging will be less accurate. In addition, kriging assumes that the variable being
predicted is multivariate normal. Raw data are rarely univariate normal (Rossi et
al. 1994).
While MSN analysis is a robust approach that implicitly captures spatial
structure, it sensitive to sample size in that smaller samples sizes reduce the
variance, which in turn affects it's predictive function. Results from MSN analysis
are unpredictable if a strong non-linear relation exists among the variables of one
or both sets. Canonical analysis (upon which MSN is based) is adequate for the
case where the joint distribution of variables is linear and continuous. So, in cases
where data are either heterogeneous or exhibit non-linear relations between
variables, MSN will fall short. As well, the linear orthogonal nature of solutions
in canonical analysis may be unrealistic for ecological data where there is often
high unique variances.
An issue which is important but has not been addressed in this paper is that
of the scale of variation. Spatial variation may be conceptualized as being
composed of several (possibly independent) components at several different scales
(Haining 1990). In testing for, representing and modeling spatial autocorrelation,
it is important to identify and account for the contribution from each of the
different scales to the total variation.
CONCLUSIONS
If we rely on prediction models that assume that biological features are
distributed uniformly or randomly in space, then chances of obtaining accurate
predictions of missing values are small, since spatial autocorrelation characterizes
most ecological data. It is often difficult to determine if assumptions have been
broken, making it hard to compensate for such violations or to know how reliable
the newly completed data set is. Consequently spatial autocorrelation must be
tested for and techniques which either explicitly or implicitly consider spatial
autocorrelation must be used when it is present.
While the different techniques examined above have various advantages,
disadvantages and restrictions, it is impossible to produce a truly meaningful
comparison unless each method is applied to a unique set of variables. To date
there have been few studies which have formally compared the various spatial
inference techniques which can be used to fill in missing data values. Given the
increased importance such methods will have in the future, conducting such a
comparative study is certainly warranted.
REFERENCES
Antle, A. N. and Klinkenberg, B. 1995. Statistically-based data generation
techniques: an emerging trend. In: Heit, M., Parker, H. D. and Shortreid, A.
(eds.) GIs Applications in Natural Resources 2, GIs World, Inc.
Anselin, L. & Getis, A. 1992. Spatial statistical analysis and geographic
information systems. The Annals of Regional Science, Springer-Verlag, Vol.
26, pp. 19 -33.
Chrisman, N.R. 1991. The error component in spatial data. In: MacGuire D.J.,
Download