Document 11863977

advertisement
This file was created by scanning the printed publication.
Errors identified by the software have been corrected;
however, some errors may remain.
Data Accuracy to Data Quality:
Using spatial statistics to predict the implications of
spatial error in point data
Adam Lewis and Michael F. Hutchinson
Abstract.- Data error has received a good deal of attention, however, data
error alone is not very informative. To judge the value of a dataset for a
specific application, measures of data quality are needed.
A quantitative approach to the problem of estimation of data quality for
point data is presented, based on the use of geostatistics to characterise
the GIs datasets with which the points are to be overlaid.
Results demonstrate that the intended application of data is critical to the
assessment of data quality, and that the suitability of data for a given
application can be assessed with only a basic model of the absolute spatial
error associated with the point data.
INTRODUCTION
substantial body of literature exists on the subject of error in spatial
databases.
For point data, spatial error can be modelled as a random process. Most
simply a bivariate normal distribution can be used, but other symmetric probability
density functions have also been used, eg. Bolstad et al. (1990). The scaling
parameter of the distribution, such as the standard deviation, may be referred to as
epsilon. An alternative to the probabilistic model is the deterministic interpretation,
in which a l l points are considered to fall within a distance, epsilon, of the true
location. This extends to the epsilon-band model of spatial error in linear features.
These interpretations of spatial error are often quoted as a standard in mapping
organisations (Goodchild 1988), for instance "90% of points are within 25 metres
of the correct location".
Attribute errors arise because thematic maps, DEMs and other spatial datasets
are approximations. The polygon representation of spatial variability maps
homogeneous units with distinct boundaries, however it is well understood that for
natural resource data these polygons are heterogeneous. Where continuous spatial
variation is represented on a grid or lattice, as with a DEM, there is a residual error
in the model, which may be expressed as the RMS error.
Both spatial and attribute errors are, typically, spatially autocorrelated; the
error in one location is dependent on the errors nearby. This reduces the
consequences of error in some circumstances - digital contours do not cross
randomly although they may be within each others epsilon-band - but complicates
the modelling of errors. The probabilistic pointepsilon model of spatial errors
cannot be extended to lines without complication, and models of attribute error
require simulation of autocorrelated processes (Goodchild et al. 1992).
Data error versus data quality
Few studies have focussed on the practical implications of errors in spatial
data. Ultimately the question is not the absolute error associated with a dataset,
but whether the datasets available are of adequate quality to support a given use.
This paper develops the relationship between locational accuracy, and data
quality, the ultimate aim being to provide a basis from which to make informed
statements on accuracy.
DATA SOURCES AND METHODS
The analysis is limited to spatial error of point locations using a simple error
model. The consequence of spatial error on the outcome of overlay with datasets
representing continuous variables is analysed. The suitability of the point data for
overlay with a particular variable is assessed using the expected R~ value between
the outcome of overlay given the spatial error, and the outcome of the overlay if
the points were indeed accurately placed. This provides a measure of the quality of
the point dataset for a particular use.
Quadrat locations (sourcel) were recorded in a floristic database (Turner and
Muek 1992) in decimal degrees, calculated from map coordinates read from
1:100,000 scale topographic maps. Source1 coordinate error is assumed to be due
incorrect transcription of points onto maps, incorrect reading of coordinates from
maps, and incorrect transcription of coordinate values.
Quadrats locations (source2) were also marked onto 1:25,000 topographic
maps of dubious cartographic lineage. It is assumed that the quadrat locations are
marked correctly onto the topographic maps relative to local stream, ridge and
road features identified on the map. Source2 coordinate error is assumed to be due
to stretches and shifts in the low quality map base.
Source3 data were compiled by a separate process which avoids the sources
of error associated with sourcel and source2. These are considered to be the true
quadrat locations. Given the cartographic scale of the maps and the limits on the
accuracy with which a point can be located on a map from a field survey, a
planimetric error of 20 metres is regarded as very good.
Figure 1. Location of study
quadrats.
A terrain model (15 metre
cell size) was interpolated from
1:25,OOO contours (10 metre
contour interval) for the study
ANUDEM
area
using
(Hutchinson 1988, 1989). Two
attributes of the terrain model,
slope and elevation, were
estimated at each quadrat
location using the GIs. Slope
was estimated using the method
Burrough (1986) as
of
implemented in ARCINFO.
INITIAL RESULTS
The spatial error associated with the points from Sourcel is illustrated in
figure 2. Extreme values are not shown. Source2 has similar errors.
.!
Source1 :m a d t u d e and direction of errors in x and v
.a
-
3.
tr
i
1
400-
observations
.
!
i
I
.
-
I
Figure 2. Scatter diagram of
the horizontal error in x and y
observed in quadrat locations fro&
sourcel. The diagram indicates
that errors in x and y tend to be
negatively correlated suggesting an
anisotropic error distribution. The
mechanisms behind this are
unknown.
The importance of the
spatial error in terms of
inducing errors in the estimates
of elevation and slope is
i
LUU
u
LW
CWO
illustrated by figure 3, which
horizontal error (metres) in x
shows deviations from the
correct values of slope and
elevation. Figure 3 clearly illustrates that for a given degree of spatial error the
implications of the error are greater for estimates of slope than for elevation.
The results given in figure 3 can be surnrnarised using R~ values for slope and
elevation between the true locations and the mapped locations as follows:
Slope Gpercent rise)
Point locations
Sourcel
Point locations
, Source2
fiom
fiom
Elevation (metres)
0.181
0.952
0.318
0976
MODELLING THE IMPORTANCE OF SPATIAL ERRORS
Spatial error is present when xg + xi, xg being the true place of observation,
and xi the location mapped. Let f(x) be a function denoting the probability density
of the event x = xi and Z(x1) be the actual observation (of slope, insolation, soil
depth), while Z(xg) is the correct observation (unknown).
The question then, is: when xg is estimated by xi, what is the expected
outcome of observations of some spatial variable, Z(x1).
We can state the following:
EIZ(xl)l = J Z(x)f(x) dx = J z(xo + h)f(xo + h)
(1)
where h is a vector such that x = xg + h
This is the same form of equation identified by Allen and Starr (1982) as
corresponding to observation of Z(x) at a coarser scale, with the scaling function
weights being determined by f(xg + h). From this we can say that spatial error in
point locations has similar consequencesfor the expected value of observation as
would viewing at a broader scale.
Errors in slope from Source1
observations
.
Errors in elevation from Source1
Errors in elevation from Source2
observations
y =x-
.
Figure 3. Plots of estimates of elevation and slope from quadrat locations from sourcel
and source2, against the correct estimate. Departures from 'y=x' indicate errors in the
estimate due to quadrat location errors. Relatively large errors are introduced into
estimates of slope cf. elevation, irrespective of the source of the quadrat error (sourcel vs.
source2). The type of variable observed, rather than the specific amount of spatial error in
the quadrats, appears to dictate whether the errors introduced will be substantial or
negligible.
In qualitative terms, the results become fuzzy. Statistically, the correlation
between the observed value Z(x1) and the true value Z(xg) will drop below 1
unless the spatial error is zero. The severity of the error wiU be indicated by the R=!
value between the observed value, Z(x1) and the true value Z(xg).
For a known spatial error, h the covariance, cov(h), between the imperfect
observation, taken at location x i = xg + h, and the true observation, taken at xg,is
cov(h) = E[(Z(xg+h) - Pd(Z(xg) - Pz)1
(2)
The correlation coefficient between Z(x1) and Z(xg) for a known error h is
given by
~ ( h )=
2 (jov(h) I $)2
(3)
where R(h) is commonly interpreted as the proportion of the variance in
Z(xg) accounted for by Z(x1).
The covariance is related to the semivariance by y(h) = d - cov(h), (assuming
that Z is a stationary function and that the error model is isotropic) so equation 3
can be re-stated as
[ ~ ( h )=] (1
~ - y(h) 1 02)2
(4)
To progress fiom the simple case of known error h to a probability model of
error f(h) we calculate the expected value of cov(h) using simply:
= If(h) (1 - y(h) I o2)2 dh
(5)
This gives a direct mathematical expression to estimate the importance of
spatial error in a given situation, requiring only a model of the spatial error of the
point observations and a correlogram or variogram of the spatial variable 2. The
semivariance, y(h) can be thought of as that part of the total variance which is
introduced at lag h.
Equation 5 also illustrates the relationship between the magnitude of the
spatial error and the inherent scaling of the variable in determining the importance
of the spatial error, discussed earlier. Clearly if f(h) is high only where y(h) is near
0, the error is less important than if f(h) is high over a wide range of y(h) including
where y(h) d.Forms of f(h) which either have a non-zero mean error, E[(h)] #
0, or which are multi-modal, with a peak some distance from xo (eg. Bolstad et al.
1990) will tend to introduce more significant errors.
-
Prediction of the errors in elevation and slope introduced through spatial
error in coordinate locations.
Equation 5 demonstrates that information on the distribution of spatial errors
in the point data and a knowledge of the variogram of Z are required to estimate
~ 2giving
,
a direct assessment of the data quality of the points, for observation of
a given variable 2.
If isotropic variograms and error models are assumed, equation 5 simplifies to
E [ R ~ ]= If(h) R (h)2 dh = If(h) (1 - y( h) I 02)2 dh
(6)
where h = Ihl.
In the following section this method is applied to the datasets used here. The
Excel software was used to manipulate data and to apply mathematical formulae,
while variograms were estimated using software written by the author. The
continuous functions indicated by equation 6 were approximated using discrete
forms with an interval of 40 metres (h=0,40, ....8000).
Models of spatial error
An isometric error model was adopted. Normal and lognormal distributions
were fitted to observed errors using the method of moments (figure 4).
The parametric models enable the standard deviation to be used as a
parameter to vary the model, allowing sensitivity testing. The assumption of an
isometric normal model, Ah), has the further advantage of requiring only one
parameter, o,.The convention of epsilon = 20, is adopted, thus in terms of an
epsilon model of error we can state that 95% of the data points are expected to lie
within epsilon metres of the mapped point.
I distribution
horizontal error (metres)
horizontal error (metres)
Figure 4. Plots of the cumulative distribution of horizontal error for sourcel quadrats
(left) and source2 quadrats (right); and normal and lognormal probability models fitted
using the method of moments. An isometric error distribution is assumed. For sourcel, the
normal distribution was fitted after exclusion of extreme values. Neither probability model
deals with sourcel very well, however the normal distribution was chosen in preference to
the lognormal.
ASSESSING DATA QUALITY FOR SLOPE AND ELEVATION
The discrete version of equation 6 was used to calculate the expected R2 value
for true and estimated values of slope and elevation from sourcel and source2 for
the values of epsilon given in Table 1.
E [ R ~ ]=
[F(h+Al2) - F(h-N2)] [ l - y( h) I d)12
+
h=A, 2A, ... ,8000
r F ( m - F(O)l[l - y( 0) m l 2
(7)
where: F(h) is the cumulative normal probability distribution with standard
deviation of epsilonl2, truncated to exclude negative values (F(0) = 0) and scaled
to maintain F(m) = 1; h is the magnitude of the spatial error, and the lag distance
for y(h); A is an interval (40 m), d is the observed variance of Z (slope or
elevation, as appropriate).
Using equation 7 the predicted R~ values for the observations of slope and
elevation, from sourcel and source2, were calculated and compared with the
observed values (table 1).
Table 1. Modelled and observed (in brackets) correlations ( R ~between
)
observations
of elevation and slope from true data point locations and data point locations containing
spatial error. Epsilon = 20.
Elevation (metres)
Slope (percent rise)
Point locations from
Epsilon = 384
Point locations from
Epsilon = 320
Sourcel.
0.303
(0.181)
0.343
(0.318)
0.957
(0.952)
0.%7
(0.976)
Source2.
Predicted and observed R-square values for elevation
I
I
epsilon
Predicted and observed R-square values for slope
g&@gggg~r-
In most practical
situations, y(h) is readily
estimated for datasets in
a GIs, but the precise
nature of F(h) will be
unknown. To develop
some idea of the
sensitivity
of
data
quality to f(h), E [ R ~ ]
was calculated for a
wide range of epsilon.
Results are shown in
figure 5, from which it is
clear that the modelled
~2 values for elevation
from
sourcel
and
source2 are closely
predicted. R2 for slope
is accurately predicted
for source2, but for
sourcel is optimistic,
reflecting the extreme
values not catered for by
the normal distribution
(figure 4).
Figure 5. Predictions
of g2 value for a wide
range of error bands,
compared with observed
~2 values.
The
contrast
between figures 5 and 6
epsilon
shows that the key
factor determining the R~ between correct and incorrect values of elevation and
slope is not the spatial error in the observation points, but the inherent spatial
structure of the variables being observed.
Thus, even the very coarse models of spatial error used here are sufficient to
quite accurately model the quality of the quadrat data sources for overlay with
other spatial data themes.
500
lo00
1500
1
2000
DISCUSSION
These results have wide-ranging practical value. They suggest that the detail
of the probability error model Ax), is less important in determining data quality
than the inherent properties of the surfaces being interrogated. Only the main
features of f(x) seem to be important.
The results also suggest an approach to management of GIs error which
integrates spatial and attribute error. If the value of a variable is estimated from
interrogation of a data-surface 2, where Z is a model with residual (attribute) error
0 2 , ~ , the variance of 2, at any given place, due to spatial and attribute error can
be calculated as the expected semivariance, plus the variance of the residual error
of the model.
02z = E[yZ1 + 0 eZ
where E[yz1 = lfz(h) yz(h)
(8)
In which f (h) is the pdf of the spatial errors in variable 2, while yZ(h) is the
semivariance of variable Z readily estimated readily from the data-surface Z.
Applying equation 8 to the DEM used here, assuming isometric normally
~ gives
~
distributed spatial errors with epsilon = 25, and an RMS of 5 ( o =. 25)
E[y ] = 3.7, oZz = 3.7-+ 25. Thus (for elevation) the influence of spatlal error is
s m h compared with residual error.
Allen T F H, Stan; T B (1982) Hierarchy: perspectives for ecological complexity.
University of Chicago Press
Bolstad, P V, Gessler, P, Lillesand, T M (1990) Positional uncertainty in manually
digitised map data. International Journal of Geographic Information Systems. 4
399
Burrough, P A. (1986) Principles of Geographical information systems for land
resources assessment. Clarendon Press. Oxford.
Goodchild, M F. The issue of accuracy in global databases. (1988) In Mounsey, H,
Tomlinson, R F. (eds.) Building databases for global science. Proceedings of
the first meeting of the International Geographical Union Global Database
Planning Project. Tylney Hall, Hampshire, UK, May 1988. Taylor and Francis.
London & New York, 1988
Goodchild, M F, Guoqing, S, Shiren, Y. (1992) Development and test of an error
model for categorical data. International Journal of Geographical Information
Systems. 6 87- 104
Hutchinson, M F (1988) Calculation of hydrologically sound digital elevation
models. Proceedings of the Third International Symposium on Spatial Data
Handling. Sydney, Australia.
Hutchinson, M F (1989) A new procedure for gridding elevation and stream line
data with automatic removal of spurious pits. Journal of Hydrology 106 21 1232
Turner, L A, Mueck, S G. (1992) The vegetation of the Sardine, Rich and Ellery
forest blocks, Orbost region, Victoria. Silvicultural Systems Project Technical
Report number 9. Department of Conservation and Environment, Melboume,
Australia.
BIOGRAPHICAL SKETCH
Adam Lewis is the Senior GIs Scientist in the Natural Resource Systems
Branch (NRS) of the Department of Conservation and Natural Resources, based in
Melbourne, Australia. He has experience in Forest and Land Management, and
recently completed a PhD and the Australian National University.
Michael Hutchinson is a Senior Fellow at the Australian National University
whose primary interest is the spatial and temporal analysis of physical
environmental data.
~
Download