This file was created by scanning the printed publication. Errors identified by the software have been corrected; however, some errors may remain. Data Accuracy to Data Quality: Using spatial statistics to predict the implications of spatial error in point data Adam Lewis and Michael F. Hutchinson Abstract.- Data error has received a good deal of attention, however, data error alone is not very informative. To judge the value of a dataset for a specific application, measures of data quality are needed. A quantitative approach to the problem of estimation of data quality for point data is presented, based on the use of geostatistics to characterise the GIs datasets with which the points are to be overlaid. Results demonstrate that the intended application of data is critical to the assessment of data quality, and that the suitability of data for a given application can be assessed with only a basic model of the absolute spatial error associated with the point data. INTRODUCTION substantial body of literature exists on the subject of error in spatial databases. For point data, spatial error can be modelled as a random process. Most simply a bivariate normal distribution can be used, but other symmetric probability density functions have also been used, eg. Bolstad et al. (1990). The scaling parameter of the distribution, such as the standard deviation, may be referred to as epsilon. An alternative to the probabilistic model is the deterministic interpretation, in which a l l points are considered to fall within a distance, epsilon, of the true location. This extends to the epsilon-band model of spatial error in linear features. These interpretations of spatial error are often quoted as a standard in mapping organisations (Goodchild 1988), for instance "90% of points are within 25 metres of the correct location". Attribute errors arise because thematic maps, DEMs and other spatial datasets are approximations. The polygon representation of spatial variability maps homogeneous units with distinct boundaries, however it is well understood that for natural resource data these polygons are heterogeneous. Where continuous spatial variation is represented on a grid or lattice, as with a DEM, there is a residual error in the model, which may be expressed as the RMS error. Both spatial and attribute errors are, typically, spatially autocorrelated; the error in one location is dependent on the errors nearby. This reduces the consequences of error in some circumstances - digital contours do not cross randomly although they may be within each others epsilon-band - but complicates the modelling of errors. The probabilistic pointepsilon model of spatial errors cannot be extended to lines without complication, and models of attribute error require simulation of autocorrelated processes (Goodchild et al. 1992). Data error versus data quality Few studies have focussed on the practical implications of errors in spatial data. Ultimately the question is not the absolute error associated with a dataset, but whether the datasets available are of adequate quality to support a given use. This paper develops the relationship between locational accuracy, and data quality, the ultimate aim being to provide a basis from which to make informed statements on accuracy. DATA SOURCES AND METHODS The analysis is limited to spatial error of point locations using a simple error model. The consequence of spatial error on the outcome of overlay with datasets representing continuous variables is analysed. The suitability of the point data for overlay with a particular variable is assessed using the expected R~ value between the outcome of overlay given the spatial error, and the outcome of the overlay if the points were indeed accurately placed. This provides a measure of the quality of the point dataset for a particular use. Quadrat locations (sourcel) were recorded in a floristic database (Turner and Muek 1992) in decimal degrees, calculated from map coordinates read from 1:100,000 scale topographic maps. Source1 coordinate error is assumed to be due incorrect transcription of points onto maps, incorrect reading of coordinates from maps, and incorrect transcription of coordinate values. Quadrats locations (source2) were also marked onto 1:25,000 topographic maps of dubious cartographic lineage. It is assumed that the quadrat locations are marked correctly onto the topographic maps relative to local stream, ridge and road features identified on the map. Source2 coordinate error is assumed to be due to stretches and shifts in the low quality map base. Source3 data were compiled by a separate process which avoids the sources of error associated with sourcel and source2. These are considered to be the true quadrat locations. Given the cartographic scale of the maps and the limits on the accuracy with which a point can be located on a map from a field survey, a planimetric error of 20 metres is regarded as very good. Figure 1. Location of study quadrats. A terrain model (15 metre cell size) was interpolated from 1:25,OOO contours (10 metre contour interval) for the study ANUDEM area using (Hutchinson 1988, 1989). Two attributes of the terrain model, slope and elevation, were estimated at each quadrat location using the GIs. Slope was estimated using the method Burrough (1986) as of implemented in ARCINFO. INITIAL RESULTS The spatial error associated with the points from Sourcel is illustrated in figure 2. Extreme values are not shown. Source2 has similar errors. .! Source1 :m a d t u d e and direction of errors in x and v .a - 3. tr i 1 400- observations . ! i I . - I Figure 2. Scatter diagram of the horizontal error in x and y observed in quadrat locations fro& sourcel. The diagram indicates that errors in x and y tend to be negatively correlated suggesting an anisotropic error distribution. The mechanisms behind this are unknown. The importance of the spatial error in terms of inducing errors in the estimates of elevation and slope is i LUU u LW CWO illustrated by figure 3, which horizontal error (metres) in x shows deviations from the correct values of slope and elevation. Figure 3 clearly illustrates that for a given degree of spatial error the implications of the error are greater for estimates of slope than for elevation. The results given in figure 3 can be surnrnarised using R~ values for slope and elevation between the true locations and the mapped locations as follows: Slope Gpercent rise) Point locations Sourcel Point locations , Source2 fiom fiom Elevation (metres) 0.181 0.952 0.318 0976 MODELLING THE IMPORTANCE OF SPATIAL ERRORS Spatial error is present when xg + xi, xg being the true place of observation, and xi the location mapped. Let f(x) be a function denoting the probability density of the event x = xi and Z(x1) be the actual observation (of slope, insolation, soil depth), while Z(xg) is the correct observation (unknown). The question then, is: when xg is estimated by xi, what is the expected outcome of observations of some spatial variable, Z(x1). We can state the following: EIZ(xl)l = J Z(x)f(x) dx = J z(xo + h)f(xo + h) (1) where h is a vector such that x = xg + h This is the same form of equation identified by Allen and Starr (1982) as corresponding to observation of Z(x) at a coarser scale, with the scaling function weights being determined by f(xg + h). From this we can say that spatial error in point locations has similar consequencesfor the expected value of observation as would viewing at a broader scale. Errors in slope from Source1 observations . Errors in elevation from Source1 Errors in elevation from Source2 observations y =x- . Figure 3. Plots of estimates of elevation and slope from quadrat locations from sourcel and source2, against the correct estimate. Departures from 'y=x' indicate errors in the estimate due to quadrat location errors. Relatively large errors are introduced into estimates of slope cf. elevation, irrespective of the source of the quadrat error (sourcel vs. source2). The type of variable observed, rather than the specific amount of spatial error in the quadrats, appears to dictate whether the errors introduced will be substantial or negligible. In qualitative terms, the results become fuzzy. Statistically, the correlation between the observed value Z(x1) and the true value Z(xg) will drop below 1 unless the spatial error is zero. The severity of the error wiU be indicated by the R=! value between the observed value, Z(x1) and the true value Z(xg). For a known spatial error, h the covariance, cov(h), between the imperfect observation, taken at location x i = xg + h, and the true observation, taken at xg,is cov(h) = E[(Z(xg+h) - Pd(Z(xg) - Pz)1 (2) The correlation coefficient between Z(x1) and Z(xg) for a known error h is given by ~ ( h )= 2 (jov(h) I $)2 (3) where R(h) is commonly interpreted as the proportion of the variance in Z(xg) accounted for by Z(x1). The covariance is related to the semivariance by y(h) = d - cov(h), (assuming that Z is a stationary function and that the error model is isotropic) so equation 3 can be re-stated as [ ~ ( h )=] (1 ~ - y(h) 1 02)2 (4) To progress fiom the simple case of known error h to a probability model of error f(h) we calculate the expected value of cov(h) using simply: = If(h) (1 - y(h) I o2)2 dh (5) This gives a direct mathematical expression to estimate the importance of spatial error in a given situation, requiring only a model of the spatial error of the point observations and a correlogram or variogram of the spatial variable 2. The semivariance, y(h) can be thought of as that part of the total variance which is introduced at lag h. Equation 5 also illustrates the relationship between the magnitude of the spatial error and the inherent scaling of the variable in determining the importance of the spatial error, discussed earlier. Clearly if f(h) is high only where y(h) is near 0, the error is less important than if f(h) is high over a wide range of y(h) including where y(h) d.Forms of f(h) which either have a non-zero mean error, E[(h)] # 0, or which are multi-modal, with a peak some distance from xo (eg. Bolstad et al. 1990) will tend to introduce more significant errors. - Prediction of the errors in elevation and slope introduced through spatial error in coordinate locations. Equation 5 demonstrates that information on the distribution of spatial errors in the point data and a knowledge of the variogram of Z are required to estimate ~ 2giving , a direct assessment of the data quality of the points, for observation of a given variable 2. If isotropic variograms and error models are assumed, equation 5 simplifies to E [ R ~ ]= If(h) R (h)2 dh = If(h) (1 - y( h) I 02)2 dh (6) where h = Ihl. In the following section this method is applied to the datasets used here. The Excel software was used to manipulate data and to apply mathematical formulae, while variograms were estimated using software written by the author. The continuous functions indicated by equation 6 were approximated using discrete forms with an interval of 40 metres (h=0,40, ....8000). Models of spatial error An isometric error model was adopted. Normal and lognormal distributions were fitted to observed errors using the method of moments (figure 4). The parametric models enable the standard deviation to be used as a parameter to vary the model, allowing sensitivity testing. The assumption of an isometric normal model, Ah), has the further advantage of requiring only one parameter, o,.The convention of epsilon = 20, is adopted, thus in terms of an epsilon model of error we can state that 95% of the data points are expected to lie within epsilon metres of the mapped point. I distribution horizontal error (metres) horizontal error (metres) Figure 4. Plots of the cumulative distribution of horizontal error for sourcel quadrats (left) and source2 quadrats (right); and normal and lognormal probability models fitted using the method of moments. An isometric error distribution is assumed. For sourcel, the normal distribution was fitted after exclusion of extreme values. Neither probability model deals with sourcel very well, however the normal distribution was chosen in preference to the lognormal. ASSESSING DATA QUALITY FOR SLOPE AND ELEVATION The discrete version of equation 6 was used to calculate the expected R2 value for true and estimated values of slope and elevation from sourcel and source2 for the values of epsilon given in Table 1. E [ R ~ ]= [F(h+Al2) - F(h-N2)] [ l - y( h) I d)12 + h=A, 2A, ... ,8000 r F ( m - F(O)l[l - y( 0) m l 2 (7) where: F(h) is the cumulative normal probability distribution with standard deviation of epsilonl2, truncated to exclude negative values (F(0) = 0) and scaled to maintain F(m) = 1; h is the magnitude of the spatial error, and the lag distance for y(h); A is an interval (40 m), d is the observed variance of Z (slope or elevation, as appropriate). Using equation 7 the predicted R~ values for the observations of slope and elevation, from sourcel and source2, were calculated and compared with the observed values (table 1). Table 1. Modelled and observed (in brackets) correlations ( R ~between ) observations of elevation and slope from true data point locations and data point locations containing spatial error. Epsilon = 20. Elevation (metres) Slope (percent rise) Point locations from Epsilon = 384 Point locations from Epsilon = 320 Sourcel. 0.303 (0.181) 0.343 (0.318) 0.957 (0.952) 0.%7 (0.976) Source2. Predicted and observed R-square values for elevation I I epsilon Predicted and observed R-square values for slope g&@gggg~r- In most practical situations, y(h) is readily estimated for datasets in a GIs, but the precise nature of F(h) will be unknown. To develop some idea of the sensitivity of data quality to f(h), E [ R ~ ] was calculated for a wide range of epsilon. Results are shown in figure 5, from which it is clear that the modelled ~2 values for elevation from sourcel and source2 are closely predicted. R2 for slope is accurately predicted for source2, but for sourcel is optimistic, reflecting the extreme values not catered for by the normal distribution (figure 4). Figure 5. Predictions of g2 value for a wide range of error bands, compared with observed ~2 values. The contrast between figures 5 and 6 epsilon shows that the key factor determining the R~ between correct and incorrect values of elevation and slope is not the spatial error in the observation points, but the inherent spatial structure of the variables being observed. Thus, even the very coarse models of spatial error used here are sufficient to quite accurately model the quality of the quadrat data sources for overlay with other spatial data themes. 500 lo00 1500 1 2000 DISCUSSION These results have wide-ranging practical value. They suggest that the detail of the probability error model Ax), is less important in determining data quality than the inherent properties of the surfaces being interrogated. Only the main features of f(x) seem to be important. The results also suggest an approach to management of GIs error which integrates spatial and attribute error. If the value of a variable is estimated from interrogation of a data-surface 2, where Z is a model with residual (attribute) error 0 2 , ~ , the variance of 2, at any given place, due to spatial and attribute error can be calculated as the expected semivariance, plus the variance of the residual error of the model. 02z = E[yZ1 + 0 eZ where E[yz1 = lfz(h) yz(h) (8) In which f (h) is the pdf of the spatial errors in variable 2, while yZ(h) is the semivariance of variable Z readily estimated readily from the data-surface Z. Applying equation 8 to the DEM used here, assuming isometric normally ~ gives ~ distributed spatial errors with epsilon = 25, and an RMS of 5 ( o =. 25) E[y ] = 3.7, oZz = 3.7-+ 25. Thus (for elevation) the influence of spatlal error is s m h compared with residual error. Allen T F H, Stan; T B (1982) Hierarchy: perspectives for ecological complexity. University of Chicago Press Bolstad, P V, Gessler, P, Lillesand, T M (1990) Positional uncertainty in manually digitised map data. International Journal of Geographic Information Systems. 4 399 Burrough, P A. (1986) Principles of Geographical information systems for land resources assessment. Clarendon Press. Oxford. Goodchild, M F. The issue of accuracy in global databases. (1988) In Mounsey, H, Tomlinson, R F. (eds.) Building databases for global science. Proceedings of the first meeting of the International Geographical Union Global Database Planning Project. Tylney Hall, Hampshire, UK, May 1988. Taylor and Francis. London & New York, 1988 Goodchild, M F, Guoqing, S, Shiren, Y. (1992) Development and test of an error model for categorical data. International Journal of Geographical Information Systems. 6 87- 104 Hutchinson, M F (1988) Calculation of hydrologically sound digital elevation models. Proceedings of the Third International Symposium on Spatial Data Handling. Sydney, Australia. Hutchinson, M F (1989) A new procedure for gridding elevation and stream line data with automatic removal of spurious pits. Journal of Hydrology 106 21 1232 Turner, L A, Mueck, S G. (1992) The vegetation of the Sardine, Rich and Ellery forest blocks, Orbost region, Victoria. Silvicultural Systems Project Technical Report number 9. Department of Conservation and Environment, Melboume, Australia. BIOGRAPHICAL SKETCH Adam Lewis is the Senior GIs Scientist in the Natural Resource Systems Branch (NRS) of the Department of Conservation and Natural Resources, based in Melbourne, Australia. He has experience in Forest and Land Management, and recently completed a PhD and the Australian National University. Michael Hutchinson is a Senior Fellow at the Australian National University whose primary interest is the spatial and temporal analysis of physical environmental data. ~