This file was created by scanning the printed publication. Errors identified by the software have been corrected; however, some errors may remain. Experimental Development of a Model of Vector Data uncertainty G.J. Hunter1,Be Hock2,M. Robey3and M.F. Goodchild4 Abstract.-In a recent paper by Hunter and Goodchild (1996), a model of vector data uncertainty was proposed and its conceptual design and likely manner of implementation were discussed. The model allows for probabilistic distortion of point, line and polygon features through the creation of separate, random horizontal positional error fields in the x and y directions. These are overlaid with the vector data so as to apply coordinate shifts to all nodes and vertices to establish new versions of the original data set. By studying the variation in the family of outputs derived from the distorted input data, an assessment may be made of the uncertainty associated with the final product. This paper is a continuation of that initial work and discusses the experimental development undertaken thus far to implement the model in practice. INTRODUCTION In a recent paper by Hunter and Goodchild (1996), a model of vector data uncertainty was proposed and its conceptual design and likely manner of implementation and application were discussed. This paper is a continuation of that initial work and discusses the experimental development since undertaken to implement the model in practice. In the context of this research, we suggest there is a clear distinction to be made between 'error' and 'uncertainty', since the former implies that some degree of knowledge has been attained about differences between actual results or observations and the truth to which they pertain. On the other hand, 'uncertainty' conveys the fact that it is the lack of such knowledge which is responsible for hesitancy in accepting those same results or observations without caution, and the term 'error' is often used when it would be more appropriate to use 'uncertainty'. The model that has been developed can be defined as a stochastic process capable of generating a population of distorted versions of the same reality (such as a map), with each version being a sample from the same population. The traditional Gaussian model (where the mean of the population estimates the true value and the standard deviation is a measure of variation in the observations) is one attempt at describing error, but it is global in nature and says nothing about local variations or the processes by which error may have accumulated. The model applied here is viewed as an advance on that approach since it has the ability to show spatial variation in uncertainty, and the capability to include in its realizations, the probable effects of error propagation resulting from the Assistant Professor, Department of Geornatics, The University of Melbourne, Victoria, Australia Scientist, Forest Research Institute, Rotorua, New Zealand PhD student, Department of Geomatics, The University of Melbourne, Victoria, Australia Director, National Centerfor Geographic Information & Analysis, University of California, Santa Barbara, CA subsequent algorithms and processes that have been applied to the data. This latter point is particularly important, since most software vendors do not divulge the algorithms used in their packages for commercial reasons-which prevents formal mathematical error propagation analysis from being undertaken. By studying different versions of the products created by the model, it is possible to see how differences in output are affected by variations in input. Elsewhere in this proceedings a grid-cell version of the model (which is at a more advanced stage of development), has been described and applied to assess the uncertainty associated with landslide susceptibility indices derived from a simple slope stability model (Murillo and Hunter, 1996). The vector data model involves the creation of two independent, normally distributed, random error grids in the x and y directions. When combined, these grids provide the two components of a set of simulated positional error vectors regularly distributed throughout the region of the data set to be perturbed (Figure 1). The assumptions made by the authors are firstly that the error has a circular normal distribution, and secondly that its x and y components are independent of each other. The grids are generated with a mean and standard deviation equal to the data producer's estimate for positional error in the data set to be perturbed (a prerequisite for application of the model). These error estimates, for example, might come from the residuals at control points reported during digitiser setup, or from existing data quality statements such as those that now accompany many spatial data sets. By overlaying the two grids with the vector data to be distorted, x and y positional shifts can be applied to the coordinates of each node and vertex in the data set to create a new, but equally probable, version of it. Thus, the probabilistic coordinates of a point are considered to be (x + error, y + error). With the distorted version of the data, the user then applies the same set of procedures as required previously to create the final product, and by using a number of these distorted data sets the uncertainty residing in the end product is capable of being assessed. Alternatively, several different data sets may be separately distorted (each on the basis of its own error estimate) prior to being x error grid Source data Distorted data y error grid Figure 1.-The model of vector data uncertainty uses normally distributed, random error grids in the x and y directions to produce a distorted version of the original data. combined to assess final output uncertainty. While the model requires an initial error estimate for creation of the two distortion grids, it is the resultant uncertainty arising from the use of perturbed data due to simulation which is under investigation (in conjunction with the spatial operations that are subsequently applied)-hence its label as an 'uncertainty' model. DEVELOPMENT OF THE MODEL Choosing the Error Grid Spacing As discussed in Hunter and Goodchild (1996), the first step required to implement the model is to determine an appropriate error grid spacing. If it is too large, the nodes and vertices of small features in the source data will receive similar-sized shifts in x and y during perturbation and the process will not be random, giving rise to unwanted local autocorrelation between shifts (Figure 2). Conversely, if it is too small then processing time can increase dramatically as additional grid points are needlessly processed. T h e authors suggest an appropriate spacing be selected from one of the following options: (1) the standard deviation of the horizontal positional error for the source data; (2) a distance equal to 0.5mm at source map scale where the data has been digitized; or (3) a threshold value smaller than the user would care to consider given the nature of the data to be processed. Thus far, the P3 I showing shifts at each I authors have found the standard deviation (as supplied by the data producer) to be a useful value to apply, since it tends to be smaller than (2) and therefore more conservative, Figure 2.-If the error grid spacing (d) is and of similar magnitude to practical too large, then unwanted local autocorrelation estimates of (3). between shifts may occur in small features. Generating the Initial Error Grids Using the development diagram on the following page as a guide for the remaining discussion (Figure 3), the next step is to generate the x and y error grids. The ArclInfo and Arc Grid software modules were used as the development platform and the commands referred to in the paper apply to these packages. To ensure the grids completely cover the extent of the source data, the dimensions of the grids were pre-determined by setting a window equal to the data set's dimensions (SETWINDOW command), and the cell size equivalent to the chosen grid spacing (SETCELL command). Two grids (named accordingly for the x and y directions) were created automatically using these parameters and then populated with randomly placed, normally distributed values having a mean of zero and a standard deviation defined by the data producer (NORMAL ASCII file of features from source data mask grid extra initial x error grid initial y error grid + II I I x error refined y error grid new version of source data Figure 3.-Development diagram for the uncertainty model. command). It should be noted here that selection of the standard deviation for the grid spacing does not affect the random population of the grids. The two grids are only temporary and will require further refinement before being used to perturb the vector data. Reduction of Error Grid Size As a means of minimizing processing time the number of cells in the error grids needs to be reduced, since unless the vector data set is extremely dense there will be many cells that are not required during operation. To achieve this, the original vector data are converted to grid format (POINTGRID or LINEGRID commands) to form a temporary maslung grid which only contains 'live' cells that the source data either lie within or pass through. Note that polygons are also processed by LINEGRID since only boundary strings need be perturbed. The cell attribute that has been passed during rasterization is unimportant since the grid will only be used for masking purposes. All other (non-contributing) cells are deliberately given a NODATA value. However, a problem arises here with using a masking grid that contains only single cells (representing point data) or 1-cell-wide strings (representing lines or polygons). As mentioned in Hunter and Goodchild (1996), there is the possibility that the magnitude and direction of adjacent x and y error grid shifts may cause them to overlap each other, resulting in possible loss of topological integrity in the original data when applied to it (such as the transposition of adjacent vertices causing unwanted loops). The solution requires filtering of any such 'offending' pairs of shifts which requires that the masking grid be extended for a distance of at least five standard deviations either side of the original data-given that with the normal distribution it is rare for a value to occur greater than this distance from the mean. This buffering of the masking grid will permit neighboring error grid shifts to be used in a subsequent filtering process. To achieve this, the grid spacing is compared with the standard deviation of the data and the equivalent number of cells is calculated (remembering that the grid spacing will not always equal the standard deviation). For example, using a standard deviation of 20 m and a grid spacing of 30 m, the correct number of cells equals l(100130) + 11 = 4 (with the extra cell and the modulus sign ensuring 'rounding up' of the answer). Using this value, a spread function is applied to the masking grid (EUCDISTANCE command). A new masking grid is created during the process and any (previously) NODATA cells affected by the operation are automatically returned to active status. The masking grid is then overlaid with the initial error grids to provide reduced versions of the x and y grids that contain only shift values lying within the required regions. Finally, the error grids are expanded by the width of a cell on all sides (and given NODATA values) to support later processing (SELECTBOX command). Filtering the Error Grids To preserve topological integrity in the source data upon distortion, some means of adjustment must be introduced to control the magnitude of positional shifts between neighbouring points in the x and y error grids. As previously mentioned, if this does not occur there is a likelihood that neighbouring points in the original data set may be transposed in position-causing unwanted 'fractures' 'fracture' occurs I --+ before distortion 1 - 4 I I I after distortion I error grid I I I no 'fracture' Figure 4.-In (a), uncontrolled shifts between neighbouring error grid points can cause unwanted 'fractures' or transposition of features. In (b), 'fractures' occur when the difference between neighbouring shifts is larger than their separation ( d ) . In (c), a 'fracture' is circled, requiring filtering on the basis of neighbouring shifts. in the feature structure (Figure 4a). Intuitively, these modifications should not occur and are considered unacceptable. Originally, the authors considered using spatial autocorrelation to constrain the x and y shifts, but this would have had the shifts l in the error grids when only a relatively effect of unnecessarily altering & small number needed to be adjusted in practice. Accordingly, a localised filter was seen as a better solution. Therefore, a routine was developed to test the difference between consecutive cells (in horizontal or row sequence for the x grid, and vertical or column sequence for the y grid) to determine whether the absolute value of the difference between them was greater than their separation distance (Figure 4b). If so, then a 'fracture' is possible at that location and a filter must be applied to average out the shift values on the basis of their neighbors. The procedure is iterative and proceeds until no 'fractures' exist in the error grids. For example, given consecutive shifts Axo, AX,,Ax2 and dr, in any row of the x error grid (and testing the difference between the two middle shifts), there is potential for a 'fracture' to occur if dx, - 4 >d (where d is the grid spacing). The adjusted values of dr, and dr, are computed by equation (1): The justification for the filtering is that although the shifts are normally distributed, the choice of that distribution is only an assumption and we should accept there may be faults in its selection as a theoretical model. For example, statisticians developed the human Intelligent Quotient (IQ) using a normal distribution with a mean of 100 and standard deviation of 20. Yet the distribution is deliberately truncated at zero to disallow negative values which intuitively should not occur--even though they will be predicted by the model. In a similar manner, the filtering of neighboring error shifts to avoid 'fractures' is simply a truncation of the distribution to cater for a minority of outlying values. The filtering was implemented by developing a routine that moves progressively through each error grid and tests the difference between each pair of cell values (DOCELL command). If the condition for a 'fracture' is not met, the values are written to an update grid and the next pair is tested. If the condition is met, the filter is applied and new values are computed and passed to the update grid. In practice, it has been found that shifts of several standard deviations in size can cause several iterations of the process to be required. However, the procedure still converges fairly rapidly. An early problem encountered was the situation where a large shift existed in a perimeter row or column of the error grid-requiring an extra cell value (which did not exist) to be input to the filter equation. It was found that without the placement of an extra row or column on each boundary of the grid, the absence of a cell caused the filter process to substitute a NODATA value for the cell to be adjusted. This was unacceptable and the placement of an extra perimeter cell around the error grid (with NODATA values) ensured that the original large shift was retained unaltered. This was considered preferable to it being discarded. While this implies that some nodes and vertices may receive shifts to new Figure 5.-The x andy shifts for a data point not coinciding with the error grid are calculated using bilinear interpolation based on the four surrounding grid values. positions outside the original extent of the data set, this decision remains consistent with the original concept of distorting their values to form alternative, but equally probable versions of the data. At the time of writing, the experimental development of the model has not proceeded beyond this stage and the following discussion remains conceptual. Calculation of Data Point Shifts In the final step of the model's development, values in the error grids must be transferred to the data set being perturbed. Inevitably, it is expected that few if any nodes or vertices in the source data will coincide exactly with the error grid points, and a method is required for calculating x and y shifts based on the neighboring values in the grid. A simple bilinear interpolation procedure (Watson, 1992) is proposed in which the x and y shifts assigned to each point are calculated on the basis of the respective shifts of the four surrounding grid points (Figure 5). An ASCII feature file containing the identifier and coordinates of each data point can be automatically derived (UNGENERATE command), and the four surrounding error shifts will be determined for each point and used to interpolate the correct values to be applied. The distorted coordinates are then written to a new file and when all records are processed, the file topology is rebuilt. Clearly, the need may arise for high performance computers to be employed for this task, particularly when many thousands of nodes and vertices may have to be perturbed, and current research into GIs and supercomputers (such as described in Armstong, 1994) is being investigated by the authors. CONCLUSIONS This paper has described the experimental development of an uncertainty model for vector data, which operates by taking an input data set of point, line or polygon features and then applying simulated positional error shifts in the x and y directions to calculate new coordinates for each node and vertex. In effect this produces a distorted, but equally probable, representation of the data set that can be used to create a family of alternative outputs, usually in map form. Assessment of the variation in the outputs can used to provide an estimate of the uncertainty residing in them, based on the error in the source data and its propagation through the subsequent algorithms and processes employed. At this stage of the research, software has been written that automates the creation of the error grids and filters their values to avoid conditions that would cause loss of topological integrity in the original data. The software also masks the input data to ensure redundant values in the error grids are not unnecessarily processed. Optimisation of the process is still required to achieve faster operation. The next stage of the project involves development of an algorithm which enables each node and vertex in the input data to be allocated the correct x and y shift from the error grids, which yields the perturbed version of the source data. The final stage entails testing of the working model, refinement of its operation, and assessment of its strengths and limitations. ACKNOWLEDGMENTS The authors wish to acknowledge the support given to this project by the New Zealand Forest Research Institute, the Australian Research Council, and the Australian Federal Department of Industry, Science & Technology (under the Bilateral Science & Technology Collaboration Scheme). The National Center for Geographic Information & Analysis is supported by the National Science Foundation, grant SBR 88-10917, and this research constitutes part of the Center's Research Initiative 7 on Visualisation of Spatial Data Quality. REFERENCES Armstrong, M.P. 1994. "GIs and High Performance Computing". Proceedings of the GIS/LIS 94 Conference, Phoenix, Arizona, pp. 4-13. Hunter, G.J and Goodchild, M.F. 1996. "A New Model for Handling Vector Data Uncertainty in GIs". Urban and Regional Information Systems Journal, vol. 8, no. 1, 12 pp. Murillo, M. and Hunter, G.J. 1996. "Modeling Slope Stability Uncertainty: A Case Study at the H.J. Andrews Long-Term Ecological Research Site, Oregon". Proceedings of the 2nd International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Modeling, Fort Collins, Colorado, 10 pp. Watson, D.F., 1992. Contouring: A Guide to the Analysis and Dispaly of Spatial Data (PergamonBlsevier Science: New York).