Document 11863930

advertisement
This file was created by scanning the printed publication.
Errors identified by the software have been corrected;
however, some errors may remain.
Experimental Development of a Model
of Vector Data uncertainty
G.J. Hunter1,Be Hock2,M. Robey3and M.F. Goodchild4
Abstract.-In a recent paper by Hunter and Goodchild (1996), a model
of vector data uncertainty was proposed and its conceptual design and
likely manner of implementation were discussed. The model allows for
probabilistic distortion of point, line and polygon features through the
creation of separate, random horizontal positional error fields in the x and
y directions. These are overlaid with the vector data so as to apply
coordinate shifts to all nodes and vertices to establish new versions of the
original data set. By studying the variation in the family of outputs
derived from the distorted input data, an assessment may be made of the
uncertainty associated with the final product. This paper is a continuation of that initial work and discusses the experimental development
undertaken thus far to implement the model in practice.
INTRODUCTION
In a recent paper by Hunter and Goodchild (1996), a model of vector data
uncertainty was proposed and its conceptual design and likely manner of
implementation and application were discussed. This paper is a continuation of
that initial work and discusses the experimental development since undertaken to
implement the model in practice. In the context of this research, we suggest there
is a clear distinction to be made between 'error' and 'uncertainty', since the
former implies that some degree of knowledge has been attained about differences
between actual results or observations and the truth to which they pertain. On the
other hand, 'uncertainty' conveys the fact that it is the lack of such knowledge
which is responsible for hesitancy in accepting those same results or observations
without caution, and the term 'error' is often used when it would be more
appropriate to use 'uncertainty'.
The model that has been developed can be defined as a stochastic process
capable of generating a population of distorted versions of the same reality (such
as a map), with each version being a sample from the same population. The
traditional Gaussian model (where the mean of the population estimates the true
value and the standard deviation is a measure of variation in the observations) is
one attempt at describing error, but it is global in nature and says nothing about
local variations or the processes by which error may have accumulated.
The model applied here is viewed as an advance on that approach since it has
the ability to show spatial variation in uncertainty, and the capability to include in
its realizations, the probable effects of error propagation resulting from the
Assistant Professor, Department of Geornatics, The University of Melbourne, Victoria, Australia
Scientist, Forest Research Institute, Rotorua, New Zealand
PhD student, Department of Geomatics, The University of Melbourne, Victoria, Australia
Director, National Centerfor Geographic Information & Analysis, University of California, Santa Barbara, CA
subsequent algorithms and processes that have been applied to the data. This
latter point is particularly important, since most software vendors do not divulge
the algorithms used in their packages for commercial reasons-which prevents
formal mathematical error propagation analysis from being undertaken. By
studying different versions of the products created by the model, it is possible to
see how differences in output are affected by variations in input. Elsewhere in
this proceedings a grid-cell version of the model (which is at a more advanced
stage of development), has been described and applied to assess the uncertainty
associated with landslide susceptibility indices derived from a simple slope
stability model (Murillo and Hunter, 1996).
The vector data model involves the creation of two independent, normally
distributed, random error grids in the x and y directions. When combined, these
grids provide the two components of a set of simulated positional error vectors
regularly distributed throughout the region of the data set to be perturbed (Figure
1). The assumptions made by the authors are firstly that the error has a circular
normal distribution, and secondly that its x and y components are independent of
each other. The grids are generated with a mean and standard deviation equal to
the data producer's estimate for positional error in the data set to be perturbed (a
prerequisite for application of the model). These error estimates, for example,
might come from the residuals at control points reported during digitiser setup, or
from existing data quality statements such as those that now accompany many
spatial data sets.
By overlaying the two grids with the vector data to be distorted, x and y
positional shifts can be applied to the coordinates of each node and vertex in the
data set to create a new, but equally probable, version of it. Thus, the
probabilistic coordinates of a point are considered to be (x + error, y + error).
With the distorted version of the data, the user then applies the same set of
procedures as required previously to create the final product, and by using a
number of these distorted data sets the uncertainty residing in the end product is
capable of being assessed. Alternatively, several different data sets may be
separately distorted (each on the basis of its own error estimate) prior to being
x error grid
Source data
Distorted data
y error grid
Figure 1.-The model of vector data uncertainty uses normally distributed, random
error grids in the x and y directions to produce a distorted version of the original data.
combined to assess final output uncertainty. While the model requires an initial
error estimate for creation of the two distortion grids, it is the resultant uncertainty
arising from the use of perturbed data due to simulation which is under
investigation (in conjunction with the spatial operations that are subsequently
applied)-hence its label as an 'uncertainty' model.
DEVELOPMENT OF THE MODEL
Choosing the Error Grid Spacing
As discussed in Hunter and Goodchild (1996), the first step required to
implement the model is to determine an appropriate error grid spacing. If it is too
large, the nodes and vertices of small features in the source data will receive
similar-sized shifts in x and y during perturbation and the process will not be
random, giving rise to unwanted local autocorrelation between shifts (Figure 2).
Conversely, if it is too small then processing time can increase dramatically as
additional grid points are needlessly processed.
T h e authors suggest an
appropriate spacing be selected from
one of the following options: (1) the
standard deviation of the horizontal
positional error for the source data;
(2) a distance equal to 0.5mm at
source map scale where the data has
been digitized; or (3) a threshold
value smaller than the user would care
to consider given the nature of the
data to be processed. Thus far, the
P3
I showing shifts at each I
authors have found the standard
deviation (as supplied by the data
producer) to be a useful value to
apply, since it tends to be smaller than
(2) and therefore more conservative,
Figure 2.-If the error grid spacing (d) is
and
of similar magnitude to practical
too large, then unwanted local autocorrelation
estimates
of (3).
between shifts may occur in small features.
Generating the Initial Error Grids
Using the development diagram on the following page as a guide for the
remaining discussion (Figure 3), the next step is to generate the x and y error
grids. The ArclInfo and Arc Grid software modules were used as the
development platform and the commands referred to in the paper apply to these
packages. To ensure the grids completely cover the extent of the source data, the
dimensions of the grids were pre-determined by setting a window equal to the
data set's dimensions (SETWINDOW command), and the cell size equivalent to
the chosen grid spacing (SETCELL command). Two grids (named accordingly
for the x and y directions) were created automatically using these parameters and
then populated with randomly placed, normally distributed values having a mean
of zero and a standard deviation defined by the data producer (NORMAL
ASCII file of
features from
source data
mask
grid
extra
initial
x error
grid
initial
y error
grid
+
II
I I x error
refined
y error
grid
new version
of source data
Figure 3.-Development diagram for the uncertainty model.
command). It should be noted here that selection of the standard deviation for the
grid spacing does not affect the random population of the grids. The two grids are
only temporary and will require further refinement before being used to perturb
the vector data.
Reduction of Error Grid Size
As a means of minimizing processing time the number of cells in the error
grids needs to be reduced, since unless the vector data set is extremely dense there
will be many cells that are not required during operation. To achieve this, the
original vector data are converted to grid format (POINTGRID or LINEGRID
commands) to form a temporary maslung grid which only contains 'live' cells that
the source data either lie within or pass through. Note that polygons are also
processed by LINEGRID since only boundary strings need be perturbed. The cell
attribute that has been passed during rasterization is unimportant since the grid
will only be used for masking purposes. All other (non-contributing) cells are
deliberately given a NODATA value.
However, a problem arises here with using a masking grid that contains only
single cells (representing point data) or 1-cell-wide strings (representing lines or
polygons). As mentioned in Hunter and Goodchild (1996), there is the possibility
that the magnitude and direction of adjacent x and y error grid shifts may cause
them to overlap each other, resulting in possible loss of topological integrity in the
original data when applied to it (such as the transposition of adjacent vertices
causing unwanted loops).
The solution requires filtering of any such 'offending' pairs of shifts which
requires that the masking grid be extended for a distance of at least five standard
deviations either side of the original data-given that with the normal distribution
it is rare for a value to occur greater than this distance from the mean. This
buffering of the masking grid will permit neighboring error grid shifts to be used
in a subsequent filtering process.
To achieve this, the grid spacing is compared with the standard deviation of
the data and the equivalent number of cells is calculated (remembering that the
grid spacing will not always equal the standard deviation). For example, using a
standard deviation of 20 m and a grid spacing of 30 m, the correct number of cells
equals l(100130) + 11 = 4 (with the extra cell and the modulus sign ensuring
'rounding up' of the answer). Using this value, a spread function is applied to the
masking grid (EUCDISTANCE command). A new masking grid is created
during the process and any (previously) NODATA cells affected by the operation
are automatically returned to active status.
The masking grid is then overlaid with the initial error grids to provide
reduced versions of the x and y grids that contain only shift values lying within the
required regions. Finally, the error grids are expanded by the width of a cell on
all sides (and given NODATA values) to support later processing (SELECTBOX
command).
Filtering the Error Grids
To preserve topological integrity in the source data upon distortion, some
means of adjustment must be introduced to control the magnitude of positional
shifts between neighbouring points in the x and y error grids. As previously
mentioned, if this does not occur there is a likelihood that neighbouring points in
the original data set may be transposed in position-causing unwanted 'fractures'
'fracture' occurs
I
--+
before distortion
1
-
4
I
I
I
after distortion
I
error grid
I
I
I
no 'fracture'
Figure 4.-In (a), uncontrolled shifts between neighbouring error grid points can cause
unwanted 'fractures' or transposition of features. In (b), 'fractures' occur when the
difference between neighbouring shifts is larger than their separation ( d ) . In (c), a
'fracture' is circled, requiring filtering on the basis of neighbouring shifts.
in the feature structure (Figure 4a). Intuitively, these modifications should not
occur and are considered unacceptable. Originally, the authors considered using
spatial autocorrelation to constrain the x and y shifts, but this would have had the
shifts
l
in the error grids when only a relatively
effect of unnecessarily altering &
small number needed to be adjusted in practice. Accordingly, a localised filter
was seen as a better solution.
Therefore, a routine was developed to test the difference between consecutive
cells (in horizontal or row sequence for the x grid, and vertical or column
sequence for the y grid) to determine whether the absolute value of the difference
between them was greater than their separation distance (Figure 4b). If so, then a
'fracture' is possible at that location and a filter must be applied to average out the
shift values on the basis of their neighbors. The procedure is iterative and
proceeds until no 'fractures' exist in the error grids.
For example, given consecutive shifts Axo, AX,,Ax2 and dr, in any row of the x
error grid (and testing the difference between the two middle shifts), there is
potential for a 'fracture' to occur if dx, - 4 >d (where d is the grid spacing). The
adjusted values of dr, and dr, are computed by equation (1):
The justification for the filtering is that although the shifts are normally
distributed, the choice of that distribution is only an assumption and we should
accept there may be faults in its selection as a theoretical model. For example,
statisticians developed the human Intelligent Quotient (IQ) using a normal
distribution with a mean of 100 and standard deviation of 20. Yet the distribution
is deliberately truncated at zero to disallow negative values which intuitively
should not occur--even though they will be predicted by the model. In a similar
manner, the filtering of neighboring error shifts to avoid 'fractures' is simply a
truncation of the distribution to cater for a minority of outlying values.
The filtering was implemented by developing a routine that moves progressively through each error grid and tests the difference between each pair of cell
values (DOCELL command). If the condition for a 'fracture' is not met, the
values are written to an update grid and the next pair is tested. If the condition is
met, the filter is applied and new values are computed and passed to the update
grid. In practice, it has been found that shifts of several standard deviations in
size can cause several iterations of the process to be required. However, the
procedure still converges fairly rapidly.
An early problem encountered was the situation where a large shift existed in
a perimeter row or column of the error grid-requiring an extra cell value (which
did not exist) to be input to the filter equation. It was found that without the
placement of an extra row or column on each boundary of the grid, the absence of
a cell caused the filter process to substitute a NODATA value for the cell to be
adjusted. This was unacceptable and the placement of an extra perimeter cell
around the error grid (with NODATA values) ensured that the original large shift
was retained unaltered. This was considered preferable to it being discarded.
While this implies that some nodes and vertices may receive shifts to new
Figure 5.-The x andy shifts for a data point not coinciding with
the error grid are calculated using bilinear interpolation based on
the four surrounding grid values.
positions outside the original extent of the data set, this decision remains
consistent with the original concept of distorting their values to form alternative,
but equally probable versions of the data. At the time of writing, the experimental
development of the model has not proceeded beyond this stage and the following
discussion remains conceptual.
Calculation of Data Point Shifts
In the final step of the model's development, values in the error grids must be
transferred to the data set being perturbed. Inevitably, it is expected that few if
any nodes or vertices in the source data will coincide exactly with the error grid
points, and a method is required for calculating x and y shifts based on the
neighboring values in the grid. A simple bilinear interpolation procedure (Watson,
1992) is proposed in which the x and y shifts assigned to each point are calculated
on the basis of the respective shifts of the four surrounding grid points (Figure 5).
An ASCII feature file containing the identifier and coordinates of each data
point can be automatically derived (UNGENERATE command), and the four
surrounding error shifts will be determined for each point and used to interpolate
the correct values to be applied. The distorted coordinates are then written to a
new file and when all records are processed, the file topology is rebuilt. Clearly,
the need may arise for high performance computers to be employed for this task,
particularly when many thousands of nodes and vertices may have to be
perturbed, and current research into GIs and supercomputers (such as described in
Armstong, 1994) is being investigated by the authors.
CONCLUSIONS
This paper has described the experimental development of an uncertainty
model for vector data, which operates by taking an input data set of point, line or
polygon features and then applying simulated positional error shifts in the x and y
directions to calculate new coordinates for each node and vertex. In effect this
produces a distorted, but equally probable, representation of the data set that can
be used to create a family of alternative outputs, usually in map form.
Assessment of the variation in the outputs can used to provide an estimate of the
uncertainty residing in them, based on the error in the source data and its
propagation through the subsequent algorithms and processes employed.
At this stage of the research, software has been written that automates the
creation of the error grids and filters their values to avoid conditions that would
cause loss of topological integrity in the original data. The software also masks
the input data to ensure redundant values in the error grids are not unnecessarily
processed. Optimisation of the process is still required to achieve faster
operation. The next stage of the project involves development of an algorithm
which enables each node and vertex in the input data to be allocated the correct x
and y shift from the error grids, which yields the perturbed version of the source
data. The final stage entails testing of the working model, refinement of its
operation, and assessment of its strengths and limitations.
ACKNOWLEDGMENTS
The authors wish to acknowledge the support given to this project by the New
Zealand Forest Research Institute, the Australian Research Council, and the
Australian Federal Department of Industry, Science & Technology (under the
Bilateral Science & Technology Collaboration Scheme). The National Center for
Geographic Information & Analysis is supported by the National Science
Foundation, grant SBR 88-10917, and this research constitutes part of the
Center's Research Initiative 7 on Visualisation of Spatial Data Quality.
REFERENCES
Armstrong, M.P. 1994. "GIs and High Performance Computing". Proceedings of
the GIS/LIS 94 Conference, Phoenix, Arizona, pp. 4-13.
Hunter, G.J and Goodchild, M.F. 1996. "A New Model for Handling Vector Data
Uncertainty in GIs". Urban and Regional Information Systems Journal, vol.
8, no. 1, 12 pp.
Murillo, M. and Hunter, G.J. 1996. "Modeling Slope Stability Uncertainty: A
Case Study at the H.J. Andrews Long-Term Ecological Research Site,
Oregon". Proceedings of the 2nd International Symposium on Spatial
Accuracy Assessment in Natural Resources and Environmental Modeling, Fort
Collins, Colorado, 10 pp.
Watson, D.F., 1992. Contouring: A Guide to the Analysis and Dispaly of Spatial
Data (PergamonBlsevier Science: New York).
Download