Document 11863964

This file was created by scanning the printed publication.
Errors identified by the software have been corrected;
however, some errors may remain.
Total Error Estimation in a Spatial
Database for GIs
Jose Alberto Quintanilhal and Marcos RodriguesZ
Abstract. - The aim of this paper is to present a methodology to estimate the errors
in digital spatial database originated from the digitizing a large number of maps. It
is undestood that the desired quality of that digital database is associated to a
specific foreseen use. Supposing that the spatial database is the population to be
sampled and the objects on it as the primary sampling units, two strategies are
discussed: the spatial, focused on areas where errors are likely to occur and the nonspatial sampling, focused on the selection of objects according to a database
structure. The errors types are organized in a sequential structure upon which a
multi-stage sampling strategy is applied. The sampling areas are selected by a
specific sampling fi-ame, quadrats, selected according to spatial variability evaluated
by entropy measures. Supposing that the amount of errors in the spatial units have a
Poisson distribution, the total error distribution can be estimated, allowing thus the
inference of the whole population quality.
The purpose of this paper is to present a methodology to estimate the total
errors in a spatial database for GIs applications. The population to be sampled is
the database from the digitalization of a large amount of maps in paper format.
The adopted sampling schema envolve spatial and non-spatial aspects.The
spatial aspects concern the selection of spatial areas were errors are more likely to
occur; the non-spatial aspects are related to the selection of spatial objects (points,
lines and polygons) structured in a particular data model.
The sampling area frame, quadrats, are selected according to their variability,
measured by their correspondent entropy.
A sequence of non-independent errors in database: identification
(completeness), absolute position, relative position and topology, in that order (for
more details, see Quintanilha, 1996), was used under a multiple stage sampling
The total errors in a spatial database can be estimate under the assumption that
the frequencies of the sequence of the errors follow a multivariate Poisson
distribution (Johnson and Kotz, 1969).
Universidade de Sao Paulo - Brazil
Universidade de Sao Paulo, CARTA Consultoria - Brazil
The aim is to obtain an estimate for each type of error examining the smallest
number of maps. A methodology to analyse the frequency of each type of errors
(completeness, absolute position, relative position and topology) in a spatial
database is proposed.
As mentionned before, the evaluation of the sequence of errors is made by the
following steps:
step 1: extract a sample set of maps (random or intentional) from the different
existent sets of maps (non-spatial sampling);
step 2: extract a sample of maps from the selected set of maps from the first
step (non-spatial sampling);
step 3 : choose areas from the selected maps from step 2 (spatial sampling);
step 4: stratify selected areas by themes or coverage (non-spatial
step 5: select the objects from the database by themes or coverages of interest
(non-spatial sampling).
In steps 1, 2, 3 and 5, Shannon's entropy concept is used to select the set of
maps, the maps and the area to be sampled within each map by themes and by
objects. Considering that high variability implies high probability to observe the
errors under study, one selects the set of maps, maps, and areas with higher
entropy value, which may be estimated by the number of bits of the correspondent
datafile. So, sampling spatial database was reduced to locate the spatial sampling
units using information theory.
Spatial Sampling Techniques
We aim at selecting an adequate area to identify and count errors to certify the
database accuracy. Therefore all the spatial units and spatial sampling frames to be
used are area type.
The most common sampling frame in spatial sampling is the quadrat. A quadrat
(Shaw and Wheeler, 1985) is the result of a superposition of a regular grid on the
map (or area). Errors are counted within the quadrat.
Kish (1965), affirms that quadrat sampling is not a sampling method, but a
methodology to collect information in a spatial position selected by some
traditional sampling scheme (random, sistematic, stratified, etc.).
The quadrat sampling is recommended when the units of the population are not
available (either access to them is difficult or the associated cost is very high). The
case of errors evaluation in database is an example of the inviability of census for
reasons of time and cost.
The quadrats sampling can be used in two different purposes: measure the
abundance per areal unit or investigate the population pattern (Ripley, 1981). In
the first case, one may suppose that the errors in the quadrat follow a Poisson
process with tax h (error rate i. e., number of errors occurrence per area unit). The
frequency of errors in a quadrat with area A will follow a Poisson distribution
(mean and variance equal to LA). In the second case, indexes have been proposed
that are based on counting from a set of samples of quadrats (Ripley,op.cit.).
1) if N(A) is the observed number of errors in area A and Pn(A) = prob {N(A)
= n), 3 DO, constant, implies that for any quadrat with area 6A>O (Diggle, 1979
and 1981):
where o(6A) denotes magnitude orders of real numbers and vectors sequences.
2) if for any two exclusive and congruent regions, A and B, N(A) e N(B) are
independents. Then:
Thus, N(A) follows a Poisson distribution with mean equal to LA.
Determination of quadrats size or number of quadrats
The determination of the number of quadrats are related to a cost model, which
envolves the precision of the estimates and, obviously, the determination of costs
and times. For a certain budget or duration for the database quality investigation,
the spatial objects in the set of maps, maps, areas, themeskoverage can be
investigate. In others words, for an expect value for the error of the estimatives,
we can determine an ideal size of the sample to satisfy this precision.
Certainly, a higher number of small quadrats will cover the area more
efficiently. So, the chance to detect some type of error will be improved and the
correspondent spatial correlation will be minimized. However, when sampling
units types are areas and lines, the definition of the quadrats must consider the
problem of the size of these objects which can transcend the area of the quadrat.
One example is the segmentation of polygons in meaningless lines.
Determination of position of the quadrats
For a certain area represented in a quadrat and for a certain representation
model, raster or vector, the size of datafile which represents that quadrat is
variable, proportional to the number c of differents objects inside it. For each
selected area, pi is the probability of occurrence of the ith object, which is
estimated by their frequency. The size of the datafile related to a quadrat is
proportional to the number c too, i.e.:
Considering ti as the size of a part of a quadrat datafile related to object i, the
expected mean size of the quadrat datafile is given by:
The expected size of the quadrat datafile E (t) is proportional to the sum of the
number of different objects within it.
Similarly, the entropy of the quadrat is given by:
H= -
Pi log2 pi bitslquadrat .
i= 1
So, the sampling procedure must be restricted to areas where the entropy is
higher, which corresponds to the areas with higher number of bits. The maps and
sets of maps will be selected similarly.
The themeskoverage, will be used like a secondary variable in the objects
stratitification: objects associated with each themelcoverage in areas with the
higher quadrat datafile will be in the sample.
Use of the thematic stratification
The option for a stratified sampling by themeskoverages is justified since the
simple random sampling, relates to the global distribution of the population (i.e.,
the errors), which implies that some type of errors may not be sampled or may be
undersampled. This could be avoided with a larger sample but then, there would
implications on the time and associated costs.
As mentionned, we use the sequence present above and consider each type of
error (identification, absolute position, relative position, and topology) as a
Bernouille experiment. That is, there are two possible results for each type of
error: it exist or not.
The assumptions are:
the vector Y = (YI, YA, YR, YT)' is the vector of the number of
errors each type errors in n quadrats, where:
YI is the number of errors of identification,
YA is the number of errors of absolute position which was not
generated by errors of identification,
YR is the number of relative position which was not generated
by the absolute position errors,
YT is the number of errors of topology which was not generated
by the relative position errors;
each type of errors follows a Poisson distribution;
the four kinds of errors are correlated (identification errors implies
absolute position errors which implies relative position errors which implies
in errors in topology).
So, the vector Y will follow a Poisson multivariate distribution with the
parameters: h* = (XI, LA, h ~ h , ~ where
hi is the tax of error type i, where i= I
(identification), A (absolute position), R (relative position), T (topology). Details
of this distribution can be found in Johnson and Kotz (1969); Jensen (1985) e Ho
(1 995).
Consider a situation when a number of maps with the same cartographic
characteristics (scale, &urn, coordinate system, generalization conditions,
legends, etc.) is necessary to cover a whole area. Over that maps, one overlays a
regular grid which create a similar number of equal quadrats in each map. The
overall procedure is:
create a homogeneous sets of maps,
select a sample of set of maps according of the entropy criteria,
for each selected set of maps, extract a sample of maps according the same
criteria (entropy),
for each map in the sample, locate the quadrats with the higher entropy,
for each themelcoverage in each selected quadrat, compute:
the number of identification errors,
the number of absolute position errors not generated by the
identification errors;
the number of relative position errors not generated by the previous
the number of topology errors not generated by the previous errors.
The event error can be represented by:
error=I u (-I n A) u (-I n - A u R ) u ( - I n - A n - R n T).
Because they are exclusive events, the error occurency probability can be
estimated by:
p(error)= p(1) or p(-I and A) or p(-I and -A and R) or p(-I and -A and
-R and T) =
= p(1) + p(-I n A) + p(-I n - A n R) + p(-I n - A n -R n T)=
= p(1) + p(A/-I) * p(-I) + p(R/-A n -I) * p(-A/-I) * p(-I) +
+ p(T/-I n -A n -R) * p(-R/-A n -I) * p(-A/-I) * p(-I)
p(1) is estimated by the number of identification error, i.e., the
proportion of non identified objects or objects that have been
erroneously identified,
p(-I) = 1 - p(1) is estimated by the proportion of objects correctly
p(A1-I) is estimated by the proportion of objects with absolute position
errors that were not generated from the identification errors,
p(R1-A n -I) is estimated from the proportion of objects with relative
position errors that were not generated from the previous errors,
p(-R/-A n -I) is estimated from the proportion of objects without
identification, absolute position and relative position errors,
p(T/-I n -A n -R) is estimate from the proportion of objects with
topology errors that were not generated from the previous errors.
That proposed methodology for errors evalution in spatial database, aims at
reducing the time and costs associated to qualify and validate that spatial
database for GIs users. Additional studies wiil be carried out to handle the
sporious polygons that can be generated for each type of errors. These polygons
may complicate the sampling procedure and increase the sample size, time and
The authors are grateful to Escola Politecnica da Universidade de Sao Paulo
which supported this research and to Dr. Linda Lee Ho, who supported the
statistical aspects of Poisson multivariate distribution.
Bartlett, M S. 1975. The statistical analysis of spatial pattern. Chapman and Hall,
London, 90 p.
Bellhouse, D. R. 1988. Spatial sampling. In: Kotz, S.; Johnson, N.L. (ed.)
Encyclopedia of Statistical Science, New York, John Wiley, 1988. V. 8, p. 58 14.
Berger, T. 1981. Information theory and coding theory. . In: Kotz, S.; Johnson,
N.L. (ed.) Encyclopedia of Statistical Science. New York, John Wiley, V.4, p.
Diggle, P. J. 1979. Statistical methods for spatial point patterns in ecology. In:
Cormack, R.M.; Ord, J.K., ed. Spatial and temporal analysis in ecology.
Fairland, International Co-operative Pub. House, p.95- 150.
Diggle, P. J. 1981. Some graphical methods in the analysis os spatial point
patterns. In: Barnett, V., ed., Interpreting multivariate data. Chichester, John
Wiley, chap. 4, p. 55-73.
Ho, L. L. 1995. Analysis of multivariate counts. SZio Paulo. 1O9p. Thesis (PhD) Escola PolitQnica, Universidade de SZio Paulo, Brazil (in portuguese).
Jensen, D. R. 1985. Multivariate distributions. . In: Kotz, S.; Johnson, N.L. (ed.)
Encyclopedia of Statistical Science, New York, John Wiley, V.6, p. 43-54.
Johnson, N.L.; Kotz, S. 1969. Distributions in statistics: discrete distributions.
New York, John Wiley. Chap.11: Multivariate Discrete Distributions, p.28 1323.
Kish, L. 1965. Survey sampling. Chapter 9: Area sampling. p. 301-358. New
York, John Wiley.
Quintanilha, J. A. 1996. Errors and uncertainties in a spatial database for GIs
(non-official version). 193p. Thesis (PhD) - Escola Politecnica, Universidade de
Siio Paulo, Brazil (in portuguese).
Ripley, B. D. 1981. Spatial statistics. New York, John Wiley & Sons, 252 p.
Ripley, B. D. 1988. Spatial data analysis. . In: Kotz, S.; Johnson, N.L. (ed.)
Encyclopedia of Statistical Science. New York, John Wiley, V. 8, p. 570-3
Shaw, G.; Wheeler, D. 1985. Statistical techniques in geography. S.I.: Spatial
indices & pattern analysis, Chap. 16.