This file was created by scanning the printed publication. Errors identified by the software have been corrected; however, some errors may remain. Total Error Estimation in a Spatial Database for GIs Jose Alberto Quintanilhal and Marcos RodriguesZ Abstract. - The aim of this paper is to present a methodology to estimate the errors in digital spatial database originated from the digitizing a large number of maps. It is undestood that the desired quality of that digital database is associated to a specific foreseen use. Supposing that the spatial database is the population to be sampled and the objects on it as the primary sampling units, two strategies are discussed: the spatial, focused on areas where errors are likely to occur and the nonspatial sampling, focused on the selection of objects according to a database structure. The errors types are organized in a sequential structure upon which a multi-stage sampling strategy is applied. The sampling areas are selected by a specific sampling fi-ame, quadrats, selected according to spatial variability evaluated by entropy measures. Supposing that the amount of errors in the spatial units have a Poisson distribution, the total error distribution can be estimated, allowing thus the inference of the whole population quality. INTRODUCTION The purpose of this paper is to present a methodology to estimate the total errors in a spatial database for GIs applications. The population to be sampled is the database from the digitalization of a large amount of maps in paper format. The adopted sampling schema envolve spatial and non-spatial aspects.The spatial aspects concern the selection of spatial areas were errors are more likely to occur; the non-spatial aspects are related to the selection of spatial objects (points, lines and polygons) structured in a particular data model. The sampling area frame, quadrats, are selected according to their variability, measured by their correspondent entropy. A sequence of non-independent errors in database: identification (completeness), absolute position, relative position and topology, in that order (for more details, see Quintanilha, 1996), was used under a multiple stage sampling procedure. The total errors in a spatial database can be estimate under the assumption that the frequencies of the sequence of the errors follow a multivariate Poisson distribution (Johnson and Kotz, 1969). Universidade de Sao Paulo - Brazil Universidade de Sao Paulo, CARTA Consultoria - Brazil The aim is to obtain an estimate for each type of error examining the smallest number of maps. A methodology to analyse the frequency of each type of errors (completeness, absolute position, relative position and topology) in a spatial database is proposed. EVALUATION OF THE ERRORS IN A SPATIAL DATABASE As mentionned before, the evaluation of the sequence of errors is made by the following steps: step 1: extract a sample set of maps (random or intentional) from the different existent sets of maps (non-spatial sampling); step 2: extract a sample of maps from the selected set of maps from the first step (non-spatial sampling); step 3 : choose areas from the selected maps from step 2 (spatial sampling); step 4: stratify selected areas by themes or coverage (non-spatial stratification); step 5: select the objects from the database by themes or coverages of interest (non-spatial sampling). In steps 1, 2, 3 and 5, Shannon's entropy concept is used to select the set of maps, the maps and the area to be sampled within each map by themes and by objects. Considering that high variability implies high probability to observe the errors under study, one selects the set of maps, maps, and areas with higher entropy value, which may be estimated by the number of bits of the correspondent datafile. So, sampling spatial database was reduced to locate the spatial sampling units using information theory. SPATIAL SAMPLING Spatial Sampling Techniques We aim at selecting an adequate area to identify and count errors to certify the database accuracy. Therefore all the spatial units and spatial sampling frames to be used are area type. The most common sampling frame in spatial sampling is the quadrat. A quadrat (Shaw and Wheeler, 1985) is the result of a superposition of a regular grid on the map (or area). Errors are counted within the quadrat. Kish (1965), affirms that quadrat sampling is not a sampling method, but a methodology to collect information in a spatial position selected by some traditional sampling scheme (random, sistematic, stratified, etc.). The quadrat sampling is recommended when the units of the population are not available (either access to them is difficult or the associated cost is very high). The case of errors evaluation in database is an example of the inviability of census for reasons of time and cost. The quadrats sampling can be used in two different purposes: measure the abundance per areal unit or investigate the population pattern (Ripley, 1981). In the first case, one may suppose that the errors in the quadrat follow a Poisson process with tax h (error rate i. e., number of errors occurrence per area unit). The frequency of errors in a quadrat with area A will follow a Poisson distribution (mean and variance equal to LA). In the second case, indexes have been proposed that are based on counting from a set of samples of quadrats (Ripley,op.cit.). Formally: 1) if N(A) is the observed number of errors in area A and Pn(A) = prob {N(A) = n), 3 DO, constant, implies that for any quadrat with area 6A>O (Diggle, 1979 and 1981): where o(6A) denotes magnitude orders of real numbers and vectors sequences. 2) if for any two exclusive and congruent regions, A and B, N(A) e N(B) are independents. Then: Thus, N(A) follows a Poisson distribution with mean equal to LA. Determination of quadrats size or number of quadrats The determination of the number of quadrats are related to a cost model, which envolves the precision of the estimates and, obviously, the determination of costs and times. For a certain budget or duration for the database quality investigation, the spatial objects in the set of maps, maps, areas, themeskoverage can be investigate. In others words, for an expect value for the error of the estimatives, we can determine an ideal size of the sample to satisfy this precision. Certainly, a higher number of small quadrats will cover the area more efficiently. So, the chance to detect some type of error will be improved and the correspondent spatial correlation will be minimized. However, when sampling units types are areas and lines, the definition of the quadrats must consider the problem of the size of these objects which can transcend the area of the quadrat. One example is the segmentation of polygons in meaningless lines. Determination of position of the quadrats For a certain area represented in a quadrat and for a certain representation model, raster or vector, the size of datafile which represents that quadrat is variable, proportional to the number c of differents objects inside it. For each selected area, pi is the probability of occurrence of the ith object, which is estimated by their frequency. The size of the datafile related to a quadrat is proportional to the number c too, i.e.: Considering ti as the size of a part of a quadrat datafile related to object i, the expected mean size of the quadrat datafile is given by: The expected size of the quadrat datafile E (t) is proportional to the sum of the number of different objects within it. Similarly, the entropy of the quadrat is given by: C H= - Pi log2 pi bitslquadrat . i= 1 So, the sampling procedure must be restricted to areas where the entropy is higher, which corresponds to the areas with higher number of bits. The maps and sets of maps will be selected similarly. The themeskoverage, will be used like a secondary variable in the objects stratitification: objects associated with each themelcoverage in areas with the higher quadrat datafile will be in the sample. Use of the thematic stratification The option for a stratified sampling by themeskoverages is justified since the simple random sampling, relates to the global distribution of the population (i.e., the errors), which implies that some type of errors may not be sampled or may be undersampled. This could be avoided with a larger sample but then, there would implications on the time and associated costs. TEST FOR THE TOTAL ERRORS AMOUNT As mentionned, we use the sequence present above and consider each type of error (identification, absolute position, relative position, and topology) as a Bernouille experiment. That is, there are two possible results for each type of error: it exist or not. The assumptions are: the vector Y = (YI, YA, YR, YT)' is the vector of the number of errors each type errors in n quadrats, where: YI is the number of errors of identification, YA is the number of errors of absolute position which was not generated by errors of identification, YR is the number of relative position which was not generated by the absolute position errors, YT is the number of errors of topology which was not generated by the relative position errors; each type of errors follows a Poisson distribution; the four kinds of errors are correlated (identification errors implies absolute position errors which implies relative position errors which implies in errors in topology). So, the vector Y will follow a Poisson multivariate distribution with the parameters: h* = (XI, LA, h ~ h , ~ where ) hi is the tax of error type i, where i= I (identification), A (absolute position), R (relative position), T (topology). Details of this distribution can be found in Johnson and Kotz (1969); Jensen (1985) e Ho (1 995). PROPOSED METHODOLOGY Consider a situation when a number of maps with the same cartographic characteristics (scale, &urn, coordinate system, generalization conditions, legends, etc.) is necessary to cover a whole area. Over that maps, one overlays a regular grid which create a similar number of equal quadrats in each map. The overall procedure is: create a homogeneous sets of maps, select a sample of set of maps according of the entropy criteria, for each selected set of maps, extract a sample of maps according the same criteria (entropy), for each map in the sample, locate the quadrats with the higher entropy, for each themelcoverage in each selected quadrat, compute: 1. the number of identification errors, the number of absolute position errors not generated by the identification errors; the number of relative position errors not generated by the previous 3. errors; the number of topology errors not generated by the previous errors. 4. The event error can be represented by: error=I u (-I n A) u (-I n - A u R ) u ( - I n - A n - R n T). Because they are exclusive events, the error occurency probability can be estimated by: p(error)= p(1) or p(-I and A) or p(-I and -A and R) or p(-I and -A and -R and T) = = p(1) + p(-I n A) + p(-I n - A n R) + p(-I n - A n -R n T)= = p(1) + p(A/-I) * p(-I) + p(R/-A n -I) * p(-A/-I) * p(-I) + + p(T/-I n -A n -R) * p(-R/-A n -I) * p(-A/-I) * p(-I) where: p(1) is estimated by the number of identification error, i.e., the proportion of non identified objects or objects that have been erroneously identified, p(-I) = 1 - p(1) is estimated by the proportion of objects correctly identified, p(A1-I) is estimated by the proportion of objects with absolute position errors that were not generated from the identification errors, p(R1-A n -I) is estimated from the proportion of objects with relative position errors that were not generated from the previous errors, p(-R/-A n -I) is estimated from the proportion of objects without identification, absolute position and relative position errors, p(T/-I n -A n -R) is estimate from the proportion of objects with topology errors that were not generated from the previous errors. 2. CONCLUSIONS That proposed methodology for errors evalution in spatial database, aims at reducing the time and costs associated to qualify and validate that spatial database for GIs users. Additional studies wiil be carried out to handle the sporious polygons that can be generated for each type of errors. These polygons may complicate the sampling procedure and increase the sample size, time and costs. ACKNOWLEDGMENTS The authors are grateful to Escola Politecnica da Universidade de Sao Paulo which supported this research and to Dr. Linda Lee Ho, who supported the statistical aspects of Poisson multivariate distribution. REFERENCES Bartlett, M S. 1975. The statistical analysis of spatial pattern. Chapman and Hall, London, 90 p. Bellhouse, D. R. 1988. Spatial sampling. In: Kotz, S.; Johnson, N.L. (ed.) Encyclopedia of Statistical Science, New York, John Wiley, 1988. V. 8, p. 58 14. Berger, T. 1981. Information theory and coding theory. . In: Kotz, S.; Johnson, N.L. (ed.) Encyclopedia of Statistical Science. New York, John Wiley, V.4, p. 125-41. Diggle, P. J. 1979. Statistical methods for spatial point patterns in ecology. In: Cormack, R.M.; Ord, J.K., ed. Spatial and temporal analysis in ecology. Fairland, International Co-operative Pub. House, p.95- 150. Diggle, P. J. 1981. Some graphical methods in the analysis os spatial point patterns. In: Barnett, V., ed., Interpreting multivariate data. Chichester, John Wiley, chap. 4, p. 55-73. Ho, L. L. 1995. Analysis of multivariate counts. SZio Paulo. 1O9p. Thesis (PhD) Escola PolitQnica, Universidade de SZio Paulo, Brazil (in portuguese). Jensen, D. R. 1985. Multivariate distributions. . In: Kotz, S.; Johnson, N.L. (ed.) Encyclopedia of Statistical Science, New York, John Wiley, V.6, p. 43-54. Johnson, N.L.; Kotz, S. 1969. Distributions in statistics: discrete distributions. New York, John Wiley. Chap.11: Multivariate Discrete Distributions, p.28 1323. Kish, L. 1965. Survey sampling. Chapter 9: Area sampling. p. 301-358. New York, John Wiley. Quintanilha, J. A. 1996. Errors and uncertainties in a spatial database for GIs (non-official version). 193p. Thesis (PhD) - Escola Politecnica, Universidade de Siio Paulo, Brazil (in portuguese). Ripley, B. D. 1981. Spatial statistics. New York, John Wiley & Sons, 252 p. Ripley, B. D. 1988. Spatial data analysis. . In: Kotz, S.; Johnson, N.L. (ed.) Encyclopedia of Statistical Science. New York, John Wiley, V. 8, p. 570-3 Shaw, G.; Wheeler, D. 1985. Statistical techniques in geography. S.I.: Spatial indices & pattern analysis, Chap. 16.