This file was created by scanning the printed publication. Errors identified by the software have been corrected; however, some errors may remain. Exploring Spatial Confidence with a Matthew H. Pelkkil Abstract.-- One of the main input methods into geographic mformation systems remains digitizing spatial data from paper maps. While newer, automated methods such as global positioning systems promise better accuracy and quality of data for future GIs, digitizers are llkely to be part of the data collection technology for some time into the future. When digtizing this data, it is important to remember that a single sample of the representation of reality is being taken. Information related to the spatial confidence of the resulting representation often is not quantified well by the digitizing software. Spatial confidence can be expressed as percentage or degree of certainty that a represented object does indeed exist at that location or within a certain distance of that location. T h s research found that for map registration error may not be a good indication the required raster resolution to obtain h g h spatial confidence. It was also found that complex lines tend to increase the amount of error or variance in the resulting digital map. INTRODUCTION Obtaining spatial data of high quality is an obvious desire of spatial data users. Positional accuracy is one important component of data quality (Antenucci et al, 1991). Error in spatial data is inherent and difficult to remove. While minimizing error has long been a goal of spatial sciences (Chrisman, 1991), others (Aronoff 1989) suggest that since error cannot be eliminated, it should be managed. Certainly, the first step in managing error is being able to quantify it. The need to explicitly understand spatial confidence in a GIs is growing as digital data becomes less specialized and more widespread and mainstream. It is well known that digital data is perceived to be of higher quality than analog data due to the appearance of precision and accuracy. As developers of digital data become more and more removed fiom the end-users, explicit understanding of locational confidence is important to prevent rnis-application of data to analyses for which it is not suitable. Paper maps are currently the most common source for geographic data, and while 1 Assistant Professor, University of Kentucky, Department of Forestry, Lexington, KY 40546-0073. 336 automated methods such as global positioning systems promise better accuracy and quality of data for fiture GIs, digitizers are likely to be a part of the data collection technology for some time into the future. Historical data, and much natural resource data will be available only on paper maps. In these cases, the map is in fact the only representation of "truth and any one manual digitization of that map is but one sample of reality. The map scale is often used as an estimate for the accuracy of the digitized data (Fisher, 1991) and one sample of reality is often all that is taken due to time and cost constraints. This limits the information on the quality of the data, as it assumes that the first sample is truly representative of the "truth." Ideally, spatial confidence metadata should be attached to a raster GIs that states something like, "there is a 95% confidence that any object classed within a raster cell actually exists within that cell or within a radius of X cells from the indicated cell." From a single sample, even assuming that the analog data represents "truth," this is difficult to measure. This research involves exploration of absolute locational accuracy of point and line data that is manually digitized and rasterized to various resolutions under the assumption that the analog medium being digitized is in fact "truth." METHODS The GIs software chosen for this study was EPPL7 (Environmental Programming and Planning Language) release 2.1. This is a relatively simple raster-based GIs created and maintained by the State of Minnesota Land Management Information Center. It is used as a "desktop" GIs in Minnesota, and has users throughout the United States and in 15 other countries. Users include universities, several state natural resource agencies, the National Park Service, and the Fish and Wildlife Service. Two separate spatial representations were created digitally. One representation was composed of ten randomly located points. The second spatial representation was composed of ten lines, two each having one to five line segments. This was done to simulate a some increasing complexity in lines, but to keep it on a quantifiable scale. Thus, according to Burrough (1 986), more complex lines (those with more vertices) should have a greater degree of error associated with them. These digital representations were considered to be "truth," and were printed out on a relatively stable medium (transparency film) for manual digitizing. The scale of these printed images was 150,000, and the point and line width printed on the map was 11100th of an inch, giving the approximate representation width of < 13 m. Both the point and the line images were digitized ten times under two conditions. The first set of ten samples were collected under a single map registration, and the second set of ten samples were each collected under a different map registration. This was done to see how map registration differences affected positional accuracy. The registration standard deviation of error was recorded in each and every case. In the EPPL7 manual, it is recommended that the standard deviation be less than the desired resolution for rasterizing the file (LMIC, 1992). All digitizing was done in the center of a 24 x 36 inch Calcomp 3300 digitizing tablet. The room was climate controlled to reduce climate-caused distortion of the transparencies. Once digitized, the data files were rasterized to various resolutions (5, 10, 20, and 40 meters). The digital "truth" file was also rasterized to the same resolution. Error in absolute position was determined by overlay, and the error was weighted by the distance between any one raster cell and the "truth." Similar to the study reported by Maffini et a1 (1989), this study examines positional errors. It examines the performance of a single digitizer operator over repeated trials and does not control for error properties in converting the digital "truth" to transparency medium, errors in the digitizing tablet or equipment, nor control for speed of the operator. RESULTS Table 1 shows the results from ten separate digitizing operations on the ten point locations using the same map registration. The distances for locational error were calculated by counting the number of raster cells the sample point was displaced from the "truth" point and multiplying by the resolution to obtain X and Y positional error. Since under each resolution there where ten points digitized ten times, the five points with the largest positional errors were discarded, and the positional error that would contain 95 points is recorded in the leftmost column of table 1. We can see that as raster resolution gets smaller, the 95% zone of inclusion approaches a value around 20 m. Point data digitized under different map registrations is shown in table 2. The rasterized data for 40m resolution were omitted, but a trend similar to table 1 appears Table 1. Positional error of point data digitized under same registration and rasterized from 5m to 40m resolution. Raster cell resolution Mean Distance Positional Error 1 - Std. Dev. of registration = 3.16 m Minimum Positional Error Maximum Positional Error 951100 Cells lie within: Table 2. Positional error of point data digitized under different registrations and rasterized from 5m to 30m resolution. Raster cell resolution Mean Distance Positional Error Minimum Positional Error Maximum Positional Error 30 m 20.8 m Om 42.4 m 1 - Std. Dev. of registrations ranged fiom 3.69m to 13.17m,and averaged 6.93m 951100 Cells lie within: 42.4 m to occur. As raster resolution gets smaller, a more accurate measurement of absolute position error is possible, and the maximum value which includes 95% of all points approaches some value. Determining positional error for line data is a bit more complex. Error for lines includes overshoots and undershoots, as well as locational deviation for portions of the line. A line error index was calculated for each digitized line as it was compared to the "truth" line. The error index weighted errors by the magnitude of the deviation fkom the "truth" line. The formula used is as follows: where: LEI = location error index N = number of raster cells in "truth" line N' = number of raster cells in digitized line r = raster cell resolution di = distance between truth line and digitized line for cell I This index is independent of line distance, but it is not independent of resolution, since with larger raster cell sizes, small deviations between the "truth" line and the digitized line will be impossible to detect. Therefore, the error index should decrease with raster cell size but increase with line complexity. An LEI = 0.50 means that, for any given line length, 50% of that line lies in a incorrect cell location weighted by the magnitude of the error. Tables 3 and 4 show the LEI value for lines rasterized to various resolutions under the same and multiple map registrations, respectively. The tables also show the maximum deviation in number of cells for the ten sample lines from the "truth" line. From the tables, two relationships appear. The first is that as raster cell size increases, Table 3. Line data digitized under the same registration and rasterized from 5m to 40m resolution (# cells max. deviation in parentheses). Raster cell resolution Number of line segments per line 1 2 3 4 5 Table 4. Line data digitized under different registrations and rasterized from 5m to 40m resolution (# cells max. deviation in parentheses). Raster cell resolution Number of line segments per line 1 2 3 4 5 the LEI index decreases as does the maximum number of cells deviation for the sample lines. The second relationship is that more complex lines have higher LEI values and maximum cell deviation values, indicating that complex lines introduce more error than simple lines. Multiple linear regressions indicated that the both these relationships are sigtllficant at the a = 0.05 level of significance. It is also interesting to note that incorporating the map registration error into the regression model did not significantly improve the prediction of error (a=0.10). CONCLUSIONS In the EPPL7 GIs, the summed standard deviation reported during map registration appears to be a poor indicator of locational accuracy of the resulting map. Given the map scale, the point and line representation equated to just under 13 m, and so perhaps a better estimation of adequate raster cell size would be the 13 m plus two standard deviations. This would suggest a cell size of 20 meters. It appeared that multiple map registrations increased error, but the tests were uncontrolled and so no comments about the significance of any apparent numerical differences can be made. In line data, error is strongly correlated to line complexity. This makes good intuitive sense, as the more complex the line, the greater the number of vertices are required to represent that line with reasonable accuracy. The correlation between line error and target raster cell size is an indication that larger raster cells do a poor job of calculating distance, and as the raster cell size increases, differences between the digitized line and the "truth" line that are less than the raster cell size are rounded to zero. However, for maps covering large areas, small raster cell size requires a great deal of processing time and storage. This preliminary work has identified the need to incorporate various map scales into the procedure, particularly larger scale maps that will better allow the testing of registration error on overall locational error. Testing for multiple digitizing personnel, and various levels of operator experience might also prove interesting. ACKNOWLEDGMENTS Special thanks go to Linda Delay, a student worker that performed the digitizing and ran the computer macros to determine the error components. Without her time and effort on data collection, this project would still be in the conceptual stage. REFERENCES Antenucci, I. C., K. Brown, P. L. Croswell, M. J. Kevany, and H. Archer. 1991. Geographical information systems: A guide to the technology. Van Nostrand Reinhold, New York, NY. 3 0 1 p. Aronoff, S. 1989. Geographic Information Systems: A management perspective. WDL Publications, Ottawa, Canada. 294 p. Burrough, P . A. 1986. Principles of geographical information systems for land resources assessment. Oxford University Press, New York, NY. 194 p. Chrisman, N. R. 1991. The error component in spatial data. In Geographical Information Systems, Volume 1: Principles, D. J. Maguire, M. F. Goodchild, and D. W. Rhind, editors. John Wiley and Sons, New York. Pp. 165-174. Fisher, P. F. 1991. Spatial data sources and data problems. In Geographical Information Systems, Volume 1: Principles, D. J. Maguire, M. F. Goodchild, and D. W. Rhind, editors. John Wiley and Sons, New York. Pp. 175-189. LMIC. 1992. EPPL7 User's Guide, Release 2.0, Tutorial Chapter, page 110. Land Management Information Center, St. Paul, MN. Ma£Eni, G., M. Arno, and W. Bitterlich. 1989. Observations and comments on the generation and treatment of error in digital GIs data. In Accuracy of spatial databases, M. Goodchild and S. Gopal, editors. Taylor and Francis, Bristol, PA. Pp. 55-68. BIOGRAPHICAL SKETCH Matthew H. Pelkki is an Assistant Professor of Forest Management and Economics at The University of Kentucky Department of Forestry. He graduated with a B.S.F. in 1985 from the University of Michigan's School of Natural Resources & Environment, and then earned an M.S. (1988) and Ph.D. (1992) from the University of Minnesota's College of Natural Resources. Matthew has been an assistant professor at the University of Kentucky siice 1991 where he teaches courses in timber management, integrated forest resource management, and applications of GIs in natural resources. His research interests include dynamic programming and standlevel optimization, natural resource information system planning and design, and the economics of bioremediation/ecological restoration.