Date: 16 September 2009 and afterwards To: Drs. Claude Boyd and Phil Chaney From: Mike Polioudakis and Nathan Campbell Subject: Statistics on ground truth of GIS method for finding inland water SUMMARY Ground truth data on location and size of ponds in Lee County was compared with data from a GIS map that had been generated from satellite information. The map gave reliable information for ponds located outside urban areas but not within urban areas. Urban areas produced many false positives of apparent ponds that turned out to be other features such as parking lots. It would be hard to correct this bias in urban areas without extensive and complicated remodeling. The map is useful in its present form as long as people keep the limitations in mind. Simple statistics on matching of map to ground truth are given. INTRODUCTION Nathan Campbell did all the field work. Mike Polioudakis created and processed the GIS map and wrote this document. Unless specified otherwise: the GIS program used was ERDAS Imagine; and the statistical program used was MS Excel. Except for integers where an exact count could be done, the last digit in a number was rounded. This project summarized in this document consisted of “ground truth-ing” data on pond size and pond location from a GIS map for inland water in Alabama that was constructed using satellite data. From that map, 277 random points were generated from within water bodies. Each point could be used to specify a water body. Nathan Campbell located the points on Internet computer mapping services such as Google Maps. He visited the points to determine their status as below. Mr. Campbell (A) assessed whether a pond was at the site indicated on the GIS map, and (B) he estimated the size of the pond independently of the size reported on the GIS map by simple visual inspection and/or by using a laser sighting device. He had earlier practiced assessing pond size on ponds of known size that were not part of the sample points. The code numbers in parentheses are for convenience here. The color in parentheses is for anybody who sees the Excel data. (1) (yellow) No access, so could not be evaluated 19 sites (2) (red) Present on map but not found on ground 45 sites (3) (green) Present on map and found on ground 177 ponds (sites) (4) (blue) Duplicate points 36 sites STATISTICS ON FOUND POINTS Duplicate points (4, blue) occurred in the sampling process because the algorithm that samples points chooses pixels (rasters) from all the pixels available in a category (water) regardless of whether two pixels are in the same pond (same clump). Ponds consisted of more than one pixel, so more than one pixel might be chosen out of the same pond, resulting in a duplicate. Duplicates can be discounted from assessments. Some points were not accessible to confirm whether they were present on the ground, and some points could not be assessed for size. Points were not accessible because of local terrain or because they were on private property for which Mr. Campbell could not get permission to enter. Points that were not accessible at all are removed from assessments. Points that could not be properly sized are removed from any size assessments. Results: Found points (3, green) plus not found points (2, yellow) total 222. On that basis: 0.20270 (20.3%) of the points were not found (2, yellow) 0.79730 (79.7%) of the points were found (3, green) In addition, we were able to divide the points into (A) rural; (B) peri-urban; and (C) urban. We used the USGS Tiger vector map as the basis for the division. That map is based on US Census data. Urban points all lay within the USGS designated urban areas for Lee County. Rural and peri-urban (B) points all lay outside the USGS designated urban areas. Peri-urban points were assigned according to personal knowledge of Lee County and by visual inspection of the maps. “Peri-urban” includes ponds in the areas that are now often called “ex-urban” or “suburban”. We will see that (B) peri-urban points belong most properly belong with rural points, but separating them initially was a useful way to feel confident in this assessment. Table 1 summarizes the results. Rural Peri-urban R and P-U Urban Found (3) 126 16 142 35 Ratio 0.8811 1.0000 0.8931 0.6604 Not Found (2) 17 0 17 18 Ratio 0.1188 0 0.1069 0.3396 Totals 143 16 159 53 The ratio of found points is similar with rural and peri-urban, both about 88%. The ratio of found points is considerably lower with urban, about 66%. The urban area gave many false positives. Some places indicated water according to criteria that were used to make the GIS map but those points of apparent water were not ponds. Those points included rooftops, parking lots, swimming pools, large air conditioner facilities, etc. It is not clear if recent rain helped to created these false positives (data on rainfall at the time of the satellite passages from which the GIS map was derived was not immediately available), but probably that is so. Initially separating peri-urban points allowed us to clarify that the problem was limited to the urban area. Care is needed when assessing the GIS data within an urban area; but the GIS map seems to be fairly reliable outside of the immediate urban area, even in suburban or ex-urban terrain. IMPUTING SIGNIFICANCE TO STATISTICS ON FOUND POINTS By making a few statistical assumptions, it is possible to approximately assess the probability that these degrees of accuracy occurred by chance alone. Think of “found or not found” as “hit or miss” in a Poisson distribution. Assume the mean or expected “hit” ratio (found ratio) is about 0.75. Rural and peri-urban are grouped together. The results are as below: Rural Urban Found 17 18 Not Found 142 35 Total 159 53 Expected 119 40 Probability 0.0041 0.04854 For Rural and Peri-Urban: the probability that exactly 142 ponds were found when 119 were expected, at a rate of 75% success, is 0.0041. This finding indicates, but does not establish, that the rate of finding ponds was much greater than chance if we expected to find 75% of the points were ponds. For Urban: the probability that exactly 18 ponds were found when 40 were expected is 0.04854. This finding indicates, but does not establish, that the rate of finding ponds was less than chance if we expected to find 75% of the ponds. This finding confirms that the success rate in the urban area was noticeably less than expected. This low success rate in the urban area drove down the overall success rate to about 80%. The finding for the rural and peri-urban sites can be strengthened by creating the equivalent of a “confidence interval”. The fact that the probability of finding exactly 142 points is low (if we expected to find 75% of the points) is not by itself very telling. By applying the same procedure to a range of points, we establish that the probability of finding from 139 to 145 points equals 0.0304. The interval of 139 to 145 found points is the 0.03 confidence interval if we expected to find 75% of points. This fact more clearly indicates a higher-than-expected success rate in the rural and peri-urban combined area. The size of the confidence interval can be expanded, or made one-tailed, with similar significant results. This rate of success in the rural and peri-urban area is reliably high. A similar procedure can be used to establish that the success rate in the urban area is reliably low, but the result is not as clearly significant, so the results are not given here. SIZE ASSESSMENT OF PONDS The size of ponds as determined from the GIS map was compared to the size of ponds as determined in the field using estimation or using optical measurement and calculation. We can only assess size among ponds that were actually found in the field. For this assessment, Nathan Campbell counted by hand the number of pixels on the GIS map associated with each pond. The reason why we followed this procedure will be explained below after the data. The “Correlation” (“Correl”) function from MS Excel was used to assess the correlation between the map size of ponds and the measured size of ponds. The correlation function appears to return Pearson’s “r”. The function does not appear to return a value for “n” or “p”. The figures below are for “raw” r, not for r squared. Rural correl = 0.9523 Peri-urban correl = 0.9740 Rural plus peri-urban correl = 0.9596 Urban correl = 0.8078 Overall correl = 0.9493 Rural and peri-urban figures are closely comparable. Because only a minority of urban ponds were found (see above), the apparently high correlation for urban ponds should be considered carefully. It means that the map is accurate for the ponds that were found but it does not indicate that a majority of ponds were found. Even though the correlation for urban pounds is high, it is lower than the figures for rural and peri-urban. The comparatively lower correlation in urban ponds lowers the overall correlation slightly. Still, this degree of correlation indicates that the figures for size taken from the GIS map can be relied upon for most purposes, especially outside of urban areas. Exactly what constitutes a pond is not always clear from the GIS map. In about two-thirds of the cases, pond boundaries were clear, so that the number of pixels in a pond could be determined exactly. However, sometimes a pond has nearby outlier pixels that probably are attached to the pond but would not be counted as part of the pond by GIS procedures because the pixels are not immediately adjacent (touching). Sometimes outliers trailed off in strings or formed their own clusters, and these features probably should not be counted as part of the pond. When counting pixels from the GIS map, Mr. Campbell counted both a lower range of pixels to include as part of a pond and an upper range. About one-third of the ponds were reported with both an upper and lower pixel range. Nearly all of these were middle or large size ponds. For the correlations given above, we used only the lower range of counted pixels. Further work on how accuracy of pond boundary assessment range affects accuracy of size assessment is indicated. Correlation was highest among small ponds and medium sized ponds because they showed fewer problems of pond boundaries than large size ponds. Data on size effects is not give here because size effects cannot be separated from boundary problems. Further work on correlation on medium sized and large sized water bodies around the state of Alabama is in progress. This data needs to be “run through” a statistical program that provides “n” and “p” automatically, and that might allow easier manipulation of the range problem.