link

advertisement
Date:
16 September 2009 and afterwards
To:
Drs. Claude Boyd and Phil Chaney
From:
Mike Polioudakis and Nathan Campbell
Subject:
Statistics on ground truth of GIS method for finding inland water
SUMMARY
Ground truth data on location and size of ponds in Lee County was compared with data from a
GIS map that had been generated from satellite information. The map gave reliable information
for ponds located outside urban areas but not within urban areas. Urban areas produced many
false positives of apparent ponds that turned out to be other features such as parking lots. It
would be hard to correct this bias in urban areas without extensive and complicated remodeling. The map is useful in its present form as long as people keep the limitations in mind.
Simple statistics on matching of map to ground truth are given.
INTRODUCTION
Nathan Campbell did all the field work. Mike Polioudakis created and processed the GIS map
and wrote this document.
Unless specified otherwise: the GIS program used was ERDAS Imagine; and the statistical
program used was MS Excel. Except for integers where an exact count could be done, the last
digit in a number was rounded.
This project summarized in this document consisted of “ground truth-ing” data on pond size and
pond location from a GIS map for inland water in Alabama that was constructed using satellite
data.
From that map, 277 random points were generated from within water bodies. Each point could
be used to specify a water body. Nathan Campbell located the points on Internet computer
mapping services such as Google Maps. He visited the points to determine their status as
below. Mr. Campbell (A) assessed whether a pond was at the site indicated on the GIS map,
and (B) he estimated the size of the pond independently of the size reported on the GIS map by
simple visual inspection and/or by using a laser sighting device. He had earlier practiced
assessing pond size on ponds of known size that were not part of the sample points.
The code numbers in parentheses are for convenience here. The color in parentheses is for
anybody who sees the Excel data.
(1) (yellow)
No access, so could not be evaluated
19 sites
(2) (red)
Present on map but not found on ground
45 sites
(3) (green)
Present on map and found on ground
177 ponds (sites)
(4) (blue)
Duplicate points
36 sites
STATISTICS ON FOUND POINTS
Duplicate points (4, blue) occurred in the sampling process because the algorithm that samples
points chooses pixels (rasters) from all the pixels available in a category (water) regardless of
whether two pixels are in the same pond (same clump). Ponds consisted of more than one
pixel, so more than one pixel might be chosen out of the same pond, resulting in a duplicate.
Duplicates can be discounted from assessments.
Some points were not accessible to confirm whether they were present on the ground, and
some points could not be assessed for size. Points were not accessible because of local terrain
or because they were on private property for which Mr. Campbell could not get permission to
enter. Points that were not accessible at all are removed from assessments. Points that could
not be properly sized are removed from any size assessments.
Results: Found points (3, green) plus not found points (2, yellow) total 222. On that basis:
0.20270 (20.3%) of the points were not found (2, yellow)
0.79730 (79.7%) of the points were found (3, green)
In addition, we were able to divide the points into (A) rural; (B) peri-urban; and (C) urban. We
used the USGS Tiger vector map as the basis for the division. That map is based on US
Census data. Urban points all lay within the USGS designated urban areas for Lee County.
Rural and peri-urban (B) points all lay outside the USGS designated urban areas. Peri-urban
points were assigned according to personal knowledge of Lee County and by visual inspection
of the maps. “Peri-urban” includes ponds in the areas that are now often called “ex-urban” or
“suburban”. We will see that (B) peri-urban points belong most properly belong with rural points,
but separating them initially was a useful way to feel confident in this assessment.
Table 1 summarizes the results.
Rural
Peri-urban
R and P-U
Urban
Found (3)
126
16
142
35
Ratio
0.8811
1.0000
0.8931
0.6604
Not Found (2)
17
0
17
18
Ratio
0.1188
0
0.1069
0.3396
Totals
143
16
159
53
The ratio of found points is similar with rural and peri-urban, both about 88%. The ratio of found
points is considerably lower with urban, about 66%.
The urban area gave many false positives. Some places indicated water according to criteria
that were used to make the GIS map but those points of apparent water were not ponds. Those
points included rooftops, parking lots, swimming pools, large air conditioner facilities, etc. It is
not clear if recent rain helped to created these false positives (data on rainfall at the time of the
satellite passages from which the GIS map was derived was not immediately available), but
probably that is so. Initially separating peri-urban points allowed us to clarify that the problem
was limited to the urban area. Care is needed when assessing the GIS data within an urban
area; but the GIS map seems to be fairly reliable outside of the immediate urban area, even in
suburban or ex-urban terrain.
IMPUTING SIGNIFICANCE TO STATISTICS ON FOUND POINTS
By making a few statistical assumptions, it is possible to approximately assess the probability
that these degrees of accuracy occurred by chance alone. Think of “found or not found” as “hit
or miss” in a Poisson distribution. Assume the mean or expected “hit” ratio (found ratio) is about
0.75. Rural and peri-urban are grouped together. The results are as below:
Rural
Urban
Found
17
18
Not Found
142
35
Total
159
53
Expected
119
40
Probability
0.0041
0.04854
For Rural and Peri-Urban: the probability that exactly 142 ponds were found when 119 were
expected, at a rate of 75% success, is 0.0041. This finding indicates, but does not establish,
that the rate of finding ponds was much greater than chance if we expected to find 75% of the
points were ponds.
For Urban: the probability that exactly 18 ponds were found when 40 were expected is 0.04854.
This finding indicates, but does not establish, that the rate of finding ponds was less than
chance if we expected to find 75% of the ponds. This finding confirms that the success rate in
the urban area was noticeably less than expected. This low success rate in the urban area
drove down the overall success rate to about 80%.
The finding for the rural and peri-urban sites can be strengthened by creating the equivalent of a
“confidence interval”. The fact that the probability of finding exactly 142 points is low (if we
expected to find 75% of the points) is not by itself very telling. By applying the same procedure
to a range of points, we establish that the probability of finding from 139 to 145 points equals
0.0304. The interval of 139 to 145 found points is the 0.03 confidence interval if we expected to
find 75% of points. This fact more clearly indicates a higher-than-expected success rate in the
rural and peri-urban combined area. The size of the confidence interval can be expanded, or
made one-tailed, with similar significant results. This rate of success in the rural and peri-urban
area is reliably high.
A similar procedure can be used to establish that the success rate in the urban area is reliably
low, but the result is not as clearly significant, so the results are not given here.
SIZE ASSESSMENT OF PONDS
The size of ponds as determined from the GIS map was compared to the size of ponds as
determined in the field using estimation or using optical measurement and calculation. We can
only assess size among ponds that were actually found in the field.
For this assessment, Nathan Campbell counted by hand the number of pixels on the GIS map
associated with each pond. The reason why we followed this procedure will be explained below
after the data. The “Correlation” (“Correl”) function from MS Excel was used to assess the
correlation between the map size of ponds and the measured size of ponds. The correlation
function appears to return Pearson’s “r”. The function does not appear to return a value for “n”
or “p”. The figures below are for “raw” r, not for r squared.
Rural correl = 0.9523
Peri-urban correl = 0.9740
Rural plus peri-urban correl = 0.9596
Urban correl = 0.8078
Overall correl = 0.9493
Rural and peri-urban figures are closely comparable.
Because only a minority of urban ponds were found (see above), the apparently high correlation
for urban ponds should be considered carefully. It means that the map is accurate for the ponds
that were found but it does not indicate that a majority of ponds were found.
Even though the correlation for urban pounds is high, it is lower than the figures for rural and
peri-urban. The comparatively lower correlation in urban ponds lowers the overall correlation
slightly.
Still, this degree of correlation indicates that the figures for size taken from the GIS map can be
relied upon for most purposes, especially outside of urban areas.
Exactly what constitutes a pond is not always clear from the GIS map. In about two-thirds of the
cases, pond boundaries were clear, so that the number of pixels in a pond could be determined
exactly. However, sometimes a pond has nearby outlier pixels that probably are attached to the
pond but would not be counted as part of the pond by GIS procedures because the pixels are
not immediately adjacent (touching). Sometimes outliers trailed off in strings or formed their
own clusters, and these features probably should not be counted as part of the pond. When
counting pixels from the GIS map, Mr. Campbell counted both a lower range of pixels to include
as part of a pond and an upper range. About one-third of the ponds were reported with both an
upper and lower pixel range. Nearly all of these were middle or large size ponds. For the
correlations given above, we used only the lower range of counted pixels. Further work on how
accuracy of pond boundary assessment range affects accuracy of size assessment is indicated.
Correlation was highest among small ponds and medium sized ponds because they showed
fewer problems of pond boundaries than large size ponds. Data on size effects is not give here
because size effects cannot be separated from boundary problems. Further work on correlation
on medium sized and large sized water bodies around the state of Alabama is in progress.
This data needs to be “run through” a statistical program that provides “n” and “p” automatically,
and that might allow easier manipulation of the range problem.
Download