The effects of uncertainty in forest inventory plot locations Ronald E. McRoberts, Geoffrey R. Holden, and Greg C. Liknes North Central Research Station, USDA Forest Service, Saint Paul, Minnesota 55108 USA _____________________________________________________________________ Abstract Data from forest inventory plots are used to obtain a variety of estimates, many of which require knowledge of exact plot locations. Although geographic coordinates of plot locations are measured with greater precision than in the past, the measurements are still subject to uncertainty. In addition, to comply with policies that prohibit disclosure of exact locations, some inventory programs release only perturbed plot locations to the public. For design-based estimates, a worst case scenario of the effects of uncertainty in plot locations is evaluated in terms of the radius of circular areas of interest and the maximum uncertainty in plot locations. In a modelbased context, the effects of uncertainty in plot locations on predictions from a logistic regression model calibrated using inventory data and satellite imagery are investigated. Finally, a method is developed for circumventing the effects of perturbed plot locations when using spectral values of satellite imagery as independent variables in modeling applications. _____________________________________________________________________ Introduction Many applications of forest inventory data are spatial in nature and require knowledge of exact plot locations. Examples include estimating the volume of timber within a specified distance of a mill and calibrating models using inventory plot data and the spectral values of satellite imagery for pixels containing the plot centers. Although coordinates of plot locations are quite precisely measured using global positioning system (GPS) receivers, aerial imagery, and digitization methods, the coordinates still have uncertainty associated with them. In addition, for privacy and sample integrity reasons, some forest inventory programs are prohibited from disclosing exact plot locations to the public. For example, the Forest Inventory and Analysis (FIA) program of the Forest Service, U.S. Department of Agriculture, randomly perturbs all plot locations by as much as 0.8 km (0.5 mi) and some by as much as 1.6 km (1.0 mi) before releasing them to the public. In these situations, uncertainty in plot locations may contribute to both bias and uncertainty in estimates based on inventory data. The objectives of the study were threefold: (1) to estimate the effects of uncertainty in plot locations on the uncertainty in design-based estimates, (2) to estimate the effects of uncertainty in plot locations on the uncertainty in model-based estimates, and (3) to investigate methods for circumventing the effects of perturbed plot locations for model- based applications that use spectral values of satellite imagery as independent variables. Methods Estimation using inventory plot data has historically been design-based, although model-based applications are becoming more extensive. The properties of designbased estimators derive from the sampling designs used to obtain the data. Designbased estimates often consist of plot-based means and variances of means for selected areas of interest (AOI). For design-based estimates, the primary effect of uncertainty in plot locations is that the set of plots determined to be in an AOI will exclude some plots that are actually in the AOI and include some that are actually outside the AOI. The negative effects of the uncertainty in plot locations decrease as the uncertainty decreases, as the size of circular AOIs increases, and as the strength of spatial correlation among plot observations increases. The properties of model-based estimators derive from the mathematical forms of the models, unexplained residual variability around model predictions, and the spatial correlation among residuals. For this study, model-based estimation entails formulating mathematical models of the relationships between dependent and independent variables, predicting the value of the dependent variable for each estimation unit in the AOI, and calculating the mean of predictions over all estimation units in the AOI . When the dependent and independent variables are observed at the same geographic location and the plot coordinates themselves are not independent variables, then the estimated model of the relationship is unaffected by uncertainty in plot locations, although the spatial correlation among model prediction residuals may be poorly estimated. For many analyses, however, the dependent and independent variables are not observed coincidentally. For example, the independent variables may be spectral values of satellite image pixels containing the centers of the sampling units on which the dependent variable is observed. In this case, uncertainty arises in model-based estimates as a result of uncertainty in the plot locations, errors in image registration, and errors in co-registration of the plot locations to the image. When the independent and dependent variables are not observed at the same geographic location, bias may be introduced into the model predictions, residual variability may increase, and the spatial correlation among residuals may be poorly estimated. The effects of uncertainty in plot locations on model-based estimates decrease as the uncertainty in plot location decreases, as registration errors decrease, as the size of the sampling unit increases, and as the strength of the spatial correlation among observations of the variables increases. However, unlike the case of design-based estimation, the negative effects of uncertainty in plot locations do not necessarily decrease as the size of the AOI increases. McRoberts et al. (in press) proposed a framework for systematically estimating the effects of uncertainty in forest inventory plot locations on design- and model-based estimates. The framework proposed estimating the effects separately for design- and model-based estimation, and considered three factors: (1) the range and strength of spatial correlation, (2) the sizes of AOIs, and (3) the spatial resolution of the units on which variables are observed. The effects of uncertainty in plot locations on design-based estimates R For a circular AOI (Figure 1), the expected correlation, β, between design-based estimates using exact plot locations and estimates using locations with uncertainty may be expressed in terms of five quantities: (1) the radius, R, of the AOI; (2) the distribution of plot location errors in the interval, [-rmax, rmax], where rmax is the maximum uncertainty; (3) the number of plots with exact locations in B whose locations with uncertainty place them in C; (4) the number of plots with exact locations in C whose locations with uncertainty place them in B; and (5) the spatial correlation of the attribute of interest. r max r ma x A B C Figure 1. Circular AOI. A worst case scenario occurs when all plots with exact locations in B are replaced by plots with exact locations in C and when observations for the plots in B are uncorrelated with observations for plots in C. The latter condition is truly worst case and occurs only when the status of the forest in B is substantially different than the status of the forest in C; e.g., B is forest and C is nonforest. Under this scenario, assuming a maximum uncertainty distance of rmax, the worst case correlation between means estimated using data from exact locations and data from locations with uncertainty may be expressed as, 2 π(R − rmax ) rmax ⎞ Area A ⎛ = = − ρ= 1 . ⎜ ⎟ ⎝ R ⎠ Area A + AreaB πR2 2 [1] The worst case correlation with respect to AOI radius, R, is shown in Figure 2 for four values of rmax: 0.04 km, 0.20 km, 0.80 km, and 1.60 km. The first value of rmax corresponds approximately to maximum GPS error, and the third and fourth values correspond to the intermediate and maximum plot location perturbing distances, respectively, used by the FIA program. . 1.0 Correlation 0.8 0.6 Maximum perturbing distance 0.4 0.05 km 0.20 km 0.80 km 1.60 km 0.2 0.0 0 10 20 30 40 50 AOI radius (km) Figure 2. Correlations for worst case scenarios. The effects of uncertainty in plot locations on model-based estimation McRoberts (in review) developed a logistic regression model to predict the probability of forest for estimation units corresponding to individual satellite image pixels: E(pi ) = exp(β 0 + β1X i1 +...+β m X im ) 1 + exp(β 0 + β1X i1 +...+β m X im ) [2] where E(.) denotes statistical expectation, pi is the probability of forest for the ith pixel, exp(.) is the exponential function, Xij is the value of the jth spectral band for the ith pixel, and the βs are parameters to be estimated. Observations of forest/nonforest were obtained from FIA plots, and the spectral data was obtained from Landsat TM/ETM+ imagery. The FIA field plot consists of four 7.31-m (24-ft) radius circular subplots. The subplots are configured as a central subplot and three peripheral subplots with centers located at 36.58 m (120 ft) and azimuths of 0o, 120o, and 240o from the center of the central subplot. Locations of forested or previously forested plots are measured using GPS receivers, while locations of non-forested plots are measured using aerial imagery and digitization methods. For this study, inventory data for three 15-km radius circular study areas in Minnesota, USA, were used and consisted of observations for 83 plots for which 200 subplots were completely forested and 132 subplots were completely non-forested. Landsat imagery for two Minnesota scenes, rows 27 and 28 of path 28, for three dates corresponding to early, peak, and late vegetation green-up were used. The spatial configuration of the FIA subplots with centers separated by 36.58 m and the 30-m x 30-m spatial resolution of the TM /ETM+ imagery permits individual subplots to be associated with individual image pixels. The subplot area of 167.87 m2 is approximately 19 percent of the 900 m2 pixel area. The satellite imagebased predictor variables consisted of the normalized difference vegetation index (NDVI) and the greenness, brightness, and wetness tasseled cap (TC) transformations of the spectral values, scaled to the interval [0,255], for each of the three image dates. The calibration of the logistic regression model was based on data aggregated from all three study areas. Within each of the three 15-km radius study areas, design- and model-based estimates of proportion forest area were calculated. To estimate the effects of uncertainty in inventory plot locations, four maximum perturbation distances were selected: rmax=0.05 km, rmax=0.20 km, rmax=0.80 km, and rmax=1.60 km. The first value corresponds, approximately, to the sum of maxiumum GPS error and maximum image registration error; the third and fourth values correspond to the intermediate and maximum plot location perturbing distances, respectively, used by the FIA program. The plot locations were perturbed in four steps: (1) perturbations, rlon for longitude and rlat for latitude, were randomly selected from a uniform distribution with positive density 2 2 in the interval [–rmax, rmax]; (2) the total perturbation, rtot = rlon + rlat , was checked to ensure that rtot ≤ rmax; if not, Step 1 was repeated; (3) rlon and rlat were added to the exact coordinates of all four subplots for each plot to obtain the perturbed locations; and (4) the perturbed subplot locations were checked to ensure they remained in their respective study area; if not, Steps 1-3 were repeated. The spectral values of the pixels containing the perturbed subplot centers were associated with the subplot attributes, the model was recalibrated, and predictions, p$ , of the probability of forest were calculated for each pixel in each study area. The procedure was repeated 10 times for each selection of rmax. The analyses took two forms. First, for each study area and for each selection of rmax, proportion forest area was estimated as the mean of pixel predictions over the entire study area for the exact and each of the 10 perturbed plot models. The bias in proportion forest area estimates due to perturbed plot locations was calculated as the difference between the mean of the 10 estimates for perturbed plot models and the estimate for the exact plot model. Second, for the exact and each of the 10 perturbed plot models, all pixels in each study area were classified as nonforest if the predicted probability of forest was p$ ≤ 0.5 and forest if p$ > 0.5 . The misclassification proportion for each perturbed plot model was calculated as the proportion of pixels for which the classifications based on exact and perturbed plot model predictions differed. For each selection of rmax and for each study area, the mean misclassification proportion over the 10 perturbed plot models was calculated. Circumventing the effects of perturbed plot locations For models that use spectral values of satellite imagery as independent variables, a method for circumventing the effects of perturbed plot locations was investigated. First, a check was made to determine if searches of an entire Landsat image using only spectral values could uniquely locate individual pixels. If so, inventory data appended with the spectral values of corresponding pixels provides sufficient information to determine exact plot locations, a violation of the FIA non-disclosure policy. One, two, and three dates of imagery were searched to determine the number of pixels with the same combination of spectral values for each of 1,000 randomly selected pixels. If only a small proportion of image pixels have unique combinations of spectral values, then inventory data appended with spectral values of associated pixels may be released with confidence that most exact subplot locations cannot be determined. However, if a large proportion of pixels have unique combinations of spectral values, then alternatives must be sought. An alternative may be to perturb the spectral values of the pixels associated with the exact plot locations. If the perturbed spectral values mask the exact plot locations but yet retain sufficient information for model calibration, then the effects of perturbed plot locations may be circumvented while yet complying with the FIA non-disclosure policy. First, for each of the 1,000 randomly selected image pixels, each spectral value for all three dates of imagery was perturbed by randomly selecting an integer from a uniform distribution with positive density in the interval, [-Imax, Imax] where Imax=1, 2, …, 5. For one, two, and three dates of imagery, all image pixels at the minimum distance in spectral space from each of the 1,000 pixels with perturbed spectral values were identified. Second, for the logistic regression model, the spectral values for pixels associated with FIA subplots were perturbed by randomly selecting an integer from a uniform distribution with positive density in the interval, [-Imax, Imax] where Imax=1, 2, …, 5. The model was then recalibrated, and predictions, p$ , were calculated for each pixel in each study area. The procedure was repeated 10 times. The bias and mean misclassification proportion were calculated in the same manner as when plot locations were perturbed. Results Design-based estimates Based on equation [1] and Figure 2, the effects of GPS error (rmax=0.40) on designbased estimates is negligible. For circular AOIs and a maximum perturbing distance of 0.8 km, the minimum worst case correlation between means of proportion forest area based on exact and perturbed plot locations is greater than ρ=0.90 when the radius of a circular AOI is greater than R=16 km and is greater than ρ=0.95 when the radius is greater than R=32 km. For a maximum perturbing distance of 1.6 km, a radius of R=32 km is required for a worst case of correlation of ρ=0.90, and a radius of R=64 km is required for a worst case correlation of ρ=0.95. The actual correlations will depend on the shape of the AOI and the spatial correlation among observations of the attribute of interest. Model-based estimates The 10 replications of procedures for each rmax and each Imax produced coefficients of variation for estimates of bias that were generally less than 0.10 and always less than 0.15. Coefficients of variation for mean misclassification proportion were generally less than 0.05 and always less than 0.10. Thus, although the replications were small in number, they were sufficient to produce quite precise estimates of bias and mean misclassification proportion. One effect of uncertainty in plot locations when dependent and independent variables are not observed at the same geographic locations is bias in estimated model relationships. Even the relatively small uncertainty associated with the combination of GPS and image registration errors produced detectable bias and misclassification for the logistic regression model (Figures 3 and 4). 0.2 Bias 0.1 y Stud a3 Are 0.0 Study Area 1 -0.1 Study Are a 2 -0.2 0.0 0.4 0.8 1.2 1.6 Maximum plot location perturbation (km) Mean misclassification proportion Figure 3. Bias in proportion forest estimates for 15-km radius study areas. 0.4 Stud 0.3 0.2 a3 2 Study Area Study Area 1 0.1 0.0 y Are 0.0 0.4 0.8 1.2 1.6 Maximum plot location perturbation (km) Figure 4. Mean misclassification proportion for 15-km radius study areas. Circumventing the effects of perturbed plot locations With one, two, and three dates of Landsat TM/ETM+ imagery, 0.118, 0.974, and 1.000 proportions of pixels, respectively, were uniquely located when the spectral values were not perturbed (Table 1). These proportions indicate that appending inventory subplot data with spectral values of corresponding pixels violates the FIA nondisclosure policy. Table 1 Results of searching image for a given pixel. Proportion of 1,000 pixels Maximum pixel Pixel of interest at minimum distance perturbation Single pixel1 Multiple pixels2 Pixel of interest not at minimum distance3 Single date 0 0.118 0.882 0.000 1 0.002 0.034 0.964 2 0.002 0.004 0.994 3 0.000 0.002 0.998 4 0.000 0.002 0.998 5 0.000 0.002 1.000 Two dates 0 0.974 0.016 0.000 1 0.242 0.418 0.340 2 0.072 0.138 0.790 3 0.020 0.036 0.944 4 0.010 0.028 0.962 5 0.002 0.006 0.992 Three dates 0 1.000 0.000 0.000 1 0.708 0.230 0.062 2 0.286 0.266 0.448 3 0.108 0.146 0.746 4 0.082 0.058 0.860 5 0.032 0.034 0.934 1 The proportion of 1,000 pixels for which the pixel was the only pixel at the minimum spectral distance. 2 The proportion of 1,000 pixels for which the pixel was one of multiple pixels at the minimum spectral distance. 3 The proportion of 1,000 pixels for which the pixel was not among the pixels at the minimum spectral distance. Assume, arbitrarily, that the criterion for compliance with the FIA non-disclosure policy is that only 10 percent or fewer pixels can be located by searching the image. This criterion means that the proportion of searches for individual pixels for which the pixel of interest is not found at the minimum spectral distance is 0.900 or greater. For one date of Landsat TM/ETM+ imagery, spectral value perturbations of randomly selected, uniformly distributed integers from the interval [-1, 1] are sufficient; for two dates, integers from the interval [-3, 3] are sufficient; and for three dates, integers from the interval [-5,5] are sufficient. Thus, relatively small perturbations are sufficient to mask the locations of subplots. The negative effects of perturbing spectral values with integers from the interval [-5, 5] on logistic model predictions were a small bias in estimates of proportion forest area and a small misclassification proportion (Table 2). However, the negative effects were smaller than or comparable to the effects of perturbing plot locations from the interval [-0.05 km, 0.05 km] (Table 2). The important result of this finding is that the effects of perturbing spectral values from the interval [-5, 5] are no greater than the combined effects of GPS measurement error and image registration error. Table 2. Study Area 1 2 3 Effects of perturbing locations and spectral values. Plot location perturbation Spectral value perturbation [-0.05 km, 0.05 km] [-5, 5] Bias Mean Bias Mean misclassification misclassification proportion proportion 0.0306 0.0399 0.0088 0.0193 0.0383 0.0507 0.0152 0.0280 0.0690 0.1085 0.0899 0.0969 Conclusions Three conclusions may be drawn from this study. First, for design-based analyses, the effects of GPS error are negligible, and the worst case effects of greater uncertainty for circular AOIs may be estimated from Equation [1] and Figure 2. Similar results for AOIs of other shapes are expected except when the AOIs have very narrow components. Second, appending spectral values of satellite image pixels to inventory data violates the FIA policy of not disclosing exact plot locations. The third conclusion, however, is that the negative effects of perturbing plot locations in order to comply with the policy may be circumvented at least partially for the logistic regression model by perturbing spectral values with randomly selected, uniformly distributed integers from the interval [-5, 5]. This action not only masks inventory plot locations but retains most of the information in the imagery. Further, the effects on bias and misclassification of these spectral value perturbations were typically less than the effects of GPS and image registration errors. Nevertheless, caution should be exercised when using different model forms and when predicting forest attributes that exhibit different spatial correlation. Reference Liknes, G.C., Holden, G.R., Nelson, M.D., and McRoberts, R.E. (2005) Spatially locating FIA plots from pixel values. In R. E. McRoberts, G.A. Reams, P. C. Van Deusen, and W.H. McWilliams (Eds). Proceedings of the Fourth Annual Forest Inventory and Analysis Symposium. NC-GTR- U.S. Department of Agriculture, Forest Service, St. Paul, MN. McRoberts, R.E., Holden, G.H., Nelson, M.D., Moser, W.K., Lister, A.J., King, S.L., LaPoint, E.B., Coulston, J.W., Smith, W.B., and Reams, G.A. (In press) Estimating and circumventing the effects of perturbing and swapping inventory plot locations. Journal of Forestry. McRoberts, R.E. (In review) Using a logistic regression model, satellite imagery, and inventory data to estimates forest area. Remote Sensing of Environment.