Estimating insect distributions in Alaskan landscapes not covered in aerial surveys. APPENDIX A Modeling and Error Estimation Descriptions GIS and Landsat Data Spectral information were taken from satellite imagery of the study site. A Landsat-5 TM image was obtained from EROS Data Center (Sioux Falls, South Dakota), while a MODIS image was obtained from the Eric Johnson (Geospatial Coordinator, GIS/ALP, USDA Forest Service, Juneau, Alaska). Both the Landsat-5 TM and MODIS imagery were obtained for the study site during the month of August, 2007 coinciding with the field data collection. The Landsat-5 TM imagery had seven spectral bands with a 30 m resolution, while the MODIS imagery had six spectral bands with a resolution of 250 m (bands 1 and 2) and 500 m (bands 3-6). These latter four MODIS bands were resampled to a 250 m spatial resolution using nearest neighbor techniques (Muukkonena and Heiskanenb, 2005). Nearest neighbor resampling was selected due to quicker computer processing time as compared to other interpolation methods. In addition, nearest neighbor interpolation better maintains original reflectance values while providing sufficient accuracy and reduced potential introduction of unwanted geometric distortions in areas with no ground control points to provide precise control (Muukkonena and Heiskanenb, 2005). Topographic data were taken from a digital elevation model (DEM). The DEM was obtained from the National Elevation Dataset (NED) as a seamless ArcInfo (ESRI, 1995) grid at a 90 m resolution (U.S. Geological Survey (USGS), Gesch et al., 2002; Rabus et al., 2003). The DEM was resampled to both a 30 m and 250 m spatial resolution using bilinear techniques (Edenius et al., 2003), producing a more continuous surface reflecting gradual changes in elevation at both a 30 m and 250 m spatial resolution. GIS grids of elevation, slope, and aspect were derived from the DEM’s using ArcView (ESRI, 1998). Values for all grid layers of information were derived for each sample point station using Avenue (ESRI, 1998) code. Modeling Vegetation Types and Distribution of Aspen Leaf Miner Binary classification trees were used to describe the presence/absence of a given vegetation type infested and non-infested as a function of topographic data and the satellite imagery (Joy et al., 2003). A 10-fold cross-validation procedure was used to identify the best tree size to minimize the prediction classification error. The pruned classification trees were then used to generate GIS layers showing the spatial distribution of the various vegetation types at a 30 m resolution for the Landsat-5 TM imagery and at a 250 m resolution for the MODIS imagery. This was accomplished by passing the appropriate GIS layers of topographic data and satellite imagery through the pruned classification trees. Receiver Operating Curves (ROC) were used to assess the predictive accuracy of the binary models (Boyce et al., 2002). 1 An ROC curve is a plot of the true positive rate against the false positive rate for different possible outcomes of a model. An ROC curve demonstrates several things: 1) the tradeoff between sensitivity and specificity; 2) the closer the curve is to the left-hand border and the top border of the ROC space, the more accurate the model; 3) the closure the curve comes to the 45-degree diagonal of the ROC space, the less accurate the model; 4) the slope of the tangent line at a give point in the ROC curve gives the likelihood ratio for that value of the model; and 5) the area under the curve is a measure of test accuracy (Boyce et al., 2002). This last characteristic, the area under the curve measures discrimination, or the ability of the model to correctly classify a sample point as to its vegetation type. Modeling Forest Characteristics Modeling of basal area and canopy closure was accomplished in two stages using procedures developed by Reich et al. (2004a). In the first stage, multiple regression analysis was used to describe the coarse-scale variability in the stand structure as a function of elevation, slope, aspect and the Landsat-5 TM or MODIS bands. For each component of forest structure, a stepwise procedure was used to identify the best subset of independent variables to include in regression models. Semi-variograms were used to evaluate spatial dependencies among residuals from the various models. If the residuals exhibited spatial dependencies, a generalized least squares model was used to estimate the regression coefficients associated with the regression models (Reich and Davis, 1998). In the second stage, a tree-based stratification design was used to enhance the estimation process of the small-scale spatial variability. With this approach, sample units (i.e., pixel of a satellite image) are classified with respect to predictions of error attributes into homogeneous classes, and the classes are then used as strata in the stratified analysis. Independent variables considered in the binary regression trees (Breiman et al., 1984) included elevation, slope, aspect and Landsat-5 TM or MODIS bands. A decision rule was used to identify a tree size that minimized the error in estimating the variance of the mean response and prediction uncertainties at new spatial locations. The effectiveness of the final models were evaluated using a goodness-of-prediction statistic (G) (Agterberg, 1984; Kravchenka and Bullock, 1999; Guisan and Zimmermann, 2000; Schloeder et al., 2001). The G-value measures how effective a prediction might be relative to that which could have been derived using the sample mean (Agterberg, 1984). A G-value equal to 1 indicates perfect prediction, a positive value indicates a more reliable model than if one had used the sample mean, and a negative value indicates a less reliable model than if one had used the sample mean. Grids of basal area and canopy closure were generated for the best fitting regression models. Similarly, grids representing the error associated with each regression model were generated by passing each grid for the appropriate independent variables through the regression trees. The final predicted surfaces were obtained from the sum of the two grids. 2 Model Evaluation A 10-fold cross-validation was used to estimate the prediction error for each model (Efron and Tibshrani, 1993; Stone, 1974). Data were split into K=10 parts consisting of approximately 21 sample points and models were fitted to the remaining K-1=9 parts of the data. The fitted model was used to predict the part of the data removed from the modeling process. This process was repeated 10 times so that each weather station was excluded from the model fitting step and its response predicted. Prediction errors were obtained from the predicted minus actual values. Various measures of prediction error were calculated to evaluate the effectiveness of the models. Prediction bias (Williams, 1997) was calculated for each validation data set as a percentage of the true value and accuracy (Kravchenko and Bullock, 1999; Schloeder et al., 2001) was measured by the mean absolute error (MAE), which is a measure of the sum of absolute residuals (i.e., actual minus predicted) and the root mean squared error (RMSE). Small MAE values indicate a model with few errors, while small values of RMSE indicate more accurate predictions on a point-by-point basis (Schloeder et al., 2001). Estimation uncertainty in the models (Reich et al., 2004a) was calculated as the estimation error variance (EEV), σˆ i2* for each observation in the data set. The consistency between the EEV and the observed estimation errors (i.e., true errors) was calculated using the standard mean squared error (SMSE) (Hevesi et al., 1992). EEVs were assumed consistent with true errors if the SMSE fell within the interval [1 ± 2(2 n ) ] 1/ 2 (Hevesi et al., 1992). The EEVs were also used to construct 95% confidence intervals around individual estimates. Coverage rates were calculated as the proportion of individual confidence intervals that contained the true value. Models were developed to predict the spatial distribution of Aspen leaf miner in the study area using satellite imagery. The results are summarized in the Table below. Results -- Modeling Vegetation Types and Distribution of Aspen Leaf Minor All aspen stands were heavily infected and because of the difficulty in locating stands with low to moderate levels of infections it was difficult to discriminate between birch and non-infected aspen stands and so for the analysis, these stands were combined. The Landsat-5 TM model was able to discriminate between all land cover types, except open areas, with an overall accuracy of 81% (Table 1). A second measure of accuracy is provided by the area under the curve of the ROC curve. An area of 1 represents perfect prediction, while an area of 0.5 represents a worthless prediction. A rough guide for classifying the accuracy of the model is the traditional academic point system ● 0.90-1 = excellent (A) 3 ● 0.80-0.90 = good (B) ● 0.70-0.80 = fair (C) ● 0.60-0.70 = poor (D) ● 0.50-0.60 = fail (E) The area under the curve for non-infected aspen and spruce was 0.89 and the model would be considered almost excellent in the ability to discriminate between these two vegetation types. Infested aspen had an area under the curve of 0.85 so the model would be considered good at discriminating this vegetation type. The model was only fair (area under the curve – 0.78) in its ability to discriminate open areas. Vegetation Type Accuracy Area Under the Curve1 0.85 0.89 ROC Ranking1 Good Good Infected Aspen 0.80 Birch/Non-infected 0.89 Aspen Spruce 0.82 0.89 Good Open Areas 0.59 0.78 Fair Overall 0.81 1 AUC = 1-0.90 – excellent; 0.80 – 0.89 – good; 0.70-0.79 – fair; 0.60-0.69 – poor; 0.50 – 0.59 – failure. Table 1. Accuracy assessment of Landsat-5 TM imagery in identifying vegetation types on the study site near Fairbanks, Alaska. Figure 2 displays the predicted spatial distribution of the vegetation types and associated ROC curve, while Table 3 summarizes the estimated areas associated with each of the vegetation types. Spruce dominated the study site (45%) followed by birch/non-infected aspen stands (31%) and then infected aspen stands (22%). Open areas covered only 2% of the study site. The MODIS model had a lower overall accuracy (69%) compared to the model based on the Landst-5 TM imagery (Table 3). Spruce (78%) and birch/non-infected aspen stands (73%) had the highest level of accuracy, while infected aspen stands (60%) and open areas (59%) had the lowest accuracy. Fig.3 displays the predicted spatial distribution of the vegetation types and associated ROC curve. The area under the curve indicted that the model was only fair in its ability to discriminate among the various vegetation types, except for spruce in which the model was able to do a good job in discriminating this vegetation type. Table 4 summarizes the estimated areas associated with each of the vegetation types. The MODIS model over estimated the area in spruce and infected aspen stands and under estimated the area covered by birch/non-infected aspen stands when compared to the Landsat-5 TM model. The area covered in open areas was similar for both models. 4 1.0 Vegetation Types of Study Area Near Fairbanks, Alaska Birch Aspen 0.6 Open 0.4 TPR (sensitivity) 0.8 Spruce N i sc od ti i na r im (r on m do an e ss gu ) 0.2 Better Legend Worse Roads 4 0.0 Water Aspen 0 1,125 2,250 4,500 6,750 9,000 Miles Birch/Non-infected aspen Open areas 0.0 0.2 0.4 0.6 0.8 1.0 FPR (1-specificity) Spruce Fig. 2. Spatial distribution of vegetation types predicted using Landsat TM imagery on a study site near Fairbanks, Alaska, and associated ROC curve. Table 2. Estimated areas associated with the Landsat-5 TM model for the various vegetation types on the study site near Fairbanks, Alaska. Vegetation Type Area (ha) Percent Infected Aspen 4,433 22.2 Birch/Non-infected Aspen 6,216 31.1 363 1.8 Spruce 8,979 44.9 Total 19,981 Open Areas Table 3. Accuracy assessment of MODIS imagery in identifying vegetation types on the study site near Fairbanks, Alaska. Tree Cover Accuracy Area Under the ROC 1 Curve Ranking1 Infected Aspen 0.60 0.76 Fair Birch/Non-infected 0.73 0.77 Fair Aspen Spruce 0.78 0.82 Good Open Areas 0.59 0.78 Fair Overall 0.69 5 1 AUC = 1-0.90 – excellent; 0.80 – 0.89 – good; 0.70-0.79 – fair; 0.60-0.69 – poor; 0.50 – 0.59 – failure. 1.0 Vegetation Types of Study Area Near Fairbanks, Alaska 0.8 Spruce 0.6 Aspen 0.4 TPR(sensitivity) Birch Open d No i na r im i sc ti o ra n( om nd e ss gu ) Better Roads Water Aspen 0 1,125 2,250 4,500 6,750 9,000 Miles Worse 0.0 4 0.2 Legend Birch/Non-infected aspen Open areas Spruce 0.0 0.2 0.4 0.6 0.8 1.0 FPR(1-specificity) Fig. 3. Spatial distribution of vegetation types predicted using MODIS imagery on a study site near Fairbanks, Alaska, and associated ROC curve. Table 4. Estimated areas associated with the MODIS model for the various vegetation types on the study site near Fairbanks, Alaska. Vegetation Type Infected Aspen Area (ha) Percent 6,306 30.6 3025 14.7 219 1.1 11,037 53.6 Birch/Non-infected Aspen Open Areas Spruce Total 20,587 Results -- Modeling Forest Characteristics Canopy closure and basal were linearly correlated with the topographic (elevation and slope) and satellite imagery, and these linear relationships varied significantly among vegetation types. All of the topographic, satellite imagery and vegetation type data? were also used in regression trees to describe the error in one or more of the regression models. The decision rule used to identify the optimal tree size identified 9 strata for the canopy closure model using the Landsat-5 TM data and 9 for the basal model and with a minsize 6 of 60 observations. The basal area model accounted for 53% of the variability observed on the sample plots while the canopy model had a G-statistic of 42% (Table 5). In contrast, the models for canopy closure and basal area using the MODIS data both had strata sizes of 8, respectively and with a minsize of 60 observations. The canopy model had a G-statistic of 58% and 52% for the basal area model (Table 5). Sample plots with the same canopy closure or basal area can have different spectral properties depending on, for example the vegetation type, stocking level and average tree size thus, making it more difficult to model using the spectral reflectance and terrain data. This results in a coarser partitioning of the data in a small number of large strata. This coarser partitioning of the data results in a lower G-statistic for these two models. Fig. 4 and 5 compare the final models for canopy closure and basal area using the two sets of satellite imagery The confidence and prediction coverage rates for all models were close to the 95% nominal coverage rate. The standardized mean squared errors were close to their expected value of unity indicating the variance estimates were consistent with the true errors. The standardized mean squared errors ranged from 0.91 to 1.16. Figure 6 display the histograms of the prediction errors from the 10-fold cross-validation of the final models. The histograms and plots of the predicted vs. observed values (not shown) did not show any trends suggesting any systematic bias in the models. Canopy Closure on the Study Area Near Fairbanks, Alaska Canopy Closure on the Study Area Near Fairbanks, Alaska Roads Roads Water Water Percent Percent 0 0 0 - 20 0 - 20 20-40 4 20-40 4 40-60 0 0.5 1 2 3 60-80 4 Miles 80-100 40-60 0 0.5 1 2 3 60-80 4 Miles Fig. 4. Comparison of the predicted spatial distribution of canopy closure on the study site near Fairbanks, Alaska using Landstat-5 TM (left) and MODIS data (right). 7 80-100 Table 5. Overall model performance (G-statistic) of the multiple regression models (LS) and the use of regression trees (RT) in describing the errors in the LS models of forest structure using Landsat-5 TM and MODIS imagery on a study site near Fairbanks, Alaska. Model 1 1 LS LS + RT Canopy Closure-L (%) 0.182 0.453 Basal Area-L (m2 ha-1) 0.422 0.526 Canopy Closure-M (%) 0.448. 0.577 Basal Area-M (m2 ha-1) 0.395 0.519 L=Landsat-5 TM; M=MODIS imagery Prediction bias was nominal (Table 6) for all models. Minimum, maximum, and quartile values showed that estimated and observed value distributions were similar for all models. Estimation errors for the canopy closure and basal area had a similar spread in the estimation errors for both set of satellite imagery. The mean estimation errors did not differ significantly from zero (p-value > 0.050). The MAE was smaller than the RMSE for all models indicating that in general, the models are more accurate in predicting regional or global means than on a point-by-point basis. Basal Area on the Study Area Near Fairbanks, Alaska Basal Area on the Study Area Near Fairbanks, Alaska Roads Roads Water 2 m ha Water -1 2 m ha 0 0 - 40 0 - 40 40-50 40-50 50-100 50-100 100-150 4 100-150 4 150-200 0 0.5 1 2 3 200-300 4 Miles -1 0 300-500 150-200 0 0.5 1 2 3 200-300 4 Miles Fig. 5. Comparison of the predicted spatial distribution of basal area on the study site near Fairbanks, Alaska using Landstat-5 TM (left) and MODIS data (right). 8 300-500 Table 6. Summary statistics of estimation errors of forest structure using Landsat-5 TM and MODIS imagery for a study area near Fairbanks, Alaska based on the 10-fold cross-validation. N 188 188 205 Basal Area –M2 (m2 ha1) 205 Mean 1.34 0.96 -2.35 0.63 IQR 24.26 86.09 27.04 109.09 MAE 16.96 55.10 18.83 65.42 RMSE 22.60 71.88 25.69 81.83 SMSEM 0.92 0.95 0.91 0.91 SMSEP 1.16 0.92 0.99 1.06 CRM 0.97 0.95 0.96 0.96 CRP 0.93 0.96 0.92 0.95 Statistic 1 Canopy Basal 2 Closure- L Area –L2 (%) (m2 ha-1) 1 Canopy Closure-M2 (%) IQR = interquartile range, MAE = mean absolute error, RMSE = root mean square error, SMSE = standardized mean square error for the mean response (M) and for prediction; CR = coverage rate for the mean response (M) and prediction (P). 2 L=Landsat-5 TM; M=MODIS image 9 Table 7. Summary statistics of observed and estimated forest structure using Landsat-5 TM and MODIS imagery for a study site near Fairbanks, Alaska from 10-fold cross-validation. Canopy Closure-L1 (%) Basal Area-L1 (m2 ha-1) Statistic Obs. Est. N 188 188 Mean 67.24 65.9 Std. Dev. 22.98 14.22 CV% 34.17 21.58 Minimum 0.0 16.01 First quantile 59.48 64.24 Median 72.18 68.53 Third quantile 84.38 73.2 Maximum 96.35 89.07 Bias% 1.99 1 L=Landsat-5 TM; M=MODIS imagery Obs. 188 133 80.95 60.87 0.0 70.0 120.0 180.0 370.0 Est. 188 132 63.82 48.34 0.0 89.07 136.70 180.0 272.2 0.75 10 Canopy Closure-M1 (%) Obs. Est. 205 205 61.77 64.12 22.98 19.84 34.17 30.94 0.0 0.0 52.0 61.97 71.0 70.65 83.0 75.87 96.0 97.05 -3.80 Basal Area-M1 (m2 ha1 ) Obs. Est. 205 205 133 120.7 80.95 64.07 60.87 53.09 0.0 0.0 70.0 73.5 120.0 124.2 180.0 176.0 370.0 266.3 9.25 60 b 40 30 20 Number Sample Plots 40 0 0 10 20 Number Sample Plots 60 50 a -50 0 50 -200 -100 Residuals 100 200 100 200 Residuals d 30 20 Number Sample Plots 40 0 0 10 20 Number Sample Plots 60 40 50 80 c 0 -100 -50 0 50 -200 Residuals -100 0 Residuals Fig. 6. Histograms of prediction errors from a 10-fold cross-validation of the canopy model (a and c) and basal area model (b and d) using Landsat-5 TM (a nd b) and MODIS (c and d) satellite imagery. 11