This file was created by scanning the printed publication. Errors identified by the software have been corrected; however, some errors may remain. Cost-Effective, Practical Sampling Strategies for Accuracy Assessment of Large-Area Thematic Maps Stephen V. stebmanl Abstract. - Accuracy assessment is an expensive but necessary process in the development and eventual use of a large-scale thematic map. he sampling strategy used for collecting and analyzing the reference samples required for the accuracy assessment must be cost-effective, yet still achieve satisfactory precision for the estimated accuracy parameters. Strata and clusters may be used to improve the efficiency of the sampling strategy , and more specialized designs such as double sampling or adaptive cluster sampling may provide better precision for certain objectives. Poststratified and regression estimators may be combined with a simple design to yield enhanced precision without subst antially increasing costs. Each strategy has strengths and weaknesses and no single strategy is ideal for all applications. INTRODUCTION Large-area, satellite-based land cover mapping projects are becoming increasingly important for use in environmental monitoring and modeling, resource inventories, and global change research. Assessing the thematic accuracy of these maps requires balancing the needs of statistical validity and rigor with the practical constraints and limited resources available for the task. Finding this balance is a characteristic of practically any applied sampling problem. The primary expense is obtaining ground reference samples, the critical data for an accuracy assessment. A cost-effective, practical accuracy assessment strategy must efficiently use these ground reference samples. This efficiency can be achieved by selecting an efficient sampling design, or by using efficient analysis procedures following collection of the data. A site-specific, thematic accuracy assessment is assumed of primary interest, with the evaluation unit being a single pixel as suggested by Janssen and van der We1 (1994). Analogous strategies could be developed for a polygon-based accuracy assessment, but that topic is not pursued here. Each pixel on the target map corresponds to a particular location on the ground, and it is assumed that a ground visit leads to a correct landcover classification of that pixel. Location error will be confounded with classification error if field site locations do not correspond exactly with map locations. This view of accuracy assessment permits illustration of the basic principles and techniques to be presented in this paper. The statistical approach taken is the classical finite population sampling model (Cochran 1977). In this perspective as applied to accuracy assessment (Stehman 1995), the population consists of the N pixels on the map. If a complete census of ground reference sites were obtained, each pixel on the map could be labeled as correctly or ' S U N Y College of Environmental Science and Forestry, 920 B r a y Hall, Syracuse, N Y 485 incorrectly classified, and a population error matrix constructed from these results in which the rows of the error matrix represent the map classifications, and the columns represent the reference classifications ("truth"). The objective of the accuracy assessment is to use a sample of reference locations to construct a sample error matrix and then to compute estimates of various accuracy parameters, typically the overall proportion of pixels correctly classified (P,), user's accuracy, producer's accuracy, and the kappa coefficient of agreement ( K ) . A basic recommendation is that a probability sampling design should be used to collect the reference data. For a probability sample, each pixel in the population has a positive, known probability of being included in the sample. This "inclusion probabilityn is a characteristic of the sampling design. By knowing the inclusion probability of each sampled pixel, the pixels can be correctly weighted when estimating accuracy parameters. All pixels need not have the same inclusion probability, a stratified sample being a good example, as long as the inclusion probability is known. Probability sampling provides an objective, scientifically defensible protocol for accuracy assessment . In the finite sampling perspective, a sampling strategy consists of two parts, the sampling design, which is a protocol by which the sample data are collected, and an estimator, which is a formula for estimating a population parameter. An "efficient" sampling strategy is one in which the variance of an estimator is low, or, equivalently, precision is high. Efficiency translates into cost savings in that if a sampling strategy can provide the same variance using fewer sampling locations than another strategy, the former strategy is more cost-effective. Efficiency can be gained by judicious choice of design or estimator, or both. Stratification, cluster sampling, double sampling, and adaptive cluster sampling are design options for improving efficiency. Regression and poststratified estimators are estimation techniques commonly employed in finite population sampling to improve precision. These estimation techniques require auxiliary information in addition to the sample reference data. Aerial photography or videography are potential sources of auxiliary information, and the number of pixels classified into each cover type by the land-cover map can be used as auxiliary information in poststratification. ENHANCING PRECISION VIA SAMPLING DESIGN Stratification Several stratification options exist. Stratifying by geographic region ensures that the sampling design for accuracy assessment results in a regionally well-distributed sample, spreads the workload out geographically, and allows for convenient reporting of accuracy statistics by region. For parameters summarizing the full error matrix, such as PC and K , geographic stratification normally does not produce large gains in precision over simple random sampling (SRS) (Cochran 1977, p. 102) and should not be selected with the hope of achieving a substantial reduction in variance for these estimators. Although geographic stratification is convenient for reporting accuracy statistics by region, it is not necessary to stratify to obtain regional estimates. Any geographic region can be identified as a subpopulation and accuracy statistics calculated for that subpopulation. The advantage of stratification is that the sample size can be controlled in each region or stratum, and the stratum-specific estimates will have smaller variance compared to the estimates obtained via SRS or another equal probability design such as systematic or cluster sampling. For any equal probability design, the sample size in a region will be proportional to the area of that region, and for small regions, this sample size may be too small to achieve adequate precision. Stratifying by the land-cover classes identified on the target map has the advantage of guaranteeing a specified sample size in each land-cover class, and thus is cost-effective if the primary objective is estimating user's accuracy. Efficiency for estimating producer's accuracy, PC,and K may be reduced when the design is stratified by the classes identified on the map depending on the allocation of samples to the strata. Stratifying by the map classification also requires that the map be completed prior to selecting the sample, and this may result in a delay between the time of imagery and the time the ground sampling takes place. Changes in land cover may occur during the intervening period. Still another possible stratification option was employed by Edwards et al. (1996) in a large-area accuracy assessment in Utah, USA. Within broad geographical strata, two additional spatial strata were created, a stratum identified by a 1-km wide corridor centered on a road, and an off-road stratum. The advantage of this stratification scheme is that access to reference sites is much easier for areas in close proximity to a road, and the number of samples per unit cost can be increased dramatically by increasing the sampling intensity in the road stratum. Selecting some samples from the non-road stratum maintains the probability sampling characteristic of the design. Cluster Sampling Cluster sampling is another potentially cost-effective design for selecting reference samples. In this design, the primary sampling unit (psu) consists of a cluster of pixels, such as a 3x3 block or a linear row of pixels. In one-stage cluster sampling, all pixels within the psu are sampled, whereas in tw-stage cluster sampling, a subsample of pixels within each psu is selected. The advantage of cluster sampling is that the number of pixels sampled per unit cost is increased because pixels are sampled in closer proximity. Moisen et al. (1994) demonstrated the efficiency gains achievable by cluster sampling based on an analysis taking into account sampling costs and the spatial autocorrelation of classification errors. The pixels within a given sampled psu cannot be regarded as independent observations, so the standard error formulas must reflect the cluster structure of the sampling design (Czaplewski 1994, Moisen et al. 1994, Stehman 1996b). The usual SRS variance estimators do not take into account the within cluster correlation and will likely underestimate the cluster sampling variance. Sampling Rare Land-Cover Classes The emphasis placed on sampling rare land-cover categories strongly influences the design selected for accuracy assessment. If rare classes are considered extremely important, the sampling design should reflect this priority and practically forces using a stratified design to ensure adequate sample sizes for these rare classes. At the other extreme, rare classes may be assigned low priority, and an equal probability design is a viable option. Rare classes will not be sampled in large numbers in such designs. An intermediate emphasis on rare land-cover types requires a compromise design. Two general strategies are offered. The first strategy is to treat rare classes as a separate sampling problem. For example, a general purpose sampling design for accuracy assessment could be a simple random, systematic, or cluster sample. Because such designs will result in only a few reference samples in the rare classes, the general design is supplemented by a specialized design tailored to sample rare classes with high probability. This two-step design must still be conducted according to a probability sampling protocol, and the analysis must take into account that those pixels selected in the supplemental design may have different inclusion probabilities than those pixels selected by the original design. Adaptive cluster sampling (Thompson 1990) is tailored to sample efficiently rare, but spatially clustered items. An example illustrating this strategy in an accuracy assessment setting is as follows. Suppose the rare land-cover class is marsh, and assume that a 5x5 block of pixels is adopted as the psu. A simple random or systematic sample of psus is selected. For those psus in which at least one marsh pixel (according to the reference classification) is found, the sampling procedure is "adapted" to then sample adjacent 5x5 blocks surrounding the initial sampled psu. If one or more of these adjacent psus is found to have at least one marsh pixel, the adaptive strategy is continued. The process stops when no marsh pixels are found in adjacent psus. If the rare cover-type is spatially clustered, this strategy will greatly improve the probability of sampling reference pixels in this class because the method intensifies sampling effort in those areas in which the rare class is found. The adaptive design satisfies the necessary criteria of a probability sample. Special estimators of the accuracy parameters need to be employed when this design is used (see Thompson (1990) for the basic theory). Adaptive cluster sampling may be an effective design for change detection accuracy assessment because while change pixels may be rare, they are likely to be spatially clustered, and the adaptive design may more cost-effectively sample such pixels. Double sampling is another design alternative that may be used to enhance estimation for rare cover types. In double sampling, a large first-phase sample is selected and the stratum to which each sampled pixel belongs is identified. From these first-phase sample units, a stratified random sample, called the second-phase sample, is then selected. The primary advantage of this method is that only the first-phase sample units, not all N pixels, need to be assigned to strata. The stratification could be based on the land-cover classes as identified by the reference data instead of stratifying on the land-cover classes identified by the map. This stratification is then used to increase the probability of sampling rare ground classes. To implement this design, a large firstphase sample is selected and each location is assigned to a land-cover class. These assignments would not necessarily require a ground visit if reasonably accurate stratum identifications can be made using available maps, aerial photographs or videography. The second-phase sample is then a stratified random subsample of the first-phase sample, and the "true" classification of these second-phase sample sites is identified. To illustrate how this approach focuses sampling effort on rare categories, suppose a particular class, say marsh, represents only 0.1% of the land area, and that commission error for marsh is high. If the stratification is based on land-cover as identified by the map, it is likely that only a few true marsh sites will appear in the reference sample. By using double sampling and stratifying on the ground classification, the probability of sampling marsh reference pixels could be increased dramatically by intensifying sampling effort in the marsh stratum. Increasing the sample size in this manner will improve precision of producer's accuracy estimates for the marsh class. The second strategy for sampling rare classes is to adapt the general purpose sampling design itself to emphasize more strongly rare classes. A stratified design is a typical approach for such a one-step strategy. However, if the general design is chosen to emphasize ~ampling~rare classes, some precision will be lost for other estimates of map accuracy such as PC and 2 because of this allocation of sampling resources. This precision trade-off is characteristic of any sampling design choice, and the properties of the strategy selected should reflect the objectives specified for the accuracy assessment. ENHANCING PRECISION VIA ANALYSIS The second component of a cost-effective sampling strategy is the analysis of the data, provided by the estimators of map accuracy parameters. Three techniques for improving the precision of map accuracy estimates at the analysis stage are poststratification, regression estimation, and incorporating "found" data into an accuracy assessment. All three approaches use auxiliary information in the analysis to achieve this precision gain. Poststratification and regression estimation can be applied to practically any sampling design. Here it is assumed that SRS is employed because an advantage of these analysis techniques is that they can improve precision for simple, easily implemented sampling designs. Poststratification Poststratification is an estimation technique, not a sampling design, requiring auxiliary information similar to that required for stratified sampling. Suppose n sample pixels are obtained via SRS. Let Nk+ and nk+ denote the number of pixels classified as cover type k in the population and sample, respectively. Both Nk+ and nk+ are known once the map is completed. Post stratification incorporates the known totals Nk+ into the estimator by analyzing the SRS data as a stratified random sample in which nk+ pixels have been selected from the Nk+ pixels available in that stratum (cover type k as identified on the map). Consider the estimator of PC. If nkk is the number of sample pixels correctly classified in cover type k, the usual SRS estimator for PC is C n k k / n , where summation is over the q cover types. For SRS, each pixel is weighted equally, the weights being N/n. For the poststratified estimator, the weights are dependent on the identified strata, the weight for stratum k being Nk+/nkf. Then the poststratified estimator of PC is (I/N) C l=l(Nk+/nk+)nkk. Poststratified estimators for other parameters are constructed in essentially the same manner, replacing the weight N/n used in the usual SRS formula by Nk+/nk+, and then summing over the q strata. Card (1982) presents poststratified estimators and variance estimators for some accuracy parameters, and Stehman (1995) shows the poststratified estimators of PC and n. The precision achieved by post stratification is approximately that of a stratified sample with proportional allocation (Cochran 1977, p. 134), so poststratification will usually result in some gain in precision over the usual SRS estimators. Based on a small simulation study (Stehman 1996c), the gain in precision from poststratification can be expected to be around 5% (in terms of standard error) for estimating PC and n, and as much as 1530% for estimating producer's accuracy. A disadvantage of poststratification relative to a stratified design is that the sample size within each stratum, nk+, is under control with a stratified design, but not with SRS followed by poststratification. However, poststratification can be applied to any classification scheme. For example, if a user wishes to collapse land-cover classes or use an entirely different classification scheme, poststratification can still be applied. Poststratification is flexible because it can be adapted to any identified subpopulation, whereas the usual st ratified sampling design is advantageous only for those subpopulations identified as strata prior to sampling. Poststratification and stratified sampling both require Nk+, but these are available once the land-cover map is completed. Because poststratification does not require this information to obtain the sample, no time delay between the imagery and the ground sampling need occur. In the regression estimator approach, any auxiliary information available that can provide a land-cover classification of 'a pixel may be used. Primary sources of auxiliary data include aerial photography or videography, or even another land-cover map obtained via remote sensing and using a coarser scale of resolution (e.g., AVHRR). This auxiliary information is separate from the imagery data used to construct the target land-cover map being assessed. A sample of ground reference locations is still required, and these reference data are combined with the auxiliary data via a regression estimator to estimate PC. For SRS of n pixels, the sample proportion of pixels in which the reference and map classifications match is If no auxiliary data are available, is the usual estimator of PC. The sample proportion of pixels in which the map classiis the and fication matches the classification obtained by the auxiliary data is sample proportion of pixels in which both the reference and auxiliary data classilcations cy cy. c,, c, where P, is the proportion of pixels in the entire map in which the auxiliary data and is map classifications match. The estimated variance of creg The regression estimator results in some gain in precision over the usual SRS estimator, no matter how poor the auxiliary data classifications, but the approach is not worth implementing unless a meaningful gain in precision is achieved. The more accurate the classifications from the auxiliary data, the greater the gain in precision achieved by the regression estimator relative to the usual SRS estimator. The usual regression estimator approach requires auxiliary information for the entire map region. In practice, it is more feasible to use a double sampling approach in which the auxiliary data are collected for a first-phase sample, and ground visits are made to a subsample of the first-phase sample. The double sampling regression estimator incorporates both the first- and second-phase sample data. Stehman (1996a) demonstrated that the precision of the regression estimator combined with double sampling was nearly the same as the strategy employing the regression estimator with SRS. The double sampling strategy is more cost-effective because it requires obtaining the auxiliary data only on the first-phase sample, not the entire target region. Found Data Additional potential ground reference data may be available via purposeful, haphazard, convenience, or other non-probability sampling methods. For example, suppose an organization has land-cover classifications for various special interest sites which were visited "to see what was there." Assuming the land-cover classifications made at these sites are correct and consistent with the classification scheme employed for the map being evaluated, is it possible to use these data in the accuracy assessment? Such data in a sense represent "free" reference samples because they already exist. But incorporating them into the accuracy assessment in a statistically valid manner is not simple because it is difficult to generalize from these data to the population a t large. That is, what population do these "found" sites represent? To illustrate the nature of the difficulty, if reference samples were obtained only from areas within, say, 250 meters of a road, the sample data can be generalized only to a population of the area in the study region within 250 meters of a road. Generalizing beyond that area is not supported by statistical inferences, and must rest on non-statistical arguments regarding what these sites represent. The same can be said of "found* sites. Because these sites were not selected by probability sampling methods, the population represented by these sites is unknown. The probability sampling protocol must be replaced by assumptions concerning the population represented by the found data. Overton (1990) and Overton et al. (1993) describe some statistical procedures for using found data that apply to this problem. However, the amount of work (cost) needed to use these data in a statistically valid manner appears high, and unless these data are abundant or exceedingly valuable, such as for a very rare cover-type, it is doubtful that such data can contribute significantly to reducing the costs of a statistically rigorous accuracy assessment. Incorporating existing data is much more feasible if a valid sampling design was used to obtain the data. GENERAL RECOMMENDATIONS Some general recommendations for sampling in large-scale accuracy assessment projects are proposed. As with any general recommendation, it is easy to think of situations in which exceptions would obviously be needed. Recommendations should also evolve over time as better methods and new insights are gained. With those caveats in mind, the following suggestions are proposed. For accuracy assessment of large-area land-cover maps, stratification by a few large geographic regions is helpful to ensure a regionally representative sample, to provide adequate sample sizes in these regions for reporting region specific accuracy statistics, and to spread the workload more evenly. As an example of the scale of the recommended stratification, a state such as North Carolina could be stratified into three physiographic regions, the coastal plain, Piedmont, and western upland areas. Bauer et al. (1994) provide another example, as they partitioned their study region (Minnesota) into eight physiographic regions that States themselves are could be used as strata in an accuracy assessment. administratively convenient strata for a land-cover map spanning several states. Within each geographic stratum, a simple design such as simple random or systematic sampling is recommended as the general design. Implementing a simple design creates the opportunity to employ precision enhancing analysis techniques such as poststratification and regression estimation. Cluster sampling has been demonstrated to be cost-effective (Moisen et al. 1994), and the combination of cluster sampling with stratification into road and non-road areas (Edwards et al. 1996) has considerable practical appeal. Simple designs are easier to implement in the field thus increasing the likelihood that the design will be implemented correctly. This general design strategy is adaptable to a variety of analyses and classification systems, and will accommodate the multiple general uses and objectives present in a large-area mapping project. To accomplish more specialized objectives, a design tailored to these objectives can supplement the general purpose sampling design. For example, if sampling certain rare classes is an important objective, the data from the general design can be supplemented by an additional simple random sample from each rare class. Users must be careful to incorporate proper weights in the analysis when combining data collected by these two different designs. The double sampling and adaptive cluster sampling designs described earlier may also be used to supplement a simpler, general purpose design. Accuracy assessment of large-area thematic maps creates several challenging statistical problems. These problems, however, are not insurmountable. The classical finite sampling approach has been in use for over 50 years and has been applied to sampling large, spatially disperse, and hard to measure populations. National samples obtaining labor, economic, and health statistics are applied to populations over large geographic areas, and these programs face a demanding set of objectives requiring estimates for a variety of parameters at various spatial scales. These problems are characteristic of large-area accuracy assessment efforts. Classical finite sampling theory and methods provide a rich toolbox from which to choose cost-effective sampling strategies, and adapting these strategies to accuracy assessment of large-area thematic maps is a viable approach to pursue. ACKNOWLEDGMENTS I thank Ray Czaplewski for his review and helpful suggestions. This work has been supported by cooperative agreement CR821782 between the U.S. EPA and SUNY-ESF. REFERENCES Bauer, M.E. et al. 1994. Satellite inventory of Minnesota forest resources. Photogram. Eng. tY Remote Sensing 60: 287-298. Card, D. H. 1982. Using known map category marginal frequencies to improve estimates of thematic map accuracy. Photogram. Eng. tY Remote Sensing 48: 431-439. Cochran, W.G. 1977. Sampling Techniques (3rd ed). Wiley: New York. Czaplewski, R.L. 1994. Variance approximations for assessments of classification accuracy. Res. Pap. RM-316. Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Rocky Mountain Forest Range and Experiment Station. 29p. Edwards, T.C., Jr., Moisen, G.G., and Cutler, D.R. 1996. Assessing map accuracy in an ecoregion-scale cover-map ( i n review). Janssen, L.L.F., and van der Wel, F.J.M. 1994. Accuracy assessment of satellite derived land-cover data: A review. Photogram. Eng. tY Remote Sensing 60: 419-426. Moisen, G.G., Edwards, Jr., T.C., and Cutler, D.R. 1994. Spatial sampling to assess classification accuracy of remotely sensed data. In Environment a1 Informat ion Management and Analysis: Ecosystem to Global Scales, W.K. Michener, J.W. Brunt, and S.G. Stafford (eds). New York: Taylor and Francis. Overton, W.S. 1990. A strategy for use of found samples in a rigorous monitoring design. Tech. Rep. 139, Dept. of Statistics, Oregon State University, Corvallis, OR. Overton, J.McC., Young, T.C., and Overton, W.S. 1993. Using 'found' data to augment a probability sample: procedure and case study. Envir. Monit. and Assmt. 26: 65-83. Stehman, S.V. 1995. Thematic map accuracy assessment from the perspective of finite population sampling, Inter. J. of Remote Sensing 16: 589-593. Stehman, S.V. 1996a. Use of auxiliary data to improve the precision of estimators of thematic map accuracy (Remote Sensing of Environment, in review). Stehman, S.V. 1996b. Estimating standard errors of accuracy assessment statistics under cluster sampling (Remote Sensing of Environment, in review). Stehman, S.V. 1996c. Sampling design and analysis issues for thematic map accuracy assessment. ASPRS and A CSM Annual Proceedings (to appear). Thompson, S.K. 1990. Adaptive cluster sampling. J. Amer. Stat. Assoc. 85: 1050-1059. BIOGRAPHICAL SKETCH Stephen Stehman is an Associate Professor at SUNY-ESF. He has a B.S. in Biology from Penn State, an M.S. in Statistics from Oregon State, and a Ph.D. in Statistics from Cornell. He provides statistical consulting for faculty and graduate students at ESF and teaches courses in sampling, experimental design, and multivariate statistics.