This file was created by scanning the printed publication. Errors identified by the software have been corrected; however, some errors may remain. Measures of Association and Agreement for Describing Land Cover Characterization Classes Eugene A. Fosnightl and Gary W. Fowler2 Abstract.-- Several statistical measures of association or agreement are available for comparing land cover characterizations to reference data. This study compares five statistical measures under three sampling schemes for five different class aggregations. Results show that each statistical measure contributes to the overall view of the information. Aggregated land cover characterization classes arpr less successful in predicting the reference classes. Sample design and size can cause biased estimators. INTRODUCTION Confidence in land cover characterizations derived from remotely sensed images is acquired through a thorough description of the characterizations. This paper seeks to describe measures of agreement and association suitable for quantifying the relationship between ground-sampled or photointerpreted data and characterizations of remotely sensed images. This objective consists of three tasks: (1) describe a subset of measures, (2) analyze the distribution of each measure under one simple and two stratified random sampling schemes, and (3) discuss issues of interpretability associated with each statistic. For this study the U.S. Land Cover Characterization database was compared to a photointerpreted land use and land cover data set. The measures of association and agreement calculated were Cohen's Kappa, Cramer's V, Guttman's Proportional Reduction in Error (PRE), Goodman and Kruskal's Proportion of Explained Variance (PEV), and Percent Correctly Classified (PCC). MEASURES OF ASSOCIATION AND AGREEMENT Many measures of association and agreement are available for quantifying the cross-classified tables that are created by cross-tabulating classified images and reference data. Within the remote sensing literature, these tables are often referred to as "confusion tables" or "error matrices" (Czaplewski and Catts 1992, Story and Congalton 1986). A few general concepts should be defined. The probability for row i and column j is pu. The estimated value of pu is Sij = xi / N where xi is the cell 'cartographer, Hughes STX, UNEPIGRID, EROS Data Center, Sioux Falls, SD. Work performed under US. Geological Survey contract 1434-92-C-40004. Biometrician, School of Natural Resources and Environment, University of Michigan, Ann Arbor, MI. count. The expected value, mQ,of xi, given no information other than the sample ~ rng = p , + ~ +where ~ , pi+ are the row marginal marginal frequencies, xi+ and x + is probabilities and p+jare the column marginal probabilities. Measures based on Chi-square Several measures of association are derivatives of Pearson's Coefficient @2 of Mean-Square Contingency (Eq 1). Cramer suggested modifications to restrict the range to between 0 and 1 (Bishop et aI. 1975, Conover 1980). The statistical measures are I J ("G - )' and Cramer V = Pi+P+j min[(I - I), ( J - I)] where the maximum achievable value of aZis min[(I - I), ( J - I)]. As V depart from zero, the measures are interpreted as becoming less independent. @2 = ys i=1 j=l Proportional Reduction in Error Goodman and Kruskal (1979) discussed the PRE ( a ) measure based on a probabilistic model (Eq. 2). Two probabilities are compared in calculating PRE. First, the probability of an error in correctly guessing the column by chance is determined by calculating the most frequently occurring column. This probability is l-P+m, where p+, = max(p+,,p+2,.*.,p+J). Second, the probability of correctly guessing the column given rows is determined by calculating which rows occur most frequently with each coluinn. This probability is I If the variables are statistically independent, PRE will be zero; however, if PRE is zero, the variables are not necessarily statistically independent. The predictive interpretation of PRE distinguishes it from Chi-square-based measures. Proportion of Explained Variance After Bishop et al. (1975), Guttman's PEV ( z) can be explained as the difference between predicting the row given only the marginal probabilities of the columns and predicting the column given the conditional proportions of the rows (Eq. 3). PEV can be cast into analysis of variance terminology, where the total variation is defined as: This definition of total variation meets the criterion of equaling zero if and only if all counts are in the same column, and it is maximized when the marginal frequencies of the columns are uniformly distributed. The total variation is split into between (BSS) and within (WSS) class cornpollents PEV is then set equal to the ratio of the between class component and the total variation. BSS TSS ' n 'CIR =- The measure PEV for the column given the rows can be described as the percentage of the variation in the columns that can be explained knowing the rows. Measures of Agreement Measures of agreement have more restrictive class specifications than do measures of association, but they also provide more powerful tests if the class specifications can be met. The cross-classified table must have identical classes in the same order for both rows and columns. This restriction gives special meaning to the diagonal of the table. PCC, the simplest measure of agreement, is calculated by dividing the sum of the diagonal cells (all samples correctly classified) by the total number of samples (Eq. 4). I 1 PCC=&i and P& = i=l CX,~IN. i =l This is the most common statistic for assessing accuracy in the remote sensing literature. PCC has a highly intuitive interpretation. Cohen (1960) proposed a measure of agreement, Kappa, (Eq. 5) that adjusts PCC for the probability of chance agreement. Chance agreement is the probability of agreement given only the sample marginal proportions. I I I I I I Kappa equals 1 if there are no off-diagonal counts. With the exception of perfect agreement Kappa will always be lower than PCC. Unfortunately, as noted by Card (1982) and Hay (1979), PCC can also be a highly biased estimator. Calibration The estimated probability of each cell in the cross-classified table (Eq. 6) can be calibrated by incorporating the known population marginal frequencies: A * x .I;.. where X. pg = -, fie; = J x,. are the population marginal frequencies. ;=l [GI Calibrated estimates of the cell counts (Eq. 7) can then be calculated from the cell probabilities and the known finite population is * *= [71 Card (1982) and Czaplewski and Catts (1992) described in detail calibration techniques for deriving calibrated estimates for the areas of reference classes given known population frequencies of the modeled classes. The measures of association and agreement, Cramer's V, Guttman's PRE, Goodman and Kruskal's PEV, PCC, and Cohen's Kappa form a representative sample of the measures that could contribute toward assessing the accuracy of remotely sensed classifications. X ~ ~ ~ N . EXPERIMENTAL DATA The study area is the conterminous United States. The U.S. Land Cover Characterization database is compared to the reference U.S. Geological Survey (USGS) Land Cover and Land Use digital data. The U.S. Land Cover Characterization database is derived from Advanced Very High Resolution Radiometer (AVHRR) images (Loveland et al. 1991, Brown et al. 1993). The USGS Land Cover and Land Use digital data are derived from photographs at scales smaller than 1:6O,OOO (Anderson 1976, USGS 1990). The two classifications were grouped into several alternative aggregations. Three levels of the Anderson classification scheme were created for the reference data set: (1) the 37-class Anderson level I1 (lulc), (2) a 14-class modified Anderson that retains the classes of natural vegetation, range (mixed, shrub and brush, and herbaceous), forest (deciduous, evergreen, and mixed), and wetlands (forested and nonforested) at Anderson Level 11, while collapsing the remaining classes to Anderson Level I (lulcmA), and (3) an 8-class Anderson Level I (lulc1). The level I classification was determined by Anderson to be the class resolution attainable by Landsat Multispectral Scanner scale classifications. The Level I1 classification was considered attainable using small-scale (less than 1:60,000) aerial photographs. The detailed temporal information available with AVHRR images (daily coverage) and the ancillary information provide the possibility of attaining at least level I class resolutions. Three classifications were created from the land cover characterization: (1) the ungrouped classes (lcc), (2) an aggregation corresponding to the lulcI (1ccI) classes, and (3) an aggregation corresponding to the lulcrnA (1ccmA) classes. The classifications of the land cover characterization classes and reference land use land cover classes were cross-classified into five tables; lcclulc, lcclulcmA, lcclulc1, lccrnAlulcmA, 1ccIlulcI. The first three cross-classified tables (lcclulc, lcclulcmA, and 1cclulcI) are designed to show how well the full set of land cover characterization classes can predict decreasing levels of class resolution within the Anderson classification. The square cross-classified tables use Cohen's Kappa and PCC statistics for measuring agreement between the variables. RESULTS The parameters provide a baseline for comparison with simulation results. hen lcc is compared to the three lulc aggregations, the ability of lcc to predict lulc, measured by PRElulcllcc, increases as the number of classes in lulc decreases, likewise, the proportion of explained variance, as measured by PEVl,lcllcc, and the lack of independence, as measured by Cramer's V, also increase. The least amount of predictive or explanatory power exists when lcc is aggregated to the modified Anderson classification, where the class resolution of the land cover characterization classes are stretched to the limit. When the data sets are aggregated to class resolution more appropriate to the AVHRR satellite, as in lccIlulcI, the measures approach their maximum. The results of the Monte Carlo simulations for simple random sampling, stratified random sampling with probability proportional to size, and stratified random sampling with equal probabilities are shown in Figures 1, 2, and 3, respectively. The Monte Carlo simulations consisted of a thousand repetitions calculated for each of the measures for random samples selected without replacement from the full data set. The population parameters are shown on each of the figures. All of the measures have inflated values for small sample sizes extracted from the large rectangular tables. Whether this is caused by the large number of cells (and large number of empty cells) or the rectangular nature of the tables is unknown; however, the effect is least pronounced in table lcclulcI, which is the most rectangular and has the fewest empty cells. Cramer's V shows some bias for all tables at small sample sizes. For simple random sampling and stratified random sampling with probability proportional to size, the dist~ibutionsare very similar. Stratified random sampling with probability proportional to size shows slightly larger biases for small sample sizes. The stratified sampling by k c class enforced a more even distribution of samples across lulc classes, causing fewer classes with a inarginal frequency of zero. Under stratified random sampling with equal probabilities, PRE, PEV, and Cramer's V are consistently overestimated, while for Kappa and PCC the measures are consistently underestimated (Figure 3). . Equal probability sampling results in small classes being over-sampled and large classes being under-sampled. For large percent samples, the small classes are exhaustively sampled, and hence cannot achieve the target percentage. Certainly in the case of a 10 percent sample, all 159 lcc classes cannot be sampled at 10 percent of the total. The very smallest classes are nearly exhaustively sampled with the .O1 percent sample. Because equal probability sampling is not necessarily representative of the population, it becomes susceptible to differential accuracy among the classes. The net result is that the statistical measures are highly biased for even large sample sizes, and the direction of the bias depends on the statistic and per class accuracies. PRE clr PEV clr V S lccllulcl 2 Figure 1.-Results from Monte Carlo simulation of simple random sample for measures of association and agreement. Missing boxplots result either from all zero cells for an Icc or an lulc class or from computation time too long to complete. The horizontal lines through the boxes are the population parameters. PRE clr lcclulcl PEV clr V 0 Kappa PCC % ;m F5F-5m ----q 0 2 lccllulcl q z 0 0 .01%.1% 1% 10% T I - .01%.1% 1% 10% L .01%.1% 1% 10% L .L .L L 0 - .01%.1% 1% 10% 5 .01%.1% 1% ion/* Figure 2.-Results from Monte Carlo simulation of stratified random sampling with probability proportional to size for measures of association and agreement. Missing boxplots result from all zero cells for either an k c or an lulc class. The horizontal lines through the boxes are the population parameters. fq:T/ PRE clr lcclulc PEV clr Sl"1 5 c- V - - - 0 2 1 - e, 2 2 0 lcclulcmA PCC IccIIuIcI p e 8 0 : L 01% 1% 1% 10% Figure 3.-Boxplots of results from Monte Carlo simulation of stratified random sampling with equal probability allocation for measures of association and agreement. Missing boxplots result from all zero cells for either an Icc or an lulc class. The horizontal lines through the boxes are the population parameters. Equal probability sampling ensures a sufficient sample size of each stratified class. This allows a sufficient description of each of the stratified, in this case lcc, classes. Under probability proportional to size, sampling some classes may be under-represented. For example, rare classes may be sampled only once, if at all, to meet strict probability proportional to size criteria. - The cell counts resulting from stratified random sampling with equal probabilities can be adjusted to better conform to cell probabilities yielded through simple random sampling. In this Monte Carlo simulation, each sample from the cross-classified tables lcclulcI and lccIlulcI was selected with equal probabilities and was calibrated using the population marginal sums for the lcc classes. The results from the calibration are shown in Figure 4. PRE clr PEV clr V lcclulcl 8 Kama 0 8 8 PCC 0 0 lccllulcl E E .Ol%.l% 1% 10% I .01%.l% 1% 10% .01%.1% 1% 10% .01%.1% 1% 10% .01%.1% 1% 10% Figure 4.-Results from Monte Carlo simulation of stratified random sampling with equal probability allocation for measures of association and agreement after calibration of cell counts. The horizontal lines through the boxes are the population parameters. The distributions after calibration for the cross-classified table lccIlulcI are very similar to those from the simple random sample. For the larger rectangular table lcclulc1, even though the biases are removed for large sample sizes, the biases for small samples grew to be even larger. For this type of table, the small sample sizes and large number of empty cells prevent the calibration from being well controlled. Unfortunately, these are the practical sample sizes for actual applications. CONCLUSION Only three of the measures, PRE, PEV, and Cramer's V, are suitable for rectangular tables. PRE and PEV are asymmetrical predictive measures, and Cramer's V is a measure of independence. Cramer's V shows some tendency to be biased for small sample sizes, even for a small number of classes. Both PEV and PRE were designed for ease of interpretation. Cramer's V can perhaps be used best in tandem with other measures. For example, a table can be tested for independence with Cramer's V, and then the strength of the association can be measured with PRE and PEV. For those cross-classified tables where Kappa is applicable, it provides the most robust statistic; however, PCC is the simplest to understand. REFERENCES Anderson, J.R., Hardy, E.E., Roach, J.T., and Witmer, R.E., 1976, A Land Use and Land Cover Classification System for Use with Remote Sensor Data: Washington, DC, U.S. Government Printing Office. Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W., 1975, Discrete Multivariate Analysis: Cambridge, Massachusetts, The MIT Press. Brown, J.F., Loveland, T.R., Merchant, J.W., Reed, B.C., and Ohlen, D.O., 1993, Using Multisource Data in Global Land-Cover Characterization: Coilcepts, Requirements, and Methods: Photogrammetric Engineering and Remote Sensing, v. 59, no. 6, p. 977-987. Card, D.H., 1982, Using Known Map Category Marginal Frequencies to Improve Estimates of Thematic Map Accuracy: Photograrnrnetric Engineering and Remote Sensing, v. 48, no. 3, p. 431-439. Cohen, J., 1960, A Coefficient of Agreement for Nominal Scales: Educational and Psychological Measurement, v. 20, no. 1, p. 37-46. Conover, W. J., 1980, Practical Nonparametric Statistics: New York, NY: John Wiley & Sons. Czaplewski, R.L., and Catts, G.P., 1992. Calibration of Remotely Sensed Proportion or Area Estimates for Misclassification Error: Remote Sensing of Environment, v. 39, pp. 29-43. Goodman, L.A., and Kruskal, W.H., 1979, Measures of Association for Cross Classifications, Springer Series in Statistics: New York, NY: SpringerVerlag . Hay, A.M., 1979, Sampling Designs to Test Land-Use Map Accuracy: Photogrammetric Engineering and Remote Sensing, v. 45, no. 4, p. 529-533. Loveland, T.R., Merchant, J.W., Ohlen, D.O., and Brown, J.F., 1991, Development of a Land-Cover Characteristics Database for the Conterminous U.S.: Photogrammetric Engineering and Remote Sensing, v. 57, no. 11, p. 1,453- 1,463. Story, M., and Congalton, R.G., 1986, Accuracy Assessment: A User's Perspective: Photogrammetric Engineering and Remote Sensing, v. 52, no. 3, p. 397-399. U.S. Geological Survey, 1990, Land Use and Land Cover Digital Data from 1:250,000- and 1:100,000-Scale Maps: Reston, Virginia, U. S. Geological Survey. ACKNOWLEDGMENTS The research described in this article has been supported by the U.S. Environmental Protection Agency (EPA) through Interagency Agreement IAG DW 14936073 to the U.S. Geological Survey and through the National Aeronautics and Space Administration in support of the United Nations Environment Programme. However, it has not been subjected to EPA review and therefore does not necessarily reflect the views of the agency. No official endorsement should be inferred by either the U.S. Government or the United Nations Environment Programme. BIOGRAPHICAL SKETCH Eugene A. Fosnight is a cartographer with the United Nations Environmental Programme's Global Resource Information Database (GRID), Sioux Falls, S. Dak. He graduated from Purdue University with a B.S. in 1972, from University College Swansea with a diploma in cartography in 1980 and from University of Michigan with a M.S. in remote sensing in 1992. Gene provides cartographic, remote-sensing and statistical support for GRID-Sioux Falls. Gary W. Fowler is a Professor of biometrics with the School of Natural Resources and Environment at the University of Michigan, Ann Arbor, Mich. He graduated from the University of California with a Ph.D. in 1969. Gary's research concentrates on sequential sampling, estimation of forest stem parameters, volumebasal area ratios for use in horizontal point sampling, efficient sampling of endemic forest insect populations, development of new volume equations for commercial coniferous species, and statistical properties of species diversities.