Document 11863955

advertisement
This file was created by scanning the printed publication.
Errors identified by the software have been corrected;
however, some errors may remain.
Measures of Association and
Agreement for Describing Land Cover
Characterization Classes
Eugene A. Fosnightl and Gary W. Fowler2
Abstract.-- Several statistical measures of association or agreement are
available for comparing land cover characterizations to reference data.
This study compares five statistical measures under three sampling
schemes for five different class aggregations. Results show that each
statistical measure contributes to the overall view of the information.
Aggregated land cover characterization classes arpr less successful in
predicting the reference classes. Sample design and size can cause biased
estimators.
INTRODUCTION
Confidence in land cover characterizations derived from remotely sensed
images is acquired through a thorough description of the characterizations. This
paper seeks to describe measures of agreement and association suitable for
quantifying the relationship between ground-sampled or photointerpreted data and
characterizations of remotely sensed images. This objective consists of three tasks:
(1) describe a subset of measures, (2) analyze the distribution of each measure
under one simple and two stratified random sampling schemes, and (3) discuss
issues of interpretability associated with each statistic.
For this study the U.S. Land Cover Characterization database was compared to
a photointerpreted land use and land cover data set. The measures of association
and agreement calculated were Cohen's Kappa, Cramer's V, Guttman's
Proportional Reduction in Error (PRE), Goodman and Kruskal's Proportion of
Explained Variance (PEV), and Percent Correctly Classified (PCC).
MEASURES OF ASSOCIATION AND AGREEMENT
Many measures of association and agreement are available for quantifying the
cross-classified tables that are created by cross-tabulating classified images and
reference data. Within the remote sensing literature, these tables are often referred to
as "confusion tables" or "error matrices" (Czaplewski and Catts 1992, Story and
Congalton 1986).
A few general concepts should be defined. The probability for row i and
column j is pu. The estimated value of pu is Sij = xi / N where xi is the cell
'cartographer, Hughes STX, UNEPIGRID, EROS Data Center, Sioux Falls, SD. Work performed under US.
Geological Survey contract 1434-92-C-40004.
Biometrician, School of Natural Resources and Environment, University of Michigan, Ann Arbor, MI.
count. The expected value, mQ,of xi, given no information other than the sample
~ rng = p , + ~ +where
~ , pi+ are the row marginal
marginal frequencies, xi+ and x + is
probabilities and p+jare the column marginal probabilities.
Measures based on Chi-square
Several measures of association are derivatives of Pearson's Coefficient @2 of
Mean-Square Contingency (Eq 1). Cramer suggested modifications to restrict the
range to between 0 and 1 (Bishop et aI. 1975, Conover 1980). The statistical
measures are
I
J
("G -
)' and Cramer V =
Pi+P+j
min[(I - I), ( J - I)]
where the maximum achievable value of aZis min[(I - I), ( J - I)].
As V depart from zero, the measures are interpreted as becoming less
independent.
@2
=
ys
i=1 j=l
Proportional Reduction in Error
Goodman and Kruskal (1979) discussed the PRE ( a ) measure based on a
probabilistic model (Eq. 2). Two probabilities are compared in calculating PRE.
First, the probability of an error in correctly guessing the column by chance is
determined by calculating the most frequently occurring column. This probability is
l-P+m, where p+, = max(p+,,p+2,.*.,p+J).
Second, the probability of correctly guessing the column given rows is
determined by calculating which rows occur most frequently with each coluinn.
This probability is
I
If the variables are statistically independent, PRE will be zero; however, if PRE
is zero, the variables are not necessarily statistically independent. The predictive
interpretation of PRE distinguishes it from Chi-square-based measures.
Proportion of Explained Variance
After Bishop et al. (1975), Guttman's PEV ( z) can be explained as the
difference between predicting the row given only the marginal probabilities of the
columns and predicting the column given the conditional proportions of the rows
(Eq. 3).
PEV can be cast into analysis of variance terminology, where the total variation
is defined as:
This definition of total variation meets the criterion of equaling zero if and
only if all counts are in the same column, and it is maximized when the marginal
frequencies of the columns are uniformly distributed.
The total variation is split into between (BSS) and within (WSS) class
cornpollents
PEV is then set equal to the ratio of the between class component and the total
variation.
BSS
TSS '
n
'CIR
=-
The measure PEV for the column given the rows can be described as the
percentage of the variation in the columns that can be explained knowing the rows.
Measures of Agreement
Measures of agreement have more restrictive class specifications than do
measures of association, but they also provide more powerful tests if the class
specifications can be met. The cross-classified table must have identical classes in
the same order for both rows and columns. This restriction gives special meaning
to the diagonal of the table.
PCC, the simplest measure of agreement, is calculated by dividing the sum of
the diagonal cells (all samples correctly classified) by the total number of samples
(Eq. 4).
I
1
PCC=&i
and P&
=
i=l
CX,~IN.
i =l
This is the most common statistic for assessing accuracy in the remote sensing
literature. PCC has a highly intuitive interpretation.
Cohen (1960) proposed a measure of agreement, Kappa, (Eq. 5) that adjusts
PCC for the probability of chance agreement. Chance agreement is the probability
of agreement given only the sample marginal proportions.
I
I
I
I
I
I
Kappa equals 1 if there are no off-diagonal counts. With the exception of
perfect agreement Kappa will always be lower than PCC. Unfortunately, as noted
by Card (1982) and Hay (1979), PCC can also be a highly biased estimator.
Calibration
The estimated probability of each cell in the cross-classified table (Eq. 6) can be
calibrated by incorporating the known population marginal frequencies:
A *
x .I;.. where X.
pg = -,
fie;
=
J
x,. are the population marginal frequencies.
;=l
[GI
Calibrated estimates of the cell counts (Eq. 7) can then be calculated from the
cell probabilities and the known finite population is
*
*=
[71
Card (1982) and Czaplewski and Catts (1992) described in detail calibration
techniques for deriving calibrated estimates for the areas of reference classes given
known population frequencies of the modeled classes.
The measures of association and agreement, Cramer's V, Guttman's PRE,
Goodman and Kruskal's PEV, PCC, and Cohen's Kappa form a representative
sample of the measures that could contribute toward assessing the accuracy of
remotely sensed classifications.
X
~
~
~
N
.
EXPERIMENTAL DATA
The study area is the conterminous United States. The U.S. Land Cover
Characterization database is compared to the reference U.S. Geological Survey
(USGS) Land Cover and Land Use digital data. The U.S. Land Cover
Characterization database is derived from Advanced Very High Resolution
Radiometer (AVHRR) images (Loveland et al. 1991, Brown et al. 1993). The
USGS Land Cover and Land Use digital data are derived from photographs at
scales smaller than 1:6O,OOO (Anderson 1976, USGS 1990).
The two classifications were grouped into several alternative aggregations.
Three levels of the Anderson classification scheme were created for the reference
data set: (1) the 37-class Anderson level I1 (lulc), (2) a 14-class modified Anderson
that retains the classes of natural vegetation, range (mixed, shrub and brush, and
herbaceous), forest (deciduous, evergreen, and mixed), and wetlands (forested and
nonforested) at Anderson Level 11, while collapsing the remaining classes to
Anderson Level I (lulcmA), and (3) an 8-class Anderson Level I (lulc1).
The level I classification was determined by Anderson to be the class resolution
attainable by Landsat Multispectral Scanner scale classifications. The Level I1
classification was considered attainable using small-scale (less than 1:60,000) aerial
photographs. The detailed temporal information available with AVHRR images
(daily coverage) and the ancillary information provide the possibility of attaining at
least level I class resolutions.
Three classifications were created from the land cover characterization: (1) the
ungrouped classes (lcc), (2) an aggregation corresponding to the lulcI (1ccI) classes,
and (3) an aggregation corresponding to the lulcrnA (1ccmA) classes.
The classifications of the land cover characterization classes and reference land
use land cover classes were cross-classified into five tables; lcclulc, lcclulcmA,
lcclulc1, lccrnAlulcmA, 1ccIlulcI. The first three cross-classified tables (lcclulc,
lcclulcmA, and 1cclulcI) are designed to show how well the full set of land cover
characterization classes can predict decreasing levels of class resolution within the
Anderson classification. The square cross-classified tables use Cohen's Kappa and
PCC statistics for measuring agreement between the variables.
RESULTS
The parameters provide a baseline for comparison with simulation results.
hen lcc is compared to the three lulc aggregations, the ability of lcc to predict lulc,
measured by PRElulcllcc, increases as the number of classes in lulc decreases,
likewise, the proportion of explained variance, as measured by PEVl,lcllcc, and the
lack of independence, as measured by Cramer's V, also increase. The least amount
of predictive or explanatory power exists when lcc is aggregated to the modified
Anderson classification, where the class resolution of the land cover
characterization classes are stretched to the limit. When the data sets are aggregated
to class resolution more appropriate to the AVHRR satellite, as in lccIlulcI, the
measures approach their maximum.
The results of the Monte Carlo simulations for simple random sampling,
stratified random sampling with probability proportional to size, and stratified
random sampling with equal probabilities are shown in Figures 1, 2, and 3,
respectively. The Monte Carlo simulations consisted of a thousand repetitions
calculated for each of the measures for random samples selected without
replacement from the full data set. The population parameters are shown on each of
the figures.
All of the measures have inflated values for small sample sizes extracted from
the large rectangular tables. Whether this is caused by the large number of cells
(and large number of empty cells) or the rectangular nature of the tables is
unknown; however, the effect is least pronounced in table lcclulcI, which is the
most rectangular and has the fewest empty cells. Cramer's V shows some bias for
all tables at small sample sizes.
For simple random sampling and stratified random sampling with probability
proportional to size, the dist~ibutionsare very similar. Stratified random sampling
with probability proportional to size shows slightly larger biases for small sample
sizes. The stratified sampling by k c class enforced a more even distribution of
samples across lulc classes, causing fewer classes with a inarginal frequency of
zero.
Under stratified random sampling with equal probabilities, PRE, PEV, and
Cramer's V are consistently overestimated, while for Kappa and PCC the measures
are consistently underestimated (Figure 3).
. Equal probability sampling results in small classes being over-sampled and
large classes being under-sampled. For large percent samples, the small classes are
exhaustively sampled, and hence cannot achieve the target percentage. Certainly in
the case of a 10 percent sample, all 159 lcc classes cannot be sampled at 10 percent
of the total. The very smallest classes are nearly exhaustively sampled with the .O1
percent sample.
Because equal probability sampling is not necessarily representative of the
population, it becomes susceptible to differential accuracy among the classes. The
net result is that the statistical measures are highly biased for even large sample
sizes, and the direction of the bias depends on the statistic and per class accuracies.
PRE clr
PEV clr
V
S
lccllulcl
2
Figure 1.-Results from Monte Carlo simulation of simple random sample for measures
of association and agreement. Missing boxplots result either from all zero cells for an
Icc or an lulc class or from computation time too long to complete. The horizontal lines
through the boxes are the population parameters.
PRE clr
lcclulcl
PEV clr
V
0
Kappa
PCC
%
;m
F5F-5m
----q
0
2
lccllulcl q
z
0
0
.01%.1% 1% 10%
T
I
-
.01%.1% 1% 10%
L
.01%.1% 1% 10%
L
.L
.L
L
0
-
.01%.1% 1% 10%
5
.01%.1% 1% ion/*
Figure 2.-Results from Monte Carlo simulation of stratified random sampling with
probability proportional to size for measures of association and agreement. Missing
boxplots result from all zero cells for either an k c or an lulc class. The horizontal lines
through the boxes are the population parameters.
fq:T/
PRE clr
lcclulc
PEV clr
Sl"1
5 c-
V
-
-
-
0
2 1 -
e,
2
2
0
lcclulcmA
PCC
IccIIuIcI
p e 8
0
:
L
01%
1%
1%
10%
Figure 3.-Boxplots of results from Monte Carlo simulation of stratified random
sampling with equal probability allocation for measures of association and agreement.
Missing boxplots result from all zero cells for either an Icc or an lulc class. The
horizontal lines through the boxes are the population parameters.
Equal probability sampling ensures a sufficient sample size of each stratified
class. This allows a sufficient description of each of the stratified, in this case lcc,
classes. Under probability proportional to size, sampling some classes may be
under-represented. For example, rare classes may be sampled only once, if at all,
to meet strict probability proportional to size criteria.
-
The cell counts resulting from stratified random sampling with equal
probabilities can be adjusted to better conform to cell probabilities yielded through
simple random sampling. In this Monte Carlo simulation, each sample from the
cross-classified tables lcclulcI and lccIlulcI was selected with equal probabilities and
was calibrated using the population marginal sums for the lcc classes. The results
from the calibration are shown in Figure 4.
PRE clr
PEV clr
V
lcclulcl
8
Kama
0
8
8
PCC
0
0
lccllulcl
E
E
.Ol%.l%
1%
10%
I
.01%.l% 1% 10%
.01%.1% 1%
10%
.01%.1% 1%
10%
.01%.1% 1%
10%
Figure 4.-Results from Monte Carlo simulation of stratified random sampling with
equal probability allocation for measures of association and agreement after calibration
of cell counts. The horizontal lines through the boxes are the population parameters.
The distributions after calibration for the cross-classified table lccIlulcI are
very similar to those from the simple random sample. For the larger rectangular
table lcclulc1, even though the biases are removed for large sample sizes, the biases
for small samples grew to be even larger. For this type of table, the small sample
sizes and large number of empty cells prevent the calibration from being well
controlled. Unfortunately, these are the practical sample sizes for actual
applications.
CONCLUSION
Only three of the measures, PRE, PEV, and Cramer's V, are suitable for
rectangular tables. PRE and PEV are asymmetrical predictive measures, and
Cramer's V is a measure of independence. Cramer's V shows some tendency to be
biased for small sample sizes, even for a small number of classes.
Both PEV and PRE were designed for ease of interpretation. Cramer's V can
perhaps be used best in tandem with other measures. For example, a table can be
tested for independence with Cramer's V, and then the strength of the association
can be measured with PRE and PEV. For those cross-classified tables where
Kappa is applicable, it provides the most robust statistic; however, PCC is the
simplest to understand.
REFERENCES
Anderson, J.R., Hardy, E.E., Roach, J.T., and Witmer, R.E., 1976, A Land Use
and Land Cover Classification System for Use with Remote Sensor Data:
Washington, DC, U.S. Government Printing Office.
Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W., 1975, Discrete Multivariate
Analysis: Cambridge, Massachusetts, The MIT Press.
Brown, J.F., Loveland, T.R., Merchant, J.W., Reed, B.C., and Ohlen, D.O., 1993,
Using Multisource Data in Global Land-Cover Characterization: Coilcepts,
Requirements, and Methods: Photogrammetric Engineering and Remote
Sensing, v. 59, no. 6, p. 977-987.
Card, D.H., 1982, Using Known Map Category Marginal Frequencies to Improve
Estimates of Thematic Map Accuracy: Photograrnrnetric Engineering and
Remote Sensing, v. 48, no. 3, p. 431-439.
Cohen, J., 1960, A Coefficient of Agreement for Nominal Scales: Educational and
Psychological Measurement, v. 20, no. 1, p. 37-46.
Conover, W. J., 1980, Practical Nonparametric Statistics: New York, NY: John
Wiley & Sons.
Czaplewski, R.L., and Catts, G.P., 1992. Calibration of Remotely Sensed
Proportion or Area Estimates for Misclassification Error: Remote Sensing of
Environment, v. 39, pp. 29-43.
Goodman, L.A., and Kruskal, W.H., 1979, Measures of Association for Cross
Classifications, Springer Series in Statistics: New York, NY: SpringerVerlag .
Hay, A.M., 1979, Sampling Designs to Test Land-Use Map Accuracy:
Photogrammetric Engineering and Remote Sensing, v. 45, no. 4, p. 529-533.
Loveland, T.R., Merchant, J.W., Ohlen, D.O., and Brown, J.F., 1991,
Development of a Land-Cover Characteristics Database for the Conterminous
U.S.: Photogrammetric Engineering and Remote Sensing, v. 57, no. 11, p.
1,453- 1,463.
Story, M., and Congalton, R.G., 1986, Accuracy Assessment: A User's
Perspective: Photogrammetric Engineering and Remote Sensing, v. 52, no. 3,
p. 397-399.
U.S. Geological Survey, 1990, Land Use and Land Cover Digital Data from
1:250,000- and 1:100,000-Scale Maps: Reston, Virginia, U. S. Geological
Survey.
ACKNOWLEDGMENTS
The research described in this article has been supported by the U.S.
Environmental Protection Agency (EPA) through Interagency Agreement IAG
DW 14936073 to the U.S. Geological Survey and through the National Aeronautics
and Space Administration in support of the United Nations Environment
Programme. However, it has not been subjected to EPA review and therefore does
not necessarily reflect the views of the agency. No official endorsement should be
inferred by either the U.S. Government or the United Nations Environment
Programme.
BIOGRAPHICAL SKETCH
Eugene A. Fosnight is a cartographer with the United Nations Environmental
Programme's Global Resource Information Database (GRID), Sioux Falls, S.
Dak. He graduated from Purdue University with a B.S. in 1972, from University
College Swansea with a diploma in cartography in 1980 and from University of
Michigan with a M.S. in remote sensing in 1992. Gene provides cartographic,
remote-sensing and statistical support for GRID-Sioux Falls.
Gary W. Fowler is a Professor of biometrics with the School of Natural
Resources and Environment at the University of Michigan, Ann Arbor, Mich. He
graduated from the University of California with a Ph.D. in 1969. Gary's research
concentrates on sequential sampling, estimation of forest stem parameters, volumebasal area ratios for use in horizontal point sampling, efficient sampling of endemic
forest insect populations, development of new volume equations for commercial
coniferous species, and statistical properties of species diversities.
Download