A Review of Assessing the Accuracy of Classifications

advertisement
REMOTE SENS. ENVIRON. 37:35-46 (1991)
A Review of Assessing the Accuracy of
Classifications of Remotely Sensed Data
Russell G. Congalton
Department of Forestry and Resource Management, University of California, Berkeley
T h i s paper reviews the necessary considerations
and available techniques for assessing the accuracy
of remotely sensed data. Included in this review
are the classification system, the sampling scheme,
the sample size, spatial autocorrelation, and the
assessment techniques. All analysis is based on the
use of an error matrix or contingency table. Example matrices and results of the analysis are presented. Future trends including the need for assessment of other spatial data are also discussed.
INTRODUCTION
With the advent of more advanced digital satellite
remote sensing techniques, the necessity of performing an accuracy assessment has received renewed interest. This is not to say that accuracy
assessment is unimportant for the more traditional
remote sensing techniques. However, given the
complexity of digital classification, there is more of
a need to assess the reliability of the results.
Traditionally, the accuracy of photointerpretation
has been accepted as correct without any confirmation. In fact, digital classifications are often
assessed with reference to photointerpretation. An
obvious assumption made here is that the photointerpretation is 100% correct. This assumption is
rarely valid and can lead to a rather poor and
unfair assessment of the digital classification
(Biging and Congalton, 1989).
Therefore, it is essential that researchers and
users of remotely sensed data have a strong knowledge of both the factors needed to be considered
as well as the techniques used in performing any
accuracy assessment. Failure to know these techniques and considerations can severely limit one's
ability to effectively use remotely sensed data. The
objective of this paper is to provide a review of the
appropriate analysis techniques and a discussion of
the factors that must be considered when performing any accuracy assessment. Many analysis techniques have been published in the literature; however, I believe that it will be helpful to many
novice and established users of remotely sensed
data to have all the standard techniques summarized in a single paper. In addition, it is important
to understand the analysis techniques in order to
fully realize the importance of the various other
considerations for accuracy assessment discussed
in this paper.
TECHNIQUES
Address correspondence to R. G. Congalton, 145 Mulford Hall,
Department of Forestry and Resource Management, University of
California, Berkeley, CA 94720.
Received 15 October 1990; revised 14 April 1991.
oo34-42s7/91/$3.50
Until recently, the idea of assessing the classification accuracy of remotely sensed data was treated
35
36
Congalton
more as an afterthought than as an integral part of
any project. In fact, as recently as the early 1980s
many studies would simply report a single number
to express the accuracy of a classification. In many
of these cases the accuracy reported was what is
called non-site-specific accuracy. In a non-sitespecific accuracy assessment, locational accuracy is
completely ignored. In other words, only total
amounts of a category are considered without regard for the location. If all the errors balance out,
a non-site-specific accuracy assessment will yield
very high but misleading results. In addition, most
assessments were conducted using the same data
set as was used to train the classifier. This training
and testing on the same data set also results in
overestimates of classification accuracy.
Once these problems were recognized, many
more site specific accuracy assessments were performed using an independent data set. For these
assessments, the most common way to represent
the classification accuracy of remotely sensed data
is in the form of an error matrix. Using an error
matrix to represent accuracy has been recommended by many researchers and should be
adopted as the standard reporting convention. The
reasons for choosing the error matrix as the standard are clearly demonstrated in this paper.
An error matrix is a square array of numbers
set out in rows and columns which express the
number of sample units (i.e., pixels, clusters of
pixels, or polygons) assigned to a particular cate-
Table 1. A n E x a m p l e
gory relative to the actual category as verified on
the ground (Table 1). The columns usually represent the reference data while the rows indicate the
classification generated from the remotely sensed
data. An error matrix is a very effective way to
represent accuracy in that the accuracies of each
category are plainly described along with both the
errors of inclusion (commission errors) and errors
of exclusion (omission errors) present in the classification.
Descriptive Techniques
The error matrix can then be used as a starting
point for a series of descriptive and analytical
statistical techniques. Perhaps the simplest descriptive statistic is overall accuracy which is computed by dividing the total correct (i.e., the sum of
the major diagonal) by the total number of pixels
in the error matrix. In addition, accuracies of
individual categories can be computed in a similar
manner. However, this case is a little more complex in that one has a choice of dividing the
number of correct pixels in that category by either
the total number of pixels in the corresponding
row or the corresponding column. Traditionally,
the total number of correct pixels in a category is
divided by the total number of pixels of that
category as derived from the reference data (i.e.,
the column total). This accuracy measure indicates
the probability of a reference pixel being correctly
Error Matrix
Reference Data
lOW
D
C
BA
SB
total
Land Cover Categories
D
65
4
22
24
115
C
6
81
5
8
100
BA
0
11
85
19
115
C = conifer
SB
4
7
3
90
104
BA = b a r r e n
75
103
115
141
434
SB = s h r u b
column
total
D = deciduous
OVERALL ACCURACY =
3 2 1 / 4 3 4 = 74%
PRODUCER'S ACCURACY
D=65/75=
87%
C = 81/103 =
BA = 8 5 / 1 1 5 =
SB = 9 0 / 1 4 1 =
79%
74%
64%
USER'S ACCURACY
D=65/115=
C= 81/100=
BA=85/115=
SB = 9 0 / 1 0 4 =
57%
81%
74%
87%
Review: Assessing Classification Accuracy 37
classified and is really a measure of omission error.
This accuracy measure is often called "producer's
accuracy" because the producer of the classification is interested in how well a certain area can be
classified. On the other hand, if the total number
of correct pixels in a category is divided by the
total number of pixels that were classified in that
category, then this result is a measure of commission error. This measure, called "user's accuracy"
or reliability, is indicative of the probability that a
pixel classified on the map/image actually represents that category on the ground (Story and
Congalton, 1986).
A very simple example quickly shows the advantages of considering overall accuracy, "producer's accuracy," and "user's accuracy.'" The error matrix shown in Table 1 indicates an overall
map accuracy of 74%. However, suppose we are
most interested in the ability to classify deciduous
forests. We can calculate a "producer's accuracy"
for this category by dividing the total number of
correct pixels in the deciduous category (65) by
the total number of deciduous pixels as indicated
by the reference data (75). This division results in
a "producer's accuracy" of 87%, which is quite
good. If we stopped here, one might conclude
that, although this classification has an overall
accuracy that is only fair (74%), it is adequate for
the deciduous category. Making such a conclusion
could be a very serious mistake. A quick calculation of the "user's accuracy" computed by dividing
the total number of correct pixels in the deciduous
category (65) by the total number of pixels classified as deciduous (115) reveals a value of 57%. In
other words, although 87% of the deciduous areas
have been correctly identified as deciduous, only
57% of the areas called deciduous are actually
deciduous. A more careful look at the error matrix
reveals that there is significant confusion in discriminating deciduous from barren and shrub.
Therefore, although the producer of this map can
claim that 87% of the time an area that was
deciduous was identified as such, a user of this
map will find that only 57% of the time will an
area he visits that the map says is deciduous will
actually be deciduous.
Analytical Techniques
In addition to these descriptive techniques, an
error matrix is an appropriate beginning for many
analytical statistical techniques. This is especially
true of the discrete multivariate techniques. Starting with Congalton et al. (1983), discrete multivariate techniques have been used for performing
statistical tests on the classification accuracy of
digital remotely sensed data. Since that time many
others have adopted these techniques as the standard accuracy assessment tools (e.g., Rosenfield
and Fitzpatrick-Lins, 1986; Hudson and Ramm,
1987; Campbell, 1987). Discrete multivariate techniques are appropriate because remotely sensed
data are discrete rather than continuous. The data
are also binomially or multinomially distributed
rather than normally distributed. Therefore, many
common normal theory statistical techniques do
not apply. The following example presented in
Tables 2-9 demonstrates the power of these discrete multivariate techniques. The example begins
with three error matrices and presents the results
of the analysis techniques.
Table 2 presents the error matrices generated
from using three different classification algorithms
to map a small area of Berkeley and Oakland,
California surrounding the University of California
campus from SPOT satellite data. The three classification algorithms used included a traditional supervised approach, a traditional unsupervised approach, and a modified approach that combines
the supervised and unsupervised classifications together to maximize the advantages of each
(Chuvieco and Congalton, 1988). The classification
was a simple one using only four categories; forest
(F), industrial (I), urban (U), and water (W). All
three classifications were performed by a single
analyst. In addition, Table 3 presents the error
matrix generated for the same area using only the
modified classification approach by a second analyst. Each analyst was responsible for performing
an accuracy assessment. Therefore, different numbers of samples and different sample locations
were selected by each.
The next analytical step is to "normalize" or
standardize the error matrices. This technique uses
an iterative proportional fitting procedure which
forces each row and column in the matrix to sum
to one. In this way, differences in sample sizes
used to generate the matrices are eliminated and,
therefore, individual cell values within the matrix
are directly comparable. In addition, because as
part of the iterative process the rows and columns
are totaled (i.e., marginals), the resulting normal-
38 Congalton
Table 2. Error Matrices for the Three Classification Approaches from Analyst
#1
Supervised Approach
Reference Data
F
Classified
Data
I
U
W
F
68
7
3
0
I
12
112
15
10
Overall Accuracy =
3 2 5 / 3 9 1 = 83%
U
3
9
89
0
W
0
2
5
56
UnsupervisedApproach
Reference Data
F
Classified
Data
I
U
W
F
60
11
3
4
I
15
102
14
8
Overall Accuracy =
U
6
13
90
2
3 0 4 / 3 9 1 = 78%
W
2
4
5
52
Modified
Approach
Reference Data
F
Classified
Data
F
I
U
W
75
6
1
0
I
4
116
11
3
Overall Accuracy =
U
3
7
96
2
3 4 8 / 3 9 1 = 89%
W
1
1
4
61
Table 3. Error Matrix for the Modified Classification Approach from Analyst
ModifiedApproach - -
#2
Analyst # 2
Reference Data
F
AG
U
W
35
6
1
0
AG
3
82
5
10
U
4
2
54
0
W
0
5
2
37
F
Classified
Data
ized matrix is more indicative of the off-diagonal
cell values (i.e., the errors of omission and commission). In other words, all the values in the
matrix are iteratively balanced by row and column
thereby incorporating information from that row
and column into each individual cell value. This
process then changes the cell values along the
major diagonal of the matrix (correct classifications) and therefore a normalized overall accuracy
can be computed for each matrix by summing the
overall accuracy =
2 0 8 / 2 4 6 = 85%
major diagonal and dividing by the total of the
entire matrix. Consequently, one could argue that
the normalized accuracy is a better representation
of accuracy than is the overall accuracy computed
from the original matrix because it contains information about the off-diagonal cell values. Table 4
presents the normalized matrices from the same
three classification algorithms for analyst # 1 generated using a computer program called MARGFIT
(marginal fitting). Table 5 presents the normalized
Review: Assessing ClassificationAccuracy 39
Table 4. Normalized Error Matrices for the Three Classification Approaches from Analyst
#1
Supervised Approach
Reference Data
Classified
Data
F
I
U
W
F
0.8652
0.0940
0.0331
0.0073
I
0.0845
0.7547
0.0784
0.0824
Normalized Accuracy =
U
0.0435
0.1171
0.8319
0.0072
3 . 3 5 4 9 / 4 = 84%
W
0.0069
0.0342
00567
0.9031
UnsupervisedApproach
Reference Data
Classified
Data
F
I
U
W
F
0.7734
0.1256
0.0387
0.0622
I
0.1242
0.7014
0.1006
0.0824
Normalized Accuracy =
U
0.0656
0.1163
0.7094
0.0273
5.1022/4 = 78%
W
0.0369
0.0567
0.0702
0.8370
Modified
Approach
Reference Data
F
I
U
W
0.9080
0.0687
0.0152
0.0076
I
0.0372
0.8460
0.0801
0.0366
Normalized Accuracy =
U
0.0370
0.0697
0.8598
0.0334
3.5362/ 4 = 88%
W
0.0178
0.0156
0.6450
0.9224
F
Classified
Data
Table 5. Normalized Error Matrix for the Modified Approach from Analyst
ModifiedApproach- -
#2
Analyst # 2
Reference Data
Classified
Data
AG
U
W
F
0.8519
0.1090
0.0287
0.0113
AG
0.0464
0.7641
0.0581
0.1313
Overall Accuracy =
U
0.0897
0.0348
0.8655
0.0094
3.3295/ 4 = 83%
W
0.0120
0.0921
0.0477
0.8480
matrix for the modified approach performed by
analyst #2.
In addition to computing a normalized accuracy, the normalized matrix can also be used to
directly compare cell values between matrices.
For example, we may be interested in comparing
the accuracy each analyst obtained for the forest
category using the modified classification approach. From the original matrices we can see that
analyst #1 classified 75 sample units correctly
I
while analyst # 2 classified 35 correctly. Neither of
these numbers means much because they are not
directly comparable due to the differences in the
number of samples used to generate the error
matrix by each analyst. Instead, these numbers
would need to be converted into percent so that a
comparison could be made. Here another problem
arises: Do we divide the total correct by the row
total (user's accuracy) or by the column total (producer's accuracy)? We could calculate both and
40 Congalton
compare the results or we could use the cell value
in the normalized matrix. Because of the iterative
proportional fitting routine, each cell value in the
matrix has been balanced by the other values in its
corresponding row and column. This balancing has
the effect of incorporating producer's and user's
accuracies together. Also since each row and column add to 1, an individual cell value can quickly
be converted to a percentage by multiplying by
100. Therefore, the normalization process provides
a convenient way of comparing individual cell
values between error matrices regardless of the
number of samples used to derive the matrix.
Another discrete multivariate technique of use
in accuracy assessment is called KAPPA (Cohen,
1960). The result of performing a KAPPA analysis
is a KHAT statistic (an estimate of KAPPA), which
is another measure of agreement or accuracy. The
KHAT statistic is computed as
/~=
i=1
i=1
F
N2-
E (x,+ • x+,)
i=1
where r is the number of rows in the matrix, xii is
the number of observations in row i and column i,
x i+ and x +i are the marginal totals of row i and
column i, respectively, and N is the total number
of observations (Bishop et al., 1975). The KHAT
equation is published in this paper to clear up
some confusion caused by a typographical error in
Congalton et al. (1983), who originally proposed
the use of this statistic for remotely sensed data.
Since that time, numerous papers have been published recommending this technique. The equations for computing the variance of the KHAT
statistic and the standard normal deviate can be
found in Congalton et al. (1983), Rosenfield and
Fitzpatriek-Lins (1986), and Hudson and Ramm
(1987) to list just a few. It should be noted that
the KHAT equation assumes a multinomial sampling model and that the variance is derived using
the Delta method.
Table 6 provides a comparison of the overall
accuracy, the normalized accuracy, and the KHAT
statistic for the three classification algorithms used
by analyst # 1. In this particular example, all three
measures of accuracy agree about the relative
ranking of the results. However, it is possible for
these rankings to disagree simply because each
Table 6. A Comparison of the Three Accuracy Measures for
the Three Classification Approaches
Classification
Algorithm
Supervised
approach
Unsupervised
approach
Modified
approach
Overall
Accuracy
KHAT
Normalized
Accuracy Accuracy
84%
77%
83%
78%
70%
78%
88%
85%
89%
measure incorporates various levels of information
from the error matrix into its computations. Overall accuracy only incorporates the major diagonal
and excludes the omission and commission errors.
As already described, normalized accuracy directly
includes the off-diagonal elements (omission and
commission errors) because of the iterative proportional ftting procedure. As shown in the KHAT
equation, KHAT accuracy indirectly incorporates
the off-diagonal elements as a product of the row
and column marginals. Therefore, depending on
the amount of error included in the matrix, these
three measures may not agree. It is not possible to
give cleareut rules as to when each measure should
be used. Each accuracy measure incorporates different information about the error matrix and
therefore must be examined as different computations attempting to explain the error. My experience has shown that if the error matrix tends to
have a great many off-diagonal cell values with
zeros in them, then the normalized results tend to
disagree with the overall and Kappa results. Many
zeros occur in a matrix when an insuflqeient sample has been taken or when the classification is
exceptionally good. Because of the iterative proportional fitting routine, these zeros tend to take
on positive values in the normalization process
showing that some error could be expected. The
normalization process then tends to reduce the
accuracy because of these positive values in the
off-diagonal cells. If a large number of off-diagonal
cells do not contain zeros, then the results of the
three measures tend to agree. There are also times
when the Kappa measure will disagree with the
other two measures. Because of the ease of computing all three measures (software is available
from the author) and because each measure refleets different information contained within the
error matrix, I recommend an analysis such as the
Review: Assessing Classification Accuracy 41
Table 7. Results of the KAPPA Analysis Test of Significance
for Individual Error Matrices
Table 9. Results of KAPPA Analysis for Comparison between Modified Approach for Analyst # 1 vs. Analyst # 2
Test of Significance of Each Error Matrix
Test of Significant Differences between Error Matrices
Classification Algorithms
KHAT Statistic
Z Statistic Result"
Supervised approach
.7687
29.41
Sb
Unsupervised approach
.6956
24.04
S
Modified approach
.8501
39.23
S
Comparison
Modified #1 vs. modified #2
Z Statistic
Result ~
1.6774
NS b
°At the 95% confidence level.
/'NS = not significant.
~At the 95% confidence level.
bS = significant.
Table 8. Results of KAPPA Analysis for Comparison between Error Matrices for Analyst # 1
Test of Significant Differences between Error Matrices
Comparison
Z Statistic
Result ~
Supervised vs. unsupervised
1.8753
NS/~
Supervised vs. modified
2.3968
S
Unsupervised vs. modified
4.2741
S
°At the 95% confidence level.
I~S= significant, NS = not significant.
one performed here to glean as much information
from the error matrix as possible.
In addition to being a third measure of accuracy, KAPPA is also a powerful technique in its
ability to provide information about a single matrix
as well as to statistically compare matrices. Table 7
presents the results of the KAPPA analysis to test
the significance of each matrix alone. In other
words, this test determines whether the results
presented in the error matrix are significantly better than a random result (i.e., the null hypothesis:
KHAT = 0). Table 8 presents the results of the
KAPPA analysis that compares the error matrices
two at a time to determine if they are significantly
different. This test is based on the standard normal
deviate and the fact that, although remotely sensed
data are discrete, the KHAT statistic is asymptotically normally distributed. A quick look at Table 8
shows why this test is so important. Despite the
overall accuracy of the supervised approach being
6% higher than the unsupervised approach (84%
- 7 8 % = 6%), the results of the KAPPA analysis
show that these two approaches are not significantly different. Therefore, given the choice of
only these two approaches, one should use the
easier, quicker, or more efficient approach because
the accuracy will not be the deciding factor. Similar results are presented in Table 9 comparing the
modified classification approach for analyst #1
with analyst #2.
In addition to the discrete multivariate techniques just presented, other techniques for assessing the accuracy of remotely sensed data have also
been suggested. Rosenfield (1981) proposed the
use of analysis of variance techniques for accuracy
assessment. However, violation of the normal theory assumption and independence assumption
when applying this technique to remotely sensed
data has severely limited its application. Aronoff
(1985) suggested the use of a minimum accuracy
value as an index of classification accuracy. This
approach is based on the binomial distribution of
the data and is therefore very appropriate for
remotely sensed data. The major disadvantage of
the Aronoff approach is that it is limited to a single
overall accuracy value rather than using the entire
error matrix. However, it is useful in that it this
index does express statistically the uncertainty
involved in any accuracy assessment. Finally,
Skidmore and Turner (1989) have begun work on
techniques for assessing error as it accumulates
through many spatial layers of information in a
GIS, including remotely sensed data. These techniques have included using a line sampling method
for accuracy assessment as well as probability theory to accumulate error from layer to layer. It is in
this area of error analysis that much new work
needs to be performed.
CONSIDERATIONS
Along with the actual analysis techniques, there
are many other considerations to note when performing an accuracy assessment. In reality, the
techniques are of little value if these other factors
are not considered because a critical assumption of
all the analysis described above is that the error
matrix is truly representative of the entire classification. If the matrix is improperly generated,
then all the analysis is meaningless. Therefore, the
42 Congalton
following factors must be considered: ground data
collection, classification scheme, spatial autocorrelation, sample size, and sampling scheme. Each of
these factors provide essential information for the
assessment and failure to consider even one of
them could lead to serious shortcomings in the
assessment process.
Ground Data Collection
It is obvious that in order to adequately assess the
accuracy of the remotely sensed classification, accurate ground, or reference data must be collected. However, the accuracy of the ground data
is rarely known nor is the level of effort needed to
collect the appropriate data clearly understood.
Depending on the level of detail in the classification (i.e., classification scheme), collecting reference data can be a very difficult task. For example,
in a simple classification scheme the required level
of detail may be only to distinguish residential
from commercial areas. Collecting reference data
may be as simple as obtaining a county zoning
map. However, a more complex forest classification scheme may involve collecting reference data
for not only species of tree, but size class, and
crown closure as well. Size class involves measuring the diameters of trees and therefore a great
many trees may have to be measured to estimate
the size class for each pixel. Crown closure is even
more difficult to measure. Therefore, in this case,
collecting accurate reference data can be difficult.
A traditional solution to this problem has been
for the producer and user of the classification to
assume that some reference data set is correct. For
example, the results of some photointerpretation
or aerial reconnaissance may be used as the reference data. However, errors in the interpretation
would then be blamed on the digital classification,
thereby wrongly lowering the digital classification
accuracy. It is exactly this problem that has caused
the lack of acceptance of digital satellite data for
many applications. Although no reference data set
may be completely accurate, it is important that
the reference data have high accuracy or else it is
not a fair assessment. Therefore, it is critical that
the ground or reference data collection be carefully considered in any accuracy assessment. Much
work is yet to be done to determine the proper
level of effort and collection techniques necessary
to provide this vital information.
Classification Scheme
When planning a project involving remotely sensed
data, it is very important that sufficient effort be
given to the classification scheme to be used.
In many instances, this scheme is an existing
one such as the Anderson classification system
(Anderson et al., 1976). In other cases, the classification scheme is dictated by the objectives of the
project or by the specifications of the contract. In
all situations a few simple guidelines should be
followed. First of all, any classification scheme
should be mutually exclusive and totally exhaustive. In other words, any area to be classified
should fall into one and only one category or class.
In addition, every area should be included in the
classification. Finally, if possible, it is very advantageous to use a classification scheme that is hierarchical in nature. If such a scheme is used, certain categories within the classification scheme can
be collapsed to form more general categories. This
ability is especially important when trying to meet
predetermined accuracy standards. Two or more
detailed categories of lower than the minimum
required accuracy may need to be grouped together (collapsed) to form a more general category
that exceeds the minimum accuracy requirement.
For example, it may be impossible to separate
interior live oak from canyon live oak. Therefore,
these two categories may have to be collapsed to
form a live oak category to meet the required
accuracy standard.
Because the classification scheme is so important, no work should begin on the remotely sensed
data until the scheme has been thoroughly reviewed and as many problems as possible identified. It is especially helpful if the categories in the
scheme can be logically explained. The difference
between Douglas fir and Ponderosa pine is easy to
understand; however, the difference between
Density Class 3 (50-70% crown closure) and Density Class 4 ( > 70% crown closure) may not be. In
fact, many times these classes are rather artificial
and one can expect to find confusion between a
forest stand with a crown closure of 67% that
belongs in Class 3 and a stand of 73% that belongs
in Class 4. Sometimes there is little that can be
done about the artificial delineations in the classification scheme; other times the scheme can be
modified to better represent natural breaks. However, tZailure to try to understand the classification
Review: Assessing Classification Accuracy 43
scheme from the every beginning will certainly
result in a great loss of time and much frustration
in the end.
Spatial Autocorrelation
Spatial autoeorrelation is said to occur when the
presence, absence, or degree of a certain characteristic affects the presence, absence, or degree of
the same characteristic in neighboring units (Cliff
and Ord, 1973). This condition is particularly important in accuracy assessment if an error in a
certain location can be found to positively or negatively influence errors in surrounding locations
(Campbell, 1981). Work by Congalton (1988a) on
Landsat MSS data from three areas of varying
spatial diversity (i.e., an agriculture, a range, and a
forest site) showed a positive influence as much as
30 pixels (over 1 mile) away. These results are
explainable in an agricultural environment where
field sizes are large and typical misclassification
would be to make an error in labeling the entire
field. However, these results are more surprising
for rangeland and forested sites. Surely these resuits should affect the sample size and especially
the sampling scheme used in accuracy assessment,
especially in the way this autocorrelation affects
the assumption of sample independence. This autocorrelation may then be responsible for periodicity in the data that could effect the results of any
type of systematic sample. In addition, the size of
the cluster used in cluster sampling would also be
effected because each new pixel would not be
contributing independent information.
Sample Size
Sample size is another important consideration
when assessing the accuracy of remotely sensed
data. Each sample point collected is expensive and
therefore sample size must be kept to a minimum
and yet it is critical to maintain a large enough
sample size so that any analysis performed is
statistically valid. Of all the considerations discussed in this paper, the most has probably been
written about sample size. Many researchers, notably van Genderen and Lock (1977), Hay (1979),
Hord and Brooner (1976), Rosenfield et al. (1982),
and Congalton (1988b), have published equations
and guidelines for choosing the appropriate sample size. The majority of researchers have used an
equation based on the binomial distribution or the
normal approximation to the binomial distribution
to compute the required sample size. These techniques are statistically sound for computing the
sample size needed to compute the overall accuracy of a classification or even the overall accuracy
of a single category. The equations are based on
the proportion of correctly classified samples
(pixels, clusters, or polygons) and on some allowable error. However, these techniques were not
designed to chose a sample size for filling in an
error matrix. In the case of an error matrix, it is
not simply a matter of correct or incorrect. It is a
matter of which error or, in other words, which
categories are being confused. Sufficient samples
must be acquired to be able to adequately represent this confusion. Therefore, the use of these
techniques for determining the sample size for an
error matrix is not appropriate. Fitzpatrick-Lins
(1981) used the normal approximation equation to
compute the sample size for assessing a land u s e /
land cover map of Tampa, Florida. The results of
the computation showed that 319 samples needed
to be taken for a classification with an expected
accuracy of 85% and an allowable error of 4%. She
ended up taking 354 samples and filling in an
error matrix that had 30 categories in it (i.e., a
matrix of 30 rows × 3 0 columns or 900 possible
cells). Although this sample size is sufficient for
computing overall accuracy, it is obviously much
too small to be represented in a matrix. Only 35 of
the 900 cells had a value greater than zero. Other
researchers have used the equation to compute the
sample size for each category. Although resulting
in a larger sample, the equation still does not
account for the confusion between categories.
Because of the large number of pixels in a
remotely sensed image, traditional thinking about
sampling does not often apply. Even a one-half
percent sample of a single Thematic Mapper scene
can be over 300,000 pixels. Not all assessments are
performed on a per pixel basis, but the same
relative argument holds true if the sample unit is a
cluster of pixels or a polygon. Therefore, practical
considerations more often dictate the sample size
selection. A balance between what is statistically
sound and what is practically attainable must be
found. It has been my experience that a good rule
of thumb seems to be collecting a minimum of 50
samples for each vegetation or land use category
in the error matrix. If the area is especially large
44 Congalton
(i.e., more than a million acres) or the classification has a large number of vegetation or land use
categories (i.e., more than 12 categories), the minimum number of samples should be increased to 75
or 100 samples per category. The number of samples for each category can also be adjusted based
on the relative importance of that category within
the objectives of the mapping or by the inherent
variability within each of the categories. Sometimes it is better to concentrate the sampling on
the categories of interest and increase their number of samples while reducing the number of
samples taken in the less important categories.
Also it may be useful to take fewer samples in
categories that show little variability such as water
or forest plantations and increase the sampling in
the categories that are more variable such as uneven-aged forests or riparian areas. Again, the
object here is to balance the statistical recommendations in order to get an adequate sample to
generate an appropriate error matrix with the time,
cost, and practical limitations associated with any
viable remote sensing project.
Sampling Scheme
In addition to the considerations already discussed, sampling scheme is an important part of
any accuracy assessment. Selection of the proper
scheme is absolutely critical to generating an error
matrix that is representative of the entire classified
image. Poor choice in sampling scheme can result
in significant biases being introduced into the
error matrix which may over or under estimate the
true accuracy. In addition, use of the proper sampling scheme may be essential depending on the
analysis techniques to be applied to the error
matrix.
Many researchers have expressed opinions
about the proper sampling scheme to use (e.g.,
Hord and Brooner, 1976; Ginevan, 1979; Rhode,
1978; Fitzpatrick-Lins, 1981). These opinions vary
greatly among researchers and include everything
from simple random sampling to stratified systematic unaligned sampling. Despite all these opinions, very little work has actually been performed
in this area. Congalton (1988b) performed sampling simulations on three spatially diverse areas
and concluded that in all cases simple random
without replacement and stratified random sampling provided satisfactory results. Despite the
nice statistical properties of simple random sampling, this sampling scheme is not always that
practical to apply. Simple random sampling tends
to undersample small but possibly very important
areas unless the sample size is significantly increased. For this reason, stratified random sampiing is recommended where a minimum number
of samples are selected from each strata (i.e.,
category). Even stratified random sampling can be
somewhat impractical because of having to collect
ground information for the accuracy assessment at
random locations on the ground. The problems
with random locations are that they can be in
places with very difficult access and they can only
be selected after the classification has been performed. This limits the accuracy assessment data
to being collected late in the project instead of in
conjunction with the training data collection,
thereby increasing the costs of the project. In
addition, in some projects the time between the
project beginning and the accuracy assessment
may be so long as to cause temporal problems in
collecting ground reference data. In other words,
the ground may change (i.e., the forest harveste d )
between the time the project is started and the
accuracy assessment is begun.
Therefore, some systematic approach would
certainly help make this ground collection effort
more efficient by making it easier to locate the
points of the ground and allowing data to be
collected simultaneously for training and assessment. However, results of Congalton (1988a)
showed that periodicity in the errors as measured
by the autocorrelation analysis could make the use
of systematic sampling risky for accuracy assessment. Therefore, perhaps some combination of
random and systematic sampling would provide
the best balance between statistical validity and
practical application. Such a system may employ
systematic sampling to collect some assessment
data early in a project while random sampling
within strata would be used after the classification
is completed to assure that enough samples were
collected for each category and to minimize any
periodicity in the data.
In addition to the sampling schemes already
discussed, cluster sampling has also been frequently used in assessing the accuracy of remotely
sensed data, especially to collect information on
many pixels very quickly. However, cluster sampiing must be used intelligently. Simply using
Review: Assessing Classification Accuracy
very large clusters is not a valid method of collecting data because each pixel is not independent of
the other and adds very little information to the
cluster. Congalton (1988b) recommended that no
clusters larger than 10 pixels and certainly not
larger than 25 pixels be used because of the lack
of information added by each pixel beyond these
cluster sizes.
Finally, some analytic techniques assume that
certain sampling schemes were used to obtain the
data. For example, use of the Kappa analysis assumes a multinomial sampling model. Only simple
random sampling completely satisfies this assumption. The effect of using another of the sampling
schemes discussed here is unknown. An interesting project would be to test the effect on the
Kappa analysis of using a sampling scheme other
than simple random sampling. If the effect is
found to be small, then the scheme may be appropriate to use within the conditions discussed above.
If the effect is found to be large, then that sampling scheme should not be used to perform Kappa
analysis. To conclude that some sampling schemes
can be used for descriptive techniques and others
for analytical techniques seems impractical. Accuracy assessment is expensive and no one is going
to collect data for only descriptive use. Eventually,
someone will use that matrix for some analytical
technique.
CONCLUSIONS
This paper has reviewed the factors and techniques to be considered when assessing the accuracy of classifications of remotely sensed data. The
work has really just begun. The factors discussed
here are certainly not fully understood. The basic
issues of sample size and sampling scheme have
not been resolved. Spatial autocorrelation analysis
has rarely been applied to any study. Exactly what
constitutes ground or reference data and the level
of effort needed to collect it must be studied.
Research needs to continue in order to balance
what is statistically valid within the realm of practical application. This need becomes increasingly
important as techniques are developed to use remotely sensed data over large regional and global
domains. What is valid and practical over a small
area may not apply to regional or global projects.
Up to now, the little experience we have has been
45
on relatively small remote sensing projects. However, there is a need to use remote sensing for
much larger projects such as monitoring global
warming, deforestation, and environmental degradation. We do not know all the problems that will
arise when dealing with such large areas. Certainly, the techniques described must be extended
and refined to better meet these assessment needs.
It is critical that this work and the use of quantitative analysis of remotely sensed data continue. We
have suffered too long because of the oversell of
the technology and the underutilization of any
quantitative analysis early in the digital remote
sensing era. Papers such as Meyer and Werth
(1990) that state that the digital remote sensing is
not a viable tool for most resource applications
continue to demonstrate the problems we have
created by not quantitatively documenting our
work. We must put aside the days of a casual
assessment of our classification. "It looks good" is
not a valid accuracy statement. A classification is
not complete until it has been assessed. Then and
only then can the decisions made based on that
information have any validity.
In addition, we must not forget that remotely
sensed data is just a small subset of spatial data
currently being used in geographic information
systems (GIS). The techniques and considerations
discussed here need to be applied over all spatial
data. Techniques developed for other spatial data
need to be tested for use with remotely sensed
data. The work has just begun, and if we are going
to use spatial data to help us make decisions, and
we should, then we must know about the accuracy
of this information.
The author would like to thank Greg Biging and Craig Olson
for their helpful reviews of this paper. Thanks also to the two
anonymous reviewers whose comments significantly improved
this manuscript.
REFERENCES
Anderson,J. R., Hardy, E. E., Roach,J. T., and Witmer, R. E.
(1976), A land use and land cover classificationsystem for
use with remote sensor data, U.S. Geol. Survey Prof. Paper
964, 28 pp.
Aronoff, Stan (1985), The minimum accuracy value as an
index of classificationaccuracy, Photogramm. Eng. Remote
Sens. 51(1):99-111.
Biging, G. and Congalton, R. (1989), Advances in forest
inventory using advanced digital imagery, in Proceedings
46 Congalton
of Global Natural Research Monitoring and Assessments':
Preparing for the 21st Century, Venice, Italy, September,
Vol. 3, pp. 1241-1249.
Bishop, Y., Fienberg, S., and Holland, P. (1975), Discrete
Multivariate Analysis--Theory and Practice, MIT Press,
Cambridge, MA, 575 pp.
Campbell, J. (1981), Spatial autocorrelation effects upon the
accuracy of supervised classification of land cover, Photogramm. Eng. Remote Sens. 47(3):355-363.
Campbell, J. (1987), Introduction to Remote Sensing, Guilford
Press, New York, 551 pp.
Chuvieco, E., and Congalton, R. (1988), Using cluster analysis
to improve the selection of training statistics in classifying
remotely sensed data, Photogramm. Eng. Remote Sens.
54(9): 1275-1281.
Cliff, A. D., and Ord, J. K. (1973), Spatial Autocorrelation,
Pion, London, 178 pp.
Cohen, J. (1960), A coefficient of agreement for nominal
scales, Educ. Psychol. Measurement 20(1):37-46.
Congalton, R. G. (1988a), Using spatial autocorrelation analysis to explore errors in maps generated from remotely
sensed data, Photogramm. Eng. Remote Sens. 54(5):
587-592.
Congalton, R. G. (1988b), A comparison of sampling schemes
used in generating error nmtrices for assessing the accuracy of maps generated from remotely sensed data, Photogramnt. Eng. Remote Sens. 54(5):593-600.
Congalton, R. G., Oderwald, R. G., and Mead, R. A. (1983),
Assessing Landsat classification accuracy using discrete
multivariate statistical techniques, Photogramm. Eng. Remote Sens. 49(12):1671-1678.
Fitzpatrick-Lins, K. (1981), Comparison of sampling procedures and data analysis for a land-use and land-cover map,
Photogramm. Eng. Remote Sens. 47(3):343-351.
Ginevan, M. E. (1979), Testing land-use map accuracy: another look, Photogramm. Eng. Remote Sens. 45(10):
1371-1377.
Hay, A. M. (1979), Sampling designs to test land-use map
accuracy, Photogramm. Eng. Remote Sens. 45(4):529-533.
Hord, R. M., and Brooner, W. (1976), Land use map accuracy
criteria, Photogramm. Eng. Remote Sens. 42(5):671-677.
Hudson, W., and Ramm, C. (1987), Correct formulation of the
kappa coefficient of agreement, Photogramm. Eng. Remote
Sens. 53(4):421-422.
Meyer, M., and Werth, k (1990), Satellite data: management
panacea of potential problem?, J. Forestry 88(9):10-13.
Rhode, W. G. (1978), Digital image analysis techniques for
natural resource inventory, in National Computer Conference Proceedings, pp. 43-106.
Rosenfield, G. (1981), Analysis of variance of thematic mapping experiment data, Photogramm. Eng. Remote Sens.
47(12):1685-1692.
Rosenfield, G., and Fitzpatrick-Lins, K. (1986), A coefficient
of agreement as a measure of thematic classification accuracy, Photogramm. Eng. Remote Sens. 52(2):223-227.
Rosenfield, G. H., Fitzpatriek-Lins, K., and Ling, H. (1982),
Sampling tbr thematic map accuracy testing, Photogramm.
Eng. Remote Sens. 48(1):131-137.
Skidmore, A., and Turner, B. (1989), Assessing the accuracy
of resource inventory maps, in Proceedings of Global
Natural Resource Monitoring and Assessments: Preparing
for the 21st Century, Venice, Italy, September, Vol. 2, pp.
524-535.
Story, M., and Congalton, R. (1986), Accuracy assessment: a
user's perspective, Photogramm. Eng. Remote Sens.
52(3):397-399.
van Genderen, J. L., and Lock, B. F. (1977), Testing land use
map accuracy, Photogramm . Eng. Remote Sens.
43(9):1135-1137.
Download