Accuracy Assessment

Accuracy Assessment
Note that the previously presented classification algorithms are
neither “right” nor “wrong”. For example, someone might object to
your use of the simple minimum distance to means classifier by
pointing out that your classes have different variability, or a person
might object to your use of a Maximum Likelihood Classifier
because your data is not normally distributed. These algorithms are
neither “right” nor “wrong”, but rather they do a better or worse
job in classifying pixels in an image. Accuracy assessment tries to
quantify how good of a job was done by the classifier.
Accuracy assessment should be an important part of any
classification. I say “should” because it is frequently not done…
The reason for this is that it usually involves a lot of work in the
field, which can be very expensive and time consuming. However,
without any accuracy assessment we do not know how accurate
our classification is: Are we right about 90% of the time with our
classification, or maybe only 50%?
The accuracy of a classification is usually assessed by comparing
the classification with some reference data that is believed to
accurately reflect the true land-cover. Sources of reference data
include among other things ground truth, higher resolution satellite
images, and maps derived from aerial photo interpretation. Note
that virtually all reference data (even ground truth data) are
inaccurate to some degree as well.
The accuracy assessment reflects really the difference between our
classification and the reference data. Consequently, if your
reference data is highly inaccurate, your assessment might indicate
that your classification is poor, while it really is a good
classification. It is better to get fewer, but more accurate, reference
data. Also be aware of temporal changes: If you satellite image
was taken at a different time than when you collected your
reference data, apparent errors might be due to the fact that your
landscape has changed.
Ideally, your selection of reference sites should be based on some
random sampling design. Ground truthing from the car is not a
random design and the estimated accuracy measures from these
types of sampling designs are usually biased.
Also, accuracy assessment should not be based on the training
pixels. The problem with using training pixels is that they are
usually not randomly selected and that the classification is not
independent of the training pixels. Using training pixels usually
results in an overly optimistic accuracy assessment.
The results of an accuracy assessment are usually summarized in a
confusion matrix:
Ground Truth
No. classified
Agri. Forest
Classified Water
in Satellite Agriculture
Image as: Forest
No. ground truth pixels
Table 1: Confusion matrix
A measure for the overall classification accuracy can be derived
from this table by counting how many pixels were classified the
same in the satellite image and on the ground and dividing this by
the total number of pixels:
46 + 37 + 44
= 84.7%
The drawback of this measure is that it does not tell you anything
about how well individual classes were classified. The user and
producer accuracy are two widely used measures of class
accuracy. The producer’s accuracy refers to the probability that a
certain land-cover of an area on the ground is classified as such,
while the user’s accuracy refers to the probability that a pixel
labeled as a certain land-cover class in the map is really this class.
The user and producer accuracy for any given class typically are
not the same. In the above example, an estimate for the producer
accuracy of water is 46 = 75.4% while the user accuracy is 46 = 92% .
As a user of a classification, I can expect that roughly 92% of all
the pixels classified as water are indeed water on the ground.
However, as a producer, I am quite unsettled by the fact that I only
classified 75.4% of all the water pixels as such.
Note that all these measures of accuracies are estimates for the
true, unknown accuracies. Thus, it is reasonable to ask what is the
accuracy of these accuracy measures? In the above example, is the
user accuracy of water really 92% or could it also reasonably be
80%? In general it is possible to construct confidence intervals for
all of these accuracy measures to give an idea of these accuracies
estimates. The width of these confidence intervals is influenced
mainly by the sample size and by the size of the accuracy measures
For the same sample size, accuracy measures of 50% have a larger
confidence interval than accuracy measures of 90%. Statistically
there is little one can do to improve the accuracy itself since we
already are trying to get a classification with high accuracy.
Modifying the sample size, however, will affect the confidence
interval. Increasing the sample size results in a higher precision of
the estimated accuracy measures. Decreasing the sample size will
lower the precision. Note that you cannot bargain with statistics:
Decreasing your sample size will result in a less precise estimate.
If you cannot afford more samples then you need to live with the
reality that your estimated accuracies are poor estimates.
Contrary to what has been written in the literature, the total number
of pixels in a satellite image has little (negligible) influence on the
accuracy of the estimates. It is wrong to say that one needs to
increase the sample size because the area under investigation is
very large.
Also, some people mistakenly believe that spatial correlation is a
problem for accuracy assessment. However, this also is usually
incorrect if a random sample is selected.
Other ways to better estimate the user accuracy is to use sampling
designs such as stratified random sampling. The course “Sampling
Methodology and Practice” by Timothy Gregoire deals extensively
with this subject.
Design Based Inference
In design based inference we assume that the population is
regarded as fixed. This makes sense since we have a certain
classification that we want to compare with the true land cover.
The random part comes into play when we select the sample. In
simple random sampling, each sample is selected with equal
probability and independently of the other sample. Consequently,
samples are uncorrelated and unbiased estimators are relatively
easy to derive. Note that some scientists believe that spatial
autocorrelation is a problem for deriving an unbiased estimator.
However, it is your samples that are random, not your land cover
and the errors you made in your classification.
A very powerful design is the stratified sampling design. The basic
steps for this design are:
• Divide ALL of your pixels up so that they belong to one and
only one group. These groups are called strata.
• Take a sample within each stratum and calculate your
• Combine the results from all the strata. The trick here is to
estimate totals, i.e., the total number of correctly classified
pixels within each stratum. The variances for the total can
also be added.
Important: You can divide up your pixels into strata any way you
want to. As long as each pixel belongs to one and only one stratum
and you choose random sampling within each stratum, you can
come up with a valid estimate. However, there are more, or less,
useful ways to stratify your data. Also, stratify your classification
before you select your samples.
1) Use your classification. Each class becomes one stratum.
This way you have control over how accurately you want to
estimate the user accuracy. That is to say, you can ensure that
you take enough samples for smaller classes.
2) Create one (or several) “hard to get to” strata. If you are able
to identify areas that are very hard to get to before you take
your sample, create a separate stratum and take fewer
samples within this stratum. You might save a lot of time and
money, and you still can come up with valid estimates.
3) Create “no way I can get to that place” stratum. Are you
really going to climb K2 and see what’s there? The
disadvantage here is that you will not be able to include these
areas in your final classification because you really have no
idea of how accurate your classification is in these areas. Are
you more or less accurate in these “difficult” areas? I believe
this is a more honest way of dealing with areas that you know
you will never be able to get to. Use this as a last resort and
don’t just create a stratum “I know I really did a poor job in
classifying these pixels here” or “I am too lazy to get these
pixels” and try artificially increasing your overall accuracy.
Cluster Sampling
It is often less time consuming to assess the accuracy of two or
more neighboring pixels than it is to visit separate, independently
chosen pixels. For example, you might choose to select all pixels
in a 3 by 3 window instead of just one pixel. Since the samples in
each cluster are not selected independently, the estimator is
somewhat less effective compared to randomly chosen samples
and increasing your cluster size much beyond 25 (i.e., a 5 by 5
window) is usually not advisable.
Also, you can, of course, combine the cluster sampling design and
the stratification design.
Example for Stratified Sampling:
Suppose in our example above we had chosen to use a stratified
random sample design. In our classification we had a total of
10,000 classified pixels, 1,000 of them water, 5,000 agriculture and
4,000 forest. Thus we had three strata: water, agriculture and
forest. Since we don’t know anything about the accuracy of our
classification we decide to take the same number of samples in
each stratum. Note that we could have chosen to take a different
number of samples in each cluster.
An estimate for the total number of correctly classified water
pixels in the water stratum is then: 46 ×1000 = 920 . This means that
we estimate that 920 of 1000 pixels classified as water are indeed
water. Similarly, we estimate that
× 1000 = 40
were classified as
water but are indeed agriculture.
Ground Truth, estimated No. classified
Agri. Forest
Classified Water
in Satellite Agriculture
1,000 3,700
Image as
80 3,520
No. ground truth pixel
2,320 3,820 3,860
Table 2: Confusion matrix of estimated total number of pixels
From this we can now estimate that our overall accuracy as:
920 + 3, 700 + 3,520
= 81.4% .
10, 000
An estimate for our water user accuracy is then:
water producer accuracy is:
= 92% .
1, 000
= 39.6% .
Note that the confusion matrix in Table 1 cannot be used to derive
the overall accuracy if you had used a stratified sampling design.
However, if you just had a simple random sampling design, this
table would be still valid to derive an estimate for your overall
accuracy as well as the user and producer accuracy.
Confidence Interval
An unbiased estimate for the variance is:
Var ( pˆ ) =
N − n pˆ (1 − pˆ )
n −1
where N denotes the total number of pixel, n the number of sample
taken, and p̂ is the estimated accuracy (proportion).
For example, the estimated user accuracy for water in the above
example was 0.92. The variance for this estimate is then
Var ( pˆ ) =
1,000 − 50 0.92(1 − 0.92)
= 0.00143
50 − 1
A 90% confidence interval then would be:
pˆ ± t1−0.9 / 2,n−1 var( pˆ )
0.92 ± 1.676 0.00143
0.92 ± 0.063
Thus, the confidence limits are 85.7% and 98.3%.
Note that if we had a much larger population (i.e., 10,000 pixels in
this class), the estimated variance would be very similar (0.00149)
and the confidence limits would be 85.5% and 98.5%, which is
only slightly larger.