Accuracy Assessment Note that the previously presented classification algorithms are neither “right” nor “wrong”. For example, someone might object to your use of the simple minimum distance to means classifier by pointing out that your classes have different variability, or a person might object to your use of a Maximum Likelihood Classifier because your data is not normally distributed. These algorithms are neither “right” nor “wrong”, but rather they do a better or worse job in classifying pixels in an image. Accuracy assessment tries to quantify how good of a job was done by the classifier. Accuracy assessment should be an important part of any classification. I say “should” because it is frequently not done… The reason for this is that it usually involves a lot of work in the field, which can be very expensive and time consuming. However, without any accuracy assessment we do not know how accurate our classification is: Are we right about 90% of the time with our classification, or maybe only 50%? The accuracy of a classification is usually assessed by comparing the classification with some reference data that is believed to accurately reflect the true land-cover. Sources of reference data include among other things ground truth, higher resolution satellite images, and maps derived from aerial photo interpretation. Note that virtually all reference data (even ground truth data) are inaccurate to some degree as well. The accuracy assessment reflects really the difference between our classification and the reference data. Consequently, if your reference data is highly inaccurate, your assessment might indicate that your classification is poor, while it really is a good classification. It is better to get fewer, but more accurate, reference data. Also be aware of temporal changes: If you satellite image was taken at a different time than when you collected your reference data, apparent errors might be due to the fact that your landscape has changed. Ideally, your selection of reference sites should be based on some random sampling design. Ground truthing from the car is not a random design and the estimated accuracy measures from these types of sampling designs are usually biased. Also, accuracy assessment should not be based on the training pixels. The problem with using training pixels is that they are usually not randomly selected and that the classification is not independent of the training pixels. Using training pixels usually results in an overly optimistic accuracy assessment. The results of an accuracy assessment are usually summarized in a confusion matrix: Ground Truth No. classified Water Agri. Forest pixels Classified Water 46 2 2 50 in Satellite Agriculture 10 37 3 50 Image as: Forest 5 1 44 50 No. ground truth pixels 61 40 49 150 Table 1: Confusion matrix A measure for the overall classification accuracy can be derived from this table by counting how many pixels were classified the same in the satellite image and on the ground and dividing this by the total number of pixels: 46 + 37 + 44 = 84.7% 150 The drawback of this measure is that it does not tell you anything about how well individual classes were classified. The user and producer accuracy are two widely used measures of class accuracy. The producer’s accuracy refers to the probability that a certain land-cover of an area on the ground is classified as such, while the user’s accuracy refers to the probability that a pixel labeled as a certain land-cover class in the map is really this class. The user and producer accuracy for any given class typically are not the same. In the above example, an estimate for the producer accuracy of water is 46 = 75.4% while the user accuracy is 46 = 92% . 61 50 As a user of a classification, I can expect that roughly 92% of all the pixels classified as water are indeed water on the ground. However, as a producer, I am quite unsettled by the fact that I only classified 75.4% of all the water pixels as such. Note that all these measures of accuracies are estimates for the true, unknown accuracies. Thus, it is reasonable to ask what is the accuracy of these accuracy measures? In the above example, is the user accuracy of water really 92% or could it also reasonably be 80%? In general it is possible to construct confidence intervals for all of these accuracy measures to give an idea of these accuracies estimates. The width of these confidence intervals is influenced mainly by the sample size and by the size of the accuracy measures themselves. For the same sample size, accuracy measures of 50% have a larger confidence interval than accuracy measures of 90%. Statistically there is little one can do to improve the accuracy itself since we already are trying to get a classification with high accuracy. Modifying the sample size, however, will affect the confidence interval. Increasing the sample size results in a higher precision of the estimated accuracy measures. Decreasing the sample size will lower the precision. Note that you cannot bargain with statistics: Decreasing your sample size will result in a less precise estimate. If you cannot afford more samples then you need to live with the reality that your estimated accuracies are poor estimates. Contrary to what has been written in the literature, the total number of pixels in a satellite image has little (negligible) influence on the accuracy of the estimates. It is wrong to say that one needs to increase the sample size because the area under investigation is very large. Also, some people mistakenly believe that spatial correlation is a problem for accuracy assessment. However, this also is usually incorrect if a random sample is selected. Other ways to better estimate the user accuracy is to use sampling designs such as stratified random sampling. The course “Sampling Methodology and Practice” by Timothy Gregoire deals extensively with this subject. Design Based Inference In design based inference we assume that the population is regarded as fixed. This makes sense since we have a certain classification that we want to compare with the true land cover. The random part comes into play when we select the sample. In simple random sampling, each sample is selected with equal probability and independently of the other sample. Consequently, samples are uncorrelated and unbiased estimators are relatively easy to derive. Note that some scientists believe that spatial autocorrelation is a problem for deriving an unbiased estimator. However, it is your samples that are random, not your land cover and the errors you made in your classification. Stratification A very powerful design is the stratified sampling design. The basic steps for this design are: • Divide ALL of your pixels up so that they belong to one and only one group. These groups are called strata. • Take a sample within each stratum and calculate your estimates. • Combine the results from all the strata. The trick here is to estimate totals, i.e., the total number of correctly classified pixels within each stratum. The variances for the total can also be added. Important: You can divide up your pixels into strata any way you want to. As long as each pixel belongs to one and only one stratum and you choose random sampling within each stratum, you can come up with a valid estimate. However, there are more, or less, useful ways to stratify your data. Also, stratify your classification before you select your samples. Example: 1) Use your classification. Each class becomes one stratum. This way you have control over how accurately you want to estimate the user accuracy. That is to say, you can ensure that you take enough samples for smaller classes. 2) Create one (or several) “hard to get to” strata. If you are able to identify areas that are very hard to get to before you take your sample, create a separate stratum and take fewer samples within this stratum. You might save a lot of time and money, and you still can come up with valid estimates. 3) Create “no way I can get to that place” stratum. Are you really going to climb K2 and see what’s there? The disadvantage here is that you will not be able to include these areas in your final classification because you really have no idea of how accurate your classification is in these areas. Are you more or less accurate in these “difficult” areas? I believe this is a more honest way of dealing with areas that you know you will never be able to get to. Use this as a last resort and don’t just create a stratum “I know I really did a poor job in classifying these pixels here” or “I am too lazy to get these pixels” and try artificially increasing your overall accuracy. Cluster Sampling It is often less time consuming to assess the accuracy of two or more neighboring pixels than it is to visit separate, independently chosen pixels. For example, you might choose to select all pixels in a 3 by 3 window instead of just one pixel. Since the samples in each cluster are not selected independently, the estimator is somewhat less effective compared to randomly chosen samples and increasing your cluster size much beyond 25 (i.e., a 5 by 5 window) is usually not advisable. Also, you can, of course, combine the cluster sampling design and the stratification design. Example for Stratified Sampling: Suppose in our example above we had chosen to use a stratified random sample design. In our classification we had a total of 10,000 classified pixels, 1,000 of them water, 5,000 agriculture and 4,000 forest. Thus we had three strata: water, agriculture and forest. Since we don’t know anything about the accuracy of our classification we decide to take the same number of samples in each stratum. Note that we could have chosen to take a different number of samples in each cluster. An estimate for the total number of correctly classified water pixels in the water stratum is then: 46 ×1000 = 920 . This means that 50 we estimate that 920 of 1000 pixels classified as water are indeed water. Similarly, we estimate that 2 × 1000 = 40 50 were classified as water but are indeed agriculture. Ground Truth, estimated No. classified Water Agri. Forest pixel Classified Water 920 40 40 1,000 in Satellite Agriculture 1,000 3,700 300 5,000 Image as Forest 400 80 3,520 4,000 No. ground truth pixel 2,320 3,820 3,860 10,000 Table 2: Confusion matrix of estimated total number of pixels From this we can now estimate that our overall accuracy as: 920 + 3, 700 + 3,520 = 81.4% . 10, 000 An estimate for our water user accuracy is then: water producer accuracy is: 920 = 92% . 1, 000 Our 920 = 39.6% . 2,320 Note that the confusion matrix in Table 1 cannot be used to derive the overall accuracy if you had used a stratified sampling design. However, if you just had a simple random sampling design, this table would be still valid to derive an estimate for your overall accuracy as well as the user and producer accuracy. Confidence Interval An unbiased estimate for the variance is: Var ( pˆ ) = N − n pˆ (1 − pˆ ) N n −1 where N denotes the total number of pixel, n the number of sample taken, and pĚ‚ is the estimated accuracy (proportion). For example, the estimated user accuracy for water in the above example was 0.92. The variance for this estimate is then Var ( pˆ ) = 1,000 − 50 0.92(1 − 0.92) = 0.00143 1000 50 − 1 A 90% confidence interval then would be: pˆ ± t1−0.9 / 2,n−1 var( pˆ ) 0.92 ± 1.676 0.00143 0.92 ± 0.063 Thus, the confidence limits are 85.7% and 98.3%. Note that if we had a much larger population (i.e., 10,000 pixels in this class), the estimated variance would be very similar (0.00149) and the confidence limits would be 85.5% and 98.5%, which is only slightly larger.