Image Classification Chapter 12 Intro Digital image classification is assigning pixels to classes (categories) Each pixel has as many digital values as there are bands Compare the values to pixels of known composition and assign them accordingly Each class (in theory) is homogenous Intro Direct uses Produce a map of land-use/land-cover Indirect use Classification is an intermediate step, and may form only one of several data layers in a GIS Water map vs water quality GIS Intro Classifier is a computer program that does some sort of image classification Many different types available No one method is “best” Simplest is a point or spectral classifier Considers each pixel individually Simple and economic Can’t describe relation to neighboring pixels Intro Spatial neighborhood classifiers consider groups of pixels More difficult to program and more expensive Intro Supervised vs Unsupervised classification Supervised requires the analyst to identify known areas Unsupervised determines a set number of categories based on a computer algorithm Hybrid classifiers are a mix of the two Supervised classification Unsupervised classification No previous knowledge assumed about data. Tries to spectrally separate the pixels. User has controls over: Number of classes Number of iterations Convergence thresholds Two main algorithms: Isodata and k-means Unsupervised classification Informational Classes Informational classes are categories of interest to the users Geological units Forest types Land use Spectral classes Spectral classes are pixels that are of uniform brightness in each of their several channels The idea is to link spectral classes to informational classes However, there is usually variability that causes confusion A forest can have trees of varying age, health, species composition, density, etc. Classes Informational classes then are usually composed of numerous spectral subclasses. These subclasses may be displayed as a single unit on the final product We often will want to examine how different the pixels within a class are Look at variance and standard deviation Variance The Variance is defined as: The average of the squared differences from the Mean. Which is the square of the standard deviation, ie: σ2 The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm. 600 + 470 + 170 + 430 + 300 Mean = 1970 = 5 = 394 5 Now, we calculate each dogs difference from the Mean: To calculate the Variance, take each difference, square it, and then average the result: 2062 + 762 + (-224)2 + 362 + (-94)2 Variance: σ2 = 108,520 = 5 So, the Variance is 21,704. = 21,704 5 And the Standard Deviation is just the square root of Variance, so: Standard Deviation: σ = √21,704 = 147 So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small. Rottweillers are tall dogs. And Dachsunds are a bit short ... but don't tell them! Normal Distribution Differences between means A crude estimate of might be to simply look at the differences in means between two classes This is too simplistic since it does not account for differences in variability between the two A better way is to look at normalized differences | xa xb | ND sa sb NDWI The Normalized Difference Water Index (NDWI) (Gao, 1996) The SWIR reflectance reflects changes in both the vegetation water content and the spongy mesophyll structure in vegetation canopies, The NIR reflectance is affected by leaf internal structure and leaf dry matter content but not by water content. The combination of the NIR with the SWIR removes variations induced by leaf internal structure and leaf dry matter content, improving the accuracy in retrieving the vegetation water content (Ceccato et al. 2001). Normalized Difference Snow Index (NDSI ) Normalized Difference Vegetation Index Normalized Difference Cloud Index (NDCI) is defined as a ratio between the difference and the sum of two zenith radiances measured for two narrow spectral bands in the visible and near-IR regions. It provides extra tools to remove the radiative effects of the 3D cloud structure. Unsupervised Classification RS images are usually composed of several relatively uniform spectral classes Unsupervised classification is the identification, labeling and mapping of such classes Unsupervised Classification Advantages Requires no prior knowledge of the region Human error is minimized Unique classes are recognized as distinct units Disadvantages Classes do not necessarily match informational categories of interest Limited control of classes and identities Spectral properties of classes can change with time Unsupervised Classification Distance Measures are used to group or cluster brightness values together Euclidean distance between points in space is a common way to calculate closeness Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. Euclidean Distance Distance Euclidean Distance Example, the (Euclidean) distance between points (2, -1) and (-2, 2) dist((2, -1), (2, 2)) = √(2 - (-2))² + ((-1) - 2)² = √(2 + 2)² + (-1 - 2)² = √(4)² + (-3)² = √16 + 9 = √25 = √5. Euclidean Distance This can be extended to multiple dimensions (bands) Add the differences together 1 Pixel A 34 Pixel B 26 Difference 8 (Dif)2 64 Σ (dif)2 = 1,637 √1637 = 40.5 2 28 16 12 144 3 22 52 -30 900 4 6 29 -23 529 Distances There are a number of other distances that can be calculated L1 distance is the sum of absolute differences between different bands e.g. 8=12+30+23 = 73 for previous example Example Landsat bands Near-IR band Red band Example spectral plot • Two bands of data. • Each pixel marks a location in this 2d spectral space Band 2 • Our eye’s can split the data into clusters. • Some points do not fit clusters. Band 1 K-means (unsupervised) 1. 2. 3. 4. 5. A set number of cluster centers are positioned randomly through the spectral space. Pixels are assigned to their nearest cluster. The mean location is re-calculated for each cluster. Repeat 2 and 3 until movement of cluster centres is below threshold. Assign class types to spectral clusters. Band 1 1. First iteration. The cluster centers are set at random. Pixels will be assigned to the nearest center. Band 2 Band 2 Band 2 Example k-means Band 1 2. Second iteration. The centers move to the mean-center of all pixels in this cluster. Band 1 3. N-th iteration. The centers have stabilized. Key Components Regardless of the unsupervised algorithm need to pay attention to methods for Measuring distance Identifying class centroids Testing distinctness of classes Decision Boundary All classification programs try to determine classes based on “decision boundaries” That is, divide feature space into an exhaustive set of nonoverlapping regions Begin with a set of prelabeled points for each class (training samples) Minimum Distance to Means – determine locus of points equidistant from class mean Nearest neighbor – determine locus of points equidistant from the nearest member of 2 classes Decision boundary Decision boundary Decision Boundaries Classification usually not so easy Desired classes have distributions in feature space that are not obviously separated Nearly always have to use more than three features (dimensions) Wind up having to use discriminant functions Supervised classification Start with knowledge of class types. Classes are chosen at start Training samples are created for each class Ground truth used to verify the training samples. Quite a few algorithms. Here we will look at: Parallelepiped Maximum likelihood Supervised Classification Advantages Analyst has control over the selected classes tailored to the purpose Has specific classes of known identity Does not have to match spectral categories on the final map with informational categories of interest Can detect serious errors in classification if training areas are missclassified Supervised Classification Disadvantages Analyst imposes a classification (may not be natural) Training data are usually tied to informational categories and not spectral properties Remember diversity Training data selected may not be representative Selection of training data may be time consuming and expensive May not be able to recognize special or unique categories because they are not known or small Supervised Classification Training data Specify corner points of selected areas Assume that the correct ID is known Often requires ancillary data (maps, photos, etc.) Field work often needed to verify Supervised Classification Key Characteristics of Training areas Number of pixels Number depends on the number of categories, their diversity, and resources available Have several training areas for one category A total of at least 100 pixels per category More areas also allow discarding ones that have too high a variance Shape –not important, usually rectangular for ease Supervised Classification More Key Characteristics Locations must be spread around the image and be easily transferred from map to image Size must be large enough to estimate spectral characteristics and variations Varies with sensor type and resolution Varies with heterogeneity of area Uniformity means that each training set should be as homogenous as possible Idealized Sequence Assemble information Conduct field studies Conduct preliminary study of scene to determine landmarks and assess image quality Identify training areas Evaluate training data Edit training data if necessary Feature Selection Graphic Method – one of the first simple feature selection aids Plot ±1σ in a bar graph Feature Selection Cospectral parallelepiped plots (ellipse plots) visual representation of separability in two dimensional feature space Use mean and SD of training class statistics for each class, c, and band, k Parallelepipeds represent mean ±1σ of each band for each class Feature Selection Statistical Methods Statistically try to separate clusters Results in two types of errors A pixel is assigned to a class to which it does not belong (error of commission) A pixel is not assigned to its appropriate class (error of omission) Parallelepiped Also known as box decision rule, or levelslice procedure Based on the values of the training data Parallelepiped (supervised) For each training region determine the range of values observed in each band. These ranges form a spectral box (or parallelepiped) which is used to classify this class type. Assign new image pixels to the parallelepiped which it fits into best. Pixels outside all boxes can be unclassified or assigned to the closest one. Problems with classes that exhibit high correlation between bands. This creates long ‘diagonal’ datasets that don’t fit well into a box. Parallelepiped example Training classes plotted in spectral space. In this example using 2 bands. Parallelepiped example continued •Each class type defines a spectral box •Note that some boxes overlap even though the classes are spatially separable. •This is due to band correlation in some classes. •Can be overcome by customising boxes. Parallelepiped example • The algorithm tests a pixel to see if its spectral values fall within the bounds of each class. • Pixels are sequentially tested against the defined classes (i.e., class 1 is tested first, class 2 is tested next, etc.). • As soon as the test is passed, the pixel is classified and the algorithm moves on to the next pixel. • This classifier is mathematically simple. • Problem: We will have ambiguities when working with classes with overlapping bounds. Parallelepiped example • The checking procedure stops once the digital numbers, associated with the investigated pixel, lies within the bounds of a certain class. – The classification result is order dependent. • In other words, the final classification result depends on how the classes are numbered. • This is not a desirable feature. • Solution: Minimum distance classifier. Minimum Distance Classifier • Any pixel in the scene is categorized using the distances between: – The digital number vector (spectral vector) associated with that pixel, and – The means of the information classes derived from the training sets. • The pixel is designated to the class with the shortest distance. • Some versions of this classifier use the standard deviation of the classes to determine a minimum distance threshold. Minimum Distance Classifier • If minimum distance is greater than the threshold, the pixel will be considered unclassified. – This pixel does not belong to any of the classes represented by the training set. • This classifier is slower than the parallelepiped classifier • This classifier is mathematically simple. • Problem: We do not use the standard deviation derived from the training data. Maximum likelihood (supervised) For each training class the spectral variance and covariance is calculated. The class can then be statistically modelled with a mean vector and covariance matrix. This assumes the class is normally distributed. Which is generally okay for natural surfaces. Unidentified pixels can then be given a probability of being in any one class. Assign the new pixel to the class with the highest probability – or unclassified if all probabilities low. Maximum likelihood (supervised) Maximum likelihood example • Normal probability distributions are fitted to each training class. • The lines in the diagram show regions of equal probability. • Point 1 would be assigned to class ‘pond culture’ as this is most probable. • Point 2 would generally be unclassified as the probabilities of fitting into one for the classes would be below threshold. 1 Equiprobability contours 2 Maximum likelihood example • Characteristics: – Generally produces the most accurate classification results. – Assumes normal distribution of the spectral data within the training classes. – Mathematically complex. – Computationally slow. ISODATA (hybrid) Extends k-means. Also calculate standard deviation for clusters. After mean location is re-calculated for each cluster we can either: Combine clusters if centers are close. Split clusters with large standard deviation in any dimension. Delete clusters that are to small. Then reclassify each pixel and repeat. Stop on max iterations or convergence limit. Assign class types to spectral clusters. Band 1 1. Data is clustered but blue cluster is very stretched in band 1. Band 2 Band 2 Band 2 Example ISODATA Band 1 2.Cyan and green clusters only have 2 or less pixels. So they will be removed. Band 1 3. Either assign outliers to nearest cluster, or mark as unclassified. Bayes’s Classification Bayesian classification is based on the probability of observing a particular class given a particular pixel value Lets play a simple game with two sets of dice. One normal pair One augmented pair – 2 extra spots per side Player 1 selects a pair of dice randomly and rolls then, announcing only the outcome Player 2 names which type of dice were used Bayes’s Classification To determine the decision boundary we can list all possible outcomes, and how likely each one is. For normal dice values 2 – 12 For augmented dice 6- 16 How likely is each? To get 2, there is only one way to roll To get 3, there are two ways (1 and 2, or 2 and 1) Bayes’s Classification The histograms become discriminant functions, and the decision boundary can be set based on the most probable outcome in any given case If a 7 is the outcome, it could have come from either pair of dice, but more likely from the standard pair If a 4, then it is most assuredly from the standard pair Bayes’s Classification Now lets assume we are trying to guess the type of groundcover We could estimate the probabilities from training areas We would generate histograms to estimate the probability function for each class Use these probabilities to separate pixels K-Nearest Neighbors k Nearest Neighbor Requires 3 things: ? The set of stored records Distance metric to compute distance between records The value of k, the number of nearest neighbors to retrieve To classify an unknown record: Compute distance to other training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) k Nearest Neighbor Compute the distance between two points: Euclidean distance d(p,q) = √∑(pi – qi)2 Hamming distance (overlap metric) Determine the class from nearest neighbor list Take the majority vote of class labels among the k-nearest neighbors k Nearest Neighbor k = 1: ? k = 3: Belongs to triangle class k = 7: Belongs to square class Belongs to square class Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes Choose an odd value for k, to eliminate ties k Nearest Neighbor Accuracy of all NN based classification, prediction, or recommendations depends solely on a data model, no matter what specific NN algorithm is used. Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes. Examples Height of a person may vary from 4’ to 6’ Weight of a person may vary from 100lbs to 300lbs Income of a person may vary from $10k to $500k Nearest Neighbor classifiers are lazy learners Models are not built explicitly unlike eager learners. Nearest Neighbor Advantages k Simple technique that is easily implemented Building model is cheap Extremely flexible classification scheme Well suited for Multi-modal classes Records with multiple class labels Error Cover & Hart paper (1967) Can rate at most twice that of Bayes error rate sometimes be the best method Michihiro Kuramochi and George Karypis, Gene Classification using Expression Profiles: A Feasibility Study, International Journal on Artificial Intelligence Tools. Vol. 14, No. 4, pp. 641-660, 2005 K nearest neighbor outperformed SVM for protein function prediction using expression profiles Nearest Neighbor Disadvantages k Classifying unknown records are relatively expensive Requires distance computation of k-nearest neighbors Computationally intensive, especially when the size of the training set grows Accuracy can be severely degraded by the presence of noisy or irrelevant features Many Classifiers Ground truth Ideally the training regions need to be based on ground observation. They should be large enough to capture all the spectral variability in the class type. E.g. different types of forest, shallow water and deep ocean etc. Do not need to get too detailed otherwise classes will not be spectrally separable. Ancillary Data Acquired by other means Used to assist in classification or analysis Maps, reports, other data Primary requirements Available digitally Pertain to the problem Compatible with the RS data Ancillary Data Incompatibility is a serious problem Physical – digital formats Logical – data usually collected for another reason Scale Resolution Date accuracy Ancillary Data Stratification – subdivide the image into that are easy to define using ancillary data Elevation could be used to look at alpine vegetation separately from lowland Postclassification sorting – examine confusion matrix and look in more detail at confused classes Post classification Can check non-training regions with more ground truth if available. Calculate classification statistics. Confusion Matrix: Columns show ground truth, rows show how many pixels are assigned to each class. Overall accuracy: Total correct pixels/total pixels Commission errors: Incorrect pixels assigned to a class Omission errors: Pixels in class that are assigned a different class Visually check to see if any major errors or unwanted features. Classification and Regression Tree Analysis Classification and Regression Tree Analysis (CART) is a method to incorporate ancillary data into image classification Requires accurate training data, but not prior knowledge of the role of the variables Advantage is that it identifies useful data and separates it from those that don’t contribute to the classification. Fuzzy Clustering Traditional methods allow a pixel to be identified only with a single cluster There are many processes which can make matching problematic So many pixels will be incorrectly labeled Fuzzy logic allows partial membership Instead of a water pixel, it could be 0.7 water and 0.3 forest. Neural Networks Artificial Neural Networks (ANN) are computer programs that simulate the brain Establishment of linkages and then reinforcement of linkages between input and output. Generally comprised of three elements Input layers – source data Hidden layers – association by weights Output layer - classes Neural Networks There can be forward propagation – the normal training to classification sequence Backward propagation is a retrospective analysis of input and output which allows adjustment of the weights This creates a transfer function Quantitative link between input and output Weights may show some bands are more effective for certain classes and other bands for different classes Contextual Classification Context is derived from spatial relationships within the image Can operate on either classified or unclassified scenes Usually some classification has been done It reassigns pixels as appropriate based on location (context) Contextual Classification Contextual Classification Contextual Classification