CLUSTER ANALYSIS OF LANDSAT TM DATA Masatoshi Mori and Keinosuke Gotoh t Department of Management Engineering, Kinki University, Iizuka 820, Japan tDepartment of Civil Engineering, Nagasaki University, Nagasaki 852, Japan PURPOSE: Cluster analysis ~ystem by using Statistical Analysis System (SAS) is constructed. The analysis system is applied to Landsat TM data. The s:x spectral bands except t~e thermal band of TM data are used for analysis. Three hierarchical algorithms of cluster analysIs, \\:,ard method, a,:,erag~ lmkage, and centroid method, are prepared in the framework of SAS cluster. The results by three alg;of1thm~ are exammed m relation to classification. The classified map by Ward method is the most accurate among them, and IS practIcal. On the other hand, the others are almost same and incorrect. KEYWORDS: Cluster analysis, Image processing, Landsat TM data, SAS 1. INTRODUCTION as the sea, rivers, ponds, residential, commercial, bare soil, roads, grass, etc. These typical categories are very important for checking cluster algorithms. We adopted six spectral bands except the thermal (the 6th) band for the analysis system, because the spacial resolution of the thermal band is 120m, which is different from that of the others, so we excluded the thermal band. Then we construct the six-dimensional feature space of TM data. In use of Landsat Thematic Mapper (TM) Data, cluster analysis as an unsupervised classification scheme is one of the most useful methods. Specially, in the case of being not clear of text area in objective image, cluster analysis is effective and powerful (Koontz, 1976). A difficulty in using of cluster analysis is, however, existing in extending spectral band capacity. There are seven spectral bands in Landsat TM sensors. Each of these sensors has a dynamic range with 256 levels (8 bits) nominally. In the first stage of cluster analysis we construct a multi-dimensional feature space (Goldberg, 1978). If using all the seven spectral data of TM in analysis, a seven-dimensional feature space is necessary, which needs 256 7 bits of computer memory (Wharton, 1983). It is, however, impossible to construct this feature space in actuality. Then, as conventional method of clustering, preprocess of sampling has been adopted. In this method, before analyzing data, several", several hundred points are selected from original image data. Then, these selected points are directly analyzed by using clustering scheme, and classified to land cover map. Finally every point of original data is assigned to the nearest classified point. As every point of original data is not analyzed directly by this procedure, incorrect classification in assigning points to the resultant class may occur. The reason is that the Euclid metric, not isotropic, is used in assigning, category area is not always an ellipsoid in the multi-dimensional feature space, so erroneous assigning may occur still. 2.2 SAS/cluster procedure The three algorithms of SAS/cluster procedures are prepared and compared with one another, which- are Ward method, average linkage, and centroid method. They are all hierarchical methods. The jobs of clustering are carried out on IBM 3081 computer and FACOM M-1S00 computer systems. The CPU time of SAS/cluster procedure is about 11 minutes for three algorithms, but the jobs need more than 9 mega bytes in computer memory. (a) Ward method By Ward method clusters are joined so as to maximize the likelihood at each level of the hierarchy. Ward method tends to join clusters with a small munber of observation and is strongly biased toward producing clusters with roughly the same number of observations. (b) average linkage In average linkage the distance between two clusters is the average distance between pairs of observations, one on each cluster. Average linkage tends to join clusters with small variances and is slightly biased toward producing clusters with the same variance. In the present paper we consider a direct analysis of every point of original data of Landsat TM, without preprocessing for sampling. We use Statistical AnaJysis System (SAS) (J oyner, 1985) so as to complete this analysis system. By using SAS system, we can process binary data of multidimensional image. However, as the direct result from SAS is a set of records, it is difficult to reform the record format to binary data of classified image. Moreover, because of direct analysis system, a number of pixels of original image are limited. We try to examine three algorithms of cluster analysis, Ward method, average linkage, and centroid method, compared with resultant classified image. The result using TM data is shown precisely. (c) centroid method In centroid method the distance between two clusters is defined as Euclidean distance between their centroid or means. The centroid method is more robust to outliers than most other hierarchical methods but in other aspects may not perform as well as Ward method and average linkage. 2.3 Reconstruction to image data 2. CLUSTER ANALYSIS SYSTEM For direct analysis of image data by SAS, the six image files of TM data are allocated to device names, respectively. After the process of clustering procedure, the sample results by Ward method as shown in Table I are obtained. In the case of Table I, initially it starts with 1000 pixels, which are equal to 1000 clusters. Each record of the result means the one process of joining two clusters to one. In the right column the frequency means a number of pixels belonging to a new cluster. Finally one cluster is obtained as known 2.1 Image data In the present study, we used Landsat-4 TM data (the scene: 113-37, obtained on May 22, 1984). The square image of 120 x 120 pixels was cutoff from the scene, which is shown in Fig.!. This image is a part of Fukuoka City in Japan, and contains several typical features of land such 196 Table I. A sample of the result of SAS cluster procedure by Ward method. Number of clusters Cluster Jomted Frequency of new cluster CL48 CL45 CL34 CL23 CL36 CL25 CL39 CL24 CL20 CL28 CL16 CL18 CL27 CL15 CL13 CL7 CL9 CL8 CL6 CLll CL2 CL58 CL37 CL33 CL46 CL30 CL41 CL26 CL54 CL31 CL29 CL19 CL17 CL21 CL35 CL22 CLIO CL12 CL14 CL4 CL5 CL3 55 50 189 64 48 377 25 10 73 102 566 112 90 35 85 197 192 45 242 758 1000 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Fig.I. Color composite image of Landsat TM ( the scene: 113-37, obtained on May 22, 1984). in Table I. Then, we can obtain an expected number of clusters by cluster procedure. In the present study we examined the case of eight clusters for three algorithms. In order to reconstruct classified image from these records, we must know which one among eight clusters esch pixel of image data belongs to. Analysis of this tree network gives the relation between pixels and cluster numbers. Thus we can obtain a final classified image. This reconstruction process is coded in FORTRAN77. The land cover map by centroid method is shown in Fig.2 (c). The result is almost same as- that by average linkage. The main cluster contains almost all-categories in land, which is more remarkable than the case by average linkage. 4. CONCLUSION 3. RESULTS OF CLASSIFICATION The cluster analysis system by using SAS has been constructed, and was applied. to Landsat TM data. Three algorithms of cluster analysis, Ward method, average linkage, and centroid method, were examined and compared with one another. The classification result by Ward method is most accurate among them. The results by average linkage and centroid method are almost same, but average linkage is somewhat better. However, the separation between water region and land area is rather accurate in classifications by three algorithms naturally. 3.1 Classification land cover map Results of classification by three algorithms are shown in Fig.2 (a)"" (c), respectively. Also the rates of categories are shown. These map were obtained by reconstruction procedure described in Section 2. Each color used in maps is assigned to the most suitable category. The field survay has been performed for checking accuracy of classification. 3.2 Ward method Although we can analyze six-dimensional image data at one procedure, the image size is limited to around 120x 120 pixels due to the framework of SAS/cluster. This size is, however, not sufficientlly practical. One more problem is that three algorithms are all hierarchical. Non-hierarchical algorithms such as the k-means and iso data method are not included in SAS cluster system. The result by Ward method is shown in Fig.2 (a). As known from Fig.2 (a)"",(c) , this classification is most accurate among three algorithms. Eight categories classified in the procedure are water, river and shallows, grass, wood, bare soil, residential, commercial, and asphalted area. Water area contains the sea and a pond with salt water. Wood occupies verdart parks. Bare soil contains playgrounds of school and open area being developed. Asphalted area contains roads, yards, and squares. Although these area are comparatively accurate, residential area contains unclassified area partly. REREFERENCES Goldberg, S.W., and Shilien, S., 1978. A clustering scheme for multidimensional images. IEEE Transaction on system, man, and cybernetics, SMC-8(2):,86-92~ Joyner, S.P., 1985. The SAS User's Guide: Stasistics, Version 5 Edition, SAS Institute Inc. 3.3 average linkage Koontz, W.L.G., Narendra, P., and Fukunaga, K., 1976. A graph-theoretic approach to nonparametric cluster analysis. IEEE Transaction on comp:uters, C-25(9):936-944. The result by average linkage is shown in Fig.2 (b). The classification in land area is incorrect. The main land cover gives commercial, residential, grass, and wood, which are joined to one cluster. Only bare soil is comparatively correct. Bare soil 1",,4 are essentially the same category. However, these bare soil 1",,4 have been separated due to the characteristics of this algorithm. Wharton, S.W., 1983. A generalized histogram clustering scheme for multidimensional image data. Pattern Recognition, 16(2):193-199. 3.4 centroid method 197 Water River and Shallows 16.58 % 5.67 % Asphalted 17.83 % Commercial 23.06 % Residential 16.02 % Bare soil 4.69 % Grass 12.77 % Wood 3.38 % Water 20.40 % Land 72.58 % Grass 0.13 % Bare soil 1 4.45 Bare soil 2 0.97 % Bare soil 3 0.24 % Bare soil 4 0.17 % Crossroads 1.06 % (a) Ward method - % (b) average linkage - Water 18.32 % Land 78.26 % Grass 0.02 % Bare soil 1 2.96 % Bare soil 2 0.24-% Bare soil 3 0.03 % Bare soil 4 0.14 % Crossroads 0.03 % (c) centroid method Fig.2. Results of cluster analysis of Fig.1 by three algorithms, which show classification maps, categories, and rates of occupation. 198