Cluster Analysis of Spatial Data Using Peano Count Tree 1, 2 Qiang Ding, William Perrizo Department of Computer Science North Dakota State University Fargo, ND 58105-5164 {Qiang.Ding, William.Perrizo}@ndsu.nodak.edu Abstract Cluster Analysis of spatial data is a very important field due to the very large quantities of spatial data collected in various application areas. In most cases the spatial data sizes are too large to be mined in a reasonable amount of time using existing methods. In this paper, we restructure the data using a data-mining ready data structure, Peano Count Tree (P-tree) and then based on this new data formats, we present a new scalable clustering algorithm that can also handle noise and outliers effectively. We achieve this by introducing a new similarity function and adjusting the partitioning methods. Our analysis shows that with the assistance of P-trees, very large data sets can be handled efficiently and easily. Keywords Data mining, Cluster Analysis, Spatial Data, Peano ordering, k-Medoids 1 INTRODUCTION Data mining in general is the search for hidden patterns that may exist in large databases. Spatial data mining in particular is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. In the past 30 years, _____________________________________________ 1 The P-tree and bSQ technologies discussed in this paper are patent pending by North Dakota State University. 2 This work was partially supported by GSA grant number ACT# K96130308. cluster analysis has been widely applied to many areas. Spatial data is a promising area for clustering [3]. Due to the large size of spatial data, such as satellite images, the existing methods are not very suitable. In this paper, we propose a new method to perform clustering on spatial data. The application focus of this paper is the clustering productivity levels in agricultural fields. The lossless data structure, Peano Count Tree (P-tree) [4], is used in the model. Ptrees represent spatial data bit by bit in a recursive quadrant-by-quadrant arrangement. With the information in P-trees, we can do the clustering much easier and much more efficient. The rest of the paper is organized as follows. In section 2, we review the data formats of spatial data and describe the P-tree data structure (and its variants) and algebra. In Section 3, we introduce the current clustering algorithms based on partitioning. In Section 4, we present the HOBBit distance function and detail our clustering method to show the advantage of P-trees. This section also includes a simple example to demonstrate the basic idea. Section 5 concludes with a summary and some directions for future research. 2 DATA STRUCTURES A spatial image can be viewed as a 2-dimensional array of pixels. Associated with each pixel are various descriptive attributes, called “bands”. For example, visible reflectance bands (Blue, Green and Red), infrared reflectance bands (e.g., NIR, MIR1, MIR2 and TIR) and possibly some bands of data gathered from ground sensors (e.g., yield quantity, yield quality, and soil attributes such as moisture and nitrate levels, etc.). All the values have been scaled to values between 0 and 255 for simplicity. The pixel coordinates in raster order constitute the key attribute. One can view such data as table in relational form where each pixel is a tuple and each band is an attribute. There are several formats used for spatial data, such as Band Sequential (BSQ), Band Interleaved by Line (BIL) and Band Interleaved by Pixel (BIP). In our previous work [4, 11], we proposed a new format called bit Sequential Organization (bSQ). Since each intensity value ranges from 0 to 255, which can be represented as a byte, we try to split each bit in one band into a separate file, called a bSQ file. Each bSQ file can be reorganized into a quadrant-based tree (P-tree). The example in Figure 1 shows a bSQ file and its P-tree. 11 11 11 00 11 11 10 00 11 11 11 00 11 11 11 10 11 11 11 11 11 11 11 11 11 11 11 11 01 11 11 11 55 16 8 3 0 1 1 level=3 4 1 1 0 0 0 1 0 15 4 16 level=2 4 3 4 level=1 1 1 0 1 level=0 Figure 1. 8 by 8 image and its p-tree. In this example, 55 is the count of 1’s in the entire image (called root count), the numbers at the next level, 16, 8, 15 and 16, are the 1-bit counts for the four major quadrants. Since the first and last quadrant is made up of entirely 1-bits (called pure-1 quadrants), we do not need sub-trees for them. Similarly, quadrants made up of entirely 0-bits are called pure-0 quadrant. This pattern is continued recursively. Recursive raster ordering is called Peano or Z-ordering in the literature. The process terminates at the leaf level (level-0) where each quadrant is a 1-row-1-column quadrant. If we were to expand all sub-trees, including those pure quadrants, then the leaf sequence is just the Peano space-filling curve for the original raster image. For each band (assuming 8-bit data values), we get 8 basic P-trees, one for each bit positions. For band, Bi, we will label the basic P-trees, Pi,1, Pi,2, …, Pi,8, thus, Pi,j is a lossless representation of the jth bits of the values from the ith band. However, Pij provides more information and are structured to facilitate data mining processes. Some of the useful features of P-trees can be found later in this paper or our earlier work [4, 11]. The basic P-trees defined above can be combined using simple logical operations (AND, OR and COMPLEMENT) to produce P-trees for the original values (at any level of precision, 1-bit precision, 2-bit precision, etc.). We let Pb,v denote the Peano Count Tree for band, b, and value, v, where v can be expressed in 1-bit, 2-bit,.., or 8-bit precision. For example, Pb,110 can be constructed from the basic P-trees as: Pb,110 = Pb,1 AND Pb,2 AND Pb,3’ where ’ indicates the bit-complement (which is simply the count complement in each quadrant). This is called the value P-tree. The AND operation is simply the pixel wise AND of the bits. The data in the relational format can also be represented as P-trees. For any combination of values, (v1,v2,…,vn), where vi is from band-i, the quadrant-wise count of occurrences of this tuple of values is given by: P(v1,v2,…,vn) = P1,V1 AND P2,V2 AND … AND Pn,Vn This is called a tuple P-tree. Finally, we note that the basic P-trees can be generated quickly and it is only a one-time cost. The logical operations are also very fast [5]. So this structure can be viewed as a “data mining ready” and lossless format for storing spatial data. 3 CLUSTERING ALGORITHMS BASED ON PARTITIONING Given a database of n objects and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k≤n), where each partition represents a cluster [7]. K-means, k-medoids and their variations are the most commonly used partitioning methods. The k-means algorithm [9] first randomly selects k of the objects, which initially each represent a cluster mean or center. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the mew mean for each cluster. This process iterates until the criterion function converges. The k-means method, however, can be applied only when the mean of a cluster is defined. And it is sensitive to noisy data and outliers since a small number of such data can substantially influence the mean value. So we will not use this method in our clustering. The basic strategy of k-medoids clustering algorithms is to find k cluster in n objects by first arbitrarily finding a representative object (the medoid) for each cluster. Each remaining object is clustered with the medoid to which it is the most similar. The strategy then iteratively replaces one of the medoids by one of the non-medoids as long as the quality of the resulting clustering is improved. PAM [8] is one of the well known k-medoids algorithms. After an initial random selection of k medoids, the algorithm repeatedly tries to make a better choice of medoids. All of the possible pairs of objects are analyzed, where one object in each pair is considered a medoid and the other is not. The algorithm proceeds in two steps: BUILD-step: This step sequentially selects k "centrally located" objects, to be used as initial medoids SWAP-step: If the objective function can be reduced by swapping a selected object with an unselected object, then the swap is carried out. This is continued till the objective function can no longer be decreased. Experimental results show that PAM works satisfactorily for small data sets. But it is not efficient in dealing with medium and large data sets. Instead of finding representative objects for the entire data set, CLARA [8] draws a sample of the data set, applies PAM on the sample, and finds the medoids of the sample. However, a good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased. So CLARANS was proposed [2] which does not confine itself to any sample at any given time. It draws a sample with some randomness in each step of the search. 4 OUR ALGORITHM 4.1 Clustering Similarity Functions In our algorithm, the clustering similarity function is based on the closeness of feature attribute values. Various similarity functions have been proposed. For two data points, X = <x1, x2, x3, …, xn-1> and Y = <y1, y2, y3, …, yn-1>, the Euclidean similarity function is defined as n 1 x d 2 ( X ,Y ) i yi 2 . i 1 It can be generalized to the Minkowski similarity function, d q ( X ,Y ) q n 1 w i xi y i q . i 1 If q = 2, this gives the Euclidean function. If q = 1, it gives the Manhattan distance, which is n 1 d 1 ( X , Y ) xi y i . i 1 If q = , it gives the max function n 1 d ( X , Y ) max x i y i i 1 . We proposed a metric using P-trees, called HOBBit [10]. The HOBBit metric measures distance based on the most significant consecutive bit positions starting from the left (the highest order bit). As you may have already noticed, those values have the same higher order bits are close based on certain precision. This precision depend s on the length of higher order bits considered. We can also partition each dimension into several intervals by using this higher-order-bits concept hierarchy. The HOBBit similarity between two integers A and B is defined by SH(A, B) = max{s | 0 i s ai = bi} where ai and bi are the ith bits of A and B respectively. The HOBBit distance between two tuples X and Y is defined by d H X,Y max m - S H xi ,y i n i 1 where m is the number of bits in binary representations of the values; n is the number of attributes used for measuring distance; and xi and yi are the ith attributes of tuples X and Y. The HOBBit distance between two tuples is a function of the least similar pairing of attribute values in them. In this paper, we define the HOBBit distance between two tuples X and Y to be the HOBBit distance between tuple P-trees P(X) and P(Y). From the experiments, we found that HOBBit distance is more natural for spatial data than other distance metrics. 4.2 Our Clustering Method Before giving a formal description of the clustering algorithm, we first review the concept of dense units [1]. Let S=B1× B1× ... × Bd be a d-dimensional numerical space and B1, B2, ... , Bd are the dimensions of S. We consider each pixel of the spatial image data as a d-dimensional points v = {v1, v2, ... , vd}. If we partition every dimension into several intervals, then the data space S can be partitioned into non-overlapping rectangular units. Each unit u is the intersection of one interval from each attribute. It has the form { u1, u2, ... , ud} where ui = [li, hi) is a right-open interval in the partitioning of Bi. We say that a point v = {v1, v2, ... , vd} is contained in a unit u = {u1, u2, ... , ud} if li ≤ v1 < hi for all ui . The selectivity of a unit is defined to be the fraction of total data points contained in the unit. We call a unit u dense if selectivity(u) is greater than the density threshold r. A cluster is a maximal set of connected dense units. By using higher-order-bits concept hierarchy, we partition each dimension into several intervals. Each of these intervals can be represented as a P-tree. In this way, all the units can also be represented as P-trees by ANDing those interval P-trees. Furthermore, to deal with outliers, we arbitrarily disregard those sparse values. If we have an estimation of the percentage of outliers, we can use the algorithm in Figure 2 to prune them out. Then we will do the clustering be based on these dense units. Input: total number of objects (N), all tuple P-trees, percentage of outliers (t) Output: tuple P-trees after prune (1) Choose the tuple P-tree with smallest root count (Pv) (2) outliers:=outliers+RootCount(P v) (3) if (outliers/N<= t) then remove P v and repeat (1)(2) Figure 2. Algorithm to prune out outliers. Finally, we use the PAM method to partition all the P-trees representing these dense units into k clusters. We consider each tuple P-tree as an object and use the HOBBit similarity function between tuple P-trees to calculate the dissimilarity matrix. EXAMPLE The following relation contains 4 bands of 4-bit data values (expressed in decimal and binary) (BSQ format would consist of the 4 projections of this relation, R[YIELD], R[Blue], R[Green], R[Red] ). To make it clear, we generate dense units using all the bits instead of only part of higher order bits here. FIELD CLASS COORDS LABEL X Y REMOTELY SENSED REFLECTANCES B1 B2 B3 B4 0,0 0,1 0,2 0,3 0011 0011 0111 0111 0111 0011 0011 0010 1000 1000 0100 0101 1011 1111 1011 1011 1,0 1,1 1,2 1,3 0011 0011 0111 0111 0111 0011 0011 0010 1000 1000 0100 0101 1011 1011 1011 1011 2,0 2,1 2,2 2,3 0010 0010 1010 1111 1011 1011 1010 1010 1000 1000 0100 0100 1111 1111 1011 1011 3,0 3,1 3,2 3,3 0010 1010 1111 1111 1011 1011 1010 1010 1000 1000 0100 0100 1111 1111 1011 1011 Figure 3. Learning dataset. . This dataset is converted to bSQ format. We display the bSQ bit-bands values in their spatial positions, rather than displaying them in 1-column files. The Band-1 bitbands are: B11 0000 0000 0011 0111 B12 0011 0011 0001 0011 B13 1111 1111 1111 1111 B14 1111 1111 0001 0011 Thus, the Band-1 Basic P-trees are as follows (tree pointers are omitted). P1,1 5 0014 0001 P1,2 7 0403 0111 P1,3 16 P1,4 11 4403 0111 We can use AND and COMPLEMENT operation to calculate all the value P-trees of Band-1 as below. (e.g., P1,0011 = P1,1’ AND P1,2’ AND P1,3 AND P1,4 ) P1,0000 P1,0100 P1,1000 P1,1100 P1,0010 P1,0110 P1,1010 P1,1110 P1,0001 P1,0101 P1,1001 P1,1101 P1,0011 P1,0111 P1,1011 P1,1111 0 3 0030 1110 0 4 4000 0 0 0 4 0400 0 0 2 0011 0001 1000 0 0 0 0 3 0003 0111 Then we generate basic P-trees and value P-trees similarly to B2, B3 and B4. From the value P-trees, we can generate all the tuple P-trees. Here we give all the non-zero trees: P-0010,1011,1000,1111 P-1010,1010,0100,1011 P-1010,1011,1000,1111 P-0011,0011,1000,1011 P-0011,0011,1000,1111 P-0011,0111,1000,1011 P-0111,0010,0101,1011 P-0111,0011,0100,1011 P-1111,1010,0100,1011 3 1 1 1 1 2 2 2 3 0030 0001 0010 1000 1000 2000 0200 0200 0003 1110 1000 0001 0001 0100 1010 0101 1010 0111 If the noise and outliers are estimated at about 25%, then the dense units are: 1) P-0010,1011,1000,1111 2) P-0011,0111,1000,1011 3) P-0111,0010,0101,1011 4) P-0111,0011,0100,1011 5) P-1111,1010,0100,1011 3 2 2 2 3 0030 2000 0200 0200 0003 1110 1010 0101 1010 0111 Now, the dissimilarity matrix can be calculated as in Figure 4. 1 2 3 4 1 0 2 4 0 3 4 4 0 4 4 4 3 0 5 4 4 4 4 5 0 Figure 4. Dissimilarity matrix. Finally, we use PAM method to partition these tuple P-trees into k clusters (let k=4): cluster1: cluster2: cluster3: cluster4: 5 P-0010,1011,1000,1111 P-0011,0111,1000,1011 P-0111,0010,0101,1011 P-0111,0011,0100,1011 P-1111,1010,0100,1011 3 2 2 2 3 0030 2000 0200 0200 0003 1110 1010 0101 1010 0111 CONCLUSION In this paper, we propose a new approach to cluster analysis that is especially useful for the clustering on spatial data. We use the bit Sequential data organization (bSQ) and a lossless, data-mining ready data structure, the Peano Count Tree (P-tree), to represent the information needed for clustering in an efficient and ready-to-use form. The rich and efficient P-tree storage structure and fast P-tree algebra, facilitate the development of a very fast clustering method. We also discuss a lot about clustering methods based on partitioning. We know PAM is not efficient in dealing with medium and large data sets. CLARA and CLARANS first draw samples from the original data and then adopt PAM method. In our algorithm, we did not draw samples, but group the data first (each tuple P-tree can be viewed as a group of data) and then use PAM method on these groups. The number of these P-trees is much smaller than the size of original data sets and even smaller than what CLARA or CLARANS needs to deal with. We also present a data pruning method that works effectively. 6 REFERENCES [1] R. Agrawal, J. Cehrke, D. Gunopulos, and P. Raghavan. “Automatic subspace clustering of high dimensional data for data mining application”, Proc. ACMSIGMOD, Washington, 1998. [2] R. Ng and J. Han. “Efficient and effective clustering method for spatial data mining”, Proc. VLDB, pp. 144--155, Santiago, Chile, 1994. [3] M.S. Chen, J. Han, and P.S. Yu, “Data Mining: An Overview from a Database Perspective”, IEEE Transactions on Knowledge and Data Engineering, 8(6): 866-883, 1996. [4] William Perrizo, Qin Ding, Qiang Ding, Amlendu Roy, “Deriving High Confidence Rules from Spatial Data using Peano Count Trees”, SpringerVerlag, LNCS 2118, 2001. [5] William Perrizo, “Peano Count Tree Technology”, Technical Report NDSUCSOR-TR-01-1, 2001. [6] Ester, M., Kriegel, M. -P., Sander, J. and Xu, X., “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Proc. KDD, pp. 226-231, Portland, Oregon, 1996. [7] Jiawei Han, Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2001. [8] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: an Introduction to Cluster Analysis”. John Wiley & Sons, 1990. [9] J. MacQueen, “Some methods for classification and analysis of multivariate observations”, L. Le Cam and J. Neyman, editors, 5th Berkley Symposium on Mathematical Statistics and Probability, 1967. [10] Maleq Khan, Qin Ding, William Perrizo, “k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees”, PAKDD 2002, Springer-Verlag, LNAI 2336, 2002, pp. 517-528. [11] Q. Ding, M. Khan, A. Roy, and W. Perrizo, “The P-tree algebra”, Proc. ACM Symposium Applied Computing (SAC 2002), pp.426-431, Madrid, Spain, 2002.