Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han Pavan Podila COSC 6341, Fall ‘04 1 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations Summary 2 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations Summary 3 Spatial Data Mining Identifying interesting relationships and characteristics that may exist implicitly in Spatial Databases Different from Relational Databases Spatial objects - store both spatial and nonspatial attributes Queries (“All Walmart stores within 10 miles of UH) Spatial Joins, work on spatial indexes (R-tree) Huge sizes (Tera bytes) GIS is a classic example 4 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations Summary 5 Partitioning Methods Given K, the number of partitions to create, a partitioning method constructs initial partitions. It then iterative refines the quality of these clusters so as to maximize intra-cluster similarity and inter-cluster dissimilarity. [Quality of Clustering]: Average dissimilarity of objects from their cluster centers (medoids) Selected algorithms: 1. K-medoids 2. PAM 3. CLARA 4. CLARANS 6 K-Medoids 10 9 8 Partition based clustering (K partitions) Effective, why ? Resistant to outliers Do not depend on order in which data points are examined Cluster center is part of dataset, unlike k-means where cluster center is gravity based Experiments show that large data sets are handled efficiently 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 K-medoids 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 K-means 7 PAM (Partitioning Around Medoids) [Goal]: Find K representative objects of the data set. Each of the K objects is called a Medoid, the most centrally located object within a cluster. 8 PAM (2) Start with K data points designated as medoids. Create cluster around a medoid by moving data points close to the medoid Oj belongs to Oi if d(Oj, Oi) = minOe d(Oj, Oe) Iteratively replace Oi with Oh if quality of clustering improves. Swapping cost, Cijh, associated for replacing a selected object Oi with a non-selected object Oh Oi Oh Oj 9 PAM (3) * O(k(n-k)2) for each iteration * Good for small data sets (n=100, k=5) Select K representative objects arbitrarily Compute TCih for all pairs (Oi, Oh) Replace Oi with Oh Select pair (Oi, Oh) with min TCih (Oi, Oh) Yes TCih < 0 No For every Oj find the most representative object 10 CLARA (Clustering LARge Applications) Improvement over PAM Finds medoids in a sample from the dataset [Idea]: If the samples are sufficiently random, the medoids of the sample approximate the medoids of the dataset [Heuristics]: 5 samples of size 40+2k gives satisfactory results Works well for large datasets (n=1000, k=10) 11 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations Summary 12 CLARANS (Clustering Large Applications based on RANdomized Search) A graph abstraction, Gn,k Each vertex is a collection of k medoids S1 | S1 S2 | = k – 1 Each node has k(n-k) neighbors Cost of each node is total dissimilarity of objects to their medoids PAM searches whole graph CLARA searches subgraph {Om1, ..., Omk} S2 {Oa1, ..., Oak} {Ob1, ..., Obk} {Oc1, ..., Ock} {Od1, ..., Odk} 13 CLARANS (2) Input maxNeighbors, numLocal i = 1, minCost = ∞, bestNode = -1 j=1 current = S current = random node of Gn,k Pick random neighbor S of current. TCS < TCcurrent YES NO YES TCcurrent < minCost j++ YES NO minCost = TCcurrent, bestNode = current NO YES j < maxNeighbor Experimental values i++ • numLocal = 2 • maxNeighbors = max(1.25% of k(n-k), 250) i > numLocal NO Output bestNode Stop 14 CLARANS (3) Outperforms PAM and CLARA in terms of running time and quality of clustering O(n2) for each iteration CLARANS vs CLARA 15 CLARANS vs PAM Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations Summary 16 Generalization Useful to mine non-spatial attributes Process of merging tuples based on a concept hierarchy DBLearn – SQL query, gen. hierarchy and threshold color reddish red orange yellowish yellow bluish green blue indigo violet Sphere(color, diameter) diameter Initial relation Generalized relation small large 1...20 21...40 17 Silhouette Silhouette of object Oj determines how much Oj belongs to it’s cluster Between -1 and 1 1 indicates high degree of membership Silhouette width of cluster Average silhouette of all objects in cluster Silhouette coefficient Average silhouette widths of k clusters Silhoutte width Interpretation 0.71 – 1 Strong cluster 0.51 – 0.7 Reasonable cluster 0.26 – 0.5 Weak or artificial cluster ≤ 0.25 No cluster found 18 SD and NSD approach SD – Spatial Dominant NSD – Non-Spatial Dominant Clustering for spatial attributes / Generalization for non-spatial attributes Dominance is decided by what is carried out first (clustering/generalization) Second phase works on tuples from previous stage 19 SD(CLARANS) Data SQL Specify learning request in the form of SQL query For every cluster CLARANS on spatial attributes Oi Tuples Oh Oj Collect non-spatial components Knat clusters Apply DBLearn Finds non-spatial generalizations from spatial clustering Value for Knat is determined through heuristics using the silhouette coefficients Clustering phase can be treated as finding spatial generalization hierarchy 20 NSD(CLARANS) For every generalized tuple Data SQL Collect spatial components Specify learning request as SQL query Clusters Tuples Oi Apply DBLearn to non-spatial attributes Generalized tuples Oh Oj Check if any clusters overlap. Merge them. CLARANS to find Knat clusters Finds spatial clusters from non-spatial generalizations Clusters may overlap 21 Overview Spatial Data Mining Clustering techniques CLARANS Spatial and Non-Spatial dominant CLARANS Observations Summary 22 Observations In all previous methods, quality of mining depends on the SQL query CLARANS assumes that the entire dataset is in memory. Not always the case for large data sets. Quality of results cannot be guaranteed when N is very large – due to Randomized Search 23 Observations (2) Other clustering algorithms proposed for Spatial Data Mining Hierarchical: BIRCH Density based: DBSCAN, GDBSCAN, DBRS Grid based: STING 24 Summary A seminal paper on use of clustering for spatial data mining CLARANS is an effective clustering technique for large datasets SD(CLARANS)/NSD(CLARANS) are effective spatial data mining algorithms 25 References Primary Efficient and Effective Clustering Methods for Spatial Data Mining (1994) - Raymond T. Ng, Jiawei Han Secondary CLARANS: A Method for Clustering Objects for Spatial Data Mining - Raymond T. Ng, Jiawei Han Clustering for Mining in Large Spatial Databases Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu An Introduction to Spatial Database Systems - Ralf Hartmut Güting 26