Matakuliah : M0614 / Data Mining & OLAP Tahun : Feb - 2010 Cluster Analysis (cont.) Pertemuan 12 Learning Outcomes Pada akhir pertemuan ini, diharapkan mahasiswa akan mampu : • Mahasiswa dapat menggunakan teknik analisis clustering: Partitioning, hierarchical, dan model-based clustering pada data mining. (C3) 3 Bina Nusantara Acknowledgments These slides have been adapted from Han, J., Kamber, M., & Pei, Y. Data Mining: Concepts and Technique and Tan, P.-N., Steinbach, M., & Kumar, V. Introduction to Data Mining. Bina Nusantara Outline Materi • A categorization of major clustering methods: Hiararchical methods • A categorization of major clustering methods: Model-based clustering methods • Summary 5 Bina Nusantara Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram – A tree like diagram that records the sequences of merges or splits 5 6 0.2 4 3 4 2 0.15 5 2 0.1 1 0.05 3 0 1 3 2 5 4 6 1 Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters – Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level • They may correspond to meaningful taxonomies – Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) Hierarchical Clustering • Two main types of hierarchical clustering – Agglomerative: • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left – Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters) • Traditional hierarchical algorithms use a similarity or distance matrix – Merge or split one cluster at a time Hierarchical Clustering • Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 a Step 1 Step 2 Step 3 Step 4 agglomerative (AGNES) ab b abcde c cde d de e Step 4 June 28, 2016 Step 3 Step 2 Step 1 Step 0 Data Mining: Concepts and Techniques divisive (DIANA) 9 AGNES (Agglomerative Nesting) • Introduced in Kaufmann and Rousseeuw (1990) • Implemented in statistical packages, e.g., Splus • Use the Single-Link method and the dissimilarity matrix • Merge nodes that have the least dissimilarity • Go on in a non-descending fashion • Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 2 3 June 28, 2016 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 Data Mining: Concepts and Techniques 0 1 2 3 4 5 6 7 8 9 10 10 Agglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm is straightforward 1. 2. 3. 4. 5. 6. • Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters – Different approaches to defining the distance between clusters distinguish the different algorithms Starting Situation • Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5 ... p1 p2 p3 p4 p5 . . Proximity Matrix ... . p1 p2 p3 p4 p9 p10 p11 p12 Intermediate Situation • After some merging steps, we have some clusters C1 C2 C3 C4 C5 C1 C2 C3 C3 C4 C4 C5 Proximity Matrix C1 ... p1 C2 C5 p2 p3 p4 p9 p10 p11 p12 Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C2 C3 C4 C5 C1 C2 C3 C3 C4 C4 C5 Proximity Matrix C1 C2 C5 ... p1 p2 p3 p4 p9 p10 p11 p12 After Merging • The question is “How do we update the proximity matrix?” C1 C1 C4 C3 C4 ? ? ? ? C2 U C5 C3 C2 U C5 ? C3 ? C4 ? Proximity Matrix C1 C2 U C5 ... p1 p2 p3 p4 p9 p10 p11 p12 How to Define Inter-Cluster Similarity p1 Similarity? p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error p5 . . . Proximity Matrix ... How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error p5 . . . Proximity Matrix ... How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error p5 . . . Proximity Matrix ... How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error p5 . . . Proximity Matrix ... How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function – Ward’s Method uses squared error p5 . . . Proximity Matrix ... Cluster Similarity: MIN or Single Link • Similarity of two clusters is based on the two most similar (closest) points in the different clusters – Determined by one pair of points, i.e., by one link in the proximity graph. I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5 Hierarchical Clustering: MIN 1 5 3 5 0.2 2 1 2 3 0.15 6 0.1 0.05 4 4 Nested Clusters 0 3 6 2 5 Dendrogram 4 1 Strength of MIN Original Points • Can handle non-elliptical shapes Two Clusters Limitations of MIN Original Points • Sensitive to noise and outliers Two Clusters Cluster Similarity: MAX or Complete Linkage • Similarity of two clusters is based on the two least similar (most distant) points in the different clusters – Determined by all pairs of points in the two clusters I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5 Hierarchical Clustering: MAX 4 1 2 5 5 0.4 0.35 0.3 2 0.25 3 3 6 1 4 0.2 0.15 0.1 0.05 0 Nested Clusters 3 6 4 Dendrogram 1 2 5 Strength of MAX Original Points • Less susceptible to noise and outliers Two Clusters Limitations of MAX Original Points •Tends to break large clusters •Biased towards globular clusters Two Clusters Cluster Similarity: Group Average • Proximity of two clusters is the average of pairwise proximity between points in the two clusters. proximity(pi , p j ) proximity(Clusteri , Clusterj ) piClusteri p jClusterj |Clusteri ||Clusterj | • Need to use average connectivity for scalability since total proximity favors large clusters I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5 Hierarchical Clustering: Group Average 5 4 1 0.25 2 5 0.2 2 0.15 3 6 1 4 3 Nested Clusters 0.1 0.05 0 3 6 4 1 Dendrogram 2 5 Hierarchical Clustering: Group Average • Compromise between Single and Complete Link • Strengths – • Less susceptible to noise and outliers Limitations – Biased towards globular clusters Hierarchical Clustering: Comparison 1 5 4 3 5 2 2 5 1 2 1 MIN 3 MAX 5 2 6 3 3 4 1 5 2 5 2 3 3 6 1 4 4 1 4 4 Group Average 6 Hierarchical Clustering: Problems and Limitations • Once a decision is made to combine two clusters, it cannot be undone • No objective function is directly minimized • Different schemes have problems with one or more of the following: – Sensitivity to noise and outliers – Difficulty handling different sized clusters and convex shapes – Breaking large clusters DIANA (Divisive Analysis) • Introduced in Kaufmann and Rousseeuw (1990) • Implemented in statistical analysis packages, e.g., Splus • Inverse order of AGNES • Eventually each node forms a cluster on its own 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 June 28, 2016 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Data Mining: Concepts and Techniques 0 1 2 3 4 5 6 7 8 9 10 34 MST: Divisive Hierarchical Clustering • Build MST (Minimum Spanning Tree) – Start with a tree that consists of any point – In successive steps, look for the closest pair of points (p, q) such that one point (p) is in the current tree but the other (q) is not – Add q to the tree and put an edge between p and q MST: Divisive Hierarchical Clustering • Use MST for constructing hierarchy of clusters Extensions to Hierarchical Clustering • Major weakness of agglomerative clustering methods – Do not scale well: time complexity of at least O(n2), where n is the number of total objects – Can never undo what was done previously • Integration of hierarchical & distance-based clustering – BIRCH (1996): uses CF-tree and incrementally adjusts the quality of subclusters – ROCK (1999): clustering categorical data by neighbor and link analysis – CHAMELEON (1999): hierarchical clustering using dynamic modeling June 28, 2016 Data Mining: Concepts and Techniques 37 Model-Based Clustering • What is model-based clustering? – Attempt to optimize the fit between the given data and some mathematical model – Based on the assumption: Data are generated by a mixture of underlying probability distribution • Typical methods – Statistical approach • EM (Expectation maximization), AutoClass – Machine learning approach • COBWEB, CLASSIT – Neural network approach • SOM (Self-Organizing Feature Map) June 28, 2016 Data Mining: Concepts and Techniques 38 EM — Expectation Maximization • EM — A popular iterative refinement algorithm • An extension to k-means – Assign each object to a cluster according to a weight (prob. distribution) – New means are computed based on weighted measures • General idea – Starts with an initial estimate of the parameter vector – Iteratively rescores the patterns against the mixture density produced by the parameter vector – The rescored patterns are used to update the parameter updates – Patterns belonging to the same cluster, if they are placed by their scores in a particular component • Algorithm converges fast but may not be in global optima June 28, 2016 Data Mining: Concepts and Techniques 39 The EM (Expectation Maximization) Algorithm • Initially, randomly assign k cluster centers • Iteratively refine the clusters based on two steps – Expectation step: assign each data point Xi to cluster Ci with the following probability – Maximization step: • Estimation of model parameters June 28, 2016 Data Mining: Concepts and Techniques 40 Conceptual Clustering • Conceptual clustering – A form of clustering in machine learning – Produces a classification scheme for a set of unlabeled objects – Finds characteristic description for each concept (class) • COBWEB – A popular a simple method of incremental conceptual learning – Creates a hierarchical clustering in the form of a classification tree – Each node refers to a concept and contains a probabilistic description of that concept June 28, 2016 Data Mining: Concepts and Techniques 41 COBWEB Clustering Method A classification tree June 28, 2016 Data Mining: Concepts and Techniques 42 More on Conceptual Clustering • Limitations of COBWEB – The assumption that the attributes are independent of each other is often too strong because correlation may exist – Not suitable for clustering large database data – skewed tree and expensive probability distributions • CLASSIT – an extension of COBWEB for incremental clustering of continuous data – suffers similar problems as COBWEB • AutoClass – Uses Bayesian statistical analysis to estimate the number of clusters – Popular in industry June 28, 2016 Data Mining: Concepts and Techniques 43 Neural Network Approach • Neural network approaches – Represent each cluster as an exemplar, acting as a “prototype” of the cluster – New objects are distributed to the cluster whose exemplar is the most similar according to some distance measure • Typical methods – SOM (Soft-Organizing feature Map) – Competitive learning • Involves a hierarchical architecture of several units (neurons) • Neurons compete in a “winner-takes-all” fashion for the object currently being presented June 28, 2016 Data Mining: Concepts and Techniques 44 Self-Organizing Feature Map (SOM) • SOMs, also called topological ordered maps, or Kohonen Self-Organizing Feature Map (KSOMs) • It maps all the points in a high-dimensional source space into a 2 to 3-d target space, s.t., the distance and proximity relationship (i.e., topology) are preserved as much as possible • Similar to k-means: cluster centers tend to lie in a low-dimensional manifold in the feature space • Clustering is performed by having several units competing for the current object – The unit whose weight vector is closest to the current object wins – The winner and its neighbors learn by having their weights adjusted • SOMs are believed to resemble processing that can occur in the brain • Useful for visualizing high-dimensional data in 2- or 3-D space June 28, 2016 Data Mining: Concepts and Techniques 45 Web Document Clustering Using SOM • The result of SOM clustering of 12088 Web articles • The picture on the right: drilling down on the keyword “mining” • Based on websom.hut.fi Web page June 28, 2016 Data Mining: Concepts and Techniques 46 User-Guided Clustering Course Professor person name course course-id group office semester name position instructor area Advise Group professor name Publish degree User hint Target of clustering Publication author title title year student area • • Open-course Work-In conf Register student Student course name office semester unit position grade User usually has a goal of clustering, e.g., clustering students by research area User specifies his clustering goal to CrossClus June 28, 2016 Data Mining: Concepts and Techniques 47 Comparing with Classification User hint • User-specified feature (in the form of attribute) is used as a hint, not class labels – The attribute may contain too many or too few distinct values, e.g., a user may want to cluster students into 20 clusters instead of 3 – Additional features need to be included in cluster analysis All tuples for clustering June 28, 2016 Data Mining: Concepts and Techniques 48 Comparing with Semi-Supervised Clustering • Semi-supervised clustering: User provides a training set consisting of “similar” (“must-link) and “dissimilar” (“cannot link”) pairs of objects • User-guided clustering: User specifies an attribute as a hint, and more relevant features are found for clustering User-guided clustering All tuples for clustering Semi-supervised clustering June 28, 2016 x All tuples for clustering Data Mining: Concepts and Techniques 49 Why Not Semi-Supervised Clustering? • Much information (in multiple relations) is needed to judge whether two tuples are similar • A user may not be able to provide a good training set • It is much easier for a user to specify an attribute as a hint, such as a student’s research area Tom Smith Jane Chang SC1211 BI205 TA RA Tuples to be compared User hint June 28, 2016 Data Mining: Concepts and Techniques 50 CrossClus: An Overview • Measure similarity between features by how they group objects into clusters • Use a heuristic method to search for pertinent features – Start from user-specified feature and gradually expand search range • Use tuple ID propagation to create feature values – Features can be easily created during the expansion of search range, by propagating IDs • Explore three clustering algorithms: k-means, k-medoids, and hierarchical clustering June 28, 2016 Data Mining: Concepts and Techniques 51 Multi-Relational Features • A multi-relational feature is defined by: – A join path, e.g., Student → Register → OpenCourse → Course – An attribute, e.g., Course.area – (For numerical feature) an aggregation operator, e.g., sum or average • Categorical feature f = [Student → Register → OpenCourse → Course, Course.area, null] areas of courses of each student Tuple Areas of courses Tuple DB AI TH t1 5 5 0 t2 0 3 t3 1 t4 t5 June 28, 2016 Values of feature f Feature f DB AI TH t1 0.5 0.5 0 7 t2 0 0.3 0.7 5 4 t3 0.1 0.5 0.4 5 0 5 t4 0.5 0 0.5 3 3 4 t5 0.3 0.3 0.4 Data Mining: Concepts and Techniques f(t1) f(t2) f(t3) f(t4) DB AI TH f(t5) 52 Representing Features • Similarity between tuples t1 and t2 w.r.t. categorical feature f L f t . p – Cosine similarity between vectors f(t1) and f(t2) sim f t1 , t 2 Similarity vector Vf • • • June 28, 2016 k 1 L f t . p k 1 • 1 1 2 k k f t 2 . pk L f t . p k 1 2 2 k Most important information of a feature f is how f groups tuples into clusters f is represented by similarities between every pair of tuples indicated by f The horizontal axes are the tuple indices, and the vertical axis is the similarity This can be considered as a vector of N x N dimensions Data Mining: Concepts and Techniques 53 Similarity Between Features Values of Feature f and g Feature f (course) Feature g (group) DB AI TH Info sys Cog sci Theory t1 0.5 0.5 0 1 0 0 t2 0 0.3 0.7 0 0 1 t3 0.1 0.5 0.4 0 0.5 0.5 t4 0.5 0 0.5 0.5 0 0.5 t5 0.3 0.3 0.4 0.5 0.5 0 Similarity between two features – cosine similarity of two vectors Vf Vg V f V g sim f , g f g V V June 28, 2016 Data Mining: Concepts and Techniques 54 Computing Feature Similarity Feature f Tuples Feature g DB Info sys AI Cog sci TH Theory Similarity between feature values w.r.t. the tuples sim(fk,gq)=Σi=1 to N f(ti).pk∙g(ti).pq Info sys DB 2 ti , t j sim g ti , t j fsimilarities, V f V g Tuple sim fsimilarities, Featuresim value k , gq N N i 1 j 1 l hard to compute DB Info sys AI Cog sci TH Theory June 28, 2016 m k 1 q easy 1 to compute Compute similarity between each pair of feature values by one scan on data Data Mining: Concepts and Techniques 55 Searching for Pertinent Features • Different features convey different aspects of information Academic Performances Research area Demographic info GPA Research group area Permanent address GRE score Nationality Number of papers Conferences of papers Advisor • Features conveying same aspect of information usually cluster tuples in more similar ways – Research group areas vs. conferences of publications • Given user specified feature – Find pertinent features by computing feature similarity June 28, 2016 Data Mining: Concepts and Techniques 56 Heuristic Search for Pertinent Features Overall procedure Course Professor person name course course-id group office semester name position instructor area 2 1. Start from the userspecified feature Group 2. Search in neighborhood of name existing pertinent features area 3. Expand search range gradually User hint Target of clustering Open-course Work-In Advise Publish professor student author 1 title degree Publication title year conf Register student Student course name office semester position unit grade Tuple ID propagation is used to create multi-relational features IDs of target tuples can be propagated along any join path, from which we can find tuples joinable with each target tuple June 28, 2016 Data Mining: Concepts and Techniques 57 Summary • Cluster analysis groups objects based on their similarity and has wide applications • Measure of similarity can be computed for various types of data • Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods • There are still lots of research issues on cluster analysis June 28, 2016 Data Mining: Concepts and Techniques 58 Dilanjutkan ke pert. 13 Applications and Trends in Data Mining Bina Nusantara