DATA MINING LECTURE 6 Clustering K-Means K-Median CLUSTERING Clustering • The process of making a group of abstract objects into classes of similar objects is known as clustering. • unsupervised Machine Learning-based Algorithm • One group is treated as a cluster of data objects • In the process of cluster analysis, the first step is to partition the set of data into groups with the help of data similarity, and then groups are assigned to their respective labels. • The biggest advantage of clustering overclassification is it can adapt to the changes made and helps single out useful features that differentiate different groups. What is a Clustering? • In general a grouping of objects such that the objects in a group (cluster) are similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Applications of Cluster Analysis • It is widely used in many applications such as image processing, data analysis, and pattern recognition. • It helps marketers to find the distinct groups in their customer base and they can characterize their customer groups by using purchasing patterns. • It can be used in the field of biology, by deriving animal and plant taxonomies and identifying genes with the same capabilities. • It also helps in information discovery by classifying documents on the web. • Early applications of cluster analysis • John Snow, London 1854 Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Types of Clusterings • A clustering is a set of clusters • Important distinction between hierarchical and partitional sets of clusters • Partitional Clustering • A division data objects into subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree Partitional Clustering Original Points A Partitional Clustering Hierarchical Clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram p1 p3 p4 p2 p1 p2 Non-traditional Hierarchical Clustering p3 p4 Non-traditional Dendrogram Other types of clustering • Exclusive (or non-overlapping) versus non- exclusive (or overlapping) • In non-exclusive clusterings, points may belong to multiple clusters. • Points that belong to multiple classes, or ‘border’ points • Fuzzy (or soft) versus non-fuzzy (or hard) • In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 • Weights usually must sum to 1 (often interpreted as probabilities) • Partial versus complete • In some cases, we only want to cluster some of the data Types of Clusters: Well-Separated • Well-Separated Clusters: • A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters Types of Clusters: Center-Based • Center-based • A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster • The center of a cluster is often a centroid, the minimizer of distances from all the points in the cluster, or a medoid, the most “representative” point of a cluster A center-based cluster is a type of cluster in which the grouping of data points is determined based on their proximity to a central point, often referred to as the cluster center or centroid. The center represents the mean or median of the data points within the cluster 4 center-based clusters Clustering Algorithms • K-means and its variants • Hierarchical clustering The various types of clustering are: • DBSCAN Hierarchical clustering Partitioning clustering Hierarchical clustering is further subdivided into: Agglomerative clustering Divisive clustering Partitioning clustering is further subdivided into: K-Means clustering Fuzzy C-Means clustering K-MEANS K-means Clustering • Unsupervised Learning algorithm • Partitional clustering approach • groups the unlabeled dataset into different clusters. • K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters. • divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties. • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid K-means Clustering • Problem: Given a set X of n points in a d- dimensional space and an integer K group the points into K clusters C= {C1, C2,…,Ck} such that π πΆππ π‘ πΆ = ΰ· ΰ· πππ π‘(π₯, π) π=1 π₯∈πΆπ is minimized, where ci is the centroid of the points in cluster Ci K-means Clustering • Most common definition is with euclidean distance, minimizing the Sum of Squares Error (SSE) function • Sometimes K-means is defined like that • Problem: Given a set X of n points in a d- dimensional space and an integer K group the points into K clusters C= {C1, C2,…,Ck} such that π πΆππ π‘ πΆ = ΰ· ΰ· π₯ − ππ 2 π=1 π₯∈πΆπ is minimized, where ci is the mean of the points in cluster Ci Sum of Squares Error (SSE) K-means Algorithm • K-means is sometimes synonymous with this algorithm • The kmeans clustering algorithm mainly performs two tasks: • Determines the best value for K center points or centroids by an iterative process. • Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster. How does the K-Means Algorithm Work? • Step-1: Select the number K to decide the number of clusters. • Step-2: Select random K points or centroids. (It can be other from the input dataset). • Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters. • Step-4: Calculate the variance and place a new centroid of each cluster. • Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster. • Step-6: If any reassignment occurs, then go to step-4 else go to FINISH. • Step-7: The model is ready. • K-means Algorithm – Initialization • Initial centroids are often chosen randomly. • Clusters produced vary from one run to another. Two different K-means Clusterings 3 2.5 2 Original Points y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 2.5 2.5 2 2 1.5 1.5 y 3 y 3 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 x Optimal Clustering 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Sub-optimal Clustering Importance of Choosing Initial Centroids Iteration 6 1 2 3 4 5 3 2.5 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 Importance of Choosing Initial Centroids Iteration 1 Iteration 2 Iteration 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y 3 y 3 y 3 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 -2 2 -1.5 -1 -0.5 x 0 0.5 1 1.5 -2 2 Iteration 4 Iteration 5 2.5 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 -0.5 0 x 0.5 1 1.5 2 0 0.5 1 1.5 2 1 1.5 2 y 2.5 y 2.5 y 3 -1 -0.5 Iteration 6 3 -1.5 -1 x 3 -2 -1.5 x -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 x 0.5 Importance of Choosing Initial Centroids Iteration 5 1 2 3 4 3 2.5 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 Importance of Choosing Initial Centroids … Iteration 1 Iteration 2 2.5 2.5 2 2 1.5 1.5 y 3 y 3 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 -2 2 -1.5 -1 -0.5 x 0 0.5 Iteration 3 2.5 2 2 2 1.5 1.5 1.5 y 2.5 y 2.5 y 3 1 1 1 0.5 0.5 0.5 0 0 0 -1 -0.5 0 x 0.5 2 Iteration 5 3 -1.5 1.5 Iteration 4 3 -2 1 x 1 1.5 2 -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 Dealing with Initialization • Do multiple runs and select the clustering with the smallest error • Select original set of points by methods other than random . E.g., pick the most distant (from each other) points as cluster centers (K-means++ algorithm) K-means Algorithm – Centroids • The centroid depends on the distance function • The minimizer for the distance function • ‘Closeness’ is measured by Euclidean distance (SSE), cosine similarity, correlation, etc. • Centroid: • The mean of the points in the cluster for SSE • The median for Manhattan distance. Limitations of K-means • K-means has problems when clusters are of different • Sizes • Densities • Non-globular shapes • K-means has problems when the data contains outliers. Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters) Limitations of K-means: Differing Density Original Points K-means (3 Clusters) Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters) Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together. Overcoming K-means Limitations Original Points K-means Clusters Example • Visualization • This clustering can be visualized as two distinct groups on a 2D graph, where the x-axis represents Study Hours and the y-axis represents Test Scores. Assignment • K Median • Self Study