Clustering Methods 1 Introduction to Clustering • Clustering is the use of unsupervised techniques for grouping similar objects. In machine learning, unsupervised refers to the problem of finding hidden structure within unlabeled data. • Clustering techniques are unsupervised in the sense that the data scientist does not determine, in advance, the labels to apply to the clusters. The structure of the data describes the objects of interest and determines how best to group the objects. • Suppose variables such as age, years of education and annual purchase expenditures were considered along with the personal income variable. What are the natural occurring groupings of customers? Clustering tries to answer this question. 2 Applications of Clustering • Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision processes. Some specific applications of clustering are image processing, medical, and customer segmentation. • Successive frames of security video images can be examined to identify any changes to the clusters. These newly identified clusters may indicate unauthorized access to a facility. • Patient attributes such as age, height, weight, systolic and diastolic blood pressures, cholesterol level, and other attributes can identify naturally occurring clusters. These clusters could be used to target individuals for specific preventive measures or clinical trial participation. Clustering, in general, is useful in biology for the classification of plants and animals as well as in the field of human genetics. 3 K-Means Clustering • Given a collection of objects, k-means identifies k clusters of objects based on the object’s proximity to the center of the k groups. The center is determined as the arithmetic average (mean) of each cluster’s n-dimensional vector of attributes. 4 Method step by step • In this chapter, to find k clusters from a collection of M objects with n attributes, the two-dimensional case (n=2) is examined. It is much easier to visualize the k-means method in two dimensions. • Because each object has two attributes, it is useful to consider each object corresponding to the point (x,y), where x and y denote the two attributes. For a given cluster of m points, the point that corresponds to the cluster’s mean is called a centroid. In mathematics, a centroid refers to a point that corresponds to the center of mass for an object. • The k-means algorithm to find k clusters can be described in the three steps. 5 Selection of Initial Centroids • Choose the value of k and the k initial guesses for the centroids. In this example, k = 3 and the initial centroids are indicated by the points shaded in red, green, and blue. 6 Assignment of Samples • Compute the distance from each data point (x, y) to each centroid. Assign each point to the closest centroid. This association defines the first k clusters. In two dimensions, the distance d, between any two points, (x1, y1) and (x2, y2), in the Cartesian plane is typically expressed by using the Euclidean distance measure. 7 Re-computation of Centroids • Compute the centroid, the center of mass, of each newly defined cluster. In two dimensions, the centroid (xc , yc ) of the points in a k-means cluster is calculated using • The two steps above are repeated until the algorithm converges. • Convergence is reached when the computed centroids do not change or the centroids and the assigned points oscillate back and forth from one iteration to the next. The latter case can occur when there are one or more points that are equal distances from the computed centroid. 8 k-means Clustering Algorithm 9 k-means Clustering Algorithm • Trajectories for the centroids of the k-means clustering procedure applied to two-dimensional data. 10 Fuzzy k-means (or, c-means) Clustering Algorithm 11 Fuzzy k-means Clustering Algorithm 12 Units of Measure • From a computational perspective, the k-means algorithm depends on the units of measure for a given attribute (for example, meters or centimeters for a patient’s height). The algorithm will identify different clusters depending on the choice of the units of measure. • When the height is expressed in meters, the age dominates to the distance calculation between two samples. • A widely used rescaling approach is to divide each attribute by the attribute’s standard deviation. The resulting attributes will each have a standard deviation equal to 1 and will be without units. 13 Example 14 Example Using Iris Dataset • Consider the iris dataset that is made up of three classes: 15 Example 16 Example (2 clusters) 17 Example (more clusters) k=4 k=3 k=5 18 Choosing the Best k value • Within Cluster Sum of Squares (WSS or inertia) represents the total sum of distance squares between each sample and the corresponding centroid • The smaller the WSS value, the more coherent are the different clusters. When as many clusters are added as there are samples in the data set, then the WSS value would be zero. • So how to find the optimal number of clusters using the inertia value? For this, the so called Elbow-Method can be used. 19 Choosing the Best k value • The value of k can be chosen based on a reasonable guess or some predefined requirement. However, even then, it would be good to know how much better or worse having k clusters versus (k − 1) or (k + 1) clusters would be in explaining the structure of the data. • A heuristic using the Within Cluster Sum of Squares (WSS) metric can be examined to determine a reasonably optimal value of k. Remember that WSS is the sum of the squares of the distances between each data point and its closest centroid. 20 Choosing the Best k value 21 Choosing the Best k value • As can be seen, WSS is greatly reduced when k increases from one to two. Another substantial reduction in WSS occurs at k = 3. However, the improvement in WSS is fairly linear for k > 3. Therefore, the k-means analysis will be conducted for k = 3. 22