Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Applications • Group related documents for browsing • Group genes and proteins that have similar functionality • Group stocks with similar price fluctuations • Reduce the size of large data sets • Group users with similar buying mentalities Clustering is ambiguous There is no correct or incorrect solution for clustering. How many clusters? Six Clusters Two Clusters Four Clusters Challenges faced Scalability Ability to deal with different types of attributes Noise & Outliers Complex shapes and types of data Incremental clustering and insensitivity to the order of input records High dimensionality Constraint-based clustering Interpretability and usability Types of Data Data Matrix n-objects with p-variables. The structure is in the form of a relational table, or n x p matrix Dissimilarity Matrix object-by-object structure. Stores a collection of proximities that are available for all pair of n objects. d(i, j) is the dissimilarity between objects i and j. d(i, j) = d(j, i) and d(i, i) = 0 Types of Data Interval- Scaled Variables Binary Variables Nominal Ordinal Ratio-Scaled variables Variables of Mixed Types Interval- Scaled Variables Interval-scaled variables contd… Binary variables Binary variable has only two states 0 and 1 Dissimilarity between two binary variables is by a 2*2 contingency table for binary variables OBJ j OBJ i 1 0 1 q r q+r 0 s t s+t q+s r+t p Dissimilarity between binary variables Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y Y N N N N D(Jack,Mary)=0.33 D(Jack,Jim)=0.67 D(Mary,Jim)=0.75 Categorical Variables Other types of data Ordinal similar to nominal variables, but values are ordered in some sequence. Eg. rank or employees can be assistant, associate, full Ratio-Scaled variables Makes a positive measurement on a non-linear scale Eg. Growth of bacteria, radioactivity Variables of Mixed Types Types of clustering Hierarchical clustering(BIRCH) A set of nested clusters organized as a hierarchical tree Partitional Clustering(k-means,k-mediods) A division data objects into non-overlapping (distinct) subsets (i.e., clusters) such that each data object is in exactly one subset Density – Based(DBSCAN) Based on density functions Grid-Based(STING) Based on nultiple-level granularity structure Model-Based(SOM) Hypothesize a model for each of the clusters and find the best fit of the data to the given model Partitional Clustering Original Points A Partitional Clustering Hierarchical Clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram p1 p3 p4 p2 p1 p2 Non-traditional Hierarchical Clustering p3 p4 Non-traditional Dendrogram Clustering Algorithms Partitional K-means K-mediods Hierarchial Agglomerative Divisive K-Mean Algorithm Each cluster is represented by the mean value of the objects in the cluster Input : set of objects (n), no of clusters (k) Output : set of k clusters Algo Randomly select k samples & mark them a initial cluster Repeat Assign/ reassign in sample to any given cluster to which it is most similar depending upon the mean of the cluster Update the cluster’s mean until No Change. K-Means (Array) Step 1: Step 2: Step 3: Step 4: Randomly assign objects to k clusters Find the mean of each cluster Re-assign objects to the cluster with closest mean. Go to step2 Repeat until no change. Example 1 Given: {2,3,6,8,9,12,15,18,22} Assume k=3. Solution: Randomly partition given data set: K1 = 2,8,15 mean = 8.3 K2 = 3,9,18 mean = 10 K3 = 6,12,22 mean = 13.3 Reassign K1 = 2,3,6,8,9 K2 = K3 = 12,15,18,22 mean = 5.6 mean = 0 mean = 16.75 Reassign K1 = 3,6,8,9 K2 = 2 K3 = 12,15,18,22 Reassign K1 = 6,8,9 K2 = 2,3 K3 = 12,15,18,22 Reassign K1 = 6,8,9 K2 = 2,3 K3 = 12,15,18,22 STOP mean = 6.5 mean = 2 mean = 16.75 mean = 7.6 mean = 2.5 mean = 16.75 mean = 7.6 mean = 2.5 mean = 16.75 Example 2 Given {2,4,10,12,3,20,30,11,25} Assume k=2. Solution: K1 = 2,3,4,10,11,12 K2 = 20, 25, 30 Advantages • K-means is relatively scalable and efficient in processing large data sets • The computational complexity of the algorithm is O(nkt) n: the total number of objects k: the number of clusters t: the number of iterations Normally: k<<n and t<<n Disadvantage • Can be applied only when the mean of a cluster is defined • Users need to specify k • K-means is not suitable for discovering clusters with non convex shapes or clusters of very different size • It is sensitive to noise and outlier data points (can influence the mean value) K-Means (graph) Step1: Step2: Form k centroids, randomly Calculate distance between centroids and each object Use Euclidean’s law do determine min distance: d(A,B) = (x2-x1)2 + (y2-y1)2 Step3: Step4: C= Assign objects based on min distance to k clusters Calculate centroid of each cluster using (x1+x2+…xn , y1+y2+…yn) n n Go to step 2. Repeat until no change in centroids. Example 1 There are four types of medicines and each have two attributes, as shown below. Find a way to group them into 2 groups based on their features. Medicine A B Weight 1 2 pH 1 1 C 4 3 D 5 4 Solution Plot the values on a graph. Mark any k centeroids Calculate Euclidean distance of each point from the centeroids. D= 0 1 1 0 3.61 2.83 5 4.24 Based on minimum distance, we assign points to clusters: K1 = A K2 = B, C, D Calculate new centeroids C = 2+4+5 , 1+3+4 3 3 = (11/3 , 8/3) Marking the new centroids Continue the iteration, until there is no change in the centroids or clusters. Final solution Example 2 Use K-means algorithm to create two clusters. Given: Example 3 . Group the below points into 3 clusters