Clustering - anuradhasrinivas

Definition  Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Applications • Group related documents for browsing • Group genes and proteins that have similar functionality • Group stocks with similar price fluctuations • Reduce the size of large data sets • Group users with similar buying mentalities Clustering is ambiguous  There is no correct or incorrect solution for clustering. How many clusters? Six Clusters Two Clusters Four Clusters Challenges faced  Scalability  Ability to deal with different types of attributes  Noise & Outliers  Complex shapes and types of data  Incremental clustering and insensitivity to the order of input records  High dimensionality  Constraint-based clustering  Interpretability and usability Types of Data  Data Matrix  n-objects with p-variables.  The structure is in the form of a relational table, or n x p matrix  Dissimilarity Matrix  object-by-object structure. Stores a collection of proximities that are available for all pair of n objects.  d(i, j) is the dissimilarity between objects i and j.  d(i, j) = d(j, i) and d(i, i) = 0 Types of Data  Interval- Scaled Variables  Binary Variables  Nominal  Ordinal  Ratio-Scaled variables  Variables of Mixed Types Interval- Scaled Variables  Interval-scaled variables contd…  Binary variables  Binary variable has only two states 0 and 1  Dissimilarity between two binary variables is by a 2*2 contingency table for binary variables OBJ j OBJ i 1 0 1 q r q+r 0 s t s+t q+s r+t p Dissimilarity between binary variables Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y Y N N N N D(Jack,Mary)=0.33 D(Jack,Jim)=0.67 D(Mary,Jim)=0.75 Categorical Variables  Other types of data  Ordinal  similar to nominal variables, but values are ordered in some sequence.  Eg. rank or employees can be assistant, associate, full  Ratio-Scaled variables  Makes a positive measurement on a non-linear scale Eg. Growth of bacteria, radioactivity  Variables of Mixed Types Types of clustering  Hierarchical clustering(BIRCH)  A set of nested clusters organized as a hierarchical tree  Partitional Clustering(k-means,k-mediods)  A division data objects into non-overlapping (distinct) subsets (i.e., clusters) such that each data object is in exactly one subset  Density – Based(DBSCAN)  Based on density functions  Grid-Based(STING)  Based on nultiple-level granularity structure  Model-Based(SOM)  Hypothesize a model for each of the clusters and find the best fit of the data to the given model Partitional Clustering Original Points A Partitional Clustering Hierarchical Clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram p1 p3 p4 p2 p1 p2 Non-traditional Hierarchical Clustering p3 p4 Non-traditional Dendrogram Clustering Algorithms  Partitional  K-means  K-mediods  Hierarchial  Agglomerative  Divisive K-Mean Algorithm  Each cluster is represented by the mean value of the objects in the cluster  Input : set of objects (n), no of clusters (k)  Output : set of k clusters  Algo  Randomly select k samples & mark them a initial cluster  Repeat   Assign/ reassign in sample to any given cluster to which it is most similar depending upon the mean of the cluster Update the cluster’s mean until No Change. K-Means (Array)  Step 1:  Step 2:  Step 3:  Step 4: Randomly assign objects to k clusters Find the mean of each cluster Re-assign objects to the cluster with closest mean. Go to step2 Repeat until no change. Example 1 Given: {2,3,6,8,9,12,15,18,22} Assume k=3.  Solution:  Randomly partition given data set:  K1 = 2,8,15 mean = 8.3  K2 = 3,9,18 mean = 10  K3 = 6,12,22 mean = 13.3  Reassign  K1 = 2,3,6,8,9  K2 =  K3 = 12,15,18,22 mean = 5.6 mean = 0 mean = 16.75  Reassign  K1 = 3,6,8,9  K2 = 2  K3 = 12,15,18,22  Reassign  K1 = 6,8,9  K2 = 2,3  K3 = 12,15,18,22  Reassign  K1 = 6,8,9  K2 = 2,3  K3 = 12,15,18,22  STOP mean = 6.5 mean = 2 mean = 16.75 mean = 7.6 mean = 2.5 mean = 16.75 mean = 7.6 mean = 2.5 mean = 16.75 Example 2 Given {2,4,10,12,3,20,30,11,25} Assume k=2. Solution: K1 = 2,3,4,10,11,12 K2 = 20, 25, 30 Advantages • K-means is relatively scalable and efficient in processing large data sets • The computational complexity of the algorithm is O(nkt) n: the total number of objects k: the number of clusters t: the number of iterations Normally: k<<n and t<<n Disadvantage • Can be applied only when the mean of a cluster is defined • Users need to specify k • K-means is not suitable for discovering clusters with non convex shapes or clusters of very different size • It is sensitive to noise and outlier data points (can influence the mean value) K-Means (graph)  Step1:  Step2: Form k centroids, randomly Calculate distance between centroids and each object  Use Euclidean’s law do determine min distance: d(A,B) = (x2-x1)2 + (y2-y1)2  Step3:  Step4: C= Assign objects based on min distance to k clusters Calculate centroid of each cluster using (x1+x2+…xn , y1+y2+…yn) n n  Go to step 2.  Repeat until no change in centroids. Example 1  There are four types of medicines and each have two attributes, as shown below. Find a way to group them into 2 groups based on their features. Medicine A B Weight 1 2 pH 1 1 C 4 3 D 5 4 Solution  Plot the values on a graph.  Mark any k centeroids  Calculate Euclidean distance of each point from the centeroids. D= 0 1 1 0 3.61 2.83 5 4.24  Based on minimum distance, we assign points to clusters: K1 = A K2 = B, C, D  Calculate new centeroids  C = 2+4+5 , 1+3+4 3 3 = (11/3 , 8/3)  Marking the new centroids  Continue the iteration, until there is no change in the centroids or clusters. Final solution Example 2  Use K-means algorithm to create two clusters. Given: Example 3 . Group the below points into 3 clusters

Clustering - anuradhasrinivas

Related documents

Products

Support

Clustering - anuradhasrinivas

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib