CLUSTERING (Segmentation) Saed Sayad www.ismartsoft.com 1 Data Mining Steps 1 • Problem Definition 2 • Data Preparation 3 • Data Exploration 4 • Modeling 5 • Evaluation 6 • Deployment www.ismartsoft.com 2 What is Clustering? Given a set of records, organize the records into clusters Income A cluster is a subset of records which are similar Age www.ismartsoft.com 3 Clustering Requirements • The ability to discover some or all of the hidden clusters. • Within-cluster similarity and between-cluster disimilarity. • Ability to deal with various types of attributes. • Can deal with noise and outliers. • Can handle high dimensionality. • Scalability, Interpretability and usability. www.ismartsoft.com 4 Similarity - Distance Measure To measure similarity or dissimilarity between objects, we need a distance measure. The usual axioms for a distance measure D are: D(x, x) = 0 D(x, y) = D(y, x) D(x, y) ≤ D(x, z) + D(z, y) the triangle inequality www.ismartsoft.com 5 Similarity - Distance Measure k x y Euclidean 2 i i 1 k Manhattan x i i 1 Minkowski i yi x y k i 1 q i www.ismartsoft.com i 1q 6 Similarity - Correlation rxy ( x x )( y y ) ( x x ) ( y y) i i 2 i Similar 2 i Dissimilar Credit$ Credit$ Age Age www.ismartsoft.com 7 Similarity – Hamming Distance k DH xi yi i 1 Gene 1 A A T C C A G T Gene 2 T C T C A A G C Hamming Distance 1 1 0 0 1 0 0 1 www.ismartsoft.com 8 Clustering Methods • • • • Exclusive vs. Overlapping Hierarchical vs. Partitive Deterministic vs. Probabilistic Incremental vs. Batch learning www.ismartsoft.com 9 Exclusive vs. Overlapping Income Income Age Age www.ismartsoft.com 10 Hierarchical vs. Partitive Income Age www.ismartsoft.com 11 Hierarchical Clustering • Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy. • There are two types of hierarchical clustering: – Agglomerative – Divisive www.ismartsoft.com 12 Hierarchical Clustering Agglomerative Divisive www.ismartsoft.com 13 Hierarchical Clustering - Agglomerative 1. Assign each observation to its own cluster. 2. Compute the similarity (e.g., distance) between each of the clusters. 3. Join the two most similar clusters. 4. Repeat steps 2 and 3 until there is only a single cluster left. www.ismartsoft.com 14 Hierarchical Clustering - Divisive 1. Assign all of the observations to a single cluster. 2. Partition the cluster to two least similar clusters. 3. Proceed recursively on each cluster until there is one cluster for each observation. www.ismartsoft.com 15 Hierarchical Clustering – Single Linkage r s L(r, s) min(D( xri , xsj )) www.ismartsoft.com 16 Hierarchical Clustering – Complete Linkage r s L(r, s) max(D( xri , xsj )) www.ismartsoft.com 17 Hierarchical Clustering – Average Linkage r s 1 L( r , s ) nr ns nr ns D( x i 1 j 1 www.ismartsoft.com ri , xsj ) 18 K Means Clustering 1. Clusters the data into k groups where k is predefined. 2. Select k points at random as cluster centers. 3. Assign observations to their closest cluster center according to the Euclidean distance function. 4. Calculate the centroid or mean of all instances in each cluster (this is the mean part) 5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in consecutive rounds. www.ismartsoft.com 19 K Means Clustering Income Age www.ismartsoft.com 20 K Means Clustering Sum of Squares function K J ( xn j ) 2 j 1 nS j www.ismartsoft.com 21 Clustering Evaluation • • • • • Sarle’s Cubic Clustering Criterion The Pseudo-F Statistic The Pseudo-T2 Statistic Beale’s F-Type Statistic Target-based www.ismartsoft.com 22 Clustering Evaluation Chi2 Test Categorical K-S Test Target Variable ANOVA Numerical H Test www.ismartsoft.com 23 Chi2 Test Actual Predicted Y Y n11 N n12 N n21 n22 r c 2 (nij eij ) i 1 j 1 www.ismartsoft.com 2 eij 24 Analysis of Variance (ANOVA) Source of Variation Sum of Squares Degree of Freedom Mean Square F P Between Groups SSB dfB MSB = SSB/dfB F=MSB/MSW P(F) Within Groups SSW dfw MSW = SSW/dfw Total SST dfT www.ismartsoft.com 25 Clustering - Applications • Marketing: finding groups of customers with similar behavior. • Insurance & Banking: identifying frauds. • Biology: classification of plants and animals given their features. • Libraries: book ordering. • City-planning: identifying groups of houses according to their house type, value and geographical location. • World Wide Web: document classification; clustering weblog data to discover groups with similar access patterns. www.ismartsoft.com 26 Summary • Clustering is the process of organizing objects (records or variables) into groups whose members are similar in some way. • Hierarchical and K-Means are the two most used clustering techniques. • The effectiveness of the clustering method depends on the similarity function. • The result of the clustering algorithm can be interpreted and evaluated in different ways. www.ismartsoft.com 27 Questions? www.ismartsoft.com 28