Matakuliah : M0614 / Data Mining & OLAP Tahun : Feb - 2010 Cluster Analysis Pertemuan 11 Learning Outcomes Pada akhir pertemuan ini, diharapkan mahasiswa akan mampu : • Mahasiswa dapat menggunakan teknik analisis clustering: Partitioning, hierarchical, dan model-based clustering pada data mining. (C3) 3 Bina Nusantara Acknowledgments These slides have been adapted from Han, J., Kamber, M., & Pei, Y. Data Mining: Concepts and Technique and Tan, P.-N., Steinbach, M., & Kumar, V. Introduction to Data Mining. Bina Nusantara Outline Materi • What is cluster analysis? • Types of data cluster analysis • A categorization of major clustering methods: Partitionong methods 5 Bina Nusantara What is Cluster Analysis? • Cluster: A collection of data objects – similar (or related) to one another within the same group – dissimilar (or unrelated) to the objects in other groups • Unsupervised learning: no predefined classes • Typical applications – As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms 6 What is Cluster Analysis? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Clustering for Data Understanding and Applications • • • • • • • • Biology: taxonomy of living things: kindom, phylum, class, order, family, genus and species Information retrieval: document clustering Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understanding earth climate, find patterns of atmospheric and ocean Economic Science: market resarch 8 Applications of Cluster Analysis Discovered Clusters • Understanding – Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations 1 2 3 4 Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP • Summarization – Reduce the size of large data sets Clustering precipitation in Australia Industry Group Technology1-DOWN Technology2-DOWN Financial-DOWN Oil-UP What is not Cluster Analysis? • Supervised classification – Have class label information • Simple segmentation – Dividing students into different registration groups alphabetically, by last name • Results of a query – Groupings are a result of an external specification • Graph partitioning – Some mutual relevance and synergy, but areas are not identical Types of Clusters • Well-separated clusters • Center-based clusters • Contiguous clusters • Density-based clusters • Property or Conceptual Types of Clusters: Well-Separated • Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters Types of Clusters: Center-Based • Center-based – A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster – The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters Types of Clusters: Contiguity-Based • Contiguous Cluster (Nearest neighbor or Transitive) – A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters Types of Clusters: Density-Based • Density-based – A cluster is a dense region of points, which is separated by lowdensity regions, from other regions of high density. – Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters Types of Clusters: Conceptual Clusters • Shared Property or Conceptual Clusters – Finds clusters that share some common property or represent a particular concept. . 2 Overlapping Circles Types of Clusterings • A clustering is a set of clusters • Important distinction between hierarchical and partitional sets of clusters • Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering – A set of nested clusters organized as a hierarchical tree Partitional Clustering Original Points A Partitional Clustering Hierarchical Clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram p1 p3 p4 p2 p1 p2 Non-traditional Hierarchical Clustering p3 p4 Non-traditional Dendrogram Requirements of Clustering in Data Mining • Scalability • Ability to deal with different types of attributes • Ability to handle dynamic data • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge to determine input parameters • Able to deal with noise and outliers • Insensitive to order of input records • High dimensionality • Incorporation of user-specified constraints • Interpretability and usability June 28, 2016 Data Mining: Concepts and Techniques 20 Quality: What Is Good Clustering? • A good clustering method will produce high quality clusters – high intra-class similarity: cohesive within clusters – low inter-class similarity: distinctive between clusters • The quality of a clustering result depends on both the similarity measure used by the method and its implementation • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns June 28, 2016 Data Mining: Concepts and Techniques 21 Measure the Quality of Clustering • • Dissimilarity/Similarity metric – Similarity is expressed in terms of a distance function, typically metric: d(i, j) – The definitions of distance functions are usually rather different for intervalscaled, boolean, categorical, ordinal ratio, and vector variables – Weights should be associated with different variables based on applications and data semantics Quality of clustering: – There is usually a separate “quality” function that measures the “goodness” of a cluster. – It is hard to define “similar enough” or “good enough” • The answer is typically highly subjective June 28, 2016 Data Mining: Concepts and Techniques 22 Type of data in clustering analysis • Interval-scaled variables: • Binary variables: • Nominal, ordinal, and ratio variables: • Variables of mixed types: Interval-valued variables • Standardize data – Calculate the mean absolute deviation: sf 1 n (| x1 f m f | | x2 f m f | ... | xnf m f |) where mf 1 n (x1 f x2 f ... xnf ) . – Calculate the standardized measurement (z-score) xif m f zif s f • Using mean absolute deviation is more robust than using standard deviation Similarity and Dissimilarity Between Objects • Distances are normally used to measure the similarity or dissimilarity between two data objects • Some popular ones include: Minkowski distance: d (i, j) q (| x x |q | x x |q ... | x x |q ) i1 j1 i2 j2 ip jp where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer • If q = 1, d is Manhattan distance d (i, j) | x x | | x x | ... | x x | i1 j1 i2 j 2 ip j p Similarity and Dissimilarity Between Objects (Cont.) • • If q = 2, d is Euclidean distance: d (i, j) (| x x |2 | x x |2 ... | x x |2 ) i1 j1 i2 j2 ip jp – Properties • d(i,j) 0 • d(i,i) = 0 • d(i,j) = d(j,i) • d(i,j) d(i,k) + d(k,j) Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures. Binary Variables • A contingency table for binary data Object j Object i 1 0 1 a b 0 c d sum a c b d sum a b cd p • Simple matching coefficient (invariant, if the binary variable is symmetric): d (i, j) bc a bc d • Jaccard coefficient (noninvariant if the binary variable is asymmetric): d (i, j) bc a bc Dissimilarity between Binary Variables • Example Name Jack Mary Jim Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N – gender is a symmetric attribute – the remaining attributes are asymmetric binary – let the values Y and P be set to 1, and the value N be set to 0 01 0.33 2 01 11 d ( jack , jim ) 0.67 111 1 2 d ( jim , mary ) 0.75 11 2 d ( jack , mary ) Test-4 N N N Nominal Variables • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green • Method 1: Simple matching – m: # of matches, p: total # of variables m d (i, j) p p • Method 2: use a large number of binary variables – creating a new binary variable for each of the M nominal states Ordinal Variables • An ordinal variable can be discrete or continuous rif {1,...,M f } • order is important, e.g., rank • Can be treated like interval-scaled – replacing xif by their rank – map the range of each variable onto [0, 1] by replacing i-th object in the frif 1 th variable by zif M f 1 – compute the dissimilarity using methods for interval-scaled variables Ratio-Scaled Variables • Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt • Methods: – treat them like interval-scaled variables — not a good choice! (why?) – apply logarithmic transformation yif = log(xif) – treat them as continuous ordinal data treat their rank as interval-scaled. Variables of Mixed Types • • A database may contain all the six types of variables – symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio. One may use a weighted formula to combine their effects. pf 1 ij( f ) d ij( f ) d (i, j ) pf 1 ij( f ) – f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w. – f is interval-based: use the normalized distance – f is ordinal or ratio-scaled rif 1 z • compute ranks rif and if M f 1 • and treat zif as interval-scaled Major Clustering Approaches • • • • Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: – Create a hierarchical decomposition of the set of data (or objects) using some criterion – Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON Density-based approach: – Based on connectivity and density functions – Typical methods: DBSACN, OPTICS, DenClue Grid-based approach: – based on a multiple-level granularity structure – Typical methods: STING, WaveCluster, CLIQUE June 28, 2016 Data Mining: Concepts and Techniques 33 Partitioning Algorithms: Basic Concept • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance E ik1 pCi ( p mi ) 2 • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion – Global optimal: exhaustively enumerate all partitions – Heuristic methods: k-means and k-medoids algorithms – k-means (MacQueen’67): Each cluster is represented by the center of the cluster – k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster June 28, 2016 Data Mining: Concepts and Techniques 34 K-means Clustering Method • • • • • Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple The K-Means Clustering Method • Example 10 10 9 9 8 8 7 7 6 6 5 5 10 9 8 7 6 5 4 4 3 2 1 0 0 1 2 3 4 5 6 7 8 K=2 Arbitrarily choose K object as initial cluster center 9 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 4 3 2 1 0 0 1 2 3 4 5 6 reassign 10 10 9 9 8 8 7 7 6 6 5 5 4 2 1 0 0 1 2 3 4 5 6 7 8 7 8 9 10 reassign 3 June 28, 2016 Update the cluster means 9 10 Update the cluster means Data Mining: Concepts and Techniques 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 36 Comments on the K-Means Method • Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) • Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness – Applicable only when mean is defined, then what about categorical data? – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers – Not suitable to discover clusters with non-convex shapes June 28, 2016 Data Mining: Concepts and Techniques 37 Variations of the K-Means Method • A few variants of the k-means which differ in – Selection of the initial k means – Dissimilarity calculations – Strategies to calculate cluster means • Handling categorical data: k-modes – Replacing means of clusters with modes – Using new dissimilarity measures to deal with categorical objects – Using a frequency-based method to update modes of clusters – A mixture of categorical and numerical data: k-prototype method June 28, 2016 Data Mining: Concepts and Techniques 38 What Is the Problem of the K-Means Method? • The k-means algorithm is sensitive to outliers ! – Since an object with an extremely large value may substantially distort the distribution of the data. • K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 June 28, 2016 1 2 3 4 5 6 7 8 9 10 0 1 2 3 Data Mining: Concepts and Techniques 4 5 6 7 8 9 10 39 The K-Medoids Clustering Method • Find representative objects, called medoids, in clusters • PAM (Partitioning Around Medoids) : – starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering – PAM works effectively for small data sets, but does not scale well for large data sets • CLARA • CLARANS: Randomized sampling • Focusing + spatial data structure June 28, 2016 Data Mining: Concepts and Techniques 40 A Typical K-Medoids Algorithm (PAM) 10 10 10 9 9 9 8 8 8 Arbitrary choose k object as initial medoids 7 6 5 4 3 2 7 6 5 4 3 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 0 10 1 2 3 4 5 6 7 8 9 10 Assign each remainin g object to nearest medoids 7 6 5 4 3 2 1 0 0 K=2 Until no change 10 2 3 4 5 6 7 8 9 10 10 Compute total cost of swapping 9 9 Swapping O and Oramdom 8 If quality is improved. 5 5 4 4 3 3 2 2 1 1 7 6 0 8 7 6 0 0 June 28, 2016 1 Randomly select a nonmedoid object,Oramdom Total Cost = 26 Do loop Total Cost = 20 1 2 3 4 5 6 7 8 9 10 Data Mining: Concepts and Techniques 0 1 2 3 4 5 6 7 8 9 10 41 PAM Clustering: Total swapping cost TCih=jCjih 10 10 9 9 t 8 7 j t 8 7 6 5 i 4 3 j 6 h 4 5 h i 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 Cjih = d(j, h) - d(j, i) 0 1 2 3 4 5 6 7 8 9 10 Cjih = 0 10 10 9 9 h 8 8 7 j 7 6 6 i 5 5 i 4 h 4 t j 3 3 t 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 Cjih = d(j, t) - d(j, i) 10 0 1 2 3 4 5 6 7 8 9 Cjih = d(j, h) - d(j, t) 10 PAM (Partitioning Around Medoids) • PAM, built in Splus • Use real object to represent the cluster – Select k representative objects arbitrarily – For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih – For each pair of i and h, • If TCih < 0, i is replaced by h • Then assign each non-selected object to the most similar representative object – repeat steps 2-3 until there is no change June 28, 2016 Data Mining: Concepts and Techniques 43 PAM Clustering: Finding the Best Cluster Center • Case 1: p currently belongs to oj. If oj is replaced by orandom as a representative object and p is the closest to one of the other representative object oi, then p is reassigned to oi June 28, 2016 Data Mining: Concepts and Techniques 44 What Is the Problem with PAM? • Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean • Pam works efficiently for small data sets but does not scale well for large data sets. – O(k(n-k)2 ) for each iteration where n is # of data,k is # of clusters Sampling-based method CLARA(Clustering LARge Applications) June 28, 2016 Data Mining: Concepts and Techniques 45 CLARA (Clustering Large Applications) • CLARA – Built in statistical analysis packages, such as SPlus – It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output • Strength: deals with larger data sets than PAM • Weakness: – Efficiency depends on the sample size – A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased June 28, 2016 Data Mining: Concepts and Techniques 46 CLARANS (“Randomized” CLARA) • CLARANS (A Clustering Algorithm based on Randomized Search) – Draws sample of neighbors dynamically – The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids – If the local optimum is found, it starts with new randomly selected node in search for a new local optimum • Advantages: More efficient and scalable than both PAM and CLARA • Further improvement: Focusing techniques and spatial access structures June 28, 2016 Data Mining: Concepts and Techniques 47 Dilanjutkan ke pert. 12 Cluster Analysis (cont.) Bina Nusantara