Clustering Analysis 2013/04/13 Outline What is Clustering Analysis? Partitioning Methods K-Means Hierarchical Methods Agglomerative and Divisive hierarchical clustering Density-Based Methods DBSCAN What is Clustering Analysis? Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar to each other than to those in other groups (clusters). Ex: Friends in facebook. Clustering is useful in that it can lead to the discovery of previously unknown groups within the data. Partitioning Methods Given a data set, D, of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions ( k ≤ n), where each partition represents a cluster. K-Means Input: 1. X={x1,x2,…,xn}: A data set in d-dim space. 2. k:Number of clusters. Output: Cluster centers: c j , 1≤ j ≤ k . Requirement: The output should minimize the object function. Object function ej || x i c j || 2 ; Gj:第j個group , Cj: cluster center x i G j k E j 1 k ej || x i c j || 2 ; k:Number of clusters. j 1 x i G j Goal: 要分成幾群以及相關的cluster center, 使得 E 的值為最小。 Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster. Input: k: the number of clusters, D: a data set containing n objects. Output: A set of k clusters. Method: (1) arbitrarily choose k objects from D as the initial cluster centers; (2) repeat (3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; (4) update the cluster means, that is, calculate the mean value of the objects for each cluster; (5) until no change; Example (3)assign (4)update Example (4)update (3)reassign (4)update (4)update Hierarchical Methods A hierarchical clustering method works by grouping data objects into hierarchy or “tree” of clusters. Representing data objects in form of hierarchy is useful for data summarization and visualization. Hierarchical methods suffer from the fact that once a step(merge or spilt)is done,it can never be undone. Agglomerative versus Divisive hierarchical clustering Agglomerative methods starts with individual objects as clusters,which are iteratively merged to form larger cluster. Divisive methods initially let all the given objects form one cluster,which they iteratively split into smaller clusters. Agglomerative clustering Divisive clustering Bottom-up Top-down Individual objects Placing all objects in one cluster Merge Split Example (Divisive hierarchical clustering) 1 6 9 7 2 4 8 d(Gt)=0.6 d(Gs)=1 d(Gt)=0.8 3 5 dmin=0.75 7 8 9 1 2 6 3 5 4 Example (Agglomerative hierarchical clustering) Five objects: (1,2) (2.5,4.5) (2,2) (4,1.5) (4,2.5) Example Five objects: (1,2) (2.5,4.5) (2,2) (4,1.5) (4,2.5) D= Example Dendrogram Density-Based Methods Partitioning and hierarchical methods are designed to find spherical-shaped clusters. Density-based clustering methods can discover clusters of nonspherical shape. Density-Based Spatial Clustering of Applications with Noise(DBSCAN) DBSCAN finds core objects that have dense neighborhoods.It connects core objects and their neighborhoods to form dense regions as cluster. A object is a core object if the ε-neighborhood of the object contains at least MinPts objects. 1. ε-neighborhood is the space within a radius ε centered at core objects. 2. MinPts: The minimum number of points required to form a cluster. Algorithm: DBSCAN: a density-based clustering algorithm. Input: D: a data set containing n objects ; ε : the radius parameter ; MinPts: the neighborhood density threshold. Output: A set of density-based clusters. Method: (1) mark all objects as unvisited; (2) do (3) randomly select an unvisited object p; (4) mark p as visited; (5) if the ε-neighborhood of p has at least MinPts objects (6) create a new cluster C, and add p to C; (7) let N be the set of objects in the -neighborhood of p; (8) for each point p’ in N if p’ is unvisited (9) (10) mark p’ as visited; (11) if the -neighborhood of p’ has at least MinPts points, add those points to N; if p’ is not yet a member of any cluster, add p’ to C; (12) (13) end for (14) output C; (15) else mark p as noise; (16) until no object is unvisited; Example A given ε represented by the radius of the circles, and let MinPts = 3.