Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com What is Data Clustering? 2 of 28 A cluster is a closely-packed group. A collection of data objects that are similar to one another and treated collectively as a group. Data Clustering is the partitioning of a dataset into clusters Vladimir Jelić (jelicvladimir5@gmail.com) Data Clustering Helps understand the natural grouping or structure in a dataset Provided a large set of multidimensional data – – – 3 of 28 Data space is usually not uniformly occupied Identify the sparse and crowded places Helps visualization Vladimir Jelić (jelicvladimir5@gmail.com) Some Clustering Applications 4 of 28 Biology – building groups of genes with related patterns Marketing – partition the population of consumers to market segments Division of WWW pages into genres. Image segmentations – for object recognition Land use – Identification of areas of similar land use from satellite images Vladimir Jelić (jelicvladimir5@gmail.com) Clustering Problems 5 of 28 Today many datasets are too large to fit into main memory The dominating cost of any clustering algorithm is I/O, because seek times on disk are orders of a magnitude higher than RAM access times Vladimir Jelić (jelicvladimir5@gmail.com) Previous Work Two classes of clustering algorithms: Probability-Based Distance-Based 6 of 28 Examples: COBWEB and CLASSIT Examples: KMEANS, KMEDOIDS, and CLARANS Vladimir Jelić (jelicvladimir5@gmail.com) Previous Work: COBWEB 7 of 28 Probabilistic approach to make decisions Clusters are represented with probabilistic description Probability representations of clusters is expensive Every instance (data point) translates into a terminal node in the hierarchy, so large hierarchies tend to over fit data Vladimir Jelić (jelicvladimir5@gmail.com) Previous Work: KMeans 8 of 28 Distance based approach, so there must be distance measurement between any two instances Sensitive to instance order Instances must be stored in memory All instances must be initially available May have exponential run time Vladimir Jelić (jelicvladimir5@gmail.com) Previous Work: CLARANS 9 of 28 Also distance based approach, so there must be distance measurement between any two instances computational complexity of CLARANS is about O(n2) Sensitive to instance order Ignore the fact that not all data points in the dataset are equally important Vladimir Jelić (jelicvladimir5@gmail.com) Contributions of BIRCH 10 of 28 Each clustering decision is made without scanning all data points BIRCH exploits the observation that the data space is usually not uniformly occupied, and hence not every data point is equally important for clustering purposes BIRCH makes full use of available memory to derive the finest possible subclusters ( to ensure accuracy) while minimizing I/O costs ( to ensure efficiency) Vladimir Jelić (jelicvladimir5@gmail.com) Background Knowledge (1) Given a cluster of instances {xi }, we define: Centroid: x0 N i 1 i x N 2 i1 ( xi x0 ) N Radius: R( N 2 i1 j 1 ( xi x j ) N Diameter: 11 of 28 D( ) 1 2 N N ( N 1) ) 1 2 Vladimir Jelić (jelicvladimir5@gmail.com) Background Knowledge (2) centroid Euclidian distance: centroid Manhattan distance: 12 of 28 1 2 D 0 (( x0 x0 ) ) 2 1 2 d (i ) (i ) D1 | x 0 x 0 | | x0 x0 | 1 2 2 i 1 1 average inter-cluster: 2 1 N iN11 N j 1N1 21 ( xi x j ) D2 ( )2 N1 N 2 average intra-cluster: 2 1 N iN11 N 2 N j 11 2 ( xi x j ) 2 D3 ( ) ( N1 N 2 )( N1 N 2 1) variance increase: N1 N 2 x N1 x lN1N N21 xl N1 N 2 N1 N1 N 2 2 l 1 l 1 l 2 l 2 1 ( xk (x j D4 ) ( xi ) ) k 1 i 1 j N1 1 N1 N 2 N1 N2 Vladimir Jelić (jelicvladimir5@gmail.com) Clustering Features (CF) 13 of 28 The Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set. Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following. Vladimir Jelić (jelicvladimir5@gmail.com) Clustering Feature (CF) 14 of 28 Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N, CF = (N, LS, SS) N is the number of data points in the cluster, LS is the linear sum of the N data points, SS is the square sum of the N data points. Vladimir Jelić (jelicvladimir5@gmail.com) CF Additivity Theorem (1) If CF1 = (N1, LS1, SS1), and CF2 = (N2 ,LS2, SS2) are the CF entries of two disjoint sub-clusters. The CF entry of the sub-cluster formed by merging the two disjoin sub-clusters is: CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2) 15 of 28 Vladimir Jelić (jelicvladimir5@gmail.com) CF Additivity Theorem (2) Example: CF = (5, (16,30),(54,190)) 10 9 8 7 6 5 4 3 2 1 0 0 16 of 28 1 2 3 4 5 6 7 8 9 10 (3,4) (2,6) (4,5) (4,7) (3,8) Vladimir Jelić (jelicvladimir5@gmail.com) Properties of CF-Tree Each non-leaf node has at most B entries Each leaf node has at most L CF entries which each satisfy threshold T Node size is determined by dimensionality of data space and input parameter P (page size) 17 of 28 Vladimir Jelić (jelicvladimir5@gmail.com) CF Tree Insertion 18 of 28 Identifying the appropriate leaf: recursively descending the CF tree and choosing the closest child node according to a chosen distance metric Modifying the leaf: test whether the leaf can absorb the node without violating the threshold. If there is no room, split the node Modifying the path: update CF information up the path. Vladimir Jelić (jelicvladimir5@gmail.com) Example of the BIRCH Algorithm New subcluster sc4 sc8 sc5 sc6 sc3 sc1 Root LN1 LN1 sc8 19 of 28 LN3 LN2 sc2 sc7 LN2 LN3 sc5 sc6 sc1 sc2 sc3 sc4 sc7 Vladimir Jelić (jelicvladimir5@gmail.com) Merge Operation in BIRCH If the branching factor of a leaf node can not exceed 3, then LN1 is split sc4 sc1 sc5 sc3 sc6 sc2 sc7 sc8 LN2 LN1” LN1’ LN1’ sc8 19 of 28 LN3 Root LN1” LN2 LN3 sc5 sc6 sc1 sc2 sc3 sc4 sc7 Vladimir Jelić (jelicvladimir5@gmail.com) Merge Operation in BIRCH If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one sc3 sc1 sc6 sc4 Root sc2 sc5 NLN1 sc8 LN2 NLN2 LN1’ LN1” LN1’ 19 of 28 sc1 LN3 LN1” LN2 sc8 sc7 sc4 sc2 sc3 sc5 LN3 sc6 sc7 Vladimir Jelić (jelicvladimir5@gmail.com) Merge Operation in BIRCH Assume that the subclusters are numbered according to the order of formation sc5 sc6 sc3 root sc2 sc1 LN1 sc4 LN2 LN1 LN2 sc6 sc1 sc2 19 of 28 sc3 sc4 sc5 Vladimir Jelić (jelicvladimir5@gmail.com) Merge Operation in BIRCH If the branching factor of a leaf node can not exceed 3, then LN2 is split sc5 sc6 sc2 sc1 sc3 sc4 root LN1 LN2” LN2’ LN1 sc1 19 of 28 sc2 sc5 LN2’ sc4 LN2” sc3 sc6 Vladimir Jelić (jelicvladimir5@gmail.com) Merge Operation in BIRCH LN2’ and LN1 will be merged, and the newly formed node wil be split immediately sc2 sc5 sc6 sc3 LN2” sc4 sc1 LN3” root LN3” LN3’ sc2 19 of 28 LN3’ sc1 sc5 sc4 sc3 LN2” sc6 Vladimir Jelić (jelicvladimir5@gmail.com) Birch Clustering Algorithm (1) 20 of 28 Phase 1: Scan all data and build an initial inmemory CF tree. Phase 2: condense into desirable length by building a smaller CF tree. Phase 3: Global clustering Phase 4: Cluster refining – this is optional, and requires more passes over the data to refine the results Vladimir Jelić (jelicvladimir5@gmail.com) Birch Clustering Algorithm (2) 21 of 28 Vladimir Jelić (jelicvladimir5@gmail.com) Birch – Phase 1 22 of 28 Start with initial threshold and insert points into the tree If run out of memory, increase thresholdvalue, and rebuild a smaller tree by reinserting values from older tree and then other values Good initial threshold is important but hard to figure out Outlier removal – when rebuilding tree remove outliers Vladimir Jelić (jelicvladimir5@gmail.com) Birch - Phase 2 23 of 28 Optional Phase 3 sometime have minimum size which performs well, so phase 2 prepares the tree for phase 3. BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and groups dense clusters into larger ones. Vladimir Jelić (jelicvladimir5@gmail.com) Birch – Phase 3 Problems after phase 1: – – Phase 3: – – 24 of 28 Input order affects results Splitting triggered by node size cluster all leaf nodes on the CF values according to an existing algorithm Algorithm used here: agglomerative hierarchical clustering Vladimir Jelić (jelicvladimir5@gmail.com) Birch – Phase 4 25 of 28 Optional Do additional passes over the dataset & reassign data points to the closest centroid from phase 3 Recalculating the centroids and redistributing the items. Always converges (no matter how many time phase 4 is repeated) Vladimir Jelić (jelicvladimir5@gmail.com) Conclusions (1) 26 of 28 Birch performs faster than existing algorithms (CLARANS and KMEANS) on large datasets Scans whole data only once Handles outliers better Superior to other algorithms in stability and scalability Vladimir Jelić (jelicvladimir5@gmail.com) Conclusions (2) 27 of 28 Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesn’t always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesn’t perform well because it uses the notion of radius or diameter to control the boundary of a cluster Vladimir Jelić (jelicvladimir5@gmail.com) References 28 of 28 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method for very large databases. SIGMOD'96 Jan Oberst: Efficient Data Clustering and How to Groom Fast-Growing Trees Tan, Steinbach, Kumar: Introduction to Data Mining Vladimir Jelić (jelicvladimir5@gmail.com)