Birch: An efficient data clustering method for very

advertisement
Birch:
Balanced Iterative Reducing and
Clustering using Hierarchies
By Tian Zhang, Raghu
Ramakrishnan
Presented by
Vladimir Jelić 3218/10
e-mail: jelicvladimir5@gmail.com
What is Data Clustering?



2 of 28
A cluster is a closely-packed group.
A collection of data objects that are similar to
one another and treated collectively as a
group.
Data Clustering is the partitioning of a
dataset into clusters
Vladimir Jelić (jelicvladimir5@gmail.com)
Data Clustering


Helps understand the natural grouping or
structure in a dataset
Provided a large set of multidimensional data
–
–
–
3 of 28
Data space is usually not uniformly occupied
Identify the sparse and crowded places
Helps visualization
Vladimir Jelić (jelicvladimir5@gmail.com)
Some Clustering Applications





4 of 28
Biology – building groups of genes with
related patterns
Marketing – partition the population of
consumers to market segments
Division of WWW pages into genres.
Image segmentations – for object recognition
Land use – Identification of areas of similar
land use from satellite images
Vladimir Jelić (jelicvladimir5@gmail.com)
Clustering Problems


5 of 28
Today many datasets are too large to fit into
main memory
The dominating cost of any clustering
algorithm is I/O, because seek times on disk
are orders of a magnitude higher than RAM
access times
Vladimir Jelić (jelicvladimir5@gmail.com)
Previous Work

Two classes of clustering algorithms:

Probability-Based


Distance-Based

6 of 28
Examples: COBWEB and CLASSIT
Examples: KMEANS, KMEDOIDS, and CLARANS
Vladimir Jelić (jelicvladimir5@gmail.com)
Previous Work: COBWEB




7 of 28
Probabilistic approach to make decisions
Clusters are represented with probabilistic
description
Probability representations of clusters is
expensive
Every instance (data point) translates into a
terminal node in the hierarchy, so large
hierarchies tend to over fit data
Vladimir Jelić (jelicvladimir5@gmail.com)
Previous Work: KMeans





8 of 28
Distance based approach, so there must be
distance measurement between any two
instances
Sensitive to instance order
Instances must be stored in memory
All instances must be initially available
May have exponential run time
Vladimir Jelić (jelicvladimir5@gmail.com)
Previous Work: CLARANS




9 of 28
Also distance based approach, so there must
be distance measurement between any two
instances
computational complexity of CLARANS is
about O(n2)
Sensitive to instance order
Ignore the fact that not all data points in the
dataset are equally important
Vladimir Jelić (jelicvladimir5@gmail.com)
Contributions of BIRCH



10 of 28
Each clustering decision is made without scanning
all data points
BIRCH exploits the observation that the data space
is usually not uniformly occupied, and hence not
every data point is equally important for clustering
purposes
BIRCH makes full use of available memory to derive
the finest possible subclusters ( to ensure accuracy)
while minimizing I/O costs ( to ensure efficiency)
Vladimir Jelić (jelicvladimir5@gmail.com)
Background Knowledge (1)


Given a cluster of instances {xi }, we define:
Centroid:

x0
N 
i 1 i


x
N
  2
i1 ( xi  x0 )
N
Radius:
R(
N
  2
i1  j 1 ( xi  x j )
N
Diameter:
11 of 28
D(
)
1
2
N
N ( N  1)
)
1
2
Vladimir Jelić (jelicvladimir5@gmail.com)
Background Knowledge (2)
centroid Euclidian distance:
centroid Manhattan distance:
12 of 28
1
2


D 0  (( x0  x0 ) ) 2
1
2
d  (i )  (i )


D1 | x 0  x 0 |  | x0
 x0
|
1
2
2
i 1 1
average inter-cluster:
 2 1
N 
iN11  N
j 1N1 21 ( xi  x j )
D2  (
)2
N1 N 2
average intra-cluster:
 2 1
N 
iN11 N 2  N
j 11 2 ( xi  x j ) 2
D3  (
)
( N1  N 2 )( N1  N 2  1)
variance increase:

N1  N 2 x
N1 x
lN1N N21 xl
N1  N 2
N1
N1  N 2


2


l 1
l 1 l 2
l 2
1
 ( xk 
 (x j 
D4 
)   ( xi 
) 
)
k 1
i 1
j  N1 1
N1  N 2
N1
N2
Vladimir Jelić (jelicvladimir5@gmail.com)
Clustering Features (CF)


13 of 28
The Birch algorithm builds a dendrogram called
clustering feature tree (CF tree) while scanning the
data set.
Each entry in the CF tree represents a cluster of
objects and is characterized by a 3-tuple: (N, LS,
SS), where N is the number of objects in the cluster
and LS, SS are defined in the following.
Vladimir Jelić (jelicvladimir5@gmail.com)
Clustering Feature (CF)




14 of 28
Given N d-dimensional data points in a
cluster: {Xi} where i = 1, 2, …, N,
CF = (N, LS, SS)
N is the number of data points in the cluster,
LS is the linear sum of the N data points,
SS is the square sum of the N data points.
Vladimir Jelić (jelicvladimir5@gmail.com)
CF Additivity Theorem (1)
If CF1 = (N1, LS1, SS1), and
CF2 = (N2 ,LS2, SS2) are the CF entries of two
disjoint sub-clusters.
 The CF entry of the sub-cluster formed by
merging the two disjoin sub-clusters is:
CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 +
SS2)

15 of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
CF Additivity Theorem (2)
Example:
CF = (5, (16,30),(54,190))
10
9
8
7
6
5
4
3
2
1
0
0
16 of 28
1
2
3
4
5
6
7
8
9
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
Vladimir Jelić (jelicvladimir5@gmail.com)
Properties of CF-Tree
Each
non-leaf node has
at most B entries
Each leaf node has at
most L CF entries which
each satisfy threshold T
Node size is
determined by
dimensionality of data
space and input
parameter P (page size)
17 of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
CF Tree Insertion



18 of 28
Identifying the appropriate leaf: recursively
descending the CF tree and choosing the
closest child node according to a chosen
distance metric
Modifying the leaf: test whether the leaf can
absorb the node without violating the
threshold. If there is no room, split the node
Modifying the path: update CF information up
the path.
Vladimir Jelić (jelicvladimir5@gmail.com)
Example of the BIRCH Algorithm
New subcluster
sc4
sc8
sc5
sc6
sc3
sc1
Root
LN1
LN1
sc8
19 of 28
LN3
LN2
sc2
sc7
LN2
LN3
sc5 sc6
sc1 sc2 sc3 sc4
sc7
Vladimir Jelić (jelicvladimir5@gmail.com)
Merge Operation in BIRCH
If the branching factor of a leaf node can not exceed 3, then LN1 is split
sc4
sc1
sc5
sc3
sc6
sc2
sc7
sc8
LN2
LN1”
LN1’
LN1’
sc8
19 of 28
LN3
Root
LN1”
LN2
LN3
sc5 sc6
sc1 sc2 sc3 sc4
sc7
Vladimir Jelić (jelicvladimir5@gmail.com)
Merge Operation in BIRCH
If the branching factor of a non-leaf node can not exceed 3, then the
root is split and the height of the CF Tree increases by one
sc3
sc1
sc6
sc4
Root
sc2
sc5
NLN1
sc8
LN2
NLN2
LN1’
LN1”
LN1’
19 of 28
sc1
LN3
LN1”
LN2
sc8
sc7
sc4
sc2 sc3
sc5
LN3
sc6
sc7
Vladimir Jelić (jelicvladimir5@gmail.com)
Merge Operation in BIRCH
Assume that the subclusters are numbered according to the order of
formation
sc5
sc6
sc3
root
sc2
sc1
LN1
sc4
LN2
LN1
LN2
sc6
sc1
sc2
19 of 28
sc3
sc4
sc5
Vladimir Jelić (jelicvladimir5@gmail.com)
Merge Operation in BIRCH
If the branching factor of a leaf node can not exceed 3, then LN2 is split
sc5
sc6
sc2
sc1
sc3
sc4
root
LN1
LN2”
LN2’
LN1
sc1
19 of 28
sc2
sc5
LN2’
sc4
LN2”
sc3
sc6
Vladimir Jelić (jelicvladimir5@gmail.com)
Merge Operation in BIRCH
LN2’ and LN1 will be merged, and the newly formed node
wil be split immediately
sc2
sc5
sc6
sc3
LN2”
sc4
sc1
LN3”
root
LN3”
LN3’
sc2
19 of 28
LN3’
sc1
sc5
sc4 sc3
LN2”
sc6
Vladimir Jelić (jelicvladimir5@gmail.com)
Birch Clustering Algorithm (1)




20 of 28
Phase 1: Scan all data and build an initial inmemory CF tree.
Phase 2: condense into desirable length by
building a smaller CF tree.
Phase 3: Global clustering
Phase 4: Cluster refining – this is optional,
and requires more passes over the data to
refine the results
Vladimir Jelić (jelicvladimir5@gmail.com)
Birch Clustering Algorithm (2)
21 of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Birch – Phase 1




22 of 28
Start with initial threshold and insert points into the
tree
If run out of memory, increase thresholdvalue, and
rebuild a smaller tree by reinserting values from
older tree and then other values
Good initial threshold is important but hard to figure
out
Outlier removal – when rebuilding tree remove
outliers
Vladimir Jelić (jelicvladimir5@gmail.com)
Birch - Phase 2



23 of 28
Optional
Phase 3 sometime have minimum size which
performs well, so phase 2 prepares the tree
for phase 3.
BIRCH applies a (selected) clustering
algorithm to cluster the leaf nodes of the CF
tree, which removes sparse clusters as
outliers and groups dense clusters into larger
ones.
Vladimir Jelić (jelicvladimir5@gmail.com)
Birch – Phase 3

Problems after phase 1:
–
–

Phase 3:
–
–
24 of 28
Input order affects results
Splitting triggered by node size
cluster all leaf nodes on the CF values according
to an existing algorithm
Algorithm used here: agglomerative hierarchical
clustering
Vladimir Jelić (jelicvladimir5@gmail.com)
Birch – Phase 4




25 of 28
Optional
Do additional passes over the dataset &
reassign data points to the closest centroid
from phase 3
Recalculating the centroids and redistributing
the items.
Always converges (no matter how many time
phase 4 is repeated)
Vladimir Jelić (jelicvladimir5@gmail.com)
Conclusions (1)




26 of 28
Birch performs faster than existing algorithms
(CLARANS and KMEANS) on large datasets
Scans whole data only once
Handles outliers better
Superior to other algorithms in stability and
scalability
Vladimir Jelić (jelicvladimir5@gmail.com)
Conclusions (2)

27 of 28
Since each node in a CF tree can hold only a
limited number of entries due to the size, a
CF tree node doesn’t always correspond to
what a user may consider a nature cluster.
Moreover, if the clusters are not spherical in
shape, it doesn’t perform well because it
uses the notion of radius or diameter to
control the boundary of a cluster
Vladimir Jelić (jelicvladimir5@gmail.com)
References



28 of 28
T. Zhang, R. Ramakrishnan, and M. Livny.
BIRCH : An efficient data clustering method
for very large databases. SIGMOD'96
Jan Oberst: Efficient Data Clustering and
How to Groom Fast-Growing Trees
Tan, Steinbach, Kumar: Introduction to Data
Mining
Vladimir Jelić (jelicvladimir5@gmail.com)
Download