CanopyClustering - machinelearningbigdata

advertisement
Canopy Clustering
Given a distance measure and two threshold distances T1>T2,
1. Determine canopy centers - go through
The list of input points to form a list of “clusterCenters”. If a point is within T2 of
A a point in clusterCenters, then ignore it. If not, then append the point to
ClusterCenters.
2. Determine canopy membership – for each point in the input set, if the point is
Within T1 of a cluster center, then the point is a member of the corresponding
cluster
Combine Canopy and kMeans or EM
Only calculate distances for points that share a canopy with the centroid.
(assign infinite distance to points outside the canopies containing the
Centroid.
Canopy Clustering with MR
Given distance metric and tighter threshold T2
Mapper – Start with empty set of canopyCenters. For each x in
inputData, if x is further than T2 from any member of canopyCenters,
Then add x to canopyCenters and emit (1, x).
Reducer – start with empty set of canopyCenters. Input = (key,
iterator over mapper cluster centers). For x in iterator, if x is further
than T2 from any member of canopyCenters, then add x to
canopyCenters and emit(1,x).
This results in a list of canopy centers to be used for determining
canopy membership
Download