cluster

advertisement
Clustering Analysis
2013/04/13
Outline
 What is Clustering Analysis?
 Partitioning Methods
K-Means
 Hierarchical Methods
Agglomerative and Divisive hierarchical clustering
 Density-Based Methods
DBSCAN
What is Clustering Analysis?
 Cluster analysis or clustering is the task of grouping
a set of objects in such a way that objects in the same
group (called cluster) are more similar to each other
than to those in other groups (clusters).
Ex: Friends in facebook.
 Clustering is useful in that it can lead to the discovery
of previously unknown groups within the data.
Partitioning Methods
 Given a data set, D, of n objects, and k, the number of
clusters to form, a partitioning algorithm organizes the
objects into k partitions ( k ≤ n), where each partition
represents a cluster.
K-Means
 Input:
1. X={x1,x2,…,xn}: A data set in d-dim space.
2. k:Number of clusters.
 Output:
Cluster centers: c j , 1≤ j ≤ k .
 Requirement:
The output should minimize the object function.
Object function
ej 
 || x i  c j ||
2
; Gj:第j個group , Cj: cluster center
x i G j
k
E 

j 1
k
ej 

 || x i  c j ||
2
; k:Number of clusters.
j 1 x i G j
Goal: 要分成幾群以及相關的cluster center,
使得 E 的值為最小。
Algorithm: k-means. The k-means algorithm for partitioning,
where each cluster’s center is represented by the mean value of
the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the
most similar, based on the mean value of the objects in the
cluster;
(4) update the cluster means, that is, calculate the mean value of the
objects for each cluster;
(5) until no change;
Example
(3)assign
(4)update
Example
(4)update
(3)reassign
(4)update
(4)update
Hierarchical Methods
 A hierarchical clustering method works by grouping
data objects into hierarchy or “tree” of clusters.
 Representing data objects in form of hierarchy is
useful for data summarization and visualization.
 Hierarchical methods suffer from the fact that once a
step(merge or spilt)is done,it can never be undone.
Agglomerative versus Divisive
hierarchical clustering
 Agglomerative methods starts with individual objects
as clusters,which are iteratively merged to form larger
cluster.
 Divisive methods initially let all the given objects form
one cluster,which they iteratively split into smaller
clusters.
Agglomerative clustering
Divisive clustering
Bottom-up
Top-down
Individual objects
Placing all objects in one
cluster
Merge
Split
Example
(Divisive hierarchical clustering)
1
6
9
7
2
4
8
d(Gt)=0.6
d(Gs)=1
d(Gt)=0.8
3
5
dmin=0.75
7
8
9
1
2
6
3
5
4
Example
(Agglomerative hierarchical clustering)
 Five objects: (1,2) (2.5,4.5) (2,2) (4,1.5) (4,2.5)
Example
 Five objects: (1,2) (2.5,4.5) (2,2) (4,1.5) (4,2.5)
D=
Example
 Dendrogram
Density-Based Methods
 Partitioning and hierarchical methods are designed to
find spherical-shaped clusters.
 Density-based clustering methods can discover clusters
of nonspherical shape.
Density-Based Spatial Clustering of
Applications with Noise(DBSCAN)
 DBSCAN finds core objects that have dense
neighborhoods.It connects core objects and their
neighborhoods to form dense regions as cluster.
 A object is a core object if the ε-neighborhood of the
object contains at least MinPts objects.
1. ε-neighborhood is the space within a radius ε centered
at core objects.
2. MinPts: The minimum number of points required to
form a cluster.
Algorithm: DBSCAN: a density-based clustering algorithm.
Input:
D: a data set containing n objects ; ε : the radius parameter ; MinPts: the neighborhood density threshold.
Output: A set of density-based clusters.
Method:
(1) mark all objects as unvisited;
(2) do
(3)
randomly select an unvisited object p;
(4)
mark p as visited;
(5)
if the ε-neighborhood of p has at least MinPts objects
(6)
create a new cluster C, and add p to C;
(7)
let N be the set of objects in the -neighborhood of p;
(8)
for each point p’ in N
if p’ is unvisited
(9)
(10)
mark p’ as visited;
(11)
if the -neighborhood of p’ has at least MinPts points,
add those points to N;
if p’ is not yet a member of any cluster, add p’ to C;
(12)
(13)
end for
(14)
output C;
(15)
else mark p as noise;
(16) until no object is unvisited;
Example
 A given ε
represented by the radius of the circles,
and let MinPts = 3.
Download