Chapter 9 Part 1

advertisement
Chapter 9
UNSUPERVISED LEARNING:
Clustering
Part 1
Cios / Pedrycz / Swiniarski / Kurgan
Outline
• What is Clustering?
- Categories of clustering methods
- Similarity measures
• Partition-Based Clustering
• Hierarchical Clustering
• Model-Based (mixture of probabilities)
clustering
• Scalable Clustering
• Grid-Based Clustering
• Cluster Validity
• Clustering of Large Datasets
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
2
What is Clustering?
How do we understand data?
We look for structure in data by revealing
groups/clusters.
Clusters are about abstraction of data.
The structure is formed based on similarities
between patterns (data).
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
3
How hard is clustering?
Consider N data points to be split into “c”
groups (clusters). The number of possible
splits (partitions) is described as
1 c
c i  c  N
 (1)  i
c! i 1
i
Even for a small problem of N =100 and c =5
we end up with 1067 partitions
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
4
Clustering Challenges – from Bezdek
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
5
Clustering Challenges – from Bezdek
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
6
Categories of Clustering
We distinguish between three main categories
(classes) of clustering methods
• Partition-based
• Hierarchical
• Model-based (mixture of probabilities)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
7
Major Clustering Approaches (I)
• Partitioning approach:
– Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
– Create a hierarchical decomposition of the set of data (or objects)
using some criterion
– Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
– Based on connectivity and density functions
– Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
– based on a multiple-level granularity structure
– Typical methods: STING, WaveCluster, CLIQUE
8
Major Clustering Approaches (II)
• Model-based:
– A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
– Typical methods: EM, SOM, COBWEB
• Frequent pattern-based:
– Based on the analysis of frequent patterns
– Typical methods: p-Cluster
• User-guided or constraint-based:
– Clustering by considering user-specified or application-specific
constraints
– Typical methods: COD (obstacles), constrained clustering
• Link-based clustering:
– Objects are often linked together in various ways
– Massive links can be used to cluster objects: SimRank, LinkClus
9
Partition-Based Clustering
It is also referred to as objective function
clustering, relies on the minimization of a
certain objective function (performance
index)
The result of minimization is a partition
matrix and a collection of prototypes
The methods in this class are conceptually
and algorithmically appealing
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
10
Partitioning Algorithms: Basic Concept
• Partitioning method: Partitioning a database D of n objects into a set of
k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)
E  ik1 pCi ( p  ci )2
• Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
11
Model-Based Clustering
In MBC we assume a certain probabilistic
model of data and estimate its parameters,
such as mean, covariance matrix, etc.
Mixture density model is the common
approach used – we assume that data are a
result of a mixture of “c” sources of data and
each source is treated as a separate cluster.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
12
Similarity measures
SM are the most fundamental components of
every clustering method; they are used to
quantify similarity (or dissimilarity) between
the data points.
The data with the highest similarity (like the
lowest distance) are candidates to form a
single cluster.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
13
Examples of Distance Functions:
Continuous Data (1)
Euclidean distance (p=2 in Minkowski)
d(x, y ) 
n
2
 (x i  y i )
i 1
Hamming distance (p=1 in Minkowski)
n
d(x, y)   | x i  y i |
i 1
Tchebyschev distance (p= ∞ in Minkowski)
d(x, y)  maxi1,2,...,n | x i  yi |
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
14
Examples of Distance Functions:
Continuous Data (2)
Minkowski distance
d(x, y ) 
p
n
 (x i  y i ) , p  0
p
i 1
(drawing from Wikipedia)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
15
Examples of Distance Functions:
Continuous Data (2)
Canberra distance d(x, y )   | x i  y i | , x and y are positive
i
i
i 1 x i  y i
n
Example:
v1(1,1,1), v2(1,1,0), v3(10,5,0), v4(1,2,3), v5(2,4,6)
d12=1
d13=2.485
d45=1
n
 x i yi
Angular separation
d(x, y ) 
Example:
v1(7,6,3,-1), v2(0,3,4,5)
d12=0.363
i 1
n
n
2
[  x i  y i2 ]1/ 2
i 1
i 1
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
16
Examples of Distance Functions:
Discrete Data (1)
Binary data x = [x1 x2 …xn]
a- number of occurrences where both xi and yi are 1
d- number of occurrences where both xi and yi are 0
b,c- number of occurrences where xi and yi are different (0-1)
1
0
1
a
c
0
b
d
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
17
Examples of Distances:
Discrete Data (2)
Binary data x = [x1 x2 …xn]
Matching index
ad
abcd
Rusell & Rao
a
abcd
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
18
Examples of Distances:
Discrete Data (3)
Binary data x = [x1 x2 …xn]
Jacard index
a
abc
Czekanowski
2a
2a  b  c
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
19
Hierarchical Clustering
HC provides graphical illustration of relationships
between the data in the form of dendrogram, which
is a binary tree.
There are two approaches to HC:
• Bottom – up / agglomerative
• Top-down / divisive
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
20
Hierarchical Clustering
• Agglomerative / bottom-up method starts with each
object in the data forming its own cluster, and then
successively merges the clusters until one large
cluster is formed, which encompasses the entire
dataset
• Divisive / top-down method starts by considering the
entire data as one cluster and then splits up the
cluster(s) until each object forms its own cluster
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
21
Hierarchical Clustering
Top –down /
divisive
Bottom-up /
agglomerative
{a}
{b,c,d,e}
{f,g,h}
a
b
c
d
e
f
g
h
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
22
Hierarchical Agglomerative Clustering
numbers of clusters
at different levels
4
3
2
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
23
Hierarchical Agglomerative Clustering
Given : a data set and the distance function
1. start with “N “ clusters by assigning each pattern to a
separate cluster
2. proceed with this initial configuration of the clusters and
merge the clusters that are the closest. In other words, if S
and T are the two clusters being recognized as the closest,
form a single cluster {S, T} and reduce the number of clusters
by one
3 repeat step 2 until a minimal number of the clusters has been
reached.
Result : clusters of data (partition)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
24
Distance Between Clusters
Single linkage method: T  S  min
complete linkage : T  S  max
average linkage : T -S 
xT
yS
xT
yS
xy
xy
1
xy

card (S )card (T )
xT
yS
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
25
Single Linkage
S imilarity between S and T is calculated based on the
minimal dis tan ce between the elements belonging to the
corresponding clusters
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
26
Complete Linkage
We rely on the maximal distance between
the patterns in the analyzed clusters.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
27
Average Linkage
We combine two clusters based
upon their averaged distance
between the patterns in the clusters.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
28
Hausdorff Distance Function
d(A, B)  max{max xA min yBd(x, y), max yBmin xA d(x, y)}
picture from Wikipedia
d(A,B) = max { min d(A,B)} = sup { inf d(A,B)}
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
29
Lance-Williams updating formula
d AB,C  α A d A, C  α Bd B,C  βd A, B  γ | d A, C  d B,C |
Clustering
method
Single link
Complete link
centroid
A (B)


1/2
1/2
0
0
-1/2
1/2
0
nA
nA  nB
median
1/2

nAnB
(n A  n B ) 2
-1/4
0
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
30
Hierarchical Divisive Method
HD algorithm starts by considering all divisions of the data into two
nonempty subsets
n 1
which amounts to
possibilities.
2
1
However, it is possible to construct divisive methods
that do not consider all divisions,
most of which may be incorrect anyway.
One such algorithm is by MacNaughton - Smith (1964)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
31
Hierarchical Divisive Method
At first A:=C and B:=
1. Move one object at a time from A to B.
For each object iA we compute average dissimilarity to all
other objects of A:
1
a(i ) 
d (i, j )

| A | 1 jA
j i
Object m of A for which a(m) is the largest, is moved to B:
A : A \ {m}, B : {m}
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
32
Hierarchical Divisive Method
2. Move other objects from A to B (called the “splinter
group”)
If |A|=1, stop. Otherwise compute a(i) for all iA, and the average
dissimilarity of i to all objects of B, denoted as d(i,B)
1
1
a(i)  d (i, B) 
d (i, j ) 
d (i, k )


| A | 1 jA
| B | kB
j i
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
33
Hierarchical Divisive Method
Select the object hA for which
a(h)  d (h, B)  maxiA (a(i)  d (i, B))
If a(h)-d(h,B) > 0  move h from A to B, go to 2
If a(h)-d(h,B)  0  the process stops
The division of C into clusters A and B is complete.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
34
Hierarchical Divisive Method
a
a  0.0
b  2.0
c  6.0

d 10.0
e  9.0
b
c
d
e
2.0 6.0 10.0 9.0
0.0 5.0 9.0 8.0
5.0 0.0 4.0 5.0

9.0 4.0 0.0 3.0
8.0 5.0 3.0 0.0
Object Average Dissimilarity to the
Other Objects
a
(2.0 + 6.0 + 10.0 + 9.0)/4 = 6.75
b
(2.0 + 5.0 + 9.0 +8.0)/4 = 6.00
c
(6.0 + 5.0 + 4.0 + 5.0)/4 = 5.00
d
(10.0 + 9.0 + 4.0 + 3.0)/4 = 6.50
e
(9.0 + 8.0 + 5.0 + 3.0)/4 = 5.25
In our example, object a is chosen to initiate the splinter group.
At this stage we have groups A={b,c,d,e} and B={a}, but we
don’t stop here.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
35
Hierarchical Divisive Method
Average Dissimilarity
Object to remaining Objects
Average Dissimilarity
to Objects of
Splinter Group
Difference
b
c
d
e
2.00
6.00
10.00
9.00
5.33
-1.33
-4.67
-3.67
(5.0+9.0+8.0)/37.33
(5.0+4.0+5.0)/34.67
(9.0+4.0+3.0)/35.33
(8.0+5.0+3.0)/35.33
Therefore, object b changes sides, so new splinter group is B={a, b} and the
remaining group becomes A={c, d, e}
Average Dissimilarity
Object to remaining Objects
Average Dissimilarity
to Objects of
Splinter Group
Difference
c
d
e
(6.0+5.0)/2=5.50
(10.0+9.0)/2=9.50
(9.0+8.0)/2=8.50
-1.00
-6.00
-4.50
(4.0+5.0)/2=4.50
(4.0+3.0)/2=3.50
(5.0+3.0)/2=4.00
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
36
Objective Function Clustering
Develop and optimize a partition matrix
so that a certain performance index is
optimized.
The lower the value of the objective
function, the better.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
37
Objective Function Clustering
Objective function
Minimization
Structure
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
38
Objective Function Clustering
• Depends on minimization of a certain
performance index Q
2 c N
c N
Q   U m x v    U m d 2
ik
k
i
ik
ik
i1 k1
i1 k1
c – number of clusters
U – partition matrix
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
39
Clustering: representation issues
how to representclusters?
P art it ion mat rix
N dat a point s
c clust ers
data  partition matrix
1 0 0 1 1 0 0 1
0 1 1 0 0 0 0 0 

U
0 0 0 0 0 1 1 0 




© 2007 Cios / Pedrycz / Swiniarski / Kurgan
40
Partition Matrix
1 0 0 1 1 0 0 1
0 1 1 0 0 0 0 0 

U
0 0 0 0 0 1 1 0 




U  {U | 0 
N
u
k 1
cluster1 : {1,4,5,8}
cluster2: {2,3}
cluster3: {6,7}
c
ik
 N, 0   u ik 1 }
i1
=
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
41
Objective Function-Based Clustering
Clustering is guided by the minimization of some
objective function/performance index Q.
Representation of structure is in the form of a:
• Partition matrix U = [uik], i=1,2,…,c; k=1, 2,…,N
N
0   u  N for i 1, 2, ..., c
k1 ik
N
 uik 1 for k 1, 2, ...,N
i1
•Prototypes vi, i=1,2,…, c
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
42
Objective Function Clustering
Algorithm:
Given: the (guessed!) number of clusters (c),
decide on the similarity function
(and on the value of the power factor (m) for fuzzy clustering only)
Compute the prototypes (v) and update the
partition matrix (U) based on the conditions of
the minimized objective function
Result: partition matrix and prototypes
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
43
K - means Algorithm
Input
Output
c = number of clusters
d = distance function
U = partition matrix
m = power (fuzziness) factor
V = Cluster centers
not used in hard K- means
t = termination criteria
Amount of movement
between clusters
v = cluster centers
Randomly chosen each run
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
44
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four
steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
– Assign each object to the cluster with the nearest
seed point
– Go back to Step 2, stop when no more new
assignment
April 8, 2015
Data Mining: Concepts and Techniques
45
The K-Means Clustering Method
• Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
10
9
9
8
8
7
7
6
6
5
5
4
2
1
0
0
1
2
3
4
5
6
7
8
7
8
9
10
reassign
3
April 8, 2015
Update
the
cluster
means
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Data Mining: Concepts and Techniques
46
Comments on the K-Means Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about categorical
data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
April 8, 2015
Data Mining: Concepts and Techniques
47
What Is the Problem of the K-Means
Method?
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially
distort the distribution of the data
• K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
K - means Algorithm
Termination criteria
When the summed difference of the old and new
partitions (partition matrices) is less than a
threshold
• Hard
Unew - Uold == 0;
• Fuzzy
Unew - Uold < user chosen number
(like 0.0001)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
49
K - means Algorithm
Minimizes the objective function by allocating data points
to different clusters
Given: the number of clusters c
1. select initial c means
2. calculate distance between the pattern and the means of
the clusters
3. allocate the pattern to the cluster whose mean is
nearest to this pattern
4. recalculate the mean of the cluster from the patterns
allocated to it
5. repeat 2-4 until a termination criterion is satisfied
Result: a collection of means (prototypes) of the clusters
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
50
K-Means Clustering Algorithm (1)
Objective function
c
N
Q    u ik || x k  v i ||2
i 1 k 1
Minimize Q
subject to constraints
uik = 0 or 1
N
0   u  N for i 1, 2, ..., c
k1 ik
N
 uik 1 for k 1, 2, ...,N
i1
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
51
K-Means Clustering Algorithm (2)
start with some initial configuration of the prototypes vi, i=1,2, …, c (e.g., choose
them randomly)
- iterate
construct a partition matrix by assigning numeric values to U according to the
following rule
1, if d(x k , v i )  min ji d(x k , v j )
u ik  
otherwise
 0,
-
update the prototypes by computing the weighted average that involves the
entries of the partition matrix
N
vi 
 u ik x k
k 1
N
 u ik
k 1
until the performance index Q stabilizes and does not change or the changes
are negligible
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
52
Growing the Hierarchy of Clusters
c-clusters
Develop “c” clusters; split
the most heterogeneous cluster
into “c” clusters, etc.
c-clusters
(a)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
53
Growing the Hierarchy of Clusters
c-clusters
c-clusters
c-clusters
balanced growth
c-clusters
imbalanced growth
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
54
Kernel-Based Clustering
Idea:
Original data in the n-dimensional space are transformed
through some mapping f into elements in m-dimensional
space where m >>n.
Objective function in the new space:
c
N
m
Q    u ik
|| φ(x k )  φ(v i ) ||2
i 1 k 1
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
55
Kernel-Based Clustering
Given the dimensionality of a new space, m >>n,
we calculate in the new space a kernel function K(x,v)
as a dot product
K(x,v) = fT(x)f(v)
We can use a Gaussian kernel
K(x,v) = exp(- ||x-v||2/s2)
||f(xk)-f(vi)||2 = 2 – K(xk, vi)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
56
Kernel K-means
k-means cannot separate clusters that are non-linearly
separable
To solve this problem kernel k-means algorithm is used:
before doing clustering, all points are mapped into a higherdimensional space using some nonlinear function,
and then the algorithm partitions the points in the new space.
Major difference with k-means is calculation of distance in the
kernel k-means algorithm by the kernel method - not, say, by
Euclidean.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
57
Kernel K-means
c
N
m
Q    u ik
|| φ(x k )  φ(v i ) ||2
i 1 k 1
c
Q
i 1
N
 w(x) || φ(x
k 1
k
)  m j ||
2
To calculate the distances between the points in the new space
and the mj we use a kernel function that is specified in
the kernel matrix K.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
58
Kernel K-means
Input: K - kernel function, k - number of clusters
1. Initialize the k clusters: C1(0), ...,Ck(0)
2. Set t = 0
3. For each point x, find its new cluster by:
J*(x) = argmin j ||f(x)−mj||2
4. Compute the updated clusters as
Cj (t+1) = {x : J*(x) = J}
5. If not converged, set t = t + 1 and go to step 3; otherwise, stop
Result: partition into clusters C1, ....,Ck
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
59
K-Medoids Clustering
To enhance robustness of clustering we use
medoids instead of prototype mean values
In one-dimensional case it is the median
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
60
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters
– PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
• Efficiency improvement on PAM
– CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
– CLARANS (Ng & Han, 1994): Randomized re-sampling
61
Median as a Robust Estimator
Consider an ordered collection of real data
x1 <= x2 <= … <=xN
Median is the central point in this sequence (if N is
even) or an average the two points in the middle (if N
is odd)
Median is a robust estimator (ordered statistics)
median
median
mean
mean
outlier
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
62
Median as a Robust Estimator
Median is a solution to the minimization problem
N
N
k 1
k 1
minii  | x k  x ii |  | x k  med |
We enhance the robustness by considering the
objective function
c
N n
   u ik | x k j  v ij |
i 1 k 1j1
Advantage: one of the original points becomes cluster center
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
63
Partitioning Around Medoids (PAM)
medoids – a family of the most centrally positioned data points.
PAM clustering:


represent the structure in the data by a collection of medoids,
each data point is grouped around the medoid to which its distance is the
shortest.
PAM starts with an arbitrary collection of elements treated as medoids.
At each step of the optimization, we make a swap between a certain data and one
of the medoids assuming that the swap results in improvement of the quality of the
clustering.
Limitations -- size of the dataset.
PAM works well for small datasets with a small number of clusters,
(100 data points and 5 clusters).
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
64
PAM (Partitioning Around Medoids) (1987)
• PAM (Kaufman and Rousseeuw, 1987), built in Splus
• Use real object to represent the cluster
– Select k representative objects arbitrarily
– For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
– For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most
similar representative object
– repeat steps 2-3 until there is no change
65
PAM Clustering: Finding the Best Cluster
Center
Four Cases
66
PAM Clustering: Finding the
Best Cluster Center
• Case 1
Suppose p currently belongs to cluster represented by Oj. Further
More,
D(p,Oi) < D(p,Orandom) .Then If Oj is replaced by Orandom, p will belong to
the cluster represented by Oi
swap cost:
C=d(p,Oi)-d(p,Oj)
PAM Clustering: Finding the
Best Cluster Center
• Case 2
P currently belongs to the cluster represented by Oj. But this time we
assume D(p,Oi) > D(p,Orandom). So, Then If Oj is replaced by Orandom,
p will belong to the cluster represented by Orandom
Cost Swap
C=d(p,Orandom)-d(p,Oj)
PAM Clustering: Finding the
Best Cluster Center
• Case 3
P currently belongs to a cluster represented by Oi instead of Oj,
Besides, D(p,Oi) < D(p,Orandom) . Then If Oj is replaced by Orandom, p
will still belong to the cluster represented by Oi
Cost Swap
C=0;
PAM Clustering: Finding the
Best Cluster Center
• Case 4
P currently belongs to cluster represented by Oi, But D(p,Oi) >
D(p,Orandom)
If replacing Oj with Orandom, p will belong to Orandom
Cost Swap
C=d(p,Orandom)-d(p,Oi)
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10
10
10
9
9
9
8
8
8
Arbitrary
choose k
object as
initial
medoids
7
6
5
4
3
2
7
6
5
4
3
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
0
10
1
2
3
4
5
6
7
8
9
10
Assign
each
remainin
g object
to
nearest
medoids
7
6
5
4
3
2
1
0
0
K=2
Until no
change
2
3
4
5
6
7
8
9
10
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
Do loop
1
10
10
Compute
total cost of
swapping
9
9
Swapping O
and Oramdom
8
If quality is
improved.
5
5
4
4
3
3
2
2
1
1
7
6
0
8
7
6
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
What Is the Problem with
PAM?
• Pam is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced
by outliers or other extreme values than a mean
• Pam works efficiently for small data sets but does not
scale well for large data sets.
– O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling-based method
CLARA(Clustering LARge Applications)
72
CLARA (Clustering Large Applications)
(1990)
• CLARA (Kaufmann and Rousseeuw in 1990)
– Built in statistical analysis packages, such as SPlus
– It draws multiple samples of the data set, applies
PAM on each sample, and gives the best clustering
as the output
• Strength: deals with larger data sets than PAM
• Weakness:
– Efficiency depends on the sample size
– A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
73
CLARA
• Algorithm CLARA
Five Examples of size 40+2k
1. For i =1 to 5, repeat the following steps:
2. Draw a sample of 40 + 2k objects randomly from the entire data set, and call Algorithm
PAM to find k medoids of the sample.
3. For each object Oj in the entire data set, determine which of the k medoids is the most
similar to Oj.
4. Calculate the average dissimilarity of the clustering obtained in the previous step. If
this value is less than the current minimum, use this value as the current minimum,
and retain the k medoids found in Step 2 as the best set of medoids obtained so far.
5. Return to Step 1 to start the next iteration
CLARANS (“Randomized” CLARA) (1994)
• CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
– Draws sample of neighbors dynamically
– The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
– If the local optimum is found, it starts with new randomly
selected node in search for a new local optimum
• Advantages: More efficient and scalable than both PAM
and CLARA
• Further improvement: Focusing techniques and spatial
access structures (Ester et al.’95)
75
CLARANS
• Find k medoids can be viewed abstractly as
searching through a certain graph.
• node is represented by a set of k objects
{O1,……Ok} which is selected medoids.
• Two nodes are neighbors if and only if their
sets differ by only one object. Each node
corresponds to a clustering and each node
can be assigned a cost which defines the
dissimilarity between every object and the
medoit
CLARANS
• Algorithm CLARANS
1. Input parameters numlocal and maxneighbor. Initialize i to 1, and
mincost to a large number.
2. Set current to an arbitrary node in Gn,k.
3. Set j to 1.
4. Consider a random neighbor S of current, and based on cost swap
functtion, calculate the cost differential of the two nodes.
5. If S has a lower cost, set current to S, and go to Step 3.
6. Otherwise, increment j by 1. If j < maxneighbor, go to Step 4.
7. Otherwise, when j > maxneighbor, compare the cost of current with
mincost. If the former is less than mincost, set mincost to the cost of
current and set bestnode to current.
8. Increment i by 1. If i > numlocal, output bestnode and halt. Otherwise, go
to Step 2.
Clustering Large Applications (CLARA)
Modification of PAM to deal with large data sets:
Instead of processing all data, sample it ,and apply PAM to the sample
K-Medoids clustering:
•
PAM
•
CLARA
Advantages:
• robustness, and
• interpretability
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
78
Fuzzy C-Means
How to deal (quantify) data that are in-between clusters?
Consider partial membership to clusters – emergence of fuzzy sets
elements with partial membership
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
79
Fuzzy C-Means
Partial membership in clusters – fuzzy partition matrix U
Objective function
c
N
m
Q    u ik
|| x k  v i ||2
i 1 k 1
U = [uik]; uik – degree of membership of k-th data to i-th cluster
m – fuzzification coefficient, m>1
||. || - distance function
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
80
Fuzzy C - Means: Optimization
c
Q
i 1
N
u
k 1
m
ik
|| x k  v i ||  Minprototypes,UU Q
Min Q with respect toprototypes
 prototypesQ  0
Q
that is
 0, i  1,2,...,c j  1,2,...,n
v ij
2
Min Q with respect topartitionmatrix
UQ  0
that is
Q
 0, i  1,2,...,c k  1,2,..., N
u ik
constraint: U is a partitionmatrix!!
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
81
Fuzzy C – Means: Calculations
c
min prototypes
N

i 1
|| x k  v i ||
m
u ik
2
k 1

vi
N

N
m

T
u ik ( x k  v i ) ( x k  v i )  2
k 1
m
uik ( x k  v i )
k 1
N
u
N
u
k 1
m
ik ( k
x  vi )  0
vi 
m
ik
xk
k 1
N

u mik
k 1
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
82
FCM: Algorithm
Initialize: select the number of clusters (c), stopping value (e), fuzzification coefficient (m).
The distance function is Euclidean or weighted Euclidean. The initial partition matrix consists of random
entries
Repeat
update prototypes
N
m
 u ik x k
v i  k 1N
m
 u ik
k 1
update partition matrix
u ik 
1
c  || x  v || 
k
i


l 1 || x  v || 
j
 k

2/(m 1)
until a certain stopping criterion has been satisfied
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
83
FCM – Algorithm
Design aspects
stopping criterion: termination of iterations
maxik | uik(iter+1) – uik(iter)|  
fuzzification coefficient (m) : m>1
Shape of the membership functions
m =2.0 – typical value
m close to 1 – set like shape of membership functions
m higher than 2.0 - spike like membership functions
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
84
Model-Based Clustering
Mixture of data as an underlying model
Each component described by some conditional probability
density function described by parameters
p(x| θ 1, θ 2,…, θ c) =
c
 p(x | θ )p
i 1
i
i
Parameters estimation of the mixture of data
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
85
Mixture of Data Model
Maximum likelihood estimation
Given data x1, x2, …, xN choose parameters such that
the value of the expression
N
P(X | θ)   p(x k | θ)
k 1
becomes maximized
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
86
Scalable Clustering
Clustering algorithms need to be scalable to deal
with large data sets
Example algorithms:
•Density-Based Clustering (DBSCAN)
•OPTICS
•DENCLUE
•CLIQUE
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
87
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such
as density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
88
Density-Based Clustering: Basic Concepts
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Epsneighbourhood of that point
• NEps(p):
{q belongs to D | dist(p,q) <= Eps}
• Directly density-reachable: A point p is directly densityreachable from a point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
p
– core point condition:
q
MinPts = 5
Eps = 1 cm
|NEps (q)| >= MinPts
April 8, 2015
Data Mining: Concepts and Techniques
89
Density-Based Clustering DBSCAN
Based on the concept of -neighborhood, N_Pts, and
density-based reachability

-neighborhood, denoted by N(xk), is given as
N(xk) = { x | d(x, xk)
 }
N_Pts – number of points falling within the neighborhood
xi is xk density-reachable with parameters  and N_Pts if
the following conditions are satisfied
(a) xi belongs to N(xk), and
(b) card (N(xk))  N_Pts
Then xi becomes a CORE point
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
90
DBSCAN: Density-based Reachability
xi is density-reachable from xk by a chain of data points
xk+1, xk+2, …, xi-1
N(xk)
N(x1)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
91
DBSCAN Algorithm
Set up the parameters of the neighborhood,  and N_Pts
(a)
arbitrarily select a data point, say xk
(b)
find (retrieve) all data that are density reachable from xk
(c)
if xk is a core point, then the cluster has been formed (all points density
reachable from xk)
(d) otherwise consider xk to be a border point and move on to the next data
point
The sequence (a) – (d) is repeated until all data points have been processed.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
92
Density-Reachable and Density-Connected
• Density-reachable:
– A point p is density-reachable from
a point q w.r.t. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
p
p1
q
• Density-connected
– A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o w.r.t.
Eps and MinPts
April 8, 2015
p
q
o
Data Mining: Concepts and Techniques
93
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
• Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core
MinPts = 5
94
DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p w.r.t. Eps
and MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
• Continue the process until all of the points have been
processed
95
OPTICS: A Cluster-Ordering Method (1999)
• OPTICS: Ordering Points To Identify the Clustering
Structure
– Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
– Produces a special order of the database wrt its
density-based clustering structure
– This cluster-ordering contains info equiv to the densitybased clusterings corresponding to a broad range of
parameter settings
– Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
– Can be represented graphically or using visualization
techniques
96
OPTICS: Some Extension from
DBSCAN
• Index-based:
• k = number of dimensions
• N = 20
• p = 75%
• M = N(1-p) = 5
D
– Complexity: O(kN2)
• Core Distance
p1
• Reachability Distance
o
Max (core-distance (o), d (o, p))
r(p1,
= 2.8cm. r(p2,o) = 4cm
April 8,o)2015
p2
o
MinPts = 5
Data Mining: Concepts and Techniques
 = 3 cm
97
DENCLUE: Using Statistical
Density Functions
f Gaussian ( x , y )  e

d ( x , y )2
2s 2
DENCLUE: Using Statistical
Density Functions
• Density Attractor
The local maxima of the overall density function
• Density Attracted
A point x is density attracted to a density attractor x* if there exists a set
of points x0,x1…,xk such that x0=x, xk=x* and the gradient of xi-1 is
in the direction of xi for 0<i<k
In General Points that are density attracted to x* may form a cluster
Denclue
• center-defined cluster
A center-defined cluster for a density attractor, x* is a subset of points, C
belongs to D ,that are density-attracted by x*, and where the density
function x* is no less than a threshold.
•
arbitrary-shape cluster
For a set of density attractor is a set of Cs, each being density-attracted to
its respective density attractor, where
(1) Density function value at each density-attractor is no less than a
threshold
(2) there exist a path P, from each density-attractor to another, where the
density function value for each point along the path is no less than the
threshold
100
Grid-Based Clustering
Describe structure in data in the language of generic
geometric constructs – hyperboxes and their combinations
Collection of clusters of
different geometry
Formation of clusters by
merging adjacent hyperboxes
of the grid
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
101
Grid-Based Clustering
Hyperboxes
{ B1, B2, …, Bp.} with two requirements:
a) Bi is nonempty in the sense it includes some data points,
b) the hyperboxes are disjoint that is Bi  Bj =  if p i  j,
c) a union of all hyperboxes covers all data that is  B i  X
i 1
where X = {x1, x2, …, xN}.
It is also required that such hyperboxes “cover” some maximal number
(say bmax) of data points.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
102
Grid-Based Clustering Steps






Formation of the grid structure
Insertion of data into the grid structure
Computation of the density index of each hyperbox of the grid
structure
Sorting the hyperboxes with respect to the values of their density
index
Identification of cluster centres (viz. the hyperboxes of the highest
density)
Traversal of neighboring hyperboxes and merging process
Choice of the grid:
too rough grid may not help capture the details of the structure in the
data.
too detailed grid produces a significant computational overhead.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
103
Clustering High-Dimensional Data
•
Clustering high-dimensional data
– Many applications: text documents, DNA micro-array data
– Major challenges:
• Many irrelevant dimensions may mask clusters
• Distance measure becomes meaningless—due to equi-distance
• Clusters may exist only in some subspaces
•
Methods
– Feature transformation: only effective if most dimensions are relevant
• PCA & SVD useful only when features are highly correlated/redundant
– Feature selection: wrapper or filter approaches
• useful to find a subspace where the data have nice clusters
– Subspace-clustering: find clusters in all the possible subspaces
• CLIQUE, ProClus, and frequent pattern-based clustering
April 8, 2015
Data Mining: Concepts and Techniques
104
CLIQUE (Clustering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
• Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
• CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal length
interval
– It partitions an m-dimensional data space into non-overlapping
rectangular units
– A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
– A cluster is a maximal set of connected dense units within a
subspace
April 8, 2015
Data Mining: Concepts and Techniques
105
CLIQUE: The Major Steps
• Partition the data space and find the number of points that
lie inside each cell of the partition.
• Identify the subspaces that contain clusters using the
Apriori principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of
interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of
connected dense units for each cluster
– Determination of minimal cover for each cluster
April 8, 2015
Data Mining: Concepts and Techniques
106
40
50
20
30
40
50
age
60
Vacation
=3
30
Vacation
(week)
0 1 2 3 4 5 6 7
Salary
(10,000)
0 1 2 3 4 5 6 7
20
age
60
30
50
age
April 8, 2015
Data Mining: Concepts and Techniques
107
Strength and Weakness of
CLIQUE
• Strength
– automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
– insensitive to the order of records in input and does not
presume some canonical data distribution
– scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
• Weakness
– The accuracy of the clustering result may be degraded
at the expense of simplicity of the method
April 8, 2015
Data Mining: Concepts and Techniques
108
Reference
• Data Mining Concepts and Techniques,
edited by Jiawei Han
• Jiawei Han, CLARANS: A Method for
Clustering Objects for Spatial Data Mining
Download