hierclustering(1)

advertisement
Hierarchical Clustering
Two Types of Clustering
• Partitional algorithms: Construct various partitions and then evaluate them by some
criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using
some criterion
Hierarchical
Partitional
(How-to) Hierarchical Clustering
The number of possible dendrograms
with n leafs =
(2n -3)!/[(2(n -2)) (n -2)!]
Number
of Leafs
2
3
4
5
...
10
Number of Possible
Dendrograms
1
3
15
105
…
34,459,425
Since we cannot test all possible trees
we will have to heuristic search of all
possible trees. We could do this..
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.
Top-Down (divisive): Starting with all
the data in a single cluster, consider
every possible way to divide the cluster
into two. Choose the best division and
recursively operate on both sides.
We begin with a distance matrix which
contains the distances between every
pair of objects in our database.
0
D( , ) = 8
D( , ) = 3
8
8
7
7
0
2
4
4
0
5
5
0
3
0
A generic technique for measuring similarity
To measure the similarity between two objects, transform one
of the objects into the other, and measure how much effort it
took. The measure of effort becomes the distance measure.
The distance between Patty and Selma.
Change dress color,
1 point
Change earring shape, 1 point
Change hair part,
1 point
D(Patty,Selma) = 3
The distance between Marge and Selma.
Change dress color,
Add earrings,
Decrease height,
Take up smoking,
Lose weight,
D(Marge,Selma) = 5
1
1
1
1
1
point
point
point
point
point
This is called the “edit
distance” or the
“transformation distance”
Agglomerative clustering algorithm
•
Most popular hierarchical clustering technique
•
Basic algorithm
1.
2.
3.
4.
5.
6.
•
Compute the distance matrix between the input data points
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the distance matrix
Until only a single cluster remains
Key operation is the computation of the distance between
two clusters
–
Different definitions of the distance between clusters lead to
different algorithms
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.
Consider all
possible
merges…
…
Choose
the
closest
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.
Consider all
possible
merges…
Consider all
possible
merges…
…
Cluster
the
closest
…
Cluster
the
closest
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.
Consider all
possible
merges…
Consider all
possible
merges…
Consider all
possible
merges…
Choose
the best
…
…
Choose
the best
…
Choose
the best
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.
Consider all
possible
merges…
Consider all
possible
merges…
Consider all
possible
merges…
Cluster
the
closest
…
…
…
Cluster
the
closest
Cluster
the
closest
We know how to measure the distance between two
objects, but defining the distance between an object
and a cluster, or defining the distance between two
clusters is non obvious.
• Single linkage (nearest neighbor): In this method the distance between
two clusters is determined by the distance of the two closest objects (nearest
neighbors) in the different clusters.
• Complete linkage (furthest neighbor): In this method, the distances
between clusters are determined by the greatest distance between any two
objects in the different clusters (i.e., by the "furthest neighbors").
• Group average linkage: In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two
different clusters.
Single linkage
7
6
5
4
3
2
1
29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7
Average linkage
Summary of Hierarchal Clustering Methods
• No need to specify the number of clusters in
advance.
• Hierarchal nature maps nicely onto human
intuition for some domains
• They do not scale well: time complexity of at least
O(n2), where n is the number of total objects.
• Like any heuristic search algorithms, local optima
are a problem.
• Interpretation of results is (very) subjective.
Hierarchical Clustering Matlab
Diketahui titik titik berikut
(1, 2) (2.5, 4.5) (2, 2) (4,1.5) (4, 2.5)
Buat klusterisasi hirarkhi dari titik-titik tersebut
Jawab :
Matrik X menyimpan titik-titik tsb
Selanjutnya menghitung distance titik 1 dan 2, titik 1 dan 3, dst sampai semua
pasangan titik diketahui distance-nya. Fungsi matlab untuk melakukan ini adalah
pdist.
Untuk memudahkan membaca matrik distance Y tersebut, matrik tersebut bisa
ditransformasikan sbb (Elemen 1,1 berarti jarak titik1 dengan titik1 yaitu 0, dst)
Selanjutnya melakukan hirarchical clustering dengan fungsi linkage
Cara membaca matrik hasil Z adalah sbb:
Baris-1 : Object ke-4 dan ke-5 yang berjarak 1 dicluster
Baris-2 : Object ke-1 dan ke-3 yang berjarak 1 dicluster
Baris-3 : Object ke-6 (hasil cluster baris-1) dan ke-7 (hasil cluster baris-2) dicluster,
keduanya berjarak 2.0616
Baris-4 : Object ke-2 dan ke-8 (hasil cluster baris-3) dicluster, keduanya berjarak 2.5
Lebih jelasnya dapat dilihat pada grafis di atas
Membuat dendrogram dari matrik hasil Z
dendrogram (Z)
sehingga menghasilkan figure dendrogram berikut
Download