Hierarchical Clustering Two Types of Clustering • Partitional algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using some criterion Hierarchical Partitional (How-to) Hierarchical Clustering The number of possible dendrograms with n leafs = (2n -3)!/[(2(n -2)) (n -2)!] Number of Leafs 2 3 4 5 ... 10 Number of Possible Dendrograms 1 3 15 105 … 34,459,425 Since we cannot test all possible trees we will have to heuristic search of all possible trees. We could do this.. Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides. We begin with a distance matrix which contains the distances between every pair of objects in our database. 0 D( , ) = 8 D( , ) = 3 8 8 7 7 0 2 4 4 0 5 5 0 3 0 A generic technique for measuring similarity To measure the similarity between two objects, transform one of the objects into the other, and measure how much effort it took. The measure of effort becomes the distance measure. The distance between Patty and Selma. Change dress color, 1 point Change earring shape, 1 point Change hair part, 1 point D(Patty,Selma) = 3 The distance between Marge and Selma. Change dress color, Add earrings, Decrease height, Take up smoking, Lose weight, D(Marge,Selma) = 5 1 1 1 1 1 point point point point point This is called the “edit distance” or the “transformation distance” Agglomerative clustering algorithm • Most popular hierarchical clustering technique • Basic algorithm 1. 2. 3. 4. 5. 6. • Compute the distance matrix between the input data points Let each data point be a cluster Repeat Merge the two closest clusters Update the distance matrix Until only a single cluster remains Key operation is the computation of the distance between two clusters – Different definitions of the distance between clusters lead to different algorithms Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges… … Choose the closest Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges… Consider all possible merges… … Cluster the closest … Cluster the closest Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges… Consider all possible merges… Consider all possible merges… Choose the best … … Choose the best … Choose the best Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges… Consider all possible merges… Consider all possible merges… Cluster the closest … … … Cluster the closest Cluster the closest We know how to measure the distance between two objects, but defining the distance between an object and a cluster, or defining the distance between two clusters is non obvious. • Single linkage (nearest neighbor): In this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. • Complete linkage (furthest neighbor): In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). • Group average linkage: In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. Single linkage 7 6 5 4 3 2 1 29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7 Average linkage Summary of Hierarchal Clustering Methods • No need to specify the number of clusters in advance. • Hierarchal nature maps nicely onto human intuition for some domains • They do not scale well: time complexity of at least O(n2), where n is the number of total objects. • Like any heuristic search algorithms, local optima are a problem. • Interpretation of results is (very) subjective. Hierarchical Clustering Matlab Diketahui titik titik berikut (1, 2) (2.5, 4.5) (2, 2) (4,1.5) (4, 2.5) Buat klusterisasi hirarkhi dari titik-titik tersebut Jawab : Matrik X menyimpan titik-titik tsb Selanjutnya menghitung distance titik 1 dan 2, titik 1 dan 3, dst sampai semua pasangan titik diketahui distance-nya. Fungsi matlab untuk melakukan ini adalah pdist. Untuk memudahkan membaca matrik distance Y tersebut, matrik tersebut bisa ditransformasikan sbb (Elemen 1,1 berarti jarak titik1 dengan titik1 yaitu 0, dst) Selanjutnya melakukan hirarchical clustering dengan fungsi linkage Cara membaca matrik hasil Z adalah sbb: Baris-1 : Object ke-4 dan ke-5 yang berjarak 1 dicluster Baris-2 : Object ke-1 dan ke-3 yang berjarak 1 dicluster Baris-3 : Object ke-6 (hasil cluster baris-1) dan ke-7 (hasil cluster baris-2) dicluster, keduanya berjarak 2.0616 Baris-4 : Object ke-2 dan ke-8 (hasil cluster baris-3) dicluster, keduanya berjarak 2.5 Lebih jelasnya dapat dilihat pada grafis di atas Membuat dendrogram dari matrik hasil Z dendrogram (Z) sehingga menghasilkan figure dendrogram berikut