Machine Learning_Assignment4

Mohammad Rameez Shafsad Student Id: 1359924 Machine Learning Assignment 4 Metaclustering Most clustering methodologies focus on finding optimal or near optimal clustering of the data. However, users cannot always specify the appropriate cluster required. The users might need to change the first clustering criteria by guessing the approximate distance matrix. This is a tedious task. In this paper, a solution to this problem has been proposed, known as meta clustering. Meta clustering is done in three steps. First, a number of potentially useful clusters are generated. Then the similarities between these pairs of clusters are measured using distance metrics. Finally these clusters are then clustered together using the computed similarities. Two different approaches are used for generating diverse clusters. In the first approach, k means clustering is used by which a mean is set to a random point in a cluster. Datapoints are classified as being a member of the cluster with the mean and the mean is then updated. This process is repeated until maximum iterations or if no points change its cluster. In practical this is carried out several times and clustering with smallest sum of squared distances is taken as the final result. The second approach is to use feature weights. Clustering different times using random feature weights can allow us to find better clusterings using the same clustering algorithm. Zipf power law distribution is being used for feature weight distribution. In case of pair wise similarities data, MDS is used to convert it into feature-vector format. Similarity between the clusterings is measured using Rand index. 𝐼𝑖𝑗 is defined as 1 if its in the same cluster and as 0 if it is in different cluster. This similarity metric is referred to as Cluster Difference. Once the distances are calculated, these clusters are clustered by itself. This clustering is done in the paper using agglomerative clustering as it works with pair similarity data, it does not need to specify the number of clusters and also its easier to navigate through the resulting hierarchy. Metaclustering is then testing using six different datasets namely, Australia, which contains measurements from every half degree of longitude and latitude of the Australian coastline. The bergmark data has auxiliary labels which are the 25 web crawls used to get the data. The Covertype data contains cartographic variabels sized 30 X 30 grid cells. The letters data is a subset of the isolet spoken letter data. Protein data is the pairwise similarity between 639 proteins. The Phenome data records 11 phenomes of 15 speakers. Two clustering performance metrics compactness and accuracy are measured. Compactness measures the distance between the points in the same cluster. Accuracy is measured relative to each auxiliary label. Various parameters are changed to study the behavior of the datasets. Zipf power distribution parameter α is the first parameter used. Then PCA95 is used to measure the accuracy and compactness of the datasets. Then the datasets are measured with the local minima and feature weighting used for the initial clustering. Other clustering methods than the k means and feature weighting are also experimented and results are measured accordingly. The datasets are then clustered in the meta level using agglomerative clustering. The experiment infers that higher Zipf distribution parameter gives better clustering and PCA before and after feature weighting gives variable results for different datasets. Meta clustering has proven to be better than other clustering techniques for the above datasets. A case study on protein dataset is run using meta clustering to find groups of proteins to find as many proteins with the same structure. It proved that metaclustering has found qualitatively different ways to cluster the protein at the meta level. Another case study is conducted on phenome clustering dataset as well. Metaclustering is a very useful clustering method for data when there is no prior knowledge of the number of clusters used. Various parameters and different methods can be used in meta clustering depending upon the dataset to tune the performance. Metaclustering is an effective clustering method but its expensive. Reference 1. Caruana, Elwahary, et.al , Metaclustering, Cornell University

Machine Learning_Assignment4

Related documents

Products

Support

Machine Learning_Assignment4

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib