Machine Learning_Assignment4

advertisement
Mohammad Rameez Shafsad
Student Id: 1359924
Machine Learning Assignment 4
Metaclustering
Most clustering methodologies focus on finding optimal or near optimal clustering of the data.
However, users cannot always specify the appropriate cluster required. The users might need to change
the first clustering criteria by guessing the approximate distance matrix. This is a tedious task. In this
paper, a solution to this problem has been proposed, known as meta clustering.
Meta clustering is done in three steps. First, a number of potentially useful clusters are generated. Then
the similarities between these pairs of clusters are measured using distance metrics. Finally these
clusters are then clustered together using the computed similarities. Two different approaches are used
for generating diverse clusters. In the first approach, k means clustering is used by which a mean is set
to a random point in a cluster. Datapoints are classified as being a member of the cluster with the mean
and the mean is then updated. This process is repeated until maximum iterations or if no points change
its cluster. In practical this is carried out several times and clustering with smallest sum of squared
distances is taken as the final result. The second approach is to use feature weights. Clustering different
times using random feature weights can allow us to find better clusterings using the same clustering
algorithm. Zipf power law distribution is being used for feature weight distribution. In case of pair wise
similarities data, MDS is used to convert it into feature-vector format.
Similarity between the clusterings is measured using Rand index. 𝐼𝑖𝑗 is defined as 1 if its in the same
cluster and as 0 if it is in different cluster. This similarity metric is referred to as Cluster Difference. Once
the distances are calculated, these clusters are clustered by itself. This clustering is done in the paper
using agglomerative clustering as it works with pair similarity data, it does not need to specify the
number of clusters and also its easier to navigate through the resulting hierarchy.
Metaclustering is then testing using six different datasets namely, Australia, which contains
measurements from every half degree of longitude and latitude of the Australian coastline. The
bergmark data has auxiliary labels which are the 25 web crawls used to get the data. The Covertype data
contains cartographic variabels sized 30 X 30 grid cells. The letters data is a subset of the isolet spoken
letter data. Protein data is the pairwise similarity between 639 proteins. The Phenome data records 11
phenomes of 15 speakers. Two clustering performance metrics compactness and accuracy are
measured. Compactness measures the distance between the points in the same cluster. Accuracy is
measured relative to each auxiliary label.
Various parameters are changed to study the behavior of the datasets. Zipf power distribution
parameter α is the first parameter used. Then PCA95 is used to measure the accuracy and compactness
of the datasets. Then the datasets are measured with the local minima and feature weighting used for
the initial clustering. Other clustering methods than the k means and feature weighting are also
experimented and results are measured accordingly. The datasets are then clustered in the meta level
using agglomerative clustering. The experiment infers that higher Zipf distribution parameter gives
better clustering and PCA before and after feature weighting gives variable results for different datasets.
Meta clustering has proven to be better than other clustering techniques for the above datasets.
A case study on protein dataset is run using meta clustering to find groups of proteins to find as many
proteins with the same structure. It proved that metaclustering has found qualitatively different ways to
cluster the protein at the meta level. Another case study is conducted on phenome clustering dataset as
well.
Metaclustering is a very useful clustering method for data when there is no prior knowledge of the
number of clusters used. Various parameters and different methods can be used in meta clustering
depending upon the dataset to tune the performance. Metaclustering is an effective clustering method
but its expensive.
Reference
1. Caruana, Elwahary, et.al , Metaclustering, Cornell University
Download