Report on IASc Summer Research Fellowship Program (SRFP) 2015 at IIT, Delhi

advertisement
Report on IASc SRFP-15
IIT Delhi
Report on
IASc Summer Research Fellowship Program (SRFP) 2015
at IIT, Delhi
Two months Summer Research Fellowship Program (SRFP)-2015 for students and faculties in
engineering institutions is organized every year by Indian Academy of Sciences (IASc)
Bangalore. During the year 2015, I was selected by the academy to work at Indian Institute of
Technology, Delhi. With prior permission of Hon. Principal and HOD-IT, I have successfully
completed the fellowship work under the guidance of Dr. Panigrahi B. K. of Electrical
Engineering Department at IIT Delhi during May-July 2015.
The main theme of this research fellowship program offered by Indian Academy of Sciences
every year is to nurture the research skills in students and faculty in various Engineering
disciplines. During this period, i worked in the field of Web Mining. I am glad to quote that i
have successfully implemented a project entitled “A Novel Approach for Data Clustering
using Improved Markovian and Z-Score based Technique” during my stay at IIT D.
The main objective of the task assigned to me was performance optimization of existing
clustering algorithms and to improvise the results. During this stay at IIT Delhi I emphasized the
need of optimization in data clustering and was successful to optimize the existing clustering
techniques. I got the opportunity to attend couple of Continuing Education Programs (CEPs) and
expert talks during this tenure. Moreover, I got the opportunity to attend expert talk by Dr.
Dipankar Chatterjee, Professor at IISc Bangalore who addressed all IASc research fellows on 22
June 2015 at Indian National Science Academy (INSA), Delhi on various insights and career
opportunities in research for various disciplines.
The germinal to learn concepts in evolutionary computing from one of the inventor of this
domain Dr. Panigrahi was really fruitful and heartening. This fellowship has provided an ideal
platform for learning and experiencing the culture at the IIT which will surely help me in shaping
research career in upcoming days.
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
A Report On
“A Novel Approach for Data Clustering using Improved Markovian and ZScore based Technique”
Submitted to the Indian Academy of Sciences (IASc), Bangalore in fulfillment of
the requirements for Summer Research Fellowship for the duration of May-July
2015
By
Mr. Shah Sahil Kailas
Application ID: ENGT143
Vidya Pratishthan’s College of Engineering, Baramati-413133
Under the guidance of
Dr. B. K. Panigrahi
bkpanigrahi@ee.iitd.ac.in
Department of Electrical Engineering
Indian Institute of Technology Delhi
Hauz Khas, New Delhi – 110016
India
Clustering Techniques in Data Mining
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
In todays’ era, in each and every field there lies a need of grouping the things in order to scan/analyze
them in quick time. As the size of data items spread across various domains is growing exponentially,
groups of such data items can be used for their in depth analysis and facilitation of the things.
Clustering is the technique of grouping the data items (objects, object descriptions, features etc.) in to
various groups/clusters so that data items with same feature set can be members of the same cluster. In
short, clustering will form clusters of data objects such that objects belonging to same cluster are similar
while those belonging to different cluster are dissimilar.
Example:
Animal Taxonomy represent various classes of animals like reptiles, mammals etc.
Plant Taxonomy, Bio Informatics, Grouping of various locations based on geographic areas etc.
Grouping of data objects can be carried out in two ways:
1. Supervised Classification
2. Unsupervised classification a.k.a Clustering
1. Classification
This type of supervised learning involves grouping various new data items into various existing classes by
predicting/learning from existing classes. In this technique, various predefined classes with important
features set are presumed and on the basis of which new data items can be classified.
Ex: A new species of dog can be easily classified into the class of dogs based on prediction/learning.
This technique requires assumption of existing classes and based on that learning must be performed.
2. Clustering
This type of unsupervised classification involves formation of clusters/groups based on data objects
properties and their dissimilarities with each other. There does not exist any predefined classes. Objects
can be grouped into similar cluster if their distance/dissimilarity value is very less. In cluster analysis, a
group of objects is split up into a number of more or less homogeneous sub-groups on the basis of an
often subjectively chosen measure of similarity such that the similarity between objects within a subgroup is larger than the similarity between objects belonging to different sub-groups.
Example: Various data items like websites, information pages, documents etc. can be clustered into
different clusters depending upon their features.
Clustering algorithms partition data into a certain number of clusters (groups, subsets, or categories).
Cluster can be described by considering the internal homogeneity and the external Separation i.e., patterns
in the same cluster should be similar to each other, while patterns in different clusters should not.
Mathematical Model for Clustering
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
Given set of data objects
S = {s1,s2,………………,sN}
Where,
Si: ith data object consisting of various features {Si1,Si2,….Sin}
N: No. of data objects to be clustered
Form K clusters/partitions C= {C1, C2… Ck}, k<=N such that
1. Ci ≠ Φ, i=1 to k
2.
3. Ci ∩ Cj = Φ, i≠ j;
In case of hard clustering, 3rd condition must hold while in case of soft clustering or when overlapping is
allowed it can be skipped.
While forming any cluster, data objects must satisfy and optimize some objective function. Objective
function is mostly given in terms of dissimilarity /distance function.
Distance function can be given using any of the three techniques:
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
Clustering Process
Figure 1: Clustering Process [3]
Clustering process can be classified mainly into 4 main stages viz.
1. Feature Extraction
Data objects to be clustered need to be pre-processed in order to extract only selective features which can
be used in further processing. Features selected can be of two types either symmetric or asymmetric.
Generally, ideal features should be of use in distinguishing patterns belonging to different clusters,
immune to noise, easy to extract and interpret.
2. Selection of Clustering Algorithm
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
This is the vital stage in the entire clustering process which involves selection of proper clustering
technique from wide variety of available clustering algorithms. Selection of clustering algorithm can be
performed depending on various parameters like nature and type of data objects, size etc. This step in turn
involves selection of proximity/similarity measure which plays a part in future process.
3. Cluster Validation
Given a data set, each clustering algorithm can always generate a division, no matter whether the structure
exists or not. Moreover, different approaches usually lead to different clusters; and even for the same
algorithm, parameter identification or the presentation order of input patterns may affect the final results.
Therefore, effective evaluation standards and criteria are important to provide the users with a degree of
confidence for the clustering results derived from the used algorithms. These assessments should be
objective and have no preferences to any algorithm. Also, they should be useful for answering questions
like how many clusters are hidden in the data, whether the clusters obtained are meaningful or just an
artefact of the algorithms, or why we choose some algorithm instead of another.
4. Results interpretation
The crucial goal of clustering is to offer users with meaningful insights from the original data, so that they
can effectively solve the problems encountered. Interpretation involves experimentation on the results
obtained in order to check feasibility of the generated solution.
Clustering algorithms can be categorized into following majors:
1. Partitioning Methods
This technique needs a user defined parameter number of clusters K at the start of the algorithm. The goal
is to find/obtain K partitions of given data items which satisfy following properties:
a) Each group must contain exactly one object
b) Each object must belong to exactly one group.
This technique uses an iterative method which tries to relocate various data items to different partitions at
each iteration depending upon the minimal distance in between objects inside one cluster.
1. K-means
2. K-medoids
2. Hierarchical Methods
This method creates hierarchical decomposition of data objects either in agglomerative or divisive way.
i.e. this method either groups various data items in bottom-up (agglomerative) way or it divides data
items into various small sized individual clusters (divisive) way. It is harder to undo the
changes/groupings once done in such type of algorithms. Moreover, user needs to specify number of
levels for refinement of clustering results.
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) and Clustering using
Representatives (CURE) are mostly used techniques under this category.
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
3. Density based methods
In this method, clusters are grown i.e. new data objects are added to clusters till the number of data
objects in the nearby cluster exceeds some predefined limit. In short, for each given data object, nearby
cluster with a given radius has to contain at least a minimum number of points. This is normally used to
remove the outliers. The objects which are nearby each other are enclosed inside one cluster.
DBSCAN and Ordering the Points to Identify Clustering Structure (OPTICS) are the main variants under
density based clustering.
4. Grid based methods
This method quantizes the data object space into a finite number of cells which forms a grid structure.
All the clustering related operations now will be performed on this grid structure. This involves faster
processing time as the only requirement is to transform given data objects into grid structure.
STING, CLIQUE and Wave-Cluster can be identified as prominent clustering algorithms under these
methods.
5. Model based methods
This type of clustering method involves finding out a best/optimal fit model for given data set using some
heuristics. Model based approaches can directly find the final numbers of clusters using iterative method
till model descriptions are satisfied. Minimum Description Length (MDL), Expectation-Maximization
(EM) etc. are few of the examples of algorithms within this category.
New Approach: Data Clustering using Markov Clustering Algorithm (MCL) and Page Ranking
For partitioning of larger graphs, there is always a requirement of clustering algorithms capable of
producing optimal results. MCL algorithm is a graph based clustering algorithm based on Stijn van
Dongen Van Dongen, S. (2000) Graph Clustering by Flow Simulation. Ph D Thesis, University of
Utrecht, the Netherlands [2].
This algorithm makes utilization of markov chain concept and theory of markovian matrices to derive the
final clustering structure when we provide graph as an input. The Markov Cluster Process (abbreviated
MCL process) defines a sequence of stochastic matrices by alternation of two operators on a generating
matrix. It is basically the foremost requirement behind the clustering of graphs. The foundation of MCL
algorithm is based on one of the popular graph paradigm “natural' groups in graphs, the groups that are
sought and not known, have the property: A random walk in G that visits a dense cluster will likely not
leave the cluster until many of its vertices have been visited.
MCL makes use of this idea while forming the cluster structure. This is achieved by promoting the flow
within a graph where there are strong connections and it will minimize the flow outside the dense regions.
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
Figure 2: Sample Undirected Graph
Considering any graph, it is natural that nodes which are more likely will share maximum number of links
in between as compared with others. In above figure, one can see the natural tendency of clusters: intra
cluster similarity will be more as opposed with inter cluster similarity. Above figure demonstrates this
paradigm clearly where if we assume two clusters connected by a single link then more nodes are strongly
connected inside the cluster.
This means if one is starting from some node, and then randomly travel to a connected node, he/she is
more likely to stay within a cluster than travel between as there will be more links within the cluster. By
doing random walks upon the graph, it may be possible to discover where the flow tends to gather and
therefore where the clusters can be formed.
These Random Walks on a graph will be calculated using “Markov Chains” or Stochastic matrices.
Working Example
As shown in following figure, we can manually visualize there can be 2 clusters having 5 and 3 nodes
respectively. This can be obtained using MCL as follows:
Figure 3: Sample Undirected Graph
In one time step, a random walker at node 1 has a 33% chance of going to node 2, 3, & 4, and 0% chance
to nodes 5, 6, or 7. From node 2, 25% chance for 1, 3, 4, 5 and 0% for 6 and 7. This probability of
random walks can be well explained using following transition matrix/probability matrix:
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
One can easily note that the above matrix is column stochastic i.e. sum of every column equals one. This
is also known as markov chaining and it gives us probability of random walks from various nodes in input
graph. This matrix can be treated as Markov Matrix.
MCL algorithm will involve two important stages after forming a markov matrix viz... Expansion and
Inflation. These two steps are iteratively carried out till convergence criterion is achieved. Most of the
times, this convergence criterion would be in terms of residual energy. As per the MCL process, flow is
easier within dense regions than across sparse boundaries, however, in the long run this effect may
disappear.
During the earlier powers of the Markov Chain, the edge weights will be higher in links that are within
clusters, and lower between the clusters. This predicts the correspondence between the distribution of
weight over the columns and the clustering’s i.e. more the weight across cell value it will lead towards
clustering. This is achieved in MCL by introducing a new technique called inflation whose job is to
deliberately boost the clustering process by strengthening the strong neighbors or intra cluster similarity
and weakening the inter cluster similarity.
MCL Expansion [2]
This involves multiplying markov matrix by itself or in simpler words squaring the existing markovian
matrices. In this stage future probabilities are dependent on existing/current probabilities only. This stage
can be thought of as the step of expanding the existing transition probabilities spread across the
network/existing graph. The expansion operator is responsible for allowing flow to connect different
regions of the graph
MCL Inflation [2]
During the earlier powers of the Markov Chain, the edge weights will be higher in links that are within
clusters, and lower between the clusters. This predicts the correspondence between the distribution of
weight over the columns and the clustering’s i.e. more the weight across cell value it will lead towards
clustering. This is achieved in MCL by introducing a new technique called inflation whose job is to
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
deliberately boost the clustering process by strengthening the strong neighbors or intra cluster similarity
and weakening the inter cluster similarity.
Inflation can be defined as:
Given a matrix M, M>=0 and a real nonnegative number r, the matrix resulting from rescaling and
normalizing of each of the columns of M with power coefficient r is called TrM and Tr is called inflation
operator and r is called inflation parameter.
Where (TrM) pq indicates the value at cell pq in inflation process when raised to power r.
The inflation operator is responsible for both strengthening and weakening of current while the inflation
parameter, r, controls the extent of this strengthening / weakening. In brief, one can say that the
granularity of clustering is dependent on this inflation parameter. It is feasible to have more granularity
amongst the clusters in graph based approaches and it can be achieved by varying the inflation parameter.
Optimal values of r which results in good clustering are 1.4,2,4 and 6. However, most of the applications
works fine with default value of 2.
MCL Algorithm
Input: An undirected graph (can be weighted). For better performance, one must convert directed graph
into undirected one as this works best with undirected graphs.
Output: Clustering Structure/ Clustered nodes
Algorithm:
1.
Read the input undirected graph and form an adjacency matrix/weight matrix. In case of
weighted graphs, normalization is required.
2. Generate markov matrix from input associated matrix. This step involves probability
computations while traversing from one node to the other.
3. Add self-loops in order to maximize the flow of process within the denser nodes as compared
with other nodes. This step is optional.
4. Check if resulting matrix is column stochastic. (i.e. check if all elements in every column sums to
1). If it’s not column stochastic rescale the existing column values in order to make it stochastic.
5. Expansion:
Expand the result of previous step by squaring the earlier matrix or appropriate power e if specified.
6. Inflation:
Inflate the result of step 5 by taking inflation of resulting matrix with parameter r.
Where (TrM) pq indicates the value at cell pq in inflation process when raised to power r.
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
7. Repeat expansion and inflation until a convergence criterion is reached. In this stage, convergence
criterion could be anything from total cost, scarcity of resulting matrix or number of computations
required etc. In our study, we use residual energy behind matrix computations as this convergence
criterion.
Once the residual energy of resulting matrix becomes 0 or is less than specified threshold, stop
further expansion and inflation.
8. Resulting matrix gives the cluster structure and it can be analyzed based on:
Cluster can have all the nodes which are having positive values in a single row.
Working Example
Let following be an input graph to MCL clustering algorithm:
Figure 4: Input Graph G
1. We will be providing input graph using probability/stochastic matrix. So, input matrix M will have
following format for above graph:
Matrix M:
Nodes 1
0.2
1
0.2
2
0
3
0
4
0
5
0.2
6
0.2
7
0
8
0
9
0.2
10
0
11
0
12
2
0.25
0.25
0.25
0
0.25
0
0
0
0
0
0
0
3
0
0.25
0.25
0.25
0.25
0
0
0
0
0
0
0
4
0
0
0.2
0.2
0
0
0
0.2
0.2
0
0.2
0
5
0
0.2
0.2
0
0.2
0
0.2
0.2
0
0
0
0
6
0.333
0
0
0
0
0.333
0
0
0
0.333
0
0
7
0.25
0
0
0
0.25
0
0.25
0
0
0.25
0
0
8
0
0
0
0.2
0.2
0
0
0.2
0.2
0
0.2
0
9
0
0
0
0.2
0
0
0
0.2
0.2
0
0.2
0.2
10
0.25
0
0
0
0
0.25
0.25
0
0
0.25
0
0
11
0
0
0
0.2
0
0
0
0.2
0.2
0
0.2
0.2
12
0
0
0
0
0
0
0
0
0.333
0
0.333
0.333
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
One can also note that, here self-loops have been added along with existing matrix. Here the value in each
cell M (i, j) where, i, j Ɛ 1 to 12 indicates probability of visiting node j if randomly started at node i.
For instance, cell M (3, 4) indicates probability of visiting node 4 from 3 is 25% in one step.
2. As the input matrix is normalized and column stochastic there is no need of rescaling or normalization.
At this stage Mnorm = M
Matrix Mnorm:
Nodes
1
2
3
4
5
6
7
8
9
10
11
12
1
0.2
0.2
0
0
0
0.2
0.2
0
0
0.2
0
0
2
0.25
0.25
0.25
0
0.25
0
0
0
0
0
0
0
3
0
0.25
0.25
0.25
0.25
0
0
0
0
0
0
0
4
0
0
0.2
0.2
0
0
0
0.2
0.2
0
0.2
0
5
0
0.2
0.2
0
0.2
0
0.2
0.2
0
0
0
0
6
0.333
0
0
0
0
0.333
0
0
0
0.333
0
0
7
0.25
0
0
0
0.25
0
0.25
0
0
0.25
0
0
8
0
0
0
0.2
0.2
0
0
0.2
0.2
0
0.2
0
9
0
0
0
0.2
0
0
0
0.2
0.2
0
0.2
0.2
10
0.25
0
0
0
0
0.25
0.25
0
0
0.25
0
0
11
0
0
0
0.2
0
0
0
0.2
0.2
0
0.2
0.2
12
0
0
0
0
0
0
0
0
0.333
0
0.333
0.333
In case of weighted graph, there lies always a need of normalization and rescaling. The normalization can
be carried out by rescaling existing values in such a way that resulting matrix becomes column stochastic.
3. Expansion
In this step, expand the resulting matrix of step 2 by multiplying
matrix.
Matrix Me:
Node
s
1
2
3
4
5
6
7
0.062
0.257 0.113 5
0
0.1
0.261 0.175
1
0.066
0.09
0.225 0.175 0.05
0.14
7
0.1
2
3
0.05
4
it with itself and rescale it to markov
8
9
10
11
12
0
0
0.258
0
0
0.04
0
0.05
0
0
0.225
0.09
0.14
0
0.05
0.08
0.04
0
0.04
0
0
0.175
0.062
5
0.113
0.21
0.09
0
0
0.16
0.16
0.16
0.133
5
0.1
0.175
0.175
0.09
0.23
0
0.113
0.08
0.04
0
0.062
5
0.04
0
6
0.157
0.05
0
0
0
0.261
0.113
0
0
0.196
0
0
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
7
0.14
0.1
0.05
0
0.09
0.15
0.225
0.04
0
0.175
0
0
8
0
0.05
0.1
0.16
0.08
0
0.05
0.2
0.16
0
0.16
0.133
9
0
0
0.05
0.16
0.04
0
0
0.16
0.227
0
0.227
0.244
10
0.207
0.05
0
0
0.05
0.261
0.175
0
0
0.258
0
0
11
0
0
0.05
0.16
0.04
0
0
0.16
0.227
0
0.227
0.244
12
0
0
0
0.08
0
0
0
0.08
0.147
0
0.147
0.244
4. Inflation
This step involves raising the resulting expansion matrix to the inflation parameter in order to strengthen
the strong connections and weaken less strong connections. The default value of this inflation parameter
is taken as 2.
Inflation
1
2
4
5
6
0.0867
3
0.02
68
0
767
0.295
0.347
0.21
0.0171
0.15
0.0192
7
0.20
1
0.06
57
1
0.38
2
0.0467
0
0.01
64
0.046
0
0
0.184
3
0.0144
0.21
4
0
0.0268
0.34
7
0.08
67
5
0.0577
0.21
6
0.142
0.0171
7
0.113
0.0685
8
0
0.0171
9
0
0
10
0.246
0.0171
11
0
12
0
0.0555
0.302
0.15
0.06
21
0.21
0.0555
0.40
6
0
0.01
71
0.06
85
0.01
71
0
0
0
0
0.01
71
0
0
0
0.295
0.175
0
0.06
21
0.04
91
0.01
23
0.01
92
0.01
23
0.0438
0
0
0.175
0.175
0.0972
0
0
0.08
32
0.08
32
0.33
3
0.01
64
8
9
10
11
12
0
0
0.32
0
0
0.0115
0.012
0.0187
0
0.0
08
96
0.1
43
0.0
08
96
0
0.046
0
0.0
089
6
0.1
43
0.0
089
6
0
0
0.184
0
0
0.0115
0
0.1
43
0.2
88
0.147
0
0.1
43
0.2
88
0
0.32
0
0
0
0.2
88
0
0.1
0.278
0.287
0.184
0.295
0
0.20
1
0
0
0.184
0
0.2
88
0
0
0.046
0.1
0
0
0
0
0
Department of Information Technology, VPCOE BARAMATI
0
0.0828
0
0.0828
0.278
0.278
Report on IASc SRFP-15
IIT Delhi
Resulting residual energy (stability criterion) = 0.17924319695886184
As residual energy > 0, we continue the process of expansion and inflation till resulting residual energy
criterion is satisfied.
5. This input graph requires 7 iterations for formation of final cluster structure. The final resulting matrix
after 7th iteration is:
Final
Clusters
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.5
0
0.5
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.5
0
0.5
0
0
0
0
0
0
0
0
0
0.5
0
0.5
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.5
0
0.5
0
0
0
0
0
0
0
0
0
0.5
0
0.5
0
From above resulting matrix, one can analyze 3 clusters are formed using MCL algorithm:
C1: {1, 6, 7, 10}
C2: {2, 3, 5}
C3: {4, 8, 9, 11, 12}
Identifying a cluster structure needs to find only those rows from resulting matrix which are having nonzero entries in their cell values.
MCL Cluster Interpretation
To interpret clusters, the vertices are split into two types. Attractors, which attract other vertices, and
vertices that are being attracted by the attractors. Attractors have at least one positive flow value within
their corresponding row (in the steady state matrix).Each attractor is attracting the vertices which have
positive values within its row. Attractors and the elements they attract are swept together into the same
cluster.
In above example, attractors are 1, 5 and 9 respectively which attracts remaining vertices to form a cluster
with some positive flow value.
In general, overlapping clusters (where one or more nodes are shared in multiple clusters) are only found
in very special cases of graph symmetry: Only when a vertex is attracted exactly equally by more than
one cluster. This occurs only when both clusters are isomorphic. Following graph shows example of
isomorphic clusters where node 4 is shared amongst both clusters.
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
Figure 5: Sample graph showing isomorphic clusters
Analysis of MCL Clustering
The clustering granularity is mostly affected by inflation parameter and total number of clusters formed
will directly vary with inflation parameter.
MCL algorithm requires O (N3) time as the main requirement is expansion and inflation which requires
matrix multiplication for matrices of N nodes.
We introduce a new technique called “Matrix Pruning” in order to reduce this computational time.
Matrix Pruning
1. Analyze and inspect the resulting matrix after every iteration (expansion and inflation process) in order
to check whether any value of resulting matrix Mr (i, j) is nearer to zero or less than specified threshold.
2. If the value is less, mark that as zero (prune the original value to 0).
This improves the computational speed as unnecessary multiplications required will be mostly reduced
due to pruning.
The time can be reduced up to O (N2) by using this pruning technique.
Advantages of MCL
 Scales well with increasing graph size.
 Works with both weighted and unweighted graphs.
 Produces good clustering results.
 Robust against noise in graph data/outliers are separated smoothly from other prominent
members.
 Number of clusters need not be specified ahead of time, but one can adjust cluster granularity
with parameters.
Result Analysis
For analyzing the clustering performance with this newly introduced pruning technique, we computed the
results of clustering for 2 sample cases with 20 buses and 30 buses respectively.
The results computed were checked for correctness by evaluating them against cluster validity indices and
we can conclude that results produced are optimized.
We use 5 different cluster validity indices [5] for the performance analysis:
1. DB Index (DBI)
It basically measures the ratio between separation within (how sparse two objects are inside the cluster)
and between clusters (how distant the two clusters are).
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
Where,
C: Total Number of Clusters
Ai: Average Intra-Cluster Distance
dij: Inter Cluster Distance between Cluster i & j
Normally, this ratio should give least values to form good clusters. Lesser values indicate there will be
less sparseness amongst the objects within the clusters and inter-cluster distance will be more. Results
obtained for sample data are giving lesser values for this DB index.
2. Dunn Index (DI)
This category of cluster validation score measures ratio of inter cluster distances and maximum intra
cluster distance for a given set of clusters and data objects.
Where,
δi,j : Distance between node in cluster i and node in cluster j or simply this can be taken as the
distance between centroid of cluster i & j.
Δk: Maximum distance within cluster k
This ratio should give larger values for good clusters which indicate that there should be larger separation
between clusters as opposed to within cluster separation.
3. Calinski-Harabasz index (CHI)
This cluster validity score computes the relationship between two traces namely TB & TW.
Where,
TB : Trace between two clusters which is computed as the weighted sum of squared distances between
centroid of whole object set and every cluster centroid.
TW: Trace computed for objects within a cluster given as the sum of squared distances between every
node with its centroid.
C: Total Number of clusters
n: Number of data objects/nodes
Mostly, larger values of this index are preferred as it produces good clusters and the parameters like
number of objects and clusters formed also play a vital role.
4. Xie-Beni Index
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
This index finds the relationship between within cluster trace TW and the minimum inter cluster distance.
This value should produce smaller to intermediate values to have good clustering.
Where,
TW: Trace computed for objects within a cluster given as the sum of squared distances between every
node with its centroid.
dij: Inter Cluster Distance between Cluster i & j
5. Φ index/ I index
This measure is the ratio between two parameters E1 and Ec which gives measure of separateness amongst
the clusters and amount of compactness within clusters respectively.
Where,
C: Total Number of Clusters
E1: Sum of distances between each node with centroid of whole data set.
EC: Sum of the distances between each node and centroid of the corresponding cluster.
dij: Inter Cluster Distance between Cluster i & j
This index gives larger values for the good clusters.
Results analyzed for sample 20 bus and 30 bus systems are giving optimized performance when we
change the granularity of the cluster by changing the inflation parameter.
Table 1: Cluster Validity Index Score for 20 bus System
Value
of
Inflation
Parameter (r)
1.6
2.0
2.8
3.5
5.2
Number of
Clusters
formed (C)
2
3
5
4
6
DaviesBouldin
Index
0.37
1.10
1.46
1.56
1.33
Dunn
Index
0.87
0.11
0.0520
0.066
0.10
CalinskiHarabasz
index
20.84
8.40
7.41
23.60
8.76
Xie-Beni
Index
Φ index
0.46
26.50
63.07
18.51
11.30
14.83
7.47
20.03
75.12
19.60
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
Figure 6: Performance Analysis: Case 1 20 bus System
Case 30bus System
Table 2: Cluster Validity Index Score for 20 bus System
Value
of
Inflation
Parameter (r)
1.6
2.0
2.6
3.8
4.5
Number of DaviesClusters
Bouldin
formed
Index
9
1.85
11
1.97
15
1.6
16
1.13
16
1.05
Dunn
Index
0.6
0.79
0.45
1.34
1.03
CalinskiHarabasz
index
9.45
23.88
21.44
25.35
37.28
Xie-Beni
Index
Φ index
0.39
2.47
22.36
18.45
26.32
8.55
11.63
27.31
26.4
39.77
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
Figure 7: Performance Analysis: Case 1 20 bus System
Naïve Performance Measure [2]:
Performance of MCL can also be measured using performance measure proposed by Van-Dongen in his
PhD thesis. Van introduced a new measure which clearly depicts the quality of clusters formed. In this
measure, again a stress is given on intra-cluster similarity which should be very least and should connect
maximum of the edges with each other as opposed with edges present in between the clusters. This means
that the final clusters should cover as many edges or `ones' in terms of resultant matrix as possible, while
there should be fewer zeroes within the blocks and few edges outside.
The measure computes the relationship between edges connecting two or more clusters and those edges
which are present inside the cluster. In the denominator, for convergence the total number of data
objects/nodes are used. This value lies in between 0 to 1.
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
Where,
G- Input Graph
P- Partitions/ Clusters formed
#1out (G, P): Total number of edges not covered by the clustering (i.e. an edge (i,j) in G for which i and j
are in different clusters of P)
#0in (G, P): Total number of edges suggested by P absent in G (i.e. all edge pairs edge (i, j) for which i and
j are in the same cluster of P and edge (i, j) is not present in original graph G).
Good clusters formed should have values distant to 1. Clusters which are not producing good partitioning
will have values approaching towards 1.
Table3: Naïve Performance Value for 20 bus System
Value of Inflation Parameter (r)
Number of Clusters formed
1.6
2.0
2.8
3.5
5.2
2
3
5
4
6
Performance
Value
0.91
0.924
0.89
0.86
0.88
Table4: Naïve Performance Value for 30 bus System
Value of Inflation Parameter (r)
Number of Clusters formed
1.6
2.0
2.6
3.8
4.5
9
11
15
16
16
Performance
Value
0.937
0.903
0.87
0.894
0.84
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
Figure 7: Naïve Performance Analysis
The analysis score for both cases shows that, good clusters are obtained when we increase the value of
inflation parameter and quality degrades when the cluster granularity is least.
Page Ranking in MCL [8]:
The concept of page ranking in web pages can easily be applied to graph partitions in real world
applications. Communicating across the web has become an integral part of everyday life. This
communication is enabled in part by scientific studies of the structure of the web.
We consider a simple model, known as the random surfer model. Let us consider the web to be a fixed set
of pages, with each page containing a fixed set of hyperlinks, and each link a reference to some other
page. We can analyze the probability of next move of an entity from one web page over the other. Here
we consider the adjacency matrix/transition matrix where different web pages will be connected if there is
a hyper link present in between.
Same analogy we can apply to real world graph partitioning problems, where input will be a simple graph
indicating connection between various nodes. The same can be converted into transition matrix which can
be further used to compute the page ranks of individual nodes.
Department of Information Technology, VPCOE BARAMATI
Report on IASc SRFP-15
IIT Delhi
Thus, importance of various nodes can be determined based on value of the page rank. This concept when
combined with technique of MCL gives us the optimized results. The nodes having higher page ranks will
be situated in different clusters.
References:
1. Jain, A.K. Murty, M.N. and Flynn, P.J. Data Clustering: A Survey. ACM Computing
Surveys, Vol. 31, No. 3, September 1999.
2. Van Dongen “Graph Clustering by Flow Simulation”, PhD Thesis 2000.
3. Han, J. and Kamber, M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers,
2001.
4. Rui Xu et.al Survey of Clustering Algorithms, IEEE TRANSACTIONS ON NEURAL
NETWORKS, VOL. 16, NO. 3, MAY 2005.
5. P. Foggia et al. Benchmarking graph-based clustering algorithms , Image and Vision Computing
27 (2009) 978-988
6. Swagatam Das et. al. Metaheuristic Clustering, Studies in Computational Intelligence, Volume
178 Springer 2009.
7. J. Jauregui Markov Chains-Google Page Rank Algorithms 2012.
8. C. Li et al. Method for evaluating the importance of power grid nodes based on PageRank
algorithm, IET Gener. Transm. Distrib. 2014, Vol. 8, Iss. 11, pp. 1843–1847
9. Yun Xiong et. al. Top-k Similarity Join in Heterogeneous Information Networks IEEE
Transactions on Knowledge and Data Engineering, Vol.27 No.6 June 2015
10. Alp Ozdemir et.al. Hierarchical Spectral Consensus Clustering for Group Analysis of Functional
Brain Networks DOI 10.1109/TBME.2015.2415733, IEEE Transactions on Biomedical
Engineering 2015
11. http://introcs.cs.princeton.edu/java/16pagerank/ [ Last Accseed: 16 July’15]
Department of Information Technology, VPCOE BARAMATI
Download