Report on IASc SRFP-15 IIT Delhi Report on IASc Summer Research Fellowship Program (SRFP) 2015 at IIT, Delhi Two months Summer Research Fellowship Program (SRFP)-2015 for students and faculties in engineering institutions is organized every year by Indian Academy of Sciences (IASc) Bangalore. During the year 2015, I was selected by the academy to work at Indian Institute of Technology, Delhi. With prior permission of Hon. Principal and HOD-IT, I have successfully completed the fellowship work under the guidance of Dr. Panigrahi B. K. of Electrical Engineering Department at IIT Delhi during May-July 2015. The main theme of this research fellowship program offered by Indian Academy of Sciences every year is to nurture the research skills in students and faculty in various Engineering disciplines. During this period, i worked in the field of Web Mining. I am glad to quote that i have successfully implemented a project entitled “A Novel Approach for Data Clustering using Improved Markovian and Z-Score based Technique” during my stay at IIT D. The main objective of the task assigned to me was performance optimization of existing clustering algorithms and to improvise the results. During this stay at IIT Delhi I emphasized the need of optimization in data clustering and was successful to optimize the existing clustering techniques. I got the opportunity to attend couple of Continuing Education Programs (CEPs) and expert talks during this tenure. Moreover, I got the opportunity to attend expert talk by Dr. Dipankar Chatterjee, Professor at IISc Bangalore who addressed all IASc research fellows on 22 June 2015 at Indian National Science Academy (INSA), Delhi on various insights and career opportunities in research for various disciplines. The germinal to learn concepts in evolutionary computing from one of the inventor of this domain Dr. Panigrahi was really fruitful and heartening. This fellowship has provided an ideal platform for learning and experiencing the culture at the IIT which will surely help me in shaping research career in upcoming days. Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi A Report On “A Novel Approach for Data Clustering using Improved Markovian and ZScore based Technique” Submitted to the Indian Academy of Sciences (IASc), Bangalore in fulfillment of the requirements for Summer Research Fellowship for the duration of May-July 2015 By Mr. Shah Sahil Kailas Application ID: ENGT143 Vidya Pratishthan’s College of Engineering, Baramati-413133 Under the guidance of Dr. B. K. Panigrahi bkpanigrahi@ee.iitd.ac.in Department of Electrical Engineering Indian Institute of Technology Delhi Hauz Khas, New Delhi – 110016 India Clustering Techniques in Data Mining Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi In todays’ era, in each and every field there lies a need of grouping the things in order to scan/analyze them in quick time. As the size of data items spread across various domains is growing exponentially, groups of such data items can be used for their in depth analysis and facilitation of the things. Clustering is the technique of grouping the data items (objects, object descriptions, features etc.) in to various groups/clusters so that data items with same feature set can be members of the same cluster. In short, clustering will form clusters of data objects such that objects belonging to same cluster are similar while those belonging to different cluster are dissimilar. Example: Animal Taxonomy represent various classes of animals like reptiles, mammals etc. Plant Taxonomy, Bio Informatics, Grouping of various locations based on geographic areas etc. Grouping of data objects can be carried out in two ways: 1. Supervised Classification 2. Unsupervised classification a.k.a Clustering 1. Classification This type of supervised learning involves grouping various new data items into various existing classes by predicting/learning from existing classes. In this technique, various predefined classes with important features set are presumed and on the basis of which new data items can be classified. Ex: A new species of dog can be easily classified into the class of dogs based on prediction/learning. This technique requires assumption of existing classes and based on that learning must be performed. 2. Clustering This type of unsupervised classification involves formation of clusters/groups based on data objects properties and their dissimilarities with each other. There does not exist any predefined classes. Objects can be grouped into similar cluster if their distance/dissimilarity value is very less. In cluster analysis, a group of objects is split up into a number of more or less homogeneous sub-groups on the basis of an often subjectively chosen measure of similarity such that the similarity between objects within a subgroup is larger than the similarity between objects belonging to different sub-groups. Example: Various data items like websites, information pages, documents etc. can be clustered into different clusters depending upon their features. Clustering algorithms partition data into a certain number of clusters (groups, subsets, or categories). Cluster can be described by considering the internal homogeneity and the external Separation i.e., patterns in the same cluster should be similar to each other, while patterns in different clusters should not. Mathematical Model for Clustering Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi Given set of data objects S = {s1,s2,………………,sN} Where, Si: ith data object consisting of various features {Si1,Si2,….Sin} N: No. of data objects to be clustered Form K clusters/partitions C= {C1, C2… Ck}, k<=N such that 1. Ci ≠ Φ, i=1 to k 2. 3. Ci ∩ Cj = Φ, i≠ j; In case of hard clustering, 3rd condition must hold while in case of soft clustering or when overlapping is allowed it can be skipped. While forming any cluster, data objects must satisfy and optimize some objective function. Objective function is mostly given in terms of dissimilarity /distance function. Distance function can be given using any of the three techniques: 1. Euclidean Distance 2. Manhattan Distance 3. Minkowski Distance Clustering Process Figure 1: Clustering Process [3] Clustering process can be classified mainly into 4 main stages viz. 1. Feature Extraction Data objects to be clustered need to be pre-processed in order to extract only selective features which can be used in further processing. Features selected can be of two types either symmetric or asymmetric. Generally, ideal features should be of use in distinguishing patterns belonging to different clusters, immune to noise, easy to extract and interpret. 2. Selection of Clustering Algorithm Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi This is the vital stage in the entire clustering process which involves selection of proper clustering technique from wide variety of available clustering algorithms. Selection of clustering algorithm can be performed depending on various parameters like nature and type of data objects, size etc. This step in turn involves selection of proximity/similarity measure which plays a part in future process. 3. Cluster Validation Given a data set, each clustering algorithm can always generate a division, no matter whether the structure exists or not. Moreover, different approaches usually lead to different clusters; and even for the same algorithm, parameter identification or the presentation order of input patterns may affect the final results. Therefore, effective evaluation standards and criteria are important to provide the users with a degree of confidence for the clustering results derived from the used algorithms. These assessments should be objective and have no preferences to any algorithm. Also, they should be useful for answering questions like how many clusters are hidden in the data, whether the clusters obtained are meaningful or just an artefact of the algorithms, or why we choose some algorithm instead of another. 4. Results interpretation The crucial goal of clustering is to offer users with meaningful insights from the original data, so that they can effectively solve the problems encountered. Interpretation involves experimentation on the results obtained in order to check feasibility of the generated solution. Clustering algorithms can be categorized into following majors: 1. Partitioning Methods This technique needs a user defined parameter number of clusters K at the start of the algorithm. The goal is to find/obtain K partitions of given data items which satisfy following properties: a) Each group must contain exactly one object b) Each object must belong to exactly one group. This technique uses an iterative method which tries to relocate various data items to different partitions at each iteration depending upon the minimal distance in between objects inside one cluster. 1. K-means 2. K-medoids 2. Hierarchical Methods This method creates hierarchical decomposition of data objects either in agglomerative or divisive way. i.e. this method either groups various data items in bottom-up (agglomerative) way or it divides data items into various small sized individual clusters (divisive) way. It is harder to undo the changes/groupings once done in such type of algorithms. Moreover, user needs to specify number of levels for refinement of clustering results. Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) and Clustering using Representatives (CURE) are mostly used techniques under this category. Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi 3. Density based methods In this method, clusters are grown i.e. new data objects are added to clusters till the number of data objects in the nearby cluster exceeds some predefined limit. In short, for each given data object, nearby cluster with a given radius has to contain at least a minimum number of points. This is normally used to remove the outliers. The objects which are nearby each other are enclosed inside one cluster. DBSCAN and Ordering the Points to Identify Clustering Structure (OPTICS) are the main variants under density based clustering. 4. Grid based methods This method quantizes the data object space into a finite number of cells which forms a grid structure. All the clustering related operations now will be performed on this grid structure. This involves faster processing time as the only requirement is to transform given data objects into grid structure. STING, CLIQUE and Wave-Cluster can be identified as prominent clustering algorithms under these methods. 5. Model based methods This type of clustering method involves finding out a best/optimal fit model for given data set using some heuristics. Model based approaches can directly find the final numbers of clusters using iterative method till model descriptions are satisfied. Minimum Description Length (MDL), Expectation-Maximization (EM) etc. are few of the examples of algorithms within this category. New Approach: Data Clustering using Markov Clustering Algorithm (MCL) and Page Ranking For partitioning of larger graphs, there is always a requirement of clustering algorithms capable of producing optimal results. MCL algorithm is a graph based clustering algorithm based on Stijn van Dongen Van Dongen, S. (2000) Graph Clustering by Flow Simulation. Ph D Thesis, University of Utrecht, the Netherlands [2]. This algorithm makes utilization of markov chain concept and theory of markovian matrices to derive the final clustering structure when we provide graph as an input. The Markov Cluster Process (abbreviated MCL process) defines a sequence of stochastic matrices by alternation of two operators on a generating matrix. It is basically the foremost requirement behind the clustering of graphs. The foundation of MCL algorithm is based on one of the popular graph paradigm “natural' groups in graphs, the groups that are sought and not known, have the property: A random walk in G that visits a dense cluster will likely not leave the cluster until many of its vertices have been visited. MCL makes use of this idea while forming the cluster structure. This is achieved by promoting the flow within a graph where there are strong connections and it will minimize the flow outside the dense regions. Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi Figure 2: Sample Undirected Graph Considering any graph, it is natural that nodes which are more likely will share maximum number of links in between as compared with others. In above figure, one can see the natural tendency of clusters: intra cluster similarity will be more as opposed with inter cluster similarity. Above figure demonstrates this paradigm clearly where if we assume two clusters connected by a single link then more nodes are strongly connected inside the cluster. This means if one is starting from some node, and then randomly travel to a connected node, he/she is more likely to stay within a cluster than travel between as there will be more links within the cluster. By doing random walks upon the graph, it may be possible to discover where the flow tends to gather and therefore where the clusters can be formed. These Random Walks on a graph will be calculated using “Markov Chains” or Stochastic matrices. Working Example As shown in following figure, we can manually visualize there can be 2 clusters having 5 and 3 nodes respectively. This can be obtained using MCL as follows: Figure 3: Sample Undirected Graph In one time step, a random walker at node 1 has a 33% chance of going to node 2, 3, & 4, and 0% chance to nodes 5, 6, or 7. From node 2, 25% chance for 1, 3, 4, 5 and 0% for 6 and 7. This probability of random walks can be well explained using following transition matrix/probability matrix: Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi One can easily note that the above matrix is column stochastic i.e. sum of every column equals one. This is also known as markov chaining and it gives us probability of random walks from various nodes in input graph. This matrix can be treated as Markov Matrix. MCL algorithm will involve two important stages after forming a markov matrix viz... Expansion and Inflation. These two steps are iteratively carried out till convergence criterion is achieved. Most of the times, this convergence criterion would be in terms of residual energy. As per the MCL process, flow is easier within dense regions than across sparse boundaries, however, in the long run this effect may disappear. During the earlier powers of the Markov Chain, the edge weights will be higher in links that are within clusters, and lower between the clusters. This predicts the correspondence between the distribution of weight over the columns and the clustering’s i.e. more the weight across cell value it will lead towards clustering. This is achieved in MCL by introducing a new technique called inflation whose job is to deliberately boost the clustering process by strengthening the strong neighbors or intra cluster similarity and weakening the inter cluster similarity. MCL Expansion [2] This involves multiplying markov matrix by itself or in simpler words squaring the existing markovian matrices. In this stage future probabilities are dependent on existing/current probabilities only. This stage can be thought of as the step of expanding the existing transition probabilities spread across the network/existing graph. The expansion operator is responsible for allowing flow to connect different regions of the graph MCL Inflation [2] During the earlier powers of the Markov Chain, the edge weights will be higher in links that are within clusters, and lower between the clusters. This predicts the correspondence between the distribution of weight over the columns and the clustering’s i.e. more the weight across cell value it will lead towards clustering. This is achieved in MCL by introducing a new technique called inflation whose job is to Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi deliberately boost the clustering process by strengthening the strong neighbors or intra cluster similarity and weakening the inter cluster similarity. Inflation can be defined as: Given a matrix M, M>=0 and a real nonnegative number r, the matrix resulting from rescaling and normalizing of each of the columns of M with power coefficient r is called TrM and Tr is called inflation operator and r is called inflation parameter. Where (TrM) pq indicates the value at cell pq in inflation process when raised to power r. The inflation operator is responsible for both strengthening and weakening of current while the inflation parameter, r, controls the extent of this strengthening / weakening. In brief, one can say that the granularity of clustering is dependent on this inflation parameter. It is feasible to have more granularity amongst the clusters in graph based approaches and it can be achieved by varying the inflation parameter. Optimal values of r which results in good clustering are 1.4,2,4 and 6. However, most of the applications works fine with default value of 2. MCL Algorithm Input: An undirected graph (can be weighted). For better performance, one must convert directed graph into undirected one as this works best with undirected graphs. Output: Clustering Structure/ Clustered nodes Algorithm: 1. Read the input undirected graph and form an adjacency matrix/weight matrix. In case of weighted graphs, normalization is required. 2. Generate markov matrix from input associated matrix. This step involves probability computations while traversing from one node to the other. 3. Add self-loops in order to maximize the flow of process within the denser nodes as compared with other nodes. This step is optional. 4. Check if resulting matrix is column stochastic. (i.e. check if all elements in every column sums to 1). If it’s not column stochastic rescale the existing column values in order to make it stochastic. 5. Expansion: Expand the result of previous step by squaring the earlier matrix or appropriate power e if specified. 6. Inflation: Inflate the result of step 5 by taking inflation of resulting matrix with parameter r. Where (TrM) pq indicates the value at cell pq in inflation process when raised to power r. Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi 7. Repeat expansion and inflation until a convergence criterion is reached. In this stage, convergence criterion could be anything from total cost, scarcity of resulting matrix or number of computations required etc. In our study, we use residual energy behind matrix computations as this convergence criterion. Once the residual energy of resulting matrix becomes 0 or is less than specified threshold, stop further expansion and inflation. 8. Resulting matrix gives the cluster structure and it can be analyzed based on: Cluster can have all the nodes which are having positive values in a single row. Working Example Let following be an input graph to MCL clustering algorithm: Figure 4: Input Graph G 1. We will be providing input graph using probability/stochastic matrix. So, input matrix M will have following format for above graph: Matrix M: Nodes 1 0.2 1 0.2 2 0 3 0 4 0 5 0.2 6 0.2 7 0 8 0 9 0.2 10 0 11 0 12 2 0.25 0.25 0.25 0 0.25 0 0 0 0 0 0 0 3 0 0.25 0.25 0.25 0.25 0 0 0 0 0 0 0 4 0 0 0.2 0.2 0 0 0 0.2 0.2 0 0.2 0 5 0 0.2 0.2 0 0.2 0 0.2 0.2 0 0 0 0 6 0.333 0 0 0 0 0.333 0 0 0 0.333 0 0 7 0.25 0 0 0 0.25 0 0.25 0 0 0.25 0 0 8 0 0 0 0.2 0.2 0 0 0.2 0.2 0 0.2 0 9 0 0 0 0.2 0 0 0 0.2 0.2 0 0.2 0.2 10 0.25 0 0 0 0 0.25 0.25 0 0 0.25 0 0 11 0 0 0 0.2 0 0 0 0.2 0.2 0 0.2 0.2 12 0 0 0 0 0 0 0 0 0.333 0 0.333 0.333 Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi One can also note that, here self-loops have been added along with existing matrix. Here the value in each cell M (i, j) where, i, j Ɛ 1 to 12 indicates probability of visiting node j if randomly started at node i. For instance, cell M (3, 4) indicates probability of visiting node 4 from 3 is 25% in one step. 2. As the input matrix is normalized and column stochastic there is no need of rescaling or normalization. At this stage Mnorm = M Matrix Mnorm: Nodes 1 2 3 4 5 6 7 8 9 10 11 12 1 0.2 0.2 0 0 0 0.2 0.2 0 0 0.2 0 0 2 0.25 0.25 0.25 0 0.25 0 0 0 0 0 0 0 3 0 0.25 0.25 0.25 0.25 0 0 0 0 0 0 0 4 0 0 0.2 0.2 0 0 0 0.2 0.2 0 0.2 0 5 0 0.2 0.2 0 0.2 0 0.2 0.2 0 0 0 0 6 0.333 0 0 0 0 0.333 0 0 0 0.333 0 0 7 0.25 0 0 0 0.25 0 0.25 0 0 0.25 0 0 8 0 0 0 0.2 0.2 0 0 0.2 0.2 0 0.2 0 9 0 0 0 0.2 0 0 0 0.2 0.2 0 0.2 0.2 10 0.25 0 0 0 0 0.25 0.25 0 0 0.25 0 0 11 0 0 0 0.2 0 0 0 0.2 0.2 0 0.2 0.2 12 0 0 0 0 0 0 0 0 0.333 0 0.333 0.333 In case of weighted graph, there lies always a need of normalization and rescaling. The normalization can be carried out by rescaling existing values in such a way that resulting matrix becomes column stochastic. 3. Expansion In this step, expand the resulting matrix of step 2 by multiplying matrix. Matrix Me: Node s 1 2 3 4 5 6 7 0.062 0.257 0.113 5 0 0.1 0.261 0.175 1 0.066 0.09 0.225 0.175 0.05 0.14 7 0.1 2 3 0.05 4 it with itself and rescale it to markov 8 9 10 11 12 0 0 0.258 0 0 0.04 0 0.05 0 0 0.225 0.09 0.14 0 0.05 0.08 0.04 0 0.04 0 0 0.175 0.062 5 0.113 0.21 0.09 0 0 0.16 0.16 0.16 0.133 5 0.1 0.175 0.175 0.09 0.23 0 0.113 0.08 0.04 0 0.062 5 0.04 0 6 0.157 0.05 0 0 0 0.261 0.113 0 0 0.196 0 0 Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi 7 0.14 0.1 0.05 0 0.09 0.15 0.225 0.04 0 0.175 0 0 8 0 0.05 0.1 0.16 0.08 0 0.05 0.2 0.16 0 0.16 0.133 9 0 0 0.05 0.16 0.04 0 0 0.16 0.227 0 0.227 0.244 10 0.207 0.05 0 0 0.05 0.261 0.175 0 0 0.258 0 0 11 0 0 0.05 0.16 0.04 0 0 0.16 0.227 0 0.227 0.244 12 0 0 0 0.08 0 0 0 0.08 0.147 0 0.147 0.244 4. Inflation This step involves raising the resulting expansion matrix to the inflation parameter in order to strengthen the strong connections and weaken less strong connections. The default value of this inflation parameter is taken as 2. Inflation 1 2 4 5 6 0.0867 3 0.02 68 0 767 0.295 0.347 0.21 0.0171 0.15 0.0192 7 0.20 1 0.06 57 1 0.38 2 0.0467 0 0.01 64 0.046 0 0 0.184 3 0.0144 0.21 4 0 0.0268 0.34 7 0.08 67 5 0.0577 0.21 6 0.142 0.0171 7 0.113 0.0685 8 0 0.0171 9 0 0 10 0.246 0.0171 11 0 12 0 0.0555 0.302 0.15 0.06 21 0.21 0.0555 0.40 6 0 0.01 71 0.06 85 0.01 71 0 0 0 0 0.01 71 0 0 0 0.295 0.175 0 0.06 21 0.04 91 0.01 23 0.01 92 0.01 23 0.0438 0 0 0.175 0.175 0.0972 0 0 0.08 32 0.08 32 0.33 3 0.01 64 8 9 10 11 12 0 0 0.32 0 0 0.0115 0.012 0.0187 0 0.0 08 96 0.1 43 0.0 08 96 0 0.046 0 0.0 089 6 0.1 43 0.0 089 6 0 0 0.184 0 0 0.0115 0 0.1 43 0.2 88 0.147 0 0.1 43 0.2 88 0 0.32 0 0 0 0.2 88 0 0.1 0.278 0.287 0.184 0.295 0 0.20 1 0 0 0.184 0 0.2 88 0 0 0.046 0.1 0 0 0 0 0 Department of Information Technology, VPCOE BARAMATI 0 0.0828 0 0.0828 0.278 0.278 Report on IASc SRFP-15 IIT Delhi Resulting residual energy (stability criterion) = 0.17924319695886184 As residual energy > 0, we continue the process of expansion and inflation till resulting residual energy criterion is satisfied. 5. This input graph requires 7 iterations for formation of final cluster structure. The final resulting matrix after 7th iteration is: Final Clusters 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5 0 0.5 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5 0 0.5 0 0 0 0 0 0 0 0 0 0.5 0 0.5 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5 0 0.5 0 0 0 0 0 0 0 0 0 0.5 0 0.5 0 From above resulting matrix, one can analyze 3 clusters are formed using MCL algorithm: C1: {1, 6, 7, 10} C2: {2, 3, 5} C3: {4, 8, 9, 11, 12} Identifying a cluster structure needs to find only those rows from resulting matrix which are having nonzero entries in their cell values. MCL Cluster Interpretation To interpret clusters, the vertices are split into two types. Attractors, which attract other vertices, and vertices that are being attracted by the attractors. Attractors have at least one positive flow value within their corresponding row (in the steady state matrix).Each attractor is attracting the vertices which have positive values within its row. Attractors and the elements they attract are swept together into the same cluster. In above example, attractors are 1, 5 and 9 respectively which attracts remaining vertices to form a cluster with some positive flow value. In general, overlapping clusters (where one or more nodes are shared in multiple clusters) are only found in very special cases of graph symmetry: Only when a vertex is attracted exactly equally by more than one cluster. This occurs only when both clusters are isomorphic. Following graph shows example of isomorphic clusters where node 4 is shared amongst both clusters. Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi Figure 5: Sample graph showing isomorphic clusters Analysis of MCL Clustering The clustering granularity is mostly affected by inflation parameter and total number of clusters formed will directly vary with inflation parameter. MCL algorithm requires O (N3) time as the main requirement is expansion and inflation which requires matrix multiplication for matrices of N nodes. We introduce a new technique called “Matrix Pruning” in order to reduce this computational time. Matrix Pruning 1. Analyze and inspect the resulting matrix after every iteration (expansion and inflation process) in order to check whether any value of resulting matrix Mr (i, j) is nearer to zero or less than specified threshold. 2. If the value is less, mark that as zero (prune the original value to 0). This improves the computational speed as unnecessary multiplications required will be mostly reduced due to pruning. The time can be reduced up to O (N2) by using this pruning technique. Advantages of MCL Scales well with increasing graph size. Works with both weighted and unweighted graphs. Produces good clustering results. Robust against noise in graph data/outliers are separated smoothly from other prominent members. Number of clusters need not be specified ahead of time, but one can adjust cluster granularity with parameters. Result Analysis For analyzing the clustering performance with this newly introduced pruning technique, we computed the results of clustering for 2 sample cases with 20 buses and 30 buses respectively. The results computed were checked for correctness by evaluating them against cluster validity indices and we can conclude that results produced are optimized. We use 5 different cluster validity indices [5] for the performance analysis: 1. DB Index (DBI) It basically measures the ratio between separation within (how sparse two objects are inside the cluster) and between clusters (how distant the two clusters are). Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi Where, C: Total Number of Clusters Ai: Average Intra-Cluster Distance dij: Inter Cluster Distance between Cluster i & j Normally, this ratio should give least values to form good clusters. Lesser values indicate there will be less sparseness amongst the objects within the clusters and inter-cluster distance will be more. Results obtained for sample data are giving lesser values for this DB index. 2. Dunn Index (DI) This category of cluster validation score measures ratio of inter cluster distances and maximum intra cluster distance for a given set of clusters and data objects. Where, δi,j : Distance between node in cluster i and node in cluster j or simply this can be taken as the distance between centroid of cluster i & j. Δk: Maximum distance within cluster k This ratio should give larger values for good clusters which indicate that there should be larger separation between clusters as opposed to within cluster separation. 3. Calinski-Harabasz index (CHI) This cluster validity score computes the relationship between two traces namely TB & TW. Where, TB : Trace between two clusters which is computed as the weighted sum of squared distances between centroid of whole object set and every cluster centroid. TW: Trace computed for objects within a cluster given as the sum of squared distances between every node with its centroid. C: Total Number of clusters n: Number of data objects/nodes Mostly, larger values of this index are preferred as it produces good clusters and the parameters like number of objects and clusters formed also play a vital role. 4. Xie-Beni Index Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi This index finds the relationship between within cluster trace TW and the minimum inter cluster distance. This value should produce smaller to intermediate values to have good clustering. Where, TW: Trace computed for objects within a cluster given as the sum of squared distances between every node with its centroid. dij: Inter Cluster Distance between Cluster i & j 5. Φ index/ I index This measure is the ratio between two parameters E1 and Ec which gives measure of separateness amongst the clusters and amount of compactness within clusters respectively. Where, C: Total Number of Clusters E1: Sum of distances between each node with centroid of whole data set. EC: Sum of the distances between each node and centroid of the corresponding cluster. dij: Inter Cluster Distance between Cluster i & j This index gives larger values for the good clusters. Results analyzed for sample 20 bus and 30 bus systems are giving optimized performance when we change the granularity of the cluster by changing the inflation parameter. Table 1: Cluster Validity Index Score for 20 bus System Value of Inflation Parameter (r) 1.6 2.0 2.8 3.5 5.2 Number of Clusters formed (C) 2 3 5 4 6 DaviesBouldin Index 0.37 1.10 1.46 1.56 1.33 Dunn Index 0.87 0.11 0.0520 0.066 0.10 CalinskiHarabasz index 20.84 8.40 7.41 23.60 8.76 Xie-Beni Index Φ index 0.46 26.50 63.07 18.51 11.30 14.83 7.47 20.03 75.12 19.60 Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi Figure 6: Performance Analysis: Case 1 20 bus System Case 30bus System Table 2: Cluster Validity Index Score for 20 bus System Value of Inflation Parameter (r) 1.6 2.0 2.6 3.8 4.5 Number of DaviesClusters Bouldin formed Index 9 1.85 11 1.97 15 1.6 16 1.13 16 1.05 Dunn Index 0.6 0.79 0.45 1.34 1.03 CalinskiHarabasz index 9.45 23.88 21.44 25.35 37.28 Xie-Beni Index Φ index 0.39 2.47 22.36 18.45 26.32 8.55 11.63 27.31 26.4 39.77 Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi Figure 7: Performance Analysis: Case 1 20 bus System Naïve Performance Measure [2]: Performance of MCL can also be measured using performance measure proposed by Van-Dongen in his PhD thesis. Van introduced a new measure which clearly depicts the quality of clusters formed. In this measure, again a stress is given on intra-cluster similarity which should be very least and should connect maximum of the edges with each other as opposed with edges present in between the clusters. This means that the final clusters should cover as many edges or `ones' in terms of resultant matrix as possible, while there should be fewer zeroes within the blocks and few edges outside. The measure computes the relationship between edges connecting two or more clusters and those edges which are present inside the cluster. In the denominator, for convergence the total number of data objects/nodes are used. This value lies in between 0 to 1. Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi Where, G- Input Graph P- Partitions/ Clusters formed #1out (G, P): Total number of edges not covered by the clustering (i.e. an edge (i,j) in G for which i and j are in different clusters of P) #0in (G, P): Total number of edges suggested by P absent in G (i.e. all edge pairs edge (i, j) for which i and j are in the same cluster of P and edge (i, j) is not present in original graph G). Good clusters formed should have values distant to 1. Clusters which are not producing good partitioning will have values approaching towards 1. Table3: Naïve Performance Value for 20 bus System Value of Inflation Parameter (r) Number of Clusters formed 1.6 2.0 2.8 3.5 5.2 2 3 5 4 6 Performance Value 0.91 0.924 0.89 0.86 0.88 Table4: Naïve Performance Value for 30 bus System Value of Inflation Parameter (r) Number of Clusters formed 1.6 2.0 2.6 3.8 4.5 9 11 15 16 16 Performance Value 0.937 0.903 0.87 0.894 0.84 Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi Figure 7: Naïve Performance Analysis The analysis score for both cases shows that, good clusters are obtained when we increase the value of inflation parameter and quality degrades when the cluster granularity is least. Page Ranking in MCL [8]: The concept of page ranking in web pages can easily be applied to graph partitions in real world applications. Communicating across the web has become an integral part of everyday life. This communication is enabled in part by scientific studies of the structure of the web. We consider a simple model, known as the random surfer model. Let us consider the web to be a fixed set of pages, with each page containing a fixed set of hyperlinks, and each link a reference to some other page. We can analyze the probability of next move of an entity from one web page over the other. Here we consider the adjacency matrix/transition matrix where different web pages will be connected if there is a hyper link present in between. Same analogy we can apply to real world graph partitioning problems, where input will be a simple graph indicating connection between various nodes. The same can be converted into transition matrix which can be further used to compute the page ranks of individual nodes. Department of Information Technology, VPCOE BARAMATI Report on IASc SRFP-15 IIT Delhi Thus, importance of various nodes can be determined based on value of the page rank. This concept when combined with technique of MCL gives us the optimized results. The nodes having higher page ranks will be situated in different clusters. References: 1. Jain, A.K. Murty, M.N. and Flynn, P.J. Data Clustering: A Survey. ACM Computing Surveys, Vol. 31, No. 3, September 1999. 2. Van Dongen “Graph Clustering by Flow Simulation”, PhD Thesis 2000. 3. Han, J. and Kamber, M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001. 4. Rui Xu et.al Survey of Clustering Algorithms, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005. 5. P. Foggia et al. Benchmarking graph-based clustering algorithms , Image and Vision Computing 27 (2009) 978-988 6. Swagatam Das et. al. Metaheuristic Clustering, Studies in Computational Intelligence, Volume 178 Springer 2009. 7. J. Jauregui Markov Chains-Google Page Rank Algorithms 2012. 8. C. Li et al. Method for evaluating the importance of power grid nodes based on PageRank algorithm, IET Gener. Transm. Distrib. 2014, Vol. 8, Iss. 11, pp. 1843–1847 9. Yun Xiong et. al. Top-k Similarity Join in Heterogeneous Information Networks IEEE Transactions on Knowledge and Data Engineering, Vol.27 No.6 June 2015 10. Alp Ozdemir et.al. Hierarchical Spectral Consensus Clustering for Group Analysis of Functional Brain Networks DOI 10.1109/TBME.2015.2415733, IEEE Transactions on Biomedical Engineering 2015 11. http://introcs.cs.princeton.edu/java/16pagerank/ [ Last Accseed: 16 July’15] Department of Information Technology, VPCOE BARAMATI