International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 Bonding By Space Method for Effective Distance Calculation Mr.V.Harikrishnan#1, Ms.D.Dhanabakyam#2 1 Assistant Professor, Department of BCA, K.S.Rangasamy College of Arts & Science, Tiruchengode, Namakkal DT, Tamilnadu, India-637215. 2 Assistant Professor, Department of Computer Applications, Sri Jayendra Saraswathy Maha Vidhyalaya College of Arts & Science, Coimbatore, Tamilnadu, India. Abstract The Bonding by Space (BBS) model is used to estimate distance between transactions with local consistency and global connectivity information. The ant colony optimization (ACO) techniques are used for the data clustering process. Visited path management behavior of ants is used in the ant colony optimization schemes. The ant colony-clustering algorithm is integrated with the Bonding by Space based distance measure. The fuzzy logic techniques are used to analyze complex relationships between the objects. Data partitioning using machine learning techniques are performed with the distance measures. Similarity between the transactions is estimated using the distance measurement algorithms such as Euclidian distance measure and cosine distance measure algorithm. Transaction assignments for the clusters are carried out with respect to the distance measures. Keywords Fuzzy logic, aco, distance, partitioning, cluster, weights 1. INTRODUCTION Cluster analysis is an important research branch of data mining. The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters [1]. Ants and other gregarious insects are with decentralized, self-organization, pheromones communication, cooperation and other characters. Ant colony algorithm imitates such intelligent behavior and applys it to the solution of hard computational problems. The inspiration of ant colony cluster comes of the accumulation of ant bodies and classification of ant larvae. The earliest work in this area was initiated by Deneubourg et al [2]. According to the similarity between data object and its surrounding objects, ISSN: 2231-5381 The Euclidian distance measure and cosine distance measure algorithms consider the local consistency factor only. Theses distance measures did not consider the global connectivity information. The distance with connection distance measure model is enhanced with fuzzy logic. The transaction weights are updated using fuzzification process. All the attribute weight values are updated with a fuzzy set weight value. The Bonding by Space model is tuned to estimate distance between the transactions using the fuzzy set values. The distance measure model efficiently handles the uneven transaction distributions. The ant colony-clustering algorithm is also improved with fuzzy logic. The similarity computations are carried out with fuzzy distance measurement models. Un-even data distribution handling, accurate distance measure and cluster accuracy are the features of the proposed clustering algorithm. the algorithm decides ants randomly moving, picking up or dropping data objects in order to achieve the purpose of clustering data. This algorithm is actually a density-based clustering, which is difficult to solve the data sets with uneven density distribution. Many researchers put forward some improvement or new ant colony clustering algorithm (ACCA) [3], [4], [5], [6]. But, the core idea of such algorithms is based on the comparison of the Euclidean distances between the objects, or objects and cluster’s centroid. Each object is distributed to a cluster based on the cluster center to which it is the nearest. These pure Euclidean distance-based ant colony clustering algorithms are suitable for discovering spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. However, facing the real problems, the data sets have a wide variety of complex structures, which requires ant colony http://www.ijettjournal.org Page 191 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 clustering algorithm, which has the capability of finding arbitrary shape clusters. This paper proposes an improved formula for calculating the distance. The basic idea is that based on the traditional Euclidean distance, we introduce a measure of connectivity between the data. Such improved distance between data objects, as criteria of data clustering, can reflect not only the local consistency which refers that data points close in location will have a high affinity, but also the global connectivity which refers that data points locating in the same manifold structure will have a high affinity [10]. The method proposed in this paper is better able to describe the inherent clustering characters of the data sets and suitable for data sets with uneven density distribution, comparing with the densitybased methods. The experiments demonstrate that the ant colony-clustering algorithm based on the improved distance calculation formula can discover clusters of arbitrary shapes. 2. FUZZY LOGIC CONCEPTS This section gives a view on fuzzy logic and fuzzy sets. 2.1. Fuzzy Logic Fuzzy logic starts with and builds on a set of user supplied human language rules. The fuzzy systems convert these rules to their mathematical equivalents. This simplifies the job of the system designer and the computer, and results in much more accurate representation of the way systems behave in the real world. Additional benefits of fuzzy logic include its simplicity and its flexibility. Fuzzy logic can handle problems with imprecise and incomplete data, and it can model nonlinear functions of arbitrary complexity. It not a good plant model, or if the system is changing, the fuzzy will produce a better solution than conventional control techniques, “ says Bob Varely, a Senior Systems Engineer at Harris Corp., an aerospace company in Palm Bay, Florida. A fuzzy system can create to match any set of input data. The Fuzzy Logic Toolbox makes this particularly easy by supplying adaptive techniques such as adaptive techniques such as adaptive neuro-fuzzy inference systems (ANFIS) and fuzzy subtractive clustering. Fuzzy logic models, called fuzzy inference systems, consist of a number of confidential “if then” rules. For the designer who understands the systems, these rules are easy to write, and as many rules as necessary can be supplied to describe the system adequately. In fuzzy logic, unlike standard conditional logic, the truth of any statement is a matter of degree. The inference rule is the form of ISSN: 2231-5381 p - > q. Fuzzy logic is possible to say (.5*p) - > (.5*q). For example the rule if then, both variables, cold and on, amp to ranges of values. Fuzzy inference systems rely on membership functions to explain to the computer how to calculate the correct value between 0 and 1. The degree to which any fuzzy statement is true is denoted by a value between 0 and 1. Not only do the rule-based approach and flexible membership function scheme make fuzzy systems straightforward to create, but they also simplify the design of systems and ensure that it can easily update and maintain the system over time. 2.2. Fuzzy Set Fuzzy Set Theory formalized by Professor Lofti Zadeh at the University of California in 1965. What Zadeh proposed is very much a paradigm shift that first gained acceptance in the Far East and its successful application has ensured its adoption around the world. A paradigm is a set of rules and a regulation, which defines boundaries and tells what to do be successful application, has ensured its adoption around the world. Bivalent Set Theory can be somewhat it describes a humanistic problem mathematically. The most obvious limiting features of bivalent sets that can be seen clearly from the diagram is that they are mutually exclusive it is not possible to have membership of more than one set. Clearly it is not accurate to define a transition from a quantity such as warm to hot by the application of one degree Fahrenheit of heat. In the real world a smooth unnoticeable drift from warm to hot would occur. This natural phenomenon can describe more accurately by Fuzzy Set Theory. 3. DATA CONNECTIVITY BASED DISTANCE ESTIMATION As shown in Fig.1 (a), data object a and object b belong to the same cluster, object c and object a, b belong to different clusters. The definition of clustering requests that data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Then we would expect the similarity between object a and object b is higher than object a and object c. According to the Euclidean distance, d(a, c) is less than d(a ,b ), which determines that the similarity between object a and object c is higher than the similarity between object a and object b. Thus Euclidean distance does not reflect the global consistency of data sets, and can not find clusters of arbitrary shape. As shown in Fig.1 (c) and Fig.1 (d), the number of data points between object a and object http://www.ijettjournal.org Page 192 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 b of (c) is less than that of (d). It is clear that object a and object b of (d) are more likely to belong to the same cluster, comparing with object a and object b of (c) (The Euclidean distance between object a and object b in (c) and (d) is equal). Suppose we link each data point of (c) and (d) to its nearest three data points (In the figure, does not draw all the connections), you will discover that the number of reachability paths between object a and object b in (c) is less than that in (d). From this idea, we can use the number of reachability paths to measure the connectivity between data points, and further measure the similarity between data points. Figure 1.(a) An illustration of that the Euclidian distance metric can not reflect the global consistency; (b) An illustration of that the weight of n+1- steps reachability paths between data points is less than that of n-steps reachability paths; (c) and (d).An illustration of the number of reachability paths can measure the connectivity between data points Based on the same way, in Fig.1 (b), we can get the number of reachability paths between object a and object d is much larger than that of object a and object b. If we simply use the number of reachability paths to measure the similarity between data points, the similarity of object a and object d is higher than that of object a and object b, which is obviously unreasonable. So, data processing is necessary, in order to make that the weight of n+1-steps reachability paths between data points is less than that of n-steps reachability paths. An improved distance calculation method proposed based on the above data characteristics can reflect the local compactness of the data, simultaneously reflect the global connectivity between the data. We put forth a novel formula for distance calculation, named BBS (distance with connection). Establish the adjacency matrix of the data set. First, calculate the set i {ij , j 1,...L} of the data object xi (i 1,...N ) L (L> 0) nearest-neighbor [8], using the Euclidean distance formula. Then, link xi and (i= 1,…,N) L (L> 0) , and the link is undirected. In this way, we construct an ISSN: 2231-5381 undirected graph G = (X, V) and the adjacency matrix R [ R y ]N N of the data set, where: R ij 1 , x j i orx i i 0 , otherwise V is the set of all links between data. Rs = [Rij s]N×N , Rij s is the number of s-steps reachability paths between xi and xj . Definition 1: The connectivity between data object xi and xj is defined as: step Conn( xi , x j ) conn s ( xi , x j ) s 1 Where step(1 step N ) is a parameter representing the maximum step. The higher the connectivity between data point xi and xj is, the more reachability paths between xi and xj will have, which reflects the higher similarity between xi and xj , and the more close distance of xi and xj . Hence, BBS (xi ,xj) 1/Conn(xi,xj) . Furthermore, the defintion of BBS (xi,xj) should reflect the local consistency at the same time, then we difined the BBS distance of data objects as follows. Definition 2: The BBS distance between data object xi and xj is difined as: (3) Where Dis ( x , x ) i j m x iv x jv 2 v 1 is the Euclidean distance between xi and xj, m denotes the number of the data object attributes, Max is a very large positive constant, M is a positive constant. If Dis(xi,xj) is short and Conn(xi,xj) is high, BBS(xi ,xj) will be small. Then, the data object xi and xj will be clustered into the same cluster with a high probability. If Dis(xi ,xj) is short, but Conn(xi,xj) is low, BBS(xi,xj) will still be large. Then, object xi and xj will not be grouped in the same cluster. So, formula (3) can effectively avoid data objects with little connectivity or large distance clustering into the same category. Simultaneously, due to the definition of connectivity is not limited to the neighborhood within a radius r of a given object, but the L http://www.ijettjournal.org Page 193 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 nearest data points is directly accessible, so, in a dense region the definition of neighborhood is relatively narrow; in a sparse region, its definition is relatively wide. Comparing with density based methods, such as DBSCAN, BBS can produce more natural clustering results, and is suitable for data sets with uneven density distributed. Obviously, BBS satisfies the following basic properties: BBS(x,x) ≥ 0 , if and only if x = y, equality holds; BBS(x, x) = BBS(x, x). BBS does not always satisfy the triangle inequality, so the definition of BBS is a generalized distance. 4. ANT COLONY OPTIMIZATION The ant colony optimization algorithm (ACO), is a probabilistic technique for solving computational problems which can be reduced to finding good paths through graphs. This algorithm is a member of ant colony algorithms family, in swarm intelligence methods, and it constitutes some metaheuristic optimizations. Initially proposed by Marco Dorigo in 1992 in his PhD thesis , the first algorithm was aiming to search for an optimal path in a graph; based on the behavior of ants seeking a path between their colony and a source of food. The original idea has since diversified to solve a wider class of Numerical problems, and as a result, several problems have emerged, drawing on various aspects of the behavior of ants. 5. ANT COLONY CLUSTERING ALGORITHM BASED ON BBS This article improves the ant colonyclustering algorithm proposed in the literature [3]. Given {x1,x2,….xN} a data set of N objects, and K(0<K<N), the number of clusters to form, clustering analysis organizes the N objects into K clusters, in order to minimize the clustering objective function F, where each object x1(i=1,….N) has m attributes, expressed as {xi1,xi2,….xim}. The objective function is computed as follows: Min F(w,C)= (4) Subject to: K w ij 1, i 1,2,...., N - (5) j 1 N w ij 1, j 1,2,...., K - (6) i 1 ISSN: 2231-5381 Here, w is an N-by-K weighting matrix, its elements: w ij 1, ifx i cluster 0 , ifx i cluster j j (7) Cj denotes the centroid of clusterj (j=1,….K). Instead of taking the mean value of the objects in a cluster as a reference point, we pick actual objects to represent the clusters, using one representative object per cluster. Here, each representative object is the medoid, or most centrally located object, of its cluster, in order to diminish sensitivity of the algorithm to outliers. In the algorithm, we use R(R≥5) ants to build solutions sq={cq1, cq2,----., cqN}(q=1,….,R), string of length N, where cqi(i=1,2,….,N) is the class identifier of data object xi and cqj {cluster1, .. clusterk}. cqi= cqj means that the object xi and xj belong to the same cluster in the solution builded by the qth ant. On the contrary, cqi ≠ cqj denotes that the object xi and xj belong to different clusters [3]. For example, given N=5, {x1,x2,….x5}, and K=3, suppose a feasible solution s1{cluster1,cluster3,cluster2,cluster2,,cluster3}, which means xi is assigned to cluster1 , x2 is assigned to cluster3 , and the rest may be deduced by analogy. At the beginning of the algorithm, initialize the N-by-N BBS distance matrix according to formula (3), where we recommend to set parameters, L and step, within the range 3% N ≤ L 4% N ≤ and N /(K L) ≤ step≤ N.. Subsequently, the calculation, involving BBS distance, can be efficiently completed in linear time. Here, need to note that if A and B both are N-by-N matrix, the computational complexity of A×B is O(nlog7) n ο , using of Strassen matrix multiplication. So the computational complexity of initializing BBS distance matrix is between (nlog7) and O (n2). Then, initialize the N-by- K pheromone matrix [τij ]N × N and each element with an initial pheromone value τ0 , where the matrix elements τij denotes the concentration of pheromone of object xi relative to clusterj . In each loop, each artificial ant constructs a pheromonebased solution, thereafter updates the pheromone matrix based on the quality of the solutions achieved. Under the guidance of pheromone matrix, ants improve the quality of solution step by step until a stopping criterion. 5.1. Solution construction In ant colony algorithm, the ants construct solution (S) by means of formula (8). Ant, located at object x1(i=1,….N), selects clusterj (j=1,….K) in probability Pij . http://www.ijettjournal.org Page 194 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 ij ij pathif Pij TABLE I. AVERAGE F-MEASURE OF CLUSTERING [9] , j 1,...., K K path ik ik ik k 1 (8) Where Pij is the probability distribution of object xi belonging to clusterj.nij= 1/BBS(xi,Cj) represents heuristic information value BBS(xi,Cj) is the BBS distance between object xi and the center of clusterj). is the heuristic factor, indicating the relative importance of heuristic information. ij path denotes the number of excellent ants , which construct good solutions and group xi into the clusterj . If pathij is very large, we can speculate that building good solution must group xi into the clusterj. Formula (8) reflects that if pathij is very large, then xi is grouped into the clusterj with a high probability. In this way, we can build a good solution rapidly, that is, the ant-colony clustering algorithm can fast convergence. 5.2. Pheromone Update Rule After each loop of the algorithm, i.e., when R(R≥5) ants have completed a solution, we sort the solutions according to the clustering object function value in ascending order. Then, we get ' q ' q1 ' ' ' S_sorted= s1 , s2 ,...., sR , ' q2 Data ant ant ant Combinatio Algorithm set colony 1 colony 2 colony 3 n of the 3 suggested (SACA) (SACA) (SACA) ant in this colonies paper (MACCA) Iris 0.918 0.910 0.915 0.927 0.925 Comparing with Fig.2 (a) and Fig.2 (c), which are the simulation results on Line, Circlesquare synthetic data sets obtained from ant colony clustering algorithm(ACCA) in the literature [3], Fig.2 (b) and Fig.2 (d), the simulation results on Line, Circle-square synthetic and the center ( Cj ) of clusterj in solution sq′ . Selecting 20% better solutions as the basis for pheromone updating, not only records relatively better solutions through updating pheromone trail, but also expands the algorithm's search space, avoiding stagnation deriving from an excessive trail level on the moves of one solution. where ' qN s {c , c ,....c } , (q=1,….R)of which use the top 20% better solutions (S_best= to s S _ sorted ,1 q 20% R Z }) " q update the pheromone matrix. updating formula is as follows: Pheromone ij (t 1) (1 ) ij (t ) ij (t ) ∆ ( ) = 0, ( , ℎ , = " (9) path ij path path ' ij 1 , is q ij S _ best , otherwise ' c qi cluster j Where _ ρ,0≤ρ≤1, is a user-defined parameter called evaporation coefficient, Q is a positive constant. BBS(x i C sq' j ) represents the BBS distance between xi. Based on the data in the table, we can see that this algorithm is better than the average performance of SACA, somewhat less effective than MACCA. ISSN: 2231-5381 Figure 2. Clusters obtained by the algorithms in literature [3] and this paper for synthetic data sets 6. FUZZY ENABLED CLUSTERING SCHEME The data-clustering scheme is designed with dynamic distance measures. Ant colony clustering algorithm is integrated with the dynamic distance measure. The distance with connection measure is used to maintain the local consistency and global connectivity factors. The distance with connection measure model is enhanced using the fuzzy logic technique. The fuzzy enhancement is done in two areas. Distance estimation function is enhanced with fuzzy models to handle uneven data distributions. The ant colony-clustering algorithm is enhanced with fuzzy relationship analysis model. The fuzzy logic based ant colony clustering algorithm is designed with dynamic distance measure and fuzzy enabled ant colony clustering models. Fuzzification, fuzzy based distance estimation, clustering process and cluster http://www.ijettjournal.org Page 195 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 analysis phases. The fuzzification phase is used to convert the attribute weight for each transaction into fuzzy sets. The distance and global connectivity are analyzed using the fuzzy weight values. The clustering process is done with the dynamic distance based ant colony clustering algorithm and fuzzy enabled ant colony clustering algorithm. Cluster accuracy is analyzed in the cluster analysis. The attribute weight values for each transaction are passed into the fuzzification process. All the attribute weight values are converted into fuzzy weight values. The fuzzy weight conversion process updates the weight values with in a range of 0 to 1. The weight value distribution is not even in some data sets. The fuzzification process removes the overhead to calculate distance in uncertain data distributions. The distance measure estimates the distance with global connectivity factors. Fuzzy sets are applied to the distance estimation process. The dynamic distance based ant colony clustering algorithm is enhanced with fuzzy comparison for dynamic similarity analysis. All the transaction analysis is carried out with fuzzy enabled weight values. The cluster results are updated using the actual attribute weights. The comparison process is performed with fuzzy weights. The cluster analysis is done with a set of parameters. The precision/recall and fitness measures are used in the cluster analysis process. The distance with connection distance measure model is enhanced with fuzzy logic. The transaction weights are updated using fuzzification process. The distance measure model efficiently handles the uneven transaction distributions. The ant colony-clustering algorithm is also improved with fuzzy logic. The similarity computations are carried out with fuzzy distance measurement models. The system enhances the distance estimation process. Fuzzy logic techniques are used to improve the distance estimation process. Global relationship is used in the system. Ant colony clustering is improved with fuzzy scheme. The system divided into four modules Distance Analysis Fuzzification Process Ant Colony Clustering Fuzzy Ant Colony Clustering Distance analysis module is designed to measure local global distance. Fuzzification module is designed to estimate fuzzy weights for transactions. Ant colony clustering module is designed with distance with connection model. ISSN: 2231-5381 Fuzzy weights are used in fuzzy based ant colony clustering module 6.1 Distance Analysis Distance analysis is performed to estimate transaction relevancy. Local and global distance estimation schemes are used in the system. Local distance is estimated with the current transaction information only. Global distance estimation uses the transaction details and support information 6.2. Fuzzification Process Fuzzy model is used to assign weights in a range between 0 to 1. Transaction weights are converted into fuzzy based weights. Fuzzy weight is used for the distance estimation. Support value is also calculated using fuzzy weights 6.3. Ant Colony Clustering Ant colony optimization is used for the clustering process. Transaction weights are used in the clustering process. Transaction comparison is done with the ant behavior. One pass analysis is used in the system. 6.4. Fuzzy Ant Colony Clustering Clustering process is performed using fuzzy weights. Fuzzy weights based distance is used for the relevancy estimation. Global distance is used for the clustering process. Fuzzy relations are integrated with the ant colony clustering model 7. CONCLUSION Based on the Euclidean distance between objects, the system uses data connectivity and an improved formula for calculating the distance named BBS. BBS reflects not only the local consistency but also the global connectivity between objects. It also overcomes the disadvantage of Euclidean distance in data clustering. Then, we improve the ant colonyclustering algorithm by using BBS and fuzzy logic concepts. Our experimental results on both synthetic and real world data sets show that the improved algorithm can discover clusters with arbitrary shape and is better than the clustering effect of earlier techniques. 8. REFERENCES [1] Jiawei Han and Micheline Kamber.Data Mining Concepts and Techniques, San Francisco: Morgan kaufmann, 2006, pp.383. [2] Deneubourg JL, Goss S, et a1. “The dynamics of collective sorting: robot-like ant and ant-like robot,”. In: M eyer JA, Wilson SW ed. Proceedings first conference on simulation of adaptive behavior: from animals to animats. Cambridge, MA: MIT Press, 1991, pp.356–365. http://www.ijettjournal.org Page 196 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 4- Feb 2014 [3] Shiyong Li, Baojiang Zhao. “Ant Colony Clustering Algorithm,” easurement & Control, vol. 11, NO.15, 2007, pp.159–1592, 2007.15(11):1590–1592. [4] Parag M. Kanade and Lawrence O. Hall, “Fuzzy Ants and Clustering, IEEE Transactions on Systems,” man, and Cybernetics-part a: systems and humans, vol. 37, no. 5, September 2007, pp. 758–769. [5] Xinbin Yang, Jinggao Sun and Dao Huang, “A New Clustering Method Based on Ant Colony Algorithm,” Proceeding of the 4th World Congress on Intelligent Control and Automation June 10– 14, 2002, pp.2222-2226. [6] Jian Gao. “Cluster Analysis Based on Parallel Ant Colony Adaptive Algorithm,” Computer Engineering and Application, vol. 25, 2003, pp.78–79, 2003.25:78–79. [7] Maoguo Gong and Liefeng Bo. “DensitySensitive Evolutionary Clustering,” The 11th Pacific- Asia Conference on Knowledge Discovery and Data Mining, Springer-Verlag Berlin Heidelberg ,2007, pp.507–514. [8] Julia Handl and Joshua Knowles. “An Evolutionary Approach to Multiobjective Clustering,” IEEE Transactions on Evolutionary Computation, vol. 11, no. 1, Feb.2007, pp.60. [9] Yan YANG and Fan Jin, “Mohamed Kamel.Clustering Combination Based on Ant Colony Algorithm,” Journal of the China Railway Socity, vol. 4, No. 26, 2004, pp.6–69. [10] Miguel A.Sanz-Bobi and Mario Castro “IDSAI: A Distributed System for Intrusion Detection Based on Intelligent Agent” 5th International Conference on Internet Monitoring and Protection, IEEE, 2010. ISSN: 2231-5381 http://www.ijettjournal.org Page 197