International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013 Enhanced cBoids Algorithm for Web Logs Clustering R Suguna1 , D Sharmila2 1 Assistant Professor Computer Science and Engineering Arunai College of Engineering, Thiruvannamalai – 606 603 2 Professor and Head Electronics and Instrumentation Engineering Bannari Amman Institute of Technology, Sathyamangalam – 638 401 Abstract— The development of web based applications generates data in enormous amount. To manage and extract information from such voluminous data is very difficult. Grouping the data based on certain parameters gives the solution to maintain the data in easier and constructive manner. In data mining perspective such grouping is called as clustering. Web data clustering is the method of grouping the web information into clusters in order to organize the web based documents and facilitate the data availability, accessing and satisfy user preferences, understanding navigation behaviour of the users, improving information retrieval and content delivery. Clustering techniques are applied to web data in the following two methods: (i) Sessions based clustering and (ii) Link based clustering. In this paper EBoid (Enhanced Boid) algorithm is used to group the users who are all having similar browsing patterns. User Cluster Comparison (UCC) algorithm is used to identify the likeness value for particular websites. The performance of the enhanced boids algorithm is tested with the parameters precision, recall and f-measure values for three websites google.com, facebook.com and microsoft.com. Keywords— cBoids, EBoid, cluster, web logs. I. INTRODUCTION The development of web based applications generates data in enormous amount. To manage and extract information from such voluminous data is very difficult. Grouping the data based on certain parameters gives the solution to maintain the data in easier and constructive manner. In data mining perspective such grouping is called as clustering. It is essential to group the web data for information accessibility, understanding user’s behaviour and improving information retrieval [1]. ISSN: 2231-5381 Clustering is most important step in data analysis, machine learning and statistics. It is defined as the process of grouping N item sets into distinct clusters based on similarity measure. A good clustering technique produces clusters which have low inter cluster and high intra cluster similarity. The objective of clustering is to maximize the similarity of the data points within each cluster and dissimilarity across clusters [2]. The two main cases in the clustering process are (i) Two or more clusters are combined to form new clusters and (ii) Initially all the items are in one cluster and recursively splits into the most appropriate clusters. The process continues until a stopping measure is achieved. There are two issues in clustering techniques. First issue is to find the optimal number of clusters in a given dataset and the next issue is to calculate the relative measure between two clusters to assess the possibility of combining the clusters [3]. In conventional clustering techniques, objects that are in similar character are allocated to the same cluster while objects that differ in nature are belongs to different clusters. These clusters are known as hard clusters. In soft clustering, an object may be placed in more than one cluster. Clustering is a widely used technique http://www.ijettjournal.org Page 189 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013 in data mining application for discovering patterns in underlying data. Users are grouped either session based or link based by using partitional, hierarchical and fuzzy based clustering algorithms. The aim of partitional clustering is to divide the P points in N dimension into M clusters. The hierarchical clustering forms a binary tree of data that merges similar group of points in one category. Where fuzzy based clustering, assigns the single data point into two or more groups on the basis of certain membership functions. Though these algorithms have applied in various domains efficiently but they are helpful in classifying the static numerical data [1]. Web data clustering is the method of grouping the web information into clusters in order to organize the web based documents and facilitate the data availability, accessing and satisfy user preferences, understanding navigation behaviour of the users, improving information retrieval and content delivery. Clustering techniques are applied to web data in the following two methods: (i) Sessions based and (ii) Link based [2]. In this paper EBoid algorithm is used to group the users who are all having similar browsing patterns. User Cluster Comparison (UCC) algorithm is used to identify the likeness value for particular websites. Chapter 1 gives introduction about the work. Related works are details in chapter 2. Chapter 3 discusses about cBoid algorithm. Enhanced boid algorithm is presents in chapter 4. Results and experiments are details in chapter 5. Conclusion and future works are presents in chapter 6. II. RELATED WORKS Athena Vakali (2004) stated that clustering is a challenging job in web based environment. Numerous clustering techniques are proposed by the researchers for diverse ISSN: 2231-5381 applications. It is performed either offline or online process depending on the requirement. Sophia et al (2006) mentioned that clustering is described as the process of grouping the similar objects in the same group. The data objects within a group have high similarity and common access behaviour than the other groups. Clustering is one of the pattern mining techniques used to group the similar data items. In web usage mining varieties of clustering algorithms were proposed by the researchers to group the web log files in two methods namely (i) Session based clustering and (ii) Link based clustering. The session based clustering method groups the users who have similar navigation behaviour. Link based clustering method groups the web pages which have similar contents. Session based clustering method helps to identify the interest level of website visitors and group the users who have similar browsing patterns. It mainly falls with time of stay on particular web page during their web page surfing period. This literature survey discusses the existing research work carried out in session based clustering to identify the user interest level on particular web pages and websites. Session based clustering methodology is described by Athena et al (2004) to group web users and web sessions. Similarity based and model based approaches were proposed for session based clustering. In similarity based clustering, sequence alignment method is used to measure the similarity between the sessions and birch algorithm is used to cluster the web pages. In model based clustering approach, model selection techniques are applied to construct model structure and parameters are estimated using maximum likelihood algorithms. George et al (2005) focused on session based clustering by the concept of model based clustering by Athena et al (2004). In this method user are grouped based on the time spent for each website. Model based clustering algorithm is proposed by the authors to group the web logs based on the sessions. Each model has defined http://www.ijettjournal.org Page 190 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013 parameters and web pages are grouped under each model based on its similarity. The resultant clusters are validated by using equilibrium distribution. George et al (2008) introduced the algorithm clustweb to group the web pages which have similar weight in directed graph. In the graph, each node represents web pages requested by the user and edges represent the web page request. Each node is associated with weight. To avoid the network traffic delay the web pages which are expected to visit by the users in near future are pre fetched. A new clustering algorithm with learning automata is developed by Babak et al (2011). The authors proposed three steps for effective clustering. During the first step web access patterns are transferred into weight vector using learning automata. Then in the second step web pages are assigned to the nearest group based on its weight vector. Weighted fuzzy c-means algorithm is used to group the cluster centers in final step. This method helps to identify the user’s surfing behaviour exactly for further personalized services. Miao et al (2012) proposed K-means algorithm to cluster the web users based on their browsing behaviour. The cluster qualities are validated by using compactness and separation force of intra and inter cluster behaviour. The users who have high similarity satisfy the compactness measure and dissimilar users are greater separation force. Random indexing approach is used to pre fetch the web pages based on the formed clusters. Existing research in the field of web usage mining attempted to extract the required knowledge from the web log files. Since the nature of the web log files, it find difficult to exercise the web log files directly. Some preprocessing techniques and pattern discovery algorithms are applied to the web logs to get meaningful information. It is better to split the data into groups with respect to meaningful ISSN: 2231-5381 parameters. It makes the situation simple and easier. So, variety of clustering algorithms is applied to web log files to group them either link based or session based. The above literature survey reveals some of clustering algorithm developed by the researchers. III. CBOIDS ALGORITHM The cBoids algorithm is developed by Marcio & Leandro (2008) for numeric data clustering in data mining. The main concept of cBoids algorithm is based on Reynolds (1987) boid model. The boid model is a swarm intelligence algorithm based on the grouping manner of animals such as birds and fishes. This algorithm was introduced by Reynolds (1897) as boids model. In this model, a boid is the representation of a bird. The main features of boid are the speed and route of the gathering of the bird which are controlled by three rules. Cohesion rule is used to keep the boids in same group Separation rule helps to avoid collisions between boids Alignment rule tries to coordinate the boids flying direction in its neighbour. The Reynolds (1897) boid model is suitable for creating realistic computer animations for birds, fishes and animals flocking behaviour. So, this algorithm is modified as cBoids algorithm by Marcio & Leandro (2010). They have introduced clustering boids (cBoids) algorithm for data clustering in data mining by adding two more new rules called centroid and merging rules for effective data clustering. Affinity value is calculated by measuring the similarity between the data objects. The Reynold’s (1897) basic three rules are modified with respect to affinity value. http://www.ijettjournal.org Page 191 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013 Centroid Rule - Initially each data object is in one cluster and it is considered as centroid of that cluster. equation 2. This rule is directly propositional to the affinity value of the objects. Merging Rule - This rule is used to combine two similar centroids together. A. Affinity calculation The affinity value defines the degree of relationship between data objects which shows similarities between them. The greater affinity value between the corresponding objects indicates higher similarity with each other. In cBoids algorithm affinities of data objects are considered as one of the main parameter for grouping the objects. The affinity value Aij between two objects is calculated using the equation 1. A ij L (X ik k 1 Y jk ) 2 1 (1) where Xik and Yjk are data objcets considered for similarity calculation, L is total number of fields in data object. The inverse euclidean distance is used for calculating the distance between the objects in affinity calculation. Pmxy ( xi yi ) 2 (2) where Xi and Yi are two centroids, n is total number of fields in each centroid. When two or more objects are grouped in one centroid, there is a chance of leaving some objects from the current group and join in another group. The probability of leaving the objects from the current group is based on greater affinity with the other groups. The current affinity with its group (cag) is calculated using the equation (3), which is the similarity between the objects and the centroid of the group it belongs to. Then, the similarity of the object with the centers of all remaining groups is calculated using the formula (4). The highest affinity is greatest affinity group (gag). The probability value Pi of the object to leave the group is directly proportional to the difference between gag – cag using the equation 5. cag gag n i 1 n i 1 ( bi c i ) 2 ( bi g i ) Pi gag cag B. Centroid rule and Merging rule n i 1 (3) 2 (4) (5) According to the cBoid approach, initially every object is considered as the centroid. It is assumed that, a database has 100 objects then the algorithm initially has 100 boids, and each cluster contains a single object. Behavioural rules are applied to merge centroids and build new centroid. The new centroid is created by splitting the previously merged groups into two new groups. Merging rule is used to combine two clusters into one centroid. When two objects have similarity between them merging centroid rule is used to merge the centroids. where bi is object which is leave the group, ci is current affinity value and gi is greatest affinity value. The probability of merging the centroids X and Y is Pmxy and it is calculated using Separation rule: The strength of the separation between two Boids is inversely proportional to the affinity between the Boids, the lower the ISSN: 2231-5381 C. The New Separation, Alignment and Cohesion Rules According to the new rules defined by Marcio Frayze David and Leandro Nunes de Castro, the C. Reynolds three behavioral rules are redefined. The rules are redefined based on their affinity value. The rules are followed as, http://www.ijettjournal.org Page 192 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013 value of affinity, the greater the separation force between them. (iv) Cluster similarity is validated by using Shannon minimum entropy measure. Alignment rule: The degree of alignment varies with the affinity between the Boids, the greater the affinity value between the Boids, the stronger alignment between them. (v) New centroid value is calculated by taking the average value of merged group centroids. Cohesion rule: The strength of cohesion varies with the Boids affinity, the greater the affinity value, the stronger the cohesion, and vice-versa. By following these three rules, new clusters are formed. D. Stopping Condition The algorithm stops when the iteration repeats maximum number of times without any changes in the clustering process. The cBoids algorithm passes maximum of 1000 iterations. IV. ENHANCED CBOIDS ALGORITHM The web logs are unformatted text files and inconsistent for its nature. Preprocessing techniques are carried out to make the web logs consistent [10][15]. The preprocessing activities namely data cleaning, user identification, session identification and path completion are performed to make the web logs suitable for applying the pattern mining algorithms [11][12][13]. The following modifications are proposed in the cBoids algorithm for the aim of grouping the users through their website visiting behaviour: (i) Proposed cBoids algorithm is applied to the preprocessed web logs. During preprocessing session time and frequency value of websites are extracted from web logs [14]. (vi) Stopping criteria is specified based on the total number of unique users. The proposed cBoids algorithm is used to perform two types of clustering process to identify the user interest level. Initially websites are grouped with respect to each user. Then in the second level clustering users are grouped with respect to website. The web logs are processed according to the need defined by the proposed approach and the clustering process is applied for grouping the websites based on their usage behaviour. In the proposed system each boid has five attributes username, ip address, url, session time and frequency. This is represented as b= <user name, ip address, url, session, frequency> The attributes are considered as the input for the EBoid algorithm. A. Affinity Calculation The EBoid algorithm considers session time and frequency are the parameters for affinity calculation. According to the EBoid algorithm the data items are arranged in distance matrix. So, Euclidean distance between the objects is adapted for the affinity value calculation between the objects in a group. The affinity value is calculated by using two measures session and frequency. The affinities Aij between two websites are calculated using the following equation 6. (ii) Websites and users are grouped with respect to the parameters session time and frequency value. (iii) Euclidian distance is used to calculate the affinity value between the data objects. ISSN: 2231-5381 http://www.ijettjournal.org Aij k s , f x ik 2 y jk (6) Page 193 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013 where s is session time, f is frequency value, Xik and Yjk are sesseion and frequency value for objects X and Y. B. Entropy Calculation for Merging Centroid Entropy is the mechanism for validating the similarity between the data item within the group. When two clusters are merged based on the propability rule Pmxy, the entropy value of the merged cluster is equal or increased. If the average entropy value is equal, it is assumed that the two clusters have high similarity. So, the clusters are merged together which are having minimum entropy. Entropy value for each cluster is calculated using the equation 7. H X p x log 2 p x x X (7) C. New Centroid Calculation When two clusters (centroids) are merged together the new centroid value is calculated for the merged group. This is calculated by taking the average value of two centroids in a group. The new centroid for the merger group is calculated by using the equation (4.8) c 1 c 2 / Cohesion rule: The strength of cohesion varies with the Boids affinity, the greater the affinity value and minimum entropy value, the stronger the cohesion, and vice-versa. By following these three rules, new clusters are formed. The algorithm stops, when the number of cycles repeats without any changes in the groups. The number equals total number of users divide by 2. E. User Interest Level Grouping where H indicates entropy function, X is a set of data items in a cluster, x is a data item, p(x) is a probability mass function of random variable x. nc Alignment rule: The degree of alignment varies with the affinity between the Boids, the greater the affinity value and minimum entropy value between the Boids, the harder the alignment between them. 2 (4.3) where nc is new centroid, c1 is centroid of initial group and c2 is centroid of new group. The clustered websites obtained from the EBoid algorithm refers to the most visited and most used websites by the user. The UCC method aims on extracting the user interest according to the clusters formed by using proposed cBoid algorithm. Clusters are made according to a user’s session and frequent visits to particular websites. The inputs for the UCC algorithm are (i) User List and (ii) Website Clusters. The user interest is calculated against the clusters is based on constraint. The constraint is defined as the likeness of a user to particular websites and is defined as lk. User interest level is depends on three parameters, they are User, Website and Likeness. Likeness value of a particular website lk is calculated using the equation 8. lk = vi+ ts D. Proposed Separation, Cohesion Rule Alignment Separation rule: The strength of the separation between two Boids is inversely proportional to the affinity between the Boids, the lower the value of affinity and mamimum entropy, the greater the separation force between them. ISSN: 2231-5381 (8) and where vi is number of visits and ts is total time spend on a particular website. The likeness value of each cluster is calculated to select high likeness clusters. The clusters with high likeness value are considered for user interest level identification. The clusters which have less likeness value are discarded for future analysis. http://www.ijettjournal.org Page 194 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013 The following shows the algorithm for website extraction process. V. EXPERIMENT AND PERFORMANCE ANALYSIS The preprocessed web logs are given as the input for this work. Java (jdk 1.6) is used for implementing the clustering technique. The system has Intel core i3 processor with 4GB RAM. In this section the performance of the proposed approach is evaluated based on their recall, precision and f-measure values. The analysis is conducted based on two criteria, initially, websites which are most visited by the users are retrieved. Secondly, the likeness values of the top websites are analyzed. A. Evaluation Metrics The performance of the proposed system is evaluated based on the following three basic evaluation criteria: B. Performance of EBoid algorithm with existing algorithms The performance of proposed cBoid algorithm is compared with cBoid aorithms by applying UCC model with the parameters precision, recall and f-measure for three websites google.com, facebook.com and microsoft.com. The 550 unique users who are identified during the preprocessing activities are considered for performance evaluation. C. Average Precision, Recall and F-Measure analysis for three websites The average precision, recall and Fmeasure value are calculated to know the user interest level for particular website. Here the analysis is carried out with three popular websites namely google.com, facebook.com and microsoft.com. Table 6.1 shows the average precision, recall and f-measure values for the above three websites for proposed and existing algorithms. Recall: Recall is the ratio of number of retrieved users to the number of relevant users in a particular website. R e c a ll U U Ret U Rel Ret Where, URet and URel are the number of retrieved and relevant users respectively. Precision: The ratio of retrieved users to relevant users based on the relevant users is given as the precision measure. P r e c is io n U Ret U U Figure 6.1 Average precision, recall and f-measure value for google.com Rel Rel F-measure: The precision values and the recall values are considered for finding the f-measure value for the total dataset. Thus the f-measure is calculated as, F m easu re 2 R e ca ll P r ecision R e ca ll P r ecisio n ISSN: 2231-5381 http://www.ijettjournal.org Page 195 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013 Table 6.1 Average Precision, Recall and F-measure value for Google.com, Facebook.com and Microsoft.com Google.com Alg. Name Figure 6.2 Average precision, recall and f- measure value for facebook.com Facebook.com P – Precision R –Recall Microsoft.com F-F measure P R F P R F P R F cBoid 0.62 0.57 0.59 0.54 0.47 0.50 0.52 0.37 0.43 EBoid 0.73 0.67 0.70 0.63 0.56 0.59 0.62 0.49 0.55 REFERENCES [1] Athena Vakali, Jaroslav Pokorn, and Theodore Dalamagas, “An Overview of Web Data Clustering Practices”, 3, Springer-Verlag Berlin Heidelberg, 2004. [2] George Pallis, Athena Vakali, Jaroslav Pokorny, “A clustering-based prefetching scheme on a Web cache environment”, Computers and Electrical Engineering 34, 309–323, 2008. [3] Miao Wan, Arne Jönsson, Cong Wang, Lixiang Li and Yixian Yang, “Web user clustering and Web prefetching using Random Indexing with weight functions”,Knowledge and Information Systems, 33, 1, 89-115, 2012. [4] Sophia G. Petridou, Vassiliki A. Koutsonikola, Athena I. Vakali, and Georgios I. Papadimitriou, “A Divergence-Oriented Approach for Web Users Clustering”, ICCSA 2006, LNCS 3981, pp. 1229 – 1238, Springer-Verlag Berlin Heidelberg 2006. [5] Babak Anari , Mohammad Reza Meybodi and Zohreh Anari, “Clustering Web Access Patterns Based on learning Automata”, IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 5, No 1, September 2011 ISSN (Online): 1694-0814 . [6] George Pallis, Lefteris Angelis, and Athena Vakali, “Model-Based Cluster Analysis forWeb Users Sessions”, M.-S. Hacid et al. (Eds.): ISMIS 2005, LNAI 3488, pp. 219–227, 2005. Springer-Verlag Berlin Heidelberg 2005. Figure 6.3 Average precision, recall and f- measure value for Microsoft.com I. CONCLUSION Identifying and analyzing website visitor’s behavior is an interesting task of many website analysts. In this paper EBoid algorithm is used to cluster the users based on their browsing pattern and group the websites based on its usage. Web logs are given as the input for the proposed algorithm. UCC algorithm is newly proposed to extract the likeness value for websites. The proposed algorithms are compared with existing algorithm to prove its efficiency. The evaluation matrices considered in the proposed clustering algorithm are recall, precision and f-measure. The experimentation of the proposed algorithm is carried out by considering three different websites. The analysis is carried out in different aspect to extract the high likeness website and users for the website. The proposed algorithm gives better performance to identify the interest level for the users. In future association rule mining is applied to the high likeness cluster for each user to find the frequent patterns. ISSN: 2231-5381 [7] Dipa, D & Jayant, G 2010, “A New Approach for Clustering of Navigation Patterns of Online Users”, International Journal of Engineering Science and Technology, vol. 2, no.6, pp.1670-1676. [8] Suguna, R and Sharmila, D, “User Interest Based Web Usage Mining using a Modified Bird Flocking Algorithm”, European Journal of Scientific Research (EJSR), Vol 86(2), pp. 218-231. [9] Marcio Frayze David and Leandro Nunes de Castro, “A New Clustering Boids Algorithm for Data Mining”,2008. [10] Cooley, R, Mobasher, B and Srivastava, J, “Data Preparation for Mining World Wide Web Browsing Patterns”, Knowledge and Information Systems, Springer Verlag, pp.1-27, 1999. http://www.ijettjournal.org Page 196 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 4 - Dec 2013 [11] Cooley, R, “Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data”, Ph.D. thesis, University of Minnesota, 2000. [12] Malarvizhi, M and Sahaaya, AM, “Preprocessing of Educational Institution Web Log Data for Finding Frequent Patterns using Weighted Association Rule Mining Technique”, European Journal of Scientific Research, vol.74, no.4, pp. 617-633, 2012. [13] Maheswara, R & Valli VVR, “An enhanced Pre-Processing Research Framework For Web Log Data using A Learning Algorithm”, NetCom, CS&IT – CSCP, pp 01-15, 2011. [14] Reynolds, CW, “Flocks, Herds and Schools: A Distributed Behavioral Model”, ACM SIGGRAPH Computer Graphics, vol.21 no.4, pp.25-34, 1987. [15] Suguna, R and Sharmila, D, “User Interest Level based Preprocessing Algorithms using Web Usage Mining”, International Journal of Computer Science and Engineering, vol. 5, no. 9, pp. 815822, 2013. (ISSN : 0975-3397) ISSN: 2231-5381 http://www.ijettjournal.org Page 197