International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol. 3, Issue 2, Jun 2013, 143-152 © TJPRC Pvt. Ltd. CLUSTERING OF WEB USAGE DATA USING FUZZY TOLERANCE ROUGH SET SIMILARITY AND TABLE FILLING ALGORITHM T. VIJAYA KUMAR & H. S. GURUPRASAD Department of IS & E, BMS College of Engineering, Bull Temple Road, Bangalore, Karnataka, India ABSTRACT Web Usage Mining is the application of data mining techniques to learn usage patterns from Web server log file in order to understand and better serve the requirements of web based applications. Web Usage Mining includes three most important steps namely Data Preprocessing, Pattern discovery and Analysis of the discovered patterns. One of the most important tasks in Web usage mining is to find groups of users exhibiting similar browsing patterns. Grouping web transactions into clusters is important in order to understand user‟s navigational behavior. Different types of clustering algorithms such as partition based, distance based, density based, grid based, hierarchical and fuzzy clustering algorithms are used to find clusters from Web usage data. In this paper we propose an approach for clustering Web usage data based on Fuzzy tolerance rough set theory and table filling algorithm. First, we have constructed the sessions using concept hierarchy and link information. The similarity between two sessions is approximated by using Rough set tolerance relation. The tolerance relation is reformulated into equivalence relation using fuzzy tolerance. Then the clusters are obtained by using modified table filling algorithm. We provide experimental results of Fuzzy rough set similarity and table filling algorithm on MSNBC web navigation data set. In this paper, we have considered the server log files of the Website www.enggresources.com for overall study and analysis. KEYWORDS: Web Usage Mining, Concept Hierarchy, Website Ontology, Rough Set Similarity, Fuzzy Tolerance, Table Filling Algorithm INTRODUCTION The growth of World Wide Web in terms of Web sites and their users over the last two decades has resulted in a large amount of data related to the user‟s interactions with the web sites. This data is recorded in the Web server log files of Web servers and referred as Web usage data. Web usage mining (WUM) uses data mining techniques to discover valuable information from Web usage data. WUM deals with the automatic discovery of user access patterns from one or more Web servers. Web Usage mining contains three main tasks namely Data preprocessing, Cluster discovery and Cluster analysis. Data preprocessing consists of data cleaning, data transformation and data reduction. Data cleaning routines work to clean the data by filling in missing values, smoothing noisy data and resolving inconsistencies in the data. In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data reduction techniques can be applied to obtain a reduced representation of the data that is much smaller in volume, yet closely maintains the integrity of the original data. Cluster discovery deals with formation of groups of users exhibiting similar browsing patterns and obtaining groups of pages that are accessed together. Cluster analysis filters out uninteresting patterns from the user clusters and page clusters found in the Cluster discovery phase. Clustering is a data mining technique that groups together a set of items having similar characteristics. In the Web usage domain, two kinds of interesting clusters such as user clusters and page clusters can be discovered. This paper presents a new approach for finding session similarity using Fuzzy Rough set theory. Rough set theory deals with uncertainty and vagueness. The building block of rough set theory is an assumption 144 T. Vijaya Kumar & H. S. Guruprasad that with every set of the universe, we can associate some information in the form of data and knowledge. Objects clustered by the same information are similar with respect to the available information about them. The set similarity considered for two sessions is a tolerance relation which is only reflexive and symmetric but not transitive. Fuzzy tolerance is used to reform the tolerance relation in to an equivalence relation. Then the indiscernibility based fuzzy tolerance rough set similarity is combined with table filling algorithm to form the clusters. Table filling algorithm is used to minimize the deterministic finite automata. The minimization problem is to find the unique minimal deterministic finite automata that accept the same language accepted by the given deterministic finite automata. Algorithms solving this problem are used in applications ranging from compiler construction to hardware circuit minimization. The rest of the paper is organized as follows. Section 2 gives a brief description about the related work. Section 3 explains the proposed model. Section 4 covers details of Data Preprocessing using Concept hierarchy and Web site topology. The details of Rough set theory and Fuzzy tolerance are discussed in Section 5. The proposed approach using fuzzy tolerance rough set theory and table filling algorithm is explained in section 6. The experimental design and results are discussed in section 7. Finally, we give our conclusion in section 8. LITERATURE SURVEY Several researchers are working on Web Usage mining and have contributed various methodologies, tools for Web Usage mining. A number of data mining methods have been used to generate models of usage patterns. Models based on association rules, clustering algorithms, sequential analysis and Markov models have been used for discovering the knowledge from Web usage data. All these models are predominantly based on usage information from Web usage data alone. Significant improvement can be achieved by making use of domain knowledge, which is usually available from domain experts, content providers, and Web designers. Cooley et al. in [1, 2], covered Web usage mining process & various steps involved in it. It serves as the primary thesis to understand fundamentals of Web usage mining. Along with the server log file other sources of knowledge such as site content or structure and semantic domain knowledge can be used in Web usage mining [3]. In [4], Murat Ali Bayir et al. have proposed a novel framework, called Smart-Miner for Web usage mining problem which uses link information for producing accurate user sessions and frequent navigation patterns. Norwati Mustapha et al. [5], have proposed a model for mining user‟s navigation pattern based on Expectation modeling algorithm and used it for finding maximum likelihood estimates of parameters in probabilistic models. A complete framework for mining evolving user profiles in dynamic Websites is proposed in [6]. They also described how to enrich the discovered user profiles with explicit information need that is inferred from search queries extracted from Web log data. In [7], T.Vijaya Kumar et al. have proposed a framework for finding useful information from Web Usage Data that uses Self Organizing Maps (SOM). Sessions are constructed using the concept hierarchy and the link information. Then they have used SOM to form the cluster. In [8], Hannah et al. have proposed an approach to obtain user profiles based on intelligent rough clustering techniques. The proposed method provides efficient algorithms for finding hidden patterns in web log data and is able to learn the number of clusters automatically from the given data. They have given a two-fold approach for clustering user access patterns and retrieving effective user profiles from web logs using Gaussian Rough (GR) clustering, Gaussian Rough Fuzzy (GRF) clustering, rough clustering and rough fuzzy clustering. In [9], Rajhans Mishra et al. have adopted the similarity upper approximation based clustering of web logs using various similarity metrics. In [10], K.Santhisree et al. have presented a technique to cluster web transactions based on the set similarity measures from web log data which identifies the behavior of the user‟s page visits and order of occurrence of visits. They have formed the Web data Clusters using the Similarity Upper Approximations. In [11], Philip Hingston presented the method for mining interesting Clustering of Web Usage Data Using Fuzzy Tolerance Rough Set Similarity and Table Filling Algorithm 145 sequential patterns from large sequential data sets. In the first step the data set is modeled in terms of stochastic grammar or automaton. Then the queries about frequently occurring patterns in the data set are answered by converting pattern frequencies into formulae concerning the model. In [12], Sunil Joshi et al. have proposed a new algorithm PMFLT (Pattern Mining using Formal Language Tools) for sequential pattern mining using formal language tools such as regular expression constraints. The algorithm finds only user specific frequent sequence in efficient optimized way as compared to other existing algorithm. SYSTEM DESIGN The main goal of the proposed system is to find web user clusters from web server log files. We have adopted Web Usage Mining System as shown in Figure 1. The WUM system in our approach is partitioned into two modules. In the first module Data cleaning, User identification and Session constructions are considered for Data preprocessing phase. Sessions are constructed using web site ontology and concept hierarchy. Then in the second module web usage clusters are formed using rough set theory, fuzzy tolerance and table filling algorithm. Figure 1: Web Usage Clustering Process DATA PREPROCESSING Data preprocessing [13] comprises of, merging of log files from different Web servers, Data cleaning, Identification of users, sessions, and visits, Data formatting and Summarization. Data cleaning consists of removing superfluous data from log file. User identification deals with identifying unique clients to Web server. A combination of IP & user agent is used to identify user uniquely. User identification can also be done using client side cookies. But, due to privacy reasons, cookies can be disabled by user, and not every Website employ cookies. Session identification is considered as the next step. A session is a sequence of requests made by a single user with a unique IP address on a particular Web domain during a specified period of time. Time Oriented Approach The most basic session definition comes with Time Oriented Heuristics which are based on time limitations on total session time or page-stay time. They are divided into two categories with respect to the thresholds they use: 146 T. Vijaya Kumar & H. S. Guruprasad In the first one, the duration of a session is limited with a predefined upper bound, which is usually accepted as 30 minutes. In this type, a new page can be appended to the current session if the time difference with the first page doesn‟t violate total session duration time. Otherwise, a new session is assumed to start with the new page request. In the second time-oriented heuristic, the time spent on any page is limited with a threshold. This threshold value is accepted as 10 minutes. If the timestamps of two consecutively accessed pages is greater than the threshold, the current session is terminated after the former page and a new session starts with the latter page. Navigation Oriented Approach Navigation-Oriented approach [14, 15] uses link information of Website graph which is present in concept based Website graph constructed by using Website knowledge. In this approach, it is necessary to have a hyperlink between every two consecutive Web page requests. Let be a session containing Web pages with respect to their timestamp orders. In this session, for every page referring to , except the initial page and has a smaller timestamp than , there must be at least one page in the session which is . Topology constraint forces to consider user navigation according to some path in Website graph. Concept-matching approach: This approach considers concepts of Web pages from concept based Website graph. Adding page to a session same. Then add is performed as follows: If concept names of pages to the current session else create a new session and add & are to it. i.e., concept switching is taken as one more criteria for breaking session [16]. FUZZY TOLERANCE ROUGH SET SIMILARITY In this section we present a Fuzzy rough set theoretic approach to cluster user access transactions over the web. The presented approach is based on the table filling algorithm. Rough set theory is based on the assumption that with every set of the universe, some information in the form of data and knowledge can be associated. Objects clustered by the same information are similar and the similarity generated based on the information form the basis for rough set theory. Given two transactions and , the sequence and set similarity measure proposed in [17], is considered for our study. Sequence similarity calculates the amount of similarity in the order of occurrence of pages within two page sequences. The sequence similarity measure is given in equation (1). Length of longest common subsequence (LLCS) with respect to the length of the longest sequence determines the sequence similarity aspect across two sequences. The Length of Longest Common subsequence (LLCS) can be calculated by dynamic programming approach [18]. Set similarity is defined as the ratio to the number of common pages and the number of unique pages in two page sequences. The Set similarity measure is given in equation (2). The Sequence set similarity metric satisfies Non negativity, Symmetry and Normalization, hence qualifies as a proper similarity metric [19]. The Sequence set similarity measure is given in equation (3). (1) (2) (3) 147 Clustering of Web Usage Data Using Fuzzy Tolerance Rough Set Similarity and Table Filling Algorithm Here and . The values of p and q determine the relative weights for sequence similarity and ∈ [0, 1] for all x and y. S set similarity. same. = 1, when two transactions x and y are exactly = 0, when two transactions x and y have no items in common. The measures of similarity gives information about the users access patterns related to their common areas of interest. The navigation of any two users over a web site may not be exactly identical but may have some common interesting patterns. Moreover the same user can ∈ [0, 1]. navigate the same pattern in different ways. From the above definition it is obvious that = 1, when two transactions are exactly identical. = 0, when two transactions are to totally different. This measure of similarity gives information about the users access patterns related to their common navigational patterns. The navigation of any two users over a Web site may not be exactly same but may have some common interesting pages. The same user can navigate the same pattern in two different ways. This similarity between the navigational behaviors of two users is modeled by using a binary relation R defined on T. For any threshold value ∈(0, 1] and for any two user transactions x and y ∈T, a binary relation R on T denoted as xRy is defined by xRy iff . The similarity class of t, denoted by SimClass(t), is the set of transactions which are similar to t. It is given by SimClass(t) = {s ∈T : sRt}. For different threshold values we can get different similarity classes. A domain expert can choose the threshold based on his experience to get a proper similarity class. For a fixed threshold ∈ [0; 1], a transaction from a given similarity class may be similar to an object of another similarity class. This relation R is a tolerance relation as R is both reflexive and symmetric but transitive may not hold good always. Let a, b and c are three different transactions. For a specified threshold, if a is similar to b and b is similar to c, then a may not be similar to c. A tolerance or proximity relation R is a relation that exhibits only the properties of reflexivity and symmetry. A tolerance relation, R, can be reformed into an equivalence relation by at most (n-1) compositions with itself, where n is the cardinal number of the set defining R. A fuzzy relation, R, on a single universe X is also a relation from X to X. It is a fuzzy equivalence relation if all three of the following properties for matrix relations define it: Reflexivity : Symmetry : Transitivity : =1 and Then where MODIFIED TABLE FILLING ALGORITHM The indiscernibility based fuzzy tolerance rough set similarity is combined with table filling algorithm to form the clusters. Table filling algorithm is used to minimize the deterministic finite automata. The minimization problem is to find the unique minimal deterministic finite automata that accept the same language accepted by the given deterministic finite automata. Algorithm: Modified table filling algorithm with Fuzzy tolerance rough set based similarity. Input: A set of n transactions Threshold: ∈ (0, 1] 148 T. Vijaya Kumar & H. S. Guruprasad Similarity measure: Output: Web usage clusters Procedure Mark Step 1: Remove sessions with session length < minimum session length Step 2: Consider all pairs of sessions (x, y), Construct the similarity matrix using Similarity measure. Step 3: Repeat the following until no previously unmarked pairs are marked. For all pairs ( ) if S The sessions > then sessions and are indistinguishable or equivalent. are placed in the same cluster. Procedure Reduce Construct Session Clusters Step 1: Use procedure mark to find all pairs of similar sessions. Use Fuzzy tolerance rough set similarity measure to find the similarity class of each session using SimClass(t) = {s ∈ T : sRt}. Step 2: Each group of equivalent sessions must be placed in a single cluster to form session clusters. EXPERIMENTAL DESIGN AND RESULTS Description of the Dataset The data from the UCI dataset repository that consists of Internet Information Server (IIS) logs for msnbc.com and news related portions of msn.com. Each sequence in the dataset corresponds to page views of a user. Each event in the sequence corresponds to a user‟s request for a page. Requests are recorded at the page categories level as determined by the site administrator. There are 17 page categories, namely „front page‟ , „news‟, „tech‟, „local‟, „opinion‟, „on-air‟, „misc‟, „weather‟, „health‟, „living‟, „business‟, „sports‟, „summary‟, „bulletin board service‟, „travel‟, „msn-news‟, and „msn-sports‟. Each page category is represented by an integer label. For example, „front page‟ is coded as 1, „news‟ is coded as 2, „tech‟ is coded as 3, etc. Each row describes the hits of a single user. Figure 2 shows the example of web navigational data. The similarity table is computed using Sequence set similarity and shown in Figure 3. T1: T2: T3: T4: T5: T6: T7: T8: T9: T10: 6 2 14 1 6 6 1 1 2 1 7 12 14 1 8 6 14 1 2 11 7 3 14 12 8 6 14 1 15 1 7 4 14 2 8 6 1 1 5 2 6 12 14 2 8 3 1 1 5 2 7 12 14 4 12 14 2 14 16 14 Figure 2: Sample Web Navigation Data 149 Clustering of Web Usage Data Using Fuzzy Tolerance Rough Set Similarity and Table Filling Algorithm T2 T3 T4 T5 T6 T7 T8 T9 T10 0 0 0 0.21 0.29 0 0 0 0 T1 0 0.47 0.17 0.17 0.17 0 0.15 0.15 T2 0 0 0.25 0.33 0.33 0 0.21 T3 0.17 0 0.45 0.27 0.24 0.5 T4 0.18 0 0 0 0 T5 0.18 0.21 0 0.17 T6 0.58 0.17 0.62 T7 0 0.5 T8 0.24 T9 Figure 3: Similarity Table Using Sequence Set Similarity Assuming the threshold value as 0.2, the similarity classes are shown as follow. SimClass(T1) = {T1, T5, T6} SimClass(T2) = {T2, T4} SimClass(T3) = {T3, T6, T7, T8, T10} SimClass(T4) = {T2, T4, T7, T8, T9, T10} SimClass(T5) = {T1, T5} SimClass(T6) = {T1, T3, T6, T8} SimClass(T7) = {T3, T4, T7, T8,T10} SimClass(T8) = {T3, T4, T6, T7, T8, T10} SimClass(T9) = {T4, T9, T10} SimClass (T10) = {T3, T4, T7, T8, T9, T10} Initially {T1, T5, T6} and {T2, T4} form as two separate clusters. Based on the Fuzzy tolerance rough set similarity we get the following clusters. C1 = {T1, T5, T6} C2 = {T2, T4, T7, T8, T9, T10} C3 = {T3, T4, T6, T7, T8, T9, T10} There are some transactions which belong to multiple clusters. Different clusters can be formed by choosing different threshold values. We have considered Web Server log file from the Web site www.enggresources.com for our experimental study and concept based Website graph is constructed as additional input. Error records, requests for images and multimedia files are removed from Server log file by using a tool called Web log filter. Usually this process removes requests concerning non-analyzed resources such as images, multimedia files, and page style files (*.css) etc. IP address, timestamp, user agent, request and referrer are retained for further processing. In user identification, IP address and user agent are used. That is, a combination of IP address and user agent is used to identify a unique user. In session construction, we have combined two trivial approaches, Time oriented approach and Navigation oriented approach along with concept name match approach for identifying user sessions. Page stay time threshold and session timeout threshold are set as 10 and 30 minutes respectively. Each Web page is assigned with unique index. 150 T. Vijaya Kumar & H. S. Guruprasad And, every unique session is also given unique index. 10217 users and 25814 sessions were discovered from preprocessing. Similarity table is constructed using Sequence set similarity. Experiments are conducted by randomly selecting 100, 200, 300, 400, and 500 sessions from the preprocessed data with threshold values = 0.2 and = 0.3. As the number of records increases the number of clusters formed also increases. The graphs for the number sessions versus the number of clusters with threshold values = 0.2 and = 0.3 are shown in Figure 4(a) and 4(b) respectively. Figure 4(a): Graph Depicting Number of Sessions versus Number of Clusters with = 0.2 Figure 4(b): Graph Depicting Number of Sessions versus Number of Clusters with = 0.3 CONCLUSIONS A web user transactions clustering can be used to find interesting user access patterns from web server log files. In this paper we have proposed an approach for finding web sessions clusters using Fuzzy tolerance rough set theory and table filling algorithm. These clusters symbolize groups of users exhibiting similar browsing patterns. These patterns can be used to provide set of recommendations for the web site which can be deployed by web site administrator for website enhancement. Traditional clustering methods create clusters by describing the members of each cluster whereas the rough set based clustering techniques create clusters describing the main characteristics of each cluster. In this work, we introduced Fuzzy tolerance rough set similarity measure along with the table filling algorithm. The proposed approach allows merging of two or more clusters. We investigated our approach on MSNBC web navigation data set. We successfully conducted experiments on the server log files of the Website www.enggresources.com to form clusters. Clustering of Web Usage Data Using Fuzzy Tolerance Rough Set Similarity and Table Filling Algorithm 151 REFERENCES 1. R. Cooley, B. Mobasher, and J. Srivastava, “Web mining: information and pattern discovery on the World Wide Web”, Ninth IEEE International Conference on Tools with Artificial Intelligence, Newport Beach, CA, USA, 1997, Pages 558-567. 2. J. Srivastava, R. Cooley, M. Deshpande, and P. N. Tan, “Web usage mining: discovery and applications of usage patterns from Web data”, ACM SIGKDD Explorations Newsletter, Volume 1,Pages 12-23, 2000. 3. BamshadMobasher, Chapter: 12, “Web Usage Mining in Data Collection and Pre-Processing”, ACM SIGKKD 2007 Pages 450-483. 4. Murat Ali Bayir, Ismail HakkiToroslu, GuvenFidan, and AhmetCosar, “Smart Miner: A New Framework for Mining Large Scale Web Usage Data”, ACM 2009. 5. Norwati Mustapha, ManijehJalali , and MehrdadJalali, “Expectation Maximization Clustering Algorithm for User Modeling in Web Usage Mining Systems”, European Journal of Scientific Research ISSN 1450-216X Volume 32 Number.4 (2009), Pages.467-476. 6. OlfaNasraoui, MahaSoliman,EsinSaka,Antonio Badia, and Richard Germain, “Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites”, IEEE transactions on knowledge and data engineering, Volume. 20, Number. 2, February 2008. 7. T. Vijaya Kumar, Dr. H. S. Guruprasad, “Clustering Web Usage Data using Concept hierarchy and Self Organizing Maps”, International Journal of Computer Applications (0975 – 8887) Volume 56 – No.18, October 2012 www.ijcaonline.org. 8. H. Hannah In barani , K. Thangavel, “Rough set based User profiling for Web Personalization”, International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009. 9. Rajhans Mishra and Pradeep Kumar, “Clustering Web Logs Using Similarity Upper Approximation with Different Similarity Measures” ,International Journal of Machine Learning and Computing, Vol. 2, No. 3, June 2012. 10. K.Santhisree, and Dr.A.Damodaram, “Clustering on Web usage data using Approximations and Set Similarities” 2010 International Journal of Computer Applications (0975 – 8887) Volume 1 – No. 4. 11. Philip Hingston, “Using Finite State Automata for Sequence Mining”, Proceedings of twenty-fifth Australian conference on computer science – Volume 4 Pages 105-110. Australian Computer Science Communications Vol.24 Issue 1, Jan-Feb 2002. 12. Sunil Joshi, Dr. R. S. Jadon, and Dr. R. C. Jain, “ Sequential Pattern Mining Using Formal Language Tools” IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 2, September 2012 ISSN (Online):1694-0814. 13. G.T.Raju, and P. S. Satyanarayana, “Knowledge Discovery from Web Usage Data: Complete Preprocessing Methodology”, IJCSNS International Journal of Computer Science and Network Security, Volume.8, Number 01 January 2008. 152 T. Vijaya Kumar & H. S. Guruprasad 14. C. Shahabi and F. B. Kashani, “Efficient and anonymous Web-usage mining for Web personalization”, INFORMS Journal on Computing, 15(2) Pages 123-147, 2003. 15. M. Spiliopoulou, B. Mobasher, B. Berendt, and M. Nakagawa, “A framework for the evaluation of session reconstruction heuristics in Web usage analysis”, INFORMS Journal on Computing, 15(2), Pages 171-190, 2003. 16. T.Vijaya Kumar, Dr. H.S. Guruprasad, Bharath Kumar K.M, IrfanBaig and KiranBabu S,“A New Web Usage Mining approach for Website recommendations using Concept hierarchy and Website Graph”, International Journal of Computer and Electrical Engineering (IJCEE, ISSN: 1793-8198 (Online Version);1793-8163( print version). 17. P. Kumar, M.V. Rao, P.R. Krishna, R.S. Bapi, and A. Laha, “Intrusion detection system using sequence and set preserving metric” Proceedings of IEEE International Conference on Intelligence and Security Informatics, LNCS Springer Verlag, Atlanta, 2005, pp.c498–504. 18. L. Bergroth, H. Hakonen, and T. Raita, “ A survey of longest common subsequence algorithm”SeventhInternational Symposium on String Processing and Information Retriveal, Atlanta, 2000, pp. 39–48. 19. Pradeep Kumar, P. Radha Krishna, Raju SBapi and Supriya Kumar De, “Rough clustering of sequential data”, Data & Knowledge Engineering 63 (2007) 183–199, www.elsevier.com/locate/datak.