Web Usage Mining – A Review Roshni S. Ali#, Rahila H. Sheikh* Department of Computer Science &Engineering#* Rajiv Gandhi College of Engineering, Research & Technology, Gondwana University, Chandrapur, Maharashtra, India-442401 #roshniali27@gmail.com *rahila.patel@gmail.com Abstract— The Web Usage Mining (WUM) is an area, where the navigational behavior of the user is tracked and analyzed. It is important to analyze the behavior of user, so that website owner can easily recognize the usage pattern of its users. By collecting this behavior of the user activities, owner can improve the quality of services to catch the attention of existing as well as new customer. For analysis of the usage patterns of the users, web log files are used. These Web Log File contains different parameters that tracks the user’s request. Then depending upon these parameters, similarities regarding user access is identified and patterns are discovered and analyzed. In this paper we reviewed the concepts and process of Web Usage Mining i.e. Pre-processing Data, Pattern Discovery and Pattern Analysis. Keywords— Web Usage Mining, Web Log File, User Access Pattern, Pattern Discovery. I. INTRODUCTION The World Wide Web (WWW) is expanding with booming information day by day [9]. It serves as huge, widely distributed, global information centre for news, consumer information, financial management, advertisements, education, government and e-commerce. It contains ample and dynamic collection of information about web page contents with hypertext structures and multimedia, hyperlink information and access and usage information. Web mining is one of the applications of data mining techniques that help to extract knowledgeable data from the vast web. This web data may include Web documents, usage logs of web sites, hyperlinks between documents, etc [1]. As we know Data Mining (also called as called knowledge discovery) is the computational technique of discovering and identifying patterns from bulk data sets involving methods at the intersection of artificial intelligence, database systems, machine learning, and statistics[4].It is a practice of analysing data from different sources and parameters and then summarizing it into useful information, that can be used to boost revenue, cuts costs, or both. Data is analysed from many different dimensions, categorized, and summarize the relationships identified. Technically, data mining is the method of finding correlations or patterns among dozens of fields in large relational databases. On the whole, goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Thus looking upon into Data Mining terms, we can say Web Mining can be implemented to perform major operations of interest - clustering (finding natural groupings of users, pages etc.), associations (which URLs tend to be requested together), and sequential analysis (the order in which URLs tend to be accessed).Web mining is one of the application of data mining techniques to dig out knowledge from Web data, where at least one of structure (hyperlink) or usage (Web log) data is used in the mining process (with or without other types of Web data)[1]. As in real world, we can see, trends are growing among all companies, different organizations and individuals to gather information through web and to utilize that information in their best interest. The term Web Mining is a technique used to crawl through various web resources to bring together required information, which helps an individual or a company to endorse business, understanding dynamics of marketing, new promotions floating on the Web and so on . In the following sections, we will proceed by having overview on Web Mining and their classifying areas in Section II. Section III explains the Concepts of Web Usage Mining, Web Log File & process of Web Usage Mining. Section IV gives the idea of some related work done till now in area of Web Usage Mining. Section V mentions the proposed methodology and finally we sum up with conclusion in Section VI. II.WEB MINING & ITS TAXONOMY Web mining is an application area in Data Mining techniques that automatically extracts the information from web documents and mainly performs four major tasks [3]: A. Resource finding It involves the task of retrieving intended web documents. It is the procedure by which we extract the data either from online or offline text resources available on web. B.Information selection and pre-processing It involves the automatic selection and pre processing of specific information from retrieved web resources. This technique transforms the original retrieved data into information. The transformation could be renewal of stop words, stemming or it may be meant for ob taining the desired representation such as finding phrases in tra ining corpus. C.Generalization It automatically discovers ge neral patterns at individual web sites as well as across numerous sites. Data Mining techniques and machine learning are used in generalization. D. Analysis It involves the validation and in terpretation of the mined patterns. It plays vital role in p attern mining. An individual plays an important role in information on knowledge discovery process on web. Web Mining is broadly divided into three categories as shown in Fig 1: Web Personali zation System Improve ment Web Usage Mining Modifica tion of Web Site Web Usage Mining uses the concepts of chart technology, data mining, artificial intelligence, and automated learning techniques on the user data sets and web logs [4] and also various models such as random-walk, markov-chain models are implemented for statistical simulation [5]. A. Concept of Web Usage Mining Web Mining Web Content Mining III.WEB USA GE MINING Web usage mining is a method that uses various web data sources to find out hidden kn owledge about users and their access patterns on the Web. Such knowledge of user access is taken into consideration to bring benefits to business and lead directly to profit increase.W eb Usage Mining involves identifying the frequency of th e page access by the users and then finding the common traversal paths. Long and complicated user access paths along with low use of a web page shows that the web site is not designed in an spontaneous manner. Thus with the help of this analysis, one can reorganize the web site. Web Structure Mining Business Int ellige nce Characte rization of User Fig 1.Classification of Web Mining & Application Areas of Web Usage Mining A. Web Content Mining Web content mining, also referred as text mining, is used in scanning and mining of text, pictur es and graphs of a Web page to find out the relevance of the c ontent to the search query. Content mining render the results lists to search engines in order of highest resemblance to the keywords in the query. B.Web Usage Mining This technique allows grouping of Web access information for Web pages. This collected u sage data provides the way leading to Web pages to b e accessed. This information is most often gathered autom atically into access logs through the Web server. C.Web Structure Mining It identifies the liaison between Web pages linked by information or direct link connection. Thi s structured data is determined by the provision of web structure schema by means of database techniques for Web pages. The connection then allows search engine to drag data r elating to a search query directly to the linking Web page f rom the Web site where the content rests upon. Discovery of meaningful patter ns from data generated by client-server transactions on on e or more Web servers [26]. Typical Sources of Data: 1) Automatically generat ed data stored in server access logs, referrer logs, pr oxy server logs, browser logs, agent logs, and client-s ide cookies. 2) E-commerce and pro duct-oriented user events (e.g. ad, shopping cart chan ges or product click-throughs, etc.), registration data. 3) User profiles and/or user ratings. 4) Meta-data, page co ntent, site structure, page attributes. 5) User queries, bookmark data, mouse clicks and scroll. B. Web Log Format A web server log file conta ins requests made to the web server, recorded in sequential order. Different web servers maintain different information in Log File. The most popular log file formats are the Common Log Format (CLF) and the extended CLF. A common log format file is generated by the web server to keep track of th e requests that occur on a web site [26]. Here are some basic para meters listed that makes the entries of web log file. 1) User Name: This deter mines who had visited the web site. The identification of user mostly done through the IP address that is allotted by the Internet Service provider (ISP). 2) Visiting Path: The path followed by the user while visiting the web site. This can be done by entering the URL directly or by hitting a link or through a search engine. 3) Path Traversed: This specifies the path accessed by the user within the web site using the different links. 4) Time stamp: It specifies the time spent by the user on each web page while browsing. This time spent is identified as the session. 5) Page last visited: It specifies the page that was visited by the user before he or she leaves the website. 6) 7) Success Rate: The success rate of the web site can be determined by the number of downloads made and the number copying activity under gone by the user. Purchase of things or software made, add upon the success rate. User Agent: It specifies the browser from where the user sends the request to the web server. It’s just a string describing the type and version of browser software being used. 8) URL: The resource accessed by the user. It may be an HTML page, a CGI program, or a script. 9) Request Type: The method used for information transfer is noted. The methods like GET, POST are used. C. Web Usage Mining-A Process. Web Usage mining Consists of three phases, mainly preprocessing, pattern discovery, and pattern analysis. Fig 2. below shows the sequence of Web Usage Mining process. Web Log File Data Preprocessing Pattern Discovery Different actions performed on data or contents during preprocessing phase are given below. 1) Pre-Processing It is the process of converting the unstructured data into useful information by applying some algorithm. Web usage data sources must be integrated, filtered, cleaned, and transformed, such that gaps will be possibly filled, irrelevant information will be thrown away, and user sessions and transactions will be identified. These sources of data are mainly Web server log files, agent logs and other interfaces. The data present in the log file cannot be used as it is for the mining process [7].Therefore the contents of the log file should be cleaned in this preprocessing step. The unwanted data are bumped of and a minimized log file is obtained. Data cleaning: The entries made in the log file for the unwanted view of images, graphics, multimedia, etc made by the users are removed. Once these data are cleaned the size of the file is minimized to a larger extent. Session Identification: Session is the time duration spent in the web page. This is done by using the time stamp details of the web pages. This can also be done by taking down the note of user id of those who have visited the web page and had traversed through the links of the web page. Data conversion: This is process of converting the log file data into the format needed by the mining algorithms. 2) Pattern Discovery After converting the data in log file into a formatted data the pattern discovery process is done[8]. With the existing data of the log files many useful patterns are identified either with user id’s, session details, time outs etc. It is the key component for analysing the pre-processed data. In this phase the process is done through various algorithm and knowledge discovery techniques used in pattern recognition, data mining, machine learning etc. It can be done using various techniques such as association rules, classification, clustering, sequential pattern and statistical analysis. Statistical Analysis such as median, frequency analysis, mean etc. Clustering of users help to discover groups of users with similar navigation patterns (provide personalized Web Data). Pattern Analysis Fig 2.Web Usage Mining Process Classification is the technique to arrange a data item into one of several predefined classes. Association Rules find out correlations among pages accessed together by a client. Sequential Patterns extract repeatedly occurring Inter-session patterns such that the occurrence of a set of items followed by another item in time order. Dependency Modeling checks if there are any considerable dependencies among the variables in the Web. 3) Pattern Analysis This process eliminates the irrelevant rules or patterns that were generated. They extract the interesting rules or patterns from the output of the pattern discovery. The most familiar form of pattern analysis comprises of a knowledge query mechanism such as SQL (Structured Query Language) or loads the usage data into a data cube to perform OLAP (Online analytical processing) operations. Visualization techniques, like graphing patterns or assigning colors to different values, highlights overall patterns or trends emerging in the data. Various mechanisms used for mining these patterns are mentioned below: Site Filter: This technique is implemented by WEBMINER system. The site filter uses the site topology to filter out rules and patterns that are not interesting. Any rule that identifies direct hypertext links among pages is sorted out[10]. mWAP(Modified Web Access Pattern): This technique totally eliminates the need to engage the numerous reconstruction of intermediate WAP-trees during mining and considerably reduces execution time[11]. EXT-Prefix span: This method mines the complete set of patterns but greatly reduces the efforts of candidate subsequence generation. Prefix –projectio n process involved in this method substantially reduces the size of projected database [12]. IV.LITERATURE REVIEW In this section we will have a look on some frameworks that are studied to implement the Web Usage Mining and various techniques and algorithms for pattern discovery and analysis. In this paper, [10] author proposed framework for web page personalization with web access. This framework follows the three-step process. Initially, it was recommended to process the data not only from web log , but to use site topology and page classification (head, content, navigation, look up, personal) based on physical and usage characteristics, then afterwards this heuristics can be used to determine users and sessions. Data referring are then transformed into transactions which represent page preferences clusters for individual users. Data cleaned and transformed in mentioned way is presented to some of the pattern discovery methods. Hybrid approach to web usage mining, proposed in [25] combines the compact HPG (Hyper Probability Grammar) approach along with explicit OLAP.Here in this model, data is stored in database through the Quilt and XML Query. The constraints for the analysis are built on the top of this database and data jointly with the constraints are used for modeling Hypertext Probabilistic Grammars, which were then mined with the help of Breadth First Search (BFS) based algorithm for mining association rules. Algorithm proposed by author in [13] is based on Maximal Forward References. These were used for mining path traversal patterns to provide environment where documents or objects are linked together to smooth the progress of interactive access on web. Two algorithms are devised here for determining large reference sequence. One is based on hashing and pruning techniques and other one is an improvisation in order to reduce number of database scans required. Markov chains Algorithm proposed in paper[14] is based on Association Rule Mining technique of Web Usage Mining which is used to make link prediction The structural knowledge is tracked in the form of three different types of clusters: grid clusters, hierarchical clusters and reference clusters. The assumed Web pages and resultant Web structures are then grouped to assist Web users in their navigation in the Web site. Improved AprioriAll algorithm has been proposed for Web logs mining in [15].It is based on Association Rule Mining. It is improvement to existing Apriori algorithm where it adds the property of the UserID during the each step of generating the candidate set and every step of scanning the database. This helps to decide whether an item in the candidate set should be put into the large set which will be used to produce next candidate set. It also restricted the size of the candidate set in time whenever it is produced. In paper[16] author proposed FPgrowth and Prefix Span Algorithm based on Association rule Mining for Web Usage mining for implementation in real business case. Maximum Forward Path (MFP) is also used in the web usage mining model along with sequential pattern mining that uses Prefix Span so as to reduce the interference of “false vis it” resulted by browser cache and heave the mining frequent traversal paths. Self Organized Maps were proposed in [17] by author that lays a basis of artificial neural network but actually is a Clustering technique that is used to identify the user’s navigational patterns. It focused on the transformations required to modify the data storage in the Web Server Log files as an input of Self Organized Maps. Algorithm based on Graph Partitioning is used to identify user’s access patterns in [18]. An undirected graph, based on connectivity between each pair of the web pages are recognized and weights are then assigned to the edges of the graph that showed improvement in the quality of clustering for user’s navigation pattern in web usage mining systems. Ant-based clustering, proposed in [08] is applied to preprocessed logs to dig out frequent access patterns for pattern discovery and then it is displayed in an interpretable format. It uses neighborhood function and after clustering alignment processing is then applied to the obtained sequences in each cluster and extracts the representative for each cluster. Modified k-means Algorithm of Clustering proposed by author in [19] solves the issue of empty cluster. The problem identified was considered as unimportant and was solved by executing this algorithm repeatedly for a number of times. To deal with large data set, a number of different parallel implementations of the k-means was developed and implemented for clustering. Custom-built Apriori algorithm was proposed to identify the effective pattern analysis, analyzing web logs for usage and access trends [20].Mentioned Algorithm was used to identify the different rules or co-relations in a rational execution time of all the frequent item set from an educational log file. The rules (co-relations) obtained from the system helped the website developer for proper decision making that helped them to improve their site effectively. K-means with Genetic Algorithm was based on rough sets to find interval sets of clusters proposed in [21]. The polished initial condition allowed the iterative algorithm to come together to a "better" local minimum. And in the next step, they proposed a GA based refinement algorithm to improvise the cluster quality. The proposed algorithm was evaluated with web access logs obtained from the Internet Traffic Archive (ITA) and showed that refined initial starting points and post processing refinement of clusters leads to improved solutions. Naive Bayesian Classification algorithm proposed by [22] was used to identify interested users. The performance of this algorithm was measured for web log data with session based timing, page depth to the site length, page visits and repeated user profiling. It showed progress in time and memory utilization when it was applied to any web log files. Learning Based K-Mean algorithm of Clustering proposed in [23], is used to develop the learning capabilities and reduce the computation intensity of a competitive learning multilayered neural network. Multi-layered network architecture with a back propagation learning mechanism is used to identify and analyze useful knowledge from the existing Web log data.It used neural networks learning capabilities to classify the web traffic data mining set. Improve-K-Means Clustering in [24], is used to improve the clustering patterns. Its idea is to identify the data objects through an iterative clustering, in order to minimize the target function, so that the generated cluster is as compact as possible and independent. K-Means clustering algorithm is based on the effective index. Having studied different approaches in literature survey we have observed that there are several algorithms that implement the clustering on the web data. However these clustering techniques are found to be useful and efficient. It enhances the Web usage mining process in some or the other way. But as web is growing rapidly day by day as information gate way, size of cluster will also start increasing due to the increase in user’s accessibility. This may result in data similarity that may occur during clustering. Thus we propose a technique for cluster formation and its optimization that will lay a basis by which web page could be personalized so that user easily switches to the page where his/her requirements are fulfilled. V.PROPOSED METHODOLOGY To have an easy and faster way for the user on web to access the data of their interest and needs, we propose a plan that not only supports a better way of clustering but also focuses on the cluster optimization to support improved web usage mining. This methodology will follow the same sequential phases of Web Usage Mining. The flow of proposed methodology is gives as follows: 1. 2. 3. 4. Tracking the user sessions with the help of Web Log File. Discovering the User access patterns using NeuroFuzzy computing Optimizing grouped clusters using Ant-Nest mate Approach. Generating & Tracking user profiles from clusters. User sessions are tracked and pre-processed by converting the usage, content and structure information available data sources into the data abstractions necessary for discovering interesting navigation patterns. The interesting criteria for navigation patterns are dynamically specified by a human expert. Once the navigational patterns are determined NEFCLASS theory based on neural and fuzzy approach clustering is used for the processing of clusters[27]. This NeuroFuzzy approach conforms to changes in users’ navigation patterns over time without losing earlier information. These processed clusters in group are then optimized using swarm intelligence technique[28] of study proposed as Ant Nest mate approach which then generates the user profiles that holds the data of their interest and needs[29]. VI.CONCLUSION Web Usage Mining plays a vital role in improvising the usability of the website design. It stresses on improvement of customers’ relations and improving the requirement of system performance and other relevant factors. Web usage mining provides the support for the web site designing, providing personalization of web server and other business making decision, etc.In this paper we focused on the process of web usage mining which involved basically three important tasks i.e. Preprocessing, Pattern Discovery and Pattern analysis. We have also gone through various algorithms that are implemented for improved web usage mining. However as the web size and access to web is increasing day by day, as result of which cluster size is also increasing. Hence we could think on optimizing these developed clusters for which a proposed plan of work is specified. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Ming-Syan Chen, Jong Soo Park, Philip S. Yu, “Efficient Data Mining for Path Traversal Patterns” , Ieee Transactions On Knowledge And Data Engineering, Vol. 10, No. 2, March/April 1998. [14] Jianhan Zhu, Jun Hong, John G. Hughes, “Using Marko v Chains for Link Prediction in Adaptive Web Sites”, REFERENCES Soft-Ware 2002, LNCS 2311, pp. 60–73, 2002. [15] WANG Tong, HE Pi-lian, “Web Log Mining by an Improved AprioriAll Algorithm”, World Academy of Srivastava J, Desikan P and V Kumar, “Web MiningScience, Engineering and Technology ,2005 . Concepts, Applications & Research Direction” in 20 02 [16] Hengshan Wang, Cheng Yang, Hua Zeng, “ Design and Conference. Implementation of a Web Usage Mining Model Based Srivastava J, Desika& n P and V Kumar , “Web Mining On Fpgrowth and Prefixspan”, Communications of the Accomplishment Future Directions” in 2004 Conferen ce. IIMA 2006 Volume 6 Issue 2. R. Kosala, and H. Blockeel, “Web Mining Research: A [17] Paola Britos, Damián Martinelli, Hernán Merlino, Ramón Survey, SIGKDD Explorations, Newsletter of the ACM García-Martínez, “Web Usage Mining Using Self Special Interest Group on Knowledge Discovery and Organized Maps”, International Journal of Computer Data Mining”, Vol. 2, No. 1 pp 1-15, 2000. Science and Network Security, VOL.7 No.6, June 2007. Qingtian Han, Xiaoyan Gao, Wenguo “Study on Web Mining Algorithim based on usuage Mining”, Computer Aided Industrial design and Conceptual design, 2008 [18] Mehrdad Jalali, Norwati Mustapha, Ali Mamat, Md. Nasir B Sulaiman, “web user navigation pattern mini ng CAID/CD 2008. approach based on graph partitioning algorithm”, Jo Jaideep shrivastav, Robert Colley, Mukund Deshpande, urnal of Theoretical and Applied Information Pang-Ning Tan, “Web Usage Mining: discovery and Technology,2008. Application of usage pattern from web data”,ACM SIGKDD,jan2000. [19] Malay K. Pakhira, “ A Modified k-means Algorithm to Avoid Empty Clusters”, in International Journal of WANG Tong HE Pi-lian, “Web Log Mining by an Recent Trends in Engineering, Vol 1, No. 1, May 2009. Improved AprioriAll Algorithm”, proceedings of worl d [20] Sandeep Singh Rawat, Lakshmi Rajamani, “Discovering academy of science, engineering and technology volume potential user browsing behaviors using custom-built 4 February 2005 ISSN 1307-6884, © 2005 apriori algorithm”, International journal of comput er WASET.ORG. science & information Technology (IJCSIT) Vol.2, No.4, Mohd Helmy Abd Wahab, Mohd Norzali Haji Mohd, August 2010. Hafizul Fahri Hanafi, and Mohamad Farhan Mohamad Mohsin, “ Data Pre-processing on Web Server Logs fo r [21] Mahdi Khosravi, Mohammad J. Tarokh, “Dynamic Mining of Users Interest Navigation Patterns Using Generalized Association Rules Mining Algorithm”, Naive Bayesian Method”, 978-1-4244-8230-6/10/$26.00 World Academy of Science, Engineering and ©2010 IEEE Technology ,2008. Kobra Etminani, Mohammad-R. Akbarzadeh-T, and [22] N. Sujatha, K. Iyakutty, “Refinement of Web usage D ata Clustering from K-means with Genetic Algorithm”, Noorali Raeeji Yanehsari, “Web Usage Mining: users' European Journal of Scientific Research ISSN 1450navigational patterns extraction from web logs using 216X Vol.42 No.3 (2010), pp.464-476. Ant-based Clustering Method,” in Proc. IFSA[23] Ms. Vinita Shrivastava, Mr. Neetesh gupta, EUSFLAT ,2009. “Performance Improvement Of Web Usage Mining By R. Cooley, B. Mobasher, and J. Srivastava,“Web Mini Using Learning Based K-Mean Clustering”, in ng: Information and Pattern Discovery on the World International Journal of Computer Science and its Wide Web,” IEEE Computer Society,2009, pp. 558. Applications,ISSN 2250 – 3765, Vol I Issue I, 2011 . R. Cooley, B. Mobasher, and J. Srivastava, “Data Preparation for Mining World Wide Web Browsing [24] TingZhong Wang “ The Development of Web Log Mining Based on Improve-K-Means Clustering Patterns,” KNOWLEDGE AND INFORMATION Analysis”, in Advances in CSIE, Vol. 2, AISC 169, p p. SYSTEMS, vol. 1,1999. 613–618, Springer-Verlag Berlin Heidelberg 2012. D.Vasumathi, and A.Govardan,“BC-WASPT : Web Acess Sequential Pattern Tree Mining,” IJCSNS [25] Jespersen, S. E., Thorhauge, J., & Pedersen, T. B. (2002). A Hybrid Approach to Web Usage Mining. International Journal of Computer Science and Network Retrieved April 22, 2009, from SpringerLink: Security., Vol.9,June-2009, pp. 569–571. http://www.springerlink.com/content/26rynqvgkhephq1x S.Vijayalakshmi V.Mohan, S.Suresh Raja,“Mining /fulltext.pdf. Constraint-based Multidimensional Frequent Sequential Pattern in Web Logs,” European Journal of Scientifi c [26] Rajni Pamnani,Pramila Chawan,”Web Usage Mining:A Research Area in Web Mining”, International Research., Vol.36, pp .480-490,2009. Conference on Recent Trends in Computer Engineering,ISCET 2010,RIMT,Punjab,ISBN 978-81910301-0-2. [27] Detlef Nauck and Rudolf Kruse,”NEFCLASS-A NeuroFuzzy Approach for the Classification of Data.”,Proceedings of the 1995 ACM symposium on applied computing .ACM ,1995. [28] O.A.Mohamed Jafar and R.Shivkumar,”Ant Based Clustering Algorithms Brief Survey”, International Journal of Computer Theory and Engineering,Vol.2,No.5,October 2010-1793-8201. [29] Anna Alphy and S.Prabakaran,”Cluster Optimization f or Improved Web Usage Mining using Ant Nest mate Approach”,IEEE-International Conference on Recent Trends in Information Technology, June 3-5,2011.