A literature Review on web usage mining techniques P.SrinivasaRao 1 Dr.D.Vasumathi 2 Asst professor , Department of IT , MRCET , Hyderabad , Telangana 1 Professor , Department of CSE, JNTUH CEH , Hyderabad , Telangana 2 Abstract The rapid growth of the World-Wide Web poses unprecedented scaling challenges for search engines. In the modern era of high volume information generation, search engine proves to be a pivotal technology of data mining and information retrieval. General purpose search engines have achieved a great deal of success in providing relevant information to the user. They used to be an effective tool for retrieving information from the huge information repository. For instance, Google, which is one of the popular search engines, not only provides fitting search results for the user in the world by pack up of more than 20 hundred millions web pages, but also the time to search is not always beyond 0.5 second. The ubiquity of the Internet and Web has led to the emergence of several Web search engines with varying capabilities. These search engines index Web sites, images, Usenet news groups, content-based directories, and news sources with the goal of producing search results that are most relevant to user queries. However, only a small number of web users actually know how to utilize the true power of Web search engines. In order to address this problem, search engines have started providing access to their services via various interfaces. INTRODUCTION Search engine as a tool to investigate the Web must obtain the desired results for any given query. Success of a search engine is directly dependent on the satisfaction level of the user. Users desire the information to be presented to them within a short time interval. They also expect that the most relevant and recent information to be presented [3]. Most of the search engines cannot completely satisfy user's requirements and the search results are often very inaccurate and irrelevant [4]. There are already many researchers who have reported on about various aspects of search engines in [5, 6]. A meta- search engine is the kind of search engine to provide users with information services and it does not have its own database of web pages. It sends search terms to the databases maintained by other search engines and gives users the results that come from all the search engines queried [4]. The dearth of any specific structure and wide range of data published on the web makes it highly challenging for the user to find the data without any external assistance. It is a general credence [8,9] that a single general purpose search engine for all web data is improbable because its processing power cannot scale up to the fast increasing and unlimited amount of web data. A tool that swiftly gains approval among users is Meta search engines [10]. The Meta search engines can run user query across multiple component search engines concurrently, retrieve the generated outcomes and amassed them. The benefits of Meta search engines against the search engines are notable [11]. The Meta search engine enhances the search coverage of the web providing higher recall. The overlap among the primary search engines is generally small [12] and it can be small as three percentages of the total results retrieved. The Meta search engine solves the scalability issue of searching the web and facilitates the use of multiple search engines enabling consistency checking [13]. The Meta search engine enhances the retrieval effectiveness providing higher precision because of ‘chorus effect’ [14]. Web Meta searching in disparity to rank aggregation is an issue representing its own unique challenges. The outcomes that a Meta search system gathers from its component engines are not similar to votes or any other single dimensional entities: Apart from the individual ranking it is assigned by a component engine, a Web outcome also incorporates a title, a small fragment of text which represents its significance to the submitted query [7, 15] (textual snippet) and a uniform resource locator (URL). Ostensibly, the traditional rank aggregation techniques are insufficient for providing a robust ranking mechanism appropriate for Meta search engines, because they ignore the semantics each Web result. 2 . literature Review With the development of the Internet, web service generates a large amount of log information, how to mine user preferred browsing paths is an important research areas. Current researches mainly focus on the mining of user preferred browsing paths;This section shows a brief review of some of the related works. Leonidas Akritidis et al. [16] have presented a Quad Rank technique which considered the additional information regarding the query terms, collected results and data correlation. They have implemented and tested the Quad Rank in real world Meta search engine. They comprehensively tested Quad Rank for both effectiveness and efficiency in the real world search environment and also used the task from the TREC-2009 conference. They demonstrated that in most cases their technique outperformed all component engines. Hideaki Ishii et al. [17] have proposed a technique to reduce the computation and communication loads for the Page Rank algorithm. They developed a method to systematically aggregate the web page into groups by using the sparsity inherent in the web. For each group, they computed an aggregated page rank value that can be distributed among the group members. They provided a distributed update scheme for the aggregated Page Rank along with an analysis on its convergence properties. They provided a numerical example to illustrate the level of reduction in computation while keeping the error in rankings small. Aging activity has been recently identified as a potential source of knowledge about personal interests, preferences, goals, and other attributes known from user models. Tags themselves can be therefore used for finding personalized recommendations of items. In this paper, Frederico Durao and Peter Dolog [18] have present a tag-based recommender system which suggests similar Web pages based on the similarity of their tags from a Web 2.0 tagging application. The proposed approach extends the basic similarity calculus with external factors such as tag popularity, tag representativeness and the affinity between user and tag. Soheila Abrishami et al [19] aims to design a hybrid recommendation system based on integrating semantic information with Web usage mining and page clustering based on semantic similarity. Since the Web pages are seen as ontology individuals, frequent navigational patterns are in the form of ontology instances instead of Web page addresses, and page clustering is done using semantic similarity. The result is used for generating web page recommendations to users. The recommender engine presented in this paper which is based on semantic patterns and page clustering creates a list of appropriate recommendations. The results of the implementation of this hybrid recommendation system indicate that integrating semantic information and page access sequence into the patterns yields more accurate recommendations. Yang and Hanjalic [20] developed a prototype-based re-ranking framework, which constructs meta re-rankers corresponding to visual prototypes representing the textual query and learns the weights of a linear re-ranking model to combine the results of individual meta rerankers and produces the re ranking score of a given image taken from the initial textbased search result. The induced re-ranking model was learned in a query-independent way requiring only a limited labeling effort and being able to scale up to a broad range of queries. The experimental results on the Web Queries dataset demonstrated that the proposed method outperforms all the existing supervised and unsupervised reranking methods. To provide personalized preferred paths to fulfill user need, in this paper, Zhou, Zhurong, and Dengwu Yang [21] proposed a novel method to compute the similarities of preferred paths and the given fields by experts. Firstly, the similarities of each page on the preferred paths and the given fields are computed. Secondly, according to the computed similarities of each page on the preferred paths and the given fields, the average similarity of all the pages on the preferred path and the given field is computed, and it was used as the similarity of preferred path and the given file. Experimental result shown that, it was accurate and scalable. It could be applied to optimize website or design personalized service. NazneenTarannum S.H. Rizvi1 and Prof. Ranjit R. Keole [22] have presents a new framework for a semantic-enhanced Web-page recommendation (WPR), and a suite of enabling techniques which include semantic network models of domain knowledge and Web usage knowledge, querying techniques . If a user accesses a page by using back button in browser then it return copy of that page which is stored in cache. This kind of accessing does not record any entry in log file that causes problem of missing references hence path completion techniques are required to fill these entries in log file [23]. The learnt object components range Author Cooley et al. [7] Preprocessing techniques Focused on Data Cleaning, User Identification, Transaction Session Identification, transaction Identification identification Remarks Proposed heuristics are not suitable for complex web sites Prabarskaite [15] Advance data cleaning, Filtering and data visualization Data cleaning Did not perform any other preprocessing technique like user identification and session identification etc. Data fusion, Data cleaning, Data All structuration and Data summarization completion Tanasa et [24] al. except Data All except Data completion path Ignored the removal of wrong http request status code Castellano et al. [25] Data cleaning module, structuration module and filtering module Robert et al. [26] Data cleaning and filtering, User identification, Session Identification Yen li et al. [23] Data cleaning, User identification, Path Completion Session Identification and path completion Xiang-ying li[27 ] Data cleaning, Client Identification, Client Identification High accuracy and high efficiency but Session Identification and Path and Session poor operating rate. Completion Identification from local structures over line segments to global silhouette-like descriptions. This representation can be used. Categories in a totally unsupervised fashion. Furthermore it employ the representation as the basis for building a supervised multicategory detection system making efficient use of training examples and outperforming pure features-based representations. Tanasa et al. [24] divides preprocessing process in four steps: Data fusion, Data cleaning, Datastructuration and Data summarization. In Data fusion author joined multiple log files from different web servers and also from site maps into a single log files. After that they anonym zed log file by encrypting host name. Further Data cleaning is performed by removing requests for non- Session Identification path Included almost all steps of data preprocessing. Better session creation simultaneously by using integer programing Combined two approaches Maximal forward reference length and Reference length to find out completed path analyzed resource such as multimedia files (images, audio, video etc.) and robot’s generated requests In Data structuration part author have completed user identification by Authentication data or IP address, Session identification by host and agent, Page view Identification by site map etc. At last Data summarization step includes pattern analysis part by using data generalization and aggregation. They did not considered unsuccessful request in data cleaning phase which is also required to remove to get rid of unnecessary calculations in later phases of web log mining processes. When evaluated on TRECVID 2005 video benchmark, the pro-posed approach improves retrieval on the average up to 32% relative to the baseline text search method in terms of story-level Mean Average Precision. In the people-related queries, which usually have recurrent coverage across news sources, we can have up to 40% relative improvement. Most of all, the proposed method does not require any additional input from users (e.g., example images), or complex search models for special queries (e.g., named person search). Castellano et al. [25] developed a tool LODAP (Log Data Preprocessor) which takes log file as input and gives statistical analysis and user sessions as output. This tool is divided into three modules: Data cleaning module, Data structuration module and Data filtering module Robert et al. [26] introduced a new concept called integer programming for better session identification .This method generates session simultaneously and produced session better match to an empirical distribution. Xiang–ying li [27] has proposed an algorithm named CSIA (Client and Session Identification algorithm) for identification of user and sessions. This algorithm includes comprehensive approach by combining IP address, topology, browser version and referrer page to identify unique user with better accuracy and efficiency. He proposed his algorithm in JAVA language framework as it is good for space utilization. However this algorithm is suffering with decrease in operating rate due to consideration of many factors for identifying user. discovery from web usage data and satisfactory knowledge representation for effective Web-page recommendations are crucial and challenging The common problems of the exiting technique are shown below. The major problem of many on-line web sites is the presentation of many choices to the client at a time; this usually results to strenuous and time consuming task in finding the right product or information on the site. The knowledge of ontology and history is not much personalization in the existing techniques. Due to lack of accuracy, extended and high run time existing recommendation systems exhibit the problems of less coverage. Pages which are recently added or rarely visited by end user is not showed by the existing technique, which also an important problem. These problems are motivated to do the research on webpage recommendation. Further in future, combination of two or more user identification techniques can be used to make better user identification. This paper concludes that various applied data preprocessing techniques with their advantages and disadvantages and draws conclusion and research directions in future. 3. CONCLUSIONS Reference In this paper Web-page recommendation or personalization plays a significant role in intelligent web systems. Useful knowledge [1] Abawajy, J.H., Hu, M.J.,"A new Internet meta-search engine and implementation,” The 3rd ACS/IEEE International Conference on Computer Systems and Applications, 2005. [2] Juan Tang, Ya-Jun Du, Ke-Liang Wang, “Design and Implement of personalize Meta-Search Engine Based on FCA,” Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007. [3] K.Satya Sai Prakash, S. V. Raghavan, "DLAPANGSE: Distributed Intelligent Agent based Parallel Architecture for Next Generation Search Engines", IIT Madras, India, 2001. [4] Z. Li, Y. Wang.V. Oria, "A New Architecture for Web Meta-Search Engines," Seventh Americas Conference on Information Systems, CIS Department, New Jercy Institute of Technology, 2001. [5] A. Araus, et. al., "Searching the Web", ACM Transactions on Internet Technology, Vol. 1, August 2001, pp: 243. [6]G.S.Goldsmidt,"Distributed Management by Delegation," Ph.D. Thesis, Columbia University, 1996. [7] R. Cooley, B. Mobasher, J. Srivastav (1999), Data preparation for mining world wide web browsing pattern in Journal of Knowledge and Data Engineering Workshop, IEEE, Vol.1 Page(s): 5-32. [8] Sugiura, A., Etzioni, O., 2000. Query routing for Web search engines: architecture and experiments. Computer Networks 33 (1–6), 417–429. [9] Manning, C.D., Raghavan, P., Schutze, H., 2008. Introduction to Information Retrieval.Cambridge University Press. [10] Meng, W., Yu, C., Liu, K.-L., 2002. Building efficient and effective metasearch engines. ACM Computing Surveys 34 (1), 48–89. [11] Spink, A., Jansen, B.J., Blakely, C., Koshman, S., 2006. Overlap among major Web search engines. In: Proceedings of the IEEE International Conference on Information Technology: New Generations (ITNG), pp. 370–374. [12] Aslam, J.A., Montague, M.H., 2001a. Metasearch consistency. In: Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 386–387. [13] Vogt, C.C., 1999. Adaptive combination of evidence for information retrieval. Ph.D. Thesis. University of California at San Diego. [14] Dwork, C., Kumar, R., Naor, M., Sivakumar, D., 2001. Rank aggregation methods for the Web. In: Proceedings of the ACM International Conference on World Wide Web (WWW), pp. 613–622. [15] Pabarskaite Z (2002), Implementing advanced cleaning and end-user interpretability technologies in web log mining in 24th International Conference on Information Technology Interfaces (ITI), Vol. 1 Page(s): 109-113. [16] Leonidas Akritidis, Dimitrios Katsaros and Panayiotis Bozanis, "Effective rank aggregation for metasearching", The Journal of Systems and Software, vol. 84, pp. -143, 2011. [17] Hideaki Ishii, Roberto Tempo and ErWei Bai, "A Web Aggregation Approach for Distributed Randomized PageRank Algorithms", IEEE TRANSACTIONS ON AUTOMATIC CONTROL, Vol. 57, No. 11, pp. 2703-2717, 2012. [18] Frederico Durao, Peter Dolog,A Personalized Tag-Based Recommendation in Social Web Systems",2012. [19] Soheila Abrishami, Mahmoud Naghibzadeh, Mehrdad Jalali,"Web Page Recommendation Based on Semantic Web Usage Mining",Volume 7710 of the series Lecture Notes . [20]Linjun Yang , Alan Hanjalic,“PrototypeBased Image Search Reranking,” IEEE Transactions On Multimedia, Vol. 14, No. 3, June 2012. [21] Zhou, Zhurong, and Dengwu Yang. "Personalized Recommendation of Preferred Paths Based On Web Log." Journal of Software 9, no. 3, pp. 684-688, 2014. [22] NazneenTarannum S.H. Rizvi1 and Prof. Ranjit R. Keole,"A Preliminary Review of Web-Page Recommendation Information Retrieval Using Mining”, International Journal of Advance Research in Computer Science and Management. [23] Yan LI (2008), Research on path completion technique in web usage mining in International Symposium on Computer Science and Computational Technology, IEEE, Vol. 1 Page(s): 554-559. . [24] D. Tanasa, B. Trousse (2004), Advanced Data Preprocessing for Intersites Web Usage Mining in IEEE Intelligent Systems, Vol. 19 Issues. 2 Page(s): 59-65. . [25] G. Castellano, A. Fanelli, M. Torsello, LODAP: A Log Data Preprocessor for Mining Web Browsing Patterns in Proceedings of the 6th Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, Page(s):12–17. [26] R. F. Dell (2008),Web user session reconstruction using integer programming in International Conference on Web Intelligence and Intelligent Agent Technology, IEEE/ACM/WIC, Vol. 1 Page(s): 385-388 [27] Xiang-ying Li (2013), Data Preprocessing in Web Usage Mining in 19th International Conference on Industrial Engineering and Engineering Management Page(s): 257-266.