International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 3 - Apr 2014 Web Data Mining: Survey Avneet *, Hardeep Singh# * M.Tech Student, #Assistant Professor 1 Dept of CSE, Lovely Professional University, Phagwara, Punjab 2 Dept of ECE, Lovely Professional University, Phagwara, Punjab Abstract: - The use of World Wide Web increases day A. Extract new information using the available data by day, with this rapid growth it becomes a biggest As we know the World Wide Web contains large number database. Most of the people publicly access the web of datasets and basically need to find the new information services on daily basis and generate a huge amount of from that available datasets. Like:-Accessing the user data like:-text, images, multimedia, queries, and user logs to extract the behavior of user and his interest. logs and blogs data. When it concerns to mine the web B. To find the relevant and effective knowledge data then it’s quite difficult because it contains huge, The web is collection of large numbers of datasets and dynamic and diverse data. Web Usage Mining when user query for any information they get accurate or available to mine user logs, Web Structure Mining inaccurate information. available to mine the links structure between C. To understand the users Webpage’s, Web Content Mining available to mine The web contains large records of users from where we the content of webpage’s. This survey present the can understand the behavior of users and provide the types of web data mining, problems occur in web data information to users according to their need. mining and applications. To find the relevant and useful information from the biggest database it uses the data mining techniques called Keywords: Web Data Mining, Web Usage Mining, Web Structure Mining, Web Content Mining, Problems, Applications web data mining. The web data mining basically access the online web data to clean extract and find relevant information or knowledge discovery. Through web data I. INTRODUCTION mining, reduce the irrelevant data from the web page. Data mining (knowledge discovery) is the process that “Web data mining is an application of data mining that helps to analyze data from different perspectives and finds the interesting and potentially useful knowledge summarize it into valuable information and also discover from web page. Normally, it is expected that hyperlink interesting patterns to improve the business processes. structure and user data both helpful for mining process.” We can also say it “GOLD MINING”. Web Data Mining is application of Data Mining and its uses the techniques of data mining to find the relevant information. With the rapid growth of web data it become difficult to handle it because necessary and unnecessary data available. With the availability of huge web data users face a lot many problems while interacting with the World Wide Web. The main problems are:- II. TYPES OF WEB DATA MINING A. Web Usage Mining The web usage mining basically used to predict the user logs or behavior of website users. It basically uses the secondary data that available on web to predict the user behavior while users access the web. Most of this information is usually generated automatically by Web servers and collected in server access logs. Other sources ISSN: 2231-5381 http://www.ijettjournal.org Page 144 International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 3 - Apr 2014 of user information include referrer logs which contains connection through search engines and third party co information about the referring pages for each page links [5].This enables clustering of connected Web pages reference, and user registration or survey data gathered. to establish the relationship of these pages. It is quite [6].To Analyze such data can help organizations to useful to establish connection between two or more determine the life time value of customers, cross organizations. Web structure mining is the process of marketing strategies across products. Analysis of server using graph theory to analyze the node and connection access logs and user registration data can also provide structure of a web site. valuable information on how to better structure a Web site in order to create a more effective presence for the C. Web Content Mining organization [4]. It describes various steps to provide the The web content mining basically extracts the hidden effective information by usage mining:- user specific data or knowledge from the available data or 1) document. Data Collection: First collect the user specific data from the web servers. With this enchase improvement of accessibility each and every organizations focus to make 2) Data Preprocessing: After collect the data then better web page with large scale of content. The web page preprocess the logs and user specific data to get the contains large scale of data like text, videos, audio, efficient information from the huge data. images and available in the form of unstructured, semi- 3) Data Clustering: After cleaning of data it is divided into same clusters for easy pattern discovery. 4) Pattern Discovery and Analysis: After clustering the structured and structured so difficult to find the interesting and knowledge pattern from it. The various challenges faced by content mining are:- data is divided into pattern and then analysis the 1) Information pattern to gain knowledge. Extraction: - The require information/content is extracted from the various B. Web Structure Mining The structure mining basically concludes the summary of web pages mostly concern to extract the structured the structure of the web page. The content mining data. There are various techniques to extract the basically concern with the inner structure of the webpage information automatically from web pages is but structure mining consider the link structure of the difficult. hyperlinks at inter webpage level. The Web Structure 2) Web information integrating and schema matching: - Mining basically includes the link structure of web page Although the Web contains a huge amount of data, and categorizes the web pages to find similarities and each web site (or even page) represents similar relationship between the link structures. The main information differently. It is difficult to integrate the purpose for structure mining is to extract unknown different types of data and then schema matching. relationships between Web pages. The web structure data 3) Detecting the noise:-To detect the noise from web mining provides use for a business to link the information pages is very difficult task. Automatically of its own Web site to enable navigation and cluster segmenting Web page to extract the main content of information into site maps. This allows its users the the pages is interesting problem. Web page contains ability to access the desired information through keyword huge amount of content that changes time by time so association and content mining. Hyperlink hierarchy is difficult to find the specific information. also determined to path the related information within the 4) Opinion extraction from online sources: There are sites to the relationship of competitor links and many online opinion sources, e.g., customer reviews ISSN: 2231-5381 http://www.ijettjournal.org Page 145 International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 3 - Apr 2014 of products, forums, blogs and chat rooms. Mining G. Web content Mining opinions (especially consumer opinions) is of great It is difficult to mine the content of social network sites importance for marketing intelligence and product because the content available in huge amount and benchmarking. We will introduce a few tasks and changes within millisecond. To find relevant information techniques to mine such sources. difficult when there is noisy and complicated data available. III. PROBLEMS OF WEB DATA MINING H. A. Indexing Data in search Engine Web structured Mining The search engine basically mines the data to assign the It is difficult to find the complicated link structure page ranking and better indexing but sometimes, it’s between pages and also difficult to find duplicate web complicated to do the indexing of large scale data for pages or web page content. search engine. I. B. The user data is basically in privacy so mine the user data Update Data Web Usage mining Most of the web pages require time to time update of data to provide them better facility sometime create problem. so to mine that particular web pages data and predict the Like, when user acce25ss the YouTube then mine the future trends is more difficult. record of user search to provide him/her the recent added C. Real time data updating videos related to his/her interest but it doesn't ethical to The web pages that basically works in real time use user search data that is personal to user. environment need more consideration to mine the data IV. APPLICATIONS OF WEB DATA MINING because the data update/change within millisecond like Air ticketing. Today most of the organizations rely on internet to D. Predict the user requirements improve their business and make better relationship with To do the advertisement on user page you have the mine customers. Web data mining provides analysis of the data of user search to find user interest and changes in available web data and now it extends analysis much by need to provide better facilities to user and advertisement combining the corporate information with Web traffic according to user need. Sometimes user needs to change data. Web mining tools can be extended and programmed the web page time to time so it is difficult to predict the to answer almost any query. Web data mining tools are user requirement. used in following areas: E. A. Web mining can provide companies managerial Fraud and treat Analysis As we know data mining basically helps to know fraud insight into visitor detection by checking the activities of particular user but management take strategic actions accordingly [2]. difficult if person give wrong personal information on B. The company profiles, which help top can obtain some subjective web sites. measurements F. effectiveness of their marketing campaign or Security through Web Mining on the As we know through mining we can know the user marketing research, which will help the business to personal information whether we doesn’t know who is improve and align their marketing strategies timely. that particular user but we come to know the activities of C. In the business world, structure mining can be quite that particular user so it is a ethical issue to mine the user useful in determining the connection between two or log information and user profile information. more business Web sites. ISSN: 2231-5381 http://www.ijettjournal.org Page 146 International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 3 - Apr 2014 D. The company can identify the strength and weakness of its web marketing campaign through Web Mining, and then make strategic adjustments, obtain the feedback from Web Mining again to see the improvement. E. Search engine Google provides advanced and efficient searching capabilities [6] V. CONCLUSION The Internet has grown from a simple search tool to a gold mine. Internet is a gold mine, but only for those companies who adopt the web data mining strategy. Web VI. REFEENCES [1] Robert Cooley, Bamshad Mobasher, Jaideep Srivastava, “Web Mining: information and Pattern Discovery on the WWW”. [2]. Mary Garvin “Data Mining and the Web: What They Can Do Together”. [3] B. Masand, M. Spiliopoulou, J. Srivastava, O. Zaiane, ed. Proceedings of “ WebKDD – Web Mining for Usage Patternsand User Profiles 2002”, Edmonton, CA [4] M. Spiliopoulou, “Data Mining for the Web”, Proceedings of the Symposium on Principles of Knowledge Discovery in Databases (PKDD). [5] R. Kosala, H. Blockeel, “Web Mining Research: A Survey”, in SIGKDD Explorations 2(1), ACM, July 2000. data mining contains different categories to analysis different kind of available web data. Most of the Companies implement web data mining to understand [6] R. Cooley, B. Mobasher, J. Srivastava, “Data Preparation for Mining World Wide Web Browsing Patterns”. their customers' profiles, and also to identify their own strength and weakness of their E-marketing efforts on the web through continuous improvements. ISSN: 2231-5381 http://www.ijettjournal.org Page 147