See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/337590907 Research on Web Data Mining Article · November 2019 CITATIONS READS 0 1,961 4 authors, including: Archana Shirke Fr. C. Rodrigues Institute of Technology 18 PUBLICATIONS 108 CITATIONS SEE PROFILE All content following this page was uploaded by Archana Shirke on 28 November 2019. The user has requested enhancement of the downloaded file. Research on Web Data Mining Mrs. Sunita S. Sane, Mrs. Archana A. Shirke, Veermata Jijabai Technological Institute, Mumbai. Veermata Jijabai Technological Institute, Mumbai. Email- sssane@vjti.org.in Email- archanashirke25@gmail.com ABSTRACT Mining and sequential pattern Mining [15]. Web data mining is the mining of Web data. Web Mining aims to discover useful information or knowledge from Web Hyperlink Structure, Page Content and Usage data. Although Web Mining uses many data Mining techniques, it is not purely an application of traditional Data Mining due to the heterogeneity and semi-structure nature of the Web data. Classification: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials. Keywords Web Data Mining, Web Mining, Data Mining, Web Content Mining. Web Usage Mining, Web Structure Mining Clustering: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. 1. INTRODUCTION Association rule Mining: Data can be mined to identify associations. The beer-diaper example is an example of associative Mining. The heterogeneity and the lack of structure that permeates much of the ever expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and management of Web-based information difficult. Sequential pattern Mining: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes. Data Mining consists of five major elements [8]: • Extract, transform, and load transaction data onto the data warehouse system. • Store and manage the data in a multidimensional database system. • Provide data access to business analysts and information technology professionals. • Analyze the data by application software. • Present the data in a useful format, such as a graph or table. 1.1 INTRODUCTION TO DATA MINING 1.2 INTRODUCTION MINING The Web Mining research is a converging research area from several research communities such as database, Information Retrieval and Artificial Intelligent research communities [1]. It has become increasingly necessary for users to utilize automated tools in order to find, extract, filter, and evaluate the desired information and resources. These factors give rise to the necessity of creating server-side and clientside intelligent systems that can effectively mine for knowledge both across the Internet and in particular Web localities[2]. Data Mining is also called knowledge discovery in databases (KDD). It is commonly defined as the process of discovering useful patterns or knowledge from data sources e.g. databases, texts, images, the Web etc [13]. The pattern must be valid, potentially useful and understandable. Data Mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. There are many Data Mining tasks. Some of the common ones are supervised learning (or classification), unsupervised learning (or clustering), association rule TO WEB Huge amount of information available on the World Wide Web leads to the Mining of Web. So Web Mining can be defined as the use of Data Mining techniques to automatically discover and extract information from Web documents and services [2]. Web Mining is a cross area of Data Mining, Information retrieval, Information Extraction and Artificial Intelligent. The Web is huge, diverse and dynamic and thus raises the scalability and multimedia issues [1]. Information users could encounter following problems when interacting with Web finding relevant information, creating new knowledge out of information available on the Web, Personalization of information, learning about consumers or individual users. Web Mining techniques could be used to solve the information overload problems above directly or indirectly. However there are techniques from different research areas such as database (DB), Information Retrieval (IR), Natural Language Processing (NLP) and Web Document Community could also be used [2]. Web Mining is decomposed into following different subtask namely [2]: 1. Resource Finding : the task of retrieving intended Web documents 2. Information selection & pre processing : automatically selecting and pre processing specific information from retrieved Web resources 3. Generalization : automatically discovers general patterns at individual Web sites as well as across multiple sites 4. Analysis : validation and/or interpretation of the mined patterns Resource finding means that process of retrieving the data that is either online or offline from the text sources available on the Web such as electronic newsletters, electronic newswire, newsgroup, the text contents of HTML documents obtained by removing HTML tags, and also the manual selection of Web resources. Information selection and pre processing step is any kind of transformation processes of the original data retrieved in the IR process. These transformation could be either a kind of preprocessing that are mentioned above such as removing stop words, stemming etc or a pre processing aimed at obtaining the desired representation such as finding phrases in the training corpus, transforming the representation to relational or first order logic form etc. Machine learning or Data Mining techniques are typically used for generalization. Humans play an important role in information and knowledge discovery process on the Web since Web is an interactive medium. Thus query triggered knowledge discovery is as important as the more automatic data triggered knowledge discovery. Thus Web Mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data. It is extension of the standard process of knowledge discovery in databases (KDD) [2]. Web Mining is often associated with IR or IE. However Web Mining or information discovery on the Web is not the same as IR or IE. IR is automatic retrieval of all relevant documents while at the same time retrieving as few of the non relevant as possible. IR has a primary goal of indexing text and searching for useful documents in a collection and nowadays research in IR includes modeling, document classification and categorization, user interfaces, data visualization, filtering etc. The task that can be considered to be an instance of Web Mining is Web document classification or categorization which could be used for indexing. Viewed in this respect, Web Mining is part of the (Web) IR process [2]. IE has the goal of transforming a collection of documents usually with the help of IR system, into information that is more readily digested and analyzed. IE aims to extract relevant facts from the documents while IR aims to select relevant document. IE is interested in structure or representation of a document while IR views the text in a document just as a bag of unordered words. Some IE systems use Machine Learning or Data Mining techniques to learn the extraction patterns or rules for Web documents semi-automatically or automatically. Within this view, Web Mining is part of the (Web) IE process [2]. The Web Mining process is similar to the data Mining process. The difference is usually in the data collection. In traditional Data Mining, the data is often already collected and stored in a data warehouse. For Web Mining, data collection can be a substantial task, especially for Web Structure and Content Mining which involves crawling a large number of target Web pages. The classification of retrieval and mining tasks for different types of data is given below [1]. Data Any Data Textual Data WebRelated Data Purpose Retrieving known data or Data Information Web documents Retrieval Retrieval Retrieval efficiently and effectively Finding new patterns or Data Web Text Mining knowledge Mining Mining previously unknown Figure 1: Classification of retrieval and mining process 2. WEB MINING CATEGORIES Web Mining tasks can be categorized into three types [2] 1. Web Content Mining (WCM) - Web content Mining refers to the discovery of useful information from Web contents, including text, image, audio, video, etc. Research in Web content Mining encompasses resource discovery from the Web, document categorization and clustering, and information extraction from Web pages. 2. Web Structure Mining (WSM) - Web structure Mining studies the Web’s hyperlink structure. It usually involves analysis of the in-links and out-links of a Web page, and it has been used for search engine result ranking. 3. Web Usage Mining (WUM) - Web usage Mining focuses on analyzing search logs or other activity logs to find interesting patterns. One of the main applications of Web usage Mining is to learn user profiles. 2.1. Web Content Mining Web Content Mining is related but different from Data Mining and Text Mining. It is related to Data Mining because many Data Mining techniques can be applied in Web content Mining. It is related to Text Mining because much of the Web contents are texts. However, it is also quite different from Data Mining because Web data are mainly semi-structured and/or unstructured, while Data Mining deals primarily with structured data. Web Content Mining is also different from Text Mining because of the semi-structure nature of the Web, while text Mining focuses on unstructured texts. Web Content Mining thus requires creative applications of Data Mining and/or Text Mining techniques and also its own unique approaches. In the past few years, there was a rapid expansion of activities in the Web Content Mining area. This is not surprising because of the phenomenal growth of the Web contents and significant economic benefit of such Mining. However, due to the heterogeneity and the lack of structure of Web data, automated discovery of targeted or unexpected knowledge information still present many challenging research problems. In this report, you can examine the following important Web content Mining problems and discuss existing techniques for solving these problems [1]. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent Web agents, as well as to extend database and Data Mining techniques to provide a higher level of organization 2.2 Web Structure Mining Web structure Mining is the process of using graph theory to analyse the node and connection structure of a Web site [1]. According to the type of Web structural data, Web structure Mining can be divided into two kinds. The first kind of Web structure Mining is extracting patterns from hyperlinks in the Web. A hyperlink is a structural component that connects the Web page to a different location. The other kind of the Web structure Mining is Mining the document structure. It is using the tree-like structure to analyse and describe the HTML (Hyper Text Markup Language) or XML (eXtensible Markup Language) tags within the Web page 2.3 Web Usage Mining Web Usage Mining is the application that uses data Mining to analyse and discover interesting patterns of user’s usage data on the Web. The usage data records the user’s behaviour when the user browses or makes transactions on the Web site. It is an activity that involves the automatic discovery of patterns from one or more Web servers. Organizations often generate and collect large volumes of data; most of this information is usually generated automatically by Web servers and collected in server log. Analyzing such data can help these organizations to determine the value of particular customers, cross marketing strategies across products and the effectiveness of promotional campaigns, etc. Web Usage Mining is the type of Web Mining activity that involves the automatic discovery of user access patterns from one or more Web servers. As more organizations rely on the Internet and the World Wide Web to conduct business, the traditional strategies and techniques for market analysis need to be revisited in this context. Organizations often generate and collect large volumes of data in their daily operations. Most of this information is usually generated automatically by Web servers and collected in server access logs. Other sources of user information include referrer logs, which contain information about the referring pages for each page reference, and user registration or survey data gathered via tools such as CGI scripts. Analyzing such data can help these organizations to determine the life time value of customers, cross marketing strategies across products, and effectiveness of promotional campaigns, among other things. Analysis of server access logs and user registration data can also provide valuable information on how to structure a Web site in order to create a more effective presence for the organization. Using intranet technologies in organizations, such analysis can shed light on more effective management of workgroup communication and organizational infrastructure. Finally, for organizations that sell advertising on the World Wide Web, analyzing user access patterns helps in targeting ads to specific groups of users. WUM can be decomposed into the following subtasks: 2.2.1 Data Pre-processing for Mining: It is necessary to perform a data preparation to convert the raw data for further process. It has separate subsections as follows. • Content Preprocessing: Content preprocessing is the process of converting text, image, scripts and other files into the forms that can be used by the usage Mining. • Structure Preprocessing: The structure of a Web site is formed by the hyperlinks between page views. The structure preprocessing can be treated similar as the content preprocessing. However, each server session may have to construct a different site structure than others. • Usage Preprocessing: The inputs of the preprocessing phase may include the Web server logs, referral logs, registration files, index server logs, and optionally usage statistics from a previous analysis. The outputs are the user session file, transaction file, site topology, and page classifications. 2.2.2 Pattern Discovery This is the key component of the Web Mining. Pattern discovery covers the algorithms and techniques from several research areas, such as Data Mining, Machine Learning, Statistics, and Pattern Recognition. It has separate subsections as follows. • Statistical Analysis: Statistical analysts may perform different kinds of descriptive statistical analyses based on different variables when analyzing the session file. By analyzing the statistical information contained in the periodic Web system report, the extracted report can be potentially useful for improving the system performance, enhancing the security of the system, facilitation the site modification task, and providing support for marketing decisions • Association Rules: In the Web domain, the pages, which are most often referenced together, can be put in one single server session by applying the association rule generation. Association rule mining techniques can be used to discover unordered correlation between items found in a database of transactions. • Clustering: Clustering analysis is a technique to group together users or data items (pages) with the similar characteristics. Clustering of user information or pages can facilitate the development and execution of future marketing strategies. • Classification: Classification is the technique to map a data item into one of several predefined classes. The classification can be done by using supervised inductive learning algorithms such as decision tree classifiers, naïve Bayesian classifiers, knearest neighbor classifier, Support Vector Machines etc. • Sequential Pattern: This technique intends to find the inter-session pattern, such that a set of the items follows the presence of another in a time-ordered set of sessions or episodes. Sequential patterns also include some other types of temporal analysis such as trend analysis, change point detection, or similarity analysis. • Dependency Modeling: The goal of this technique is to establish a model that is able to represent significant dependencies among the various variables in the Web domain. The modeling technique provides a theoretical framework for analyzing the behavior of users, and is potentially useful for predicting future Web resource consumption. 2.2.3 Pattern Analysis Pattern Analysis is a final stage of the whole Web usage Mining. The goal of this process is to eliminate the irrelative rules or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process. The output of Web Mining algorithms is often not in the form suitable for direct human consumption, and thus need to be transform to a format can be assimilate easily. There are two most common approaches for the patter analysis. One is to use the knowledge query mechanism such as SQL, while another is to construct multi-dimensional data cube before perform OLAP operations. All these methods assume the output of the previous phase has been structured. 3. BASIC MODELS In its full generality, a model must build machine representations of world knowledge, and therefore involve a NL grammar for text, hypertext, and semi structured data which will useful for our learning applications. We discuss some such models in this section [8]. 3.1 Models for structured text 1. Boolean Model: The simplest statistical model is the Boolean model. It uses the notion of exact matching documents to the user query. Both the query and the retrieval are based on the Boolean algebra. 2. Vector Space Model: A document in the vector spacer model is represented as a weight vector, in which each component weight is computed based on some variation of TF or TF-IDF scheme. Document are tokenized using simple syntactic rules (such as white space delimiters in English) and tokens stemmed to canonical form (e.g., 'reading' to 'read,' 'is,' 'was,' 'are' to 'be'). Each canonical token represents an axis in a Euclidean space. 3. Statistical Language Model: This model is based on probability and has foundations in statistical theory. It first estimates a language model for each document and then ranks documents by the likelihood of the query given the language model. 4. Probabilistic Model: This model used for document generation with the disclaimer that these models have no bearing on grammar and semantic coherence. In spite of minor variations all these models regard documents as multisets of terms, without paying attention to ordering between terms. Therefore they are collectively called bag-of-words models. 3.2 Models for semi structured data Semi structured data is a point of convergence for the Web and database communities: the former deals with documents, the latter with data. The form of that data is evolving from rigidly structured relational tables with numbers and strings to enable the natural representation of complex real-world objects like books, papers, movies, jet engine components, and chip designs without sending the application writer into contortions. Object Exchange Model (OEM): In OEM, data is in the form of atomic or compound objects: atomic objects may be integers or strings; compound objects refer to other objects through labeled edges. HTML is a special case of such 'intra-document' structure. The above forms of irregular structures naturally encourage Data Mining techniques from the domain of 'standard' structured warehouses to be applied, adapted, and extended to discover useful patterns from semi structured sources as well. 4. LINK ANALYSIS In recent years, Web link structure has been widely used to infer important information about Web pages. Web structure mining has been largely influenced by research in social network analysis and citation analysis [8]. Citations (linkages) among Web pages are usually indicators of high relevance or good quality. We use the term in-links to indicate the hyperlinks pointing to a page and the term out-links to indicate the hyperlinks found in a page. Usually, the larger the number of in-links, the more useful a page is considered to be. The rationale is that a page referenced by many people is likely to be more important than a page that is seldom referenced. As in citation analysis, an often cited article is presumed to be better than one that is never cited. In addition, it is reasonable to give a link from an authoritative source (such as Yahoo!) a higher weight than a link from an unimportant personal home page. By analyzing the pages containing a URL, we can also obtain the anchor text that describes it. Anchor text shows how other Web page authors annotate a page and can be useful in predicting the content of the target page. Several algorithms have been developed to address this issue. 5. THE SEMANTIC WEB The Semantic Web is a term coined by Berner-Lee [17] for the vision of making the information on the Web machine-processable. The basic idea is to enrich Web pages with machine-processable knowledge that is represented in the form of ontologies [19]. Ontologies define certain types of objects and the relations between them. As Ontologies are readily accessible (like other Web documents), a computer program can use them to draw inferences about the information provided on Web pages. One of the research challenges in that area is to annotate the information that is currently available on the Web with semantic tags. Typically, techniques from text classification, hyper-text classification and information extraction are used for that purpose. A landmark application in this area was the Web®KB project at Carnegie-Mellon University (Craven, DiPasquo, Freitag, McCallum, Mitchell, Nigam & Slattery, 2000). Its goal was to assign Web pages or parts of Web pages to entities in ontology. A simple test ontology modeled knowledge about computer science departments: there are entities like students (graduate and undergraduate), faculty members (professors, researchers, lecturers, post-docs, ...), courses, projects, etc., and relations between these entities, such as “courses are taught by one lecturer and attended by several students” or “every graduate student is advised by a professor”. Many applications could be imagined for such ontology. For example, it could enhance the capabilities of search engines by enabling them to answer queries like “Who teaches course X at university Y?” or “How many students are in department Z?”, or serve as a backbone for Web catalogues. A description of the first prototype system can be found in. Semantic Web Mining emerged as research field that focuses on the interactions of Web mining and the Semantic Web. 6. WEB CRAWLING Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches[13]. Crawlers can also be used for automating maintenance tasks on a Website, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting email addresses. A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. There are three important characteristics of the Web that make crawling it very difficult: • its large volume, • its fast rate of change, and • dynamic page generation, which combine to produce a wide variety of possible crawlable URLs.The large volume implies that the crawler can only download a fraction of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted. As Edwards et al. noted, "Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained.". A crawler must carefully choose at each step which pages to visit next. The behavior of a Web crawler is the outcome of a combination of policies: • A selection policy that states which pages to download. • A re-visit policy that states when to check for changes to the pages. • A politeness policy that states how to avoid overloading Websites. • A parallelization policy that states how to coordinate distributed Web crawlers. The importance of a page for a crawler can also be expressed as a function of the similarity of a page to a given query. Web crawlers that attempt to download pages that are similar to each other are called focused crawler or topical crawlers. The main problem in focused crawling is that in the context of a Web crawler, we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page. 7. WEB DATA MINING AGENT PARADIGM AND Web Mining is often viewed from or implemented within an agent paradigm. Thus, Web Mining has a close relationship with software agents or intelligent agents. Indeed some of these agents perform data Mining tasks to achieve their goals. According to Green [4] there are three sub-categories of software agents: User Interface Agents, Distributed Agents, and Mobile Agents. User Interface agents that can be classified into the Web Mining agent category are information retrieval agents, information filtering agent and personal assistant agents. Distributed agents technology is concerned with problem solving by a group of agents and relevant agents in this category are distributed agents for knowledge discovery or Data Mining. Delgado classifies the user interface agents by the underlying information filtering technology into content based filters, event based filters and hybrid filters. In event based filtering, the system tracks and follows the events that are inferred form the surfing habits of people in the Web. Some examples of those events are saving a URL into a bookmark folder, mouse clicks and scrolls, link traverse behavior etc. 8. CONCLUSIONS Web Data Mining is a new field and there are researchers ventured in this field, especially Text -Mining techniques. The key component of Web Mining is the Mining process itself. A lot of work still remains to be done in adapting known Mining techniques as well as developing new ones. 9. REFERENCES [1] Raymond Kosala, Hendrik Blockeel, “Web Mining Research: A Survey”, SIGKDD Expirations, ACM SIGKDD, July 2000. [2] Wang Bin, Liu Zhijing “Web Mining Research”. In Proceedings of the 5th IEEE International View publication stats Conference on Computational Intelligence and Multimedia Applications (ICCIMA’03), 2003. [3] R. Cooley, B. Mobasher, and J. Srivastava. “Web Mining: Information and pattern discovery on the World Wide Web”. In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’97), 1997 [4] S. Green, L. Hurst, B. Nangle, P. Cunningham, F. Somers, and R. Evans. “Software agents: A review , Technical Report TCD-CS-1997-06” , Technical Report of Trinity College, University of Dublin, 1997. [5] J. A. Delgado ” Agent-Based Information Filtering and Recommender System On the Internet.” PhD thesis, Dept of Intelligence Computer Science, Nagoya Institute of Technology, March 2000. [6] Web Site : http://ww.celi.it [7] Soumen Chakrabarti, “Mining the Web : Discovering Knowledge from Hypertext Data, Morgan Kaufmann,2003 [8] Bing Liu, “Web Data Mining: Exploring Hyperlinks, Contents and Usage Data”, Springer, 2007 [9]Hsincun Chen and Michael Chau, “Web Mining : Machine Learning for Web Application” , Annual Review of Information Science and Technology, University of Arizona [10] Chakrabarti, S. (2000). Data mining for hypertext: A tutorial survey. SIGKDD Explorations, 1(1), 1-11. [11]Chakrabarti, S., Dom, B., & Indyk, P. (1998). Enhanced hypertext categorization using hyperlink. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 307-318. [12] Johannes Fürnkranz, “Web Mining” chapter, TU Darmstadt, Knowledge Engineering Group. [13] Web site : www.wikipedia.com\ [14] Margaret H. Dunham, “Data Mining: introductory and advanced Topics”, Pearson Education, 2003. [15] Jiawei Han and Michline Kamber, “Data Mining Concepts and Techniques”, Elevier publication, second edition, 2006. [16] Tom Mitchell, “Machine Learning” McGraw-Hill , 1997 [17] Berners-Lee, Hendler & Lassila, “Semantic Web”, 2001. [18] Search Engine : http://www.google.com. [19] Dieter Fensel, “Ontology versioning on the semantic web”, 2001.